This is an evolving repo optimized for machine-learning projects aimed at designing a new algorithm. They require sweeping over different hyperparameters, comparing to baselines, and iteratively refining an algorithm. Based of cookiecutter-data-science.
Organization
-
project_name
: should be renamed, contains main code for modeling (e.g. model architecture)
-
experiments
: code for runnning experiments (e.g. loading data, training models, evaluating models)
-
scripts
: scripts for hyperparameter sweeps (python scripts that launch jobs in experiments
folder with different hyperparams)
-
notebooks
: jupyter notebooks for analyzing results and making figures
-
tests
: unit tests
Setup
- first, rename
project_name
to your project name and modify setup.py
accordingly
- clone and run
pip install -e .
, resulting in a package named project_name
that can be imported
- see
setup.py
for dependencies, not all are required
- example run: run
python scripts/01_train_basic_models.py
(which calls experiments/01_train_model.py
then view the results in notebooks/01_model_results.ipynb
- keep tests upated and run using
pytest
Features
- scripts sweep over hyperparameters using easy-to-specify python code
- experiments automatically cache runs that have already completed
- caching uses the (non-default) arguments in the argparse namespace
- notebooks can easily evaluate results aggregated over multiple experiments using pandas
Guidelines
- See some useful packages here
- Avoid notebooks whenever possible (ideally, only for analyzing results, making figures)
- Paths should be specified relative to a file's location (e.g.
os.path.join(os.path.dirname(__file__), 'data')
)
- Naming variables: use the main thing first followed by the modifiers (e.g.
X_train
, acc_test
)
- binary arguments should start with the word "use" (e.g.
--use_caching
) and take values 0 or 1
- Use logging instead of print
- Use argparse and sweep over hyperparams using python scripts (or custom things, like amulet)
- Note, arguments get passed as strings so shouldn't pass args that aren't primitives or a list of primitives (more complex structures should be handled in the experiments code)
- Each run should save a single pickle file of its results
- All experiments that depend on each other should run end-to-end with one script (caching things along the way)
- Keep updated requirements in setup.py
- Follow sklearn apis whenever possible
- Use Huggingface whenever possible, then pytorch