Setting up a robust and deterministic environment for machine learning projects can sometimes be a bit confusing if we don't set some ground rules in the beggining.

In this article, I want to go through how to set up a robust development environment for machine learning that facilitates managing dependencies and guarantees compatibility between the development and production stage throughout the life cycle of that project.

The idea is to start from scratch and end with a folder containing:

An environment.yml file specifying our Python and cuda/cudnn versions
A dev.in and prod.in files specifying the development and production package requirements respectively
A Makefile containing commands to automatically update our environment everytime we modify the environment.yml file or change the packages in the .in files

Disclaimer: the contents of this article were written using the following main resources:

Mainly I learned a lot from the course taught by the "Full Stack Deep Learning course, which besides being the main resource for learning this set up, has been my reference guide for all topics related to practical machine learning, so I strongly recommend you check them out!

Steps

Set up Anaconda
Create virtual environment and install the dependencies
Export the environment to an environment.yml file
Create the requirements files and add our dependencies for development and production
Write a MakeFile

1. Set up Anaconda

Set up anaconda
Confirm conda version with: conda -V In my case I get: conda 4.10.3
Update conda to the current version: conda update conda

In my case I get: conda 4.11.0

2. Create virtual environment and install the dependencies

In the case of this project I will use a pytorch example so I will create the environment with the necessary cudatoolkit first like this:

conda create -n setup_ml_env python==3.7 cudatoolkit=10.2 cudnn=7.6

Now we activate the environment by running:

conda activate setup_ml_env

And, we test the installation by running:

python -V

Expected output:

Python 3.7.11

3. Export the environment to an `environment.yml` file

`conda env export --from-history > environment.yml`

The --from-history command makes sure that you only add to the environment.yml file the packages you actually installed so far (in this case just the cudatoolkit and cudnn packages).

Let's add to this file pip and pip-tools to use later for installing our Python packages and then we can print out the contents of the file to check:

cat environment.yml

Expected output:

name: setup_ml_env
channels:
  - defaults
dependencies:
  - python=3.7
  - cudatoolkit=10.2
  - cudnn=7.6
  - pip
  - pip:
    - pip-tools
prefix: path/to/setup_ml_env

4. Create the requirements files and add our dependencies for development and production

In a linux terminal:

mkdir requirements
touch requirements/dev.in
touch requirements/prod.in

Inside the dev.in file we write:

-c prod.txt
mypy
black

Here the -c prod.txt will constrain the development packages to the packages specified in the production requirements that will be generated from the prod.in file.

Inside the prod.in file write:

torch
numpy

This is just an illustrative example of a toy project using the torch and numpy packages.

5. Write a MakeFile

The makefile for our project will contain:

# Command to print all the other targets, from https://stackoverflow.com/a/26339924
help:
    @$(MAKE) -pRrq -f $(lastword $(MAKEFILE_LIST)) : 2>/dev/null | awk -v RS= -F: '/^# File/,/^# Finished Make data base/ {if ($$1 !~ "^[#.]") {print $$1}}' | sort | egrep -v -e '^[^[:alnum:]]' -e '^$@$$'

The help command prints all available commands for our makefile.

# Install exact Python and CUDA versions
conda-update:
    conda env update --prune -f environment.yml
    echo "Activate your environment with: conda activate setup_ml_env"

Here is the makefile command for updating our environment everytime we modify the environment.yml file.

# Compile and install exact pip packages
pip-tools:
    pip install pip-tools
    pip-compile requirements/prod.in && pip-compile requirements/dev.in
    pip-sync requirements/prod.txt requirements/dev.txt

The pip-tools command to compile and install mutually compatible versions of all requirements. I won't cover linting here to avoid introducing more complexity to this article.

As described in the repo of the full stack deep learning course using pip-tools allows us to:

Separate out dev from production dependencies (dev.in vs prod.in).
Have a lockfile of exact versions for all dependencies (the auto-generated dev.txt and prod.txt).
Allow us to easily deploy to targets that may not support the conda environment.

If you add, remove, or need to update versions of some requirements, edit the .in files, and simply run make pip-tools again.

Concluding Thoughts

At the beggining of my machine learning career I was just installing packages and running code, without considering the negative implications of things like dependency issues and what not.

Now, even tough there might still be things that I am missing, I feel like this approach attempts to cover the holes in the naive approach to developing a machine learning project.

In summary:

environment.yml specifies python and optionally cuda/cudnn
make conda-update creates/updates the conda environment
requirements/prod.in and requirements/dev.in specify python package requirements
make pip-tools resolves and install all Python packages

If you liked this post, consider joining me on Medium. Also, subscribe to my youtube channel. Thanks and see you next time! :)