Setting up an Environment for Machine Learning with Conda & Pip-Tools
Setting up ML env with conda and pip-tools
Setting up a robust and deterministic environment for machine learning projects can sometimes be a bit confusing if we don't set some ground rules in the beggining.
In this article, I want to go through how to set up a robust development environment for machine learning that facilitates managing dependencies and guarantees compatibility between the development and production stage throughout the life cycle of that project.
The idea is to start from scratch and end with a folder containing:
- An
environment.yml
file specifying our Python and cuda/cudnn versions - A
dev.in
andprod.in
files specifying the development and production package requirements respectively - A
Makefile
containing commands to automatically update our environment everytime we modify theenvironment.yml
file or change the packages in the.in
files
Disclaimer: the contents of this article were written using the following main resources:
- https://github.com/full-stack-deep-learning/conda-piptools
- https://github.com/full-stack-deep-learning/fsdl-text-recognizer-project
- https://github.com/jazzband/pip-tools
Mainly I learned a lot from the course taught by the "Full Stack Deep Learning course, which besides being the main resource for learning this set up, has been my reference guide for all topics related to practical machine learning, so I strongly recommend you check them out!
2. Create virtual environment and install the dependencies
In the case of this project I will use a pytorch
example so I will create the environment with the necessary cudatoolkit first like this:
conda create -n setup_ml_env python==3.7 cudatoolkit=10.2 cudnn=7.6
Now we activate the environment by running:
conda activate setup_ml_env
And, we test the installation by running:
python -V
Expected output:
Python 3.7.11
3. Export the environment to an environment.yml
file
`conda env export --from-history > environment.yml`
The --from-history
command makes sure that you only add to the environment.yml
file the packages you actually installed so far (in this case just the
cudatoolkit
and cudnn
packages).
Let's add to this file pip
and pip-tools
to use later for installing
our Python packages and then we can print out the contents of the file to check:
cat environment.yml
Expected output:
name: setup_ml_env
channels:
- defaults
dependencies:
- python=3.7
- cudatoolkit=10.2
- cudnn=7.6
- pip
- pip:
- pip-tools
prefix: path/to/setup_ml_env
4. Create the requirements files and add our dependencies for development and production
In a linux terminal:
mkdir requirements
touch requirements/dev.in
touch requirements/prod.in
Inside the dev.in
file we write:
-c prod.txt
mypy
black
Here the -c prod.txt
will constrain the development packages to the packages specified in the production requirements that will be generated from the prod.in
file.
Inside the prod.in
file write:
torch
numpy
This is just an illustrative example of a toy project using the torch
and numpy
packages.
5. Write a MakeFile
The makefile for our project will contain:
# Command to print all the other targets, from https://stackoverflow.com/a/26339924
help:
@$(MAKE) -pRrq -f $(lastword $(MAKEFILE_LIST)) : 2>/dev/null | awk -v RS= -F: '/^# File/,/^# Finished Make data base/ {if ($$1 !~ "^[#.]") {print $$1}}' | sort | egrep -v -e '^[^[:alnum:]]' -e '^$@$$'
The help
command prints all available commands for our makefile.
# Install exact Python and CUDA versions
conda-update:
conda env update --prune -f environment.yml
echo "Activate your environment with: conda activate setup_ml_env"
Here is the makefile command for updating our environment everytime we modify the environment.yml
file.
# Compile and install exact pip packages
pip-tools:
pip install pip-tools
pip-compile requirements/prod.in && pip-compile requirements/dev.in
pip-sync requirements/prod.txt requirements/dev.txt
The pip-tools
command to compile and install mutually compatible versions of all requirements.
I won't cover linting here to avoid introducing more complexity to this article.
As described in the repo of the full stack deep learning course using pip-tools allows us to:
Separate out dev from production dependencies (
dev.in
vsprod.in
).Have a lockfile of exact versions for all dependencies (the auto-generated
dev.txt
andprod.txt
).Allow us to easily deploy to targets that may not support the conda environment.
If you add, remove, or need to update versions of some requirements, edit the .in files, and simply run make pip-tools
again.
Concluding Thoughts
At the beggining of my machine learning career I was just installing packages and running code, without considering the negative implications of things like dependency issues and what not.
Now, even tough there might still be things that I am missing, I feel like this approach attempts to cover the holes in the naive approach to developing a machine learning project.
In summary:
environment.yml
specifies python and optionallycuda/cudnn
make conda-update
creates/updates the conda environmentrequirements/prod.in
andrequirements/dev.in
specify python package requirementsmake pip-tools
resolves and install all Python packages
If you liked this post, consider joining me on Medium. Also, subscribe to my youtube channel. Thanks and see you next time! :)