IBM Data Science - Best Practices
Dependency Management
Why do we need to manage dependencies?
Dependency management is like your city’s sewage system. When it’s working well, it’s easy to forget that it even exists. The only time you’ll remember it is when you experience the agony induced by its failure.
Some of the goals that a healthy dependency management process tries to achieve are:
- Builds should be stable across environments. If a project builds on my machine, it should build on others’ machines and on our build server.
- Builds should be stable over time. If a project builds now, it shouldn’t break in the future.
- Any team member should be able to easily download, build, and make changes to a project.
- We should be able to have many different projects with large dependency trees without running into dependency hell.
What is pip
?
pip
is a package manager for Python. That means it’s a tool that allows you to install and manage additional libraries and dependencies that are not distributed as part of the standard library.
The Python installer installs pip
, so it should be ready for you to use, unless you installed an old version of Python. You can verify that pip is available by running the following command in your console:
$ pip --version
pip 19.2.3 from /Users/vladzamfirescu/opt/anaconda3/lib/python3.7/site-packages/pip (python 3.7)
pip
is a Python tool that specializes in installing Python packages.
For example, just run pip install numpy
to install numpy
and its dependencies. pip
also helps you to keep your version control repositories small by giving you a reproducible way to install packages without needing to include them in your source code repo.
Not only does pip
let you install normal source packages, but it can also install packages from source control repositories, wheels, and legacy binary distribution formats.
Using requirements.txt
When developing Python applications today, it’s standard practice to have a requirements.txt
file in the root of your repository.
It’s easy to get a Python project off the ground by just using pip
to install dependent packages as you go. This works fine as long as you’re the only one working on the project, but as soon as someone else wants to run your code, they’ll need to go through the process of figuring which dependencies the project needs and installing them all by hand. Worse yet, if they install a different version of a dependency than the one you used, they could end up with some very mysterious errors.
To prevent this, you can define a requirements.txt
file that records all of your project’s dependencies, versions included. This way, others can run pip install -r requirements.txt
and all the project’s dependencies will be installed automatically.
Placing this file into version control alongside the source code makes it easy for others to use and edit it. In order to ensure complete reproducibility, your requirements.txt
file should include all of your project’s transitive (indirect) dependencies, not just your direct dependencies.
$ cat requirements.txt
numpy
pandas
pytest
wheel
This is an example of a basic requirements.txt
file, and is effectively the user experience that everyone using requirements files wants. However, when a requirements.txt
file like this is used to deploy to production, unexpected consequences can occur. Effectively, because versions haven’t been pinned, running $ pip install
will give you different results today than it will tomorrow.
This is bad. As different versions of sub-dependencies are released, the result of a fresh $ pip install -r requirements.txt will result in different packages being installed, and potentially, your application failing for unknown and hidden reasons.
To avoid the pitfalls of a basic requirements.txt
file, you can define a complete list of all dependencies a project has, each with exact package versions specified.
$ cat requirements.txt
numpy==1.18.1
pandas==1.0.0
pytest==5.3.5
wheel==0.32.2
This is considered a best practice for deploying applications, and ensures an explicit runtime environment with deterministic builds.
All dependencies, including sub-dependencies, are listed, each with an exact version specified.
This type of requirements.txt
is generated from the output of running $ pip freeze
from within a current working runtime environment for the application. This encourages dev/prod parity, and encourages you to treat code within external packages with the same level of respect as your application code (because it is your application code).
Even though the fixed-version format for requirements.txt
is considered to be a best practice, it can sometimes be a bit cumbersome. Namely, if you are working on the codebase of your project, and want to $ pip install --upgrade
some/all of the packages, you wouldn’t be able to do so easily.
For additional information on the requirements.txt
file, feel free to consult the pip user guide.
Taking dependency management to the next level with pipenv
or poetry
Q: Is there a better approach to
pip freeze
?A: Yes, because there is a high risk to include dependencies that you don’t actually need when executing
pip freeze
. You should always aim to minimize your dependencies.
poetry
is a dependency manager for Python projects and improves on the more traditional requirements.txt
method described previously. If you’re familiar with Node.js’ npm
or Ruby’s bundler
, it is similar in spirit to those tools. While pip
can install Python packages, poetry
is recommended as it’s a higher-level tool that simplifies dependency management for common use cases.
It automatically creates and manages a virtualenv
for your projects, as well as adds/removes packages from your pyproject.toml
as you install/uninstall packages. It also generates the ever-important poetry.lock
, which is used to produce deterministic builds.
pipenv
is doing the same thing; the respective files are a Pipfile
and a Pipfile.lock
.
The key differentiation between
pipenv
andpoetry
is thatpoetry
includes also packaging and distribution for your own Python package, whichpipenv
doesn’t.
Poetry tends to be a bit more modern and quicker, but both work well and should always be preferred over plain pip
.
One very cool thing about pipenv is that if you already have a
requirements.txt
file in your project structure it will automatically identify it when you runpipenv install
and convert yourrequirements.txt
into aPipfile
and generate thePipfile.lock
as well.
Usage
Example pyproject.toml
(for poetry
)
[tool.poetry]
name = "poetry-demo"
version = "0.1.0"
description = ""
authors = ["Sébastien Eustace <sebastien@eustace.io>"]
[tool.poetry.dependencies]
python = "*"
[tool.poetry.dev-dependencies]
pytest = "^3.4"
Example Pipfile
(for pipenv
)
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"
[packages]
requests = "*"
[dev-packages]
pytest = "*"
Note: The lock files (
poetry.lock
andPipfile.lock
) are technically regular, editable files. However, do not adjust them manually. Always letpoetry
orpipenv
adjust them.
General Recommendations & Version Control (poetry)
- Generally, keep both
pyproject.toml
,poetry.lock
(andpoetry.toml
) in version control. - Specify your target Python version in your
pyproject.toml
’s[tool.poetry.dependencies]
section. Ideally, you should only have one target Python version, as this is a deployment tool. - Use
poetry add <module>
to add additional packages (more details). - Use
poetry install --no-root
in order to just install the dependencies (and without installing your own module) - more details. - Note that the
pyproject.toml
uses the TOML Spec. - Full
poetry
CLI documentation can be found here.
General Recommendations & Version Control (pipenv)
- Generally, keep both
Pipfile
andPipfile.lock
in version control. - Do not keep
Pipfile.lock
in version control if multiple versions of Python are being targeted. - Specify your target Python version in your
Pipfile
’s[requires]
section. Ideally, you should only have one target Python version, as this is a deployment tool.python_version
should be in the formatX.Y
andpython_full_version
should be inX.Y.Z
format. pipenv install
is fully compatible withpip install
syntax, for which the full documentation can be found here.- Note that the
Pipfile
uses the TOML Spec. - Full
pienv
CLI documentation can be found here.
Organisation-wide Open Source Software (OSS) Guidelines
:warning: IMPORTANT :warning: Please note that OSS usage (both internal or external) including consumption and contribution must comply with licensing, copywright law and your organization’s internal guidelines.
Copyright law gives authors rights, such as:
- The author determines appropriate uses of a work.
- You cannot reproduce or modify a work without an author’s permission.
- The author may grant permission via a license, for example, an open source license.
- A license grants permission, but may also impose obligations.
It is mandatory for IBMers to adhere to the IBM OSS Guidelines and associated processes (IBM Internal).
MIT
A short, permissive software license. Basically, you can do whatever you want as long as you include the original copyright and license notice in any copy of the software/source. There are many variations of this license in use.
Apache 2.0
You can do what you like with the software, as long as you include the required notices. This permissive license contains a patent license from the contributors of the code.
BSD
The BSD 2-Clause and BSD 3-Clause licenses allows you almost unlimited freedom with the software so long as you include the BSD copyright notice in it (found in full text reference below).
BSD-2-Clause Full Text
BSD-3-Clause Full Text
GPL
You may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions.
GPL 2.0 Full Text
GPL 3.0 Full Text
LGPL
This license is mainly applied to libraries. You may copy, distribute and modify the software provided that modifications are described and licensed for free under LGPL. Derivatives works (including modifications or anything statically linked to the library) can only be redistributed under LGPL, but applications that use the library don’t have to be.
LGPL 2.0 Full Text
LGPL 2.1 Full Text
LGPL 3.0 Full Text
Examples
To find examples for these guidelines, go to the example repository: MLOps pipeline.
In this example implementation pipenv
is used. As described above the two relevant files are Pipfile
and Pipfile.lock
. Both of these are located in the root folder of the example implementation.
Pipfile:
...
mypy = "==0.761"
pandas = "==1.0.0"
pipenv = "==2018.11.26"
pipenv-setup = "==3.0.1"
sphinx = "==2.3.1"
twine = "==3.1.1"
jeepney = {version = "*", sys_platform = "== 'linux'"}
secretstorage = {version = "*", sys_platform = "== 'linux'"}
wheel = "==0.34.2"
yapf = "==0.29.0"
...
In the example above you can see the different groups for packages: dev-packages
and packages
. The first set is only installed for the development environment, which includes building and experiments. The second set is used for the actual final package after being assembled and is therefore a lot smaller. One specialty in the dev-packages
section are the linux specific packages, these are only required in a linux environment. This is added to offer cross-platform compatibility. Pipenv
evaluates the current environment and then installs packages based on the result.
Pipfile.lock
"_meta": {
"hash": {
"sha256": "3000b68e11f41330aa5a7500462b47e913e95ae067eca401dcd70a62a204bbc6"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.python.org/simple",
"verify_ssl": true
},
{
"name": "nexus",
"url": "http://$NEXUS_USERNAME:'${NEXUS_PASSWORD}'@a737dee0740fd11eaa014026127b7ccb-1361029674.eu-central-1.elb.amazonaws.com:8081/repository/pypi-group/simple",
"verify_ssl": false
}
]
},
...
As described above, this file contains the specific versions of the required packages and the corresponding hash. This dependency tree is built recursively and satisfies all transitive dependencies. In the makefile
the pip environment is created with the specific pipenv
command:
...
install:
pipenv install --deploy --dev
...
This installs both dev and default packages and aborts if the Pipfile.lock
is out of date.