Often when you want to run some tests you need to install required dependencies such as python packages, ruby gems, npm packages and so on. Nowadays tests are executed automatically for pull (merge) requests in various CI systems. Dependency pulling is repeated again and again for every test run and it takes time. We cannot do much about it in cloud based CI systems such as Travis, Circle CI or Github Actions but there is some space for improvement for self-hosted Gitlab CI, Jenkins and others.
The first idea that would come to your mind is to enable cache. It looks reasonable. Why do need to redownload the same files again and again? Would it be better to store them in some near located storage? The answer is yes but there are some nuances. First of all we should decide how our cache mechanism will be implemented. I would highlight the following ways: caching proxy, cache in the same filesystem and the combination of these two methods.
Using proxy makes sense if your organization consumes a significant amount traffic from package repositories. Having caching proxy reduces amount of incoming traffic and speeds up downloading. Besides local caching proxy can play a very important role in storing proprietary packages. There are several caching proxies and I worked with Sonatype Nexus and devpi.
The most of the time I work with Python therefore below I will provide examples for its package
infrastructure. As you know Python standard package manager is
pip and in order to download
packages from a caching proxy you have to specify an index url by one of the following methods:
index-url = https://example.com/pypi/packages
- environment variable:
- command line argument:
pip -i https://example.com/pypi/packages
yum and others don't download packages from the index url immediately. Before that
they check if requested packages are stored locally in predefined directories of the file system.
Various package managers have different cache directories.
pip in Linux stores downloads in
$HOME/.cache/pip by default. And again you can change that path via
PIP_CACHE_DIR or command line argument
Cache in pipelines
That was a sort of preamble. Now we have a required knowledge in order to start a pipeline optimization. Let's designate prerequisites:
- you use Gitlab CI;
- Gitlab runner is configured to use kubernetes executor;
- a pipeline has a stage with installing python packages:
- pip install -U pip setuptools
- pip install -r requirements.txt
- there is a caching proxy that has the index url https://example.com/pypi/packages;
- you have a PVC with shared access.
First of all let's specify caching proxy index url in
test: variables: PIP_INDEX_URL: https://example.com/pypi/packages
Then we need to modify Gitlab runner config file in order to mount the PVC to builder pods:
[[runners]] executor = "kubernetes" # ... cache_dir = "/mnt/cache" [runners.kubernetes] # ... [runners.kubernetes.volumes] # ... [[runners.kubernetes.volumes.pvc]] name = "name_of_pvc" mount_path = "/mnt/cache"
It's a kind of an unobvious trick because according to the official documentation
cache_dir setting is only for Shell, Docker and SSH executors.
And you supposed to use an S3 backend for storing cache. I found that trick in
Kubernetes cache support
And the final part is to modify the pipeline stage config:
test: stage: tests image: python cache: # that's the place where we tell Gitlab to treat that directory as a cache and # properly handle the content of the directory paths: - .cache/pip variables: # Change pip's cache directory to be inside the project directory since we can # only cache local items. PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip before_script: - pip install -U pip setuptools - pip install -r requirements.txt script: - pytest
Now in the pipeline log you should see something like that:
Restoring cache Checking cache for default... No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. Successfully extracted cache Downloading artifacts Running before_script and script $ pip install -U pip setuptools Collecting pip # we don't download the package but use already downloaded from prior run Using cached https://example.com/pypi/packages/pip-20.1.1-py2.py3-none-any.whl ... Saving cache Creating cache default... .cache/pip: found 1505 matching files No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. Created cache
In the end we should get the following package flow:
Enabling cache in some cases can speed up to several times pipeline execution. Because not only downloaded packages are cached but compiled as well.