Pipelines speedup

July 06, 2020

· 4 min read

Often when you want to run some tests you need to install required dependencies such as python packages, ruby gems, npm packages and so on. Nowadays tests are executed automatically for pull (merge) requests in various CI systems. Dependency pulling is repeated again and again for every test run and it takes time. We cannot do much about it in cloud based CI systems such as Travis, Circle CI or Github Actions but there is some space for improvement for self-hosted Gitlab CI, Jenkins and others.

Cache

The first idea that would come to your mind is to enable cache. It looks reasonable. Why do need to redownload the same files again and again? Would it be better to store them in some near located storage? The answer is yes but there are some nuances. First of all we should decide how our cache mechanism will be implemented. I would highlight the following ways: caching proxy, cache in the same filesystem and the combination of these two methods.

Caching proxy

Using proxy makes sense if your organization consumes a significant amount traffic from package repositories. Having caching proxy reduces amount of incoming traffic and speeds up downloading. Besides local caching proxy can play a very important role in storing proprietary packages. There are several caching proxies and I worked with Sonatype Nexus and devpi.

diagram 1

The most of the time I work with Python therefore below I will provide examples for its package infrastructure. As you know Python standard package manager is pip and in order to download packages from a caching proxy you have to specify an index url by one of the following methods:

pip.conf:

txt [global] index-url = https://example.com/pypi/packages

environment variable: PIP_INDEX_URL=https://example.com/pypi/packages
command line argument: pip -i https://example.com/pypi/packages

Local cache

pip, gem, yum and others don't download packages from the index url immediately. Before that they check if requested packages are stored locally in predefined directories of the file system. Various package managers have different cache directories. pip in Linux stores downloads in $HOME/.cache/pip by default. And again you can change that path via pip.conf, environment variable PIP_CACHE_DIR or command line argument --cache-dir.

Cache in pipelines

That was a sort of preamble. Now we have a required knowledge in order to start a pipeline optimization. Let's designate prerequisites:

you use Gitlab CI;
Gitlab runner is configured to use kubernetes executor;
a pipeline has a stage with installing python packages:

yaml test: image: python stage: tests before_script: - pip install -U pip setuptools - pip install -r requirements.txt script: - pytest

there is a caching proxy that has the index url https://example.com/pypi/packages;
you have a PVC with shared access.

First of all let's specify caching proxy index url in .gitlab-ci.yml:

test:
  variables:
    PIP_INDEX_URL: https://example.com/pypi/packages

Then we need to modify Gitlab runner config file in order to mount the PVC to builder pods:

[[runners]]
  executor = "kubernetes"
  # ...
  cache_dir = "/mnt/cache"
  [runners.kubernetes]
    # ...
    [runners.kubernetes.volumes]
      # ...
      [[runners.kubernetes.volumes.pvc]]
        name = "name_of_pvc"
        mount_path = "/mnt/cache"

It's a kind of an unobvious trick because according to the official documentation cache_dir setting is only for Shell, Docker and SSH executors. And you supposed to use an S3 backend for storing cache. I found that trick in Kubernetes cache support issue.

And the final part is to modify the pipeline stage config:

test:
  stage: tests
  image: python
  cache:
    # that's the place where we tell Gitlab to treat that directory as a cache and
    # properly handle the content of the directory
    paths:
      - .cache/pip
  variables:
    # Change pip's cache directory to be inside the project directory since we can
    # only cache local items.
    PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip
  before_script:
    - pip install -U pip setuptools
    - pip install -r requirements.txt
  script:
    - pytest

Now in the pipeline log you should see something like that:

Restoring cache
Checking cache for default...
No URL provided, cache will not be downloaded from shared cache server. Instead a
local version of cache will be extracted.
Successfully extracted cache
Downloading artifacts
Running before_script and script
$ pip install -U pip setuptools
Collecting pip
  # we don't download the package but use already downloaded from prior run
  Using cached https://example.com/pypi/packages/pip-20.1.1-py2.py3-none-any.whl
...
Saving cache
Creating cache default...
.cache/pip: found 1505 matching files
No URL provided, cache will be not uploaded to shared cache server. Cache will be
stored only locally.
Created cache

In the end we should get the following package flow:

diagram 2

Enabling cache in some cases can speed up to several times pipeline execution. Because not only downloaded packages are cached but compiled as well.