Optimizing CI resource usage in upstream projects

At GnuTLS, our journey into optimizing GitLab CI began when we faced a significant challenge: we lost our GitLab.com Open Source Program subscription. While we are still hoping that this limitation is temporary, this meant our available CI/CD resources became considerably lower. We took this opportunity to find smarter ways to manage our pipelines and reduce our footprint.

This blog post shares the strategies we employed to optimize our GitLab CI usage, focusing on reducing running time and network resources, which are crucial for any open-source project operating with constrained resources.

CI on every PR: a best practice, but not cheap

While running CI on every commit is considered a best practice for secure software development, our experience setting up a self-hosted GitLab runner on a modest Virtual Private Server (VPS) highlighted its cost implications, especially with limited resources. We provisioned a VPS with 2GB of memory and 3 CPU cores, intending to support our GnuTLS CI pipelines.

The reality, however, was a stark reminder of the resource demands. A single CI pipeline for GnuTLS took an excessively long time to complete, often stretching beyond acceptable durations. Furthermore, the extensive data transfer involved in fetching container images, dependencies, building artifacts, and pushing results quickly led us to reach the bandwidth limits imposed by our VPS provider, resulting in throttled connections and further delays.

This experience underscored the importance of balancing CI best practices with available infrastructure and budget, particularly for resource-intensive projects.

Reducing CI running time

Efficient CI pipeline execution is paramount, especially when resources are scarce. GitLab provides an excellent article on pipeline efficiency, though in practice, project specific optimization is needed. We focused on three key areas to achieve faster pipelines:

Tiering tests
Layering container images
De-duplicating build artifacts

Tiering tests

Not all tests need to run on every PR. For more exotic or costly tasks, such as extensive fuzzing, generating documentation, or large-scale integration tests, we adopted a tiering approach. These types of tests are resource-intensive and often provide value even when run less frequently. Instead of scheduling them for every PR, they are triggered manually or on a periodic basis (e.g., nightly or weekly builds). This ensures that critical daily development workflows remain fast and efficient, while still providing comprehensive testing coverage for the project without incurring excessive resource usage on every minor change.

Layering container images

The tiering of tests gives us an idea which CI images are more commonly used in the pipeline. For those common CI images, we transitioned to using a more minimal base container image, such as fedora-minimal or debian:<flavor>-slim. This reduced the initial download size and the overall footprint of our build environment.

For specialized tasks, such as generating documentation or running cross-compiled tests that require additional tools, we adopted a layering approach. Instead of building a monolithic image with all possible dependencies, we created dedicated, smaller images for these specific purposes and layered them on top of our minimal base image as needed within the CI pipeline. This modular approach ensures that only the necessary tools are present for each job, minimizing unnecessary overhead.

De-duplicating build artifacts

Historically, our CI pipelines involved many “configure && make” steps for various options. One of the major culprits of long build times is repeatedly compiling source code, oftentimes resulting in almost identical results.

We realized that many of these compile-time options could be handled at runtime. By moving configurations that didn’t fundamentally alter the core compilation process to runtime, we simplified our build process and reduced the number of compilation steps required. This approach transforms a lengthy compile-time dependency into a quicker runtime check.

Of course, this approach cuts both ways: while it simplifies the compilation process, it could increase the code size and attack surface. For example, support for legacy protocol features such as SSL 3.0 or SHA-1 that may lower the entire security should still be able to be switched off at the compile time.

Another caveat is that some compilation options are inherently incompatible with each other. One example is that thread sanitizer cannot be enabled with address sanitizer at the same time. In such cases a separate build artifact is still needed.

The impact: tangible results

The efforts put into optimizing our GitLab CI configuration yielded significant benefits:

The size of the container image used for our standard build jobs is now 2.5GB smaller than before. This substantial reduction in image size translates to faster job startup times and reduced storage consumption on our runners.
9 “configure && make” steps were removed from our standard build jobs. This streamlined the build process and directly contributed to faster execution times.

By implementing these strategies, we not only adapted to our reduced resources but also built a more efficient, cost-effective, and faster CI/CD pipeline for the GnuTLS project. These optimizations highlight that even small changes can lead to substantial improvements, especially in the context of open-source projects with limited resources.

For further information on this, please consult the actual changes.

Next steps

While the current optimizations have significantly improved our CI efficiency, we are continuously exploring further enhancements. Our future plans include:

Distributed GitLab runners with external cache: To further scale and improve resource utilization, we are considering running GitLab runners on multiple VPS instances. To coordinate these distributed runners and avoid redundant data transfers, we could set up an external cache, potentially using a solution like MinIO. This would allow shared access to build artifacts, reducing bandwidth consumption and build times.
Addressing flaky tests: Flaky tests, which intermittently pass or fail without code changes, are a major bottleneck in any CI pipeline. They not only consume valuable CI resources by requiring entire jobs to be rerun but also erode developer confidence in the test suite. In TLS testing, it is common to write a test script that sets up a server and a client as a separate process, let the server bind a unique port to which the client connects, and instruct the client to initiate a certain event through a control channel. This kind of test could fail in many ways regardless of the test itself, e.g., the port might be already used by other tests. Therefore, rewriting tests without requiring a complex setup would be a good first step.