GitLab CI

Rationale

GitLab CI is the system that orchestrates all the CI/CD workflows within our company. Such workflows are the backbone of our entire development cycle. By using it, we become capable of:

By having highly automated workflows, we become capable of deploying applications many times a day without sacrificing quality or security.

The main reasons why we chose GitLab CI over other alternatives are:

It is Open source.
Built-in support for GitLab: As GitLab is the platform we use for our product repository, it represents an advantage for us to be able to easily integrate our CI solution with it. All GUI related capabilities like pipelines, jobs, CI/CD variables, environments, schedules, and container registries are a consequence of such integration.
It supports pipelines as code: It allows us to write all our pipelines as code.
It supports horizontal autoscaling: In order to be able to run hundreds of jobs for many developers, all in real time, our system must support horizontal autoscaling.
It supports directed acyclic graphs (DAG): Such capability allows us to make our integrations as fast as possible, as jobs exclusively depend on what they really should. It is a must when implementing a monorepo strategy like ours.
Highly versatile configurations: As every piece of software usually has its own needs when it comes to building, testing and deploying, GitLab CI offers a vast set of configurations that range from parallelism, static pages, and services to includes, workflows and artifacts.
Highly versatile infrastructure: The AWS autoscaler allows configurations for s3 cache, machine type, max number of machines, spot instances, c5d instances ssd disk usage, ebs disks, off peak periods, tagging, max builds before destruction, among many others. The importance of having a highly versatile CI comes from the fact that our development cycle completely depends on it, making us to expect clockwork-like responsiveness and as-fast-as-possible computing speed.

Alternatives

The following alternatives were considered but not chosen for the following reasons:

Jenkins: It did not support pipelines as code at the time it was reviewed.
TravisCI: It required licensing for private repositories at the time it was reviewed.
CircleCI: It did not support GitLab, it was very expensive, it was not as parameterizable.
Buildkite: It is still pending for review.

Usage

We use GitLab CI for:

Running all our CI/CD jobs.
Managing all our CI pipelines as code.
Configuring our AWS autoscaler as code.
Implementing a Continuous Delivery approach for our development cycle. This means that although the whole process is automated, including deployments for both development and production, a manual merge request approval from a developer is still required in order to be able to deploy changes to production.

We do not use GitLab CI for:

Highly time-consuming schedules that take longer than six hours, like Analytics ETL's, Machine learning training, among others. We use AWS Batch instead. The reason for this is that the GitLab CI is not meant to run jobs that take that many hours, often resulting in jobs being terminated before they can finish, mainly due to disconnections between the worker running the job and its GitLab CI Bastion.

Guidelines

General

Any changes to the CI pipelines must be done via Merge Requests.
Any changes to the AWS autoscaler infrastructure must be done via Merge Requests by modifying its Terraform module.
To learn how to test and apply infrastructure via Terraform, visit the Terraform Guidelines.
If a scheduled job takes longer than six hours, it generally should run in Batch, otherwise it can use the GitLab CI.

Components

We use:

terraform-aws-gitlab-module for defining our CI as code.
AWS Lambda and AWS API Gateway for listening to GitLab webhooks and trigger actions like canceling unnecessary pipelines, rebasing MRs and merging MRs.
AWS DynamoDB for locking Terraform states and avoiding race conditions.

Make the lambda review my merge request

The lambda reviews all currently opened merge requests when:

A new merge request is created.
An existing merge request is updated, approved (by all required approvers), unapproved, merged, or closed.
An individual user adds or removes their approval to an existing merge request.
All threads are resolved on the merge request.

If you want the lambda to rebase or merge your merge request, you can perform one of the previously mentioned actions on any of the currently opened merge requests.

Tuning the CI

Any team member can tune the CI for a specific product by modifying the values passed to it in the terraform module runners section.

One of the most important values is the idle-count, as it:

Specifies how many idle machines should be waiting for new jobs. the more jobs a product pipeline has, the more idle machines it should have. you can take the integrates runner as a reference.
It also dictates the rate at which the CI turns on new machines, that is, if a pipeline with 100 jobs is triggered for a CI with idle-count = 8, it will turn on new machines in batches of 8 until it stabilizes.
More information about how the autoscaling algorithm works can be found here.

Debugging

As we use a multi-bastion approach, the following tasks can be considered when debugging the CI.

Review GitLab CI/CD Settings

If you're an admin in GitLab, you can visit the CI/CD Settings to validate if bastions are properly communicating.

Inspect infrastructure

You can inspect both bastions and workers from the AWS EC2 console. Another useful place to look at when you're suspecting of spot availability, is the spot requests view.

Connect to bastions or workers

You can connect to any bastion or worker using AWS Session Manager.

Just go to the AWS EC2 console, select the instance you want to connect to, click on Connect, and start a Session Manager session.

Debugging the bastion

Typical things you want to look at when debugging a bastion are:

docker-machine commands. This will allow you to inspect and access workers with commands like docker-machine ls, docker-machine inspect <worker>, and docker-machine ssh <worker>.
/var/log/messages for relevant logs from the gitlab-runner service.
/etc/gitlab-runner/config.toml for bastion configurations.

Debugging a specific CI job

You can know which machine ran a job by looking at its logs.

Consider an example job where the line Running on runner-cabqrx3c-project-20741933-concurrent-0 via runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70... is displayed. This tells us that the worker with the name runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70 was the one that ran it.

From there you can access the bastion and run memory or disk debugging.

Custom API Tooling

We've developed tooling specifically for monitoring job performance in CI. These tools interact directly with Gitlab's GraphQL API, extracting and analyzing data to generate both plots and tabulated data in CSV format on your local machine.

note

All tools are accessible via the command line interface (CLI). Use the --help flag for detailed information on available arguments.

General analytics

Retrieve and visualize general job, pipeline or merge requests analytics using the CLI tool with its default workflow.

For fetching, parsing and plotting data, run:

m . /common/utils/gitlab/graphql/fetch default --resource=[RESOURCE]

Optional arguments:

  options:
  -h, --help            show this help message and exit
  --pages PAGES         Number of pages to process (default: 100)
  --time_scale {hour,day,week,month}
                        Time scale for grouping (default: day)
  --resource {jobs,pipelines,merge_requests}
                        Target resource to fetch (default: jobs)

Functional tests

Identify the slowest functional tests within integrations using the CLI tool:

m . /common/utils/gitlab/graphql/fetch functional

Optional arguments:

  options:
  -h, --help            show this help message and exit
  --pages PAGES         Number of pages to process (default: 100)
  --time_scale {hour,day,week,month}
                        Time scale for grouping (default: day)
  --limit LIMIT         Limit of functional tests to display (default: 10)

End to end tests

Detect flakiness, generate heatmaps, and timing plots for end-to-end tests with the following CLI tool:

m . /common/utils/gitlab/graphql/fetch functional

Optional arguments:

options:
  -h, --help            show this help message and exit
  --pages PAGES         Number of pages to process (default: 100)
  --time_scale {hour,day,week,month}
                        Time scale for grouping (default: day)

Customization

Easily create your own CLI tool to extract, parse, and visualize Gitlab information by leveraging the module located at common/utils/gitlab/graphql. Refer to the source code of the aforementioned tools for inspiration.

Rationale​

Alternatives​

Usage​

Guidelines​

General​

Components​

Make the lambda review my merge request​

Tuning the CI​

Debugging​

Review GitLab CI/CD Settings​

Inspect infrastructure​

Connect to bastions or workers​

Debugging the bastion​

Debugging a specific CI job​

Custom API Tooling​

General analytics​

Functional tests​

End to end tests​

Customization​

Rationale

Alternatives

Usage

Guidelines

General

Components

Make the lambda review my merge request

Tuning the CI

Debugging

Review GitLab CI/CD Settings

Inspect infrastructure

Connect to bastions or workers

Debugging the bastion

Debugging a specific CI job

Custom API Tooling

General analytics

Functional tests

End to end tests

Customization