Skip to main content

Gitlab CI

Rationale

GItlab CI is the system that orchestrates all the CI/CD workflows within our company. Such workflows are the backbone of our entire development cycle. By using it, we become capable of:

  1. Running automated processes on every commit.
  2. Automatizing application testing.
  3. Automatizing application deployment.
  4. Automatizing every QA test we can think of.

By having highly automated workflows, we become capable of deploying applications many times a day without sacrificing quality or security.

The main reasons why we chose GItlab CI over other alternatives are:

  1. It is Open source.
  2. Built-in support for Gitlab: As Gitlab is the platform we use for our product repository, it represents an advantage for us to be able to easily integrate our CI solution with it. All GUI related capabilities like pipelines, jobs, CI/CD variables, environments, schedules, and container registries are a consequence of such integration.
  3. It supports pipelines as code: It allows us to write all our pipelines as code.
  4. It supports horizontal autoscaling: In order to be able to run hundreds of jobs for many developers, all in real time, our system must support horizontal autoscaling.
  5. It supports directed acyclic graphs (DAG): Such capability allows us to make our integrations as fast as possible, as jobs exclusively depend on what they really should. It is a must when implementing a monorepo strategy like ours.
  6. Highly versatile configurations: As every piece of software usually has its own needs when it comes to building, testing and deploying, GItlab CI offers a vast set of configurations that range from parallelism, static pages, and services to includes, workflows and artifacts.
  7. Highly versatile infrastructure: The AWS autoscaler allows configurations for s3 cache, machine type, max number of machines, spot instances, c5d instances ssd disk usage, ebs disks, off peak periods, tagging, max builds before destruction, among many others. The importance of having a highly versatile CI comes from the fact that our development cycle completely depends on it, making us to expect clockwork-like responsiveness and as-fast-as-possible computing speed.

Alternatives

The following alternatives were considered but not chosen for the following reasons:

  1. Jenkins: It did not support pipelines as code at the time it was reviewed.
  2. TravisCI: It required licensing for private repositories at the time it was reviewed.
  3. CircleCI: It did not support Gitlab, it was very expensive, it was not as parameterizable.
  4. Buildkite: It is still pending for review.

Usage

We use GItlab CI for:

  1. Running all our CI/CD jobs.
  2. Managing all our CI pipelines as code.
  3. Configuring our AWS autoscaler as code.
  4. Implementing a Continuous Delivery approach for our development cycle. This means that although the whole process is automated, including deployments for both development and production, a manual merge request approval from a developer is still required in order to be able to deploy changes to production.

We do not use GItlab CI for:

  1. Highly time-consuming schedules that take longer than six hours, like Analytics ETL's, Machine learning training, among others. We use AWS Batch instead. The reason for this is that the GItlab CI is not meant to run jobs that take that many hours, often resulting in jobs being terminated before they can finish, mainly due to disconnections between the worker running the job and its Gitlab CI Bastion.

Guidelines

General

  1. Any changes to the CI pipelines must be done via Merge Requests.
  2. Any changes to the AWS autoscaler infrastructure must be done via Merge Requests by modifying its Terraform module.
  3. To learn how to test and apply infrastructure via Terraform, visit the Terraform Guidelines.
  4. If a scheduled job takes longer than six hours, it generally should run in Batch, otherwise it can use the Gitlab CI.

Components

We use:

  1. terraform-aws-gitlab-module for defining our CI as code.
  2. AWS Lambda for hourly cleaning orphaned machines.
  3. AWS DynamoDB for locking Terraform states and avoiding race conditions.

Tuning the CI

Any team member can tune the CI for a specific product by modifying the values passed to it in the terraform module runners section.

One of the most important values is the idle-count, as it:

  1. Specifies how many idle machines should be waiting for new jobs. the more jobs a product pipeline has, the more idle machines it should have. you can take the integrates-small runner as a reference.
  2. It also dictates the rate at which the CI turns on new machines, that is, if a pipeline with 100 jobs is triggered for a CI with idle-count = 8, it will turn on new machines in batches of 8 until it stabilizes.
  3. More information about how the autoscaling algorithm works can be found here.

Debugging

As we use a multi-bastion approach, the following tasks can be considered when debugging the CI.

Review Gitlab CI/CD Settings

If you're an admin in Gitlab, you can visit the CI/CD Settings to validate if bastions are properly communicating.

Inspect infrastructure

You can inspect both bastions and workers from the AWS EC2 console. Another useful place to look at when you're suspecting of spot availability, is the spot requests view.

Connect to bastions or workers

You can connect to any bastion or worker using AWS Session Manager.

Just go to the AWS EC2 console, select the instance you want to connect to, click on Connect, and start a Session Manager session.

Debugging the bastion

Typical things you want to look at when debugging a bastion are:

  • docker-machine commands. This will allow you to inspect and access workers with commands like docker-machine ls, docker-machine inspect <worker>, and docker-machine ssh <worker>.
  • /var/log/messages for relevant logs from the gitlab-runner service.
  • /etc/gitlab-runner/config.toml for bastion configurations.

Debugging a specific CI job

You can know which machine ran a job by looking at its logs.

For this example job, the line Running on runner-cabqrx3c-project-20741933-concurrent-0 via runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70..., tells us that the worker with name runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70 was the one that ran it.

From there you can access the bastion and run memory or disk debugging.