- Running automated processes on every commit.
- Automatizing application testing.
- Automatizing application deployment.
- Automatizing every QA test we can think of.
By having highly automated workflows, we become capable of deploying applications many times a day without sacrificing quality or security.
The main reasons why we chose GItlab CI over other alternatives are:
- It is Open source.
- Built-in support for Gitlab: As Gitlab is the platform we use for our product repository, it represents an advantage for us to be able to easily integrate our CI solution with it. All GUI related capabilities like pipelines, jobs, CI/CD variables, environments, schedules, and container registries are a consequence of such integration.
- It supports pipelines as code: It allows us to write all our pipelines as code.
- It supports horizontal autoscaling: In order to be able to run hundreds of jobs for many developers, all in real time, our system must support horizontal autoscaling.
- It supports directed acyclic graphs (DAG): Such capability allows us to make our integrations as fast as possible, as jobs exclusively depend on what they really should. It is a must when implementing a monorepo strategy like ours.
- Highly versatile configurations: As every piece of software usually has its own needs when it comes to building, testing and deploying, GItlab CI offers a vast set of configurations that range from parallelism, static pages, and services to includes, workflows and artifacts.
- Highly versatile infrastructure: The AWS autoscaler allows configurations for s3 cache, machine type, max number of machines, spot instances, c5d instances ssd disk usage, ebs disks, off peak periods, tagging, max builds before destruction, among many others. The importance of having a highly versatile CI comes from the fact that our development cycle completely depends on it, making us to expect clockwork-like responsiveness and as-fast-as-possible computing speed.
The following alternatives were considered but not chosen for the following reasons:
- Jenkins: It did not support pipelines as code at the time it was reviewed.
- TravisCI: It required licensing for private repositories at the time it was reviewed.
- CircleCI: It did not support Gitlab, it was very expensive, it was not as parameterizable.
- Buildkite: It is still pending for review.
We use GItlab CI for:
- Running all our CI/CD jobs.
- Managing all our CI pipelines as code.
- Configuring our AWS autoscaler as code.
- Implementing a Continuous Delivery approach for our development cycle. This means that although the whole process is automated, including deployments for both development and production, a manual merge request approval from a developer is still required in order to be able to deploy changes to production.
We do not use GItlab CI for:
- Highly time-consuming schedules that take longer than six hours, like Analytics ETL's, Machine learning training, among others. We use AWS Batch instead. The reason for this is that the GItlab CI is not meant to run jobs that take that many hours, often resulting in jobs being terminated before they can finish, mainly due to disconnections between the worker running the job and its Gitlab CI Bastion.
- Any changes to the CI pipelines must be done via Merge Requests.
- Any changes to the AWS autoscaler infrastructure must be done via Merge Requests by modifying its Terraform module.
- To learn how to test and apply infrastructure via Terraform, visit the Terraform Guidelines.
- If a scheduled job takes longer than six hours, it generally should run in Batch, otherwise it can use the Gitlab CI.
- terraform-aws-gitlab-module for defining our CI as code.
- AWS Lambda for hourly cleaning orphaned machines.
- AWS DynamoDB for locking Terraform states and avoiding race conditions.
Tuning the CI
Any team member can tune the CI for a specific product by modifying the values passed to it in the terraform module runners section.
One of the most important values is the
idle-count, as it:
- Specifies how many idle machines should be waiting for new jobs. the more jobs a product pipeline has, the more idle machines it should have. you can take the integrates-small runner as a reference.
- It also dictates the rate at which the CI turns on new machines,
that is, if a pipeline with 100 jobs is triggered
for a CI with
idle-count = 8, it will turn on new machines in batches of
8until it stabilizes.
- More information about how the autoscaling algorithm works can be found here.
As we use a multi-bastion approach, the following tasks can be considered when debugging the CI.
Review Gitlab CI/CD Settings
Connect to bastions or workers
You can connect to any bastion or worker using AWS Session Manager.
Just go to the AWS EC2 console,
select the instance you want to connect to,
and start a
Session Manager session.
Debugging the bastion
Typical things you want to look at when debugging a bastion are:
docker-machinecommands. This will allow you to inspect and access workers with commands like
docker-machine inspect <worker>, and
docker-machine ssh <worker>.
/var/log/messagesfor relevant logs from the
/etc/gitlab-runner/config.tomlfor bastion configurations.
Debugging a specific CI job
You can know which machine ran a job by looking at its logs.
For this example job,
Running on runner-cabqrx3c-project-20741933-concurrent-0 via runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70...,
tells us that the worker with name
was the one that ran it.
From there you can access the bastion and run memory or disk debugging.