Batch
Rationale
We use Batch for running batch processing jobs in the cloud. The main reasons why we chose it over other alternatives are the following:
- It is SaaS (software as a service), so we do not need to manage any infrastructure directly.
- It is free, so we only have to pay for the Elastic Compute Cloud (EC2) machines we use to process workloads.
- It complies with several certifications from ISO and CSA. Many of these certifications are focused on granting that the entity follows best practices regarding secure cloud-based environments and information security.
- We can monitor job logs using CloudWatch.
- The jobs are highly resilient, which means they rarely go irresponsive. This feature is very important when jobs take several days to finish.
- It supports EC2 spot instances, which considerably decreases EC2 costs.
- All its settings can be written as code using Terraform.
- We can use Nix to queue jobs easily.
- It supports priority-based queuing, which allows us to prioritize jobs by assigning them to one queue or another.
- It supports automatic retries of jobs.
- It integrates with Identity and Access Management (IAM/), allowing us to keep a least privilege approach regarding authentication and authorization.
- EC2 workers running jobs can be monitored using CloudWatch.
Alternatives
GitLab CI
We used GitLab CI before implementing Batch. We migrated because GitLab CI is not intended to run scheduled jobs that take many hours, often resulting in jobs becoming irresponsive before they could finish, mainly due to disconnections between the worker running the job and the GitLab CI Bastion. On top of this, GitLab CI has a limit on the number of schedules per project, and running thousands of jobs puts a lot of pressure on the GitLab coordinator and the GitLab CI Bastion.
Buildkite
Pros:
- Handles submission of duplicated jobs
- Gives us logging, monitoring, and stability measurements out-of-the-box
- We can separate costs by having different queues (associated to different deployments)
- Notifications out-of-the-box to email and others
- Support pipelines out-of-the-box
- They have an API to query information about past jobs on a pipeline and trigger new builds, which is much more flexible than Batch's API
Cons:
- Much more expensive.
Kubernetes Jobs
https://kubernetes.io/docs/concepts/workloads/controllers/job/
Pros:
- Allows better separation of costs.
Cons:
- It requires manually kick-starting a build, because it doesn't listen automatically to the queue like batch does.
Usage
We use Batch for running
- Production background schedules for all our components.
- ARM background tasks like cloning roots and refreshing targets of evaluation.
Guidelines
General
- You can access the Batch console after authenticating to AWS.
- Any changes to Batch infrastructure must be done via merge requests.
- You can queue new jobs to Batch using the compute-on-aws module.
- If a scheduled job takes longer than six hours, it should generally run in Batch; otherwise, you can use GitLab CI.
- To learn how to test and apply infrastructure via Terraform, visit the Terraform Guidelines.
- Terraform infrastructure for such schedule will also be provisioned.
Schedules
Schedules are a powerful way to run tasks periodically.
You can find all schedules here.
Creating a new schedule
We highly advise you to take a look at the currently existing schedules to get an idea of what is required.
Some special considerations are:
- The
scheduleExpression
option follows the AWS schedule expression syntax.
Testing the schedules
Schedules are tested by two Makes jobs:
m . /common/compute/schedule/test
Grants that- all schedules comply with a given schema;
- all schedules have at least one maintainer with access to the universe repository;
- every schedule is reviewed by a maintainer on a monthly basis.
m . /deployTerraform/commonCompute
Tests infrastructure that will be deployed when new schedules are created
Deploying schedules to production
Once a schedule reaches production, required infrastructure for running it is created.
Technical details can be found here.
Local reproducibility in schedules
Once a new schedule is declared,
A Makes job is created
with the format
computeOnAwsBatch/schedule_<name>
for local reproducibility.
Generally,
to run any schedule,
all that is necessary
is to export the UNIVERSE_API_TOKEN
variable.
Bear in mind that data.nix
becomes the single source of truth
regarding schedules.
Everything is defined there,
albeit with a few exceptions.