Skip to main content

Batch

Rationale

We use Batch for running batch processing jobs in the cloud. The main reasons why we chose it over other alternatives are the following:

Alternatives

GitLab CI

We used GitLab CI before implementing Batch. We migrated because GitLab CI is not intended to run scheduled jobs that take many hours, often resulting in jobs becoming irresponsive before they could finish, mainly due to disconnections between the worker running the job and the GitLab CI Bastion. On top of this, GitLab CI has a limit on the number of schedules per project, and running thousands of jobs puts a lot of pressure on the GitLab coordinator and the GitLab CI Bastion.

Buildkite

https://buildkite.com

Pros:

  • Handles submission of duplicated jobs
  • Gives us logging, monitoring, and stability measurements out-of-the-box
  • We can separate costs by having different queues (associated to different deployments)
  • Notifications out-of-the-box to email and others
  • Support pipelines out-of-the-box
  • They have an API to query information about past jobs on a pipeline and trigger new builds, which is much more flexible than Batch's API

Cons:

  • Much more expensive.

Kubernetes Jobs

https://kubernetes.io/docs/concepts/workloads/controllers/job/

Pros:

  • Allows better separation of costs.

Cons:

  • It requires manually kick-starting a build, because it doesn't listen automatically to the queue like batch does.

Usage

We use Batch for running

Guidelines

General

Schedules

Schedules are a powerful way to run tasks periodically.

You can find all schedules here.

Creating a new schedule

We highly advise you to take a look at the currently existing schedules to get an idea of what is required.

Some special considerations are:

  1. The scheduleExpression option follows the AWS schedule expression syntax.

Testing the schedules

Schedules are tested by two Makes jobs:

  1. m . /common/compute/schedule/test Grants that
    • all schedules comply with a given schema;
    • all schedules have at least one maintainer with access to the universe repository;
    • every schedule is reviewed by a maintainer on a monthly basis.
  2. m . /deployTerraform/commonCompute Tests infrastructure that will be deployed when new schedules are created

Deploying schedules to production

Once a schedule reaches production, required infrastructure for running it is created.

Technical details can be found here.

Local reproducibility in schedules

Once a new schedule is declared, A Makes job is created with the format computeOnAwsBatch/schedule_<name> for local reproducibility.

Generally, to run any schedule, all that is necessary is to export the UNIVERSE_API_TOKEN variable. Bear in mind that data.nix becomes the single source of truth regarding schedules. Everything is defined there, albeit with a few exceptions.