BigCodeBench

Last updated: Feb 9, 2026

Rationale

We use BigCodeBench to define the best model to use in our AI features.

The main reasons we chose it over other alternatives are:

It is focused on code generation and solving complex programming tasks that require the use of imports and function calls.
Its benchmark is built using a dataset similar to the prompts we use in our use cases.
It offers a user-centric approach, incorporating diverse, real-world scenarios using Stack Overflow as a source.
It calculates ratings using two different prompting methods and their average.
It is open source.

HumanEval: It is not focused on code generation. Specific to Python, with a very small test dataset, making it difficult to draw meaningful conclusions.
MBPP (Mostly Basic Python Problems): Specific to Python, featuring very simple problems.
RepoBench: Measures code auto completion, which is an adjacent problem but not the primary goal of our user case.
CRUXEval: Focuses on evaluating how well models understand code to predict function inputs or outputs. This is not very relevant to our use case of refactoring.
DS-1000: Focused on data science problems rather than code generation.

We use BigCodeBench to define the best model to use in the features that suggest vulnerability fixes within our platform and VS Code extension.