Products

Resources

Use Cases

Company

Pricing

Try for free

Products

Resources

Use Cases

Company

Pricing

Try for free

Our products

MakoGenerate

MakoOptimize

RESOURCES

Blog

USE CASES

Code Translation

Performance Optimization

COMPANY

About

Careers

Try for free

Our products

MakoGenerate

MakoOptimize

RESOURCES

Blog

USE CASES

Code Translation

Performance Optimization

COMPANY

About

Careers

Try for free

From Optimizing Kernels to Optimizing Benchmarks

Creating a representative subset of KernelBench to evaluate a long-running agent more efficiently

Written by

Essam Wissam

Published on

Sep 18, 2025

TLDR: Based on the clustering analysis presented, we introduce KernelMiniBench: a minimal subset of 160 KernelBench problems that closely reproduces the full evaluation statistics for agentic LLMs with high fidelity for faster but reliable experiments. The 160 problems are distributed as 59, 61, 29 and 10 problems respectively for levels 1, 2, 3 and 5. You can find the KernelMiniBench on HuggingFace via this link.

At Mako, we design LLM-driven agents that autonomously generate, optimize, and evaluate GPU kernels tailored for peak performance. One widely recognized benchmark that gives us a perspective on evaluating our agents' capabilities is known as KernelBench [1]. It was presented at ICML 2025 for evaluating LLMs’ ability to write efficient GPU kernels from PyTorch reference code.

KernelBench contains over 250 PyTorch programs spanning four difficulty levels:

Level 1 (100 tasks): Single-kernel operators (e.g., convolution, matrix multiply, layer normalization)
Level 2 (100 tasks): Simple fusion patterns (e.g., convolution + bias + ReLU, matmul + scale + sigmoid)
Level 3 (50 tasks): Full model architectures (e.g., MobileNet, VGG, MiniGPT)
Level 4 (~20 tasks): Aspirational tasks drawn from Hugging Face models—more complex, library-level code transformations

Evaluating on level 4 is not practical because the problems are not directly provided as raw code, but as HuggingFace links to models. We exclude level 4 but include a new level 5 that comes from the METR report on KernelBench and features state-of-the-art end-to-end models such as Llama 2 and DeepSeek.

Below is an example result comparing one of our agents versus OpenAI o3 as a baseline (utilizing the same prompt) for level 1:

The red columns represent an agent that uses OpenAI o3 in a simple loop, giving it three tries to produce the fastest solution possible. The blue columns are show our agent, MakoGenerate, with the same number of attempts.

A few observations could be made here: MakoGenerate outperforms PyTorch’s torch.compile on most problems (the dashed line indicates 1X speedup), with many cases delivering 1.5X or greater speedups.

That said, our objective in this article is not specifically about optimizing kernels. It’s about optimizing the KernelBench benchmark itself! Why you ask? Let's look at Google's AlphaEvolve paper for the answer.

By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini’s training time. … AlphaEvolve significantly reduces the engineering time required for kernel optimization, from weeks of expert effort to days of automated experiments.

It took Google Deepmind "days of automated experiments" to generate a single kernel using their proprietary and state-of-the-art framework, AlphaEvolve. With this, indeed it could take a year (or else tons of more compute) to evaluate the system over the problems in KernelBench. Our MakoGenerate framework would benefit from downscaling the set of problems in KernelBench while still maintaining a similar level of representativeness. Our solution to this problem will be the focus of this blog post.

Initial Analysis

In the realm of statistics, the accuracy of estimating a population parameter from a random sample is largely determined by the underlying variance: a dataset with instances (problems) that are highly varied will be approximated less accurately than a one with low variance, for any fixed size of a random sample.

To study this aspect on KernelBench, we used an internally tracked dataset of experiments covering a moderate set of about 50 runs with different agentic systems. For each problem we had logged the agent’s functional correctness and measured speedup, as well as reported aggregate statistics across problems: (1) percentage of kernels that outperform torch.compile ("beat-compile percentage"), (2) functionality rate, and (3) geometric-mean (geomean) speedup compared to torch.compile.

This shows the mean absolute deviation (weighted based on the number of problems attempted) for the three main metrics we use in experiments over different sizes of random samples and repeated for each size to arrive at the confidence interval. It’s like going back in time for experiments we completed fully and analyzing how different the aggregate evaluation metrics would have been if we considered only a random subset of the KernelBench problems.

We can then make these conclusions from the plot above:

If operating on all levels it's safe to consider a random sample of 30% (corresponding to a blacklist ratio of 70% on the plot) which should could keep us within 2% from the actual beat percentage, functionality rate and geomean speedup metrics.
For level 1, to maintain similar error bounds let your random sample be at least 50% of the problems
For level 2, the corresponding percentage for the random sample is around 45%
For level 3, it's also around 50% again
For level 5, it's around 75-80%

You may wonder why are the per-level requirement is more strict? The answer is that inter-level interactions (eg, correlations) make it easier to take smaller samples for level zero (full dataset) compared to individual levels.

Correlation Analysis

Speaking of interactions, we have also analyzed the correlation in speedup and functionality metrics across problems (intra-level and inter-level). Here is a sample from the intra-level results:

For Level 1, clustering reveals two tightly correlated groups and several near-uncorrelated outliers. Inspecting KernelBench shows the first ~20 entries are matmul variants and the next ~10 are memory-bound nonlinear functions, closely matching the correlation structure. This implies many of these problems could be pruned without losing coverage of distinct kernel semantics.

This is another correlation plot depicting interactions in level 3 where we see a much larger degree of uncorrelation. Notably, the first three problems appear as a dense cluster; these map to three MLP kernels in the benchmark, explaining their mutual similarity.

After studying correlation for each individual level, we produced this plot which measures the extent of high correlations across the whole dataset. We observe here that over 75 problems have a Pearson correlation in speedup/functionality metrics of 80% or higher with some other problem.

Benchmark Reduction by Clustering

Motivated by the pronounced similarity among many KernelBench problems, we adopted a clustering-based reduction to eliminate redundancy while preserving representative coverage of distinct kernel behaviors. K-Means clustering is not a suitable option because the cluster center would not be guaranteed to be a problem; meanwhile, K-medoids by design chooses a problem for each cluster.

The figure above shows our clustering results. Each problem is represented by a high-dimensional vector composed of its experiment results. We see that some clear groupings emerge: convolution kernels cluster together (see the bottom row), while some nonlinear functions form a distinct cluster (fourth row). Note that the clusters were found based on purely similarity in evaluation results; the clustering algorithm has no idea about the semantics of different problems.

We tuned the number of clusters and other hyperparameters and converged on 160 clusters. The following table compares the mean absolute deviation in metrics (compared to groundtruth) for random samples of the same size of the subset we found (we considered 50 random samples). The problem centers for the 160 clusters form our KernelMiniBench .

The ± std in parentheses is the standard deviation of the mean absolute error across different agentic experiments, while the ± std outside the parentheses—shown only for the random baseline—captures variation across different random samples. For the std over agentic experiments, mini-bench looks better so we can just ignore it in both terms and then you may deduce that the mini-bench performs close/better to the common best case of random sampling (looking at the outer std).

KernelMiniBench

Based on the clustering analysis presented, we introduce KernelMiniBench: a minimal subset of 160 KernelBench problems that closely reproduces the full evaluation statistics for agentic LLMs with high fidelity for faster but reliable experiments. The 160 problems are distributed as 59, 61, 29 and 10 problems respectively for levels 1, 2, 3 and 5. You can find the KernelMiniBench on HuggingFace via this link.

Conclusions

At Mako we build agents that generate, refine and optimize kernels with evolutionary computation.
We observed a moderate amount of redundancy in one of the most recognized benchmarks for kernel optimization and downscaled the dataset while maintaining a reasonable level of representativeness.
Semantic relationships between kernels can be inferred purely from the similarity of their evaluation profiles.

References

[1] Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, & Azalia Mirhoseini. KernelBench: Can LLMs Write Efficient GPU Kernels? arXiv preprint arXiv:2502.10517 (2025).

Latest