Written by
Published on
15x Faster CUDA Kernel Compilation for MakoGenerate
At Mako, our goal is to automatically unlock peak GPU performance through GPU kernel generation and optimization. When working with MakoGenerate - our AI Agent for writing high-performance GPU kernels - compilation can become a critical bottleneck during both model reinforcement-fine-tuning and iterative code generation workflows. In this post, we explore why reducing compile times matters and how our engineering team implemented a fast CUDA compilation path that delivers up to 15x faster compilation times.
Why Compilation Time Matters
Fine-Tuning for Syntactic Correctness: During the fine-tuning process, MakoGenerate writes candidate kernel code, which we compile to verify syntactic validity or identify compilation errors. Each round of evaluation may involve thousands of compile attempts. Slow compilation directly impacts iteration speed and increases cloud compute costs.
Iterative Code Generation and Benchmarking: When using MakoGenerate in a production, we generate a kernel, compile it, check for correctness, and then run benchmarks to measure performance. Since benchmarking is only makes sense on kernels that compile successfully, every iteration requires a compile step. With standard CUDA compilation paths specific to producing PyTorch extensions, these steps can take tens of seconds, adding up and slowing down end-to-end workflows.
Because both fine-tuning and generation rely heavily on compile feedback, accelerating this step unlocks faster iteration cycles, lower costs, and smoother developer experiences.
Standard vs. Fast Compilation: Performance Improvements
Our team recently implemented a fast CUDA compilation path specifically tailored for MakoGenerate. Below are the key results observed on a suite of KernelBench’s level 1 problems (100 trials per metric):
Standard Compilation Time: 118.0 ± 14.0 seconds (range: 47.9 to 131.0)
Fast Compilation Time: 19.7 ± 33.8 seconds (range: 6.93 to 129.0)
Standard Reloading Time: 0.000934 ± 0.000181 seconds (range: 0.000412 to 0.00143)
Fast Reload Time: 0.000741 ± 0.000241 seconds (range: 0.000292 to 0.00114)
Fast Compilation Improvement: 14.8 ± 5.49× speedup (range: 0.811× to 18.5×)
Correctness Verification: Fast-compiled modules matched standard outputs in 100/100 occasions. Two individual evaluations (level 1 problem 37, sample 0) died unexpectedly under both standard and fast paths.

These numbers demonstrate a clear reduction in end-to-end compile latency, cutting typical compile times by roughly three to four times in the median case. Even the reload path—a smaller component—showed a modest speed boost.
How the Fast Compilation Path Works
The core idea behind our fast path is to minimize dependencies on heavy Torch headers and leverage precompiled headers wherever possible. Under the hood, the fast path intercepts any call to torch.utils.cpp_extension.load
and load_inline
and reorganizes the generated source code into separate pieces:
Isolate the Kernel Code:
The kernel itself is written into a standalone
.cu
file.We strip out all
torch/extension.h
includes from the kernel file and replace them with only the essential standard CUDA headers (e.g.,<cuda.h>
,<cuda_runtime.h>
).By doing so, the compiler avoids parsing bulky Torch headers when building CUDA device code.
Split the Dispatch Logic:
Any host code that references
torch::Tensor
or other PyTorch C++ types is moved into its own.cpp
file.We compile this host file using a precompiled Torch
extension.h
header, dramatically reducing header processing time.At runtime, kernel launches route through a small, generated proxy dispatch function that bridges between the host wrapper and the compiled CUDA kernel module.
Use Precompiled Headers (PCH):
Both the host
.cpp
and the proxy dispatch files are compiled against a PCH that includestorch/extension.h
once. Future compilations can reuse the PCH, avoiding repeated parsing of PyTorch internals.This reuse is especially impactful when many compilation jobs happen in rapid succession (as during evaluation loops).
By reorganizing code in this fashion - placing only the minimal device code in the CUDA compilation unit and leveraging PCH for the host side - we minimize redundant parsing and compilation work.
Caveats and Fallbacks
While the fast path yields impressive speedups in most cases, it remains a best-effort, heuristic-driven approach. Key considerations include:
Compilation Fallback: If the kernel fails to compile via the fast path (e.g., due to templated kernel code or unusual macro usage), we automatically revert to the standard compilation process. This fallback ensures that developer feedback remains accurate: any compilation error seen under the fast path is revalidated by the slow path to confirm that the error isn’t an artifact of our heuristics.
Limited Scope: Currently, the fast path only covers scenarios using
torch.utils.cpp_extension.load
andload_inline
. Other build workflows (e.g., custom CMake-based builds or complex multi-file extensions) are untouched.Tested on simple operators: Our experiments have focused on a set of 100 simple kernels and operators. While results are consistently positive there, other kernel families or more complex dispatch logic may require further heuristics and additional logic for splitting source code.
Edge Cases and Unexpected Failures: During our testing, two problem evaluations crashed under both fast and standard paths. In addition, 6/100 kernels required fallback to the slow path. As we expand test coverage, we expect to identify more such edge cases and refine our heuristics accordingly.
Putting It All Together: Benefits for MakoGenerate Users
By integrating the fast CUDA compilation path into MakoGenerate’s code generation pipeline, we achieve:
Faster Fine-Tuning Iterations: When evaluating candidate kernel outputs during model training, each compile-and-validate loop runs significantly faster. Engineers can iterate on prompt designs and data selection with less idle time.
Lower Cloud Compute Costs: GPU kernels are often tested at scale (hundreds to thousands of compile calls). Cutting compile latency by 15× translates directly into lower GPU/CPU hours billed.
Improved Developer Productivity: Data scientists and researchers get quicker feedback loops when prototyping new kernels or experimenting with novel dispatch strategies.
Scalable Benchmarking: When benchmarking MakoGenerate-written kernels against hand-tuned baselines, faster compilation means larger sample sizes can be evaluated in a given time budget.
Conclusion and Future Directions
Accelerating compilation is a deceptively complex challenge that pays dividends across the entire MakoGenerate kernel generation lifecycle. Our fast CUDA compilation path avoids unnecessary Torch header overhead and relies on precompiled headers to drastically reduce compile times. While current support is focused on simple level 1 kernels and Torch’s load
/load_inline
APIs, we plan to extend coverage to more complex kernel templates and additional build systems.
Ongoing work includes:
The current bottleneck is using the pre-compiled headers - we want to remove these entirely by precompiling the entire object files that provide the necessary integration with PyTorch frontend.
Expanding heuristic rules to detect and optimize templated kernels.
Continuous benchmarking on higher-level kernels to validate speedups and identify new edge cases.
By reducing one of the most persistent bottlenecks in CUDA development, we hope to make kernel research and deployment faster, more efficient, and more accessible.
Stay tuned for more updates as we push the envelope on AI-powered GPU optimization!
Latest
From the blog
The latest industry news, interviews, technologies, and resources.