Products

Resources

Company

Products

Resources

Company

15x Faster CUDA Kernel Compilation for MakoGenerate

15x Faster CUDA Kernel Compilation for MakoGenerate

15x Faster CUDA Kernel Compilation for MakoGenerate

15x Faster CUDA Kernel Compilation for MakoGenerate

Optimizing the kernel generation pipeline through accelerated compilation

Optimizing the kernel generation pipeline through accelerated compilation

Written by

Łukasz Dudziak

Łukasz Dudziak

Published on

Jul 22, 2025

Jul 22, 2025

15x Faster CUDA Kernel Compilation for MakoGenerate

At Mako, our goal is to automatically unlock peak GPU performance through GPU kernel generation and optimization. When working with MakoGenerate - our AI Agent for writing high-performance GPU kernels - compilation can become a critical bottleneck during both model reinforcement-fine-tuning and iterative code generation workflows. In this post, we explore why reducing compile times matters and how our engineering team implemented a fast CUDA compilation path that delivers up to 15x faster compilation times.

Why Compilation Time Matters

  1. Fine-Tuning for Syntactic Correctness: During the fine-tuning process, MakoGenerate writes candidate kernel code, which we compile to verify syntactic validity or identify compilation errors. Each round of evaluation may involve thousands of compile attempts. Slow compilation directly impacts iteration speed and increases cloud compute costs.

  2. Iterative Code Generation and Benchmarking: When using MakoGenerate in a production, we generate a kernel, compile it, check for correctness, and then run benchmarks to measure performance. Since benchmarking is only makes sense on kernels that compile successfully, every iteration requires a compile step. With standard CUDA compilation paths specific to producing PyTorch extensions, these steps can take tens of seconds, adding up and slowing down end-to-end workflows.

Because both fine-tuning and generation rely heavily on compile feedback, accelerating this step unlocks faster iteration cycles, lower costs, and smoother developer experiences.

Standard vs. Fast Compilation: Performance Improvements

Our team recently implemented a fast CUDA compilation path specifically tailored for MakoGenerate. Below are the key results observed on a suite of KernelBench’s level 1 problems (100 trials per metric):

  • Standard Compilation Time: 118.0 ± 14.0 seconds (range: 47.9 to 131.0)

  • Fast Compilation Time: 19.7 ± 33.8 seconds (range: 6.93 to 129.0)

  • Standard Reloading Time: 0.000934 ± 0.000181 seconds (range: 0.000412 to 0.00143)

  • Fast Reload Time: 0.000741 ± 0.000241 seconds (range: 0.000292 to 0.00114)

  • Fast Compilation Improvement: 14.8 ± 5.49× speedup (range: 0.811× to 18.5×)

  • Correctness Verification: Fast-compiled modules matched standard outputs in 100/100 occasions. Two individual evaluations (level 1 problem 37, sample 0) died unexpectedly under both standard and fast paths.

These numbers demonstrate a clear reduction in end-to-end compile latency, cutting typical compile times by roughly three to four times in the median case. Even the reload path—a smaller component—showed a modest speed boost.

How the Fast Compilation Path Works

The core idea behind our fast path is to minimize dependencies on heavy Torch headers and leverage precompiled headers wherever possible. Under the hood, the fast path intercepts any call to torch.utils.cpp_extension.load and load_inline and reorganizes the generated source code into separate pieces:

  1. Isolate the Kernel Code:

    • The kernel itself is written into a standalone .cu file.

    • We strip out all torch/extension.h includes from the kernel file and replace them with only the essential standard CUDA headers (e.g., <cuda.h>, <cuda_runtime.h>).

    • By doing so, the compiler avoids parsing bulky Torch headers when building CUDA device code.

  2. Split the Dispatch Logic:

    • Any host code that references torch::Tensor or other PyTorch C++ types is moved into its own .cpp file.

    • We compile this host file using a precompiled Torch extension.h header, dramatically reducing header processing time.

    • At runtime, kernel launches route through a small, generated proxy dispatch function that bridges between the host wrapper and the compiled CUDA kernel module.

  3. Use Precompiled Headers (PCH):

    • Both the host .cpp and the proxy dispatch files are compiled against a PCH that includes torch/extension.h once. Future compilations can reuse the PCH, avoiding repeated parsing of PyTorch internals.

    • This reuse is especially impactful when many compilation jobs happen in rapid succession (as during evaluation loops).

By reorganizing code in this fashion - placing only the minimal device code in the CUDA compilation unit and leveraging PCH for the host side - we minimize redundant parsing and compilation work.

Caveats and Fallbacks

While the fast path yields impressive speedups in most cases, it remains a best-effort, heuristic-driven approach. Key considerations include:

  • Compilation Fallback: If the kernel fails to compile via the fast path (e.g., due to templated kernel code or unusual macro usage), we automatically revert to the standard compilation process. This fallback ensures that developer feedback remains accurate: any compilation error seen under the fast path is revalidated by the slow path to confirm that the error isn’t an artifact of our heuristics.

  • Limited Scope: Currently, the fast path only covers scenarios using torch.utils.cpp_extension.load and load_inline. Other build workflows (e.g., custom CMake-based builds or complex multi-file extensions) are untouched.

  • Tested on simple operators: Our experiments have focused on a set of 100 simple kernels and operators. While results are consistently positive there, other kernel families or more complex dispatch logic may require further heuristics and additional logic for splitting source code.

  • Edge Cases and Unexpected Failures: During our testing, two problem evaluations crashed under both fast and standard paths. In addition, 6/100 kernels required fallback to the slow path. As we expand test coverage, we expect to identify more such edge cases and refine our heuristics accordingly.

Putting It All Together: Benefits for MakoGenerate Users

By integrating the fast CUDA compilation path into MakoGenerate’s code generation pipeline, we achieve:

  • Faster Fine-Tuning Iterations: When evaluating candidate kernel outputs during model training, each compile-and-validate loop runs significantly faster. Engineers can iterate on prompt designs and data selection with less idle time.

  • Lower Cloud Compute Costs: GPU kernels are often tested at scale (hundreds to thousands of compile calls). Cutting compile latency by 15× translates directly into lower GPU/CPU hours billed.

  • Improved Developer Productivity: Data scientists and researchers get quicker feedback loops when prototyping new kernels or experimenting with novel dispatch strategies.

  • Scalable Benchmarking: When benchmarking MakoGenerate-written kernels against hand-tuned baselines, faster compilation means larger sample sizes can be evaluated in a given time budget.

Conclusion and Future Directions

Accelerating compilation is a deceptively complex challenge that pays dividends across the entire MakoGenerate kernel generation lifecycle. Our fast CUDA compilation path avoids unnecessary Torch header overhead and relies on precompiled headers to drastically reduce compile times. While current support is focused on simple level 1 kernels and Torch’s load/load_inline APIs, we plan to extend coverage to more complex kernel templates and additional build systems.

Ongoing work includes:

  • The current bottleneck is using the pre-compiled headers - we want to remove these entirely by precompiling the entire object files that provide the necessary integration with PyTorch frontend.

  • Expanding heuristic rules to detect and optimize templated kernels.

  • Continuous benchmarking on higher-level kernels to validate speedups and identify new edge cases.

By reducing one of the most persistent bottlenecks in CUDA development, we hope to make kernel research and deployment faster, more efficient, and more accessible.

Stay tuned for more updates as we push the envelope on AI-powered GPU optimization!

Copyright © 2025 Mako. All rights reserved.

Copyright © 2025 Mako. All rights reserved.

Copyright © 2025 Mako. All rights reserved.

Copyright © 2025 Mako. All rights reserved.