Products

Resources

Use Cases

Company

Pricing

Try for free

Products

Resources

Use Cases

Company

Pricing

Try for free

Our products

MakoGenerate

MakoOptimize

RESOURCES

Blog

USE CASES

Code Translation

Performance Optimization

COMPANY

About

Careers

Try for free

Our products

MakoGenerate

MakoOptimize

RESOURCES

Blog

USE CASES

Code Translation

Performance Optimization

COMPANY

About

Careers

Try for free

How MakoGenerate Leverages PTX and Tensor Cores for Fast Matrix Multiplication

MakoGenerate writes inline PTX to achieve near-optimial GEMM performance

Written by

Cătălin Milu

Published on

Jul 29, 2025

At Mako, our mission is to revolutionize GPU development. Our flagship product, MakoGenerate, is an intelligent agent that automatically produces highly optimized GPU kernel code. Today, we're excited to show how MakoGenerate leverages PTX code for CUDA and NVIDIA's Tensor Cores, delivering unparalleled performance for intensive tasks like matrix multiplication.

The Challenge: Manual GPU Optimization is Hard

Achieving peak GPU performance requires a deep understanding of hardware, memory, and low-level languages like PTX. Even for experienced CUDA developers, hand-optimizing a Tensor Core-enabled GEMM kernel is time-consuming and error-prone, often leading to subtle inefficiencies or "mismatches."

A Glimpse Under the Hood: Our Inline PTX GEMM Example

To illustrate, consider a manually written CUDA kernel using inline PTX for Tensor Cores. While functional, such code highlights the complexities and potential for "simple mismatches" inherent in low-level manual optimization.

__global__ void gemm_tensor_core_kernel(
    const half* __restrict__ A,
    const half* __restrict__ B,
    half* __restrict__ C,
    int M_dim, int N_dim, int K_dim
) {
    // Shared memory, block/warp/lane indices, register storage initialization
    // ...
    
    // Main computation loop over K dimension
    for (int block_k = 0; block_k < K_dim; block_k += BK) {
        // Load A and B tiles from global to shared memory
        // ...
        __syncthreads();
        
        // Process warp tiles in K dimension
        for (int warp_k = 0; warp_k < BK; warp_k += WK) {
            // Calculate and convert pointers for warp tiles in shared memory
            uint32_t A_warp_base = cvta_to_shared_u32(A_warp_tile);
            uint32_t B_warp_base = cvta_to_shared_u32(B_warp_tile);
            
            // Load A tiles into registers using ldmatrix (inline PTX)
            asm volatile (
                "ldmatrix.sync.aligned.m8n8.x2.shared.b16 "
                "{%0, %1}, [%2];"
                : "=r"(A_register[mma_m][mma_k][0]), 
                  "=r"(A_register[mma_m][mma_k][1])
                : "r"(thread_offset_bytes)
            );
            
            
            
            // Load B tiles into registers using ldmatrix (transposed, inline PTX)
            asm volatile (
                "ldmatrix.sync.aligned.m8n8.x1.trans.shared.b16 "
                "{%0}, [%1];"
                : "=r"(B_register[mma_k][mma_n])
                : "r"(thread_offset_bytes)
            );
            
            // Perform MMA operations (inline PTX)
            asm volatile (
                "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 "
                "{%0, %1}, {%2, %3}, {%4}, {%5, %6};"
                : "=r"(C_register[mma_m][mma_n][0]), 
                  "=r"(C_register[mma_m][mma_n][1])
                : "r"(A_register[mma_m][mma_k][0]),
                  "r"(A_register[mma_m][mma_k][1]),
                  "r"(B_register[mma_k][mma_n]),
                  "r"(C_register[mma_m][mma_n][0]),
                  "r"(C_register[mma_m][mma_n][1])
            );
        }
    }
    // Store results back to global memory
    // ...

This kernel uses direct PTX instructions like ldmatrix and mma.sync for Tensor Core interaction. While powerful, manually managing register allocation, shared memory, and parallelism via inline assembly invites subtle bugs and performance issues ("simple mismatches").

Enter MakoGenerate: Your AI-Powered Optimization Agent

MakoGenerate sidesteps this by intelligently generating highly optimized PTX code, including efficient Tensor Core and shared memory utilization, handling complexities that plague manual efforts.

How MakoGenerate works:

High-Level Specification: You provide MakoGenerate with your computational task (e.g., matrix multiplication).
Architectural Awareness: It understands your GPU's architecture, including Tensor Cores and memory hierarchy.
PTX Generation: MakoGenerate produces tailored PTX code, ensuring optimal shared memory, precise Tensor Core utilization, efficient thread/warp scheduling, and automatic resolution of common "mismatches."
Integration: The generated PTX seamlessly integrates into your CUDA C++ application.

The Benefits: Performance and Productivity

MakoGenerate offers clear advantages:

Superior Performance: It generates expertly optimized PTX, consistently achieving near-peak performance, often surpassing hand-tuned kernels. Our example demonstrates how a manually optimized Tensor Core kernel already outpaces a baseline.
Reduced Development Time: Focus on high-level logic, not low-level GPU optimization.
Increased Reliability: Automated generation minimizes human error, leading to more robust kernels.
Future-Proofing: It adapts to evolving GPU architectures, ensuring continuous performance.

Conclusion

Generating efficient GPU kernel code, especially leveraging Tensor Cores via PTX, is transformative for high-performance computing. At Mako, we believe MakoGenerate empowers developers to push GPU capabilities, turning complex optimization into streamlined, automated processes.

Stay tuned for more updates as we evolve MakoGenerate and enhance your GPU development workflow!

Latest