EvoKernel: AI reads 10 GPU kernels, beats the hand-tuned best

We optimized a 3D depthwise convolution kernel through 5 algorithmic steps, achieving 10.7x speedup over PyTorch (5.8ms → 0.54ms). The final breakthrough came from an AI reading an unrelated fused MLP kernel and extracting a compiler scheduling hint (sched_group_barrier) that the human engineer hadn't tried.
5.8ms
PyTorch
0.61ms
Hand-tuned (Human)
0.54ms
AI-optimized

1 Problem

3D depthwise convolution on AMD Instinct MI350X.

Input:   [1, 512, 61, 45, 80]  bf16   (NCHW)
Weight:  [512, 1, 3, 5, 5]     bf16   (depthwise: groups=512)
Output:  [1, 512, 59, 45, 80]  bf16
Padding: (0, 2, 2)   Stride: (1,1,1)   GFLOPs: 16.31
Context Each output pixel requires 75 multiply-accumulate operations (3×5×5 filter). Depthwise means each channel is convolved independently — no cross-channel computation.

PyTorch (F.conv3d via MIOpen/ROCm) runs this in 5.8ms. The challenge: depthwise conv has no cross-channel computation (groups=512), which limits which GPU hardware features can help. Can AI find optimizations a human engineer missed?

1.1 What AI Was Given

Human provides context declaratively. AI reads, implements, self-corrects. Human verifies.

task.md

Problem definition: conv3d shapes, PyTorch reference code. 33 lines. "Here's the problem, go."

gpu_arch/

CDNA3/4 ISA, LDS sizes, instruction throughput, roofline analysis. The source of truth for hardware facts.

rocprofv3

PMC hardware counters + instruction-level thread trace. AI measures bottlenecks, doesn't guess.

Human shows reference kernels declaratively: "read this hipconv code", "check fused_mlp.py" — points to good work, doesn't prescribe what to extract. Human owns final verification: correctness against PyTorch, reproducible timing.

2 The Optimization Journey

Five steps, five distinct algorithms. The failures teach as much as the wins.

ms 6 5 4 3 2 1 0 GB/s 800 600 400 200 0 5.81 5.14 1.17 0.62 14.4 0.54 73 82 361 682 29 783 PyTorch Step 1 Step 2 Step 3 Step 4 Step 5 MIOpen Naive NHWC LDS MFMA SGB Time (ms, left axis) Bandwidth (GB/s, right axis)
Figure 1. Execution time (bars) and effective memory bandwidth (line) for each kernel. Step 4 (MFMA) bar is truncated at 6ms (actual: 14.4ms). Step 5 (SGB) achieves 783 GB/s. 200 iterations, bf16, MI350X.

Step 1: Naive Baseline

5.14ms · 1.1x

One thread computes one output pixel. Each thread independently reads 75 input values and 75 weights from global memory, with boundary checks per tap. No data sharing between threads.

Roughly the same speed as PyTorch's MIOpen. The obvious approach is not enough.
Resources 22 VGPRs, occupancy 8, no LDS. AI reads gpu_arch/quick_reference.md to learn: global memory ~500 cycles, LDS ~50 cycles.

Step 2: NHWC Layout

1.17ms · 5.0x

NHWC layout makes adjacent threads access adjacent channels — coalesced 128-byte global reads. Each thread handles 8 channels at one spatial position. 5x faster, but each thread still reads 75 values independently.

Coalescing gives 5x, but each thread still does 75 independent global reads. For large filters, data reuse matters more than memory layout.
Resources 46 VGPRs, occupancy 8, no LDS. 512 threads/block, 8 spatial positions, 64 threads per spatial.

Step 3: NCHW + LDS Cache

0.62ms · 9.4x

Switch back to NCHW. 256 threads cooperatively load the input tile into LDS (Local Data Share, 160KB/CU on MI350X), then each thread reads from fast LDS. Weights cached in 75 float VGPRs. Data loaded once, reused by all threads. This is the hand-tuned production kernel.

256 threads cooperate:
HBM ⇒ LDS [3×49×84 bf16]    ← loaded once, shared
HBM → Regs [75 weights]        ← per-thread, reused
LDS → 45 ds_read_b32 → 150 v_fmac_f32 → store
AI profiles with rocprofv3 thread trace: 33.6% FMA, 20.6% bf16 extraction, 13.3% NOP. Over half the cycles are not compute. Can we do better?
Resources 155 VGPRs, occupancy 3, 32KB LDS. LDS = Local Data Share — fast on-chip memory shared within a workgroup. 160KB per CU on MI350X (2.5x larger than CDNA3's 64KB).

Step 4: MFMA Matrix Engine

14.4ms · 0.4x

AI studied the hipconv grouped convolution library and found the Toeplitz matrix + MFMA technique: reformulate the width convolution as a matrix multiply and use v_mfma_f32_4x4x4f16 (batch=16) to process 16 channels simultaneously. Produces correct results but is 6x slower than Step 3.

Why it failed: For depthwise conv (1 channel per group), loading 16 channels' data costs 16x more, but compute speeds up at most 16x — net zero. With only 64 threads (1 wave), cooperative LDS loading is 4x slower than Step 3's 256 threads.

Not every clever idea works. MFMA is powerful for grouped conv (cpg≥4) but useless for depthwise (cpg=1). The failure teaches when a technique applies.
Resources 101 VGPRs, occupancy 2, 21KB LDS, 560B scratch (spills). MFMA = Matrix Fused Multiply-Add, a dedicated matrix engine separate from VALU.

Step 5: sched_group_barrier

0.54ms · 10.7x

AI read fused_mlp.py (a Triton+Gluon fused MLP kernel) and found sched_group_barrier: a compiler hint that forces LLVM to interleave LDS reads with VALU compute, instead of batching all reads then all computes.

Two changes from Step 3:

Step 3: ds_read ×45 → waitcnt → v_fmac ×150    (batch all)
Step 5: for each of 15 filter rows:              (interleave)
          ds_read ×3
          sched_group_barrier(DS_READ, 3)
          sched_group_barrier(VALU, 10)
          waitcnt → v_fmac ×10

Results are bitwise identical to Step 3 — same algorithm, better instruction scheduling. 14% faster.

The same interleaving idea was tried in a JIT kernel earlier — it failed (no compiler hints). It only worked in HIP C++ with sched_group_barrier. The how matters as much as the what.
Resources 86 VGPRs (down from 155), occupancy 5 (up from 3), 0 spills. Only 3 input regs per row instead of 45 batched.

3 How AI Found What Humans Missed

The breakthrough came from reading a completely unrelated kernel.

Source

fused_mlp.py — a Triton+Gluon fused MLP kernel that uses sched_group_barrier to interleave VMEM reads with MFMA compute.

Pattern extracted

AI recognizes: this interleaving principle applies to DS_read + VALU in the depthwise conv kernel, not just VMEM + MFMA.

Result

14% faster. The same interleaving idea had been tried in a JIT kernel (failed — no compiler hints). Succeeded in HIP C++ with sched_group_barrier.

Figure 2. Cross-codebase learning: a technique from fused MLP applied to depthwise convolution. The key was not just the code restructuring, but the compiler scheduling hint.

3.1 It Wasn't a Straight Line

Between Steps 3 and 5, AI explored 12 optimization ideas. Most failed. Each failure narrowed the search space.

v_dot2_f32_bf16 in HIPWRONG — compiler register bug
Row-chunk streaming (V104)2.5x slower — barrier overhead
Compiler pointer reads (V104b)2.1x slower — ds_read_u16 bloat
NHWC weight tiling (V105)2.0x slower — no input reuse
ds_read_b64 wider reads4.2x slower — gfx950 penalty
Software pipeline by depthNo improvement
KW_PACK=4 (4 outputs/iter)WRONG — indexing bug
JIT row-interleave (same idea!)3% slower — no compiler hints
MFMA Toeplitz 16-channel6x slower — data loading dominates
v_pk_fma_f32 packed opsWRONG — bf16 format mismatch
20+ kernel variants, 12 phases, most of them failures. The winning technique came from reading a completely unrelated kernel (fused_mlp.py). AI's advantage: it can explore at scale without fatigue.

4 Human-AI Collaboration

Human provides context declaratively. AI reads, implements, self-corrects. Human verifies.

Human (declarative)

  • Defines problem via task.md
  • Provides hardware docs (gpu_arch/)
  • Shows reference kernels: "read this hipconv", "check fused_mlp"
  • Challenges claims: "where's the source?"
  • Owns final verification: PyTorch correctness + timing

AI (autonomous)

  • Reads ISA docs and architecture references at scale
  • Studies reference kernels, extracts techniques
  • Proposes, implements, and benchmarks variants
  • Profiles with rocprofv3 (PMC + thread trace)
  • Self-corrects when evidence contradicts claims

Key moments

"Read hipconv group_conv"
AI extracts Toeplitz+MFMA. Step 4 — failed, but learned why
"Read fused_mlp.py"
AI extracts sched_group_barrier. Step 5 — breakthrough
"Where did you find LGKM queue overflow?"
AI traces to vm_cnt.md — it was an unverified guess. Corrected
"What is actual LDS size?"
AI queries rocminfo. Found 160KB not 64KB. Analysis fixed

5 Self-Corrections

AI made confident claims. Human challenged. AI traced to sources and corrected.

LDS Size

64 KB per CU (assumed from CDNA3 documentation)
160 KB per CU (verified via rocminfo GROUP segment)

MI350X (CDNA4) has 2.5x more LDS than CDNA3. This changed the entire occupancy analysis.

LGKM Queue Depth

"16-entry LGKM queue overflows when batching 45 ds_reads"
Unverified guess — vm_cnt.md only tested VMEM queue, not LDS

The source confirmed VMEM has a 64-entry queue but only guessed LDS works similarly, noting "that causes no stall most-likely." LDS queue depth was never measured. Claim retracted.

Compiler Occupancy

Compiler reports occupancy 5, so runtime occupancy is 5
Compiler reports only the VGPR limit — LDS constraint applied at runtime

Actual runtime occupancy = min(VGPR limit, LDS limit). On MI350X with 160KB LDS, both happen to give 5, so the number was coincidentally correct.

6 Results

All kernels verified against PyTorch F.conv3d using pyhip.calc_diff and the same slice rules as pyhip/tests/contrib/conv3d/test.py (see benchmark.py).

KernelTimeSpeedupKey technique
PyTorch5.81 ms1.0xMIOpen / ROCm baseline
Step 1: Naive5.14 ms1.1x1 thread = 1 output, global memory
Step 2: NHWC1.17 ms5.0xCoalesced channels, no data reuse
Step 3: NCHW+LDS0.62 ms9.4xCooperative LDS cache, weight registers
Step 4: MFMA14.40 ms0.4xMatrix engine Toeplitz (failure)
Step 5: SGB0.54 ms10.7xsched_group_barrier + row interleave

Reproduce

cd EvoKernel
bash demo.sh              # default: 100 iterations
bash demo.sh --iters 200  # more stable timing

Requires: AMD MI300X/MI350X (gfx942/gfx950), ROCm, PyTorch with ROCm support.

The full 12-phase optimization journey — including assembly patching, JIT rewrites, 6 LDS optimization variants, and all the dead ends — is documented in optimization_trajectory.md.

Key Takeaways

  1. Ground truth beats generic GPU lore. If the hardware isn’t in the prompt, the model will still sound sure—treat hardware specifications and architecture / ISA documentation as the source of truth.
  2. Build fast; measure deeper than the stopwatch. Rapid iteration plus real observability with trusted tools—not just end-to-end time—turns guesses into evidence.
  3. Humans own the task, the benchmark, and the verdict. Define correctness and performance bars; only people sign off on what “done” means.
  4. Map the docs; don’t flood the context. An index and layered reading paths beat one enormous file agents must swallow whole.
  5. Remember the lesson, not every dead end. Capture what worked and what failed—then distill and drop noise so the trail stays sharp.
“All models are wrong, but some are useful.”

George E. P. Box