EvoKernel
AI reads 10 GPU kernels,
beats the hand-tuned best

Xiake Sun, Tong Qiu, Cheng Luo · Kernel Darwin Lab

3D depthwise convolution on AMD Instinct MI350X

5.8ms
PyTorch
 
0.61ms
Hand-tuned (Human)
 
0.54ms
AI-optimized
1. The Problem
3D Depthwise Convolution
Input: [1, 512, 61, 45, 80] bf16 Weight: [512, 1, 3, 5, 5] bf16 Output: [1, 512, 59, 45, 80] bf16 Groups: 512 Padding: (0,2,2) GFLOPs: 16.31

Each output pixel: 75 multiply-accumulate ops (3×5×5 filter). Used in video models, 3D segmentation, and efficient architectures.

The challenge: depthwise conv has no cross-channel computation (groups=512). Each channel is independent. This limits which GPU hardware features can help.

PyTorch baseline

5.8ms

MIOpen / ROCm on MI350X (CDNA4 / gfx950)

Can AI do better than both PyTorch and a human expert?

What AI Was Given
The Setup

task.md

Problem definition

Conv3d shapes, PyTorch reference code. 33 lines. "Here's the problem, go."

gpu_arch/

Hardware specs

CDNA3/4 ISA, LDS sizes, instruction throughput, roofline analysis. The source of truth.

rocprofv3

Profiling tools

PMC hardware counters + instruction-level thread trace. AI measures, not guesses.

Human shows reference kernels declaratively: "read this hipconv code", "check fused_mlp.py" — doesn't prescribe what to extract. Human owns final verification: correctness against PyTorch, reproducible timing.
Step 1
Naive Baseline

One thread computes one output pixel. Each thread independently reads 75 inputs + 75 weights from global memory with boundary checks per tap.

22 VGPRs · Occupancy 8 · 0 LDS

Roughly the same speed as PyTorch's MIOpen. The obvious approach is not enough.
5.14 ms · 1.1x faster than PyTorch
Thread N: global → load 75 inputs (slow) global → load 75 weights (slow) for each tap: if (in_bounds): acc += w * in global ← store output
Step 2
NHWC Memory Layout

NHWC layout: adjacent threads access adjacent channels. Coalesced 128-byte reads. 5x faster, but each thread still reads 75 values independently. No data reuse.

Coalescing gives 5x, but each thread still does 75 independent global reads. For large filters, data reuse matters more than memory layout.
1.17 ms · 5.0x
64 threads at spatial S: T0: ch[0..7] | T1: ch[8..15] | coalesced T2: ch[16..23] | 128B read ... | Each: 75 global loads no reuse across threads
Step 3 — Hand-tuned Best
NCHW + LDS Cache

256 threads cooperatively load input into LDS (fast on-chip memory). Each thread reads from LDS. Weights cached in 75 float VGPRs. Data loaded once, reused by all.

155 VGPRs · Occupancy 3 · 32KB LDS

AI profiles with rocprofv3 thread trace:
33.6% FMA, 20.6% bf16 extraction, 13.3% NOP. Over half the cycles are NOT compute. Can we do better?
0.62 ms · 9.4x — the production kernel
256 threads cooperate: HBM ⇒ LDS [3×49×84 bf16] ↑ loaded once, shared HBM → Regs [75 weights] ↑ per-thread, reused LDS → 45 ds_read_b32 → 150 v_fmac_f32 → global_store
Step 4 — Failure
MFMA Matrix Engine

AI studied hipconv grouped convolution. Reformulated width convolution as Toeplitz matrix multiply. Packs 16 channels into MFMA batch=16.

Correct results but 6x slower.

Loading 16 channels costs 16x more data. Only 64 threads (vs 256) for cooperative loading.

Not every clever idea works. MFMA is powerful for grouped conv (cpg≥4) but useless for depthwise (cpg=1). The failure teaches when a technique applies.
14.4 ms · 0.4x — 6x slower!
Toeplitz F(4,5): [g0 g1 g2 g3 g4 . . .] [ . g0 g1 g2 g3 g4 . .] [ . . g0 g1 g2 g3 g4 .] [ . . . g0 g1 g2 g3 g4] 16ch data = 16× LDS load Only 64 threads (vs 256)
Step 5 — Breakthrough
sched_group_barrier

AI read fused_mlp.py and found sched_group_barrier: a compiler hint that forces LLVM to interleave LDS reads with VALU compute.

Bitwise identical to Step 3. 14% faster. Same algorithm, better scheduling.

The same interleaving idea was tried in a JIT kernel earlier — it failed (no compiler hints). It only worked in HIP C++ with sched_group_barrier. The how matters as much as the what.

VGPRs

155 → 86

Occupancy

3 → 5

Speedup

+14%

Per filter row (15 rows): ds_read ×3 ← 3 VGPRs (not 45) sched_group_barrier(DS, 3) sched_group_barrier(VALU, 10) s_waitcnt lgkmcnt(0) v_fmac ×10 ← overlaps with read VGPRs: 155 → 86
Results
Performance Comparison
ms 6 5 4 3 2 1 0 GB/s 800 600 400 200 0 5.81 5.14 1.17 0.62 14.4 0.54 73 82 361 682 29 783 PyTorch Step 1 Step 2 Step 3 Step 4 Step 5 MIOpen Naive NHWC LDS MFMA SGB Time (ms, left axis) Bandwidth (GB/s, right axis)
3. The Full Picture
It Wasn't a Straight Line

Between Steps 3 and 5, AI explored 12 optimization ideas. Most failed. Each failure narrowed the search space.

v_dot2_f32_bf16 in HIPWRONG — compiler register bug
Row-chunk streaming (V104)2.5x slower — barrier overhead
Compiler pointer reads (V104b)2.1x slower — ds_read_u16 bloat
NHWC weight tiling (V105)2.0x slower — no input reuse
ds_read_b64 wider reads4.2x slower — gfx950 penalty
Software pipeline by depthNo improvement
KW_PACK=4 (4 outputs/iter)WRONG — indexing bug
JIT row-interleave (same idea!)3% slower — no compiler hints
MFMA Toeplitz 16-channel6x slower — data loading dominates
v_pk_fma_f32 packed opsWRONG — bf16 format mismatch
20+ kernel variants, 12 phases, most of them failures. The winning technique came from reading a completely unrelated kernel (fused_mlp.py). AI's advantage: it can explore at scale without fatigue.
4. Workflow
Human-AI Collaboration

Human (Declarative)

  • Defines problem via task.md
  • Provides gpu_arch/ hardware docs
  • Shows reference kernels: "read this code"
  • Challenges claims: "where's the source?"
  • Owns final verification

AI (Autonomous)

  • Reads ISA docs at scale
  • Studies kernels, extracts techniques
  • Implements and benchmarks variants
  • Profiles with rocprofv3
  • Self-corrects on evidence
"Read hipconv"
Toeplitz+MFMA → Step 4 (failed, but learned)
"Read fused_mlp.py"
sched_group_barrier → Step 5 (breakthrough)
5. Honesty
Self-Corrections

AI made confident claims. Human challenged. AI traced to sources and corrected.

LDS Size

64 KB per CU (CDNA3 docs)
160 KB per CU (rocminfo)

MI350X has 2.5x more LDS. Changed entire occupancy analysis.

LGKM Queue

"16-entry queue overflows"
Unverified guess (vm_cnt.md)

Source only tested VMEM queue. LDS queue never measured.

Compiler Occupancy

Compiler says occ=5 → runtime=5
Reports VGPR limit only

LDS constraint at runtime: min(VGPR, LDS). Both give 5 on MI350X.

Conclusion
10.7x
faster than PyTorch

AI read reference kernels (hipconv, PA, fused_mlp), extracted the sched_group_barrier technique, and beat the hand-tuned production kernel by 14%.

Human provides context declaratively.
AI reads, implements, self-corrects.
Human verifies the result.

KernelTimeSpeedup
PyTorch5.81 ms1.0x
Step 1: Naive5.14 ms1.1x
Step 2: NHWC1.17 ms5.0x
Step 3: NCHW+LDS0.62 ms9.4x
Step 4: MFMA14.40 ms0.4x
Step 5: SGB0.54 ms10.7x

AMD Instinct MI350X · CDNA4 / gfx950 · bf16

Principles
Key Takeaways

Ground truth beats generic GPU lore.

If the hardware isn’t in the prompt, the model will still sound sure—treat hardware specifications and architecture / ISA documentation as the source of truth.

Build fast; measure deeper than the stopwatch.

Rapid iteration plus real observability with trusted tools—not just end-to-end time—turns guesses into evidence.

Humans own the task, the benchmark, and the verdict.

Define correctness and performance bars; only people sign off on what “done” means.

Map the docs; don’t flood the context.

An index and layered reading paths beat one enormous file agents must swallow whole.

Remember the lesson, not every dead end.

Capture what worked and what failed—then distill and drop noise so the trail stays sharp.

Thank you
“All models are wrong, but some are useful.”

George E. P. Box