EvoKernel Presentation

EvoKernel

AI reads 10 GPU kernels,
beats the hand-tuned best

Xiake Sun, Tong Qiu, Cheng Luo · Kernel Darwin Lab

3D depthwise convolution on AMD Instinct MI350X

5.8ms

PyTorch

→

0.61ms

Hand-tuned (Human)

→

0.54ms

AI-optimized

1. The Problem

3D Depthwise Convolution

Input: [1, 512, 61, 45, 80] bf16 Weight: [512, 1, 3, 5, 5] bf16 Output: [1, 512, 59, 45, 80] bf16 Groups: 512 Padding: (0,2,2) GFLOPs: 16.31

Each output pixel: 75 multiply-accumulate ops (3×5×5 filter). Used in video models, 3D segmentation, and efficient architectures.

The challenge: depthwise conv has no cross-channel computation (groups=512). Each channel is independent. This limits which GPU hardware features can help.

PyTorch baseline

5.8ms

MIOpen / ROCm on MI350X (CDNA4 / gfx950)

Can AI do better than both PyTorch and a human expert?

What AI Was Given

The Setup

task.md

Problem definition

Conv3d shapes, PyTorch reference code. 33 lines. "Here's the problem, go."

gpu_arch/

Hardware specs

CDNA3/4 ISA, LDS sizes, instruction throughput, roofline analysis. The source of truth.

rocprofv3

Profiling tools

PMC hardware counters + instruction-level thread trace. AI measures, not guesses.

Human shows reference kernels declaratively: "read this hipconv code", "check fused_mlp.py" — doesn't prescribe what to extract. Human owns final verification: correctness against PyTorch, reproducible timing.

Step 1

Naive Baseline

One thread computes one output pixel. Each thread independently reads 75 inputs + 75 weights from global memory with boundary checks per tap.

22 VGPRs · Occupancy 8 · 0 LDS

Roughly the same speed as PyTorch's MIOpen. The obvious approach is not enough.

5.14 ms · 1.1x faster than PyTorch

Thread N: global → load 75 inputs (slow) global → load 75 weights (slow) for each tap: if (in_bounds): acc += w * in global ← store output

Step 2

NHWC Memory Layout

NHWC layout: adjacent threads access adjacent channels. Coalesced 128-byte reads. 5x faster, but each thread still reads 75 values independently. No data reuse.

Coalescing gives 5x, but each thread still does 75 independent global reads. For large filters, data reuse matters more than memory layout.

1.17 ms · 5.0x

64 threads at spatial S: T0: ch[0..7] | T1: ch[8..15] | coalesced T2: ch[16..23] | 128B read ... | Each: 75 global loads no reuse across threads

Step 3 — Hand-tuned Best

NCHW + LDS Cache

256 threads cooperatively load input into LDS (fast on-chip memory). Each thread reads from LDS. Weights cached in 75 float VGPRs. Data loaded once, reused by all.

155 VGPRs · Occupancy 3 · 32KB LDS

AI profiles with rocprofv3 thread trace:
33.6% FMA, 20.6% bf16 extraction, 13.3% NOP. Over half the cycles are NOT compute. Can we do better?

0.62 ms · 9.4x — the production kernel

256 threads cooperate: HBM ⇒ LDS [3×49×84 bf16] ↑ loaded once, shared HBM → Regs [75 weights] ↑ per-thread, reused LDS → 45 ds_read_b32 → 150 v_fmac_f32 → global_store

Step 4 — Failure

MFMA Matrix Engine

AI studied hipconv grouped convolution. Reformulated width convolution as Toeplitz matrix multiply. Packs 16 channels into MFMA batch=16.

Correct results but 6x slower.

Loading 16 channels costs 16x more data. Only 64 threads (vs 256) for cooperative loading.

Not every clever idea works. MFMA is powerful for grouped conv (cpg≥4) but useless for depthwise (cpg=1). The failure teaches when a technique applies.

14.4 ms · 0.4x — 6x slower!

Toeplitz F(4,5): [g0 g1 g2 g3 g4 . . .] [ . g0 g1 g2 g3 g4 . .] [ . . g0 g1 g2 g3 g4 .] [ . . . g0 g1 g2 g3 g4] 16ch data = 16× LDS load Only 64 threads (vs 256)

Step 5 — Breakthrough

sched_group_barrier

AI read fused_mlp.py and found sched_group_barrier: a compiler hint that forces LLVM to interleave LDS reads with VALU compute.

Bitwise identical to Step 3. 14% faster. Same algorithm, better scheduling.

The same interleaving idea was tried in a JIT kernel earlier — it failed (no compiler hints). It only worked in HIP C++ with sched_group_barrier. The how matters as much as the what.

VGPRs

155 → 86

Occupancy

3 → 5

Speedup

+14%

Per filter row (15 rows): ds_read ×3 ← 3 VGPRs (not 45) sched_group_barrier(DS, 3) sched_group_barrier(VALU, 10) s_waitcnt lgkmcnt(0) v_fmac ×10 ← overlaps with read VGPRs: 155 → 86

Results

Performance Comparison

3. The Full Picture

It Wasn't a Straight Line

Between Steps 3 and 5, AI explored 12 optimization ideas. Most failed. Each failure narrowed the search space.

v_dot2_f32_bf16 in HIPWRONG — compiler register bug

Row-chunk streaming (V104)2.5x slower — barrier overhead

Compiler pointer reads (V104b)2.1x slower — ds_read_u16 bloat

NHWC weight tiling (V105)2.0x slower — no input reuse

ds_read_b64 wider reads4.2x slower — gfx950 penalty

Software pipeline by depthNo improvement

KW_PACK=4 (4 outputs/iter)WRONG — indexing bug

JIT row-interleave (same idea!)3% slower — no compiler hints

MFMA Toeplitz 16-channel6x slower — data loading dominates

v_pk_fma_f32 packed opsWRONG — bf16 format mismatch

20+ kernel variants, 12 phases, most of them failures. The winning technique came from reading a completely unrelated kernel (fused_mlp.py). AI's advantage: it can explore at scale without fatigue.

4. Workflow

Human-AI Collaboration

Human (Declarative)

Defines problem via task.md
Provides gpu_arch/ hardware docs
Shows reference kernels: "read this code"
Challenges claims: "where's the source?"
Owns final verification

AI (Autonomous)

Reads ISA docs at scale
Studies kernels, extracts techniques
Implements and benchmarks variants
Profiles with rocprofv3
Self-corrects on evidence

"Read hipconv"

→

Toeplitz+MFMA → Step 4 (failed, but learned)

"Read fused_mlp.py"

→

sched_group_barrier → Step 5 (breakthrough)

5. Honesty

Self-Corrections

AI made confident claims. Human challenged. AI traced to sources and corrected.

LDS Size

64 KB per CU (CDNA3 docs)

160 KB per CU (rocminfo)

MI350X has 2.5x more LDS. Changed entire occupancy analysis.

LGKM Queue

"16-entry queue overflows"

Unverified guess (vm_cnt.md)

Source only tested VMEM queue. LDS queue never measured.

Compiler Occupancy

Compiler says occ=5 → runtime=5

Reports VGPR limit only

LDS constraint at runtime: min(VGPR, LDS). Both give 5 on MI350X.

Conclusion

10.7x

faster than PyTorch

AI read reference kernels (hipconv, PA, fused_mlp), extracted the sched_group_barrier technique, and beat the hand-tuned production kernel by 14%.

Human provides context declaratively.
AI reads, implements, self-corrects.
Human verifies the result.

Kernel	Time	Speedup
PyTorch	5.81 ms	1.0x
Step 1: Naive	5.14 ms	1.1x
Step 2: NHWC	1.17 ms	5.0x
Step 3: NCHW+LDS	0.62 ms	9.4x
Step 4: MFMA	14.40 ms	0.4x
Step 5: SGB	0.54 ms	10.7x

AMD Instinct MI350X · CDNA4 / gfx950 · bf16

Principles

Key Takeaways

Ground truth beats generic GPU lore.

If the hardware isn’t in the prompt, the model will still sound sure—treat hardware specifications and architecture / ISA documentation as the source of truth.

Build fast; measure deeper than the stopwatch.

Rapid iteration plus real observability with trusted tools—not just end-to-end time—turns guesses into evidence.

Humans own the task, the benchmark, and the verdict.

Define correctness and performance bars; only people sign off on what “done” means.

Map the docs; don’t flood the context.

An index and layered reading paths beat one enormous file agents must swallow whole.

Remember the lesson, not every dead end.

Capture what worked and what failed—then distill and drop noise so the trail stays sharp.

Thank you

“All models are wrong, but some are useful.”

George E. P. Box