sched_group_barrier) that the human engineer hadn't tried.
3D depthwise convolution on AMD Instinct MI350X.
Input: [1, 512, 61, 45, 80] bf16 (NCHW) Weight: [512, 1, 3, 5, 5] bf16 (depthwise: groups=512) Output: [1, 512, 59, 45, 80] bf16 Padding: (0, 2, 2) Stride: (1,1,1) GFLOPs: 16.31
PyTorch (F.conv3d via MIOpen/ROCm) runs this in 5.8ms. The challenge: depthwise conv has no cross-channel computation (groups=512), which limits which GPU hardware features can help. Can AI find optimizations a human engineer missed?
Human provides context declaratively. AI reads, implements, self-corrects. Human verifies.
Problem definition: conv3d shapes, PyTorch reference code. 33 lines. "Here's the problem, go."
CDNA3/4 ISA, LDS sizes, instruction throughput, roofline analysis. The source of truth for hardware facts.
PMC hardware counters + instruction-level thread trace. AI measures bottlenecks, doesn't guess.
Five steps, five distinct algorithms. The failures teach as much as the wins.
One thread computes one output pixel. Each thread independently reads 75 input values and 75 weights from global memory, with boundary checks per tap. No data sharing between threads.
gpu_arch/quick_reference.md to learn: global memory ~500 cycles, LDS ~50 cycles.
NHWC layout makes adjacent threads access adjacent channels — coalesced 128-byte global reads. Each thread handles 8 channels at one spatial position. 5x faster, but each thread still reads 75 values independently.
Switch back to NCHW. 256 threads cooperatively load the input tile into LDS (Local Data Share, 160KB/CU on MI350X), then each thread reads from fast LDS. Weights cached in 75 float VGPRs. Data loaded once, reused by all threads. This is the hand-tuned production kernel.
256 threads cooperate: HBM ⇒ LDS [3×49×84 bf16] ← loaded once, shared HBM → Regs [75 weights] ← per-thread, reused LDS → 45 ds_read_b32 → 150 v_fmac_f32 → store
AI studied the hipconv grouped convolution library and found the Toeplitz matrix + MFMA technique: reformulate the width convolution as a matrix multiply and use v_mfma_f32_4x4x4f16 (batch=16) to process 16 channels simultaneously. Produces correct results but is 6x slower than Step 3.
Why it failed: For depthwise conv (1 channel per group), loading 16 channels' data costs 16x more, but compute speeds up at most 16x — net zero. With only 64 threads (1 wave), cooperative LDS loading is 4x slower than Step 3's 256 threads.
AI read fused_mlp.py (a Triton+Gluon fused MLP kernel) and found sched_group_barrier: a compiler hint that forces LLVM to interleave LDS reads with VALU compute, instead of batching all reads then all computes.
Two changes from Step 3:
Step 3: ds_read ×45 → waitcnt → v_fmac ×150 (batch all) Step 5: for each of 15 filter rows: (interleave) ds_read ×3 sched_group_barrier(DS_READ, 3) sched_group_barrier(VALU, 10) waitcnt → v_fmac ×10
Results are bitwise identical to Step 3 — same algorithm, better instruction scheduling. 14% faster.
sched_group_barrier. The how matters as much as the what.The breakthrough came from reading a completely unrelated kernel.
fused_mlp.py — a Triton+Gluon fused MLP kernel that uses sched_group_barrier to interleave VMEM reads with MFMA compute.
AI recognizes: this interleaving principle applies to DS_read + VALU in the depthwise conv kernel, not just VMEM + MFMA.
14% faster. The same interleaving idea had been tried in a JIT kernel (failed — no compiler hints). Succeeded in HIP C++ with sched_group_barrier.
Between Steps 3 and 5, AI explored 12 optimization ideas. Most failed. Each failure narrowed the search space.
Human provides context declaratively. AI reads, implements, self-corrects. Human verifies.
task.mdgpu_arch/)rocprofv3 (PMC + thread trace)AI made confident claims. Human challenged. AI traced to sources and corrected.
rocminfo GROUP segment)MI350X (CDNA4) has 2.5x more LDS than CDNA3. This changed the entire occupancy analysis.
vm_cnt.md only tested VMEM queue, not LDSThe source confirmed VMEM has a 64-entry queue but only guessed LDS works similarly, noting "that causes no stall most-likely." LDS queue depth was never measured. Claim retracted.
Actual runtime occupancy = min(VGPR limit, LDS limit). On MI350X with 160KB LDS, both happen to give 5, so the number was coincidentally correct.
All kernels verified against PyTorch F.conv3d using pyhip.calc_diff and the same slice rules as pyhip/tests/contrib/conv3d/test.py (see benchmark.py).
| Kernel | Time | Speedup | Key technique |
|---|---|---|---|
| PyTorch | 5.81 ms | 1.0x | MIOpen / ROCm baseline |
| Step 1: Naive | 5.14 ms | 1.1x | 1 thread = 1 output, global memory |
| Step 2: NHWC | 1.17 ms | 5.0x | Coalesced channels, no data reuse |
| Step 3: NCHW+LDS | 0.62 ms | 9.4x | Cooperative LDS cache, weight registers |
| Step 4: MFMA | 14.40 ms | 0.4x | Matrix engine Toeplitz (failure) |
| Step 5: SGB | 0.54 ms | 10.7x | sched_group_barrier + row interleave |
cd EvoKernel bash demo.sh # default: 100 iterations bash demo.sh --iters 200 # more stable timing
Requires: AMD MI300X/MI350X (gfx942/gfx950), ROCm, PyTorch with ROCm support.
The full 12-phase optimization journey — including assembly patching, JIT rewrites, 6 LDS optimization variants, and all the dead ends — is documented in optimization_trajectory.md.
“All models are wrong, but some are useful.”
George E. P. Box