Xiake Sun, Tong Qiu, Cheng Luo · Kernel Darwin Lab
3D depthwise convolution on AMD Instinct MI350X
Each output pixel: 75 multiply-accumulate ops (3×5×5 filter). Used in video models, 3D segmentation, and efficient architectures.
PyTorch baseline
MIOpen / ROCm on MI350X (CDNA4 / gfx950)
Can AI do better than both PyTorch and a human expert?
Conv3d shapes, PyTorch reference code. 33 lines. "Here's the problem, go."
CDNA3/4 ISA, LDS sizes, instruction throughput, roofline analysis. The source of truth.
PMC hardware counters + instruction-level thread trace. AI measures, not guesses.
One thread computes one output pixel. Each thread independently reads 75 inputs + 75 weights from global memory with boundary checks per tap.
22 VGPRs · Occupancy 8 · 0 LDS
NHWC layout: adjacent threads access adjacent channels. Coalesced 128-byte reads. 5x faster, but each thread still reads 75 values independently. No data reuse.
256 threads cooperatively load input into LDS (fast on-chip memory). Each thread reads from LDS. Weights cached in 75 float VGPRs. Data loaded once, reused by all.
155 VGPRs · Occupancy 3 · 32KB LDS
AI studied hipconv grouped convolution. Reformulated width convolution as Toeplitz matrix multiply. Packs 16 channels into MFMA batch=16.
Correct results but 6x slower.
Loading 16 channels costs 16x more data. Only 64 threads (vs 256) for cooperative loading.
AI read fused_mlp.py and found sched_group_barrier: a compiler hint that forces LLVM to interleave LDS reads with VALU compute.
Bitwise identical to Step 3. 14% faster. Same algorithm, better scheduling.
sched_group_barrier. The how matters as much as the what.VGPRs
155 → 86
Occupancy
3 → 5
Speedup
+14%
Between Steps 3 and 5, AI explored 12 optimization ideas. Most failed. Each failure narrowed the search space.
AI made confident claims. Human challenged. AI traced to sources and corrected.
MI350X has 2.5x more LDS. Changed entire occupancy analysis.
Source only tested VMEM queue. LDS queue never measured.
LDS constraint at runtime: min(VGPR, LDS). Both give 5 on MI350X.
AI read reference kernels (hipconv, PA, fused_mlp), extracted the sched_group_barrier technique, and beat the hand-tuned production kernel by 14%.
Human provides context declaratively.
AI reads, implements, self-corrects.
Human verifies the result.
| Kernel | Time | Speedup |
|---|---|---|
| PyTorch | 5.81 ms | 1.0x |
| Step 1: Naive | 5.14 ms | 1.1x |
| Step 2: NHWC | 1.17 ms | 5.0x |
| Step 3: NCHW+LDS | 0.62 ms | 9.4x |
| Step 4: MFMA | 14.40 ms | 0.4x |
| Step 5: SGB | 0.54 ms | 10.7x |
AMD Instinct MI350X · CDNA4 / gfx950 · bf16
Ground truth beats generic GPU lore.
If the hardware isn’t in the prompt, the model will still sound sure—treat hardware specifications and architecture / ISA documentation as the source of truth.
Build fast; measure deeper than the stopwatch.
Rapid iteration plus real observability with trusted tools—not just end-to-end time—turns guesses into evidence.
Humans own the task, the benchmark, and the verdict.
Define correctness and performance bars; only people sign off on what “done” means.
Map the docs; don’t flood the context.
An index and layered reading paths beat one enormous file agents must swallow whole.
Remember the lesson, not every dead end.
Capture what worked and what failed—then distill and drop noise so the trail stays sharp.
“All models are wrong, but some are useful.”
George E. P. Box