GPU performance is opaque. Your AI agent stops guessing at bottlenecks and starts measuring them.
GPU kernels are opaque. Profiling tools are fragmented. Agents need structured data, not raw traces.
The agent reads CUDA code, guesses at bottlenecks, suggests generic optimizations. Is the kernel compute-bound or memory-bound? No idea. Are there fusion opportunities? Can't tell without data.
The agent runs gdbg train.py, sees the roofline classification, finds the kernel eating 40% of GPU time is memory-bound, and targets the fix. Data, not hunches.
One command. Every angle on GPU performance.
Stream Graph (0.0ms → 20.9ms, span 20.9ms)
s13 │B AAA B│
s14 │ AA │
s15 │ AA │
└────────────────────────────────────────────────────────┘
Legend:
A burst_kernel(float *, int, int) 490.2us
B quiet_kernel(float *, int) 25.2us
Hottest 5.0ms window: Window: 574.3ms → 579.3ms Busy time: 490.2us (9.8% of window) Launches: 12 Kernel Launches Time % busy ───────────────────────────── ──────── ──────── ────── burst_kernel 12 490.2us 100.0%
Per-kernel Memory Bandwidth: # Kernel Achieved % peak Bound ── ──────────────── ────────── ────── ──────── 1 coalesced_copy 612.4 GB/s 82.3% memory 2 strided_copy 78.2 GB/s 10.5% memory ←low 1 kernel under 50% of peak — likely memory-access bound (poor coalescing, low L2, uncoalesced strided loads)
Critical path chains (same stream, gap ≤ 100.0us): Longest chain: stream 7 span 12.4ms active 12.2ms (98%) 24 kernel(s) Top kernels on chain: Kernel Launches Time % chain ───────────────────── ──────── ──────── ─────── trunk_step 24 12.2ms 100.0%
One command triggers three independent collection phases. Each can fail without blocking the others. You always get the data that's available.
Kernel launches, memory transfers, stream activity, NVTX regions. The big picture of what the GPU actually did.
Occupancy, throughput, registers, shared memory, L2 hit rate. Collected on the hottest kernels, where the detail matters.
Which operator launched which kernel. Bridges the gap between your code and the hardware.
Point gdbg at a file. It reads the imports and picks the right collection strategy.
After collection, the agent queries everything from a single interface.
The agent doesn't follow a script. It reasons about GPU performance the way an engineer would.
gdbg train.py, then stats and kernels. Now it knows where GPU time goes — not where it assumed it went.roofline answers the question that matters: is this kernel starved for compute or starved for memory? The fix depends on the answer.fuse finds sequential kernels that should be one. small finds kernels where launch overhead dominates. gaps finds idle time the GPU wasted.inspect pulls hardware counters, occupancy, timing, and the operator that launched it — all in one view.diff. No guessing. Numbers go up or they don't.gdbg ships with dbg. One install gives you both.
Install
cargo install dbg-cli
Check dependencies
gdbg check
Verifies all required tools are available. Tells you exactly what's missing and how to install it.
Profile
gdbg train.py
Auto-detects the target type, collects data, and drops you into the REPL.
trace matmul shows every kernel that a matmul op launched.