gdbg — GPU Profiler for AI Agents

Agents guess. Profilers measure.

GPU kernels are opaque. Profiling tools are fragmented. Agents need structured data, not raw traces.

Without gdbg

The agent reads CUDA code, guesses at bottlenecks, suggests generic optimizations. Is the kernel compute-bound or memory-bound? No idea. Are there fusion opportunities? Can't tell without data.

With gdbg

The agent runs gdbg train.py, sees the roofline classification, finds the kernel eating 40% of GPU time is memory-bound, and targets the fix. Data, not hunches.

See what the agent sees.

One command. Every angle on GPU performance.

gpu> stream-graph

  Stream Graph (0.0ms → 20.9ms, span 20.9ms)

  s13   │B                         AAA                          B│
  s14   │                           AA                           │
  s15   │                           AA                           │
        └────────────────────────────────────────────────────────┘

  Legend:
    A  burst_kernel(float *, int, int)       490.2us
    B  quiet_kernel(float *, int)             25.2us

gpu> hotspot 5000

  Hottest 5.0ms window:

  Window:     574.3ms → 579.3ms
  Busy time:  490.2us  (9.8% of window)
  Launches:   12

  Kernel                         Launches  Time      % busy
  ─────────────────────────────  ────────  ────────  ──────
  burst_kernel                         12   490.2us  100.0%

gpu> bandwidth

  Per-kernel Memory Bandwidth:

  #  Kernel            Achieved    % peak   Bound
  ── ────────────────  ──────────  ──────   ────────
  1  coalesced_copy    612.4 GB/s   82.3%   memory
  2  strided_copy       78.2 GB/s   10.5%   memory  ←low

  1 kernel under 50% of peak — likely memory-access bound
  (poor coalescing, low L2, uncoalesced strided loads)

gpu> critical-path

  Critical path chains (same stream, gap ≤ 100.0us):

  Longest chain: stream 7  span 12.4ms  active 12.2ms (98%)
                 24 kernel(s)

  Top kernels on chain:
  Kernel                 Launches  Time      % chain
  ─────────────────────  ────────  ────────  ───────
  trunk_step                   24    12.2ms   100.0%

Collect everything. At once.

One command triggers three independent collection phases. Each can fail without blocking the others. You always get the data that's available.

Timeline

GPU Timeline

Kernel launches, memory transfers, stream activity, NVTX regions. The big picture of what the GPU actually did.

Metrics

Hardware Metrics

Occupancy, throughput, registers, shared memory, L2 hit rate. Collected on the hottest kernels, where the detail matters.

Mapping

Op Mapping

Which operator launched which kernel. Bridges the gap between your code and the hardware.

30+ commands.

After collection, the agent queries everything from a single interface.

Hotspots

kernels [N] [pattern]

Top kernels by total GPU time

ops [N] [pattern]

Top operators by GPU time (needs op mapping data)

stats

Overall session summary

top-ops [N] [pattern]

Operators ranked by GPU time contribution

Analysis

roofline [pattern]

Classify compute-bound vs memory-bound

bound <kernel>

Detailed boundedness diagnosis for a kernel

occupancy [N]

SM occupancy ranking

variance <kernel>

Launch-to-launch timing variance

warmup

Detect warmup launches before steady state

small [N]

Kernels where launch overhead exceeds compute

fuse [N]

Sequential kernels that could be fused

concurrency

Stream utilization and parallelism gaps

hotpath

Critical path through ops (CPU vs GPU bound)

compare-ops [N]

CPU vs GPU time ratio per operator

breakdown <op>

Which kernels an operator expands into

idle-between <a> <b>

GPU idle gap between two operators

Timeline

transfers [N]

Memory copies ranked by cost

gaps [N]

GPU idle periods

overlap

Compute/transfer concurrency

streams

Per-stream utilization breakdown

timeline [N]

Chronological kernel launches

Drill-down

inspect <kernel>

Full detail from all data layers

trace <op>

Operator to kernel mapping

callers <kernel>

Which operator launched this kernel

Filtering

focus <pattern>

Show only matching kernels

ignore <pattern>

Hide matching kernels

region <name>

Focus on NVTX / profiler step

reset

Clear all filters

Sessions

save <name>

Save session to .dbg/gpu/

list

List saved sessions

diff <name>

Compare current session against a saved one

layers

Show loaded data layers

suggest

Suggest what data to collect next

How the agent thinks.

The agent doesn't follow a script. It reasons about GPU performance the way an engineer would.

“What's actually slow?”The agent runs gdbg train.py, then stats and kernels. Now it knows where GPU time goes — not where it assumed it went.

“Why is it slow?”roofline answers the question that matters: is this kernel starved for compute or starved for memory? The fix depends on the answer.

“What else is wrong?”fuse finds sequential kernels that should be one. small finds kernels where launch overhead dominates. gaps finds idle time the GPU wasted.

“Show me everything about this kernel.”inspect pulls hardware counters, occupancy, timing, and the operator that launched it — all in one view.

“Did the fix actually help?”The agent saves a baseline, makes changes, re-profiles, and runs diff. No guessing. Numbers go up or they don't.

Get started.

gdbg ships with dbg. One install gives you both.

Install

cargo install dbg-cli

Check dependencies

gdbg check

Verifies all required tools are available. Tells you exactly what's missing and how to install it.

Profile

gdbg train.py

Auto-detects the target type, collects data, and drops you into the REPL.

How it works.

Three independent phases.Timeline, hardware metrics, and op mapping collect separately. A failure in one doesn't block the others. You always get the data that's available.

One session, every layer.Timeline, hardware counters, and op mapping merge into a single queryable session. No jumping between tools or parsing different output formats.

Top-5 targeting.Hardware metrics collection is expensive. gdbg runs it only on the hottest kernels from the timeline, not everything. Fast enough to use interactively.

Cross-layer correlation.Op mapping connects Python operators to GPU kernels. trace matmul shows every kernel that a matmul op launched.

Session diffing.Save a baseline, optimize, re-profile, diff. The agent sees exactly what changed — faster kernels, fewer launches, better occupancy.

Multiple profilers. One answer.

GPU Timeline

Hardware Metrics

Op Mapping

CUDA

PyTorch

Triton