Multiple profilers. One answer.

GPU performance is opaque. Your AI agent stops guessing at bottlenecks and starts measuring them.

View source See commands ↓
Agents guess. Profilers measure.

GPU kernels are opaque. Profiling tools are fragmented. Agents need structured data, not raw traces.

Without gdbg

The agent reads CUDA code, guesses at bottlenecks, suggests generic optimizations. Is the kernel compute-bound or memory-bound? No idea. Are there fusion opportunities? Can't tell without data.

With gdbg

The agent runs gdbg train.py, sees the roofline classification, finds the kernel eating 40% of GPU time is memory-bound, and targets the fix. Data, not hunches.

See what the agent sees.

One command. Every angle on GPU performance.

gpu> stream-graph
  Stream Graph (0.0ms → 20.9ms, span 20.9ms)

  s13   │B                         AAA                          B│
  s14   │                           AA                           │
  s15   │                           AA                           │
        └────────────────────────────────────────────────────────┘

  Legend:
    A  burst_kernel(float *, int, int)       490.2us
    B  quiet_kernel(float *, int)             25.2us
gpu> hotspot 5000
  Hottest 5.0ms window:

  Window:     574.3ms → 579.3ms
  Busy time:  490.2us  (9.8% of window)
  Launches:   12

  Kernel                         Launches  Time      % busy
  ─────────────────────────────  ────────  ────────  ──────
  burst_kernel                         12   490.2us  100.0%
gpu> bandwidth
  Per-kernel Memory Bandwidth:

  #  Kernel            Achieved    % peak   Bound
  ── ────────────────  ──────────  ──────   ────────
  1  coalesced_copy    612.4 GB/s   82.3%   memory
  2  strided_copy       78.2 GB/s   10.5%   memory  ←low

  1 kernel under 50% of peak — likely memory-access bound
  (poor coalescing, low L2, uncoalesced strided loads)
gpu> critical-path
  Critical path chains (same stream, gap ≤ 100.0us):

  Longest chain: stream 7  span 12.4ms  active 12.2ms (98%)
                 24 kernel(s)

  Top kernels on chain:
  Kernel                 Launches  Time      % chain
  ─────────────────────  ────────  ────────  ───────
  trunk_step                   24    12.2ms   100.0%
Collect everything. At once.

One command triggers three independent collection phases. Each can fail without blocking the others. You always get the data that's available.

1
Timeline

GPU Timeline

Kernel launches, memory transfers, stream activity, NVTX regions. The big picture of what the GPU actually did.

2
Metrics

Hardware Metrics

Occupancy, throughput, registers, shared memory, L2 hit rate. Collected on the hottest kernels, where the detail matters.

3
Mapping

Op Mapping

Which operator launched which kernel. Bridges the gap between your code and the hardware.

Auto-detects your target.

Point gdbg at a file. It reads the imports and picks the right collection strategy.

CUDA

  • .cu source files
  • Compiled CUDA binaries
  • PyCUDA / CuPy scripts

PyTorch

  • Training scripts
  • Inference pipelines
  • Custom autograd ops

Triton

  • Triton kernels
  • Flash Attention variants
  • Custom fused ops
30+ commands.

After collection, the agent queries everything from a single interface.

Hotspots
kernels [N] [pattern]
Top kernels by total GPU time
ops [N] [pattern]
Top operators by GPU time (needs op mapping data)
stats
Overall session summary
top-ops [N] [pattern]
Operators ranked by GPU time contribution
Analysis
roofline [pattern]
Classify compute-bound vs memory-bound
bound <kernel>
Detailed boundedness diagnosis for a kernel
occupancy [N]
SM occupancy ranking
variance <kernel>
Launch-to-launch timing variance
warmup
Detect warmup launches before steady state
small [N]
Kernels where launch overhead exceeds compute
fuse [N]
Sequential kernels that could be fused
concurrency
Stream utilization and parallelism gaps
hotpath
Critical path through ops (CPU vs GPU bound)
compare-ops [N]
CPU vs GPU time ratio per operator
breakdown <op>
Which kernels an operator expands into
idle-between <a> <b>
GPU idle gap between two operators
Timeline
transfers [N]
Memory copies ranked by cost
gaps [N]
GPU idle periods
overlap
Compute/transfer concurrency
streams
Per-stream utilization breakdown
timeline [N]
Chronological kernel launches
Drill-down
inspect <kernel>
Full detail from all data layers
trace <op>
Operator to kernel mapping
callers <kernel>
Which operator launched this kernel
Filtering
focus <pattern>
Show only matching kernels
ignore <pattern>
Hide matching kernels
region <name>
Focus on NVTX / profiler step
reset
Clear all filters
Sessions
save <name>
Save session to .dbg/gpu/
list
List saved sessions
diff <name>
Compare current session against a saved one
layers
Show loaded data layers
suggest
Suggest what data to collect next
How the agent thinks.

The agent doesn't follow a script. It reasons about GPU performance the way an engineer would.

“What's actually slow?”The agent runs gdbg train.py, then stats and kernels. Now it knows where GPU time goes — not where it assumed it went.
“Why is it slow?”roofline answers the question that matters: is this kernel starved for compute or starved for memory? The fix depends on the answer.
“What else is wrong?”fuse finds sequential kernels that should be one. small finds kernels where launch overhead dominates. gaps finds idle time the GPU wasted.
“Show me everything about this kernel.”inspect pulls hardware counters, occupancy, timing, and the operator that launched it — all in one view.
“Did the fix actually help?”The agent saves a baseline, makes changes, re-profiles, and runs diff. No guessing. Numbers go up or they don't.
Get started.

gdbg ships with dbg. One install gives you both.

1

Install

cargo install dbg-cli
2

Check dependencies

gdbg check

Verifies all required tools are available. Tells you exactly what's missing and how to install it.

3

Profile

gdbg train.py

Auto-detects the target type, collects data, and drops you into the REPL.

How it works.
Three independent phases.Timeline, hardware metrics, and op mapping collect separately. A failure in one doesn't block the others. You always get the data that's available.
One session, every layer.Timeline, hardware counters, and op mapping merge into a single queryable session. No jumping between tools or parsing different output formats.
Top-5 targeting.Hardware metrics collection is expensive. gdbg runs it only on the hottest kernels from the timeline, not everything. Fast enough to use interactively.
Cross-layer correlation.Op mapping connects Python operators to GPU kernels. trace matmul shows every kernel that a matmul op launched.
Session diffing.Save a baseline, optimize, re-profile, diff. The agent sees exactly what changed — faster kernels, fewer launches, better occupancy.