SGLang Close to the Metal

We selected SGLang to analyze close to the metal. It sits at a compelling cross-section where compiled-output behavior matters: fused kernels, MoE routing, multiple config and backend paths, and architecture-specific execution patterns on GPUs.

What changes after lowering, and can those changes become reviewable?

We are not attempting to benchmark SGLang performance or rank backend and config options. Source review, profilers, and benchmarks already cover that surface. This analysis reveals the emitted structure of those choices and uncovers fundamental aspects of SGLang’s execution pathways.

The main case is SGLang PR #26588, which reverted two mathematically equivalent fused paths after a downstream Gemma4 GSM8K regression. The interesting question is not whether the source math was equivalent. It is what changed downstream after lowering.

Lastly, a related case helps elucidate this style of analysis: cross-architecture Ampere vs. Hopper execution comparisons with a Triton-normalized backend.

Each target of analysis is intended to reveal why utilizing PTX and SASS as a method of investigation can reveal important information as to the execution pattern of SGLang. Even small config changes in comprehensive projects like SGLang can produce materially different execution patterns that we believe are useful to maintainers and users alike.

Methodology: For each case, we collected compiled artifacts across similar SGLang serving configs, statically read the emitted PTX and SASS, grouped kernels by family, and compared kernel presence, instruction deltas, and configured review policies across states. The analyzer does not execute kernels, run the model, collect profiler counters, or benchmark runtime. It produces a compiled-output diff: what kernel families appeared, disappeared, or changed, and what instruction-level execution structures moved.

PR #26588: equivalent math, different floating-point execution

PR #26588 comparison config

Shared SGLang settings used across baseline, pre-revert, and accepted PR states.

model-path: google/gemma-4-26B-A4B-it
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: triton
moe-runner-backend: triton
mem-fraction-static: 0.82

The interesting thing about PR #26588 is not that SGLang had a bug. It is that the reverted paths were mathematically equivalent at the source level, but not equivalent as floating-point execution after lowering. The two reverted commits are mathematically equivalent to the eager code paths they replaced. There is no bug in the path. A floating-point order bug occurs in the lowering process, in the PR’s own words:

d72d246a3: rmsnorm(x, fused_scale, eps) fast path in Gemma4Router.forward. The math is equivalent to (self.norm(x) with weight=1) * fused_scale, but BF16 accumulation order in the fused kernel diverges from the two-step path when fused_scale ≈ hidden_size**-0.5 ≈ 0.022.

03826cdd9: _gemma4_topk_softmax_scale_kernel doing topk + stable-softmax + per-expert scale in one pass. Equivalent to softmax(topk_logits) * scale[topk_ids] for topk ≤ 8, but its FP32-internal exp(top_logit - top1) / sum_top_exp ordering doesn’t match torch.nn.functional.softmax bit-for-bit.

That is the failure mode: fused Triton kernels are doing the same math as the eager versions they replaced, but in a different order. BF16 is not associative, so the bits diverge. In a router that scales by 0.022 and feeds expert selection, the bit drift propagates: enough tokens get routed to a different expert that GSM8K drops five points. Numerical execution changes silently while source-level math stays equivalent.

The signal that something was wrong was not a non-equivalence in the math, or a slower kernel in the profiler. It was a downstream accuracy regression, and that signal did not say where the regression lived.

Two surfaces for review: source and compiled output

GPU code has two review surfaces.

The first is the source. You wrote it; you can read it, change it, and test it against what you meant. SGLang already has mature review surfaces around source, config, tests, benchmarks, and profiler traces. Those are the surfaces maintainers normally use to understand a PR: what changed in the code, which backend path is selected, whether the benchmark moved, whether the profiler looks better, and whether the eval still passes.

The second surface is what comes out of the compiler: the PTX and SASS that hits the GPU. Register allocation, instruction selection, the shape of the reduction tree, and so forth all become concrete here. You do not make these decisions; the compiler does. Triton lowers your source to PTX, ptxas lowers PTX to SASS, and the binary that runs on the chip is the result.

The two surfaces are related but functionally independent for review. A clean algorithm can compile to something that loses tensor cores on an architecture port. A messy algorithm can compile cleanly. Source review and benchmarks only look at the first. The second traditionally goes unread. Profilers show symptoms of the compiled-output surface, but not the surface itself. We believe PTX and SASS, when reviewed, can show this second layer of critical information. In compiled-output diffs, questions can be answered like:

Which kernel families appeared, disappeared, or changed?
Did a fused numerical path alter conversion, reduction, or rounding structure?
Did a backend switch emit the expected memory-movement path?
Did an architecture migration introduce the expected tensor-core and warpgroup instructions?
Should the change be PASS, REVIEW, or FAIL under the project’s policy?

PR #26588 shows this exact pattern. The source-level identities were intended to be equivalent, but the fused kernels emitted a different numerical execution path. The compiler surface is what needs investigation to find the source of the bug. A compiled-output diff can thus augment the eval process. It would give maintainers a smaller review artifact showing which fused paths changed before the eval result became the first useful signal.

What the compiled output shows

We ran a static read on 3 SGLang states: baseline (7ed53d15f, main pre-PR), pre-revert (256e1d6c6, all four commits live), and the accepted PR state (847ce14, BBuf/sglang) after removing precision-sensitive fused paths.

Same model, same dtype, same Triton backend, same MoE configs. Three caches, diffed. We simply read the emitted PTX/SASS from the compiled artifacts.

At the source level, the PR describes the router RMSNorm change as a fused rmsnorm(x, fused_scale, eps) fast path in Gemma4Router.forward. The important compiled-output question is whether that source-equivalent change leaves a different execution path after lowering.

The relevant conversion boundary is visible in PTX:

cvt.rn.bf16.f32   %rs1, %f1;

Round one FP32 value to BF16, nearest-even, seven mantissa bits. Sum a long FP32 chain into a BF16 store, do it again with the chain reordered, and you get two answers that are close in magnitude and different in bits.

BBuf's PR outcome

Baseline 376635c1eall reverts = main

pass

GSM8K MTP topk=1: 0.445
avg accept length: 4.494

Accepted 27cb94c45c1+c4 kept

pass

GSM8K MTP topk=1: 0.445
avg accept length: 4.494

Pre Revert 10ab189e3all 4 commits

fail

GSM8K MTP topk=1: 0.360
avg accept length: 4.475

RMSNorm instruction mix

Category totals compare the RMSNorm kernel family across base, pre-revert, and accepted builds.

Scalar compute 2,6233,1233,497 +874

Synchronization 1,0361,1441,144 +108

Memory movement 654760840 +186

Shared memory 416464464 +48

Wide shared memory 416464464 +48

Wide memory movement 216248324 +108

Control flow 62828 +22

Pre Revert adds the by-head RMSNorm path. Accepted removes that path but keeps the larger standard RMSNorm footprint, so total RMSNorm instructions remain above base.

RMSNorm instruction averages

Each row scales independently to show how the per-kernel average moves across the three builds.

Scalar compute 1,311.5780.75874.25 -437.25

Synchronization 518286286 -232

Memory movement 327190210 -117

Shared memory 208116116 -92

Wide shared memory 208116116 -92

Wide memory movement 1086281 -27

Control flow 377 +4

Accepted creates the four-kernel RMSNorm footprint, so per-kernel averages remain below base even as total RMSNorm instructions rise. The larger standard path is split across more specialized emitted kernels.

In compiled-output analysis, most of the fleet stays the same between the Pre-Revert and Accepted states. The pre-revert run has 59 observed kernels; the accepted run has 58. Across this diff, 56 kernels are compared, 54 are unchanged, 2 changed, 2 were added, and 3 were removed.

Compiled kernel comparison

Kernel counts across baseline, accepted, and pre-revert builds.

44 Baseline

58 Accepted

59 Pre Revert

Observed kernels

12 Baseline

24 Accepted

24 Pre Revert

fused_moe_kernel

2 Baseline

4 Accepted

2 Pre Revert

_gemma_qkv_rmsnorm_kernel

Accepted removes the by-head RMSNorm path and the fused top-k softmax-scale path, while keeping the broader MoE expansion.

That is the shape of the PR in compiled form. The accepted patch does not say “undo the optimization.” It says: keep the broader MoE and attention changes, remove the fused numerical paths that changed routing behavior.

The top-k site shows the cleanest version of this. BBuf’s PR says _gemma4_topk_softmax_scale_kernel fused top-k, stable softmax, and per-expert scale into one pass. In real arithmetic, that is the same as softmax. In floating-point execution, the order is different.

The compiled diff shows that fused path directly. In the pre-revert build, the fused top-k kernel exists. In the accepted build, it is gone. Triton’s IR treats reordering of the FP32 work as equivalent so long as it maintains real-arithmetic semantics, so its passes are free to schedule the work however the optimizer prefers. The schedule it picked is not the one PyTorch’s two-step path produces.

Removed fused top-k softmax-scale path

Signal PTX/SASS instructionPre-revertAccepted

Kernel family _gemma4_topk_softmax_scale_kernel10

BF16 conversion in this path cvt.f32.bf16presentremoved

Exponentiation ex2.approx.f32presentremoved

Reciprocal path MUFU.RCPpresentremoved

Same softmax in real arithmetic. Different compiled numerical path. The kernel count and instructions all move. The algorithm surface says “equivalent.” The compiled surface says “different execution.”

The RMSNorm side has the same shape. The pre-revert build contains two _gemma_qkv_rmsnorm_by_head_kernel instances that disappear in the accepted build. The accepted build instead emits four _gemma_qkv_rmsnorm_kernel instances.

RMSNorm path migration

Baseline shape carried through the pre-revert and accepted PR outcomes.

Base + delta

base standard RMSNorm 2

PR delta path 2

Pre Revert

standard RMSNorm 2

by-head RMSNorm 2

Accepted

standard RMSNorm 4

Base contributes two standard RMSNorm kernels. The PR adds two more RMSNorm-path kernels; Pre Revert materializes them as by-head kernels, while Accepted materializes them as standard kernels.

The compiled-output diff shows the precision-sensitive execution paths that changed.

SGLang had the right validation machinery: the kernel author understood the change, the profiler showed the fused path was faster, and the deterministic accuracy CI fired. But those tools expose different surfaces. Source review sees equivalent math. The unit tests pass because the values are close. The profiler is happy because the kernel is faster. The only signal pointing at the regression is the GSM8K score, and the score does not say where.

So BBuf bisected: four commits, multiple configurations, 200 GSM8K questions per config, on a TP=2 H200. The bisect was correct. It is also the work you do when nothing reads the compiler surface for you. The compiled-output diff above arrives at the same answer BBuf did without the bisect, pointing at execution paths changing silently and providing a review artifact for the exact source. This diff does not replace validation, but narrows the suspect set earlier by showing that the changed PR state introduced and removed specific fused numerical paths, with the relevant conversion and reduction instructions.

What this could look like in PR review

Compiled-output PR review artifact

A compact policy artifact can flag emitted kernel-family changes without requiring maintainers to inspect the full PTX/SASS diff.

version: v1
rendering:
default: delta
rules:
- id: review-kernel-family-deltas
  effect: review
  match:
    family: "*"
  reason: New specialized kernels have been emitted, flagged for review

Review artifact decision flow

Reports are always available for review. Teams can build on this config with sharper flagging for problematic PTX and SASS changes.

Compiled output diff PTX and SASS after lowering

Kernel-family delta check Compare emitted kernel families

Pass path No new kernels created Compiled surface stays within the known family set

Passes

Review artifact reports pass Report remains available after a pass.

Review path New kernels created Specialized emitted kernels appear in the diff

Review

Review artifact reports required review Report flags the emitted kernel-family delta.

The value is not asking a maintainer to read every line of PTX or SASS. The value is producing a structured review artifact. A kernel owner can then decide whether the change should pass, require review, or be tied to an accuracy or profiler gate.

DeepGEMM kernels as a compiled-output CI gate

PR #26588 shows numerical execution changing after lowering. DeepGEMM shows a second use case: turning backend expectations into a compiled-output review policy.

FlashInfer DeepGEMM config

FP8 path using flashinfer_deepgemm with deep_gemm MoE runner.

model-path: DeepSeek-V2-Lite-Chat-FP8
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: flashinfer_deepgemm
moe-runner-backend: deep_gemm
mem-fraction-static: 0.82

SGLang internal DeepGEMM config

FP8 path using SGLang's deep_gemm backend and deep_gemm MoE runner.

model-path: DeepSeek-V2-Lite-Chat-FP8
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: fa3
fp8-gemm-backend: deep_gemm
moe-runner-backend: deep_gemm
mem-fraction-static: 0.82

TMA is specifically designed to offload bulk movement between global memory and shared memory on Hopper-class GPUs. If a backend path is configured with the expectation that tile movement should use TMA, then the compiled output should be able to confirm whether that expectation survived lowering.

That gives us a natural policy question:

If a backend path is configured with the expectation that tile movement should use TMA, does the emitted PTX/SASS actually match that expectation?

We ran the DeepGEMM comparison with a strict review policy:

If a GEMM path is expected to use TMA, flag repeated manual global-to-shared staging for review.

Compiled-output policy rule

Review when a DeepGEMM kernel emits matched PTX shared-store instructions.

version: v1
rules:
- id: review-gemm-when-tma-expected
  effect: review
  match:
    all:
      - kernel: deep*gemm*
      - operation: st.shared.*
        layer: ptx
        exclude:
          operation: st.shared.f32
  when:
    count:
      gte: 1
  reason: GEMM kernel used repeated manual global-to-shared copies. Check whether this tile movement should have lowered to TMA.

Under that policy, this config fires on both the SGLang FP8 path and the DeepGEMM path:

DeepGEMM policy review signals

The two configurations differ by FP8 GEMM backend while hitting the same PTX shared-store review rule.

FlashInfer DeepGEMM 8

fp8-gemm-backend: flashinfer_deepgemm
moe-runner-backend: deep_gemm
Policy signal: PTX st.shared.*
Impacted kernels: deepgemm_compute_src2dst_triton_kernel

Review threshold >= 1

SGLang internal DeepGEMM 13

fp8-gemm-backend: deep_gemm
moe-runner-backend: deep_gemm
Policy signal: PTX st.shared.*
Impacted kernels: deepgemm_compute_src2dst_triton_kernel
deep_gemm::transpose_fp32

Review threshold >= 1

Counts are matched PTX shared-store instructions. The policy threshold is >= 1, so both configurations require review under this rule.

The gate fires on DeepGEMM integration paths: flashinfer_deepgemm hits deepgemm_compute_src2dst_triton_kernel (8 instructions), deep_gemm hits the same and also hits deep_gemm::transpose_fp32 (5 instructions). Here you have different kernels with the same manual shared staging patterns where the gates catch a class of behavior and are not just focusing on the library’s choice.

The impacted helper kernel is concrete. In the normal FP8 path, it does not appear. In the DeepGEMM TMA-config run, it appears twice and brings memory movement, shared-memory activity, synchronization, and scalar compute with it.

src2dst helper instruction evidence

PTX and SASS instructions observed in deepgemm_compute_src2dst_triton_kernel.

Instruction	Count
PTX ld.global.v2.b64	1
PTX ld.global.b32	4
PTX ld.global.b64	6
PTX ld.shared.b32	4
PTX ld.shared.b64	4
PTX st.shared.b32	4
PTX st.shared.b64	4
SASS LDG.E	8

Instruction	Count
SASS LDG.E.128	1
SASS LDG.E.64	2
SASS LDS	4
SASS LDS.64	4
SASS STS	4
SASS STS.64	4
SASS STG.E	4

transpose helper instruction evidence

PTX and SASS instructions observed in deep_gemm::transpose_fp32.

Instruction	Count
PTX st.shared.u32	5
SASS LDS	6

Instruction	Count
SASS STS	40
SASS STG.E	6

Source and config review says ‘DeepGEMM path enabled.’ The compiled-output gate asks a narrower question: did the emitted kernels match the movement pattern this policy expected? In the captures above, the answer is not ‘the backend is wrong.’ The answer is ‘review required.’ A kernel owner can then accept the emitted path, tune the policy, or investigate whether the helper path should use a different movement structure.

Architecture changes move the compiled-output surface

The third case is an architecture migration: same SGLang serving path, Ampere → Hopper. This is where source/config equality can hide large binary-level movement.

Ampere to Hopper comparison config

Same SGLang serving path used to compare emitted instructions across GPU architectures.

model-path: DeepSeek-V2-Lite-Chat
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: triton
moe-runner-backend: triton
mem-fraction-static: 0.82

The kernel families are stable: 13 families compared, 13 kernels compared, 0 added, and 0 removed. But every compared kernel changes. Migrating across architectures moves the emitted execution structure.

Ampere to Hopper instruction movement

Same SGLang serving path, different GPU architecture: totals show which emitted instruction categories moved after lowering.

Scalar compute

Ampere: 18,408 Hopper: 18,118

-290

Memory movement

Ampere: 3,399 Hopper: 3,645

+246

Shared memory

Ampere: 898 Hopper: 814

-84

HMMA tensor-core path

Ampere: 808 Hopper: 744

-64

Hopper WGMMA tensor compute

Ampere: 0 Hopper: 8

Warpgroup / register control

Ampere: 0 Hopper: 14

+14

Synchronization

Ampere: 680 Hopper: 708

+28

Async movement

Ampere: 200 Hopper: 200

The fused_moe_kernel family shows the same thing. The family count stays fixed at 10 kernels on both architectures, but the emitted execution structure moves:

fused_moe_kernel family movement

In Hopper, fused_moe_kernel now emits tensor-core, warpgroup, memory, and scalar instructions.

Kernel count

Ampere: 10 Hopper: 10

HMMA tensor-core path

Ampere: 128 Hopper: 64

-64

Hopper WGMMA tensor compute

Ampere: 0 Hopper: 8

Warpgroup / register control

Ampere: 0 Hopper: 14

+14

Memory movement

Ampere: 574 Hopper: 627

+53

Shared memory

Ampere: 218 Hopper: 170

-48

Scalar compute

Ampere: 4,515 Hopper: 4,479

-36

Async movement

Ampere: 192 Hopper: 192

Hopper-specific instruction evidence

Hopper-specific tensor-core and warpgroup instructions appear in the emitted PTX and SASS.

Signal LayerObserved count

wgmma.mma_async... PTX12

wgmma.commit_group.sync.aligned PTX4

wgmma.fence.sync.aligned PTX4

wgmma.wait_group.sync.aligned PTX8

HGMMA SASS8

WARPGROUP.ARRIVE SASS4

WARPGROUP.DEPBAR.LE SASS8

Source review says “same kernel family.” Config review says “same serving path, different architecture.” The compiled-output diff shows what actually changed: which tensor-core path appeared, which shared-memory pattern moved, and which synchronization/control structures changed.

What this means

The algorithm surface and the compiled-output surface diverge, and that divergence can be invisible to source review, benchmarks, and profilers. PR #26588 is the worked example where the divergence ships and gets caught downstream. DeepGEMM is the same shape made into a CI policy a maintainer can run on every commit. The Ampere-to-Hopper case is the same analysis inside a deployment migration.

The point of a compiled-output diff is that it produces a small, structured artifact a maintainer or an agent can read instead of raw SASS. In PR #26588, the artifact would have flagged two removed fused kernels and one expanded kernel, with the precision-sensitive instruction pattern listed inline, before the bisect. In a DeepGEMM CI run, the same artifact answers a configured policy question and gates accordingly. In an architecture migration, the artifact shows which kernel families moved to the new tensor and memory paths and which did not.

There are still directions to improve this analysis: better source-to-SASS attribution, richer policy presets, and integration with profiler traces. But the core artifact is useful: a compiled-output diff that makes execution-path changes visible before they only show up through a benchmark, profiler, or downstream eval.

At Gestell, we are excited for how compiled-output review may augment open source projects like SGLang.

Please reach out to hello@gestell.ai for the full reports of compiled-output analysis detailed above.