We selected SGLang to analyze close to the metal. It sits at a compelling cross-section where compiled-output behavior matters: fused kernels, MoE routing, multiple config and backend paths, and architecture-specific execution patterns on GPUs.
What changes after lowering, and can those changes become reviewable?
We are not attempting to benchmark SGLang performance or rank backend and config options. Source review, profilers, and benchmarks already cover that surface. This analysis reveals the emitted structure of those choices and uncovers fundamental aspects of SGLang’s execution pathways.
The main case is SGLang PR #26588, which reverted two mathematically equivalent fused paths after a downstream Gemma4 GSM8K regression. The interesting question is not whether the source math was equivalent. It is what changed downstream after lowering.
Lastly, a related case helps elucidate this style of analysis: cross-architecture Ampere vs. Hopper execution comparisons with a Triton-normalized backend.
Each target of analysis is intended to reveal why utilizing PTX and SASS as a method of investigation can reveal important information as to the execution pattern of SGLang. Even small config changes in comprehensive projects like SGLang can produce materially different execution patterns that we believe are useful to maintainers and users alike.
Methodology: For each case, we collected compiled artifacts across similar SGLang serving configs, statically read the emitted PTX and SASS, grouped kernels by family, and compared kernel presence, instruction deltas, and configured review policies across states. The analyzer does not execute kernels, run the model, collect profiler counters, or benchmark runtime. It produces a compiled-output diff: what kernel families appeared, disappeared, or changed, and what instruction-level execution structures moved.
PR #26588: equivalent math, different floating-point execution
Shared SGLang settings used across baseline, pre-revert, and accepted PR states.
model-path: google/gemma-4-26B-A4B-it
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: triton
moe-runner-backend: triton
mem-fraction-static: 0.82 The interesting thing about PR #26588 is not that SGLang had a bug. It is that the reverted paths were mathematically equivalent at the source level, but not equivalent as floating-point execution after lowering. The two reverted commits are mathematically equivalent to the eager code paths they replaced. There is no bug in the path. A floating-point order bug occurs in the lowering process, in the PR’s own words:
d72d246a3:rmsnorm(x, fused_scale, eps)fast path inGemma4Router.forward. The math is equivalent to(self.norm(x) with weight=1) * fused_scale, but BF16 accumulation order in the fused kernel diverges from the two-step path whenfused_scale ≈ hidden_size**-0.5 ≈ 0.022.
03826cdd9:_gemma4_topk_softmax_scale_kerneldoing topk + stable-softmax + per-expert scale in one pass. Equivalent tosoftmax(topk_logits) * scale[topk_ids]fortopk ≤ 8, but its FP32-internalexp(top_logit - top1) / sum_top_expordering doesn’t matchtorch.nn.functional.softmaxbit-for-bit.
That is the failure mode: fused Triton kernels are doing the same math as the eager versions they replaced, but in a different order. BF16 is not associative, so the bits diverge. In a router that scales by 0.022 and feeds expert selection, the bit drift propagates: enough tokens get routed to a different expert that GSM8K drops five points. Numerical execution changes silently while source-level math stays equivalent.
The signal that something was wrong was not a non-equivalence in the math, or a slower kernel in the profiler. It was a downstream accuracy regression, and that signal did not say where the regression lived.
Two surfaces for review: source and compiled output
GPU code has two review surfaces.
The first is the source. You wrote it; you can read it, change it, and test it against what you meant. SGLang already has mature review surfaces around source, config, tests, benchmarks, and profiler traces. Those are the surfaces maintainers normally use to understand a PR: what changed in the code, which backend path is selected, whether the benchmark moved, whether the profiler looks better, and whether the eval still passes.
The second surface is what comes out of the compiler: the PTX and SASS that hits the GPU. Register allocation, instruction selection, the shape of the reduction tree, and so forth all become concrete here. You do not make these decisions; the compiler does. Triton lowers your source to PTX, ptxas lowers PTX to SASS, and the binary that runs on the chip is the result.
The two surfaces are related but functionally independent for review. A clean algorithm can compile to something that loses tensor cores on an architecture port. A messy algorithm can compile cleanly. Source review and benchmarks only look at the first. The second traditionally goes unread. Profilers show symptoms of the compiled-output surface, but not the surface itself. We believe PTX and SASS, when reviewed, can show this second layer of critical information. In compiled-output diffs, questions can be answered like:
- Which kernel families appeared, disappeared, or changed?
- Did a fused numerical path alter conversion, reduction, or rounding structure?
- Did a backend switch emit the expected memory-movement path?
- Did an architecture migration introduce the expected tensor-core and warpgroup instructions?
- Should the change be PASS, REVIEW, or FAIL under the project’s policy?
PR #26588 shows this exact pattern. The source-level identities were intended to be equivalent, but the fused kernels emitted a different numerical execution path. The compiler surface is what needs investigation to find the source of the bug. A compiled-output diff can thus augment the eval process. It would give maintainers a smaller review artifact showing which fused paths changed before the eval result became the first useful signal.
What the compiled output shows
We ran a static read on 3 SGLang states: baseline (7ed53d15f, main pre-PR), pre-revert (256e1d6c6, all four commits live), and the accepted PR state (847ce14, BBuf/sglang) after removing precision-sensitive fused paths.
Same model, same dtype, same Triton backend, same MoE configs. Three caches, diffed. We simply read the emitted PTX/SASS from the compiled artifacts.
At the source level, the PR describes the router RMSNorm change as a fused rmsnorm(x, fused_scale, eps) fast path in Gemma4Router.forward. The important compiled-output question is whether that source-equivalent change leaves a different execution path after lowering.
The relevant conversion boundary is visible in PTX:
cvt.rn.bf16.f32 %rs1, %f1;
Round one FP32 value to BF16, nearest-even, seven mantissa bits. Sum a long FP32 chain into a BF16 store, do it again with the chain reordered, and you get two answers that are close in magnitude and different in bits.
- GSM8K MTP topk=1
- 0.445
- avg accept length
- 4.494
- GSM8K MTP topk=1
- 0.445
- avg accept length
- 4.494
- GSM8K MTP topk=1
- 0.360
- avg accept length
- 4.475
Category totals compare the RMSNorm kernel family across base, pre-revert, and accepted builds.
Pre Revert adds the by-head RMSNorm path. Accepted removes that path but keeps the larger standard RMSNorm footprint, so total RMSNorm instructions remain above base.
Each row scales independently to show how the per-kernel average moves across the three builds.
Accepted creates the four-kernel RMSNorm footprint, so per-kernel averages remain below base even as total RMSNorm instructions rise. The larger standard path is split across more specialized emitted kernels.
In compiled-output analysis, most of the fleet stays the same between the Pre-Revert and Accepted states. The pre-revert run has 59 observed kernels; the accepted run has 58. Across this diff, 56 kernels are compared, 54 are unchanged, 2 changed, 2 were added, and 3 were removed.
Kernel counts across baseline, accepted, and pre-revert builds.
That is the shape of the PR in compiled form. The accepted patch does not say “undo the optimization.” It says: keep the broader MoE and attention changes, remove the fused numerical paths that changed routing behavior.
The top-k site shows the cleanest version of this. BBuf’s PR says _gemma4_topk_softmax_scale_kernel fused top-k, stable softmax, and per-expert scale into one pass. In real arithmetic, that is the same as softmax. In floating-point execution, the order is different.
The compiled diff shows that fused path directly. In the pre-revert build, the fused top-k kernel exists. In the accepted build, it is gone. Triton’s IR treats reordering of the FP32 work as equivalent so long as it maintains real-arithmetic semantics, so its passes are free to schedule the work however the optimizer prefers. The schedule it picked is not the one PyTorch’s two-step path produces.
Same softmax in real arithmetic. Different compiled numerical path. The kernel count and instructions all move. The algorithm surface says “equivalent.” The compiled surface says “different execution.”
The RMSNorm side has the same shape. The pre-revert build contains two _gemma_qkv_rmsnorm_by_head_kernel instances that disappear in the accepted build. The accepted build instead emits four _gemma_qkv_rmsnorm_kernel instances.
Baseline shape carried through the pre-revert and accepted PR outcomes.
Base contributes two standard RMSNorm kernels. The PR adds two more RMSNorm-path kernels; Pre Revert materializes them as by-head kernels, while Accepted materializes them as standard kernels.
The compiled-output diff shows the precision-sensitive execution paths that changed.
SGLang had the right validation machinery: the kernel author understood the change, the profiler showed the fused path was faster, and the deterministic accuracy CI fired. But those tools expose different surfaces. Source review sees equivalent math. The unit tests pass because the values are close. The profiler is happy because the kernel is faster. The only signal pointing at the regression is the GSM8K score, and the score does not say where.
So BBuf bisected: four commits, multiple configurations, 200 GSM8K questions per config, on a TP=2 H200. The bisect was correct. It is also the work you do when nothing reads the compiler surface for you. The compiled-output diff above arrives at the same answer BBuf did without the bisect, pointing at execution paths changing silently and providing a review artifact for the exact source. This diff does not replace validation, but narrows the suspect set earlier by showing that the changed PR state introduced and removed specific fused numerical paths, with the relevant conversion and reduction instructions.
What this could look like in PR review
A compact policy artifact can flag emitted kernel-family changes without requiring maintainers to inspect the full PTX/SASS diff.
version: v1
rendering:
default: delta
rules:
- id: review-kernel-family-deltas
effect: review
match:
family: "*"
reason: New specialized kernels have been emitted, flagged for review Reports are always available for review. Teams can build on this config with sharper flagging for problematic PTX and SASS changes.
The value is not asking a maintainer to read every line of PTX or SASS. The value is producing a structured review artifact. A kernel owner can then decide whether the change should pass, require review, or be tied to an accuracy or profiler gate.
DeepGEMM kernels as a compiled-output CI gate
PR #26588 shows numerical execution changing after lowering. DeepGEMM shows a second use case: turning backend expectations into a compiled-output review policy.
FP8 path using flashinfer_deepgemm with deep_gemm MoE runner.
model-path: DeepSeek-V2-Lite-Chat-FP8
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: flashinfer_deepgemm
moe-runner-backend: deep_gemm
mem-fraction-static: 0.82 FP8 path using SGLang's deep_gemm backend and deep_gemm MoE runner.
model-path: DeepSeek-V2-Lite-Chat-FP8
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: fa3
fp8-gemm-backend: deep_gemm
moe-runner-backend: deep_gemm
mem-fraction-static: 0.82 TMA is specifically designed to offload bulk movement between global memory and shared memory on Hopper-class GPUs. If a backend path is configured with the expectation that tile movement should use TMA, then the compiled output should be able to confirm whether that expectation survived lowering.
That gives us a natural policy question:
If a backend path is configured with the expectation that tile movement should use TMA, does the emitted PTX/SASS actually match that expectation?
We ran the DeepGEMM comparison with a strict review policy:
If a GEMM path is expected to use TMA, flag repeated manual global-to-shared staging for review.
Review when a DeepGEMM kernel emits matched PTX shared-store instructions.
version: v1
rules:
- id: review-gemm-when-tma-expected
effect: review
match:
all:
- kernel: deep*gemm*
- operation: st.shared.*
layer: ptx
exclude:
operation: st.shared.f32
when:
count:
gte: 1
reason: GEMM kernel used repeated manual global-to-shared copies. Check whether this tile movement should have lowered to TMA. Under that policy, this config fires on both the SGLang FP8 path and the DeepGEMM path:
The two configurations differ by FP8 GEMM backend while hitting the same PTX shared-store review rule.
- fp8-gemm-backend
- flashinfer_deepgemm
- moe-runner-backend
- deep_gemm
- Policy signal
- PTX st.shared.*
- Impacted kernels
-
deepgemm_compute_src2dst_triton_kernel
- fp8-gemm-backend
- deep_gemm
- moe-runner-backend
- deep_gemm
- Policy signal
- PTX st.shared.*
- Impacted kernels
-
deepgemm_compute_src2dst_triton_kerneldeep_gemm::transpose_fp32
Counts are matched PTX shared-store instructions. The policy threshold is >= 1, so both configurations require review under this rule.
The gate fires on DeepGEMM integration paths: flashinfer_deepgemm hits deepgemm_compute_src2dst_triton_kernel (8 instructions), deep_gemm hits the same and also hits deep_gemm::transpose_fp32 (5 instructions). Here you have different kernels with the same manual shared staging patterns where the gates catch a class of behavior and are not just focusing on the library’s choice.
The impacted helper kernel is concrete. In the normal FP8 path, it does not appear. In the DeepGEMM TMA-config run, it appears twice and brings memory movement, shared-memory activity, synchronization, and scalar compute with it.
PTX and SASS instructions observed in deepgemm_compute_src2dst_triton_kernel.
| Instruction | Count |
|---|---|
| PTX ld.global.v2.b64 | 1 |
| PTX ld.global.b32 | 4 |
| PTX ld.global.b64 | 6 |
| PTX ld.shared.b32 | 4 |
| PTX ld.shared.b64 | 4 |
| PTX st.shared.b32 | 4 |
| PTX st.shared.b64 | 4 |
| SASS LDG.E | 8 |
| Instruction | Count |
|---|---|
| SASS LDG.E.128 | 1 |
| SASS LDG.E.64 | 2 |
| SASS LDS | 4 |
| SASS LDS.64 | 4 |
| SASS STS | 4 |
| SASS STS.64 | 4 |
| SASS STG.E | 4 |
PTX and SASS instructions observed in deep_gemm::transpose_fp32.
| Instruction | Count |
|---|---|
| PTX st.shared.u32 | 5 |
| SASS LDS | 6 |
| Instruction | Count |
|---|---|
| SASS STS | 40 |
| SASS STG.E | 6 |
Source and config review says ‘DeepGEMM path enabled.’ The compiled-output gate asks a narrower question: did the emitted kernels match the movement pattern this policy expected? In the captures above, the answer is not ‘the backend is wrong.’ The answer is ‘review required.’ A kernel owner can then accept the emitted path, tune the policy, or investigate whether the helper path should use a different movement structure.
Architecture changes move the compiled-output surface
The third case is an architecture migration: same SGLang serving path, Ampere → Hopper. This is where source/config equality can hide large binary-level movement.
Same SGLang serving path used to compare emitted instructions across GPU architectures.
model-path: DeepSeek-V2-Lite-Chat
dtype: bfloat16
kv-cache-dtype: bf16
tensor-parallel-size: 1
expert-parallel-size: 1
attention-backend: triton
fp8-gemm-backend: triton
moe-runner-backend: triton
mem-fraction-static: 0.82 The kernel families are stable: 13 families compared, 13 kernels compared, 0 added, and 0 removed. But every compared kernel changes. Migrating across architectures moves the emitted execution structure.
Same SGLang serving path, different GPU architecture: totals show which emitted instruction categories moved after lowering.
The fused_moe_kernel family shows the same thing. The family count stays fixed at 10 kernels on both architectures, but the emitted execution structure moves:
fused_moe_kernel family movement In Hopper, fused_moe_kernel now emits tensor-core, warpgroup, memory, and scalar instructions.
Hopper-specific tensor-core and warpgroup instructions appear in the emitted PTX and SASS.
Source review says “same kernel family.” Config review says “same serving path, different architecture.” The compiled-output diff shows what actually changed: which tensor-core path appeared, which shared-memory pattern moved, and which synchronization/control structures changed.
What this means
The algorithm surface and the compiled-output surface diverge, and that divergence can be invisible to source review, benchmarks, and profilers. PR #26588 is the worked example where the divergence ships and gets caught downstream. DeepGEMM is the same shape made into a CI policy a maintainer can run on every commit. The Ampere-to-Hopper case is the same analysis inside a deployment migration.
The point of a compiled-output diff is that it produces a small, structured artifact a maintainer or an agent can read instead of raw SASS. In PR #26588, the artifact would have flagged two removed fused kernels and one expanded kernel, with the precision-sensitive instruction pattern listed inline, before the bisect. In a DeepGEMM CI run, the same artifact answers a configured policy question and gates accordingly. In an architecture migration, the artifact shows which kernel families moved to the new tensor and memory paths and which did not.
There are still directions to improve this analysis: better source-to-SASS attribution, richer policy presets, and integration with profiler traces. But the core artifact is useful: a compiled-output diff that makes execution-path changes visible before they only show up through a benchmark, profiler, or downstream eval.
At Gestell, we are excited for how compiled-output review may augment open source projects like SGLang.
Please reach out to hello@gestell.ai for the full reports of compiled-output analysis detailed above.