Introducing Structural Inference Optimization

Structure is performance.

Models have internal geometry. That geometry is full of friction. Dystrio is the first structural compiler for AI inference — we reshape models before they run, and emit standard artifacts that deploy in any stack.

Platform Agnostic Portable Artifacts No Runtime Changes

Request access Read the research

NVIDIA Inception Program Member

Every efficient system in nature converges on the same geometry — the shape that moves the most through the least friction.

AI models have structure. That structure is inefficient relative to real workloads. The industry has optimized runtimes, kernels, and serving stacks — everything around the model. We optimize the model itself: its width, its topology, its internal allocation of compute.

We call this Structural Inference Optimization. It is a new layer of the inference stack — a compiler step between training and deployment that didn't exist before.

Products

We don't change how models run. We change what they are.

Forge

Topology-Aware Expert Placement

MoE models pay for cross-GPU communication every time co-activating experts land on different ranks — even on NVLink. Forge observes routing patterns, builds a co-activation graph, and places experts where they belong. Same model. Same stack. Less friction.

Read-only observation. Output is a placement artifact you apply at deploy time. No runtime changes.

Single node · 4× A100 NVLink

Even on NVLink, structure matters.

−30%

P95/P99 tail latency

17×

throughput stability

+2.7%

throughput under skew

Multi-node · 8× A100 · 2 nodes

Across nodes, it's critical.

−86%

throughput variance

−4.1%

P95 tail latency

+3.5%

throughput

Validated on A100-SXM4 and H100 multi-node clusters. vLLM 0.15.0, allenai/OLMoE-1B-7B. Forge includes a prescriptive decision gate — it quantifies expected gain and recommends whether to apply for your workload.

Sculpt

Structural Inference Recompilation

Models allocate uniform width across every layer regardless of actual activation demand. Sculpt measures that demand, physically rewrites MLP dimensions, stabilizes the result, and emits a standard dense model. Not masking. Not sparse pruning. Tensor shape recompilation.

Hugging Face compatible. No custom kernels. No sparse runtime. Loads in vLLM, SGLang, Transformers.

Mistral-7B · 32 layers · Safe tier

Structured tradeoff. Stable staged repair. No divergence.

+22%

prefill throughput

~20%

GPU-hour reduction

Dense

standard HF artifact

Mistral-7B · 32 layers · Aggressive tier

Upper bound of structural throughput gains under dense deployment.

+31%

prefill throughput

~30%

GPU-hour reduction

Dense

standard HF artifact

Full-model recompilation. No runtime modifications. No sparse kernels. No serving stack changes. Quantization and structural recompilation are complementary — Sculpt composes with existing optimization pipelines.

Architecture

Reshape the model. Not the stack.

Platform agnostic

The output is a model, not a runtime. Works wherever models work.

Portable artifacts

Standard model files and placement JSON. Zero pipeline modification.

Composable

Structural optimization before quantization, fine-tuning, and serving. Makes every downstream step more efficient.

Prescriptive

Every product quantifies expected gain and recommends whether to apply. When optimization won't help, Dystrio tells you.

PyTorch vLLM TensorRT-LLM NCCL Ray Kubernetes Quantization LoRA Distillation

Want to see Dystrio on your workload?

We're working with design partners running inference in production.

Request access Try the analyzer