Introducing Structural Inference Optimization

Structure is performance.

Models have internal geometry. That geometry is full of friction. Dystrio is the first structural compiler for AI inference — we reshape models before they run, and emit standard artifacts that deploy in any stack.

Platform Agnostic Portable Artifacts No Runtime Changes
NVIDIA Inception Program Member NVIDIA Inception Program

Every efficient system in nature converges on the same geometry — the shape that moves the most through the least friction.

AI models have structure. That structure is inefficient relative to real workloads. The industry has optimized runtimes, kernels, and serving stacks — everything around the model. We optimize the model itself: its width, its topology, its internal allocation of compute.

We call this Structural Inference Optimization. It is a new layer of the inference stack — a compiler step between training and deployment that didn't exist before.

We don't change how models run. We change what they are.

Forge
Topology-Aware Expert Placement

MoE models pay for cross-GPU communication every time co-activating experts land on different ranks — even on NVLink. Forge observes routing patterns, builds a co-activation graph, and places experts where they belong. Same model. Same stack. Less friction.

Read-only observation. Output is a placement artifact you apply at deploy time. No runtime changes.

Single node · 4× A100 NVLink
Even on NVLink, structure matters.
−30%
P95/P99 tail latency
17×
throughput stability
+2.7%
throughput under skew
Multi-node · 8× A100 · 2 nodes
Across nodes, it's critical.
−86%
throughput variance
−4.1%
P95 tail latency
+3.5%
throughput
Validated on A100-SXM4 and H100 multi-node clusters. vLLM 0.15.0, allenai/OLMoE-1B-7B. Forge includes a prescriptive decision gate — it quantifies expected gain and recommends whether to apply for your workload.
Sculpt
Structural Inference Recompilation

Models allocate uniform width across every layer regardless of actual activation demand. Sculpt measures that demand, physically rewrites MLP dimensions, stabilizes the result, and emits a standard dense model. Not masking. Not sparse pruning. Tensor shape recompilation.

Hugging Face compatible. No custom kernels. No sparse runtime. Loads in vLLM, SGLang, Transformers.

Mistral-7B · 32 layers · Safe tier
Structured tradeoff. Stable staged repair. No divergence.
+22%
prefill throughput
~20%
GPU-hour reduction
Dense
standard HF artifact
Mistral-7B · 32 layers · Aggressive tier
Upper bound of structural throughput gains under dense deployment.
+31%
prefill throughput
~30%
GPU-hour reduction
Dense
standard HF artifact
Full-model recompilation. No runtime modifications. No sparse kernels. No serving stack changes. Quantization and structural recompilation are complementary — Sculpt composes with existing optimization pipelines.

Reshape the model. Not the stack.

Platform agnostic

The output is a model, not a runtime. Works wherever models work.

Portable artifacts

Standard model files and placement JSON. Zero pipeline modification.

Composable

Structural optimization before quantization, fine-tuning, and serving. Makes every downstream step more efficient.

Prescriptive

Every product quantifies expected gain and recommends whether to apply. When optimization won't help, Dystrio tells you.

PyTorch vLLM TensorRT-LLM NCCL Ray Kubernetes Quantization LoRA Distillation

Want to see Dystrio on your workload?

We're working with design partners running inference in production.