About This Substack

Inference is the constraint that matters.

Whether optimizing for cost, latency, or throughput—the physics is the same. Memory bandwidth vs. compute. Batch size vs. tail latency. Hardware tradeoffs compound.

This Substack documents what I’m learning from real experiments: the physics of inference, hardware constraints, and how fundamentals shape architecture.

Built on production systems, real failures, real data.

Why is inference so expensive?

Most teams inherit stacks optimized for one cloud, one GPU, one framework.

Then things change—hardware shortage, costs spike, new model with different requirements.

They get stuck troubleshooting. You skip that.

What You Get Here

Every post is a real experiment on production-grade infrastructure. Not simulations. Not white papers.

• The exact hardware-agnostic configs (.yaml, .tf, .promql)

Tradeoff analysis: gains, losses, costs

Failure modes and how to avoid them

Open-source: replicate on your own cluster

Who Should Subscribe

You own inference costs or latency metrics. You deploy to production. You’ve hit scaling limits and need to know: code, hardware, or architecture?

2-3 posts per week. Runnable tests. Patterns you can adapt. Failure paths to avoid.

No vendor drama. No marketing. Just the metal.

Join us. The inference tax is real. Let’s make it avoidable.

The Real-World PhD (Without the 4-Year Wait)

User's avatar

Subscribe to The Inference Lab

Benchmarking the universal laws of AI inference. Daily field notes on scaling language models across any hardware—T4s to H100s, AWS to on-prem.

People