About This Substack
Inference is the constraint that matters.
Whether optimizing for cost, latency, or throughput—the physics is the same. Memory bandwidth vs. compute. Batch size vs. tail latency. Hardware tradeoffs compound.
This Substack documents what I’m learning from real experiments: the physics of inference, hardware constraints, and how fundamentals shape architecture.
Built on production systems, real failures, real data.
Why is inference so expensive?
Most teams inherit stacks optimized for one cloud, one GPU, one framework.
Then things change—hardware shortage, costs spike, new model with different requirements.
They get stuck troubleshooting. You skip that.
What You Get Here
Every post is a real experiment on production-grade infrastructure. Not simulations. Not white papers.
• The exact hardware-agnostic configs (.yaml, .tf, .promql)
Tradeoff analysis: gains, losses, costs
Failure modes and how to avoid them
Open-source: replicate on your own cluster
Who Should Subscribe
You own inference costs or latency metrics. You deploy to production. You’ve hit scaling limits and need to know: code, hardware, or architecture?
2-3 posts per week. Runnable tests. Patterns you can adapt. Failure paths to avoid.
No vendor drama. No marketing. Just the metal.
Join us. The inference tax is real. Let’s make it avoidable.
The Real-World PhD (Without the 4-Year Wait)

