We think our Control Layer
(dwctl) is the fastest AI
gateway around. We believe this because it's written in Rust1, and because we thought about performance a
lot while we were building it. We put it in production in our self-hosted
inference stack, and we knew that it was fast because we didn't notice it.
It's so good that we are open sourcing it. And once it's out there, it can be used in lots of
different places, in lots of different ways. And so, to prove that it will be
fast everywhere, we have to do benchmarking2.
The usual caveats about general case benchmarks apply: the only realistic
benchmarks are built by you, the user, since only you know what your
application looks like. Every highly technical business for whom performance
is a proof point eventually releases a weary blog post talking about how
performance is multifaceted and can't be captured by simple benchmarks. See
here,
here,
here,
here
for interesting content. ↩
When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.
So, how do you design a chargeback system that is scalable, transparent, and easy to manage as your organization grows from supporting a handful of users to thousands of downstream business units? In this guide, we’ll explore how to architect and implement a chargeback system that not only integrates seamlessly with your existing AI infrastructure but also provides clear visibility into costs and benefits. By doing so, you can ensure that the value of your AI investments is both measurable and aligned with business goals.
Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals
Introduction: The Cost Challenge in LLM Workloads
Running LLMs at scale can be expensive. Whether you’re building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there’s another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).
In this blog, we’ll cover:
What batched endpoints are and how they differ from standard APIs
Introduction: The Hidden Challenge in LLM Selection
Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.
Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.
The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.
This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.
Introduction: The Hidden Cost of Choosing the Wrong Inference Engine
Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.
But here's the trap: most people choose one based on benchmark scores - and completely miss the bigger picture.
In reality, the best inference engine for your deployment depends on who’s using it, where it’s running, and how often it’s being called. That means the trade-offs between engines like Llama.cpp and vLLM go far beyond just speed. While the Doubleword Stack supports all major inference engines, selecting the best one still depends on your specific workload characteristics.
In this guide, we break down:
The two major deployment patterns for LLM inference
Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical
Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.
At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.
It typically applies to two things:
Weights: the learned, static parameters of the model
Activations: the dynamic, intermediate values produced at each layer as the model processes input
Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.
And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.
This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.
AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.
Each reasoning step adds more tokens to the input, and because most systems (especially API-based or naive self-hosted setups) resend the entire prompt history on every call, AI agents end up:
Repeating compute from earlier steps
Wasting GPU cycles
Scaling inference cost and latency quadratically
Even modern caching APIs fall short - they don’t cache intermediate thoughts, tool results, or agent memory effectively.
Prefix caching is a feature available in advanced self-hosted AI inference engines like vLLM, SGLang, and TGI. It allows your AI agents to reuse previously computed context efficiently, cutting down latency and cost - without changing the logic of your agent.
In this post, you’ll learn:
Why traditional AI agent chains are inefficient
How prefix caching works inside LLM inference
When and how to deploy it
What infrastructure patterns support it best
If you're running multi-step AI agents, this is a foundational optimization strategy.
Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:
Tables with nested headers, merged cells, or embedded footnotes
Charts and images that convey critical insights
Layout-heavy formats like invoices, reports, or scanned documents
When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.
This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:
Retrieval – How to index and surface relevant content that isn’t just plain text
Generation – How to present structured or visual content to an LLM for high-quality answers
We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.
If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.
This blog explains:
Why traditional load balancing breaks for language models
Why standard metrics like GPU utilization are misleading
How KV cache utilization provides a better signal
How to go further with prefix-aware routing to reduce latency
What’s required to implement an LLM-aware balancer in practice