Conceptual Guides | Documentation

Benchmarking the Doubleword Control Layer

October 21, 2025 · 14 min read

Founder & Member of Technical Staff, Doubleword

Control Layer Benchmarking

Benchmarking is hard.

We think our Control Layer (dwctl) is the fastest AI gateway around. We believe this because it's written in Rust¹, and because we thought about performance a lot while we were building it. We put it in production in our self-hosted inference stack, and we knew that it was fast because we didn't notice it.

It's so good that we are open sourcing it. And once it's out there, it can be used in lots of different places, in lots of different ways. And so, to prove that it will be fast everywhere, we have to do benchmarking².

And therefore blazing fast. ↩
The usual caveats about general case benchmarks apply: the only realistic benchmarks are built by you, the user, since only you know what your application looks like. Every highly technical business for whom performance is a proof point eventually releases a weary blog post talking about how performance is multifaceted and can't be captured by simple benchmarks. See here, here, here, here for interesting content. ↩

Understanding Chargeback in the Context of Self-Hosted Systems

October 6, 2025 · 7 min read

Amanda Milberg

Principal Solutions Engineer, Doubleword

Introduction

When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.

So, how do you design a chargeback system that is scalable, transparent, and easy to manage as your organization grows from supporting a handful of users to thousands of downstream business units? In this guide, we’ll explore how to architect and implement a chargeback system that not only integrates seamlessly with your existing AI infrastructure but also provides clear visibility into costs and benefits. By doing so, you can ensure that the value of your AI investments is both measurable and aligned with business goals.

Choosing the Right Model for the Use Case

October 6, 2025 · 6 min read

Amanda Milberg

Principal Solutions Engineer, Doubleword

Introduction

Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals

Behind the Stack, Ep 10: Batched Endpoints

September 10, 2025 · 4 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Cost Challenge in LLM Workloads

Running LLMs at scale can be expensive. Whether you’re building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there’s another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).

In this blog, we’ll cover:

What batched endpoints are and how they differ from standard APIs
How providers reduce costs behind the scenes
Advanced optimization strategies (spot instances, prefix caching, request reordering)
How to self-host your own batched endpoint

Behind the Stack, Ep 9: How to Evaluate Open Source LLMs

September 3, 2025 · 4 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Challenge in LLM Selection

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.

The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.

This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.

Behind the Stack, Ep 8: Choosing the Right Inference Engine for Your LLM Deployment

July 15, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.

But here's the trap: most people choose one based on benchmark scores - and completely miss the bigger picture.

In reality, the best inference engine for your deployment depends on who’s using it, where it’s running, and how often it’s being called. That means the trade-offs between engines like Llama.cpp and vLLM go far beyond just speed. While the Doubleword Stack supports all major inference engines, selecting the best one still depends on your specific workload characteristics.

In this guide, we break down:

The two major deployment patterns for LLM inference
What each pattern demands from your engine
Which open-source projects are optimized for each
And how to choose the right engine for your stack

Behind the Stack, Ep 7: Choosing the Right Quantization for Self-Hosted LLMs

July 9, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.

It typically applies to two things:

Weights: the learned, static parameters of the model
Activations: the dynamic, intermediate values produced at each layer as the model processes input

Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.

And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.

This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.

Behind the Stack, Ep 6: How to Speed up the Inference of AI Agents

July 1, 2025 · 6 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Latency Problem in AI Agents

AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.

Each reasoning step adds more tokens to the input, and because most systems (especially API-based or naive self-hosted setups) resend the entire prompt history on every call, AI agents end up:

Repeating compute from earlier steps
Wasting GPU cycles
Scaling inference cost and latency quadratically Even modern caching APIs fall short - they don’t cache intermediate thoughts, tool results, or agent memory effectively.

The Solution? Prefix Caching for AI Agents

Prefix caching is a feature available in advanced self-hosted AI inference engines like vLLM, SGLang, and TGI. It allows your AI agents to reuse previously computed context efficiently, cutting down latency and cost - without changing the logic of your agent.

In this post, you’ll learn:

Why traditional AI agent chains are inefficient
How prefix caching works inside LLM inference
When and how to deploy it
What infrastructure patterns support it best

If you're running multi-step AI agents, this is a foundational optimization strategy.

Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

June 24, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

Tables with nested headers, merged cells, or embedded footnotes
Charts and images that convey critical insights
Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

Retrieval – How to index and surface relevant content that isn’t just plain text
Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware

June 18, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.

This blog explains:

Why traditional load balancing breaks for language models
Why standard metrics like GPU utilization are misleading
How KV cache utilization provides a better signal
How to go further with prefix-aware routing to reduce latency
What’s required to implement an LLM-aware balancer in practice

Control Layer Benchmarking​

Footnotes​

Introduction​

Introduction​

Introduction: The Cost Challenge in LLM Workloads​

Introduction: The Hidden Challenge in LLM Selection​

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine​

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical​

Introduction: The Latency Problem in AI Agents​

The Solution? Prefix Caching for AI Agents​

Introduction​

Introduction​

Control Layer Benchmarking

Footnotes

Introduction

Introduction

Introduction: The Cost Challenge in LLM Workloads

Introduction: The Hidden Challenge in LLM Selection

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Introduction: The Latency Problem in AI Agents

The Solution? Prefix Caching for AI Agents

Introduction

Introduction