10 posts tagged with "behind-the-stack"

Behind the Stack, Ep 10: Batched Endpoints

September 10, 2025 · 4 min read

Founder & Member of Technical Staff, Doubleword

Introduction: The Cost Challenge in LLM Workloads

Running LLMs at scale can be expensive. Whether you’re building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there’s another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).

In this blog, we’ll cover:

What batched endpoints are and how they differ from standard APIs
How providers reduce costs behind the scenes
Advanced optimization strategies (spot instances, prefix caching, request reordering)
How to self-host your own batched endpoint

Behind the Stack, Ep 9: How to Evaluate Open Source LLMs

September 3, 2025 · 4 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Challenge in LLM Selection

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.

The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.

This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.

Behind the Stack, Ep 8: Choosing the Right Inference Engine for Your LLM Deployment

July 15, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.

But here's the trap: most people choose one based on benchmark scores - and completely miss the bigger picture.

In reality, the best inference engine for your deployment depends on who’s using it, where it’s running, and how often it’s being called. That means the trade-offs between engines like Llama.cpp and vLLM go far beyond just speed. While the Doubleword Stack supports all major inference engines, selecting the best one still depends on your specific workload characteristics.

In this guide, we break down:

The two major deployment patterns for LLM inference
What each pattern demands from your engine
Which open-source projects are optimized for each
And how to choose the right engine for your stack

Behind the Stack, Ep 7: Choosing the Right Quantization for Self-Hosted LLMs

July 9, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.

It typically applies to two things:

Weights: the learned, static parameters of the model
Activations: the dynamic, intermediate values produced at each layer as the model processes input

Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.

And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.

This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.

Behind the Stack, Ep 6: How to Speed up the Inference of AI Agents

July 1, 2025 · 6 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction: The Latency Problem in AI Agents

AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.

Each reasoning step adds more tokens to the input, and because most systems (especially API-based or naive self-hosted setups) resend the entire prompt history on every call, AI agents end up:

Repeating compute from earlier steps
Wasting GPU cycles
Scaling inference cost and latency quadratically Even modern caching APIs fall short - they don’t cache intermediate thoughts, tool results, or agent memory effectively.

The Solution? Prefix Caching for AI Agents

Prefix caching is a feature available in advanced self-hosted AI inference engines like vLLM, SGLang, and TGI. It allows your AI agents to reuse previously computed context efficiently, cutting down latency and cost - without changing the logic of your agent.

In this post, you’ll learn:

Why traditional AI agent chains are inefficient
How prefix caching works inside LLM inference
When and how to deploy it
What infrastructure patterns support it best

If you're running multi-step AI agents, this is a foundational optimization strategy.

Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

June 24, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

Tables with nested headers, merged cells, or embedded footnotes
Charts and images that convey critical insights
Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

Retrieval – How to index and surface relevant content that isn’t just plain text
Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware

June 18, 2025 · 5 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.

This blog explains:

Why traditional load balancing breaks for language models
Why standard metrics like GPU utilization are misleading
How KV cache utilization provides a better signal
How to go further with prefix-aware routing to reduce latency
What’s required to implement an LLM-aware balancer in practice

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

June 10, 2025 · 4 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.

What Is a LoRA (and Why Use One)?

LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.

Only a small fraction of parameters are updated
Training uses much less memory
The resulting adapters are tiny (often <1% of the model size)

These benefits make LoRA a go-to method for use cases where you want to:

Customize a base model per task or domain
Run many fine-tunes without retraining or duplicating the base model
Stay compatible with quantized or frozen weights

At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

June 4, 2025 · 4 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.

Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?

May 28, 2025 · 3 min read

Jamie Dborin

Founder & Member of Technical Staff, Doubleword

Introduction

It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.

Introduction: The Cost Challenge in LLM Workloads​

Introduction: The Hidden Challenge in LLM Selection​

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine​

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical​

Introduction: The Latency Problem in AI Agents​

The Solution? Prefix Caching for AI Agents​

Introduction​

Introduction​

Introduction​

What Is a LoRA (and Why Use One)?​

Introduction​

Introduction​

Introduction: The Cost Challenge in LLM Workloads

Introduction: The Hidden Challenge in LLM Selection

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Introduction: The Latency Problem in AI Agents

The Solution? Prefix Caching for AI Agents

Introduction

Introduction

Introduction

What Is a LoRA (and Why Use One)?

Introduction

Introduction