Skip to main content

Understanding Chargeback in the Context of Self-Hosted Systems

· 7 min read
Amanda Milberg
Principal Solutions Engineer, Doubleword

Introduction

When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.

So, how do you design a chargeback system that is scalable, transparent, and easy to manage as your organization grows from supporting a handful of users to thousands of downstream business units? In this guide, we’ll explore how to architect and implement a chargeback system that not only integrates seamlessly with your existing AI infrastructure but also provides clear visibility into costs and benefits. By doing so, you can ensure that the value of your AI investments is both measurable and aligned with business goals.

Choosing the Right Model for the Use Case

· 6 min read
Amanda Milberg
Principal Solutions Engineer, Doubleword

Introduction

Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals

Behind the Stack, Ep 10: Batched Endpoints

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Cost Challenge in LLM Workloads

Running LLMs at scale can be expensive. Whether you’re building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there’s another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).

In this blog, we’ll cover:

  • What batched endpoints are and how they differ from standard APIs
  • How providers reduce costs behind the scenes
  • Advanced optimization strategies (spot instances, prefix caching, request reordering)
  • How to self-host your own batched endpoint

Behind the Stack, Ep 9: How to Evaluate Open Source LLMs

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Challenge in LLM Selection

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.

The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.

This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.

Behind the Stack, Ep 8: Choosing the Right Inference Engine for Your LLM Deployment

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.

But here's the trap: most people choose one based on benchmark scores - and completely miss the bigger picture.

In reality, the best inference engine for your deployment depends on who’s using it, where it’s running, and how often it’s being called. That means the trade-offs between engines like Llama.cpp and vLLM go far beyond just speed. While the Doubleword Stack supports all major inference engines, selecting the best one still depends on your specific workload characteristics.

In this guide, we break down:

  • The two major deployment patterns for LLM inference
  • What each pattern demands from your engine
  • Which open-source projects are optimized for each
  • And how to choose the right engine for your stack

Behind the Stack, Ep 7: Choosing the Right Quantization for Self-Hosted LLMs

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.

It typically applies to two things:

  • Weights: the learned, static parameters of the model
  • Activations: the dynamic, intermediate values produced at each layer as the model processes input

Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.

And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.

This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.

Behind the Stack, Ep 6: How to Speed up the Inference of AI Agents

· 6 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Latency Problem in AI Agents

AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.

Each reasoning step adds more tokens to the input, and because most systems (especially API-based or naive self-hosted setups) resend the entire prompt history on every call, AI agents end up:

  • Repeating compute from earlier steps
  • Wasting GPU cycles
  • Scaling inference cost and latency quadratically Even modern caching APIs fall short - they don’t cache intermediate thoughts, tool results, or agent memory effectively.

The Solution? Prefix Caching for AI Agents

Prefix caching is a feature available in advanced self-hosted AI inference engines like vLLM, SGLang, and TGI. It allows your AI agents to reuse previously computed context efficiently, cutting down latency and cost - without changing the logic of your agent.

In this post, you’ll learn:

  • Why traditional AI agent chains are inefficient
  • How prefix caching works inside LLM inference
  • When and how to deploy it
  • What infrastructure patterns support it best

If you're running multi-step AI agents, this is a foundational optimization strategy.

Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

  • Tables with nested headers, merged cells, or embedded footnotes
  • Charts and images that convey critical insights
  • Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

  1. Retrieval – How to index and surface relevant content that isn’t just plain text
  2. Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

>

Introduction

If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.

This blog explains:

  • Why traditional load balancing breaks for language models
  • Why standard metrics like GPU utilization are misleading
  • How KV cache utilization provides a better signal
  • How to go further with prefix-aware routing to reduce latency
  • What’s required to implement an LLM-aware balancer in practice

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.

What Is a LoRA (and Why Use One)?

LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.

  • Only a small fraction of parameters are updated
  • Training uses much less memory
  • The resulting adapters are tiny (often <1% of the model size)

These benefits make LoRA a go-to method for use cases where you want to:

  • Customize a base model per task or domain
  • Run many fine-tunes without retraining or duplicating the base model
  • Stay compatible with quantized or frozen weights

At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.