Posts

Everything in one stream so writing stays simple and discoverable. Use tags or search when the archive gets bigger.

Need keyword search? Go to Search.

7 entries shown

Post 19 Jun 2026 2 min read

The lethal trifecta in AI agents

When agents can read private data, process untrusted content, and communicate outward, prompt injection becomes a much more serious security problem.

Post 17 Jun 2026 3 min read

How Plan Caching Reduces LLM Agent Costs

Plan caching reuses planning templates across similar agent tasks, cutting cost and latency without throwing away accuracy.

Post 5 Jun 2026 3 min read

Stop filling your agent's context window just because you can

Bigger context windows do not remove failure modes. They create new ones when we stop being intentional about what goes into an agent's context.

Post 21 May 2026 3 min read

I benchmarked 5 embedding models across 4 datasets

I benchmarked five embedding models across four NanoBEIR datasets and found that bigger embeddings did not always produce better retrieval.

Post 19 May 2026 3 min read

Why reranking matters with cross-encoders

Bi-encoders make retrieval fast, but cross-encoders expose why reranking matters when meaning depends on the query.

Post 11 May 2026 8 min read

A beginner-friendly guide to the GGUF model format

GGUF made local LLM inference feel practical by packaging model weights, vocabulary, hyperparameters, and architecture metadata into one runnable format.

Post 27 Mar 2026 3 min read

Semantic Caching in Production

Repeated user intents can quietly inflate LLM cost and latency. Semantic caching helps, but production use comes with trade-offs.