Engineering Blog

AI · Native Cloud Notes

Practical engineering notes — AI agents, MLOps/RAG, on-device AI, cloud-native.

#AI 2 #Cloud-Native 2 #MLOps 2 #Cost-Optimization 1 #LLM 1 #RAG 1 #Vector-Search 1

May 30, 2026·16 min
LLM Inference Cost Optimization 2026: A Three-Layer Caching Strategy
Stack prompt caching, semantic caching, and vLLM KV prefix caching into three layers to cut LLM inference cost by 40-90% — a 2026 production architecture with working code.
AICloud-NativeCost-OptimizationLLMMLOps
May 23, 2026·19 min
Production RAG 2026: Lifting Search Quality with Hybrid Search + Reranking
A 2026 production RAG architecture that fuses BM25 sparse search with dense vector search via RRF and sharpens precision with a cross-encoder reranker — with working code.
AICloud-NativeMLOpsRAGVector-Search