Engineering Blog
AI · Native Cloud Notes
Practical engineering notes — AI agents, MLOps/RAG, on-device AI, cloud-native.
- ·16 min
LLM Inference Cost Optimization 2026: A Three-Layer Caching Strategy
Stack prompt caching, semantic caching, and vLLM KV prefix caching into three layers to cut LLM inference cost by 40-90% — a 2026 production architecture with working code.
AICloud-NativeCost-OptimizationLLMMLOps - ·19 min
Production RAG 2026: Lifting Search Quality with Hybrid Search + Reranking
A 2026 production RAG architecture that fuses BM25 sparse search with dense vector search via RRF and sharpens precision with a cross-encoder reranker — with working code.
AICloud-NativeMLOpsRAGVector-Search