#Inference Endpoint — blogs.social

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

RLHF (Reinforcement Learning from Human Feedback)

RLHF is the training recipe that turned raw language models (good at predicting text) into aligned assistants (good at following instructions helpfully and safely). It was popularized by InstructGPT (2022) and is the foundation of every major chat LLM.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base LLM. Fine-tune it on a dataset of (prompt, ideal…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

OpenRouter

A unified API gateway for large language models that lets you call 100+ LLMs from different providers through a single OpenAI-compatible endpoint with automatic fallback and cost routing.

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

vLLM (Virtual LLM) is an open-source inference engine from UC Berkeley that dramatically increases the throughput of serving large language models on GPU hardware. It was introduced in 2023 with PagedAttention, a novel memory management technique that treats the KV cache like virtual memory in an OS, reducing waste from up to 60–80% of GPU memory down to under 4%.

The Problem: KV Cache…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.

How It Works

Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads…

Read more →