#Vllm — blogs.social

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure, whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI, is the inference endpoint.

Request Flow

Client…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

Tokenization is the first step in any LLM pipeline: converting raw text into a sequence of integer IDs that the model actually processes. Understanding tokenization helps you reason about context window limits, API costs, and why LLMs sometimes struggle with tasks that seem simple.

How Tokens Work

Tokens are typically subword units, not quite characters, not quite words. Common English words…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low "intrinsic rank", most of the useful signal lives in a much smaller…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

OpenRouter

A unified API gateway for large language models that lets you call 100+ LLMs from different providers through a single OpenAI-compatible endpoint with automatic fallback and cost routing.

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

GitHub Copilot, launched in 2021 and built on OpenAI Codex (later GPT-4), was the first AI pair programmer to reach mainstream adoption. It integrates as an extension into VS Code, JetBrains, Neovim, and Visual Studio, making it the broadest-reaching AI coding tool by editor support.

Core Capabilities

Inline completions, suggests the next line or block as you type, shown as ghost text
*…

Read more →

Sahil Kapoor's Playbook @sahil.sahilkapoor.com.ap.brid.gy

May 17

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.

How It Works

Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads…

Read more →