RLHF is the training recipe that turned raw language models (good at predicting text) into aligned assistants (good at following instructions helpfully and safely). It was popularized by InstructGPT (2022) and is the foundation of every major chat LLM.
The Three-Stage Pipeline
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base LLM. Fine-tune it on a dataset of (prompt, ideal…