OpenAI Deliberative Alignment: Reasoning Enables Safer Language Models

Researchers created a new way to train large language models (LLMs) to be safer, called Deliberative Alignment. This method teaches the models safety rules directly and trains them to think about these rules before answering a question. This helps prevent the models from giving harmful answers or refusing to answer harmless questions. They tested this method on OpenAI’s o-series models and found that they were much better at following safety guidelines, less likely to be tricked into giving bad answers (jailbroken), and less likely to refuse to answer good questions. The models achieved this by using a chain-of-thought (CoT) reasoning process where they analyze the user’s question, think about the safety rules, and then provide an appropriate answer. The training happens in two stages: first, the models learn the safety rules through examples, and second, they practice using the rules with feedback from a “judge” LLM.

https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/978a6fd0a2ee268b2cb59637bd074cca/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top