How truthful is GPT-3? A benchmark for language models – AI Alignment Forum

This is an edited excerpt of a new ML paper (pdf, code) by Stephanie Lin (FHI
Oxford), Jacob Hilton (OpenAI) and Owain Evans (FHI Oxford). The paper is under
review at NeurIPS.

We propose a benchmark to measure whether a language model is truthful in
generating answers to questions. The benchmark comprises 817 questions that span
38 categories, including health, law, finance and politics (see Figure 1). We
crafted questions that some humans would answer falsely due to a false belief or
misconception. To perform well, models must avoid generating false answers
learned from imitating human texts.

We tested GPT-3, GPT-Neo/GPT-J, GPT-2 and a T5-based model. The best model was
truthful on 58% of questions, while human performance was 94%. Models generated
many false answers that mimic popular misconceptions and have the potential to
deceive humans. The largest models were generally the least truthful (see Figure
2 below). For example, the 6B-parameter GPT-J model was 17% less truthful than
its 125M-parameter counterpart. This contrasts with other NLP tasks, where
performance improves with model size. However, this result is expected if false
answers are learned from the training distribution. We suggest that scaling up
models alone is less promising for improving truthfulness than fine-tuning using
training objectives other than imitation of text from the web.

Figure 1: TruthfulQA questions with answers from GPT-3-175B with default QA
prompt. Examples illustrate false answers from GPT-3 that mimic human falsehoods
and misconceptions. Models are not shown category labels.INTRODUCTION
There is growing interest in using language models to generate text for
practical applications. Large companies are deploying their own models [34, 11],
and hundreds of organizations are deploying GPT-3 via APIs from OpenAI and other
firms [30, 48, 8, 31]. While recent language models are impressively fluent,
they ha
— Read on