Figure 1: When asked to output a random number, GPT-4o often answers 7 (b), 70% of the time. In contrast, in multi-turn conversations where the LLM observes its past answers to the same question, it is able to de-bias itself, choosing the next numbers such that all numbers in history form nearly a uniform distribution (c).
Figure 2: For random questions, the probability of the top choice drops substantially when LLMs are allowed to view its answer history. For questions that seek subjective opinions and hard questions, the top-choice probability drops slightly since LLMs tend to choose between multiple plausible options. For easy questions, LLMs are confident and keeps choosing the same answer in the presence of answer history.
Figure 3: B-score can be used to improve the answer verification accuracy, here, of a 2-step verification process based on confidence scores and B-scores.
We have introduced B-score, a new metric designed to detect and quantify biases in large language models (LLMs) by analyzing their response histories. This work, “B-score: Detecting biases in large language models using response history,” was published at ICML 2025, one of the most prestigious conferences in machine learning with an acceptance rate of 26.9% (3,260/12,107 submissions).
LLMs are known to display various forms of bias. For example, social biases such as those against women, or even seemingly arbitrary tendencies such as favoring the number 7. In this study, we investigate whether LLMs can mitigate such biases when allowed to observe their own prior responses in a multi-turn conversational setting. To examine this systematically, we curated a novel benchmark set of questions covering nine topics and grouped into three categories: Subjective, Random, and Objective. Our experiments reveal that LLMs can effectively “de-bias” themselves in multi-turn conversations, particularly for Random questions where unbiased answers are expected.
To measure bias more rigorously, we propose B-score, a metric that successfully detects bias across Subjective, Random, Easy, and Hard questions. Beyond bias detection, B-score also proves valuable in improving answer verification. On widely used benchmarks such as MMLU, HLE, and CSQA, integrating B-score significantly enhances the accuracy of verification, accepting correct answers and rejecting incorrect ones more reliably than using verbalized confidence scores or single-turn answer frequencies alone.
This work provides both a new methodological tool and a practical evaluation resource for understanding and mitigating biases in LLMs. By enabling models to reflect on their own response histories, B-score opens a pathway toward fairer, more reliable, and more trustworthy AI systems.
Code and data are publicly available at: https://b-score.github.io