Trained on human data, LLMs may inherit many of the psychological properties of the humans who produced that training data. While this has been tentatively shown in some social and moral domains (e.g., racial bias), in two phases, we seek to demonstrate that many *cognitive* biases that afflict human judgment (e.g., base-rate neglect) also become problems for LLMs. In Phase 1, Using various language models, we intend to systematically evaluate the extent to which LLMs pass validated benchmarks of rationality (e.g., Stanovich & West’s “rationality quotient”). After this, we will compare actual performance against predictions of lay people and a special sample of computer scientists; we expect that people see LLMs as more ideal reasoners and fail to account for the biases that they inherit. In Phase 2 we will test the hypothesis that, contrary to the common view that LLMs become more rational as they advance, the process of Reinforcement Learning by Human Feedback (RLHF) can actually exacerbate the problem of cognitive bias in LLMs. We will first give a sample of human raters a battery of questions known to produce bias; bias will be assessed by having participants choose between two possible responses to a question: an intuitive (wrong) answer and a correct answer, thus mimicking the human feedback component of the RLHF process. We will then fine tune a model with these human responses and compare accuracy of LLM responses pre- versus post-RLHF. We expect that RLHF will bring LLMs further away from ideal reasoning whenever human biases systematically deviate from rationality. Altogether, we hope to demonstrate ways in which LLMs may inherit human cognitive biases — and how this may be exacerbated by RLHF — in the hopes of improving the chances of aligning AI reasoning with human goals.
My PhD is in judgment & decision making, and this project is joint with two other faculty members with similar expertise: Lucius Caviola (GPI/Oxford) & Josh Lewis (NYU). The engineer we would hire with this grant money has unmatched experience doing extremely similar work and comes recommended from top engineers of a colleague's CS lab.
$14,000
This was the quote from an engineer for his time + compute purchase
>90% chance of learning something publishable in a general science/psychology journal and useful for the community; >65% chance of publication in a top 3 journal.
App status across various funders