Artificial intelligence exhibits human-like behavior in performance evaluation

19.02.2026

Artificial intelligence often behaves in a surprisingly human way when evaluating performance: it tends to be lenient and avoid negative judgments, especially when objective standards are lacking. These are the findings of a new study by Dirk Sliwka, Professor at the Cluster of Excellence ECONtribute at the University of Cologne, and Rainer Michael Rilke, Professor at the WHU Otto Beisheim School of Management.

 

Artificial intelligence brings both promise and challenges to the workplace. It is expected to speed up decision-making processes and cut costs. However, questions arise when it comes to evaluating human performance, for example, in hiring processes: What standards does AI use to make judgments? And how objective are its decisions?

A recent study by the Cluster of Excellence ECONtribute, published as a discussion paper, addressed these questions. The research team, consisting of Professor Dirk Sliwka (University of Cologne) and Professor Rainer Michael Rilke (WHU Otto Beisheim School of Management), came to the following conclusion: Language models, such as ChatGPT, often adopt human evaluation patterns, particularly when assessing individuals without clear benchmarks. In such cases, the models tend to be lenient, often giving average ratings and avoiding negative judgments.

Language models reproduce learned evaluation patterns

For the study, the researchers analyzed three different scenarios. For this purpose, they submitted prompts to the GPT-5 mini language model via the application programming interface of the software company OpenAI. In the first scenario, the AI rated the performance of 500 CEOs of large companies on a scale from one (“unsatisfactory”) to five (“outstanding”). The result: The AI predominantly gave ratings around the midpoint and was cautious with negative judgments. “Language models reproduce evaluation patterns that they know from their training data,” explains Professor Dirk Sliwka. “This includes the human tendency to judge leniently when in doubt.” Even when tasked with assessing each CEO individually to determine if they were among the worst 20% of the 500 managers evaluated, the AI classified less than 0.3% in this category.

In the second scenario, the AI evaluated job applications of different quality levels that had previously been created using AI. As in the first scenario, the AI tended to be lenient in evaluations. Only when multiple applications had to be compared and evaluated simultaneously according to a prespecified distribution did the AI’s ability to differentiate increase significantly, resulting in more precise judgments.

AI evaluates objective data more accurately than humans

The third scenario was based on objective performance signals from an experiment with crowd workers. In this experiment, the test subjects completed well-defined tasks. The AI received the same noisy but objective information about work performance as human raters. In this case, the AI’s evaluations were far more accurate than those of the human comparison group, which tended to give better ratings when they knew that their evaluations determined a worker’s bonus, for example. “As soon as there is an objective anchor, AI gets very close to the optimal statistical benchmark and then gives much more accurate ratings than human evaluators because it makes fewer mistakes,” summarizes Professor Sliwka.

Press and Communication:

Maria John Sánchez
PR Manager
M maria.johnsanchez@uni-bonn.de

Researcher:

Dirk Sliwka

Exzellenzcluster ECONtribute, Universität zu Köln

M sliwka@wiso.uni-koeln.de

Links

Discussion paper