“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

30/07/2025

LessWrong (Curated & Popular)

0:00

6:40

FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them.

The post's first paragraph:

Humanity's Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity's Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions.

About the initial review process for HLE questions:

[...] Reviewers were given explicit instructions: “Questions should ask for something precise [...]

---

First published:
July 29th, 2025

Source:
https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/about-30-of-humanity-s-last-exam-chemistry-biology-answers

---

Narrated by TYPE III AUDIO.

Mais episódios de "LessWrong (Curated & Popular)"

Mais episódios

Descobre o mundo dos podcasts com a app gratuita GetPodcast.

Subscreve os teus podcasts preferidos, ouve episódios offline e obtém recomendações fantásticas.

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

LessWrong (Curated & Popular)

Mais episódios de "LessWrong (Curated & Popular)"

“I am worried about near-term non-LLM AI developments” by testingthewaters

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

“Maya’s Escape” by Bridgett Kay

“Do confident short timelines make sense?” by TsviBT, abramdemski

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

“Make More Grayspaces” by Duncan Sabien (Inactive)