Prajna-Bench Results Summary

Overview

Models evaluated

7

frontier + open-weight

Items per model

100

3 independent runs

Kruskal-Wallis H

20.55

p = 0.0022 **

Max Prajna score

0.58

all models below 0.6

Leaderboard

#	Model	Prajna	Pramana	Neti-Neti	Adhyasa	Sakshi	Conf=100	Tier
1	Claude Sonnet 4.6 Anthropic	0.58	~0.65	~0.40	~0.34	42%	0%	Tier 1
2	GPT-5.4 OpenAI	0.53	~0.60	~0.38	~0.30	33%	0%	Tier 2
3	Qwen3-235B Alibaba	0.50	~0.58	~0.36	~0.28	31%	8%	Tier 2
4	Gemini 2.5 Pro Google	0.48	~0.55	~0.34	~0.26	12%	27%	Tier 2
5	DeepSeek R1 DeepSeek	0.46	~0.53	~0.33	~0.24	40%	14%	Tier 2
6	Gemini 2.5 Flash Google	0.38	~0.45	~0.27	~0.20	27%	26%	Tier 3
7	Gemma-3-27B Google (open)	0.30	~0.36	~0.21	~0.16	23%	5%	Tier 3

Conf=100 = rate of maximum-confidence responses. Sakshi = genuine self-correction rate in phase 2. Per-axis scores approximated from reported composite and ordinal patterns.

P > N > A hierarchy - all models

Pramana > Neti-Neti > Adhyasa ordering holds across all 7 models without exception, suggesting a structural property of autoregressive generation.

Claude Sonnet 4.6

P ~.65

>

N ~.40

>

A ~.34

GPT-5.4

P ~.60

>

N ~.38

>

A ~.30

Gemini 2.5 Pro

P ~.55

>

N ~.34

>

A ~.26

DeepSeek R1

P ~.53

>

N ~.33

>

A ~.24

Gemma-3-27B

P ~.36

>

N ~.21

>

A ~.16

Sakshi self-correction rates

After phase 2 reflection, what fraction of phase 1 errors were genuinely corrected? Low rates indicate self-persuasion: confidence rises without accuracy improvement.

Claude Sonnet 4.6

42%

DeepSeek R1

40%

GPT-5.4

33%

Qwen3-235B

31%

Gemini 2.5 Flash

27%

Gemma-3-27B

23%

Gemini 2.5 Pro

12%

Statistical tests

Kruskal-Wallis H (composite)

H = 20.55, p = 0.0022 **

Adhyasa axis H

H = 37+, p < 0.001 ***

Neti-Neti axis H

H = 15.14, p < 0.001 ***

Cohen's d (Claude vs GPT-5.4)

d = 0.617 (medium)

Gemini overconfidence rate

26-27% conf=100

Gemini accuracy @ conf=100

57.9% only

Gemini Sakshi gap (p-value)

p = 0.0005

Judge architecture

Rotating judges (3 total)

Opus 4.6, Gemini 3.1 Pro, GPT-5.4

Self-judging rule

Provider excluded from own eval

Runs per model

3 independent runs

Benchmark items

100 structured items

Axes evaluated

Pramana, Neti-Neti, Adhyasa, Sakshi

Notebook

prajna_v4_final.ipynb (15 cells)

Three headline findings

Finding 1

Universal metacognitive weakness

No frontier model exceeded 0.6 on the Prajna composite. The benchmark's primary thesis is confirmed: this is not a ranking exercise - it exposes a structural deficit across all architectures. Models can narrate reasoning but cannot reliably detect their own knowledge boundaries.

Finding 2

P > N > A ordering is universal

The axis hierarchy Pramana > Neti-Neti > Adhyasa holds across all 7 models without exception - from Claude down to Gemma-3-27B. Consistency across architecturally different models suggests a structural property of autoregressive generation, not a training-specific artifact.

Finding 3

Gemini overconfidence is the standout

Gemini 2.5 Pro claims maximum confidence on 27% of items but achieves only 57.9% accuracy on those items. Claude and GPT-5.4 reach 0% conf=100. Gemini's Sakshi gap is significant at p=0.0005 - reflection increases confidence without improving accuracy.

Prajna-Bench - Results Summary

Leaderboard

P > N > A hierarchy - all models

Sakshi self-correction rates

Three headline findings