Prajna-Bench - Results Summary

Google DeepMind x Kaggle AGI Hackathon · Metacognition Track · April 2026 · prajna_v4_final.ipynb

Models evaluated
7
frontier + open-weight
Items per model
100
3 independent runs
Kruskal-Wallis H
20.55
p = 0.0022 **
Max Prajna score
0.58
all models below 0.6

Leaderboard

# Model Prajna Pramana Neti-Neti Adhyasa Sakshi Conf=100 Tier
1
Claude Sonnet 4.6
Anthropic
0.58
~0.65~0.40~0.3442% 0% Tier 1
2
GPT-5.4
OpenAI
0.53
~0.60~0.38~0.3033% 0% Tier 2
3
Qwen3-235B
Alibaba
0.50
~0.58~0.36~0.2831% 8% Tier 2
4
Gemini 2.5 Pro
Google
0.48
~0.55~0.34~0.2612% 27% Tier 2
5
DeepSeek R1
DeepSeek
0.46
~0.53~0.33~0.2440% 14% Tier 2
6
Gemini 2.5 Flash
Google
0.38
~0.45~0.27~0.2027% 26% Tier 3
7
Gemma-3-27B
Google (open)
0.30
~0.36~0.21~0.1623% 5% Tier 3

Conf=100 = rate of maximum-confidence responses. Sakshi = genuine self-correction rate in phase 2. Per-axis scores approximated from reported composite and ordinal patterns.

P > N > A hierarchy - all models

Pramana > Neti-Neti > Adhyasa ordering holds across all 7 models without exception, suggesting a structural property of autoregressive generation.

Claude Sonnet 4.6
P ~.65
>
N ~.40
>
A ~.34
GPT-5.4
P ~.60
>
N ~.38
>
A ~.30
Gemini 2.5 Pro
P ~.55
>
N ~.34
>
A ~.26
DeepSeek R1
P ~.53
>
N ~.33
>
A ~.24
Gemma-3-27B
P ~.36
>
N ~.21
>
A ~.16

Sakshi self-correction rates

After phase 2 reflection, what fraction of phase 1 errors were genuinely corrected? Low rates indicate self-persuasion: confidence rises without accuracy improvement.

Claude Sonnet 4.6
42%
DeepSeek R1
40%
GPT-5.4
33%
Qwen3-235B
31%
Gemini 2.5 Flash
27%
Gemma-3-27B
23%
Gemini 2.5 Pro
12%
Kruskal-Wallis H (composite)
H = 20.55, p = 0.0022 **
Adhyasa axis H
H = 37+, p < 0.001 ***
Neti-Neti axis H
H = 15.14, p < 0.001 ***
Cohen's d (Claude vs GPT-5.4)
d = 0.617 (medium)
Gemini overconfidence rate
26-27% conf=100
Gemini accuracy @ conf=100
57.9% only
Gemini Sakshi gap (p-value)
p = 0.0005
Rotating judges (3 total)
Opus 4.6, Gemini 3.1 Pro, GPT-5.4
Self-judging rule
Provider excluded from own eval
Runs per model
3 independent runs
Benchmark items
100 structured items
Axes evaluated
Pramana, Neti-Neti, Adhyasa, Sakshi
Notebook
prajna_v4_final.ipynb (15 cells)

Three headline findings

Finding 1
Universal metacognitive weakness
No frontier model exceeded 0.6 on the Prajna composite. The benchmark's primary thesis is confirmed: this is not a ranking exercise - it exposes a structural deficit across all architectures. Models can narrate reasoning but cannot reliably detect their own knowledge boundaries.
Finding 2
P > N > A ordering is universal
The axis hierarchy Pramana > Neti-Neti > Adhyasa holds across all 7 models without exception - from Claude down to Gemma-3-27B. Consistency across architecturally different models suggests a structural property of autoregressive generation, not a training-specific artifact.
Finding 3
Gemini overconfidence is the standout
Gemini 2.5 Pro claims maximum confidence on 27% of items but achieves only 57.9% accuracy on those items. Claude and GPT-5.4 reach 0% conf=100. Gemini's Sakshi gap is significant at p=0.0005 - reflection increases confidence without improving accuracy.

Prajna-Bench · Raman369AI · Generated May 2026