Building MetaCode: A Metacognitive Benchmark for LLMs

When I first started working with large language models, I was amazed by how often they got things right, and how confidently they got things wrong. There's a fundamental gap in how we evaluate AI: most benchmarks test what models can do, but very few test whether models know what they can't do.

Current AI benchmarks measure crystallized knowledge, what a model has memorized or can derive. But they tell us almost nothing about a model's meta-cognitive abilities: does it know when to ask for help? Can it recognize when its reasoning has gone off track? Will it double-check its work, or just plow ahead confidently wrong?

These questions matter enormously in real-world deployment. A coding assistant that writes great code 80% of the time but can't recognize the other 20% is wrong is a liability, you can't trust it to flag its own mistakes.

What I Built

MetaCode is a coding-focused benchmark that measures three distinct metacognition skills in LLMs:

Confidence Calibration: Does the model know if its answer is right? It states a confidence (0-100%), and we compare that against empirical accuracy.
Error Detection: Can the model identify the correct solution when shown two options (one right, one buggy)?
Self-Correction: Can the model fix its own code when given test failure feedback? We give it 3 attempts.

The benchmark uses 50 Python coding problems from the HumanEval dataset, and produces a composite score combining all three skills.

Why This Matters

Current AI benchmarks are great at measuring crystallized knowledge, what a model has memorized or can derive. But they tell us almost nothing about a model's meta-cognitive abilities:

Does it know when to ask for help?
Can it recognize when its reasoning has gone off track?
Will it double-check its work, or just plow ahead confidently wrong?

Approaches Considered

Before settling on this design, I considered several approaches:

Verbal confidence only: Just ask the model "how confident are you?" This is too easy to game. Models can say "90%" without any real calibration.
Single-try benchmarks: Measure whether the model gets it right or wrong. Doesn't capture self-monitoring at all.
Generic QA datasets: Using general knowledge questions instead of code. But code is higher-stakes: bugs have real consequences, and tests provide ground truth.

I chose code for three reasons: (1) test results provide unambiguous ground truth, (2) it's a high-stakes domain where metacognition really matters, and (3) there's existing infrastructure for evaluation.

How It Works

Each task in the benchmark works differently:

Task 1: Confidence Calibration

The model receives a problem description and function signature. It outputs:

{"confidence": 85, "code": "def solve(): ..."}

We run the code 5 times with temperature > 0, measure actual accuracy, and compare to stated confidence. The metric is Expected Calibration Error (ECE), lower is better.

Task 2: Error Detection

The model sees two solutions, one correct, one buggy. It must pick the right one:

{"answer": "A"}

This tests whether the model can diagnose code, not just write it.

Task 3: Self-Correction

Model attempts a problem, gets test failure feedback, and has 3 tries to fix. We measure solve rate and track improvement curve across attempts.

Results

In pilot testing, I found some striking patterns:

Models are overconfident: Stated confidence exceeded actual accuracy by 15-20% across most models.
Error detection is harder than solving: Models that could write correct code often failed to identify bugs in alternatives.
Self-correction plateaus early: Most fixes happen in attempts 1-2. After that, models tend to spiral rather than improve.

These results confirm what many suspected but few had measured: even capable models have significant blind spots about their own limitations.

What This Reveals

The novel contribution here is the composite measurement. Previous benchmarks measured one thing, correctness. MetaCode measures how a model relates to its own correctness:

Does it know what it knows?
Can it find its own bugs?
Will it course-correct under pressure?

This provides signal that correctness-only benchmarks can't: actionable information about where a model will fail silently, and whether you can trust it to flag its own errors.

What's Next

This benchmark was built for the Kaggle Measuring AGI Challenge (Metacognition Track). But the approach extends to other domains, medical reasoning, legal analysis, financial forecasting. Anywhere that high-stakes decisions happen, metacognition matters.

The code is open source: github.com/gbose01/kaggle-metacognition-benchmark

Built for the Kaggle Measuring AGI Challenge, Metacognition Track.