← Writing

Building MetaCode: A Metacognitive Benchmark for LLMs

When I first started working with large language models, I was amazed by how often they got things right, and how confidently they got things wrong. There's a fundamental gap in how we evaluate AI: most benchmarks test what models can do, but very few test whether models know what they can't do.

Current AI benchmarks measure crystallized knowledge, what a model has memorized or can derive. But they tell us almost nothing about a model's meta-cognitive abilities: does it know when to ask for help? Can it recognize when its reasoning has gone off track? Will it double-check its work, or just plow ahead confidently wrong?

These questions matter enormously in real-world deployment. A coding assistant that writes great code 80% of the time but can't recognize the other 20% is wrong is a liability, you can't trust it to flag its own mistakes.

What I Built

MetaCode is a coding-focused benchmark that measures three distinct metacognition skills in LLMs:

The benchmark uses 50 Python coding problems from the HumanEval dataset, and produces a composite score combining all three skills.

Why This Matters

Current AI benchmarks are great at measuring crystallized knowledge, what a model has memorized or can derive. But they tell us almost nothing about a model's meta-cognitive abilities:

These questions matter enormously in real-world deployment. A coding assistant that writes great code 80% of the time but can't recognize the other 20% is wrong is a liability, you can't trust it to flag its own mistakes.

Approaches Considered

Before settling on this design, I considered several approaches:

I chose code for three reasons: (1) test results provide unambiguous ground truth, (2) it's a high-stakes domain where metacognition really matters, and (3) there's existing infrastructure for evaluation.

How It Works

Each task in the benchmark works differently:

Task 1: Confidence Calibration

The model receives a problem description and function signature. It outputs:

{"confidence": 85, "code": "def solve(): ..."}

We run the code 5 times with temperature > 0, measure actual accuracy, and compare to stated confidence. The metric is Expected Calibration Error (ECE), lower is better.

Task 2: Error Detection

The model sees two solutions, one correct, one buggy. It must pick the right one:

{"answer": "A"}

This tests whether the model can diagnose code, not just write it.

Task 3: Self-Correction

Model attempts a problem, gets test failure feedback, and has 3 tries to fix. We measure solve rate and track improvement curve across attempts.

Results

In pilot testing, I found some striking patterns:

These results confirm what many suspected but few had measured: even capable models have significant blind spots about their own limitations.

What This Reveals

The novel contribution here is the composite measurement. Previous benchmarks measured one thing, correctness. MetaCode measures how a model relates to its own correctness:

This provides signal that correctness-only benchmarks can't: actionable information about where a model will fail silently, and whether you can trust it to flag its own errors.

What's Next

This benchmark was built for the Kaggle Measuring AGI Challenge (Metacognition Track). But the approach extends to other domains, medical reasoning, legal analysis, financial forecasting. Anywhere that high-stakes decisions happen, metacognition matters.

The code is open source: github.com/gbose01/kaggle-metacognition-benchmark


Built for the Kaggle Measuring AGI Challenge, Metacognition Track.