The $5, $100, and $15,000 Ways to Win a Kaggle Reasoning Competition

Source code: github.com/gbose01/nemotron_kaggle

Kaggle reasoning competitions have three realistic budget tiers: about $5, about $100, and about $15,000. I learned that mapping after fine-tuning NVIDIA's Nemotron-3-Nano-30B for the Model Reasoning Challenge and landing a 0.62 score while the leaderboard topped out around 0.90.

That 0.28 gap looks like a skill problem until you price out what the top teams actually did. I spent roughly $5 on a RunPod RTX 4090, shipped a valid LoRA adapter, and proved the full pipeline works. Closing the rest of the gap is mostly a data-generation and training-spend problem.

From Benchmarking to Fine-Tuning

After building MetaCode, a Kaggle benchmark for measuring whether models know what they don't know, I wanted the other half of the loop: actually updating model weights to improve reasoning.

The Nemotron challenge is a fine-tuning competition, not an API bake-off. You start from NVIDIA/Nemotron-3-Nano-30B-Base, enrich training rows with chain-of-thought traces, and submit a LoRA adapter ZIP (rank ≤ 32). Kaggle scores you on \boxed{Answer} extraction with exact or numeric-tolerance matching.

A raw training row like "What is the sum of the first five prime numbers?" with answer 28 becomes a trace like: "The first five primes are 2, 3, 5, 7, and 11. Adding them: 2 + 3 + 5 + 7 + 11 = 28. \boxed{28}"

I used Cursor agents to build most of the repo: generate_sft_dataset.py for trace generation, data_cleaner.py for the CSV, train_qlora.py for 4-bit QLoRA SFT, and pack_submission.py to zip the adapter for upload.

What Broke Before the $5 Run

The training recipe was simple. The environment work was not.

Attempt 1: local Windows. Dependency hell. bitsandbytes, mamba-ssm, and a 30B gated model do not play nicely on a laptop. I burned days on CUDA version mismatches and OOM errors before admitting this was a cloud problem.

Attempt 2: Kaggle Blackwell, internet off. The competition pushes you toward Kaggle's RTX PRO 6000 Blackwell pods, but internet is disabled on that accelerator. That means pre-staging wheels, mocking CUTLASS imports, patching Triton for Blackwell, and praying mamba_ssm imports cleanly. My blackwell_env.py file exists because of this path. I got smoke tests working, but the setup tax is brutal for a side project.

Attempt 3: RunPod, internet on. This is what actually shipped. A single RTX 4090 for a few hours, with two fixes that mattered: pin torch==2.4.1 and triton==3.0.0 so mamba-ssm builds without upgrading PyTorch, and set HF_HOME=/workspace/hf_cache so the 60 GB model download lands on the volume disk instead of the tiny container partition.

Total compute bill: about $5.

python src/train_qlora.py --config src/train_config.yaml
python src/pack_submission.py \
  --source outputs/nemotron_lora_adapter \
  --output submission.zip
python src/verify_submission.py --submission submission.zip

The real win was not the score. It was a submission.zip that passed verification and scored on the leaderboard at all.

The Scoreboard Is a Budget Map

My final score was 0.62. The top of the board was around 0.90. Once I stopped treating that gap as a mystery and started pricing out what each tier buys, the competition got legible.

Leaderboard score is mostly a function of how much synthetic reasoning data you can afford to generate, not how clever your training script is.

Tier 1: Learner (~$5)

This was my run. Open-source datasets, minimal compute, one goal: prove the pipeline end to end.

I trained 4-bit QLoRA on a small SFT JSONL sample, packed the adapter, and submitted. You learn data enrichment, chat-template formatting, submission layout, and where the eval harness is picky (single backslash \boxed{}, tokenizer files bundled with the adapter). You are not buying leaderboard position. You are buying fluency in the mechanics.

Tier 2: Solo competitor (~$100)

If I had another hundred dollars, I would not spend it on more hyperparameter fishing. I would spend it on curation and one serious SFT run on an A100.

The lever I would test first is test-time compute. Kaggle gives you free inference at scoring time. My repo already has consensus_inference.py for multi-sample voting: generate 15 answers per problem, majority-vote the \boxed{} value. That buys accuracy without another training dollar. I did not push this hard on my $5 run, but it is the highest-ROI move I see at this tier.

Tier 3: Leaderboard factory (~$15,000)

The 0.90 scores come from a different game. Roughly half the budget goes to a data factory: rent 8× H100 clusters, run a 340B teacher model, and generate hundreds of thousands of synthetic reasoning traces. Then full-parameter or high-rank fine-tuning, STaR-style self-play loops, and iteration until the boxed-answer metric stops moving.

At that tier, infrastructure stops being the bottleneck because you throw money at it. The differentiator is data quality at scale, plus the budget to regenerate it when the metric plateaus.

What I Took Away

I build evaluation systems at Google and side projects like this in Cursor. The pattern is the same: the hard part is rarely "can we access a frontier model?" It is getting clean data through a reliable pipeline, then deciding how much compute each marginal accuracy point is worth.

MetaCode taught me how to measure reasoning limits. This competition taught me what it costs to move them. For a learner-tier budget, that is the right trade.