By Collin Lee | March 2026
Every March, tens of millions of Americans fill out a bracket. And every March, most of us do it the same way: we pick our alma mater to go further than they should, we ride the hot team we watched last weekend, and we overthink that 5-12 upset because someone said it feels like an upset year.
I wanted to know what happens when you take all of that out. No gut feelings. No revenge narratives. No "they just look like a Final Four team." Just data, algorithms, and a system that doesn't care whether Duke is wearing cool shoes this year.
So I built one.
Update: The Round of 64 is done. 25 of 32 correct. See the full breakdown.
The system is straightforward in concept, obsessive in execution. It blends two models:
The two models vote on every game, blended 30% logistic / 70% XGBoost. That ratio was found by searching thousands of combinations on held-out tournament data - another decision I didn't make by hand.
The system pulls ratings from Torvik, KenPom, and Evan Miya - three of the most respected advanced stats systems in college basketball - plus the Massey Composite, which averages 50+ different ranking systems together. All sourced across 18 seasons of data.
For 2026, the emotionless bracket is... kind of boring. All four 1-seeds make the Final Four. Seven upsets total, all mild: four 5-over-4 matchups, two 9-over-8 flips, and one 10-over-7. No Cinderellas. No chaos.
The most interesting picks aren't the upsets - they're the coin flips where the model has a slight conviction. It picks 9-seed Iowa over 8-seed Clemson at 84%, seeing something in the Hawkeyes' numbers that the seeding committee didn't. It gives 10-seed Santa Clara a 53% edge over 7-seed Kentucky - technically an upset but really a toss-up where the data slightly favors the "underdog." It has Duke marching to the Elite Eight at 93% or higher in every game, then beating UConn to take the East.
Michigan dominates the Midwest. Florida takes the South. Arizona runs the West.
Championship: Michigan over Duke, 79-78. A one-point game decided by the tiniest of margins in the model's own predicted scores.
No narrative. No drama. Just numbers.
Ensemble model: 30% logistic regression + 70% XGBoost. Upsets marked in bold.
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Duke | (16) Siena | 99% | 77-56 |
| (8) Ohio St. | (9) TCU | 52% | 77-76 |
| (5) St. John's | (12) Northern Iowa | 96% | 72-65 |
| (4) Kansas | (13) Cal Baptist | 94% | 76-61 |
| (6) Louisville | (11) South Florida | 78% | 90-79 |
| (3) Michigan St. | (14) N. Dakota St. | 98% | 80-66 |
| (7) UCLA | (10) UCF | 81% | 80-73 |
| (2) Connecticut | (15) Furman | 99% | 80-56 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Arizona | (16) LIU | 96% | 96-63 |
| (9) Utah St. | (8) Villanova | 69% | 74-72 |
| (5) Wisconsin | (12) High Point | 96% | 88-72 |
| (4) Arkansas | (13) Hawaii | 96% | 89-71 |
| (6) BYU | (11) Texas | 76% | 81-80 |
| (3) Gonzaga | (14) Kennesaw St. | 98% | 92-69 |
| (7) Miami FL | (10) Missouri | 68% | 81-76 |
| (2) Purdue | (15) Queens | 99% | 91-69 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Florida | (16) Prairie View A&M | 99% | 95-68 |
| (9) Iowa | (8) Clemson | 84% | 73-68 |
| (5) Vanderbilt | (12) McNeese St. | 92% | 82-68 |
| (4) Nebraska | (13) Troy | 97% | 77-59 |
| (6) North Carolina | (11) VCU | 59% | 79-77 |
| (3) Illinois | (14) Penn | 98% | 89-62 |
| (7) Saint Mary's | (10) Texas A&M | 64% | 76-75 |
| (2) Houston | (15) Idaho | 99% | 79-59 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Michigan | (16) Howard | 99% | 101-61 |
| (8) Georgia | (9) Saint Louis | 50% | 90-88 |
| (5) Texas Tech | (12) Akron | 90% | 86-73 |
| (4) Alabama | (13) Hofstra | 88% | 86-76 |
| (6) Tennessee | (11) Miami OH | 97% | 84-71 |
| (3) Virginia | (14) Wright St. | 98% | 83-66 |
| (10) Santa Clara | (7) Kentucky | 53% | 83-82 |
| (2) Iowa St. | (15) Tennessee St. | 97% | 88-65 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Duke | (8) Ohio St. | 95% | 80-71 |
| (5) St. John's | (4) Kansas | 73% | 78-76 |
| (3) Michigan St. | (6) Louisville | 73% | 78-76 |
| (2) Connecticut | (7) UCLA | 78% | 75-70 |
| (1) Arizona | (9) Utah St. | 96% | 89-73 |
| (5) Wisconsin | (4) Arkansas | 62% | 87-86 |
| (3) Gonzaga | (6) BYU | 85% | 88-83 |
| (2) Purdue | (7) Miami FL | 91% | 80-69 |
| (1) Florida | (9) Iowa | 91% | 81-72 |
| (5) Vanderbilt | (4) Nebraska | 72% | 78-76 |
| (3) Illinois | (6) North Carolina | 93% | 82-69 |
| (2) Houston | (7) Saint Mary's | 90% | 76-67 |
| (1) Michigan | (8) Georgia | 96% | 96-77 |
| (5) Texas Tech | (4) Alabama | 51% | 87-85 |
| (3) Virginia | (6) Tennessee | 73% | 73-72 |
| (2) Iowa St. | (10) Santa Clara | 90% | 83-78 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Duke | (5) St. John's | 83% | 78-72 |
| (2) Connecticut | (3) Michigan St. | 75% | 75-74 |
| (1) Arizona | (5) Wisconsin | 93% | 87-76 |
| (2) Purdue | (3) Gonzaga | 75% | 79-74 |
| (1) Florida | (5) Vanderbilt | 79% | 88-80 |
| (2) Houston | (3) Illinois | 72% | 72-72 |
| (1) Michigan | (5) Texas Tech | 93% | 84-71 |
| (2) Iowa St. | (3) Virginia | 64% | 74-74 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Duke | (2) Connecticut | 93% | 77-65 |
| (1) Arizona | (2) Purdue | 81% | 85-71 |
| (1) Florida | (2) Houston | 64% | 76-73 |
| (1) Michigan | (2) Iowa St. | 87% | 84-74 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Duke | (1) Florida | 73% | 80-72 |
| (1) Michigan | (1) Arizona | 66% | 90-86 |
| Winner | Loser | Prob | Score |
|---|---|---|---|
| (1) Michigan | (1) Duke | 80% | 79-78 |
| CHAMPION: Michigan | |
| Michigan 79, Duke 78 | Total: 157 |
The Round of 64 is complete. See the report card for every game — what the model got right, what it got wrong, and whether the cascade effect is already working against us.
Here's where it gets interesting. Before I trusted the machine with a real bracket, I ran 15 experiments to understand how it thinks and where it breaks. Some of the findings changed how I think about brackets entirely.
Removing emotion from the bracket also meant removing ego from the modeling process. I had to accept that most of my clever ideas would fail - and set acceptance criteria before running each experiment so I couldn't talk myself into adopting marginal results.
This one stopped me cold. I analyzed five years of brackets and found that 76.5% of Sweet 16-and-beyond misses aren't the model picking the wrong winner - they're the wrong team being in the game in the first place.
Think about it: you pick a 3-seed to make the Elite Eight. They lose in the first round. Now your Elite Eight pick is automatically wrong, your Final Four pick might be wrong, and your champion pick could be dead. One miss in round one can cascade into five dead picks downstream.
When I tested what the model would score if the correct teams always advanced (an "oracle" bracket), accuracy jumped from 52.2% to 73.6%. The model is actually much better at picking winners than a bracket's score suggests. The bracket format just punishes early mistakes brutally.
The practical takeaway: obsessing over your Final Four picks matters way less than getting the first two rounds right. Every correct Round of 64 pick protects the five downstream picks that depend on it.
Here's a finding that surprised me: the optimal blend of models shifts as the tournament progresses.
In the first round, the logistic regression model - the boring one that just looks at efficiency and four-factors - actually pulls its weight. When a 1-seed plays a 16-seed, you don't need fancy machine learning. The fundamentals tell the whole story.
But by the Final Four? The logistic model actively hurts predictions. I tested this by hiding one season at a time and checking whether the pattern held across the other 17 tournament years. Pure XGBoost was optimal for the Final Four and Championship in every single fold. Not most of the time. Not 16 out of 17. All 17 out of 17.
Why? When Duke plays Michigan in a championship game, both teams have elite efficiency ratings. The simple features converge. But XGBoost can pick up complex interactions - combinations of shooting tendencies, tempo mismatches, rebounding advantages - that a simpler model can't see.
So the machine uses a 70/30 XGBoost/logistic blend for early rounds but switches to 100% XGBoost for the Final Four and championship. It's the equivalent of switching from your reading glasses to a microscope when the matchups get tighter.
Here's a partial list of things that sounded smart and turned out to be useless:
Monte Carlo simulation for bracket optimization: I simulated 200,000 complete brackets per season to compute path probabilities, then tried to optimize picks for maximum expected ESPN points instead of just picking game-by-game favorites. Result: +1.8% improvement. Basically noise. The simple approach of "pick the team more likely to win each game" is already near-optimal.
A neural network for the Final Four: If XGBoost is good at late-round predictions, maybe a neural network could find even deeper patterns? The neural net hit 85.2% accuracy on Final Four games; XGBoost hit 83.3%. Sounds like a win - but with only 54 Final Four games in the entire dataset, that 1.9 percentage point difference is well within noise. Neither model can reliably beat the other at this sample size.
Temporal weighting - caring more about late-season games: The theory is intuitive. A team's play in February should matter more than their performance in November. So I weighted recent games more heavily in training. No improvement. Turns out the season-level ratings from Torvik, KenPom, and Evan Miya already account for this. I was adding a second recency adjustment on top of systems that already had one.
Per-round tuning at the bracket level: Maybe the model should use different blends for each round? I searched 1,296 configurations. The best one improved scores by 1.5%. Not enough to trust it wasn't just overfitting to five test seasons.
Each one failed its pre-set acceptance criteria. And each one taught me something about why the simple approach is hard to beat.
Here's the part that would have been impossible a year ago: I didn't write most of this code by hand.
The entire experiment pipeline was built collaboratively with a coding agent - Claude from Anthropic, running as a command-line development partner. I'd describe what I wanted to test ("run a leave-one-season-out cross-validation on a neural net for Final Four games, sweep these architectures, and set the acceptance bar at 90%"), and the agent would write the experiment script, wire up the data loading, handle the Kaggle team name mismatches, and run the backtest. When results came back, I'd review the code and the numbers, decide what to try next, and the agent would build the next experiment.
The traditional workflow for this kind of research is: think of a hypothesis, spend a day writing boilerplate, debug data pipeline issues, run the experiment, analyze results, repeat. With a coding agent, the cycle collapsed. The boilerplate writes itself. The team-name-matching bugs that would normally eat an afternoon get caught by the agent cross-referencing our alias tables. I'd describe an experiment at 10am and have validated results by lunch.
This changed the kind of research I could do. Instead of committing to one or two big ideas and hoping they panned out, I could cheaply test a dozen hypotheses with proper acceptance criteria, rigorous cross-validation, and automated code review after each one. The 1,296-configuration grid search for per-round alpha tuning? That would have been a week of manual work. With the agent, it was an afternoon including the code review.
The irony isn't lost on me: I used an AI to build a system that removes emotion from bracket prediction. The AI didn't have opinions about which experiments to run - it just executed them cleanly, flagged when something looked off (like the seed canonicalization bug it caught during a code review), and moved on to the next one. It turns out that removing emotion from the process of building the model is just as valuable as removing emotion from the bracket itself.
During the temporal weighting experiment, my initial results showed a 3.1% improvement - enough to consider adopting. But something felt off. I dug in and realized I'd accidentally trained the baseline on a smaller dataset (12 seasons instead of 17). The "improvement" wasn't temporal weighting being good - it was the baseline being artificially weak.
When I matched the datasets properly, the improvement vanished.
This is the kind of mistake that's easy to make and hard to catch without rigorous methodology. How many sports models out there are built on exactly this kind of subtle baseline mismatch?
Building experiment infrastructure paid an unexpected dividend. While backtesting the Monte Carlo simulator against historical tournaments, the scores came back lower than expected. A code review uncovered a subtle bug: the XGBoost model expects "Team A" to always be the better-seeded team, but the bracket-filling code sometimes passed teams in bracket order instead. For matchups where the higher seed appeared in the second slot, the model was essentially predicting backwards.
The bug had been in the production code the whole time. Testing against known outcomes - not just predicting forward - is what caught it.
I can already hear it. "All four 1-seeds? Every 1 and 2 in the Elite Eight? Where's the fun in that?"
Fair. If you're filling out a bracket for entertainment, this one won't win you any style points. No one's going to high-five you for picking Duke to beat Siena. The machine has zero interest in being interesting.
But here's what 18 years of data actually says about March Madness: it's less mad than we think. Since 2008, 1-seeds have made the Elite Eight roughly 70% of the time and the Final Four about half the time. The average tournament has one or two genuine Cinderellas - but most of the 63 games go to the better team. We just don't remember the 1-seed blowouts. We remember the upsets because they're surprising, and surprising is exactly what makes them bad bracket picks.
The model doesn't pick chalk because it's lazy. It picks chalk because when it gives Duke a 93% win probability in the first round and a 95% win probability in the second round, picking against them in either game is statistically lighting your bracket on fire. And when 76.5% of your late-round misses come from the wrong team being in the game - not from picking the wrong winner - the rational move is to protect the downstream picks that depend on early favorites advancing.
That said, the machine isn't pure chalk. It picks seven upsets where the data supports it. It just won't pick an upset for vibes. The model uses a 48% threshold - meaning it'll take the underdog when its own numbers say the underdog is actually favored, or close to it. A 9-seed with a 53% win probability isn't an upset pick. It's a math pick that happens to disagree with the seeding.
Is this the bracket that'll win your office pool? Maybe not - in a pool of hundreds, you might need a differentiated pick to stand out. But if you're optimizing for expected score rather than lottery-ticket upside, the boring bracket is the smart bracket.
The system has earned about 260 ESPN points of improvement over a naive baseline across five test seasons - roughly 52 extra points per year, equivalent to five additional correct first-round picks annually. That came mostly from two changes: using pure XGBoost for the Final Four (+160 points) and using that 48% upset threshold instead of a coin flip (+100 points). Real gains. From removing human judgment. Not from adding excitement.
The machine is excellent at the things humans are bad at: not overreacting to a single game, properly weighing 39 different factors simultaneously, and maintaining calibrated probabilities. When it says a game is 65/35, it means it. It doesn't talk itself into a 50/50 game being a sure thing because the team "has momentum."
But the machine has limits. It can't account for injuries announced the morning of a game. It can't read a team's body language in warmups. And with only 54 Final Four games to train on, it's flying somewhat blind in the biggest moments.
The real lesson isn't "trust the machine" or "trust your gut." It's that most bracket mistakes come from emotion masquerading as analysis. "I feel like this is an upset spot" isn't analysis. "This 12-seed has a top-30 adjusted offensive efficiency and their opponent ranks 180th in turnover rate" - that's analysis.
The machine just does that kind of analysis for every game, simultaneously, without getting excited about any of it.
My bracket this year is Michigan over Duke, 79-78. The machine says so. And for the first time, I'm not going to argue with it.
Update: All four 1-seeds survived. Championship pick still alive. Full Round of 64 retrospective here.
© 2026 Collin Lee