Turning Injury News Into Betting Edges Before the Lines Move
The Situation
In the NBA player prop market, the money is in the minutes between news and line adjustment. A star player gets ruled OUT on Twitter. Within 5-10 minutes, sportsbooks adjust their lines for teammates. In that window, the projections for backcourt players change, rebound totals shift, and assist numbers move. A bettor who can calculate those shifts before the books adjust has an edge. Everyone else is buying stale numbers.
The challenge for the manual bettor: they see the injury tweet, open a spreadsheet, pull up season averages, eyeball the redistribution, and try to place a bet. By the time they're done, the line has moved. The entire value proposition depends on speed, and human spreadsheet work isn't fast enough.
The Cost of Doing Nothing
Manual scenario analysis takes roughly 2 hours per game day during the NBA season. Across a full 82-game regular season plus playoffs, that's approximately 360 hours of labor. But the real cost isn't the time. It's the decay. A prediction calculated 20 minutes after news breaks is stale. The line has already moved. The edge is gone. Manual analysis doesn't just cost time: it costs every bet that would have hit during the window and didn't because the calculation wasn't fast enough.
What I Built
A scenario planning engine that pre-computes injury scenarios for every NBA game. It fetches game logs from the NBA API, trains a two-stage minutes prediction model (one classifier for "will they play," one regressor for "how many minutes"), and chains that into a multi-output stats predictor for PTS, REB, and AST. When injury news comes in, the engine runs the affected team through the Usage Vacuum model, which redistributes the missing player's usage to teammates based on positional overlap, and outputs new projections on a per-player basis.
The hardest part was the data itself. The NBA API doesn't return rows for games a player didn't play. That absence of data is the most important signal for the classifier. I had to build a synthetic data generation step into the ingestion pipeline that creates zero-minute rows for every player on a team's roster who doesn't appear in the game log for that night. Without those synthetic zeros, the classifier predicted that everyone would play.
The news pipeline was its own problem. Injury tweets have no standard format. @ShamsCharania writes differently from @Underdog__NBA. I built a scraper using Apify that pulls the latest tweets from 6 NBA insider accounts, filters by keywords (OUT, DOUBTFUL, QUESTIONABLE), and then sends each tweet through GPT-4o with a forced JSON response format to extract {player, team, status, confidence}. The structured output feeds directly into the scenario engine.
System Flow
Data Model
Architecture Layers
The Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| SQLite with composite keys | PostgreSQL or cloud-hosted DB | Single-user batch system. No concurrent reads. The entire 10-table database stays under 110MB. Moving data between a managed database and local compute adds latency with zero benefit. |
| Two-stage minutes model | Single regression | DNP players don't appear in NBA API game logs. A single regressor trained on active-only data never predicts zero minutes. The classifier gates the regressor and prevents phantom projections for inactive players. |
| Synthetic zero-minute rows | Dropping DNP games entirely | The classifier needs negative examples. Without synthetic zeros for players who were on the roster but didn't play, the model has no concept of inactivity. |
| GPT-4o for tweet parsing | Regex extraction | Injury tweets have no standard format across reporters. Regex patterns would need constant maintenance. Structured JSON output from GPT-4o at temperature 0 handles format variation reliably. |
| Positional usage redistribution (60/40) | Uniform redistribution | When a guard sits, guard teammates absorb more than big men do. The 60/40 positional split approximates observed league-wide redistribution patterns without needing per-team calibration data that doesn't exist in sufficient volume. |
Results
Before the system, scenario analysis required roughly 2 hours per game day, and by the time the calculations were done, the lines had already moved. The window between injury news and line adjustment was consistently missed.
After deployment, the full pipeline (data ingestion, feature engineering, model prediction, scenario generation, bet sheet output) runs in under 3 minutes on a local machine. The bottleneck is the NBA API's 0.6-second rate limit, not computation. The system covers ~600 active players across 30 teams, generates base-case predictions plus targeted injury scenarios for any player flagged as OUT or DOUBTFUL, and surfaces edges above 1.5 standard deviations with a max-one-bet-per-team rule to limit correlated exposure.
The paper trading module runs dual portfolios (filtered top picks vs all qualifying edges) to measure whether the selection step adds or destroys value. At 10x scale (covering international leagues or lower-tier markets), the architecture would need a move from SQLite to PostgreSQL for concurrent access and a proper job queue replacing the CLI orchestrator, but the ML pipeline and scenario logic would transfer unchanged.