Architectural Brief: NBA Scenario Engine

The NBA player prop market moves within minutes of an injury tweet. A manual bettor sees the news, opens three tabs, compares projections, and places a bet. By then the line has already shifted. This system was built to pre-compute injury scenarios before the news breaks: predict what happens to every teammate's stat line when a star player sits, and surface the edges that exist in the gap between the news and the line movement.

System Topology

Infrastructure Decisions

Compute: Local machine with scheduled CLI orchestrator (main.py). Chose over cloud deployment (Lambda, ECS) because the system runs once daily during the NBA season and handles a single user. No concurrent access. The PipelineOrchestrator class manages all four pipelines (news, predict, retrain, paper-trade) through CLI flags with idempotency checks and force-rerun support.
Data Layer: SQLite via SQLAlchemy ORM. Chose over PostgreSQL because this is a single-user batch system with no concurrent reads. The entire database (10 tables, ~110MB) fits comfortably on disk. Moving data between a managed database and local compute adds latency for zero benefit at this scale. Composite primary key on (PlayerID, GameID) enforces uniqueness at the schema level.
ML Framework: scikit-learn's HistGradientBoostingClassifier and HistGradientBoostingRegressor. Chose over XGBoost for simpler dependency management and native handling of missing values. Chose over neural networks because the dataset (2,847 training records after feature engineering) is too small for deep learning to outperform gradient boosting.
News Pipeline: Apify (Twitter scraper) into GPT-4o (structured JSON extraction). Chose Apify over direct Twitter API because the official API pricing changed and the target accounts (@ShamsCharania, @wojespn, @Underdog__NBA) post infrequently enough that a scraper-based approach is cheaper than maintaining an API subscription. GPT-4o parses tweets into {player, team, status, confidence} JSON with forced json_object response format at temperature 0.
Data Bus: Flat files (CSV, JSON, .pkl model artifacts) on local disk. Chose over S3 or cloud storage because the system is single-machine. Predictions, bet sheets, and paper trading logs are date-stamped CSVs under data/. Models are joblib-serialized pickle files.

Constraints That Shaped the Design

Input: NBA game logs from nba_api (0.6 second mandatory delay between requests to avoid IP blocks). Player props and lines from TheOddsAPI (paid plan required for historical data, available from May 2023 onwards). Injury news from 6 Twitter accounts scraped via Apify, filtered by keyword, parsed by LLM. Position mappings from auto-generated player_positions.json with manual overrides.json for edge cases (Jokic classified as Big, etc.).
Output: Date-stamped prediction CSVs with columns for Player, Matchup, Scenario, Mins, PTS, REB, AST. Bet sheets filtered to edges exceeding 1.5 standard deviations. Telegram alerts with impact analysis when a player is confirmed OUT. Paper trading logs tracking dual-portfolio P&L (Selected Bets vs All Bets).
Scale Handled: ~600 active players across 30 teams per season. Feature matrix grows by ~2,800 records per season. At 5 seasons of historical data, the feature matrix would hit ~14,000 records, still well within scikit-learn's capacity without GPU. The bottleneck at scale is NBA API rate limiting, not model training.
Hard Constraints: NBA API rate limit at 0.6s per request (hardcoded delay). TheOddsAPI quota limits the number of daily requests. 24-hour tweet freshness filter in the news monitor prevents stale injury reports from triggering scenarios. Ghost player filter at 20 projected minutes removes bench players from prediction output. Max 1 bet per team per day (highest edge only) prevents correlated exposure.

Decision Log

Decision	Alternative Rejected	Why
Two-stage minutes prediction (classifier then regressor)	Single regression model	Players who don't play at all (DNP/injury) produce zero-minute games. A single regressor trained on the full dataset predicts 8-12 minutes for DNP players instead of zero. The classifier gates the regressor: if play probability is below 50%, minutes are set to zero before stats prediction runs.
TimeSeriesSplit for validation	Random train/test split	Random splits in temporal data leak future information. A player's stats in April would appear in training data used to predict their February performance. TimeSeriesSplit ensures the model only trains on past data, matching how the system operates in production.
Synthetic zero-minute rows for inactive players	Dropping DNP games from training	The classifier needs negative examples. Without synthetic zeros for players who were on the roster but didn't play, the model never learns what "not playing" looks like. The ingest pipeline creates 0-minute rows for every inactive player in every game.
Usage Vacuum positional redistribution (60/40 split)	Uniform redistribution across all teammates	When a star sits, their usage doesn't distribute evenly. Guards absorb more of a guard's usage than bigs do. The 60/40 split (same position gets 60%, other positions split 40%) approximates observed redistribution patterns in NBA data.
GPT-4o for tweet parsing	Regex-based extraction	Injury tweets have no standard format. "@ShamsCharania" writes differently from "@Underdog__NBA". Regex patterns would need constant maintenance. GPT-4o with forced JSON output and temperature 0 handles format variation reliably.
Dual-portfolio paper trading	Single portfolio tracking	Running "Selected Bets" (filtered top 20) alongside "All Bets" (every qualifying edge) reveals whether the filtering step actually improves P&L. If the All Bets portfolio outperforms, the selection logic is destroying value.

Architectural Brief: NBA Scenario Engine

System Topology

Infrastructure Decisions

Constraints That Shaped the Design

Decision Log

Contents