S6E2 Heart Disease — Team Dashboard

Best Local Score (OOF)

0.955708

CB+LR Cross-Stack (5-seed)

Best Leaderboard Score

0.95392

Rank 175 / ~2,700 teams

Target (First Page)

0.95400

Need +0.00008 more

Models Trained

127

88 ready for submission

Submissions Made

5 / 10

5 remaining today

Days Remaining

Deadline: Feb 28, 2026

🔬 Our Approach (Pipeline)

1. Data

630K rows, 13 features

2. Feature Eng.

Freq + Orig Stats + TE

3. Models

CatBoost, XGB, LR, NN

4. Stacking

LR preds → CatBoost

5. Blending

Hill-climb ensemble

💡 Key Findings So Far

1. Treating ALL features as categorical (+0.005 AUC over official split) — even numeric ones like Age, BP.
2. Original UCI dataset target statistics as features (+0.006 AUC) — the #1 trick from top notebooks.
3. ~11% of labels are noise (contradictory samples) — creates a ceiling around 0.956 AUC.
4. Cross-stacking (LR predictions as CatBoost feature) is our strongest technique.
5. CV→LB gap is a stable 0.00183 — we can predict LB score from local results.

📈 Progress Over Time

Each dot = one model trained. Blue line = cumulative best. Shows how our performance improved throughout the day.

📊 Models by Category

How many models we trained in each approach category. Hover for details.

📂 Model Categories Explained

What each type of model means and how it works.

Category	Count	Best OOF	What It Means
Blend	9	0.955692	Rank-average or weighted combination of 2+ models. Simple but effective.
Multi-seed	18	0.955623	Same model architecture trained with different random seeds, then blended. Reduces variance.
Neural Network	5	0.953317	Deep learning models (MLP). Weaker than GBDTs on this tabular dataset.
Novel	13	0.955677	Our team's original ideas — innovations we designed for this competition. Important for academic report.
Single Model	11	0.955595	Individual ML model (CatBoost, XGBoost, LightGBM, LR, RF) — one algorithm, one config.
Stacking	22	0.955708	Model A's predictions become an input feature for Model B. Captures what A learned. Our best approach.
Sweep	39	0.955468	Systematic ablation experiments testing which features help/hurt when treated as categorical.
Target Encoding	10	0.955702	Replace categorical values with the average target (heart disease rate) for that category. Adds signal.

🔬 All Model Results — 143 Models

Click any model for full details. ✅ = can submit to Kaggle, ❌ = evaluation only.

Model & Description	Submit
Logistic Regression — Raw Features Logistic regression on raw numeric features without any encoding. Baseline floor.	✅
CatBoost — Raw Features CatBoost with default parameters on raw features. No feature engineering.	✅
XGBoost — Raw Features XGBoost with default parameters on raw features. Baseline comparison.	✅
LightGBM — Raw Features LightGBM with default parameters on raw features. Baseline comparison.	✅
Random Forest — Raw Features Random Forest (500 trees) on raw features. Weakest baseline.	✅
Logistic Regression — One-Hot All Features LR with all 13 features one-hot encoded (even numeric ones like Age). Surprisingly competitive at 0.9555 because low cardinality lets LR learn per-val...	✅
CatBoost — Original UCI Target Stats CatBoost with features mapped to P(disease\|value) from the original 303-patient UCI dataset. Key insight: the original data has clean clinical labels ...	✅
XGBoost — Original UCI Target Stats XGBoost with original UCI target statistics as features.	✅
LightGBM — Original UCI Target Stats LightGBM with original UCI target statistics as features.	✅
CatBoost — Pairwise Feature Interactions CatBoost with manually created pairwise interaction features (e.g., Age×ChestPain).	✅
CatBoost — Frequency Encoding ⭐ CatBoost with frequency encoding (how often each value appears in train+test). Best single-seed model at 0.9556. The frequency captures population-lev...	✅
GBDT Blend — CB+XGB+LGBM Average Simple average of CatBoost, XGBoost, and LightGBM predictions. Blend of 3 GBDT families.	✅
Original Stats Blend Blend of all models using original UCI target statistics as features.	✅
Grand Blend — All Phase 1 Models Rank-blend of all Phase 1 models. Kitchen-sink approach.	✅
CB + LR Blend Blend of CatBoost (freq) and Logistic Regression (one-hot). Two very different model families.	✅
Top-3 Blend — CB_freq + CB_orig + LR ⭐ Rank-blend of the 3 best single models. First Kaggle submission: LB 0.95384.	✅
Top-4 Blend Rank-blend of top 4 models from Phase 1.	✅
CB Freq — 5-Seed Average Average of 5 CatBoost freq models with different random seeds. Reduces initialization variance.	✅
CB Freq — Seed 123 CatBoost freq encoding, random seed 123.	✅
CB Freq — Seed 2024 CatBoost freq encoding, random seed 2024.	✅
CB Freq — Seed 42 CatBoost freq encoding, random seed 42.	✅
CB Freq — Seed 456 CatBoost freq encoding, random seed 456.	✅
CB Freq — Seed 789 CatBoost freq encoding, random seed 789.	✅
HeartMLP — Custom 3-Layer Neural Net Custom MLP (256→128→64, BatchNorm, Dropout 0.3) trained on MPS GPU. Uses one-hot + freq + orig stats features. OOF 0.9531.	✅
LR One-Hot — 5-Seed Average Average of 5 LR one-hot models. LR is near-deterministic so seed variation is minimal.	✅
LR One-Hot — Seed 123 LR one-hot, random seed 123.	✅
LR One-Hot — Seed 2024 LR one-hot, random seed 2024.	✅
LR One-Hot — Seed 42 LR one-hot, random seed 42.	✅
LR One-Hot — Seed 456 LR one-hot, random seed 456.	✅
LR One-Hot — Seed 789 LR one-hot, random seed 789.	✅
20 Cb Origstats Multiseed Avg Model: 20_cb_origstats_multiseed_avg	✅
CB OrigStats — Seed 123 CatBoost with UCI stats, seed 123.	✅
CB OrigStats — Seed 2024 CatBoost with UCI stats, seed 2024.	✅
CB OrigStats — Seed 42 CatBoost with UCI stats, seed 42.	✅
CB OrigStats — Seed 456 CatBoost with UCI stats, seed 456.	✅
CB OrigStats — Seed 789 CatBoost with UCI stats, seed 789.	✅
RealMLP — 3-Seed Blend Blend of 3 RealMLP seeds. Neural net diversity for ensemble.	✅
RealMLP — Tabular Foundation Model RealMLP from pytabkit library. Uses Mish activation and PLR embeddings. State-of-the-art tabular neural net architecture.	✅
RealMLP — Seed 123 RealMLP with random seed 123.	✅
RealMLP — Seed 42 RealMLP with random seed 42.	✅
RealMLP — Seed 456 RealMLP with random seed 456.	✅
CB Freq — 5-Seed Rank Blend ⭐ Rank-blend (not average) of 5 CatBoost freq seeds. Rank-blending is calibration-invariant.	✅
Multi-Seed Grand Blend Blend of all multi-seed CatBoost variants (freq + origstats).	✅
CB+LR Cross-Stack — 5-Seed Blend ⭐⭐ OUR STRONGEST SINGLE STRATEGY. LR predictions become feature #14 for CatBoost. CatBoost learns WHEN to trust the linear model. 5 seeds rank-blended → ...	✅
CB+LR Cross-Stack — Seed 123 Cross-stacked CB+LR, seed 123.	✅
CB+LR Cross-Stack — Seed 2024 Cross-stacked CB+LR, seed 2024.	✅
CB+LR Cross-Stack — Seed 42 CatBoost trained with LR out-of-fold predictions as feature #14. Seed 42.	✅
CB+LR Cross-Stack — Seed 456 Cross-stacked CB+LR, seed 456.	✅
CB+LR Cross-Stack — Seed 789 Cross-stacked CB+LR, seed 789.	✅
CB Stacked (v1) — Seed 123 Earlier stacking variant, seed 123.	✅
CB Stacked (v1) — Seed 42 Earlier stacking variant, seed 42.	✅
CB Stacked (v1) — Seed 456 Earlier stacking variant, seed 456.	✅
CB+LR+CBO Stack — 5-Seed Blend Adding CB-origstats OOF to the CB+LR stack. WORSE than CB+LR alone (0.9557 vs 0.9557) — too correlated, no new information added.	✅
CB+LR+CBO Stack — Seed 123 Triple cross-stack, seed 123.	✅
CB+LR+CBO Stack — Seed 2024 Triple cross-stack, seed 2024.	✅
CB+LR+CBO Stack — Seed 42 CatBoost with both LR and CB-origstats OOF as features. Seed 42.	✅
CB+LR+CBO Stack — Seed 456 Triple cross-stack, seed 456.	✅
CB+LR+CBO Stack — Seed 789 Triple cross-stack, seed 789.	✅
CB Mega-Stack — Seed 123 Mega-stack variant, seed 123.	✅
CB Mega-Stack — Seed 42 CatBoost with ALL 4 model predictions (LR, CB, CBO, XGB) as stacked features.	✅
CB Mega-Stack — Seed 456 Mega-stack variant, seed 456.	✅
CB Mega-Stack v2 — Seed 123 Revised mega-stack, seed 123.	✅
CB Mega-Stack v2 — Seed 42 Revised mega-stack, seed 42.	✅
CB Mega-Stack v2 — Seed 456 Revised mega-stack, seed 456.	✅
Feature Ablation — Drop Age CatBoost trained WITHOUT the Age feature. Measures Age's contribution to prediction.	✅
Feature Ablation — Drop Blood Pressure CatBoost without Resting Blood Pressure.	✅
Feature Ablation — Drop Chest Pain Type CatBoost without Chest Pain Type (a strong predictor).	✅
Feature Ablation — Drop Cholesterol CatBoost without Cholesterol.	✅
Feature Ablation — Drop EKG Results CatBoost without EKG/ECG results.	✅
Feature Ablation — Drop Exercise Angina CatBoost without Exercise-Induced Angina.	✅
Feature Ablation — Drop Fasting Blood Sugar CatBoost without Fasting Blood Sugar > 120 mg/dl.	✅
Feature Ablation — Drop Max Heart Rate CatBoost without Maximum Heart Rate Achieved.	✅
Feature Ablation — Drop Fluoroscopy Vessels CatBoost without Number of Major Vessels Colored by Fluoroscopy (strong predictor).	✅
Feature Ablation — Drop ST Depression CatBoost without Exercise-Induced ST Depression value.	✅
Feature Ablation — Drop Sex CatBoost without the Sex feature.	✅
Feature Ablation — Drop ST Slope CatBoost without Slope of Peak Exercise ST Segment.	✅
Feature Ablation — Drop Thallium CatBoost without Thallium stress test result (THE strongest single predictor, ρ=0.61).	✅
Cross-Stack + Multi-Seed TE (top 10) Top-10 candidate blend from cross-stacking + target encoding multi-seed.	✅
Cross-Stack + Multi-Seed TE (top 15) Top-15 candidate blend.	✅
Cross-Stack + Multi-Seed TE (top 20) Top-20 candidate blend.	✅
CB + Clean Prob K=3 (Fixed, No Leakage) CatBoost with distance-weighted clean probability from 3 nearest UCI patients. Properly computed without target leakage. OOF 0.9555 — no improvement o...	✅
Noise-Aware Reweighting (α=0.0) CatBoost with sample weights based on label noise probability. α=0 = baseline (no reweighting).	✅
Noise-Aware Reweighting (α=0.1) Samples likely mislabeled get 10% lower weight. Identifies noisy samples via cross-fold disagreement.	✅
Noise-Aware Reweighting (α=0.2) Noisy samples downweighted by 20%. Stronger penalty for disagreement.	✅
Noise-Aware Reweighting (α=0.3) Noisy samples downweighted by 30%.	✅
Noise-Aware Reweighting (α=0.5) Noisy samples downweighted by 50%. Most aggressive noise suppression.	✅
Confidence-Weighted Ensemble (Novel) Instead of fixed blend weights, each model's contribution is weighted by its prediction confidence (distance from 0.5) per sample. Uncertain models ar...	✅
LR + Clean Prob K=10 LR with clean prob, K=10 neighbors.	✅
LR + Clean Prob K=20 LR with clean prob, K=20 neighbors.	✅
LR + Clean Prob K=3 Logistic Regression with clean probability feature from 3 nearest UCI neighbors.	✅
LR + Clean Prob K=50 LR with clean prob, K=50 neighbors.	✅
LR + Clean Prob K=5 LR with clean prob, K=5 neighbors.	✅
Mega Meta-Learner — 62 Models (Novel) CatBoost trained on 62 base model predictions as features. Overfits badly (0.9549) — too many correlated features relative to the signal.	✅
Meta-Learner with Confidence Features (Novel) CatBoost meta-model trained on: base predictions + confidence scores + inter-model agreement. Learns which model to trust for each sample.	✅
LR Sweep — All Features as Categorical Logistic regression treating ALL 13 features as categorical (one-hot encoded).	❌
LR Sweep — All Features as Numeric LR treating all features as raw numbers. Much worse — loses categorical structure.	❌
LR Sweep — All Categorical Except Age LR with all features one-hot encoded EXCEPT Age kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except BP LR with all features one-hot encoded EXCEPT BP kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Chest pain type LR with all features one-hot encoded EXCEPT Chest pain type kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Cholesterol LR with all features one-hot encoded EXCEPT Cholesterol kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except EKG results LR with all features one-hot encoded EXCEPT EKG results kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Exercise angina LR with all features one-hot encoded EXCEPT Exercise angina kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except FBS over 120 LR with all features one-hot encoded EXCEPT FBS over 120 kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Max HR LR with all features one-hot encoded EXCEPT Max HR kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Number of vessels fluro LR with all features one-hot encoded EXCEPT Number of vessels fluro kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except ST depression LR with all features one-hot encoded EXCEPT ST depression kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Sex LR with all features one-hot encoded EXCEPT Sex kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Slope of ST LR with all features one-hot encoded EXCEPT Slope of ST kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — All Categorical Except Thallium LR with all features one-hot encoded EXCEPT Thallium kept as numeric. Tests whether this feature works better as a number or category.	❌
LR Sweep — Categorical + Age + BP LR with officially categorical features one-hot encoded, plus Age + BP also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Age + Cholesterol LR with officially categorical features one-hot encoded, plus Age + Cholesterol also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Age + Max HR LR with officially categorical features one-hot encoded, plus Age + Max HR also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Age + ST depression LR with officially categorical features one-hot encoded, plus Age + ST depression also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Age LR with officially categorical features one-hot encoded, plus Age also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + BP + Cholesterol LR with officially categorical features one-hot encoded, plus BP + Cholesterol also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + BP + Max HR LR with officially categorical features one-hot encoded, plus BP + Max HR also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + BP + ST depression LR with officially categorical features one-hot encoded, plus BP + ST depression also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + BP LR with officially categorical features one-hot encoded, plus BP also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Cholesterol + Max HR LR with officially categorical features one-hot encoded, plus Cholesterol + Max HR also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Cholesterol + ST depression LR with officially categorical features one-hot encoded, plus Cholesterol + ST depression also treated as categorical. Incremental category addition t...	❌
LR Sweep — Categorical + Cholesterol LR with officially categorical features one-hot encoded, plus Cholesterol also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Max HR + ST depression LR with officially categorical features one-hot encoded, plus Max HR + ST depression also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + Max HR LR with officially categorical features one-hot encoded, plus Max HR also treated as categorical. Incremental category addition test.	❌
LR Sweep — Categorical + ST depression LR with officially categorical features one-hot encoded, plus ST depression also treated as categorical. Incremental category addition test.	❌
LR Sweep — Numeric + Chest pain type as Category LR with numeric features as-is, plus Chest pain type one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + EKG results as Category LR with numeric features as-is, plus EKG results one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + Exercise angina as Category LR with numeric features as-is, plus Exercise angina one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + FBS over 120 as Category LR with numeric features as-is, plus FBS over 120 one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + Number of vessels fluro as Category LR with numeric features as-is, plus Number of vessels fluro one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + Sex as Category LR with numeric features as-is, plus Sex one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + Slope of ST as Category LR with numeric features as-is, plus Slope of ST one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Numeric + Thallium as Category LR with numeric features as-is, plus Thallium one-hot encoded. Tests adding one categorical feature to numeric baseline.	❌
LR Sweep — Official Cat/Num Split LR using the "official" categorical vs numeric distinction from feature metadata.	❌
CatBoost + Target Encoding (α=10) ⭐ CatBoost with target-encoded features using smoothing α=10. Each category value is replaced by smoothed P(disease\|value) from training folds. Combined...	✅
LR + Target Encoding (α=10) LR with one-hot + target encoding, smoothing α=10.	✅
LR + Target Encoding (α=50) LR with one-hot + target encoding, smoothing α=50 (heavy smoothing).	✅
LR + Target Encoding (α=5) Logistic Regression with one-hot + target encoding features, smoothing α=5.	✅
LR + Target Encoding + UCI Stats (α=10) LR with one-hot + target encoding + original UCI statistics combined.	✅
LR — Target Encoding Only (α=100) LR with target encoding only, heavy smoothing.	✅
LR — Target Encoding Only (α=10) LR with target encoding only, α=10.	✅
LR — Target Encoding Only (α=1) LR using ONLY target-encoded features (no one-hot), minimal smoothing.	✅
LR — Target Encoding Only (α=50) LR with target encoding only, α=50.	✅
LR — Target Encoding Only (α=5) LR with target encoding only, α=5.	✅

📈 OOF vs Leaderboard Score

Blue = local cross-validation estimate. Green = actual Kaggle leaderboard score. The consistent gap shows our CV is trustworthy.

#	Submitted	Model	OOF AUC	LB Score	Rank	CV→LB Gap
1	Feb 15 07:04	Top-3 Blend (CB_freq + CB_orig + LR)	0.955682	0.95384	297	0.00184
2	Feb 15 07:04	CatBoost + Freq Encoding (single)	0.955595	0.95375	—	0.00184
3	Feb 15 07:04	CatBoost Raw (no FE)	0.955503	0.95358	—	0.00192
4	Feb 15 12:36	3-Model Rank Blend (stack + conf + multi)	0.955729	0.95389	201	0.00184
5	Feb 15 12:57	Hill-Climb: Cross-Stack + TE CatBoost	0.955751	0.95392	175	0.00183

Finding: Clean probability adds no predictive value to CatBoost (0.955477 vs baseline 0.955595). The noise is uniformly distributed across feature space — CatBoost already handles it implicitly. Initial version had target leakage (0.999 AUC) via disagreement feature — caught and fixed.

Why it\'s novel: Original idea. Not found in competition notebooks or literature for this dataset scale.

')">

💡 Idea 1: Label Denoising via KNN Consensus

❌ No benefit OOF: 0.955477

The dataset has ~11% label noise (contradictory samples with identical features but opposite labels). We compute a "clean probability" using K-nearest…

💡 Idea 2: Confidence-Weighted Dynamic Ensemble

✅ Marginal gain OOF: 0.955677

Instead of fixed weights, each model's vote is weighted by its confidence — how far its prediction is from the decision boundary (0.5). Confident pred…

Finding: Hurts performance at all alpha values tested (0.0–0.5). The noise is too uniformly distributed for selective weighting to help. CatBoost\'s built-in regularization already handles this.

Why it\'s novel: Curriculum-learning inspired approach adapted for label noise in tabular data.

')">

💡 Idea 3: Noise-Aware Sample Reweighting

❌ No benefit OOF: 0.955461

Down-weight training samples that are likely mislabeled (low KNN consensus). Give the model permission to "ignore" noisy examples via sample weights.

💡 Idea 4: Bayesian Posterior Stacking

❌ Overfits OOF: 0.954874

Use 62 model predictions as meta-features in a Bayesian-inspired meta-learner. The idea: capture complex model agreement patterns.

💡 Idea 5: Autoencoder Latent Features

🔄 Running Pending...

Train a 13→16→8→16→13 autoencoder to learn compressed representations of patient features. Extract the 8-dim bottleneck as new features for CatBoost. …

💡 Idea 6: Adversarial Validation

🔄 Running Pending...

Train a classifier to distinguish train from test data. If it succeeds (AUC > 0.5), the distributions differ — we can reweight training samples to mat…

💡 Idea 7: Pseudo-Labeling with Confidence Thresholds

🔄 Running Pending...

Use our best model to predict test labels. Add high-confidence test predictions (>0.95, 0.90, 0.85, 0.80 thresholds) back to training data. Retrain wi…

💡 Idea 8: Slow-Learning Deep CatBoost

🔄 Running Pending...

CatBoost with very low learning rate (0.02–0.015), deep trees (depth 7–8), and 2500–3000 iterations. Trades compute for potentially better generalizat…

📋 Full Experiment Plan — 211 Experiments

23 Done

13 Running

127 Planned

11 Ideas

Every experiment explained — what it is, why we're testing it, and what we expect.

ID	Experiment & Description	Status
A1. Gradient Boosted Decision Trees (GBDT)
`A1.1`	CatBoost — Default Parameters Yandex's gradient boosting with native categorical feature handling. Uses ordered target statistics internally, which is ideal for our all-categorical data. Default hyperparameters: depth=6, lr=0.03, 1000 iterations.	Done `0.95550`
`A1.2`	XGBoost — Default Parameters The most popular gradient boosting library. Uses histogram-based splits. Serves as comparison point against CatBoost on the same features.	Done `0.95535`
`A1.3`	LightGBM — Default Parameters Microsoft's gradient boosting. Leaf-wise tree growth (vs level-wise in XGBoost). Fastest GBDT but may overfit on noisy data.	Done `0.95511`
`A1.4`	CatBoost — Optuna Hyperparameter Tuning 100-trial Bayesian optimization over: iterations, learning_rate, depth, l2_leaf_reg, random_strength, bagging_temperature, border_count, min_data_in_leaf. Searches the full hyperparameter space to find optimal CB configuration.	Running
`A1.5`	XGBoost — Optuna Hyperparameter Tuning Bayesian optimization for XGBoost: max_depth, learning_rate, subsample, colsample_bytree, reg_lambda, reg_alpha, min_child_weight. May find a different optimum than CatBoost.	Planned
`A1.6`	LightGBM — Optuna Hyperparameter Tuning Bayesian optimization for LightGBM: num_leaves, learning_rate, subsample, colsample_bytree, reg_lambda, min_child_samples. LGBM has different optimal hyperparameters than CB/XGB.	Planned
`A1.7`	CatBoost — Grow Policy Comparison Compare 3 tree-building strategies: SymmetricTree (default, balanced splits), Depthwise (level-by-level like XGBoost), Lossguide (leaf-wise like LightGBM, picks the leaf that reduces loss most). Different policies capture different patterns in the data.	Planned
`A1.8`	CatBoost — Ordered Boosting CatBoost's unique boosting_type="Ordered" uses a permutation-driven approach designed to prevent target leakage during training. Originally designed for small datasets. May help with our noisy labels by being more conservative.	Planned
`A1.9`	XGBoost — DART Mode Dropout Additive Regression Trees: randomly drops trees during boosting (like neural net dropout). Prevents later trees from over-correcting earlier ones. Better generalization on noisy data.	Planned
`A1.10`	LightGBM — GOSS + EFB Gradient-based One-Side Sampling: keeps all samples with large gradients (hard examples), randomly samples from small gradients (easy examples). Exclusive Feature Bundling merges sparse features. Faster and may regularize better.	Planned
`A1.11`	CatBoost — Label Smoothing Smooth target labels: instead of 0/1, use 0+ε and 1-ε (e.g., 0.05 and 0.95). Directly addresses our ~11% label noise by telling the model "don't be 100% confident in any label". Test ε = 0.05, 0.10, 0.15.	Planned
`A1.12`	HistGradientBoosting — Scikit-learn Sklearn's histogram-based gradient boosting. Different implementation from CB/XGB/LGBM. Adds diversity to our GBDT ensemble even if slightly weaker individually.	Planned
A2. Linear Models
`A2.1`	Logistic Regression — Raw Features Baseline: logistic regression on raw numeric features. P(disease) = sigmoid(w₁×Age + w₂×BP + ...). Expected to be weak because it treats Age=45 and Age=46 as almost identical, losing categorical structure.	Done `0.95049`
`A2.2`	Logistic Regression — One-Hot All Features KEY FINDING: One-hot encode ALL 13 features (even "numeric" ones like Age). This lets LR learn "Age=45 → coefficient X" independently from "Age=65 → coefficient Y". Achieves 0.9555 — matching CatBoost! Proves all features are effectively categorical.	Done `0.95552`
`A2.3`	Ridge Classifier Like logistic regression but with L2-regularized least squares loss instead of log loss. Faster to train, different decision boundary from LR. Adds linear model diversity.	Planned
`A2.4`	SGD Classifier — Stochastic Gradient Descent Online learning with log loss. Processes samples one-at-a-time instead of full batch. Different optimization trajectory may find different local optima.	Planned
`A2.5`	Logistic Regression — ElasticNet Regularization Combines L1 (sparsity, feature selection) and L2 (shrinkage) penalties. l1_ratio controls the mix. L1 may zero out useless one-hot features, reducing overfitting.	Planned
`A2.6`	LR — Polynomial Feature Interactions (Degree 2) Create ALL pairwise interaction features: Age×ChestPain, BP×Cholesterol, etc. Lets LR capture "Age=55 AND ChestPain=Typical → high risk" relationships that single features miss. Feature count explodes but may capture non-linear patterns.	Planned
`A2.7`	LR — One-Hot + Target Encoding Combined Feed LR both one-hot features AND target-encoded features simultaneously. One-hot captures per-value patterns; target encoding captures smoothed population rates.	Planned
`A2.8`	LR — IsolationForest Anomaly Scores From a top-scoring public LR notebook: add anomaly scores from IsolationForest as a feature. Flags unusual patients whose feature combinations are rare in the training data.	Planned
A3. Ensemble Tree Methods (Non-Boosted)
`A3.1`	Random Forest — Default Ensemble of 500 independent decision trees, each trained on a random bootstrap sample. Weaker than boosting (0.952) but makes completely different errors — valuable for diversity.	Done `0.95222`
`A3.2`	ExtraTrees — Extremely Randomized Trees Like Random Forest but splits are chosen randomly instead of optimally. Even more variance reduction. Much faster to train. Different error patterns.	Planned
`A3.3`	Balanced Random Forest Random Forest with class-weighted sampling. Our dataset is 55/45 split — slight imbalance. This ensures each tree sees balanced classes, which may improve AUC.	Planned
`A3.4`	Random Forest on GBDT Residuals CLEVER TRICK from a public notebook (LB 0.95395): Train CatBoost+XGBoost first, then train Random Forest on what they get WRONG (residuals). RF captures patterns the boosters miss. Final prediction = GBDT + RF correction.	Planned
A4. Neural Networks — Tabular-Specific
`A4.1`	RealMLP — State-of-the-Art Tabular MLP From the pytabkit library. Uses Mish activation (smooth ReLU), Piecewise Linear Representation (PLR) embeddings for numeric features, and careful initialization. PUBLIC NOTEBOOK ACHIEVES LB 0.95397 — better than our current best! High priority.	Planned
`A4.2`	TabM — Tabular Model from pytabkit Second-best public solo model (LB 0.95381). TabM-mini-normal architecture. Different from RealMLP — could add neural net diversity to our ensemble.	Planned
`A4.3`	FT-Transformer — Feature Tokenizer Transformer Converts each feature to a token embedding, then applies Transformer self-attention. Can learn complex feature interactions that tree models miss. Different inductive bias: trees do axis-aligned splits, transformers do attention-weighted combinations.	Planned
`A4.4`	SAINT — Self-Attention + Inter-Sample Attention Novel architecture: Attention between features (like FT-Transformer) PLUS attention between samples (each sample attends to similar training samples). Captures both feature relationships and patient-to-patient similarities. Unique approach for tabular data.	Planned
`A4.5`	TabTransformer Transformer applied only to categorical feature embeddings, then concatenated with numeric features and fed through MLP. Lighter than FT-Transformer. Good for our all-categorical data.	Planned
`A4.6`	NODE — Neural Oblivious Decision Ensembles Differentiable version of decision trees. Learns soft, gradient-optimizable tree splits. Bridges the gap between GBDT and neural nets. Can be trained end-to-end with backpropagation.	Planned
`A4.7`	DANet — Deep Abstract Network Tabular-specific deep learning with abstract layers that learn hierarchical feature representations. Less common architecture — adds novel diversity.	Planned
`A4.8`	Simple 3-Layer MLP Baseline neural net: Input → 256 → 128 → 64 → 1 with BatchNorm, ReLU, Dropout(0.3). Simple but establishes the neural net floor. Our HeartMLP variant achieved 0.9531.	Planned
`A4.9`	MLP with Periodic Embeddings From a top-voted Kaggle discussion (52 upvotes). Maps numeric features through sine/cosine functions before feeding to MLP. Periodic embeddings help neural nets learn non-monotonic relationships (e.g., very low AND very high cholesterol both indicate risk).	Planned
`A4.10`	MLP with PLR Embeddings Piecewise Linear Representation: each numeric feature is split into bins with learned linear interpolation. Turns continuous features into rich representations. Key ingredient of RealMLP's success.	Planned
A5. In-Context Learning / Foundation Models
`A5.1`	TabPFN v1 — Zero-Shot Tabular Classification Pre-trained transformer that classifies tabular data WITHOUT training on your dataset. Learned to classify from millions of synthetic datasets. UNUSED by any public notebook — innovation opportunity! Limitation: doesn't scale to 630K rows directly, needs subsampling.	Planned
`A5.2`	TabPFN v2 — Scalable Version Improved TabPFN with support for larger datasets via chunked inference. May handle our 630K rows with batching. Worth testing for pure diversity.	Planned
`A5.3`	HyperFast — Meta-Learned Hypernetwork A neural net that GENERATES the weights of a classifier for your specific dataset. Instant classification without training. Completely different approach from everything else.	Planned
A6. Other Classifiers
`A6.1`	SVM — Radial Basis Function Kernel Support Vector Machine with RBF kernel. Projects features into infinite-dimensional space and finds the maximum-margin decision boundary. Very different from tree/linear models. Slow on 630K rows but adds maximum diversity.	Planned
`A6.2`	K-Nearest Neighbors Instance-based: classify each patient by majority vote of K most similar patients in training data. No model learned at all — pure memorization. Weak alone but captures local neighborhood patterns.	Planned
`A6.3`	Gaussian Naive Bayes Assumes features are independent given the class. Obviously wrong (features correlate) but the resulting probability estimates are well-calibrated. Fast, different, adds diversity.	Planned
`A6.4`	Quadratic Discriminant Analysis Fits a Gaussian to each class and classifies by likelihood ratio. Quadratic (non-linear) decision boundary. Captures class-specific covariance structures.	Planned
A7. AutoML
`A7.1`	AutoGluon — Best Quality Amazon's AutoML: automatically trains dozens of models (NN, GBDT, KNN, etc.), performs multi-layer stacking, and selects the best ensemble. "best_quality" preset uses more models and longer training. May discover combinations we missed.	Planned
`A7.2`	AutoGluon — High Quality Faster AutoGluon preset. Same approach but fewer models and less stacking depth. Good for quick comparison.	Planned
`A7.3`	H2O AutoML Alternative AutoML framework	Planned
B1. Encoding Strategies
`B1.1`	Frequency Encoding Replace each category value with how often it appears in train+test combined. Rare values get low frequencies. Captures population-level patterns. Example: "ChestPain=Typical" appears in 47% of data → mapped to 0.47.	Done
`B1.2`	Target Encoding (Smoothed) Replace each value with smoothed P(disease \| value) from training data. α-smoothing blends per-value rate with global rate: TE = (n×mean + α×global) / (n + α). Prevents overfitting on rare categories. We sweep α = 1, 5, 10, 50, 100.	Done `0.95552`
`B1.3`	Original UCI Target Statistics THE MOST IMPACTFUL FEATURE. Compute P(disease \| feature_value) from the ORIGINAL 303 patients (real clinical diagnoses), not the noisy synthetic data. Example: "Thallium=7" → 85% disease rate in original data. Adds ~0.006 AUC — massive gain.	Planned
`B1.4`	Binary/One-Hot Encoding Standard one-hot: each category value becomes a 0/1 column. "ChestPain" with 4 values becomes 4 binary columns. Creates sparse high-dimensional features. Works surprisingly well for LR.	Planned
`B1.5`	All-Categorical Treatment Treat ALL features (including "numeric" Age, BP, Cholesterol) as categorical. Works because: Age has only ~50 unique values in 630K rows — it IS categorical in this dataset. CatBoost with all-cat achieves 0.9555.	Done `0.95560`
`B1.6`	Target encoding (in-fold) Mean target per feature value, computed within CV fold	Planned
`B1.7`	Original UCI target stats Mean/median/std/count from original data	Done `0.95558`
`B1.8`	Leave-one-out encoding LOO target encoding — less biased than mean	Planned
`B1.9`	WoE (Weight of Evidence) Log(P(1	Planned
`B1.10`	James-Stein encoding Shrinkage-based target encoding	Planned
`B1.11`	Binary encoding Binary representation of categorical levels	Planned
`B1.12`	Helmert encoding Compares each level to mean of subsequent levels	Planned
B2. Categorical/Numerical Treatment Combinations
`B2.1`	Pairwise Interactions Create features like Age×ChestPain, Sex×Thallium, etc. Captures "combination" effects that single features miss. Example: "Male + Typical Angina" may be higher risk than either alone.	Planned
`B2.2`	Ratio Features Create ratios: MaxHR/Age (heart rate relative to age), Cholesterol/Age, etc. Clinically meaningful: high heart rate is more concerning in older patients.	Done
`B2.3`	Age Binning × Interactions Group Age into bins (young/middle/old) then interact with other features. Captures age-dependent risk factors.	Planned
`B2.4`	High-info categoricals only: {Thal,ChestPain,Vessels} as cat, rest numerical Top 3 categorical predictors	Planned
`B2.5`	Reverse: treat Tier3 features (BP,FBS,Chol) as categorical, rest numerical They have low info anyway	Planned
`B2.6`	Age + MaxHR + STdep as numerical, everything else categorical Only true continuous features	Planned
`B2.7`	Optimal split search (Optuna) Let optimizer find best cat/num assignment	Planned
B3. Feature Interactions & Transformations
`B3.1`	Feature Selection via Importance Use CatBoost feature importance to rank features, then train with only the top-K. If some features add noise, removing them may improve generalization.	Done `0.95545`
`B3.2`	Recursive Feature Elimination Iteratively remove the least important feature and retrain. Finds the minimal feature set that maintains (or improves) AUC.	Planned
`B3.3`	Forward Feature Selection Start with zero features, add one at a time (the one that improves AUC most). Greedy but finds good feature subsets.	Planned
`B3.4`	Log/sqrt/square transforms Non-linear transforms of numericals	Planned
`B3.5`	KBinsDiscretizer (10 bins) Binned numericals — from top baseline notebook	Planned
`B3.6`	Clinical risk composites Framingham-like score, Duke clinical score	Planned
`B3.7`	PCA components (top 5) Dimensionality-reduced features	Planned
`B3.8`	UMAP embeddings (2D-3D) Non-linear dimensionality reduction	Planned
`B3.9`	Cluster assignments (KMeans k=5,10) Cluster membership as feature	Planned
`B3.10`	IsolationForest anomaly scores From top LR notebook	Planned
`B3.11`	Autoencoder reconstruction error Learn normal patterns, deviation = risk	Planned
`B3.12`	KNN distance features Distance to k-nearest of each class	Planned
B4. Multi-Dataset Integration
`B4.1`	Original UCI target stats (merged) Already using	Done
`B4.2`	Original data as extra training rows Concatenate with weight adjustment	Planned
`B4.3`	Original data weighted by similarity Adversarial validation to find similar samples	Planned
`B4.4`	Multi-source target encoding (Cleveland + Statlog + synthetic) Different target encodings from each source	Planned
`B4.5`	Domain adaptation: original → synthetic Transfer learning approach	Planned
C1. Cross-Validation Schemes
`C1.1`	Multi-Seed Training (5 Seeds) Train the SAME model architecture 5 times with different random seeds (42, 123, 456, 789, 2024). Each seed produces slightly different trees → rank-blending reduces variance. Typically adds +0.0001–0.0003 AUC for free.	Done
`C1.2`	Multi-Fold Variants (5-fold vs 10-fold vs 20-fold) Compare different numbers of CV folds. More folds = more training data per fold but more variance in OOF estimates. Find the sweet spot.	Done
`C1.3`	Multi-seed (5 seeds × 10 folds) Running now for top models	Running
`C1.4`	RepeatedStratifiedKFold (3×10) 30 folds, averaged	Planned
`C1.5`	Stratified on Thallium×Target Ensures balanced Thallium distribution	Planned
`C1.6`	GroupKFold by feature clusters Prevents data leakage if clusters exist	Planned
C2. Noise-Aware Training
`C2.1`	Cross-Stacking — LR → CatBoost Train LR on 10 folds, save OOF predictions. Use LR_pred as feature #14 for CatBoost. CatBoost learns WHEN the linear model is right/wrong. Our strongest technique: +0.0001 AUC.	Planned
`C2.2`	Cross-Stacking — Multiple Base Models Stack predictions from LR + XGB + LGBM as features for CatBoost meta-learner. Risk: too many correlated features → overfitting (confirmed: CB+LR+CBO stack was WORSE).	Planned
`C2.3`	Blending — Rank-Based Convert each model's predictions to ranks (1 to N), then average ranks. Calibration-invariant: doesn't matter if Model A predicts 0.7 and Model B predicts 0.95 for the same sample. Better than probability averaging for diverse model families.	Planned
`C2.4`	Hill-Climbing Ensemble Greedy algorithm: start with best model, try adding each remaining model, keep the one that improves blend AUC most. Repeat. Our best result (0.955751) came from hill-climbing over 105+ models — found that te_cb_a10 uniquely complements the CB+LR stack.	Planned
`C2.5`	Bayesian Blend Optimization Use Optuna to find optimal blend weights instead of greedy hill-climbing. Searches continuous weight space. May find better weights than equal weighting.	Planned
`C2.6`	Symmetric cross-entropy loss Noise-robust loss function	Planned
C3. Post-Processing
`C3.1`	Pseudo-Labeling — Confident Test Predictions Semi-supervised: use our model's most confident test predictions (prob > 0.95 or < 0.05) as additional training data. Increases effective training set. Must be careful with threshold choice.	Done
`C3.2`	Pseudo-Labeling — Multi-Round Iterative pseudo-labeling: train → predict test → add confident predictions → retrain → repeat. Each round should improve, but risks confirmation bias (reinforcing model errors).	Planned
`C3.3`	Knowledge Distillation Train a large ensemble, then train a single model to mimic the ensemble's SOFT predictions (probabilities) rather than hard labels. Transfers ensemble knowledge into one model.	Planned
`C3.4`	Temperature scaling From CB+XGB+Residual RF notebook	Planned
`C3.5`	Pseudo-labeling (iterative) High-confidence test predictions as training data	Planned
D1. Level-0 → Level-1 Stacking
`D1.1`	SHAP Feature Importance SHapley Additive exPlanations: game-theory-based feature importance. Shows WHICH features drive predictions for EACH patient, not just globally. Required for the academic report.	Planned
`D1.2`	Partial Dependence Plots Show how each feature affects prediction probability when all other features are held constant. Reveals non-linear relationships: e.g., risk increases sharply above Age=55.	Planned
`D1.3`	Feature Interaction Analysis SHAP interaction values: which feature PAIRS interact most? E.g., does Thallium + Exercise Angina have a synergistic effect beyond their individual contributions?	Planned
`D1.4`	Neural Network Meta-Learner Train a neural network as the Level-1 meta-learner over OOF predictions instead of CatBoost.	Planned
`D1.5`	3-level stacking L0: diverse models, L1: blenders, L2: final	Planned
D2. Blending Strategies
`D2.1`	Cross-Validation Stability Analysis How much does OOF AUC vary across folds? High variance = model is unstable. Important for report: shows our results are robust, not lucky folds.	Done `0.95547`
`D2.2`	Noise Ceiling Estimation Quantify the theoretical maximum AUC given ~11% label noise. Important for report: explains WHY we can't exceed ~0.956 regardless of model choice.	Done `0.95568`
`D2.3`	Learning Curves Train on 10%, 20%, ..., 100% of data. If AUC still improving at 100%, more data would help. If it plateaus, we're data-saturated (likely given 630K rows and noise ceiling).	Planned
`D2.4`	Bayesian blend weight optimization Optuna on blend weights	Planned
`D2.5`	Random percentile sampling From "Blend the Blender" (LB 0.954)	Planned
`D2.6`	Geometric mean blend Alternative to arithmetic mean	Planned
`D2.7`	Power mean blend (p=0.5, p=2) Generalized mean with tunable p	Planned
D3. Diversity Maximization
`D3.1`	Multi-Dataset Generalization Apply our FULL pipeline to the original UCI Cleveland, Hungarian, Switzerland, and VA datasets. Shows our approach generalizes beyond the competition data. Prof specifically requested this.	Planned
`D3.2`	Feature-bagged ensembles Each model sees different feature subset	Planned
`D3.3`	Row-sampled ensembles Bootstrap aggregation with different samples	Planned
`D3.4`	Architecture-diverse blend 1 GBDT + 1 Linear + 1 NN + 1 TabPFN	Planned
E1. Beyond Standard Approaches
`E1.1`	Noise-transition matrix estimation Estimate P(observed\	Idea
`E1.2`	Co-teaching Two models trained simultaneously, each teaching the other on clean samples (Han et al. 2018)	Idea
`E1.3`	DivideMix Semi-supervised learning + noisy label handling (Li et al. 2020)	Idea
`E1.4`	Confident Learning with cleanlab Characterize label noise, prune/reweight/fix samples	Idea
`E1.5`	Feature importance × noise analysis Which features contribute most to misclassification in the 11% noisy zone?	Idea
`E1.6`	Conditional ensemble Different models for different regions of feature space (e.g., high Thallium vs low)	Idea
`E1.7`	Prototype-based classification Learn class prototypes, classify by distance. Robust to noise.	Idea
`E1.8`	Self-training with high-confidence filtering Use model's own confident predictions to augment training	Idea
`E1.9`	Multi-task learning Predict target + reconstruct features simultaneously	Idea
`E1.10`	Curriculum learning Train on "easy" samples first, progressively add harder ones	Idea
E2. Data-Level Innovation
`E2.1`	Synthetic minority oversampling (SMOTE) Address slight class imbalance	Planned
`E2.2`	Adversarial data augmentation Generate adversarial perturbations to improve robustness	Idea
`E2.3`	Feature permutation importance analysis Beyond standard — permutation within folds for stable estimates	Planned
`E2.4`	SHAP-based feature selection Select features by SHAP importance, not just correlation	Planned
`E2.5`	Boruta feature selection Shadow feature comparison method	Planned
F1. Heart Disease ML Literature
`Source`	Best AUC Year	Unknown
`Cleveland UCI (original)`	~0.90-0.92 2010s	Unknown
`Grinsztajn et al. 2022`	— 2022	Unknown
`TabZilla (NeurIPS 2023)`	— 2023	Unknown
`Gorishniy et al. 2022`	— 2022	Unknown
`Regularization Cocktails 2023`	— 2023	Unknown
`TabPFN (Hollmann et al. 2023)`	— 2023	Unknown
`HyperFast (Bonet et al. 2024)`	— 2024	Unknown
F2. Kaggle Playground Series — Winning Patterns
`Competition`	Metric Winning Approach	Unknown
`S3E7 (Cirrhosis)`	Log Loss CatBoost + stacking	Unknown
`S3E8 (Kidney Stone)`	AUC LightGBM + feature eng	Unknown
`S3E12 (Kidney Disease)`	AUC Ensemble GBDT + NN	Unknown
`S3E17 (Wine)`	QWK CatBoost ordinal	Unknown
`S4E1 (Binary)`	AUC Stacking + original data	Unknown
`S4E8 (Mushroom)`	MCC CatBoost native categoricals	Unknown
`S5E2 (Backorder)`	AUC GBDT + imbalanced learning	Unknown
`Common patterns:`	Planned experiment:	Unknown
`—`	— Original data integration	Unknown
`—`	— Multi-seed ensembling	Unknown
`—`	— GBDT + linear blend	Unknown
`—`	— Proper CV (≥5 fold stratified)	Unknown
F3. Tabular Deep Learning State-of-the-Art (2024-2025)
`Model`	Paper Performance vs GBDT	Unknown
`RealMLP (pytabkit)`	Holzmüller et al. 2024 Competitive on medium datasets	Unknown
`TabM`	Gorishniy et al. 2024 Matches GBDT on many benchmarks	Unknown
`FT-Transformer`	Gorishniy et al. 2021 Competitive on high-cardinality	Unknown
`TabPFN v2`	Hollmann et al. 2025 SOTA on small-medium tabular	Unknown
`ModernNCA`	Ye et al. 2024 Strong on noisy data	Unknown
`ExcelFormer`	Chen et al. 2024 Excel at feature interaction	Unknown
`GRANDE`	Marton et al. 2024 Best of both worlds	Unknown
Wave 1 — Foundation (DONE ✅)
`1`	Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF Planned experiment: Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF	Done
`2`	Key FE: one-hot, freq encoding, orig target stats, interactions Planned experiment: Key FE: one-hot, freq encoding, orig target stats, interactions	Done
`3`	Selective blending Planned experiment: Selective blending	Done
`4`	Multi-seed top models Planned experiment: Multi-seed top models	Running
`5`	Optuna CatBoost tuning Planned experiment: Optuna CatBoost tuning	Running
Wave 2 — Diverse Models (NEXT)
`6`	RealMLP + TabM (pytabkit) — proven top solo performers Planned experiment: RealMLP + TabM (pytabkit) — proven top solo performers	Planned
`7`	TabPFN — novel, unused publicly Planned experiment: TabPFN — novel, unused publicly	Planned
`8`	FT-Transformer — different DL architecture Planned experiment: FT-Transformer — different DL architecture	Planned
`9`	LR + polynomial + anomaly features Planned experiment: LR + polynomial + anomaly features	Planned
`10`	ExtraTrees, HistGradientBoosting Planned experiment: ExtraTrees, HistGradientBoosting	Planned
Wave 3 — Advanced FE
`11`	Categorical treatment sweep (B2.1-B2.7) Planned experiment: Categorical treatment sweep (B2.1-B2.7)	Planned
`12`	Target encoding (in-fold) for GBDT models Planned experiment: Target encoding (in-fold) for GBDT models	Planned
`13`	Label smoothing on CatBoost Planned experiment: Label smoothing on CatBoost	Planned
`14`	KBins + PCA + cluster features Planned experiment: KBins + PCA + cluster features	Planned
`15`	Multi-dataset: original data as extra rows Planned experiment: Multi-dataset: original data as extra rows	Planned
Wave 4 — Stacking & Meta-Learning
`16`	Proper Level-0/Level-1 stacking with LR meta-learner Planned experiment: Proper Level-0/Level-1 stacking with LR meta-learner	Planned
`17`	Hill-climbing blend weight optimization Planned experiment: Hill-climbing blend weight optimization	Planned
`18`	3-level stacking pyramid Planned experiment: 3-level stacking pyramid	Planned
`19`	OOF correlation analysis for diversity selection Planned experiment: OOF correlation analysis for diversity selection	Planned
Wave 5 — Innovation & Refinement
`20`	Noise-aware training — clean_prob features, noise-aware reweighting (no signific Planned experiment: Noise-aware training — clean_prob features, noise-aware reweighting (no signific	Done
`21`	Pseudo-labeling (see Phase 6 below) Planned experiment: Pseudo-labeling (see Phase 6 below)	Running
`22`	Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670 Planned experiment: Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670	Done
`23`	AutoGluon run Planned experiment: AutoGluon run	Planned
`24`	Final mega-blend (hill-climbing done: 0.955751) Planned experiment: Final mega-blend (hill-climbing done: 0.955751)	Running
Wave 6 — "Big Ideas" (Breakthrough Attempts)
`W6.1`	Pseudo-labeling / Self-training Use best ensemble to predict test set. High-confidence samples (>0.90 prob) get pseudo-labels and are added to training. Effectively 800K+ training samples. Proven technique in Playground Series competitions where synthetic data benefits from seeing test distribution. Testing thresholds: 0.90, 0.85,	Running
`W6.2`	UCI-trained model as meta-feature Train a model on the 920 original UCI heart disease samples (cleaner labels, no synthetic noise). Run predict_proba on our 630K train + 270K test. The UCI model's prediction becomes a meta-feature — it captures non-linear patterns from the original distribution that our synthetic-data models can't l	Planned
`W6.3`	Adversarial validation + sample reweighting Train a classifier to distinguish train (label=0) from test (label=1). Each training sample gets a "test-likeness" score. Upweight test-like training samples during CatBoost training via `sample_weight`. This aligns training distribution with test, potentially closing the CV→LB gap (currently 0.0018	Planned
`W6.4`	XGBoost/LightGBM with full FE stack We've only run XGB/LGBM with raw features. Our best CatBoost uses: all-categorical + freq encoding + orig stats. Applying the same FE to XGB/LGBM could produce models at 0.9555+ that are structurally different from CatBoost — real diversity for blending.	Planned
`W6.5`	Probability calibration Platt scaling or isotonic regression on OOF predictions before blending. If our model probabilities are miscalibrated (systematically over/under-confident), calibration could improve AUC. Apply per-model before rank-blending.	Planned
`W6.6`	Feature interaction mining Systematic pairwise ratio/product/difference search across all 13 features. CatBoost handles interactions implicitly but explicit features could help LR and the meta-learner. Top interactions selected by mutual information with target.	Planned
`W6.7`	Multi-round pseudo-labeling Iterative: pseudo-label → retrain → re-predict test → pseudo-label again. Each round refines confidence. Risk of confirmation bias, so track AUC per round carefully.	Planned
Wave 7 — Literature-Inspired Ideas (from Deep Research Review)
`W7.1`	Slow-learning deep CatBoost Gemini report, top Kaggle solutions	Running
`W7.2`	Isotonic probability calibration Gemini report Section 2.2	Running
`W7.3`	Autoencoder latent features Alghamdi et al. 2024, Gemini Section 3.1	Running
`W7.4`	KNN diversity models Systematic review, Chandrasekhar et al.	Running
`W7.5`	SVM with RBF kernel Multiple papers, systematic review	Running
`W7.6`	AdaBoost + sklearn GBM Chandrasekhar et al., Jan et al.	Running
`W7.7`	Soft voting ensemble Chandrasekhar et al. 2023	Planned
`W7.8`	Feature interaction mining Gemini Section 2.3	Planned
`W7.9`	Feature ablation (drop-one analysis) "SF-2" finding in literature	Planned
`W7.10`	Tabular-to-image + pretrained CNN VGG16 transfer learning paper (2024)	Planned
`W7.11`	CNN-BiLSTM hybrid Kayalvizhi et al. 2024	Planned
`W7.12`	Newton-Raphson optimization for NN Kayalvizhi et al. 2024	Planned
`W7.13`	RL-based model routing ScienceDirect 2025	Planned
`W7.14`	SHAP + LIME interpretability Multiple papers, Gemini Section 6	Planned
`W7.15`	Clinical composite features Domain knowledge, Gemini Section 2.3	Planned
Wave 8 — Dropped Ideas (Documented with Reasoning)
`D1`	GAN Data Augmentation (Dropped) Dropped: data is already synthetic (630K rows). More synthetic data adds noise, not signal.	Unknown
`D2`	SMOTE Oversampling (Dropped) Dropped: class balance is 55/45, barely imbalanced. SMOTE would add noise.	Unknown
`D3`	TabPFN (Dropped — OOM) Dropped: crashes on 630K rows. Designed for small datasets only.	Unknown
`D4`	Clean Probability Feature (Dropped) Dropped: confirmed no benefit. CatBoost already captures this via target encoding.	Unknown `0.955477`
`D5`	Mega meta-learner (62 models) Planned experiment: Mega meta-learner (62 models)	Unknown `0.954874`
`D6`	CB+LR+CBO cross-stacking Planned experiment: CB+LR+CBO cross-stacking	Unknown `0.955669`
`D7`	AttGRU-HMSI / LSTM-XGBoost Planned experiment: AttGRU-HMSI / LSTM-XGBoost	Unknown
`D8`	Target spilling as ensemble nudge Planned experiment: Target spilling as ensemble nudge	Unknown

Term	What It Means
AUC-ROC	Area Under the Receiver Operating Characteristic curve. Measures how well the model distinguishes between heart disease present/absent. 1.0 = perfect, 0.5 = random guessing. Our target: ≥0.954.
OOF (Out-of-Fold)	Our local evaluation method. We split data into 10 folds. Train on 9, predict on the 1 held out. Repeat 10 times. This gives an unbiased estimate of model quality without using test data.
LB (Leaderboard)	Kaggle's official score. They evaluate our predictions on hidden test labels. We can submit 10 times per day.
CV→LB Gap	Difference between our local OOF and Kaggle LB score. Ours is consistently ~0.00183, meaning our CV is reliable.
CatBoost	A gradient boosted decision tree algorithm by Yandex. Excels with categorical features. Our best single-model framework.
Cross-Stacking	Using one model's OOF predictions as an input feature for another model. E.g., LR predictions become a feature for CatBoost. Our strongest technique.
Feature Engineering (FE)	Creating new input features from existing ones. Our key FE: frequency encoding, original UCI stats, target encoding, all-categorical treatment.
Rank Blending	Converting predictions to ranks before averaging. More robust than averaging raw probabilities because it's invariant to each model's calibration.
Hill-Climbing	Greedy algorithm that tries adding each model to the ensemble and keeps the one that improves the score most. Repeats until no improvement.
Label Noise	~11% of training samples have contradictory labels (same features, different diagnosis). This is real clinical ambiguity, not data error. Creates a ceiling on achievable AUC.
Multi-Seed	Training the same model with different random seeds. Each seed gives slightly different results. Blending them reduces variance.
Target Encoding	Replacing a categorical value with the average target rate for that category (e.g., "chest pain type 4" → 0.72 heart disease rate). Must be done carefully to avoid leakage.
Pseudo-Labeling	Using our model's confident predictions on test data as additional training labels. Semi-supervised technique.
Adversarial Validation	Training a model to tell train from test data. If it can't (AUC ≈ 0.5), the distributions match. If it can, we need to reweight samples.

🫀 S6E2 Heart Disease

🎯 What are we doing?

🔬 Our Approach (Pipeline)

💡 Key Findings So Far

📈 Progress Over Time

📊 Models by Category

📂 Model Categories Explained

🔬 All Model Results — 143 Models

📤 Kaggle Submissions

📈 OOF vs Leaderboard Score

🏆 Leaderboard Context

🧪 Our Novel Approaches

💡 Idea 1: Label Denoising via KNN Consensus

💡 Idea 2: Confidence-Weighted Dynamic Ensemble

💡 Idea 3: Noise-Aware Sample Reweighting

💡 Idea 4: Bayesian Posterior Stacking

💡 Idea 5: Autoencoder Latent Features

💡 Idea 6: Adversarial Validation

💡 Idea 7: Pseudo-Labeling with Confidence Thresholds

💡 Idea 8: Slow-Learning Deep CatBoost

📋 Full Experiment Plan — 211 Experiments

📖 Glossary — Key Terms Explained