🎯 What are we doing?

We're competing in Kaggle Playground Series S6E2 — predicting heart disease from 13 clinical features (age, chest pain, cholesterol, etc.) on a 630,000-row synthetic dataset. The metric is AUC-ROC (Area Under the ROC Curve) — higher = better at distinguishing sick vs healthy patients. We're a team of 3 working on this as an academic project. Presentations March 28, report April 4.

Best Local Score (OOF)
0.955708
CB+LR Cross-Stack (5-seed)
Best Leaderboard Score
0.95392
Rank 175 / ~2,700 teams
Target (First Page)
0.95400
Need +0.00008 more
Models Trained
127
88 ready for submission
Submissions Made
5 / 10
5 remaining today
Days Remaining
13
Deadline: Feb 28, 2026

🔬 Our Approach (Pipeline)

1. Data
630K rows, 13 features
2. Feature Eng.
Freq + Orig Stats + TE
3. Models
CatBoost, XGB, LR, NN
4. Stacking
LR preds → CatBoost
5. Blending
Hill-climb ensemble

💡 Key Findings So Far

1. Treating ALL features as categorical (+0.005 AUC over official split) — even numeric ones like Age, BP.
2. Original UCI dataset target statistics as features (+0.006 AUC) — the #1 trick from top notebooks.
3. ~11% of labels are noise (contradictory samples) — creates a ceiling around 0.956 AUC.
4. Cross-stacking (LR predictions as CatBoost feature) is our strongest technique.
5. CV→LB gap is a stable 0.00183 — we can predict LB score from local results.

📈 Progress Over Time

Each dot = one model trained. Blue line = cumulative best. Shows how our performance improved throughout the day.

📊 Models by Category

How many models we trained in each approach category. Hover for details.

📂 Model Categories Explained

What each type of model means and how it works.
CategoryCountBest OOFWhat It Means
Blend90.955692Rank-average or weighted combination of 2+ models. Simple but effective.
Multi-seed180.955623Same model architecture trained with different random seeds, then blended. Reduces variance.
Neural Network50.953317Deep learning models (MLP). Weaker than GBDTs on this tabular dataset.
Novel130.955677Our team's original ideas — innovations we designed for this competition. Important for academic report.
Single Model110.955595Individual ML model (CatBoost, XGBoost, LightGBM, LR, RF) — one algorithm, one config.
Stacking220.955708Model A's predictions become an input feature for Model B. Captures what A learned. Our best approach.
Sweep390.955468Systematic ablation experiments testing which features help/hurt when treated as categorical.
Target Encoding100.955702Replace categorical values with the average target (heart disease rate) for that category. Adds signal.

🔬 All Model Results — 143 Models

Click any model for full details. ✅ = can submit to Kaggle, ❌ = evaluation only.

Model & Description Submit
Logistic Regression — Raw Features
Logistic regression on raw numeric features without any encoding. Baseline floor.
CatBoost — Raw Features
CatBoost with default parameters on raw features. No feature engineering.
XGBoost — Raw Features
XGBoost with default parameters on raw features. Baseline comparison.
LightGBM — Raw Features
LightGBM with default parameters on raw features. Baseline comparison.
Random Forest — Raw Features
Random Forest (500 trees) on raw features. Weakest baseline.
Logistic Regression — One-Hot All Features
LR with all 13 features one-hot encoded (even numeric ones like Age). Surprisingly competitive at 0.9555 because low cardinality lets LR learn per-val...
CatBoost — Original UCI Target Stats
CatBoost with features mapped to P(disease|value) from the original 303-patient UCI dataset. Key insight: the original data has clean clinical labels ...
XGBoost — Original UCI Target Stats
XGBoost with original UCI target statistics as features.
LightGBM — Original UCI Target Stats
LightGBM with original UCI target statistics as features.
CatBoost — Pairwise Feature Interactions
CatBoost with manually created pairwise interaction features (e.g., Age×ChestPain).
CatBoost — Frequency Encoding ⭐
CatBoost with frequency encoding (how often each value appears in train+test). Best single-seed model at 0.9556. The frequency captures population-lev...
GBDT Blend — CB+XGB+LGBM Average
Simple average of CatBoost, XGBoost, and LightGBM predictions. Blend of 3 GBDT families.
Original Stats Blend
Blend of all models using original UCI target statistics as features.
Grand Blend — All Phase 1 Models
Rank-blend of all Phase 1 models. Kitchen-sink approach.
CB + LR Blend
Blend of CatBoost (freq) and Logistic Regression (one-hot). Two very different model families.
Top-3 Blend — CB_freq + CB_orig + LR ⭐
Rank-blend of the 3 best single models. First Kaggle submission: LB 0.95384.
Top-4 Blend
Rank-blend of top 4 models from Phase 1.
CB Freq — 5-Seed Average
Average of 5 CatBoost freq models with different random seeds. Reduces initialization variance.
CB Freq — Seed 123
CatBoost freq encoding, random seed 123.
CB Freq — Seed 2024
CatBoost freq encoding, random seed 2024.
CB Freq — Seed 42
CatBoost freq encoding, random seed 42.
CB Freq — Seed 456
CatBoost freq encoding, random seed 456.
CB Freq — Seed 789
CatBoost freq encoding, random seed 789.
HeartMLP — Custom 3-Layer Neural Net
Custom MLP (256→128→64, BatchNorm, Dropout 0.3) trained on MPS GPU. Uses one-hot + freq + orig stats features. OOF 0.9531.
LR One-Hot — 5-Seed Average
Average of 5 LR one-hot models. LR is near-deterministic so seed variation is minimal.
LR One-Hot — Seed 123
LR one-hot, random seed 123.
LR One-Hot — Seed 2024
LR one-hot, random seed 2024.
LR One-Hot — Seed 42
LR one-hot, random seed 42.
LR One-Hot — Seed 456
LR one-hot, random seed 456.
LR One-Hot — Seed 789
LR one-hot, random seed 789.
20 Cb Origstats Multiseed Avg
Model: 20_cb_origstats_multiseed_avg
CB OrigStats — Seed 123
CatBoost with UCI stats, seed 123.
CB OrigStats — Seed 2024
CatBoost with UCI stats, seed 2024.
CB OrigStats — Seed 42
CatBoost with UCI stats, seed 42.
CB OrigStats — Seed 456
CatBoost with UCI stats, seed 456.
CB OrigStats — Seed 789
CatBoost with UCI stats, seed 789.
RealMLP — 3-Seed Blend
Blend of 3 RealMLP seeds. Neural net diversity for ensemble.
RealMLP — Tabular Foundation Model
RealMLP from pytabkit library. Uses Mish activation and PLR embeddings. State-of-the-art tabular neural net architecture.
RealMLP — Seed 123
RealMLP with random seed 123.
RealMLP — Seed 42
RealMLP with random seed 42.
RealMLP — Seed 456
RealMLP with random seed 456.
CB Freq — 5-Seed Rank Blend ⭐
Rank-blend (not average) of 5 CatBoost freq seeds. Rank-blending is calibration-invariant.
Multi-Seed Grand Blend
Blend of all multi-seed CatBoost variants (freq + origstats).
CB+LR Cross-Stack — 5-Seed Blend ⭐⭐
OUR STRONGEST SINGLE STRATEGY. LR predictions become feature #14 for CatBoost. CatBoost learns WHEN to trust the linear model. 5 seeds rank-blended → ...
CB+LR Cross-Stack — Seed 123
Cross-stacked CB+LR, seed 123.
CB+LR Cross-Stack — Seed 2024
Cross-stacked CB+LR, seed 2024.
CB+LR Cross-Stack — Seed 42
CatBoost trained with LR out-of-fold predictions as feature #14. Seed 42.
CB+LR Cross-Stack — Seed 456
Cross-stacked CB+LR, seed 456.
CB+LR Cross-Stack — Seed 789
Cross-stacked CB+LR, seed 789.
CB Stacked (v1) — Seed 123
Earlier stacking variant, seed 123.
CB Stacked (v1) — Seed 42
Earlier stacking variant, seed 42.
CB Stacked (v1) — Seed 456
Earlier stacking variant, seed 456.
CB+LR+CBO Stack — 5-Seed Blend
Adding CB-origstats OOF to the CB+LR stack. WORSE than CB+LR alone (0.9557 vs 0.9557) — too correlated, no new information added.
CB+LR+CBO Stack — Seed 123
Triple cross-stack, seed 123.
CB+LR+CBO Stack — Seed 2024
Triple cross-stack, seed 2024.
CB+LR+CBO Stack — Seed 42
CatBoost with both LR and CB-origstats OOF as features. Seed 42.
CB+LR+CBO Stack — Seed 456
Triple cross-stack, seed 456.
CB+LR+CBO Stack — Seed 789
Triple cross-stack, seed 789.
CB Mega-Stack — Seed 123
Mega-stack variant, seed 123.
CB Mega-Stack — Seed 42
CatBoost with ALL 4 model predictions (LR, CB, CBO, XGB) as stacked features.
CB Mega-Stack — Seed 456
Mega-stack variant, seed 456.
CB Mega-Stack v2 — Seed 123
Revised mega-stack, seed 123.
CB Mega-Stack v2 — Seed 42
Revised mega-stack, seed 42.
CB Mega-Stack v2 — Seed 456
Revised mega-stack, seed 456.
Feature Ablation — Drop Age
CatBoost trained WITHOUT the Age feature. Measures Age's contribution to prediction.
Feature Ablation — Drop Blood Pressure
CatBoost without Resting Blood Pressure.
Feature Ablation — Drop Chest Pain Type
CatBoost without Chest Pain Type (a strong predictor).
Feature Ablation — Drop Cholesterol
CatBoost without Cholesterol.
Feature Ablation — Drop EKG Results
CatBoost without EKG/ECG results.
Feature Ablation — Drop Exercise Angina
CatBoost without Exercise-Induced Angina.
Feature Ablation — Drop Fasting Blood Sugar
CatBoost without Fasting Blood Sugar > 120 mg/dl.
Feature Ablation — Drop Max Heart Rate
CatBoost without Maximum Heart Rate Achieved.
Feature Ablation — Drop Fluoroscopy Vessels
CatBoost without Number of Major Vessels Colored by Fluoroscopy (strong predictor).
Feature Ablation — Drop ST Depression
CatBoost without Exercise-Induced ST Depression value.
Feature Ablation — Drop Sex
CatBoost without the Sex feature.
Feature Ablation — Drop ST Slope
CatBoost without Slope of Peak Exercise ST Segment.
Feature Ablation — Drop Thallium
CatBoost without Thallium stress test result (THE strongest single predictor, ρ=0.61).
Cross-Stack + Multi-Seed TE (top 10)
Top-10 candidate blend from cross-stacking + target encoding multi-seed.
Cross-Stack + Multi-Seed TE (top 15)
Top-15 candidate blend.
Cross-Stack + Multi-Seed TE (top 20)
Top-20 candidate blend.
CB + Clean Prob K=3 (Fixed, No Leakage)
CatBoost with distance-weighted clean probability from 3 nearest UCI patients. Properly computed without target leakage. OOF 0.9555 — no improvement o...
Noise-Aware Reweighting (α=0.0)
CatBoost with sample weights based on label noise probability. α=0 = baseline (no reweighting).
Noise-Aware Reweighting (α=0.1)
Samples likely mislabeled get 10% lower weight. Identifies noisy samples via cross-fold disagreement.
Noise-Aware Reweighting (α=0.2)
Noisy samples downweighted by 20%. Stronger penalty for disagreement.
Noise-Aware Reweighting (α=0.3)
Noisy samples downweighted by 30%.
Noise-Aware Reweighting (α=0.5)
Noisy samples downweighted by 50%. Most aggressive noise suppression.
Confidence-Weighted Ensemble (Novel)
Instead of fixed blend weights, each model's contribution is weighted by its prediction confidence (distance from 0.5) per sample. Uncertain models ar...
LR + Clean Prob K=10
LR with clean prob, K=10 neighbors.
LR + Clean Prob K=20
LR with clean prob, K=20 neighbors.
LR + Clean Prob K=3
Logistic Regression with clean probability feature from 3 nearest UCI neighbors.
LR + Clean Prob K=50
LR with clean prob, K=50 neighbors.
LR + Clean Prob K=5
LR with clean prob, K=5 neighbors.
Mega Meta-Learner — 62 Models (Novel)
CatBoost trained on 62 base model predictions as features. Overfits badly (0.9549) — too many correlated features relative to the signal.
Meta-Learner with Confidence Features (Novel)
CatBoost meta-model trained on: base predictions + confidence scores + inter-model agreement. Learns which model to trust for each sample.
LR Sweep — All Features as Categorical
Logistic regression treating ALL 13 features as categorical (one-hot encoded).
LR Sweep — All Features as Numeric
LR treating all features as raw numbers. Much worse — loses categorical structure.
LR Sweep — All Categorical Except Age
LR with all features one-hot encoded EXCEPT Age kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except BP
LR with all features one-hot encoded EXCEPT BP kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Chest pain type
LR with all features one-hot encoded EXCEPT Chest pain type kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Cholesterol
LR with all features one-hot encoded EXCEPT Cholesterol kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except EKG results
LR with all features one-hot encoded EXCEPT EKG results kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Exercise angina
LR with all features one-hot encoded EXCEPT Exercise angina kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except FBS over 120
LR with all features one-hot encoded EXCEPT FBS over 120 kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Max HR
LR with all features one-hot encoded EXCEPT Max HR kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Number of vessels fluro
LR with all features one-hot encoded EXCEPT Number of vessels fluro kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except ST depression
LR with all features one-hot encoded EXCEPT ST depression kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Sex
LR with all features one-hot encoded EXCEPT Sex kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Slope of ST
LR with all features one-hot encoded EXCEPT Slope of ST kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — All Categorical Except Thallium
LR with all features one-hot encoded EXCEPT Thallium kept as numeric. Tests whether this feature works better as a number or category.
LR Sweep — Categorical + Age + BP
LR with officially categorical features one-hot encoded, plus Age + BP also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Age + Cholesterol
LR with officially categorical features one-hot encoded, plus Age + Cholesterol also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Age + Max HR
LR with officially categorical features one-hot encoded, plus Age + Max HR also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Age + ST depression
LR with officially categorical features one-hot encoded, plus Age + ST depression also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Age
LR with officially categorical features one-hot encoded, plus Age also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + BP + Cholesterol
LR with officially categorical features one-hot encoded, plus BP + Cholesterol also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + BP + Max HR
LR with officially categorical features one-hot encoded, plus BP + Max HR also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + BP + ST depression
LR with officially categorical features one-hot encoded, plus BP + ST depression also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + BP
LR with officially categorical features one-hot encoded, plus BP also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Cholesterol + Max HR
LR with officially categorical features one-hot encoded, plus Cholesterol + Max HR also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Cholesterol + ST depression
LR with officially categorical features one-hot encoded, plus Cholesterol + ST depression also treated as categorical. Incremental category addition t...
LR Sweep — Categorical + Cholesterol
LR with officially categorical features one-hot encoded, plus Cholesterol also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Max HR + ST depression
LR with officially categorical features one-hot encoded, plus Max HR + ST depression also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + Max HR
LR with officially categorical features one-hot encoded, plus Max HR also treated as categorical. Incremental category addition test.
LR Sweep — Categorical + ST depression
LR with officially categorical features one-hot encoded, plus ST depression also treated as categorical. Incremental category addition test.
LR Sweep — Numeric + Chest pain type as Category
LR with numeric features as-is, plus Chest pain type one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + EKG results as Category
LR with numeric features as-is, plus EKG results one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + Exercise angina as Category
LR with numeric features as-is, plus Exercise angina one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + FBS over 120 as Category
LR with numeric features as-is, plus FBS over 120 one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + Number of vessels fluro as Category
LR with numeric features as-is, plus Number of vessels fluro one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + Sex as Category
LR with numeric features as-is, plus Sex one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + Slope of ST as Category
LR with numeric features as-is, plus Slope of ST one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Numeric + Thallium as Category
LR with numeric features as-is, plus Thallium one-hot encoded. Tests adding one categorical feature to numeric baseline.
LR Sweep — Official Cat/Num Split
LR using the "official" categorical vs numeric distinction from feature metadata.
CatBoost + Target Encoding (α=10) ⭐
CatBoost with target-encoded features using smoothing α=10. Each category value is replaced by smoothed P(disease|value) from training folds. Combined...
LR + Target Encoding (α=10)
LR with one-hot + target encoding, smoothing α=10.
LR + Target Encoding (α=50)
LR with one-hot + target encoding, smoothing α=50 (heavy smoothing).
LR + Target Encoding (α=5)
Logistic Regression with one-hot + target encoding features, smoothing α=5.
LR + Target Encoding + UCI Stats (α=10)
LR with one-hot + target encoding + original UCI statistics combined.
LR — Target Encoding Only (α=100)
LR with target encoding only, heavy smoothing.
LR — Target Encoding Only (α=10)
LR with target encoding only, α=10.
LR — Target Encoding Only (α=1)
LR using ONLY target-encoded features (no one-hot), minimal smoothing.
LR — Target Encoding Only (α=50)
LR with target encoding only, α=50.
LR — Target Encoding Only (α=5)
LR with target encoding only, α=5.

📤 Kaggle Submissions

LB Score = Leaderboard score — Kaggle evaluates our predictions on their hidden test labels. This is the "real" score. The gap between our local OOF and LB is consistently ~0.00183, meaning our cross-validation is reliable. We get 10 submissions per day. Click a row for full details.

📈 OOF vs Leaderboard Score

Blue = local cross-validation estimate. Green = actual Kaggle leaderboard score. The consistent gap shows our CV is trustworthy.
#SubmittedModelOOF AUCLB ScoreRankCV→LB Gap
1Feb 15 07:04Top-3 Blend (CB_freq + CB_orig + LR) 0.9556820.95384 2970.00184
2Feb 15 07:04CatBoost + Freq Encoding (single) 0.9555950.95375 0.00184
3Feb 15 07:04CatBoost Raw (no FE) 0.9555030.95358 0.00192
4Feb 15 12:363-Model Rank Blend (stack + conf + multi) 0.9557290.95389 2010.00184
5Feb 15 12:57Hill-Climb: Cross-Stack + TE CatBoost 0.9557510.95392 1750.00183

🏆 Leaderboard Context

Top score: 0.95405  |  First page (top 20): all at 0.95400  |  Our rank: 175 / ~2,700 (~top 6.5%)  |  Top 5% cutoff: ~rank 135 ≈ 0.95395+

🧪 Our Novel Approaches

These are original ideas our team designed — not copied from public notebooks. Even if they didn't improve the score, they demonstrate research thinking and are valuable for the academic report. The professor values novelty attempts. Click each card for full details.

Finding: Clean probability adds no predictive value to CatBoost (0.955477 vs baseline 0.955595). The noise is uniformly distributed across feature space — CatBoost already handles it implicitly. Initial version had target leakage (0.999 AUC) via disagreement feature — caught and fixed.

Why it\'s novel: Original idea. Not found in competition notebooks or literature for this dataset scale.

')">

💡 Idea 1: Label Denoising via KNN Consensus

❌ No benefit   OOF: 0.955477

The dataset has ~11% label noise (contradictory samples with identical features but opposite labels). We compute a "clean probability" using K-nearest…

💡 Idea 2: Confidence-Weighted Dynamic Ensemble

✅ Marginal gain   OOF: 0.955677

Instead of fixed weights, each model's vote is weighted by its confidence — how far its prediction is from the decision boundary (0.5). Confident pred…

Finding: Hurts performance at all alpha values tested (0.0–0.5). The noise is too uniformly distributed for selective weighting to help. CatBoost\'s built-in regularization already handles this.

Why it\'s novel: Curriculum-learning inspired approach adapted for label noise in tabular data.

')">

💡 Idea 3: Noise-Aware Sample Reweighting

❌ No benefit   OOF: 0.955461

Down-weight training samples that are likely mislabeled (low KNN consensus). Give the model permission to "ignore" noisy examples via sample weights.

💡 Idea 4: Bayesian Posterior Stacking

❌ Overfits   OOF: 0.954874

Use 62 model predictions as meta-features in a Bayesian-inspired meta-learner. The idea: capture complex model agreement patterns.

💡 Idea 5: Autoencoder Latent Features

🔄 Running   Pending...

Train a 13→16→8→16→13 autoencoder to learn compressed representations of patient features. Extract the 8-dim bottleneck as new features for CatBoost. …

💡 Idea 6: Adversarial Validation

🔄 Running   Pending...

Train a classifier to distinguish train from test data. If it succeeds (AUC > 0.5), the distributions differ — we can reweight training samples to mat…

💡 Idea 7: Pseudo-Labeling with Confidence Thresholds

🔄 Running   Pending...

Use our best model to predict test labels. Add high-confidence test predictions (>0.95, 0.90, 0.85, 0.80 thresholds) back to training data. Retrain wi…

💡 Idea 8: Slow-Learning Deep CatBoost

🔄 Running   Pending...

CatBoost with very low learning rate (0.02–0.015), deep trees (depth 7–8), and 2500–3000 iterations. Trades compute for potentially better generalizat…

📋 Full Experiment Plan — 211 Experiments

23 Done
13 Running
127 Planned
11 Ideas

Every experiment explained — what it is, why we're testing it, and what we expect.

ID Experiment & Description Status
A1. Gradient Boosted Decision Trees (GBDT)
A1.1
CatBoost — Default Parameters
Yandex's gradient boosting with native categorical feature handling. Uses ordered target statistics internally, which is ideal for our all-categorical data. Default hyperparameters: depth=6, lr=0.03, 1000 iterations.
Done 0.95550
A1.2
XGBoost — Default Parameters
The most popular gradient boosting library. Uses histogram-based splits. Serves as comparison point against CatBoost on the same features.
Done 0.95535
A1.3
LightGBM — Default Parameters
Microsoft's gradient boosting. Leaf-wise tree growth (vs level-wise in XGBoost). Fastest GBDT but may overfit on noisy data.
Done 0.95511
A1.4
CatBoost — Optuna Hyperparameter Tuning
100-trial Bayesian optimization over: iterations, learning_rate, depth, l2_leaf_reg, random_strength, bagging_temperature, border_count, min_data_in_leaf. Searches the full hyperparameter space to find optimal CB configuration.
Running
A1.5
XGBoost — Optuna Hyperparameter Tuning
Bayesian optimization for XGBoost: max_depth, learning_rate, subsample, colsample_bytree, reg_lambda, reg_alpha, min_child_weight. May find a different optimum than CatBoost.
Planned
A1.6
LightGBM — Optuna Hyperparameter Tuning
Bayesian optimization for LightGBM: num_leaves, learning_rate, subsample, colsample_bytree, reg_lambda, min_child_samples. LGBM has different optimal hyperparameters than CB/XGB.
Planned
A1.7
CatBoost — Grow Policy Comparison
Compare 3 tree-building strategies: SymmetricTree (default, balanced splits), Depthwise (level-by-level like XGBoost), Lossguide (leaf-wise like LightGBM, picks the leaf that reduces loss most). Different policies capture different patterns in the data.
Planned
A1.8
CatBoost — Ordered Boosting
CatBoost's unique boosting_type="Ordered" uses a permutation-driven approach designed to prevent target leakage during training. Originally designed for small datasets. May help with our noisy labels by being more conservative.
Planned
A1.9
XGBoost — DART Mode
Dropout Additive Regression Trees: randomly drops trees during boosting (like neural net dropout). Prevents later trees from over-correcting earlier ones. Better generalization on noisy data.
Planned
A1.10
LightGBM — GOSS + EFB
Gradient-based One-Side Sampling: keeps all samples with large gradients (hard examples), randomly samples from small gradients (easy examples). Exclusive Feature Bundling merges sparse features. Faster and may regularize better.
Planned
A1.11
CatBoost — Label Smoothing
Smooth target labels: instead of 0/1, use 0+ε and 1-ε (e.g., 0.05 and 0.95). Directly addresses our ~11% label noise by telling the model "don't be 100% confident in any label". Test ε = 0.05, 0.10, 0.15.
Planned
A1.12
HistGradientBoosting — Scikit-learn
Sklearn's histogram-based gradient boosting. Different implementation from CB/XGB/LGBM. Adds diversity to our GBDT ensemble even if slightly weaker individually.
Planned
A2. Linear Models
A2.1
Logistic Regression — Raw Features
Baseline: logistic regression on raw numeric features. P(disease) = sigmoid(w₁×Age + w₂×BP + ...). Expected to be weak because it treats Age=45 and Age=46 as almost identical, losing categorical structure.
Done 0.95049
A2.2
Logistic Regression — One-Hot All Features
KEY FINDING: One-hot encode ALL 13 features (even "numeric" ones like Age). This lets LR learn "Age=45 → coefficient X" independently from "Age=65 → coefficient Y". Achieves 0.9555 — matching CatBoost! Proves all features are effectively categorical.
Done 0.95552
A2.3
Ridge Classifier
Like logistic regression but with L2-regularized least squares loss instead of log loss. Faster to train, different decision boundary from LR. Adds linear model diversity.
Planned
A2.4
SGD Classifier — Stochastic Gradient Descent
Online learning with log loss. Processes samples one-at-a-time instead of full batch. Different optimization trajectory may find different local optima.
Planned
A2.5
Logistic Regression — ElasticNet Regularization
Combines L1 (sparsity, feature selection) and L2 (shrinkage) penalties. l1_ratio controls the mix. L1 may zero out useless one-hot features, reducing overfitting.
Planned
A2.6
LR — Polynomial Feature Interactions (Degree 2)
Create ALL pairwise interaction features: Age×ChestPain, BP×Cholesterol, etc. Lets LR capture "Age=55 AND ChestPain=Typical → high risk" relationships that single features miss. Feature count explodes but may capture non-linear patterns.
Planned
A2.7
LR — One-Hot + Target Encoding Combined
Feed LR both one-hot features AND target-encoded features simultaneously. One-hot captures per-value patterns; target encoding captures smoothed population rates.
Planned
A2.8
LR — IsolationForest Anomaly Scores
From a top-scoring public LR notebook: add anomaly scores from IsolationForest as a feature. Flags unusual patients whose feature combinations are rare in the training data.
Planned
A3. Ensemble Tree Methods (Non-Boosted)
A3.1
Random Forest — Default
Ensemble of 500 independent decision trees, each trained on a random bootstrap sample. Weaker than boosting (0.952) but makes completely different errors — valuable for diversity.
Done 0.95222
A3.2
ExtraTrees — Extremely Randomized Trees
Like Random Forest but splits are chosen randomly instead of optimally. Even more variance reduction. Much faster to train. Different error patterns.
Planned
A3.3
Balanced Random Forest
Random Forest with class-weighted sampling. Our dataset is 55/45 split — slight imbalance. This ensures each tree sees balanced classes, which may improve AUC.
Planned
A3.4
Random Forest on GBDT Residuals
CLEVER TRICK from a public notebook (LB 0.95395): Train CatBoost+XGBoost first, then train Random Forest on what they get WRONG (residuals). RF captures patterns the boosters miss. Final prediction = GBDT + RF correction.
Planned
A4. Neural Networks — Tabular-Specific
A4.1
RealMLP — State-of-the-Art Tabular MLP
From the pytabkit library. Uses Mish activation (smooth ReLU), Piecewise Linear Representation (PLR) embeddings for numeric features, and careful initialization. PUBLIC NOTEBOOK ACHIEVES LB 0.95397 — better than our current best! High priority.
Planned
A4.2
TabM — Tabular Model from pytabkit
Second-best public solo model (LB 0.95381). TabM-mini-normal architecture. Different from RealMLP — could add neural net diversity to our ensemble.
Planned
A4.3
FT-Transformer — Feature Tokenizer Transformer
Converts each feature to a token embedding, then applies Transformer self-attention. Can learn complex feature interactions that tree models miss. Different inductive bias: trees do axis-aligned splits, transformers do attention-weighted combinations.
Planned
A4.4
SAINT — Self-Attention + Inter-Sample Attention
Novel architecture: Attention between features (like FT-Transformer) PLUS attention between samples (each sample attends to similar training samples). Captures both feature relationships and patient-to-patient similarities. Unique approach for tabular data.
Planned
A4.5
TabTransformer
Transformer applied only to categorical feature embeddings, then concatenated with numeric features and fed through MLP. Lighter than FT-Transformer. Good for our all-categorical data.
Planned
A4.6
NODE — Neural Oblivious Decision Ensembles
Differentiable version of decision trees. Learns soft, gradient-optimizable tree splits. Bridges the gap between GBDT and neural nets. Can be trained end-to-end with backpropagation.
Planned
A4.7
DANet — Deep Abstract Network
Tabular-specific deep learning with abstract layers that learn hierarchical feature representations. Less common architecture — adds novel diversity.
Planned
A4.8
Simple 3-Layer MLP
Baseline neural net: Input → 256 → 128 → 64 → 1 with BatchNorm, ReLU, Dropout(0.3). Simple but establishes the neural net floor. Our HeartMLP variant achieved 0.9531.
Planned
A4.9
MLP with Periodic Embeddings
From a top-voted Kaggle discussion (52 upvotes). Maps numeric features through sine/cosine functions before feeding to MLP. Periodic embeddings help neural nets learn non-monotonic relationships (e.g., very low AND very high cholesterol both indicate risk).
Planned
A4.10
MLP with PLR Embeddings
Piecewise Linear Representation: each numeric feature is split into bins with learned linear interpolation. Turns continuous features into rich representations. Key ingredient of RealMLP's success.
Planned
A5. In-Context Learning / Foundation Models
A5.1
TabPFN v1 — Zero-Shot Tabular Classification
Pre-trained transformer that classifies tabular data WITHOUT training on your dataset. Learned to classify from millions of synthetic datasets. UNUSED by any public notebook — innovation opportunity! Limitation: doesn't scale to 630K rows directly, needs subsampling.
Planned
A5.2
TabPFN v2 — Scalable Version
Improved TabPFN with support for larger datasets via chunked inference. May handle our 630K rows with batching. Worth testing for pure diversity.
Planned
A5.3
HyperFast — Meta-Learned Hypernetwork
A neural net that GENERATES the weights of a classifier for your specific dataset. Instant classification without training. Completely different approach from everything else.
Planned
A6. Other Classifiers
A6.1
SVM — Radial Basis Function Kernel
Support Vector Machine with RBF kernel. Projects features into infinite-dimensional space and finds the maximum-margin decision boundary. Very different from tree/linear models. Slow on 630K rows but adds maximum diversity.
Planned
A6.2
K-Nearest Neighbors
Instance-based: classify each patient by majority vote of K most similar patients in training data. No model learned at all — pure memorization. Weak alone but captures local neighborhood patterns.
Planned
A6.3
Gaussian Naive Bayes
Assumes features are independent given the class. Obviously wrong (features correlate) but the resulting probability estimates are well-calibrated. Fast, different, adds diversity.
Planned
A6.4
Quadratic Discriminant Analysis
Fits a Gaussian to each class and classifies by likelihood ratio. Quadratic (non-linear) decision boundary. Captures class-specific covariance structures.
Planned
A7. AutoML
A7.1
AutoGluon — Best Quality
Amazon's AutoML: automatically trains dozens of models (NN, GBDT, KNN, etc.), performs multi-layer stacking, and selects the best ensemble. "best_quality" preset uses more models and longer training. May discover combinations we missed.
Planned
A7.2
AutoGluon — High Quality
Faster AutoGluon preset. Same approach but fewer models and less stacking depth. Good for quick comparison.
Planned
A7.3
H2O AutoML
Alternative AutoML framework
Planned
B1. Encoding Strategies
B1.1
Frequency Encoding
Replace each category value with how often it appears in train+test combined. Rare values get low frequencies. Captures population-level patterns. Example: "ChestPain=Typical" appears in 47% of data → mapped to 0.47.
Done
B1.2
Target Encoding (Smoothed)
Replace each value with smoothed P(disease | value) from training data. α-smoothing blends per-value rate with global rate: TE = (n×mean + α×global) / (n + α). Prevents overfitting on rare categories. We sweep α = 1, 5, 10, 50, 100.
Done 0.95552
B1.3
Original UCI Target Statistics
THE MOST IMPACTFUL FEATURE. Compute P(disease | feature_value) from the ORIGINAL 303 patients (real clinical diagnoses), not the noisy synthetic data. Example: "Thallium=7" → 85% disease rate in original data. Adds ~0.006 AUC — massive gain.
Planned
B1.4
Binary/One-Hot Encoding
Standard one-hot: each category value becomes a 0/1 column. "ChestPain" with 4 values becomes 4 binary columns. Creates sparse high-dimensional features. Works surprisingly well for LR.
Planned
B1.5
All-Categorical Treatment
Treat ALL features (including "numeric" Age, BP, Cholesterol) as categorical. Works because: Age has only ~50 unique values in 630K rows — it IS categorical in this dataset. CatBoost with all-cat achieves 0.9555.
Done 0.95560
B1.6
Target encoding (in-fold)
Mean target per feature value, computed within CV fold
Planned
B1.7
Original UCI target stats
Mean/median/std/count from original data
Done 0.95558
B1.8
Leave-one-out encoding
LOO target encoding — less biased than mean
Planned
B1.9
WoE (Weight of Evidence)
Log(P(1
Planned
B1.10
James-Stein encoding
Shrinkage-based target encoding
Planned
B1.11
Binary encoding
Binary representation of categorical levels
Planned
B1.12
Helmert encoding
Compares each level to mean of subsequent levels
Planned
B2. Categorical/Numerical Treatment Combinations
B2.1
Pairwise Interactions
Create features like Age×ChestPain, Sex×Thallium, etc. Captures "combination" effects that single features miss. Example: "Male + Typical Angina" may be higher risk than either alone.
Planned
B2.2
Ratio Features
Create ratios: MaxHR/Age (heart rate relative to age), Cholesterol/Age, etc. Clinically meaningful: high heart rate is more concerning in older patients.
Done
B2.3
Age Binning × Interactions
Group Age into bins (young/middle/old) then interact with other features. Captures age-dependent risk factors.
Planned
B2.4
High-info categoricals only: {Thal,ChestPain,Vessels} as cat, rest numerical
Top 3 categorical predictors
Planned
B2.5
Reverse: treat Tier3 features (BP,FBS,Chol) as categorical, rest numerical
They have low info anyway
Planned
B2.6
Age + MaxHR + STdep as numerical, everything else categorical
Only true continuous features
Planned
B2.7
Optimal split search (Optuna)
Let optimizer find best cat/num assignment
Planned
B3. Feature Interactions & Transformations
B3.1
Feature Selection via Importance
Use CatBoost feature importance to rank features, then train with only the top-K. If some features add noise, removing them may improve generalization.
Done 0.95545
B3.2
Recursive Feature Elimination
Iteratively remove the least important feature and retrain. Finds the minimal feature set that maintains (or improves) AUC.
Planned
B3.3
Forward Feature Selection
Start with zero features, add one at a time (the one that improves AUC most). Greedy but finds good feature subsets.
Planned
B3.4
Log/sqrt/square transforms
Non-linear transforms of numericals
Planned
B3.5
KBinsDiscretizer (10 bins)
Binned numericals — from top baseline notebook
Planned
B3.6
Clinical risk composites
Framingham-like score, Duke clinical score
Planned
B3.7
PCA components (top 5)
Dimensionality-reduced features
Planned
B3.8
UMAP embeddings (2D-3D)
Non-linear dimensionality reduction
Planned
B3.9
Cluster assignments (KMeans k=5,10)
Cluster membership as feature
Planned
B3.10
IsolationForest anomaly scores
From top LR notebook
Planned
B3.11
Autoencoder reconstruction error
Learn normal patterns, deviation = risk
Planned
B3.12
KNN distance features
Distance to k-nearest of each class
Planned
B4. Multi-Dataset Integration
B4.1
Original UCI target stats (merged)
Already using
Done
B4.2
Original data as extra training rows
Concatenate with weight adjustment
Planned
B4.3
Original data weighted by similarity
Adversarial validation to find similar samples
Planned
B4.4
Multi-source target encoding (Cleveland + Statlog + synthetic)
Different target encodings from each source
Planned
B4.5
Domain adaptation: original → synthetic
Transfer learning approach
Planned
C1. Cross-Validation Schemes
C1.1
Multi-Seed Training (5 Seeds)
Train the SAME model architecture 5 times with different random seeds (42, 123, 456, 789, 2024). Each seed produces slightly different trees → rank-blending reduces variance. Typically adds +0.0001–0.0003 AUC for free.
Done
C1.2
Multi-Fold Variants (5-fold vs 10-fold vs 20-fold)
Compare different numbers of CV folds. More folds = more training data per fold but more variance in OOF estimates. Find the sweet spot.
Done
C1.3
Multi-seed (5 seeds × 10 folds)
Running now for top models
Running
C1.4
RepeatedStratifiedKFold (3×10)
30 folds, averaged
Planned
C1.5
Stratified on Thallium×Target
Ensures balanced Thallium distribution
Planned
C1.6
GroupKFold by feature clusters
Prevents data leakage if clusters exist
Planned
C2. Noise-Aware Training
C2.1
Cross-Stacking — LR → CatBoost
Train LR on 10 folds, save OOF predictions. Use LR_pred as feature #14 for CatBoost. CatBoost learns WHEN the linear model is right/wrong. Our strongest technique: +0.0001 AUC.
Planned
C2.2
Cross-Stacking — Multiple Base Models
Stack predictions from LR + XGB + LGBM as features for CatBoost meta-learner. Risk: too many correlated features → overfitting (confirmed: CB+LR+CBO stack was WORSE).
Planned
C2.3
Blending — Rank-Based
Convert each model's predictions to ranks (1 to N), then average ranks. Calibration-invariant: doesn't matter if Model A predicts 0.7 and Model B predicts 0.95 for the same sample. Better than probability averaging for diverse model families.
Planned
C2.4
Hill-Climbing Ensemble
Greedy algorithm: start with best model, try adding each remaining model, keep the one that improves blend AUC most. Repeat. Our best result (0.955751) came from hill-climbing over 105+ models — found that te_cb_a10 uniquely complements the CB+LR stack.
Planned
C2.5
Bayesian Blend Optimization
Use Optuna to find optimal blend weights instead of greedy hill-climbing. Searches continuous weight space. May find better weights than equal weighting.
Planned
C2.6
Symmetric cross-entropy loss
Noise-robust loss function
Planned
C3. Post-Processing
C3.1
Pseudo-Labeling — Confident Test Predictions
Semi-supervised: use our model's most confident test predictions (prob > 0.95 or < 0.05) as additional training data. Increases effective training set. Must be careful with threshold choice.
Done
C3.2
Pseudo-Labeling — Multi-Round
Iterative pseudo-labeling: train → predict test → add confident predictions → retrain → repeat. Each round should improve, but risks confirmation bias (reinforcing model errors).
Planned
C3.3
Knowledge Distillation
Train a large ensemble, then train a single model to mimic the ensemble's SOFT predictions (probabilities) rather than hard labels. Transfers ensemble knowledge into one model.
Planned
C3.4
Temperature scaling
From CB+XGB+Residual RF notebook
Planned
C3.5
Pseudo-labeling (iterative)
High-confidence test predictions as training data
Planned
D1. Level-0 → Level-1 Stacking
D1.1
SHAP Feature Importance
SHapley Additive exPlanations: game-theory-based feature importance. Shows WHICH features drive predictions for EACH patient, not just globally. Required for the academic report.
Planned
D1.2
Partial Dependence Plots
Show how each feature affects prediction probability when all other features are held constant. Reveals non-linear relationships: e.g., risk increases sharply above Age=55.
Planned
D1.3
Feature Interaction Analysis
SHAP interaction values: which feature PAIRS interact most? E.g., does Thallium + Exercise Angina have a synergistic effect beyond their individual contributions?
Planned
D1.4
Neural Network Meta-Learner
Train a neural network as the Level-1 meta-learner over OOF predictions instead of CatBoost.
Planned
D1.5
3-level stacking
L0: diverse models, L1: blenders, L2: final
Planned
D2. Blending Strategies
D2.1
Cross-Validation Stability Analysis
How much does OOF AUC vary across folds? High variance = model is unstable. Important for report: shows our results are robust, not lucky folds.
Done 0.95547
D2.2
Noise Ceiling Estimation
Quantify the theoretical maximum AUC given ~11% label noise. Important for report: explains WHY we can't exceed ~0.956 regardless of model choice.
Done 0.95568
D2.3
Learning Curves
Train on 10%, 20%, ..., 100% of data. If AUC still improving at 100%, more data would help. If it plateaus, we're data-saturated (likely given 630K rows and noise ceiling).
Planned
D2.4
Bayesian blend weight optimization
Optuna on blend weights
Planned
D2.5
Random percentile sampling
From "Blend the Blender" (LB 0.954)
Planned
D2.6
Geometric mean blend
Alternative to arithmetic mean
Planned
D2.7
Power mean blend (p=0.5, p=2)
Generalized mean with tunable p
Planned
D3. Diversity Maximization
D3.1
Multi-Dataset Generalization
Apply our FULL pipeline to the original UCI Cleveland, Hungarian, Switzerland, and VA datasets. Shows our approach generalizes beyond the competition data. Prof specifically requested this.
Planned
D3.2
Feature-bagged ensembles
Each model sees different feature subset
Planned
D3.3
Row-sampled ensembles
Bootstrap aggregation with different samples
Planned
D3.4
Architecture-diverse blend
1 GBDT + 1 Linear + 1 NN + 1 TabPFN
Planned
E1. Beyond Standard Approaches
E1.1
Noise-transition matrix estimation
Estimate P(observed\
Idea
E1.2
Co-teaching
Two models trained simultaneously, each teaching the other on clean samples (Han et al. 2018)
Idea
E1.3
DivideMix
Semi-supervised learning + noisy label handling (Li et al. 2020)
Idea
E1.4
Confident Learning with cleanlab
Characterize label noise, prune/reweight/fix samples
Idea
E1.5
Feature importance × noise analysis
Which features contribute most to misclassification in the 11% noisy zone?
Idea
E1.6
Conditional ensemble
Different models for different regions of feature space (e.g., high Thallium vs low)
Idea
E1.7
Prototype-based classification
Learn class prototypes, classify by distance. Robust to noise.
Idea
E1.8
Self-training with high-confidence filtering
Use model's own confident predictions to augment training
Idea
E1.9
Multi-task learning
Predict target + reconstruct features simultaneously
Idea
E1.10
Curriculum learning
Train on "easy" samples first, progressively add harder ones
Idea
E2. Data-Level Innovation
E2.1
Synthetic minority oversampling (SMOTE)
Address slight class imbalance
Planned
E2.2
Adversarial data augmentation
Generate adversarial perturbations to improve robustness
Idea
E2.3
Feature permutation importance analysis
Beyond standard — permutation within folds for stable estimates
Planned
E2.4
SHAP-based feature selection
Select features by SHAP importance, not just correlation
Planned
E2.5
Boruta feature selection
Shadow feature comparison method
Planned
F1. Heart Disease ML Literature
Source
Best AUC
Year
Unknown
Cleveland UCI (original)
~0.90-0.92
2010s
Unknown
Grinsztajn et al. 2022
2022
Unknown
TabZilla (NeurIPS 2023)
2023
Unknown
Gorishniy et al. 2022
2022
Unknown
Regularization Cocktails 2023
2023
Unknown
TabPFN (Hollmann et al. 2023)
2023
Unknown
HyperFast (Bonet et al. 2024)
2024
Unknown
F2. Kaggle Playground Series — Winning Patterns
Competition
Metric
Winning Approach
Unknown
S3E7 (Cirrhosis)
Log Loss
CatBoost + stacking
Unknown
S3E8 (Kidney Stone)
AUC
LightGBM + feature eng
Unknown
S3E12 (Kidney Disease)
AUC
Ensemble GBDT + NN
Unknown
S3E17 (Wine)
QWK
CatBoost ordinal
Unknown
S4E1 (Binary)
AUC
Stacking + original data
Unknown
S4E8 (Mushroom)
MCC
CatBoost native categoricals
Unknown
S5E2 (Backorder)
AUC
GBDT + imbalanced learning
Unknown
Common patterns:
Planned experiment:
Unknown
Original data integration
Unknown
Multi-seed ensembling
Unknown
GBDT + linear blend
Unknown
Proper CV (≥5 fold stratified)
Unknown
F3. Tabular Deep Learning State-of-the-Art (2024-2025)
Model
Paper
Performance vs GBDT
Unknown
RealMLP (pytabkit)
Holzmüller et al. 2024
Competitive on medium datasets
Unknown
TabM
Gorishniy et al. 2024
Matches GBDT on many benchmarks
Unknown
FT-Transformer
Gorishniy et al. 2021
Competitive on high-cardinality
Unknown
TabPFN v2
Hollmann et al. 2025
SOTA on small-medium tabular
Unknown
ModernNCA
Ye et al. 2024
Strong on noisy data
Unknown
ExcelFormer
Chen et al. 2024
Excel at feature interaction
Unknown
GRANDE
Marton et al. 2024
Best of both worlds
Unknown
Wave 1 — Foundation (DONE ✅)
1
Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF
Planned experiment: Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF
Done
2
Key FE: one-hot, freq encoding, orig target stats, interactions
Planned experiment: Key FE: one-hot, freq encoding, orig target stats, interactions
Done
3
Selective blending
Planned experiment: Selective blending
Done
4
Multi-seed top models
Planned experiment: Multi-seed top models
Running
5
Optuna CatBoost tuning
Planned experiment: Optuna CatBoost tuning
Running
Wave 2 — Diverse Models (NEXT)
6
RealMLP + TabM (pytabkit) — proven top solo performers
Planned experiment: RealMLP + TabM (pytabkit) — proven top solo performers
Planned
7
TabPFN — novel, unused publicly
Planned experiment: TabPFN — novel, unused publicly
Planned
8
FT-Transformer — different DL architecture
Planned experiment: FT-Transformer — different DL architecture
Planned
9
LR + polynomial + anomaly features
Planned experiment: LR + polynomial + anomaly features
Planned
10
ExtraTrees, HistGradientBoosting
Planned experiment: ExtraTrees, HistGradientBoosting
Planned
Wave 3 — Advanced FE
11
Categorical treatment sweep (B2.1-B2.7)
Planned experiment: Categorical treatment sweep (B2.1-B2.7)
Planned
12
Target encoding (in-fold) for GBDT models
Planned experiment: Target encoding (in-fold) for GBDT models
Planned
13
Label smoothing on CatBoost
Planned experiment: Label smoothing on CatBoost
Planned
14
KBins + PCA + cluster features
Planned experiment: KBins + PCA + cluster features
Planned
15
Multi-dataset: original data as extra rows
Planned experiment: Multi-dataset: original data as extra rows
Planned
Wave 4 — Stacking & Meta-Learning
16
Proper Level-0/Level-1 stacking with LR meta-learner
Planned experiment: Proper Level-0/Level-1 stacking with LR meta-learner
Planned
17
Hill-climbing blend weight optimization
Planned experiment: Hill-climbing blend weight optimization
Planned
18
3-level stacking pyramid
Planned experiment: 3-level stacking pyramid
Planned
19
OOF correlation analysis for diversity selection
Planned experiment: OOF correlation analysis for diversity selection
Planned
Wave 5 — Innovation & Refinement
20
Noise-aware training — clean_prob features, noise-aware reweighting (no signific
Planned experiment: Noise-aware training — clean_prob features, noise-aware reweighting (no signific
Done
21
Pseudo-labeling (see Phase 6 below)
Planned experiment: Pseudo-labeling (see Phase 6 below)
Running
22
Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670
Planned experiment: Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670
Done
23
AutoGluon run
Planned experiment: AutoGluon run
Planned
24
Final mega-blend (hill-climbing done: 0.955751)
Planned experiment: Final mega-blend (hill-climbing done: 0.955751)
Running
Wave 6 — "Big Ideas" (Breakthrough Attempts)
W6.1
Pseudo-labeling / Self-training
Use best ensemble to predict test set. High-confidence samples (>0.90 prob) get pseudo-labels and are added to training. Effectively 800K+ training samples. Proven technique in Playground Series competitions where synthetic data benefits from seeing test distribution. Testing thresholds: 0.90, 0.85,
Running
W6.2
UCI-trained model as meta-feature
Train a model on the 920 original UCI heart disease samples (cleaner labels, no synthetic noise). Run predict_proba on our 630K train + 270K test. The UCI model's prediction becomes a meta-feature — it captures non-linear patterns from the original distribution that our synthetic-data models can't l
Planned
W6.3
Adversarial validation + sample reweighting
Train a classifier to distinguish train (label=0) from test (label=1). Each training sample gets a "test-likeness" score. Upweight test-like training samples during CatBoost training via `sample_weight`. This aligns training distribution with test, potentially closing the CV→LB gap (currently 0.0018
Planned
W6.4
XGBoost/LightGBM with full FE stack
We've only run XGB/LGBM with raw features. Our best CatBoost uses: all-categorical + freq encoding + orig stats. Applying the same FE to XGB/LGBM could produce models at 0.9555+ that are structurally different from CatBoost — real diversity for blending.
Planned
W6.5
Probability calibration
Platt scaling or isotonic regression on OOF predictions before blending. If our model probabilities are miscalibrated (systematically over/under-confident), calibration could improve AUC. Apply per-model before rank-blending.
Planned
W6.6
Feature interaction mining
Systematic pairwise ratio/product/difference search across all 13 features. CatBoost handles interactions implicitly but explicit features could help LR and the meta-learner. Top interactions selected by mutual information with target.
Planned
W6.7
Multi-round pseudo-labeling
Iterative: pseudo-label → retrain → re-predict test → pseudo-label again. Each round refines confidence. Risk of confirmation bias, so track AUC per round carefully.
Planned
Wave 7 — Literature-Inspired Ideas (from Deep Research Review)
W7.1
Slow-learning deep CatBoost
Gemini report, top Kaggle solutions
Running
W7.2
Isotonic probability calibration
Gemini report Section 2.2
Running
W7.3
Autoencoder latent features
Alghamdi et al. 2024, Gemini Section 3.1
Running
W7.4
KNN diversity models
Systematic review, Chandrasekhar et al.
Running
W7.5
SVM with RBF kernel
Multiple papers, systematic review
Running
W7.6
AdaBoost + sklearn GBM
Chandrasekhar et al., Jan et al.
Running
W7.7
Soft voting ensemble
Chandrasekhar et al. 2023
Planned
W7.8
Feature interaction mining
Gemini Section 2.3
Planned
W7.9
Feature ablation (drop-one analysis)
"SF-2" finding in literature
Planned
W7.10
Tabular-to-image + pretrained CNN
VGG16 transfer learning paper (2024)
Planned
W7.11
CNN-BiLSTM hybrid
Kayalvizhi et al. 2024
Planned
W7.12
Newton-Raphson optimization for NN
Kayalvizhi et al. 2024
Planned
W7.13
RL-based model routing
ScienceDirect 2025
Planned
W7.14
SHAP + LIME interpretability
Multiple papers, Gemini Section 6
Planned
W7.15
Clinical composite features
Domain knowledge, Gemini Section 2.3
Planned
Wave 8 — Dropped Ideas (Documented with Reasoning)
D1
GAN Data Augmentation (Dropped)
Dropped: data is already synthetic (630K rows). More synthetic data adds noise, not signal.
Unknown
D2
SMOTE Oversampling (Dropped)
Dropped: class balance is 55/45, barely imbalanced. SMOTE would add noise.
Unknown
D3
TabPFN (Dropped — OOM)
Dropped: crashes on 630K rows. Designed for small datasets only.
Unknown
D4
Clean Probability Feature (Dropped)
Dropped: confirmed no benefit. CatBoost already captures this via target encoding.
Unknown 0.955477
D5
Mega meta-learner (62 models)
Planned experiment: Mega meta-learner (62 models)
Unknown 0.954874
D6
CB+LR+CBO cross-stacking
Planned experiment: CB+LR+CBO cross-stacking
Unknown 0.955669
D7
AttGRU-HMSI / LSTM-XGBoost
Planned experiment: AttGRU-HMSI / LSTM-XGBoost
Unknown
D8
Target spilling as ensemble nudge
Planned experiment: Target spilling as ensemble nudge
Unknown

📖 Glossary — Key Terms Explained

Quick reference for teammates who want to understand the metrics, techniques, and tools we're using.

TermWhat It Means
AUC-ROCArea Under the Receiver Operating Characteristic curve. Measures how well the model distinguishes between heart disease present/absent. 1.0 = perfect, 0.5 = random guessing. Our target: ≥0.954.
OOF (Out-of-Fold)Our local evaluation method. We split data into 10 folds. Train on 9, predict on the 1 held out. Repeat 10 times. This gives an unbiased estimate of model quality without using test data.
LB (Leaderboard)Kaggle's official score. They evaluate our predictions on hidden test labels. We can submit 10 times per day.
CV→LB GapDifference between our local OOF and Kaggle LB score. Ours is consistently ~0.00183, meaning our CV is reliable.
CatBoostA gradient boosted decision tree algorithm by Yandex. Excels with categorical features. Our best single-model framework.
Cross-StackingUsing one model's OOF predictions as an input feature for another model. E.g., LR predictions become a feature for CatBoost. Our strongest technique.
Feature Engineering (FE)Creating new input features from existing ones. Our key FE: frequency encoding, original UCI stats, target encoding, all-categorical treatment.
Rank BlendingConverting predictions to ranks before averaging. More robust than averaging raw probabilities because it's invariant to each model's calibration.
Hill-ClimbingGreedy algorithm that tries adding each model to the ensemble and keeps the one that improves the score most. Repeats until no improvement.
Label Noise~11% of training samples have contradictory labels (same features, different diagnosis). This is real clinical ambiguity, not data error. Creates a ceiling on achievable AUC.
Multi-SeedTraining the same model with different random seeds. Each seed gives slightly different results. Blending them reduces variance.
Target EncodingReplacing a categorical value with the average target rate for that category (e.g., "chest pain type 4" → 0.72 heart disease rate). Must be done carefully to avoid leakage.
Pseudo-LabelingUsing our model's confident predictions on test data as additional training labels. Semi-supervised technique.
Adversarial ValidationTraining a model to tell train from test data. If it can't (AUC ≈ 0.5), the distributions match. If it can, we need to reweight samples.
Generated 2026-02-15 · Team Dashboard
Refresh: cd kaggle-s6e2 && .venv/bin/python3 src/build_dashboard.py