| Category | Count | Best OOF | What It Means |
|---|---|---|---|
| Blend | 9 | 0.955692 | Rank-average or weighted combination of 2+ models. Simple but effective. |
| Multi-seed | 18 | 0.955623 | Same model architecture trained with different random seeds, then blended. Reduces variance. |
| Neural Network | 5 | 0.953317 | Deep learning models (MLP). Weaker than GBDTs on this tabular dataset. |
| Novel | 13 | 0.955677 | Our team's original ideas — innovations we designed for this competition. Important for academic report. |
| Single Model | 11 | 0.955595 | Individual ML model (CatBoost, XGBoost, LightGBM, LR, RF) — one algorithm, one config. |
| Stacking | 22 | 0.955708 | Model A's predictions become an input feature for Model B. Captures what A learned. Our best approach. |
| Sweep | 39 | 0.955468 | Systematic ablation experiments testing which features help/hurt when treated as categorical. |
| Target Encoding | 10 | 0.955702 | Replace categorical values with the average target (heart disease rate) for that category. Adds signal. |
Click any model for full details. ✅ = can submit to Kaggle, ❌ = evaluation only.
| Model & Description | Submit |
|---|---|
Logistic Regression — Raw Features | ✅ |
CatBoost — Raw Features | ✅ |
XGBoost — Raw Features | ✅ |
LightGBM — Raw Features | ✅ |
Random Forest — Raw Features | ✅ |
Logistic Regression — One-Hot All Features | ✅ |
CatBoost — Original UCI Target Stats | ✅ |
XGBoost — Original UCI Target Stats | ✅ |
LightGBM — Original UCI Target Stats | ✅ |
CatBoost — Pairwise Feature Interactions | ✅ |
CatBoost — Frequency Encoding ⭐ | ✅ |
GBDT Blend — CB+XGB+LGBM Average | ✅ |
Original Stats Blend | ✅ |
Grand Blend — All Phase 1 Models | ✅ |
CB + LR Blend | ✅ |
Top-3 Blend — CB_freq + CB_orig + LR ⭐ | ✅ |
Top-4 Blend | ✅ |
CB Freq — 5-Seed Average | ✅ |
CB Freq — Seed 123 | ✅ |
CB Freq — Seed 2024 | ✅ |
CB Freq — Seed 42 | ✅ |
CB Freq — Seed 456 | ✅ |
CB Freq — Seed 789 | ✅ |
HeartMLP — Custom 3-Layer Neural Net | ✅ |
LR One-Hot — 5-Seed Average | ✅ |
LR One-Hot — Seed 123 | ✅ |
LR One-Hot — Seed 2024 | ✅ |
LR One-Hot — Seed 42 | ✅ |
LR One-Hot — Seed 456 | ✅ |
LR One-Hot — Seed 789 | ✅ |
20 Cb Origstats Multiseed Avg | ✅ |
CB OrigStats — Seed 123 | ✅ |
CB OrigStats — Seed 2024 | ✅ |
CB OrigStats — Seed 42 | ✅ |
CB OrigStats — Seed 456 | ✅ |
CB OrigStats — Seed 789 | ✅ |
RealMLP — 3-Seed Blend | ✅ |
RealMLP — Tabular Foundation Model | ✅ |
RealMLP — Seed 123 | ✅ |
RealMLP — Seed 42 | ✅ |
RealMLP — Seed 456 | ✅ |
CB Freq — 5-Seed Rank Blend ⭐ | ✅ |
Multi-Seed Grand Blend | ✅ |
CB+LR Cross-Stack — 5-Seed Blend ⭐⭐ | ✅ |
CB+LR Cross-Stack — Seed 123 | ✅ |
CB+LR Cross-Stack — Seed 2024 | ✅ |
CB+LR Cross-Stack — Seed 42 | ✅ |
CB+LR Cross-Stack — Seed 456 | ✅ |
CB+LR Cross-Stack — Seed 789 | ✅ |
CB Stacked (v1) — Seed 123 | ✅ |
CB Stacked (v1) — Seed 42 | ✅ |
CB Stacked (v1) — Seed 456 | ✅ |
CB+LR+CBO Stack — 5-Seed Blend | ✅ |
CB+LR+CBO Stack — Seed 123 | ✅ |
CB+LR+CBO Stack — Seed 2024 | ✅ |
CB+LR+CBO Stack — Seed 42 | ✅ |
CB+LR+CBO Stack — Seed 456 | ✅ |
CB+LR+CBO Stack — Seed 789 | ✅ |
CB Mega-Stack — Seed 123 | ✅ |
CB Mega-Stack — Seed 42 | ✅ |
CB Mega-Stack — Seed 456 | ✅ |
CB Mega-Stack v2 — Seed 123 | ✅ |
CB Mega-Stack v2 — Seed 42 | ✅ |
CB Mega-Stack v2 — Seed 456 | ✅ |
Feature Ablation — Drop Age | ✅ |
Feature Ablation — Drop Blood Pressure | ✅ |
Feature Ablation — Drop Chest Pain Type | ✅ |
Feature Ablation — Drop Cholesterol | ✅ |
Feature Ablation — Drop EKG Results | ✅ |
Feature Ablation — Drop Exercise Angina | ✅ |
Feature Ablation — Drop Fasting Blood Sugar | ✅ |
Feature Ablation — Drop Max Heart Rate | ✅ |
Feature Ablation — Drop Fluoroscopy Vessels | ✅ |
Feature Ablation — Drop ST Depression | ✅ |
Feature Ablation — Drop Sex | ✅ |
Feature Ablation — Drop ST Slope | ✅ |
Feature Ablation — Drop Thallium | ✅ |
Cross-Stack + Multi-Seed TE (top 10) | ✅ |
Cross-Stack + Multi-Seed TE (top 15) | ✅ |
Cross-Stack + Multi-Seed TE (top 20) | ✅ |
CB + Clean Prob K=3 (Fixed, No Leakage) | ✅ |
Noise-Aware Reweighting (α=0.0) | ✅ |
Noise-Aware Reweighting (α=0.1) | ✅ |
Noise-Aware Reweighting (α=0.2) | ✅ |
Noise-Aware Reweighting (α=0.3) | ✅ |
Noise-Aware Reweighting (α=0.5) | ✅ |
Confidence-Weighted Ensemble (Novel) | ✅ |
LR + Clean Prob K=10 | ✅ |
LR + Clean Prob K=20 | ✅ |
LR + Clean Prob K=3 | ✅ |
LR + Clean Prob K=50 | ✅ |
LR + Clean Prob K=5 | ✅ |
Mega Meta-Learner — 62 Models (Novel) | ✅ |
Meta-Learner with Confidence Features (Novel) | ✅ |
LR Sweep — All Features as Categorical | ❌ |
LR Sweep — All Features as Numeric | ❌ |
LR Sweep — All Categorical Except Age | ❌ |
LR Sweep — All Categorical Except BP | ❌ |
LR Sweep — All Categorical Except Chest pain type | ❌ |
LR Sweep — All Categorical Except Cholesterol | ❌ |
LR Sweep — All Categorical Except EKG results | ❌ |
LR Sweep — All Categorical Except Exercise angina | ❌ |
LR Sweep — All Categorical Except FBS over 120 | ❌ |
LR Sweep — All Categorical Except Max HR | ❌ |
LR Sweep — All Categorical Except Number of vessels fluro | ❌ |
LR Sweep — All Categorical Except ST depression | ❌ |
LR Sweep — All Categorical Except Sex | ❌ |
LR Sweep — All Categorical Except Slope of ST | ❌ |
LR Sweep — All Categorical Except Thallium | ❌ |
LR Sweep — Categorical + Age + BP | ❌ |
LR Sweep — Categorical + Age + Cholesterol | ❌ |
LR Sweep — Categorical + Age + Max HR | ❌ |
LR Sweep — Categorical + Age + ST depression | ❌ |
LR Sweep — Categorical + Age | ❌ |
LR Sweep — Categorical + BP + Cholesterol | ❌ |
LR Sweep — Categorical + BP + Max HR | ❌ |
LR Sweep — Categorical + BP + ST depression | ❌ |
LR Sweep — Categorical + BP | ❌ |
LR Sweep — Categorical + Cholesterol + Max HR | ❌ |
LR Sweep — Categorical + Cholesterol + ST depression | ❌ |
LR Sweep — Categorical + Cholesterol | ❌ |
LR Sweep — Categorical + Max HR + ST depression | ❌ |
LR Sweep — Categorical + Max HR | ❌ |
LR Sweep — Categorical + ST depression | ❌ |
LR Sweep — Numeric + Chest pain type as Category | ❌ |
LR Sweep — Numeric + EKG results as Category | ❌ |
LR Sweep — Numeric + Exercise angina as Category | ❌ |
LR Sweep — Numeric + FBS over 120 as Category | ❌ |
LR Sweep — Numeric + Number of vessels fluro as Category | ❌ |
LR Sweep — Numeric + Sex as Category | ❌ |
LR Sweep — Numeric + Slope of ST as Category | ❌ |
LR Sweep — Numeric + Thallium as Category | ❌ |
LR Sweep — Official Cat/Num Split | ❌ |
CatBoost + Target Encoding (α=10) ⭐ | ✅ |
LR + Target Encoding (α=10) | ✅ |
LR + Target Encoding (α=50) | ✅ |
LR + Target Encoding (α=5) | ✅ |
LR + Target Encoding + UCI Stats (α=10) | ✅ |
LR — Target Encoding Only (α=100) | ✅ |
LR — Target Encoding Only (α=10) | ✅ |
LR — Target Encoding Only (α=1) | ✅ |
LR — Target Encoding Only (α=50) | ✅ |
LR — Target Encoding Only (α=5) | ✅ |
| # | Submitted | Model | OOF AUC | LB Score | Rank | CV→LB Gap |
|---|---|---|---|---|---|---|
| 1 | Feb 15 07:04 | Top-3 Blend (CB_freq + CB_orig + LR) | 0.955682 | 0.95384 | 297 | 0.00184 |
| 2 | Feb 15 07:04 | CatBoost + Freq Encoding (single) | 0.955595 | 0.95375 | — | 0.00184 |
| 3 | Feb 15 07:04 | CatBoost Raw (no FE) | 0.955503 | 0.95358 | — | 0.00192 |
| 4 | Feb 15 12:36 | 3-Model Rank Blend (stack + conf + multi) | 0.955729 | 0.95389 | 201 | 0.00184 |
| 5 | Feb 15 12:57 | Hill-Climb: Cross-Stack + TE CatBoost | 0.955751 | 0.95392 | 175 | 0.00183 |
Finding: Clean probability adds no predictive value to CatBoost (0.955477 vs baseline 0.955595). The noise is uniformly distributed across feature space — CatBoost already handles it implicitly. Initial version had target leakage (0.999 AUC) via disagreement feature — caught and fixed.
Why it\'s novel: Original idea. Not found in competition notebooks or literature for this dataset scale.
')">The dataset has ~11% label noise (contradictory samples with identical features but opposite labels). We compute a "clean probability" using K-nearest…
Instead of fixed weights, each model's vote is weighted by its confidence — how far its prediction is from the decision boundary (0.5). Confident pred…
Finding: Hurts performance at all alpha values tested (0.0–0.5). The noise is too uniformly distributed for selective weighting to help. CatBoost\'s built-in regularization already handles this.
Why it\'s novel: Curriculum-learning inspired approach adapted for label noise in tabular data.
')">Down-weight training samples that are likely mislabeled (low KNN consensus). Give the model permission to "ignore" noisy examples via sample weights.
Use 62 model predictions as meta-features in a Bayesian-inspired meta-learner. The idea: capture complex model agreement patterns.
Train a 13→16→8→16→13 autoencoder to learn compressed representations of patient features. Extract the 8-dim bottleneck as new features for CatBoost. …
Train a classifier to distinguish train from test data. If it succeeds (AUC > 0.5), the distributions differ — we can reweight training samples to mat…
Use our best model to predict test labels. Add high-confidence test predictions (>0.95, 0.90, 0.85, 0.80 thresholds) back to training data. Retrain wi…
CatBoost with very low learning rate (0.02–0.015), deep trees (depth 7–8), and 2500–3000 iterations. Trades compute for potentially better generalizat…
Every experiment explained — what it is, why we're testing it, and what we expect.
| ID | Experiment & Description | Status |
|---|---|---|
| A1. Gradient Boosted Decision Trees (GBDT) | ||
A1.1 | CatBoost — Default Parameters Yandex's gradient boosting with native categorical feature handling. Uses ordered target statistics internally, which is ideal for our all-categorical data. Default hyperparameters: depth=6, lr=0.03, 1000 iterations. | Done 0.95550 |
A1.2 | XGBoost — Default Parameters The most popular gradient boosting library. Uses histogram-based splits. Serves as comparison point against CatBoost on the same features. | Done 0.95535 |
A1.3 | LightGBM — Default Parameters Microsoft's gradient boosting. Leaf-wise tree growth (vs level-wise in XGBoost). Fastest GBDT but may overfit on noisy data. | Done 0.95511 |
A1.4 | CatBoost — Optuna Hyperparameter Tuning 100-trial Bayesian optimization over: iterations, learning_rate, depth, l2_leaf_reg, random_strength, bagging_temperature, border_count, min_data_in_leaf. Searches the full hyperparameter space to find optimal CB configuration. | Running |
A1.5 | XGBoost — Optuna Hyperparameter Tuning Bayesian optimization for XGBoost: max_depth, learning_rate, subsample, colsample_bytree, reg_lambda, reg_alpha, min_child_weight. May find a different optimum than CatBoost. | Planned |
A1.6 | LightGBM — Optuna Hyperparameter Tuning Bayesian optimization for LightGBM: num_leaves, learning_rate, subsample, colsample_bytree, reg_lambda, min_child_samples. LGBM has different optimal hyperparameters than CB/XGB. | Planned |
A1.7 | CatBoost — Grow Policy Comparison Compare 3 tree-building strategies: SymmetricTree (default, balanced splits), Depthwise (level-by-level like XGBoost), Lossguide (leaf-wise like LightGBM, picks the leaf that reduces loss most). Different policies capture different patterns in the data. | Planned |
A1.8 | CatBoost — Ordered Boosting CatBoost's unique boosting_type="Ordered" uses a permutation-driven approach designed to prevent target leakage during training. Originally designed for small datasets. May help with our noisy labels by being more conservative. | Planned |
A1.9 | XGBoost — DART Mode Dropout Additive Regression Trees: randomly drops trees during boosting (like neural net dropout). Prevents later trees from over-correcting earlier ones. Better generalization on noisy data. | Planned |
A1.10 | LightGBM — GOSS + EFB Gradient-based One-Side Sampling: keeps all samples with large gradients (hard examples), randomly samples from small gradients (easy examples). Exclusive Feature Bundling merges sparse features. Faster and may regularize better. | Planned |
A1.11 | CatBoost — Label Smoothing Smooth target labels: instead of 0/1, use 0+ε and 1-ε (e.g., 0.05 and 0.95). Directly addresses our ~11% label noise by telling the model "don't be 100% confident in any label". Test ε = 0.05, 0.10, 0.15. | Planned |
A1.12 | HistGradientBoosting — Scikit-learn Sklearn's histogram-based gradient boosting. Different implementation from CB/XGB/LGBM. Adds diversity to our GBDT ensemble even if slightly weaker individually. | Planned |
| A2. Linear Models | ||
A2.1 | Logistic Regression — Raw Features Baseline: logistic regression on raw numeric features. P(disease) = sigmoid(w₁×Age + w₂×BP + ...). Expected to be weak because it treats Age=45 and Age=46 as almost identical, losing categorical structure. | Done 0.95049 |
A2.2 | Logistic Regression — One-Hot All Features KEY FINDING: One-hot encode ALL 13 features (even "numeric" ones like Age). This lets LR learn "Age=45 → coefficient X" independently from "Age=65 → coefficient Y". Achieves 0.9555 — matching CatBoost! Proves all features are effectively categorical. | Done 0.95552 |
A2.3 | Ridge Classifier Like logistic regression but with L2-regularized least squares loss instead of log loss. Faster to train, different decision boundary from LR. Adds linear model diversity. | Planned |
A2.4 | SGD Classifier — Stochastic Gradient Descent Online learning with log loss. Processes samples one-at-a-time instead of full batch. Different optimization trajectory may find different local optima. | Planned |
A2.5 | Logistic Regression — ElasticNet Regularization Combines L1 (sparsity, feature selection) and L2 (shrinkage) penalties. l1_ratio controls the mix. L1 may zero out useless one-hot features, reducing overfitting. | Planned |
A2.6 | LR — Polynomial Feature Interactions (Degree 2) Create ALL pairwise interaction features: Age×ChestPain, BP×Cholesterol, etc. Lets LR capture "Age=55 AND ChestPain=Typical → high risk" relationships that single features miss. Feature count explodes but may capture non-linear patterns. | Planned |
A2.7 | LR — One-Hot + Target Encoding Combined Feed LR both one-hot features AND target-encoded features simultaneously. One-hot captures per-value patterns; target encoding captures smoothed population rates. | Planned |
A2.8 | LR — IsolationForest Anomaly Scores From a top-scoring public LR notebook: add anomaly scores from IsolationForest as a feature. Flags unusual patients whose feature combinations are rare in the training data. | Planned |
| A3. Ensemble Tree Methods (Non-Boosted) | ||
A3.1 | Random Forest — Default Ensemble of 500 independent decision trees, each trained on a random bootstrap sample. Weaker than boosting (0.952) but makes completely different errors — valuable for diversity. | Done 0.95222 |
A3.2 | ExtraTrees — Extremely Randomized Trees Like Random Forest but splits are chosen randomly instead of optimally. Even more variance reduction. Much faster to train. Different error patterns. | Planned |
A3.3 | Balanced Random Forest Random Forest with class-weighted sampling. Our dataset is 55/45 split — slight imbalance. This ensures each tree sees balanced classes, which may improve AUC. | Planned |
A3.4 | Random Forest on GBDT Residuals CLEVER TRICK from a public notebook (LB 0.95395): Train CatBoost+XGBoost first, then train Random Forest on what they get WRONG (residuals). RF captures patterns the boosters miss. Final prediction = GBDT + RF correction. | Planned |
| A4. Neural Networks — Tabular-Specific | ||
A4.1 | RealMLP — State-of-the-Art Tabular MLP From the pytabkit library. Uses Mish activation (smooth ReLU), Piecewise Linear Representation (PLR) embeddings for numeric features, and careful initialization. PUBLIC NOTEBOOK ACHIEVES LB 0.95397 — better than our current best! High priority. | Planned |
A4.2 | TabM — Tabular Model from pytabkit Second-best public solo model (LB 0.95381). TabM-mini-normal architecture. Different from RealMLP — could add neural net diversity to our ensemble. | Planned |
A4.3 | FT-Transformer — Feature Tokenizer Transformer Converts each feature to a token embedding, then applies Transformer self-attention. Can learn complex feature interactions that tree models miss. Different inductive bias: trees do axis-aligned splits, transformers do attention-weighted combinations. | Planned |
A4.4 | SAINT — Self-Attention + Inter-Sample Attention Novel architecture: Attention between features (like FT-Transformer) PLUS attention between samples (each sample attends to similar training samples). Captures both feature relationships and patient-to-patient similarities. Unique approach for tabular data. | Planned |
A4.5 | TabTransformer Transformer applied only to categorical feature embeddings, then concatenated with numeric features and fed through MLP. Lighter than FT-Transformer. Good for our all-categorical data. | Planned |
A4.6 | NODE — Neural Oblivious Decision Ensembles Differentiable version of decision trees. Learns soft, gradient-optimizable tree splits. Bridges the gap between GBDT and neural nets. Can be trained end-to-end with backpropagation. | Planned |
A4.7 | DANet — Deep Abstract Network Tabular-specific deep learning with abstract layers that learn hierarchical feature representations. Less common architecture — adds novel diversity. | Planned |
A4.8 | Simple 3-Layer MLP Baseline neural net: Input → 256 → 128 → 64 → 1 with BatchNorm, ReLU, Dropout(0.3). Simple but establishes the neural net floor. Our HeartMLP variant achieved 0.9531. | Planned |
A4.9 | MLP with Periodic Embeddings From a top-voted Kaggle discussion (52 upvotes). Maps numeric features through sine/cosine functions before feeding to MLP. Periodic embeddings help neural nets learn non-monotonic relationships (e.g., very low AND very high cholesterol both indicate risk). | Planned |
A4.10 | MLP with PLR Embeddings Piecewise Linear Representation: each numeric feature is split into bins with learned linear interpolation. Turns continuous features into rich representations. Key ingredient of RealMLP's success. | Planned |
| A5. In-Context Learning / Foundation Models | ||
A5.1 | TabPFN v1 — Zero-Shot Tabular Classification Pre-trained transformer that classifies tabular data WITHOUT training on your dataset. Learned to classify from millions of synthetic datasets. UNUSED by any public notebook — innovation opportunity! Limitation: doesn't scale to 630K rows directly, needs subsampling. | Planned |
A5.2 | TabPFN v2 — Scalable Version Improved TabPFN with support for larger datasets via chunked inference. May handle our 630K rows with batching. Worth testing for pure diversity. | Planned |
A5.3 | HyperFast — Meta-Learned Hypernetwork A neural net that GENERATES the weights of a classifier for your specific dataset. Instant classification without training. Completely different approach from everything else. | Planned |
| A6. Other Classifiers | ||
A6.1 | SVM — Radial Basis Function Kernel Support Vector Machine with RBF kernel. Projects features into infinite-dimensional space and finds the maximum-margin decision boundary. Very different from tree/linear models. Slow on 630K rows but adds maximum diversity. | Planned |
A6.2 | K-Nearest Neighbors Instance-based: classify each patient by majority vote of K most similar patients in training data. No model learned at all — pure memorization. Weak alone but captures local neighborhood patterns. | Planned |
A6.3 | Gaussian Naive Bayes Assumes features are independent given the class. Obviously wrong (features correlate) but the resulting probability estimates are well-calibrated. Fast, different, adds diversity. | Planned |
A6.4 | Quadratic Discriminant Analysis Fits a Gaussian to each class and classifies by likelihood ratio. Quadratic (non-linear) decision boundary. Captures class-specific covariance structures. | Planned |
| A7. AutoML | ||
A7.1 | AutoGluon — Best Quality Amazon's AutoML: automatically trains dozens of models (NN, GBDT, KNN, etc.), performs multi-layer stacking, and selects the best ensemble. "best_quality" preset uses more models and longer training. May discover combinations we missed. | Planned |
A7.2 | AutoGluon — High Quality Faster AutoGluon preset. Same approach but fewer models and less stacking depth. Good for quick comparison. | Planned |
A7.3 | H2O AutoML Alternative AutoML framework | Planned |
| B1. Encoding Strategies | ||
B1.1 | Frequency Encoding Replace each category value with how often it appears in train+test combined. Rare values get low frequencies. Captures population-level patterns. Example: "ChestPain=Typical" appears in 47% of data → mapped to 0.47. | Done |
B1.2 | Target Encoding (Smoothed) Replace each value with smoothed P(disease | value) from training data. α-smoothing blends per-value rate with global rate: TE = (n×mean + α×global) / (n + α). Prevents overfitting on rare categories. We sweep α = 1, 5, 10, 50, 100. | Done 0.95552 |
B1.3 | Original UCI Target Statistics THE MOST IMPACTFUL FEATURE. Compute P(disease | feature_value) from the ORIGINAL 303 patients (real clinical diagnoses), not the noisy synthetic data. Example: "Thallium=7" → 85% disease rate in original data. Adds ~0.006 AUC — massive gain. | Planned |
B1.4 | Binary/One-Hot Encoding Standard one-hot: each category value becomes a 0/1 column. "ChestPain" with 4 values becomes 4 binary columns. Creates sparse high-dimensional features. Works surprisingly well for LR. | Planned |
B1.5 | All-Categorical Treatment Treat ALL features (including "numeric" Age, BP, Cholesterol) as categorical. Works because: Age has only ~50 unique values in 630K rows — it IS categorical in this dataset. CatBoost with all-cat achieves 0.9555. | Done 0.95560 |
B1.6 | Target encoding (in-fold) Mean target per feature value, computed within CV fold | Planned |
B1.7 | Original UCI target stats Mean/median/std/count from original data | Done 0.95558 |
B1.8 | Leave-one-out encoding LOO target encoding — less biased than mean | Planned |
B1.9 | WoE (Weight of Evidence) Log(P(1 | Planned |
B1.10 | James-Stein encoding Shrinkage-based target encoding | Planned |
B1.11 | Binary encoding Binary representation of categorical levels | Planned |
B1.12 | Helmert encoding Compares each level to mean of subsequent levels | Planned |
| B2. Categorical/Numerical Treatment Combinations | ||
B2.1 | Pairwise Interactions Create features like Age×ChestPain, Sex×Thallium, etc. Captures "combination" effects that single features miss. Example: "Male + Typical Angina" may be higher risk than either alone. | Planned |
B2.2 | Ratio Features Create ratios: MaxHR/Age (heart rate relative to age), Cholesterol/Age, etc. Clinically meaningful: high heart rate is more concerning in older patients. | Done |
B2.3 | Age Binning × Interactions Group Age into bins (young/middle/old) then interact with other features. Captures age-dependent risk factors. | Planned |
B2.4 | High-info categoricals only: {Thal,ChestPain,Vessels} as cat, rest numerical Top 3 categorical predictors | Planned |
B2.5 | Reverse: treat Tier3 features (BP,FBS,Chol) as categorical, rest numerical They have low info anyway | Planned |
B2.6 | Age + MaxHR + STdep as numerical, everything else categorical Only true continuous features | Planned |
B2.7 | Optimal split search (Optuna) Let optimizer find best cat/num assignment | Planned |
| B3. Feature Interactions & Transformations | ||
B3.1 | Feature Selection via Importance Use CatBoost feature importance to rank features, then train with only the top-K. If some features add noise, removing them may improve generalization. | Done 0.95545 |
B3.2 | Recursive Feature Elimination Iteratively remove the least important feature and retrain. Finds the minimal feature set that maintains (or improves) AUC. | Planned |
B3.3 | Forward Feature Selection Start with zero features, add one at a time (the one that improves AUC most). Greedy but finds good feature subsets. | Planned |
B3.4 | Log/sqrt/square transforms Non-linear transforms of numericals | Planned |
B3.5 | KBinsDiscretizer (10 bins) Binned numericals — from top baseline notebook | Planned |
B3.6 | Clinical risk composites Framingham-like score, Duke clinical score | Planned |
B3.7 | PCA components (top 5) Dimensionality-reduced features | Planned |
B3.8 | UMAP embeddings (2D-3D) Non-linear dimensionality reduction | Planned |
B3.9 | Cluster assignments (KMeans k=5,10) Cluster membership as feature | Planned |
B3.10 | IsolationForest anomaly scores From top LR notebook | Planned |
B3.11 | Autoencoder reconstruction error Learn normal patterns, deviation = risk | Planned |
B3.12 | KNN distance features Distance to k-nearest of each class | Planned |
| B4. Multi-Dataset Integration | ||
B4.1 | Original UCI target stats (merged) Already using | Done |
B4.2 | Original data as extra training rows Concatenate with weight adjustment | Planned |
B4.3 | Original data weighted by similarity Adversarial validation to find similar samples | Planned |
B4.4 | Multi-source target encoding (Cleveland + Statlog + synthetic) Different target encodings from each source | Planned |
B4.5 | Domain adaptation: original → synthetic Transfer learning approach | Planned |
| C1. Cross-Validation Schemes | ||
C1.1 | Multi-Seed Training (5 Seeds) Train the SAME model architecture 5 times with different random seeds (42, 123, 456, 789, 2024). Each seed produces slightly different trees → rank-blending reduces variance. Typically adds +0.0001–0.0003 AUC for free. | Done |
C1.2 | Multi-Fold Variants (5-fold vs 10-fold vs 20-fold) Compare different numbers of CV folds. More folds = more training data per fold but more variance in OOF estimates. Find the sweet spot. | Done |
C1.3 | Multi-seed (5 seeds × 10 folds) Running now for top models | Running |
C1.4 | RepeatedStratifiedKFold (3×10) 30 folds, averaged | Planned |
C1.5 | Stratified on Thallium×Target Ensures balanced Thallium distribution | Planned |
C1.6 | GroupKFold by feature clusters Prevents data leakage if clusters exist | Planned |
| C2. Noise-Aware Training | ||
C2.1 | Cross-Stacking — LR → CatBoost Train LR on 10 folds, save OOF predictions. Use LR_pred as feature #14 for CatBoost. CatBoost learns WHEN the linear model is right/wrong. Our strongest technique: +0.0001 AUC. | Planned |
C2.2 | Cross-Stacking — Multiple Base Models Stack predictions from LR + XGB + LGBM as features for CatBoost meta-learner. Risk: too many correlated features → overfitting (confirmed: CB+LR+CBO stack was WORSE). | Planned |
C2.3 | Blending — Rank-Based Convert each model's predictions to ranks (1 to N), then average ranks. Calibration-invariant: doesn't matter if Model A predicts 0.7 and Model B predicts 0.95 for the same sample. Better than probability averaging for diverse model families. | Planned |
C2.4 | Hill-Climbing Ensemble Greedy algorithm: start with best model, try adding each remaining model, keep the one that improves blend AUC most. Repeat. Our best result (0.955751) came from hill-climbing over 105+ models — found that te_cb_a10 uniquely complements the CB+LR stack. | Planned |
C2.5 | Bayesian Blend Optimization Use Optuna to find optimal blend weights instead of greedy hill-climbing. Searches continuous weight space. May find better weights than equal weighting. | Planned |
C2.6 | Symmetric cross-entropy loss Noise-robust loss function | Planned |
| C3. Post-Processing | ||
C3.1 | Pseudo-Labeling — Confident Test Predictions Semi-supervised: use our model's most confident test predictions (prob > 0.95 or < 0.05) as additional training data. Increases effective training set. Must be careful with threshold choice. | Done |
C3.2 | Pseudo-Labeling — Multi-Round Iterative pseudo-labeling: train → predict test → add confident predictions → retrain → repeat. Each round should improve, but risks confirmation bias (reinforcing model errors). | Planned |
C3.3 | Knowledge Distillation Train a large ensemble, then train a single model to mimic the ensemble's SOFT predictions (probabilities) rather than hard labels. Transfers ensemble knowledge into one model. | Planned |
C3.4 | Temperature scaling From CB+XGB+Residual RF notebook | Planned |
C3.5 | Pseudo-labeling (iterative) High-confidence test predictions as training data | Planned |
| D1. Level-0 → Level-1 Stacking | ||
D1.1 | SHAP Feature Importance SHapley Additive exPlanations: game-theory-based feature importance. Shows WHICH features drive predictions for EACH patient, not just globally. Required for the academic report. | Planned |
D1.2 | Partial Dependence Plots Show how each feature affects prediction probability when all other features are held constant. Reveals non-linear relationships: e.g., risk increases sharply above Age=55. | Planned |
D1.3 | Feature Interaction Analysis SHAP interaction values: which feature PAIRS interact most? E.g., does Thallium + Exercise Angina have a synergistic effect beyond their individual contributions? | Planned |
D1.4 | Neural Network Meta-Learner Train a neural network as the Level-1 meta-learner over OOF predictions instead of CatBoost. | Planned |
D1.5 | 3-level stacking L0: diverse models, L1: blenders, L2: final | Planned |
| D2. Blending Strategies | ||
D2.1 | Cross-Validation Stability Analysis How much does OOF AUC vary across folds? High variance = model is unstable. Important for report: shows our results are robust, not lucky folds. | Done 0.95547 |
D2.2 | Noise Ceiling Estimation Quantify the theoretical maximum AUC given ~11% label noise. Important for report: explains WHY we can't exceed ~0.956 regardless of model choice. | Done 0.95568 |
D2.3 | Learning Curves Train on 10%, 20%, ..., 100% of data. If AUC still improving at 100%, more data would help. If it plateaus, we're data-saturated (likely given 630K rows and noise ceiling). | Planned |
D2.4 | Bayesian blend weight optimization Optuna on blend weights | Planned |
D2.5 | Random percentile sampling From "Blend the Blender" (LB 0.954) | Planned |
D2.6 | Geometric mean blend Alternative to arithmetic mean | Planned |
D2.7 | Power mean blend (p=0.5, p=2) Generalized mean with tunable p | Planned |
| D3. Diversity Maximization | ||
D3.1 | Multi-Dataset Generalization Apply our FULL pipeline to the original UCI Cleveland, Hungarian, Switzerland, and VA datasets. Shows our approach generalizes beyond the competition data. Prof specifically requested this. | Planned |
D3.2 | Feature-bagged ensembles Each model sees different feature subset | Planned |
D3.3 | Row-sampled ensembles Bootstrap aggregation with different samples | Planned |
D3.4 | Architecture-diverse blend 1 GBDT + 1 Linear + 1 NN + 1 TabPFN | Planned |
| E1. Beyond Standard Approaches | ||
E1.1 | Noise-transition matrix estimation Estimate P(observed\ | Idea |
E1.2 | Co-teaching Two models trained simultaneously, each teaching the other on clean samples (Han et al. 2018) | Idea |
E1.3 | DivideMix Semi-supervised learning + noisy label handling (Li et al. 2020) | Idea |
E1.4 | Confident Learning with cleanlab Characterize label noise, prune/reweight/fix samples | Idea |
E1.5 | Feature importance × noise analysis Which features contribute most to misclassification in the 11% noisy zone? | Idea |
E1.6 | Conditional ensemble Different models for different regions of feature space (e.g., high Thallium vs low) | Idea |
E1.7 | Prototype-based classification Learn class prototypes, classify by distance. Robust to noise. | Idea |
E1.8 | Self-training with high-confidence filtering Use model's own confident predictions to augment training | Idea |
E1.9 | Multi-task learning Predict target + reconstruct features simultaneously | Idea |
E1.10 | Curriculum learning Train on "easy" samples first, progressively add harder ones | Idea |
| E2. Data-Level Innovation | ||
E2.1 | Synthetic minority oversampling (SMOTE) Address slight class imbalance | Planned |
E2.2 | Adversarial data augmentation Generate adversarial perturbations to improve robustness | Idea |
E2.3 | Feature permutation importance analysis Beyond standard — permutation within folds for stable estimates | Planned |
E2.4 | SHAP-based feature selection Select features by SHAP importance, not just correlation | Planned |
E2.5 | Boruta feature selection Shadow feature comparison method | Planned |
| F1. Heart Disease ML Literature | ||
Source | Best AUC Year | Unknown |
Cleveland UCI (original) | ~0.90-0.92 2010s | Unknown |
Grinsztajn et al. 2022 | — 2022 | Unknown |
TabZilla (NeurIPS 2023) | — 2023 | Unknown |
Gorishniy et al. 2022 | — 2022 | Unknown |
Regularization Cocktails 2023 | — 2023 | Unknown |
TabPFN (Hollmann et al. 2023) | — 2023 | Unknown |
HyperFast (Bonet et al. 2024) | — 2024 | Unknown |
| F2. Kaggle Playground Series — Winning Patterns | ||
Competition | Metric Winning Approach | Unknown |
S3E7 (Cirrhosis) | Log Loss CatBoost + stacking | Unknown |
S3E8 (Kidney Stone) | AUC LightGBM + feature eng | Unknown |
S3E12 (Kidney Disease) | AUC Ensemble GBDT + NN | Unknown |
S3E17 (Wine) | QWK CatBoost ordinal | Unknown |
S4E1 (Binary) | AUC Stacking + original data | Unknown |
S4E8 (Mushroom) | MCC CatBoost native categoricals | Unknown |
S5E2 (Backorder) | AUC GBDT + imbalanced learning | Unknown |
Common patterns: | Planned experiment: | Unknown |
— | — Original data integration | Unknown |
— | — Multi-seed ensembling | Unknown |
— | — GBDT + linear blend | Unknown |
— | — Proper CV (≥5 fold stratified) | Unknown |
| F3. Tabular Deep Learning State-of-the-Art (2024-2025) | ||
Model | Paper Performance vs GBDT | Unknown |
RealMLP (pytabkit) | Holzmüller et al. 2024 Competitive on medium datasets | Unknown |
TabM | Gorishniy et al. 2024 Matches GBDT on many benchmarks | Unknown |
FT-Transformer | Gorishniy et al. 2021 Competitive on high-cardinality | Unknown |
TabPFN v2 | Hollmann et al. 2025 SOTA on small-medium tabular | Unknown |
ModernNCA | Ye et al. 2024 Strong on noisy data | Unknown |
ExcelFormer | Chen et al. 2024 Excel at feature interaction | Unknown |
GRANDE | Marton et al. 2024 Best of both worlds | Unknown |
| Wave 1 — Foundation (DONE ✅) | ||
1 | Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF Planned experiment: Raw baselines: LR, CatBoost, XGBoost, LightGBM, RF | Done |
2 | Key FE: one-hot, freq encoding, orig target stats, interactions Planned experiment: Key FE: one-hot, freq encoding, orig target stats, interactions | Done |
3 | Selective blending Planned experiment: Selective blending | Done |
4 | Multi-seed top models Planned experiment: Multi-seed top models | Running |
5 | Optuna CatBoost tuning Planned experiment: Optuna CatBoost tuning | Running |
| Wave 2 — Diverse Models (NEXT) | ||
6 | RealMLP + TabM (pytabkit) — proven top solo performers Planned experiment: RealMLP + TabM (pytabkit) — proven top solo performers | Planned |
7 | TabPFN — novel, unused publicly Planned experiment: TabPFN — novel, unused publicly | Planned |
8 | FT-Transformer — different DL architecture Planned experiment: FT-Transformer — different DL architecture | Planned |
9 | LR + polynomial + anomaly features Planned experiment: LR + polynomial + anomaly features | Planned |
10 | ExtraTrees, HistGradientBoosting Planned experiment: ExtraTrees, HistGradientBoosting | Planned |
| Wave 3 — Advanced FE | ||
11 | Categorical treatment sweep (B2.1-B2.7) Planned experiment: Categorical treatment sweep (B2.1-B2.7) | Planned |
12 | Target encoding (in-fold) for GBDT models Planned experiment: Target encoding (in-fold) for GBDT models | Planned |
13 | Label smoothing on CatBoost Planned experiment: Label smoothing on CatBoost | Planned |
14 | KBins + PCA + cluster features Planned experiment: KBins + PCA + cluster features | Planned |
15 | Multi-dataset: original data as extra rows Planned experiment: Multi-dataset: original data as extra rows | Planned |
| Wave 4 — Stacking & Meta-Learning | ||
16 | Proper Level-0/Level-1 stacking with LR meta-learner Planned experiment: Proper Level-0/Level-1 stacking with LR meta-learner | Planned |
17 | Hill-climbing blend weight optimization Planned experiment: Hill-climbing blend weight optimization | Planned |
18 | 3-level stacking pyramid Planned experiment: 3-level stacking pyramid | Planned |
19 | OOF correlation analysis for diversity selection Planned experiment: OOF correlation analysis for diversity selection | Planned |
| Wave 5 — Innovation & Refinement | ||
20 | Noise-aware training — clean_prob features, noise-aware reweighting (no signific Planned experiment: Noise-aware training — clean_prob features, noise-aware reweighting (no signific | Done |
21 | Pseudo-labeling (see Phase 6 below) Planned experiment: Pseudo-labeling (see Phase 6 below) | Running |
22 | Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670 Planned experiment: Conditional/confidence ensembles — conf_weighted 0.955677, meta_conf 0.955670 | Done |
23 | AutoGluon run Planned experiment: AutoGluon run | Planned |
24 | Final mega-blend (hill-climbing done: 0.955751) Planned experiment: Final mega-blend (hill-climbing done: 0.955751) | Running |
| Wave 6 — "Big Ideas" (Breakthrough Attempts) | ||
W6.1 | Pseudo-labeling / Self-training Use best ensemble to predict test set. High-confidence samples (>0.90 prob) get pseudo-labels and are added to training. Effectively 800K+ training samples. Proven technique in Playground Series competitions where synthetic data benefits from seeing test distribution. Testing thresholds: 0.90, 0.85, | Running |
W6.2 | UCI-trained model as meta-feature Train a model on the 920 original UCI heart disease samples (cleaner labels, no synthetic noise). Run predict_proba on our 630K train + 270K test. The UCI model's prediction becomes a meta-feature — it captures non-linear patterns from the original distribution that our synthetic-data models can't l | Planned |
W6.3 | Adversarial validation + sample reweighting Train a classifier to distinguish train (label=0) from test (label=1). Each training sample gets a "test-likeness" score. Upweight test-like training samples during CatBoost training via `sample_weight`. This aligns training distribution with test, potentially closing the CV→LB gap (currently 0.0018 | Planned |
W6.4 | XGBoost/LightGBM with full FE stack We've only run XGB/LGBM with raw features. Our best CatBoost uses: all-categorical + freq encoding + orig stats. Applying the same FE to XGB/LGBM could produce models at 0.9555+ that are structurally different from CatBoost — real diversity for blending. | Planned |
W6.5 | Probability calibration Platt scaling or isotonic regression on OOF predictions before blending. If our model probabilities are miscalibrated (systematically over/under-confident), calibration could improve AUC. Apply per-model before rank-blending. | Planned |
W6.6 | Feature interaction mining Systematic pairwise ratio/product/difference search across all 13 features. CatBoost handles interactions implicitly but explicit features could help LR and the meta-learner. Top interactions selected by mutual information with target. | Planned |
W6.7 | Multi-round pseudo-labeling Iterative: pseudo-label → retrain → re-predict test → pseudo-label again. Each round refines confidence. Risk of confirmation bias, so track AUC per round carefully. | Planned |
| Wave 7 — Literature-Inspired Ideas (from Deep Research Review) | ||
W7.1 | Slow-learning deep CatBoost Gemini report, top Kaggle solutions | Running |
W7.2 | Isotonic probability calibration Gemini report Section 2.2 | Running |
W7.3 | Autoencoder latent features Alghamdi et al. 2024, Gemini Section 3.1 | Running |
W7.4 | KNN diversity models Systematic review, Chandrasekhar et al. | Running |
W7.5 | SVM with RBF kernel Multiple papers, systematic review | Running |
W7.6 | AdaBoost + sklearn GBM Chandrasekhar et al., Jan et al. | Running |
W7.7 | Soft voting ensemble Chandrasekhar et al. 2023 | Planned |
W7.8 | Feature interaction mining Gemini Section 2.3 | Planned |
W7.9 | Feature ablation (drop-one analysis) "SF-2" finding in literature | Planned |
W7.10 | Tabular-to-image + pretrained CNN VGG16 transfer learning paper (2024) | Planned |
W7.11 | CNN-BiLSTM hybrid Kayalvizhi et al. 2024 | Planned |
W7.12 | Newton-Raphson optimization for NN Kayalvizhi et al. 2024 | Planned |
W7.13 | RL-based model routing ScienceDirect 2025 | Planned |
W7.14 | SHAP + LIME interpretability Multiple papers, Gemini Section 6 | Planned |
W7.15 | Clinical composite features Domain knowledge, Gemini Section 2.3 | Planned |
| Wave 8 — Dropped Ideas (Documented with Reasoning) | ||
D1 | GAN Data Augmentation (Dropped) Dropped: data is already synthetic (630K rows). More synthetic data adds noise, not signal. | Unknown |
D2 | SMOTE Oversampling (Dropped) Dropped: class balance is 55/45, barely imbalanced. SMOTE would add noise. | Unknown |
D3 | TabPFN (Dropped — OOM) Dropped: crashes on 630K rows. Designed for small datasets only. | Unknown |
D4 | Clean Probability Feature (Dropped) Dropped: confirmed no benefit. CatBoost already captures this via target encoding. | Unknown 0.955477 |
D5 | Mega meta-learner (62 models) Planned experiment: Mega meta-learner (62 models) | Unknown 0.954874 |
D6 | CB+LR+CBO cross-stacking Planned experiment: CB+LR+CBO cross-stacking | Unknown 0.955669 |
D7 | AttGRU-HMSI / LSTM-XGBoost Planned experiment: AttGRU-HMSI / LSTM-XGBoost | Unknown |
D8 | Target spilling as ensemble nudge Planned experiment: Target spilling as ensemble nudge | Unknown |
| Term | What It Means |
|---|---|
| AUC-ROC | Area Under the Receiver Operating Characteristic curve. Measures how well the model distinguishes between heart disease present/absent. 1.0 = perfect, 0.5 = random guessing. Our target: ≥0.954. |
| OOF (Out-of-Fold) | Our local evaluation method. We split data into 10 folds. Train on 9, predict on the 1 held out. Repeat 10 times. This gives an unbiased estimate of model quality without using test data. |
| LB (Leaderboard) | Kaggle's official score. They evaluate our predictions on hidden test labels. We can submit 10 times per day. |
| CV→LB Gap | Difference between our local OOF and Kaggle LB score. Ours is consistently ~0.00183, meaning our CV is reliable. |
| CatBoost | A gradient boosted decision tree algorithm by Yandex. Excels with categorical features. Our best single-model framework. |
| Cross-Stacking | Using one model's OOF predictions as an input feature for another model. E.g., LR predictions become a feature for CatBoost. Our strongest technique. |
| Feature Engineering (FE) | Creating new input features from existing ones. Our key FE: frequency encoding, original UCI stats, target encoding, all-categorical treatment. |
| Rank Blending | Converting predictions to ranks before averaging. More robust than averaging raw probabilities because it's invariant to each model's calibration. |
| Hill-Climbing | Greedy algorithm that tries adding each model to the ensemble and keeps the one that improves the score most. Repeats until no improvement. |
| Label Noise | ~11% of training samples have contradictory labels (same features, different diagnosis). This is real clinical ambiguity, not data error. Creates a ceiling on achievable AUC. |
| Multi-Seed | Training the same model with different random seeds. Each seed gives slightly different results. Blending them reduces variance. |
| Target Encoding | Replacing a categorical value with the average target rate for that category (e.g., "chest pain type 4" → 0.72 heart disease rate). Must be done carefully to avoid leakage. |
| Pseudo-Labeling | Using our model's confident predictions on test data as additional training labels. Semi-supervised technique. |
| Adversarial Validation | Training a model to tell train from test data. If it can't (AUC ≈ 0.5), the distributions match. If it can, we need to reweight samples. |
cd kaggle-s6e2 && .venv/bin/python3 src/build_dashboard.py