Portfolio Website: (file:///Users/aseel/Desktop/ML.proj/portfolioF.html) Clinical Outcome Prediction · CS4082 ML Project
🏥 CS4082 · Machine Learning

Clinical Outcome Prediction
from Noisy Medical Records

Predicting 30-day hospital readmission in diabetic patients using supervised machine learning on real-world clinical data.

📚 Course: CS4082 Machine Learning 👩‍🏫 Instructor: Dr. Naila Marir 🗃 Dataset: Diabetes 130-US Hospitals
↓  Scroll to explore

What This Project Is About

The Problem

This project focuses on predicting hospital readmission for diabetic patients using real-world clinical data. The main goal is to build and evaluate machine learning models that can predict whether a patient will be readmitted to the hospital within 30 days.

Early identification of high-risk patients can help hospitals improve care, reduce avoidable readmissions, and support better clinical decision-making.

The Data Source

The project uses the Diabetes 130-US Hospitals dataset, which contains real clinical records collected from 130 hospitals in the United States between 1999 and 2008.

This dataset includes real medical noise, missing values, duplicate patient encounters, categorical clinical features, and an imbalanced target label — reflecting realistic challenges in medical records.

Why It Matters

Unlike small clean datasets, this dataset reflects realistic healthcare challenges. The combination of noisy labels, class imbalance, and high dimensionality makes it an ideal testbed for evaluating model robustness.

Seven machine learning classifiers were trained, compared, and tested under adversarial label noise to identify the most robust predictor.

Diabetes 130-US Hospitals

101,766
Original Rows
50
Original Columns
69,970
Rows After Cleaning
93
Encoded Features
25
PCA Components
~90%
Variance Retained

Target Variable: readmitted

Originally a three-class variable, converted to binary classification:

Class 1
Readmitted within 30 days (positive · minority class)
Class 0
Not readmitted within 30 days (negative · majority class)

The positive class (<30 days) is clinically important because readmission within 30 days triggers preventable-readmission penalties for hospitals.

Data Challenges

The dataset contains real-world challenges including missing values represented as "?" in multiple columns, duplicate patient encounters across time, highly imbalanced class distribution, mixed categorical and numerical features, and near-zero variance medication columns. These challenges were all addressed in the preprocessing pipeline.

Understanding the Data

📊

Class Imbalance

The dataset is highly imbalanced. Most patients were not readmitted within 30 days. This means accuracy alone is unreliable — Recall, F1-score, and AUC-ROC are essential evaluation metrics.

🔍

Missing Value Analysis

The weight column had ~97% missing values and was removed. race, payer_code, and medical_specialty were filled with "Unknown." Lab result columns were kept — absence itself can be clinically meaningful.

🏥

Strongest Predictor

number_inpatient was one of the strongest predictors of readmission. Patients with more previous inpatient visits had a significantly higher risk of being readmitted.

👴

Age Patterns

Older patients — especially those between 60 and 90 years old — had higher readmission rates. Medication usage analysis revealed many columns had very low usage, so near-zero variance medication features were removed.

Preprocessing Pipeline

1
Data Cleaning
Missing values represented as "?" replaced with NaN. Duplicate patient records handled by keeping only the first encounter per patient to reduce data leakage. Identifier columns (encounter_id, patient_nbr) removed. Patients who died or were discharged to hospice removed.
2
Feature Encoding
Age encoded ordinally (natural order). Medication columns encoded based on status: No / Steady / Up / Down. All other categorical variables one-hot encoded, yielding 93 encoded features.
3
Feature Selection
Random Forest feature importance used to rank features. The top features covering 85% of cumulative importance were selected, removing low-signal columns.
4
Scaling & Dimensionality Reduction
StandardScaler applied to normalise features. PCA then reduced the feature space to 25 components while retaining ~90% of the variance.
5
Train / Validation / Test Split
Data split into 70% training, 15% validation (model comparison & hyperparameter tuning), and 15% test (final evaluation only).
StandardScaler PCA (25 components) Ordinal Encoding One-Hot Encoding RF Feature Importance 70/15/15 Split Data Leakage Prevention

Simulated Label Noise

⚠️ Adversarial Label Noise

To align with the project theme of "Clinical Outcome Prediction from Noisy Medical Records," label noise was introduced as the adversarial condition. 10% of training labels were randomly flipped, simulating real-world recording errors in healthcare datasets.

6,997
Labels Flipped

Out of 69,970 total samples, 10% of training labels were randomly changed to simulate noisy medical documentation.

🎯

Purpose

The goal was to test how robust each model is when training labels contain errors — a realistic and clinically important challenge in real healthcare record systems.

📉

Outcome

Label noise reduced model reliability and highlighted the importance of robustness in healthcare prediction tasks. Linear models like Logistic Regression were more resilient than complex tree-based methods.

Baseline Classifiers

Seven baseline machine learning classifiers were trained and compared. Each model was evaluated using Accuracy, Precision, Recall, F1-score, and AUC-ROC. Confusion matrices and ROC curves were also used to better understand model behavior.

📈
Logistic Regression
🌳
Decision Tree
🌲
Random Forest
🎲
Naive Bayes
📍
K-Nearest Neighbors
SVM
🚀
Gradient Boosting

Validation Performance

Model Accuracy AUC-ROC Selection
Logistic Regression 0.636 0.624 Best Model
Decision Tree 0.597 0.577
Random Forest 0.732 0.607
Naive Bayes 0.879 0.593
K-Nearest Neighbors 0.905 0.507
SVM 0.694 0.622 Runner-Up
Gradient Boosting 0.910 0.615

AUC-ROC Comparison

Logistic Reg.
0.624
SVM
0.622
Gradient Boosting
0.615
Random Forest
0.607
Naive Bayes
0.593
Decision Tree
0.577
KNN
0.507

Key Takeaway

Although KNN and Gradient Boosting achieved the highest accuracy (0.905 and 0.910 respectively), they performed poorly in detecting readmitted patients. This demonstrates that high accuracy can be deeply misleading in imbalanced medical datasets. Logistic Regression was selected as the best model because it achieved the highest AUC-ROC and had better recall for the minority class. SVM was also competitive, with an AUC close to Logistic Regression.

Optimising the Best Models

📈

Logistic Regression

Achieved the best validation AUC overall. Best parameters found: C = 0.01, solver = lbfgs. Tuning only slightly changed parameters — the baseline was already near optimal.

Best Tuned Model
🌲

Random Forest

Tuning improved AUC slightly compared to the baseline, but it still did not outperform Logistic Regression on the validation AUC metric.

Improved, Not Best

SVM

Tuned SVM performed close to Logistic Regression but remained slightly lower in validation AUC, confirming Logistic Regression as the model of choice.

Competitive Runner-Up

What We Learned

⚖️
The dataset is highly imbalanced, so accuracy alone is not a sufficient evaluation metric. Recall, F1-score, and AUC-ROC must be used together to assess model performance on the minority class.
🏆
Logistic Regression performed best overall based on AUC-ROC and recall for the readmitted class, despite having lower raw accuracy than KNN and Gradient Boosting.
⚠️
KNN and Gradient Boosting achieved high accuracy (0.905 and 0.910) but failed to correctly identify many readmitted patients, making them unsuitable for this clinical prediction task.
📋
number_inpatient was one of the strongest individual predictors of 30-day readmission — patients with more prior inpatient visits carried significantly higher risk.
👴
Older patients (especially ages 60–90) had higher readmission risk. Age should always be included as a feature in clinical outcome prediction models.
🔊
Simulated label noise (10% random flips) reduced model reliability across the board, underscoring the importance of data quality and model robustness when deploying ML in healthcare settings.

What Could Be Improved

Current Limitations

The highly imbalanced dataset makes it difficult for models to correctly identify the minority class even with careful preprocessing. Missing and noisy values reduce model performance regardless of the cleaning strategy. Some clinical features may not fully capture the complexity of patient health. Introduced label noise makes classification harder because some training labels do not reflect the true outcome.

Future Directions

Future work could explore SMOTE or other oversampling techniques to address class imbalance. Decision threshold tuning and cost-sensitive learning could further improve recall for the minority class. More interpretable healthcare-specific models and Explainable AI methods (like SHAP values) would increase clinical trust. Ensemble methods combining the best-performing models may also yield improvements.

SMOTE Threshold Tuning Cost-Sensitive Learning SHAP / XAI Ensemble Methods

Conclusion

This project successfully applied supervised machine learning to predict 30-day hospital readmission using noisy clinical records. The results showed that Logistic Regression performed best overall, while some models with high accuracy failed to detect the minority class effectively. The project highlights the critical importance of preprocessing, proper evaluation metrics, and robustness testing when working with real-world healthcare data.

The Researchers

👩‍💻
Machine Learning Engineer
CS4082 · Machine Learning
Dr. Naila Marir
👩‍💻
Machine Learning Engineer
CS4082 · Machine Learning
Dr. Naila Marir