Clinical Outcome Prediction
from Noisy Medical Records
Predicting 30-day hospital readmission in diabetic patients using supervised machine learning on real-world clinical data.
What This Project Is About
The Problem
This project focuses on predicting hospital readmission for diabetic patients using real-world clinical data. The main goal is to build and evaluate machine learning models that can predict whether a patient will be readmitted to the hospital within 30 days.
Early identification of high-risk patients can help hospitals improve care, reduce avoidable readmissions, and support better clinical decision-making.
The Data Source
The project uses the Diabetes 130-US Hospitals dataset, which contains real clinical records collected from 130 hospitals in the United States between 1999 and 2008.
This dataset includes real medical noise, missing values, duplicate patient encounters, categorical clinical features, and an imbalanced target label — reflecting realistic challenges in medical records.
Why It Matters
Unlike small clean datasets, this dataset reflects realistic healthcare challenges. The combination of noisy labels, class imbalance, and high dimensionality makes it an ideal testbed for evaluating model robustness.
Seven machine learning classifiers were trained, compared, and tested under adversarial label noise to identify the most robust predictor.
Diabetes 130-US Hospitals
Target Variable: readmitted
Originally a three-class variable, converted to binary classification:
The positive class (<30 days) is clinically important because readmission within 30 days triggers preventable-readmission penalties for hospitals.
Data Challenges
The dataset contains real-world challenges including missing values represented as "?" in multiple columns, duplicate patient encounters across time, highly imbalanced class distribution, mixed categorical and numerical features, and near-zero variance medication columns. These challenges were all addressed in the preprocessing pipeline.
Understanding the Data
Class Imbalance
The dataset is highly imbalanced. Most patients were not readmitted within 30 days. This means accuracy alone is unreliable — Recall, F1-score, and AUC-ROC are essential evaluation metrics.
Missing Value Analysis
The weight column had ~97% missing values and was removed. race, payer_code, and medical_specialty were filled with "Unknown." Lab result columns were kept — absence itself can be clinically meaningful.
Strongest Predictor
number_inpatient was one of the strongest predictors of readmission. Patients with more previous inpatient visits had a significantly higher risk of being readmitted.
Age Patterns
Older patients — especially those between 60 and 90 years old — had higher readmission rates. Medication usage analysis revealed many columns had very low usage, so near-zero variance medication features were removed.
Preprocessing Pipeline
Simulated Label Noise
To align with the project theme of "Clinical Outcome Prediction from Noisy Medical Records," label noise was introduced as the adversarial condition. 10% of training labels were randomly flipped, simulating real-world recording errors in healthcare datasets.
Out of 69,970 total samples, 10% of training labels were randomly changed to simulate noisy medical documentation.
Purpose
The goal was to test how robust each model is when training labels contain errors — a realistic and clinically important challenge in real healthcare record systems.
Outcome
Label noise reduced model reliability and highlighted the importance of robustness in healthcare prediction tasks. Linear models like Logistic Regression were more resilient than complex tree-based methods.
Baseline Classifiers
Seven baseline machine learning classifiers were trained and compared. Each model was evaluated using Accuracy, Precision, Recall, F1-score, and AUC-ROC. Confusion matrices and ROC curves were also used to better understand model behavior.
Validation Performance
| Model | Accuracy | AUC-ROC | Selection |
|---|---|---|---|
| Logistic Regression | 0.636 | 0.624 | Best Model |
| Decision Tree | 0.597 | 0.577 | |
| Random Forest | 0.732 | 0.607 | |
| Naive Bayes | 0.879 | 0.593 | |
| K-Nearest Neighbors | 0.905 | 0.507 | |
| SVM | 0.694 | 0.622 | Runner-Up |
| Gradient Boosting | 0.910 | 0.615 |
AUC-ROC Comparison
Key Takeaway
Although KNN and Gradient Boosting achieved the highest accuracy (0.905 and 0.910 respectively), they performed poorly in detecting readmitted patients. This demonstrates that high accuracy can be deeply misleading in imbalanced medical datasets. Logistic Regression was selected as the best model because it achieved the highest AUC-ROC and had better recall for the minority class. SVM was also competitive, with an AUC close to Logistic Regression.
Optimising the Best Models
Logistic Regression
Achieved the best validation AUC overall. Best parameters found: C = 0.01, solver = lbfgs. Tuning only slightly changed parameters — the baseline was already near optimal.
Best Tuned ModelRandom Forest
Tuning improved AUC slightly compared to the baseline, but it still did not outperform Logistic Regression on the validation AUC metric.
Improved, Not BestSVM
Tuned SVM performed close to Logistic Regression but remained slightly lower in validation AUC, confirming Logistic Regression as the model of choice.
Competitive Runner-UpWhat We Learned
What Could Be Improved
Current Limitations
The highly imbalanced dataset makes it difficult for models to correctly identify the minority class even with careful preprocessing. Missing and noisy values reduce model performance regardless of the cleaning strategy. Some clinical features may not fully capture the complexity of patient health. Introduced label noise makes classification harder because some training labels do not reflect the true outcome.
Future Directions
Future work could explore SMOTE or other oversampling techniques to address class imbalance. Decision threshold tuning and cost-sensitive learning could further improve recall for the minority class. More interpretable healthcare-specific models and Explainable AI methods (like SHAP values) would increase clinical trust. Ensemble methods combining the best-performing models may also yield improvements.
Conclusion
This project successfully applied supervised machine learning to predict 30-day hospital readmission using noisy clinical records. The results showed that Logistic Regression performed best overall, while some models with high accuracy failed to detect the minority class effectively. The project highlights the critical importance of preprocessing, proper evaluation metrics, and robustness testing when working with real-world healthcare data.
The Researchers
Resources & Deliverables
All project links will be updated upon submission. Click any button above to be notified.