Chronic Obstructive Pulmonary Disease (COPD) is a respiratory illness characterized by airflow limitation, inflammation, and recurrent exacerbation events that significantly impact patient morbidity and healthcare costs. As the fourth leading cause of death worldwide, COPD presents an urgent need for improved predictive tools to manage disease progression and reduce hospitalizations. However, accurate prediction of acute exacerbations in COPD remains a critical challenge due to the limited availability of high-quality, temporally rich patient data. This study presents a novel application of TimeGAN, a generative adversarial network tailored for time-series data, to synthesize realistic multivariate health records derived from wearable sensors monitoring COPD patients. By preserving both temporal dynamics and feature-level dependencies, the synthetic data generated by TimeGAN enhances model training for downstream prediction tasks. Classifiers trained on augmented datasets demonstrate a significant improvement—exceeding 16% gains in predictive accuracy—compared to models trained on the limited real data alone. The approach not only improves forecasting of exacerbation events but also supports equitable model performance across underrepresented patient data. These findings highlight the potential of generative modeling to address data scarcity in respiratory healthcare and advance the development of more generalizable, personalized predictive systems.This work is currently available as a preprint and is under review for publication.
The toughest step was cleaning the data—removing noise, interpolating gaps, and standardizing inputs. Once reliable sequences were built, I applied TimeGAN for augmentation and trained LSTMs for prediction. Careful validation ensured no data leakage, making the pipeline robust and reproducible for real-world health AI applications.
To test if synthetic data resembled real patient data, I used PCA and t-SNE visualizations. Overlapping clusters confirmed TimeGAN produced realistic sequences. These sanity checks reduced bias and gave confidence that augmented data could be trusted for model training without distorting outcomes.
TimeGAN-augmented training improved LSTM performance: accuracy rose from 0.60 to 0.76 and F1 scores from 0.40 to 0.66. These gains highlight the power of synthetic augmentation in small, imbalanced medical datasets—transforming limited raw data into more reliable early-warning models for COPD.