- Arshia Koul - October 15, 2024
This year I trained a TimeGAN model on a modest wearable-sensor dataset for chronic obstructive pulmonary disease (COPD). The model eventually behaved, but the project’s biggest lesson wasn’t technical; it was about scarcity. The sample was simply small: too few nights of sleep data, too few exercise sessions, too few patients overall to capture the full range of COPD experiences.
While I battled those limitations in Python, my mom was battling uncertainty in doctors’ offices, searching for answers about painful uterine fibroids. Each specialist admitted the research base is thinner than it should be, especially for women who don’t fit the “average” study participant. Her story turned my coding frustration into something personal.
What the COPD project taught me:
Limited data limits insight. With only a handful of patients, the model sometimes confused elevated heart rate after a flight of stairs for the onset of an exacerbation.
Synthetic data needs solid roots. TimeGAN can expand a dataset, but it can’t invent patterns that were never recorded. Real, diverse signals are still the foundation.
Context is everything. Every graph in my notebook now starts with a note: “Sample size and collection window.” That reminder stops me from overpromising what the model can do.
Watching my mom cycle through scans, medication trials, and “come back in six months” visits made the parallel clear: when original data are thin, everyone has to work harder—patients, doctors, and, yes, algorithms. Better datasets could mean earlier detection of fibroid growth, clearer guidelines, and fewer anxious months waiting for the next appointment.
Large reviews have documented how women are often underrepresented in medical studies, from cardiovascular trials (https://pubmed.ncbi.nlm.nih.gov/34079530/) to basic biomedical research (https://pubmed.ncbi.nlm.nih.gov/35602460/) to broad clinical-trial registries (https://news.northwestern.edu/stories/2021/06/women-and-men-are-underrepresented-in-clinical-trials/). When data are thin, care guidelines stay vague, diagnoses arrive late, and families like mine feel the cost.
How I want to push this work forward
The COPD project and my mom’s experience convinced me that filling data gaps is the next research frontier I want to tackle. Going forward, I’m especially interested in:
Building or contributing to open datasets that include women across ages and ethnicities for conditions such as fibroids, endometriosis, and PCOS.
Developing simple “data-equity checklists” and code snippets that help students audit who is—and isn’t—in their spreadsheets before they train a model.
Writing and speaking about these gaps so more budding researchers ask, “Who’s missing from my data?” before they click Run.
My mom’s chart and my COPD CSV may live in different folders, but they share the same message: when data are scarce, real people pay the price. Closing those gaps isn’t just a technical goal; it’s a family one.
- Arshia Koul - January 15, 2025
When I joined the COPD early-warning project, I imagined the hard part would be training the models. Instead, the hardest part turned out to be the most invisible: making the data usable.
The dataset I worked with came from multiple sources such as wearable devices (heart rate, steps, oxygen), daily symptom surveys (EXACT score), and inhaler sensors (controller and rescue use). Each had its own quirks: missing values, noisy readings, and different scales. To get them to “speak the same language,” I had to normalize and standardize everything. I used SciPy interpolators to fill shorter gaps, dropped unreliable runs, and aligned all streams into day-by-day timelines.
Only after this heavy lifting could I move to modeling. I trained an LSTM forecaster to detect pre-exacerbation signatures, and used TimeGAN to create synthetic sequences that balanced out the scarcity of positive events. To make sure my synthetic data wasn’t garbage, I ran PCA/t-SNE checks to confirm the distributions overlapped with real data.
The results were clear: garbage in, garbage out; but clean, consistent data in meant accuracy and F1 scores jumped significantly. For me, the lesson was simple but powerful: in AI, data is king.
- Arshia Koul - April 27, 2025
When I tell people I worked on COPD prediction models, their first question is usually about the AI algorithm. Was it a neural net? Did it use deep learning? And yes, it was an LSTM forecaster. But what often gets overlooked is the challenge of not having enough real patient data in the first place.
In healthcare, this problem is everywhere. Collecting data from patients is expensive, time-consuming, and full of privacy concerns. For COPD, flare-ups don’t happen every day, so we had far fewer “positive events” than “normal days.” That imbalance makes it really hard for a model to learn.
This is where synthetic data comes in. Using a method called TimeGAN, I generated realistic sequences that looked statistically similar to the real patient data. Think of it like making high-quality practice problems for your model. PCA and t-SNE visualizations showed the synthetic clusters overlapping with real ones, which gave me confidence that the new data wasn’t nonsense.
The impact was huge: our model’s accuracy and F1 score jumped by over 16%. Without creating more real patients or risking privacy, we gave the AI more to learn from.
To me, that’s the promise of synthetic data in healthcare. It won’t replace real data, but it can fill critical gaps, level out imbalances, and make models more reliable. For patients, that means earlier warnings, fewer missed diagnoses, and maybe one day, better outcomes.
For me as a student, it was eye-opening to see how creative data science can be. Sometimes solving the problem isn’t about inventing a new algorithm, but about making the data itself strong enough to unlock what AI can really do.
- Arshia Koul - May 15, 2025
When I first started working on my COPD project, I thought data generation would be simple. Just make more of what I already had.
But it wasn’t.
COPD, or chronic obstructive pulmonary disease, is unpredictable. Wearable devices record patients’ vitals day after day: heart rate, oxygen levels, activity, yet the moments that truly matter, the flare-ups, are rare. Out of months of data, only a few days captured those critical events. My model needed to learn from them, but it didn’t have enough examples to recognize the early warning signs.
Like most beginners, I tried SMOTE, a classic data-balancing method I’d seen in tutorials. SMOTE creates “in-between” samples by averaging nearby data points, which works great for static spreadsheets. But when I applied it to time-series health data, everything fell apart.
The new samples looked mathematically valid but biologically impossible, a mix of one patient’s high heart rate with another’s normal oxygen levels. It was like blending two songs with different rhythms; the beats didn’t line up. My model learned noise instead of patterns, and while recall improved slightly, accuracy dropped.
That’s when I discovered TimeGAN.
TimeGAN doesn’t just generate random numbers, It learns how data moves through time. It’s a hybrid model combining the realism of a generative adversarial network (GAN) with the sequence awareness of an LSTM. Instead of producing isolated snapshots, it builds entire timelines that evolve realistically.
When I trained TimeGAN on my COPD dataset, the results felt different, almost alive. The synthetic sequences showed realistic progressions, like oxygen levels gradually dipping before a flare-up while heart rate increased. It was the first time I felt that the AI wasn’t just copying data; it was understanding it.
After adding this synthetic data to my training set, the LSTM classifier’s F1-score jumped from 0.40 to 0.66 (and from 0.54 to 0.70 in another test). But beyond the metrics, I learned something more important: good AI isn’t just about more data. It’s about more meaningful data.
SMOTE treated my readings like points on a graph. TimeGAN treated them like parts of a story.
Working with TimeGAN also made me think about privacy. The model generates data that looks real but doesn’t belong to any real person, a crucial step for responsible medical research. It showed me that technology can expand what’s possible without crossing ethical lines.
Looking back, I realize how much this project changed how I see AI. I began by trying to “fix” an unbalanced dataset. I ended up learning how to make machines learn more like humans, not by copying what exists, but by understanding how it works over time.