Fiona X. Cai

Hi! I’m a second year PhD student in the Department of Biomedical Data Science at Stanford University where I am fortunate to be advised by Emily Alsentzer and Serena Yeung-Levy. My current research interests lie at the intersection of machine learning, computer vision, and representation learning and their applications to healthcare and medical imaging. My work is supported by the NSF Graduate Research Fellowship .

I previously completed my B.S and M.Eng in Computer Science at Massachusetts Institute of Technology (MIT), where I was advised by Prof. John Guttag. In the past, I have worked on clinical informatics problems using traditional statistical machine learning approaches for tasks like clinical trial recruitment and probabilistic phenotyping.

Email / Google Scholar / LinkedIn / Twitter

selected publications

Uncertainty Inclusive Contrastive Learning for Leveraging Synthetic Images

Fiona Cai, Emily Mu, and John Guttag

CVPR 2024 Synthetic Data for Computer Vision Workshop, 2024

Abs HTML

Recent advancements in text-to-image generation models have sparked a growing interest in using synthesized training data to improve few-shot learning performance. Prevailing approaches treat all generated data as uniformly important, neglecting the fact that the quality of generated images varies across different domains, datasets, and methods of generation. Using poor-quality images can hurt learning performance. In this work, we present Uncertaininclusive Contrastive Learning (UniCon), a novel contrastive loss function that incorporates uncertainty weights for synthetic images during training. Extending the framework of supervised contrastive learning, we add a learned hyperparameter that weights the synthetic input images per class, adjusting the influence of synthetic images during the training process. We evaluate the effectiveness of UniCon-learned representations against traditional supervised contrastive learning, both with and without synthetic images. Across three different finegrained classification datasets, we find that the learned representation space generated by the UniCon loss function leads to significantly improved downstream classification performance in comparison to supervised contrastive learning baselines.
Improving the Efficiency of Clinical Trial Recruitment Using an Ensemble Machine Learning to Assist With Eligibility Screening

Tianrun Cai, Fiona Cai, Kumar P. Dahal, and 7 more authors

ACR Open Rheumatology, Jul 2021

Abs HTML

Objective Efficiently identifying eligible patients is a crucial first step for a successful clinical trial. The objective of this study was to test whether an approach using electronic health record (EHR) data and an ensemble machine learning algorithm incorporating billing codes and data from clinical notes processed by natural language processing (NLP) can improve the efficiency of eligibility screening. Methods We studied patients screened for a clinical trial of rheumatoid arthritis (RA) with one or more International Classification of Diseases (ICD) code for RA and age greater than 35 years, from a tertiary care center and a community hospital. The following three groups of EHR features were considered for the algorithm: 1) structured features, 2) the counts of NLP concepts from notes, 3) health care utilization. All features were linked to dates. We applied random forest and logistic regression with least absolute shrinkage and selection operator penalty against the following two standard approaches: 1) one or more RA ICD code and no ICD codes related to exclusion criteria (ScreenRAICD1+EX) and 2) two or more RA ICD codes (ScreenRAICD2). To test the portability, we trained the algorithm at one institution and tested it at the other. Results In total, 3359 patients at Brigham and Women’s Hospital (BWH) and 642 patients at Faulkner Hospital (FH) were studied, with 461 (13.7%) eligible patients at BWH and 84 (13.4%) at FH. The application of the algorithm reduced ineligible patients from chart review by 40.5% at the tertiary care center and by 57.0% at the community hospital. In contrast, ScreenRAICD2 reduced patients for chart review by 2.7% to 11.3%; ScreenRAICD1+EX reduced patients for chart review by 63% to 65% but excluded 22% to 27% of eligible patients. Conclusion The ensemble machine learning algorithm incorporating billing codes and NLP data increased the efficiency of eligibility screening by reducing the number of patients requiring chart review while not excluding eligible patients. Moreover, this approach can be trained at one institution and applied at another for multicenter clinical trials.
PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies

Jennifer A Sinnott, Fiona Cai, Sheng Yu, and 4 more authors

Journal of the American Medical Informatics Association, May 2018

Abs HTML

Objective: Standard approaches for large scale phenotypic screens using electronic health record (EHR) data apply thresholds, such as ≥2 diagnosis codes, to define subjects as having a phenotype. However, the variation in the accuracy of diagnosis codes can impair the power of such screens. Our objective was to develop and evaluate an approach which converts diagnosis codes into a probability of a phenotype (PheProb). We hypothesized that this alternate approach for defining phenotypes would improve power for genetic association studies. Methods: The PheProb approach employs unsupervised clustering to separate patients into 2 groups based on diagnosis codes. Subjects are assigned a probability of having the phenotype based on the number of diagnosis codes. This approach was developed using simulated EHR data and tested in a real world EHR cohort. In the latter, we tested the association between low density lipoprotein cholesterol (LDL-C) genetic risk alleles known for association with hyperlipidemia and hyperlipidemia codes (ICD-9 272.x). PheProb and thresholding approaches were compared. Results: Among n = 1462 subjects in the real world EHR cohort, the threshold-based p-values for association between the genetic risk score (GRS) and hyperlipidemia were 0.126 (≥1 code), 0.123 (≥2 codes), and 0.142 (≥3 codes). The PheProb approach produced the expected significant association between the GRS and hyperlipidemia: p = .001. Conclusions: PheProb improves statistical power for association studies relative to standard thresholding approaches by leveraging information about the phenotype in the billing code counts. The PheProb approach has direct applications where efficient approaches are required, such as in Phenome-Wide Association Studies.
On-Body Piezoelectric Energy Harvesters through Innovative Designs and Conformable Structures

Sara V. Fernandez, Fiona Cai, Sophia Chen, and 7 more authors

ACS Biomaterials Science & Engineering, Nov 2021