The Battle of Two Cultures: Statistics versus (?) Data Science
The title of my talk is inspired by Leo Breiman's seminal paper in 2001, "Statistical Modeling: The Two Cultures", where Breiman describes the intellectual tension between a classic stochastic modeler versus an algorithmic modeler. The shock that "data science" has injected into the world of statistics has been palpable. The next generation of students are perhaps finding the job title data scientist more exciting than being a good old statistician. In this talk, I will try to share the “joy” (and associated anxiety) of being a classically trained statistician at a time when our science and society are undergoing unprecedented information/data revolution. I will discuss statistical challenges and opportunities with joint analysis of electronic health records and genomic data through "Phenome-Wide Association Studies (PheWAS)". I will posit a modeling framework that helps us to understand the effect of both selection bias and outcome misclassification in assessing genetic associations across the medical phenome. I will use data from the UK Biobank and the Michigan Genomics Initiative, a longitudinal biorepository at Michigan Medicine, launched in 2012 to illustrate the analytic framework. The examples illustrate that understanding sampling design and selection bias matters for big data, and are at the heart of doing good science with data. This is joint work with Lauren Beesley and Lars Fritsche at the University of Michigan.