Statistical Methods for Big Stream Data
Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. In this talk, I will start with a brief overview of the recent statistical developments on big data. The presentation will focus on an online updating approach for big data streams, and in particular, on online updating of survival analysis. When large amounts of survival data arrive in streams, conventional estimation methods become computationally infeasible since they require storage of all the risk sets at each accumulation point. We develop methods for carrying out survival analysis under the Cox proportional hazards model in an online-update framework. Specifically, we propose online-updating estimators as well as their standard errors for both the regression coefficients and the baseline hazard function. An extensive simulation study is conducted to investigate the empirical performance of the proposed estimators. A large colon cancer data set from the Surveillance, Epidemiology, and End Results (SEER) program is analyzed to demonstrate the utility of the proposed methodologies.