Understanding generalization in deep learning
One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data and then fit using multiple epochs of stochastic gradient descent (SGD), which could be viewed as approximate empirical risk minimization. In light of this capacity for overfitting, it is remarkable that SGD reliably return solutions with low test error. A popular hypothesis is that SGD performs some type of implicit regularization. One roadblock to explaining empirical performance in terms of implicit regularization is that the sample complexity bounds offered by statistical learning theory are empirically vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization in terms of bounds, we need nonvacuous bounds.
I will discuss recent work using so-called PAC-Bayesian bounds and nonconvex optimization to arrive at nonvacuous generalization bounds for neural networks with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on "flat minima" and MDL-based explanations of generalization, as well as to variational inference for deep learning. Time permitting, I'll discuss recent work on data-dependent priors for tighter generalization bounds.
Daniel Roy is an assistant professor in the Department of Statistical Sciences at the University of Toronto. Roy is a recent recipient of the Ontario Early Research Award and a Google Faculty Research Award. Prior to joining Toronto, Roy was a Research Fellow of Emmanuel College and Newton International Fellow of the Royal Society and Royal Academy of Engineering, hosted by the University of Cambridge. Roy completed his doctorate in Computer Science at the Massachusetts Institute of Technology, where his dissertation was awarded the MIT EECS Sprowls Award, given to the top theses in computer science in that year.