In this talk, I discuss how deep learning can statistically outperform shallow methods such as kernel ridge regression. First, I will discuss the excess risk bounds of deep learning by considering some function classes such as Besov spaces and show that sparsity and non-convex geometry of the target function class play the essential role to characterize the superiority of deep learning. In particular, it is shown that deep learning can attain better performances for high-dimensional (or infinite-dimensional) inputs. In the latter half, I discuss optimization of neural networks and its impact on statistical performances. I will consider some optimization methods in a mean-field regime based on gradient Langevin dynamics and show that they can achieve the global optimal solutions with convergence rate guarantees. It is shown that optimizatoin in the mean field regime yields adaptivity and statistical superiority of deep learning compared with linear estimators.