Alex Smola, Amr Ahmed, Markus Weimer


Scalable data analysis has come a long way since the introduction of the MapReduce paradigm a decade ago. In this tutorial we present algorithms for synchronous and asynchronous data processing. They are are capable of dealing with the amounts of data typically available on the internet.

We give a brief description of the problems one faces when performing scalable machine learning on the internet. To motivate matters we provide a number of scenarios from spam filtering, advertising and collaborative filtering. This is followed by an extensive discussion of current and novel synchronous data processing techniques. In particular we emphasize how insights from systems research and databases can be used to achieve significant improvements both in terms of expressiveness and in terms of efficiency of the deployed algorithms.

This is followed by a description of asynchronous data analysis and inference methods. The latter are particularly necessary whenever the estimation problem requires the use of a significant number of latent variables. This includes cases such as clustering, topic models, or graph factorization. We provide an ample number of motivating examples and applications, ranging from user profiling to the analysis of communication networks. Special emphasis is placed on approximations needed to scale algorithms to hundreds of millions of users and billions of documents.