Distributed and Scalable PCA in the Cloud

Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer and Vijay Narayanan

Abstract

Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.

Download PDF