Mikhail Bilenko, Tom Finley, Shon Katzenberger, Sebastian Kochman, Dhruv Mahajan, Shravan Narayanamurthy, Julia Wang, Shizhen Wang, Markus Weimer

Abstract

Most existing frameworks for distributed machine learning are either tied to a specific data platform, or focus on novel computational and communication abstractions. The latter often neglect the constraints of shared-use clusters, such as fault tolerance, fair resource (network, CPU) usage, and isolation. This paper proposes a new distributed ML framework, SALMON, that abstracts the key components (control flow, partitioned data store, group communication) and relies only on above-resource-manager platform dependencies (via Apache REEF). The resulting framework is both expressive for common ML algorithm patterns (e.g., iterative MapReduce and parameter server), and flexible to operate on a variety of conventional, shared-use platforms (e.g., Apache Hadoop and HPC). Early experiments demonstrate the promise of this approach via comparisons with Apache Spark on a large-scale production dataset.

Download PDF

BibTeX

@inproceedings{chun2015dolphin,
  title={Salmon: Towards Production-Grade, Platform-Independent Distributed ML},
  author={Mikhail Bilenko, and Tom Finley, and Shon Katzenberger, and Sebastian Kochman, and Dhruv Mahajan, and Shravan Narayanamurthy, and Julia Wang, and Shizhen Wang, and **Markus Weimer**},
  booktitle={The ML Systems Workshop at ICML},
  year={2016}
}