Salmon: Towards Production-Grade, Platform-Independent Distributed ML
Mikhail Bilenko, Tom Finley, Shon Katzenberger, Sebastian Kochman, Dhruv Mahajan, Shravan Narayanamurthy, Julia Wang, Shizhen Wang, Markus Weimer
Abstract
Most existing frameworks for distributed machine learning are either tied to a specific data platform, or focus on novel computational and communication abstractions. The latter often neglect the constraints of shared-use clusters, such as fault tolerance, fair resource (network, CPU) usage, and isolation. This paper proposes a new distributed ML framework, SALMON, that abstracts the key components (control flow, partitioned data store, group communication) and relies only on above-resource-manager platform dependencies (via Apache REEF). The resulting framework is both expressive for common ML algorithm patterns (e.g., iterative MapReduce and parameter server), and flexible to operate on a variety of conventional, shared-use platforms (e.g., Apache Hadoop and HPC). Early experiments demonstrate the promise of this approach via comparisons with Apache Spark on a large-scale production dataset.
BibTeX
@inproceedings{chun2015dolphin,
title={Salmon: Towards Production-Grade, Platform-Independent Distributed ML},
author={Mikhail Bilenko, and Tom Finley, and Shon Katzenberger, and Sebastian Kochman, and Dhruv Mahajan, and Shravan Narayanamurthy, and Julia Wang, and Shizhen Wang, and **Markus Weimer**},
booktitle={The ML Systems Workshop at ICML},
year={2016}
}