Intel high-performance analytics toolkit and dataframes

HPAT: High Performance Analytics with Scripting Ease-of-Use
E. Totoni, T. A. Anderson, T. Shpeisman

HiFrames: High Performance Data Frames in a Scripting Language
E. Totoni, W. U. Hassan, T. A. Anderson, T. Shpeisman

Long ago, in the ivory tower of high-performance computing, massive parallel computing tasks were carried out by complex, specialized programs written in low-level languages such as C or Fortran. Google’s MapReduce paradigm, published in 2003, democratized distributed computing, providing a simple model accessible to data wranglers without a domain expertise in parallel and distributed programming.

The MapReduce paradigm deals with problems that can be formulated as coordinate-wise transforms, sorting, and consolidation of key-value pairs and is ill-suited for data analytics tasks that require frequent exploratory interactions with the dataset. This led to the development of interactive, in-memory data processing tools such as Google Pregel and Apache Spark. Such tools are often built atop existing MapReduce clusters such as Apache Hadoop, and the many layers introduces substantial overhead, resulting in processing speeds several magnitudes slower than parallel programs written in low-level languages.

An outstanding challenge, then, is to develop a productive, interactive data analytics tool with high performance. Intel’s High-Performance Analytics Toolkit (HPAT) takes a stab at the challenge by compiling high-level scripting syntax—HPAT uses Julia—down to high-performance, low-level code, e.g., OpenMP/MPI. Automatic parallelization and optimization is achieved by assuming that “the map/reduce parallel pattern inherently underlies the target application domain. … distribution of parallel vectors and matrices is mainly one-dimensional (1D) for analytics tasks … the data-parallel computations are in the form of high-level matrix/vector computations or comprehensions.”

Building on HPAT, HiFrames provides an alternative to fast but non-distributed dataframes such as Python Pandas and distributed but non-array-like dataframes such as Spark SQL. Unlike Spark’s resilient distributed datasets that underlies Spark SQL, HiFrames does not provide fault tolerance, with the justification that “the portion of [the research group’s] target programs with relational operations to be significantly shorter than the mean time between failure (MTBF) of moderate-sized clusters … in practice most clusters consist of 30-60 machines which is a scale at which fault tolerance is not a big concern.” The omission of fault tolerance allows a significant boost in performance. HiFrames compiles Julia-style code down to HPAT, with additional relational optimizations.

The papers claim that “HPAT is 369x to 2033x faster than Spark on the Cori supercomputer and 20x to 256x times on Amazon AWS,” and that “HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytic operations.”

Update, 6/22/2017: Ehsan Totoni informs me that HPAT is now implemented in Python as well. Here’s an excerpt from his email:

“We actually moved to Python recently, which will help attract users as you mentioned. … Unfortunately, there is no documentation yet since development started very recently. It requires an MPI installation, and this branch of Numba. … we are contributing our automatic shared-memory parallelization work to Numba. It’s in the development branch and will be released soon.”