An Introduction to using Apache Spark for Machine Learning

Apache Spark is a fast distributed processing engine written in Scala and running atop Hadoop. This talk will describe the Spark MLLib library and then dive in to coding examples in Classification, Clustering, and Feature Selection

About This Session

Apache Spark is a fast distributed processing engine written in Scala with additional support for Java and Python.

The core data structure of Spark is a Resilient Distributed DataSet that provides fast and flexible in-memory processing yet with support for massive datasets and fault tolerance by running on top of Hadoop/HDFS (and lately additional scalable backends, notably Cassandra).

Spark includes several Modules on top of its Core: this talk will focus on Spark MLLib which is a machine learning library on top of Spark Core. MLLib leverages the memory-resident capabilities of Spark to enable fast implementations of iterative algorithms in ways that are not possible with traditional Map/Reduce on top of Hadoop.

This talk will briefly describe the Spark Core and MLLib architectures and then switch focus into the Scala language implementation of examples in the following areas: Classification Regression Clustering (If time allows: Collaborative Filtering and Feature Selection) Additionally SparkSQL will be briefly discussed and then used to examine some of the results.

This talk will assume working knowledge of scala functional programming methods and constructs.

Time: 9:45 AM Saturday Room: 5001

The Speaker(s)

Stephen Boesch

Scala/Spark/Machine Learning Developer, Intuit

I am a developer focusing on scalable data pipelines and machine learning apps on Spark and Hadoop

An Introduction to using Apache Spark for Machine Learning

About This Session

The Speaker(s)

Stephen Boesch

Scala/Spark/Machine Learning Developer, Intuit

Download