Apache Spark is a fast distributed processing engine written in Scala with additional support for Java and Python.
The core data structure of Spark is a Resilient Distributed DataSet that provides fast and flexible in-memory processing yet with support for massive datasets and fault tolerance by running on top of Hadoop/HDFS (and lately additional scalable backends, notably Cassandra).
Spark includes several Modules on top of its Core: this talk will focus on Spark MLLib which is a machine learning library on top of Spark Core. MLLib leverages the memory-resident capabilities of Spark to enable fast implementations of iterative algorithms in ways that are not possible with traditional Map/Reduce on top of Hadoop.
This talk will briefly describe the Spark Core and MLLib architectures and then switch focus into the Scala language implementation of examples in the following areas: Classification Regression Clustering (If time allows: Collaborative Filtering and Feature Selection) Additionally SparkSQL will be briefly discussed and then used to examine some of the results.
This talk will assume working knowledge of scala functional programming methods and constructs.