Skip to main content

Introduction

Apache Spark

Spark is a fast and general engine for large-scale data processing. The user writes a Driver Program containing the script that tells spark what to do with your data, and Spark builds a Directed Acyclic Graph (DAG) to optimize workflow. With a massive dataset, you can process concurrently across multiple machines.

Let's take a second to discuss the components of spark:

  • Spark Core - All functionalities are build on this layer (task scheduling, memory management, fault recovery, and storage systems). Also has the API that defines RDDs, which we will discuss later. There are 4 modules on this layer:
    • Spark Streaming - Live data streaming of data, such as log files. These APIs are similar to RDD.
    • MLib - Scalable machine learning library
    • Spark SQL - Library for working with structured data. Supports Hive, parquet, json, csv, etc.
    • GraphX - API for graphs and graph parallel execution. Clustering, classification, traversal, searching, and path-finding is possible in the graph. We'll come back to this much later on.

Why Scala?

You can use Spark with many languages; primarily Python, R, Java and Scala. I like Scala because it's functional, type-safe and JVM-friendly langauge. Also, since Spark is written in Scala, there is a slight overhead on running scripts in any other language. 

Besides a knowledge of programming, a familiarity with SQL with make Spark very easy to learn.