Let’s take a closer look on Apache Spark shell. Its main mission is to represent Spark API and its opportunities.
People do not usually switch to Spark-shell in scope of Big Data tasks. But it is still a powerful tool for data processing. To install Spark use our guide.
As we mentioned in previous lesson, you can start the shell by typing:
In this lesson we will only work with scala CLI.
You should already know about RDD (Resilient Distributed Dataset). This is a collection of entries that can be created from HDFS or after transforming of another RDD. In our example we will create an RDD from the default readme file in the spark directory.
scala> val readmeFile = sc.textFile("README.md")
This code will parse the content of the file and save it into RDD that we called “readmeFile”. Note that sc is an instance of SparkContext class. The execution of this line will result in such kind of output:
readmeFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at textFile at <console>:22
Let’s recall transformations and actions from previous lessons. Now we need to calculate the number of entries and than take the first one. Try to execute this commands and observe the output:
It was the really basic RDD actions in Spark shell. Now let’s do some transformation stuff here. Next task is to filter this file for lines that contain the word “apache” (in lowercase) and print out the result RDD.
scala> val linesWithApache = readmeFile.filter(line => line.contains("apache"))
This will create the needed RDD inside the Spark shell. To print it out just use the method .collect() on this dataset.
Spark shell MapReduce
To summarize let’s make some serious computation. We will be calculating the number of each word used in README.md file. It is a typical task for Hadoop MapReduce called “Word Count”.
scala> import java.lang.Math
This command imports the Math class to the Spark-shell. The next steps are splitting words by spaces mapping each word and reducing grouping them by key.
scala> val wordCounts = readmeFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
A brief explanation of line from above…
1) .flatMap(function) is a transformation that is similar to map. But this time the argument function could return more than one element. The RDD returned from this mapping is flattened.
2) Then .map() function maps each entry of RDD to a key-value pair, where key is a word from readme file and key is always “1”.
3) Reduce phase is needed to group entries with same key and summarize their occurrences.
After all, when we execute wordCounts.collect() we will see all counts of each word.
These are the basics of Spark shell. It is a part of our Apache Spark Java Tutorial. To discover more read the official documentation.
In next article we will create a simple Java application with Spark.