What is Apache Spark?

what is apache spark

Apache Spark project is breaking limits of growth. It quickly became a trending buzz-word in Big Data field.

So, what is Apache Spark? First of all, this is a very fast and efficient framework for big data processing.

About Apache Spark

Since 2014, when it got license of Apache 2.0, Spark became evolving. Now it consists of a bunch of modules – Spark Core, Streaming, ML (Machine Learning), Spark SQL, GraphX and others.

Apache Spark is used for executing complicated analytics tasks in parallel. It is written in such programming languages as Scala, Java, Python and R. It doesn’t have a storage and has a configuration to work with existing ones – HDFS, HBase, JDBC, etc.

Apache Spark RDD

The main concept of Spark is RDDResilient Distributed Dataset. This is a dataset that can be interacted in two ways – transformation and action. Transformations of RDD usually results into another RDD. It means that new dataset is created after first one is transformed. Actions are the way of interaction for dataset management. They are used for such sings as writing RDD to the disk or printing out, etc.

spark-rdd

RDD Transformations

.map(function) – applies the argument function to every entry of the dataset;

.filter(function) – returns all entries for which argument function returned true;

.distinct() – returns a set with distinct entries of the original set.

It also supports methods for working with sets – .union(otherDataset), .intersection(otherDataset), etc.

RDD Actions

.saveAsTextFile(path) – saves data into a text file;

.count() – returns the number of elements in the RDD;

.reduce(function) – aggregates the entries of a dataset using an argument function;

.collect() – returns all entries as an array;

.first() – returns the 1st element of dataset;

.take(n) – returns the first n entries of the RDD;

Spark has an API for primary programming languages. There is a class called SparkContext that is needed for further work with the cluster. For introduction there is a Spark-shell – the command-line interface for communicating with Spark. It has an already created instance of SparkContext and you can execute operations.

In next article of this Apache Spark Tutorial you will learn how to install Spark, work with its shell and create applications with Spark.

Leave a Reply

Be the First to Comment!