Category: BIG DATA

BIG DATA

WHAT IS APACHE SPARK?

Apache Spark project is breaking limits of growth. It quickly became a trending buzz-word in Big Data field. So, what is Apache Spark? First of all, this is a very fast and efficient framework for big data processing. ABOUT APACHE SPARK Since 2014, when it got license of Apache 2.0, Spark became evolving. Now it consists of […]

thaiphan 
BIG DATA

APACHE YARN BASICS

Apache YARN was born as a module of Hadoop 2.0. It is responsible for resource management and task planning. Previously, all this activity was handled by MapReduce’s JobTracker. But in YARN this responsibilities are separated between ResourceManager and ApplicationMaster. YARN can work with MapReduce applications and with all distributed applications that implements corresponding API (Apache […]

thaiphan 
BIG DATA

WHAT IS APACHE AMBARI?

Apache Ambari is a web application that manages Big Data clusters. It allows to control the variety of metrics such as disk, network and CPU usages on hosts, number of components alive, hosts states and many other specific indicators. If you’re asking yourself “What does Ambari mean?”, Mahadev Konar (Hortonworks Co-Founder) explained the meaning in one […]

thaiphan 
BIG DATA

WHAT IS HORTONWORKS SANDBOX?

Sandbox is a virtual machine image that has already installed Hortonworks Data Platform. Its main purpose is to showcase common tools and frameworks of Big Data world. In its HUE (Hadoop User Experience) interface you can load data sets into HDFS and manipulate with it using Pig and Hive. It’s one of the easiest way to […]

thaiphan 
BIG DATA

MAPREDUCE EXPLAINED

MapReduce is an algorithm to process large amounts of data very quickly. It splits data into pieces, gives one piece to one computer, the computers do their counting jobs in parallel, then the algorithm gathers and summarizes the results. The name MapReduce is a combination of two main stages in the algorithm: map for feeding a set […]

thaiphan 
BIG DATA

FIXING LOGSTASH: (OPENSSL::X509::STOREERROR) SETTING DEFAULT PATH FAILED: THE TRUSTANCHORS PARAMETER MUST BE NON-EMPTY

While starting the Logstash you may run into this error: org.jruby.exceptions.RaiseException: (OpenSSL::X509::StoreError) setting default path failed: the trustAnchors parameter must be non-empty   This is not related with Logstash directly, but it says that you have problems with security certificates in Java. The solution is very easy. To fix it you just need to remove the […]

thaiphan 
BIG DATA

SPARKEXCEPTION – JAVA.IO.NOTSERIALIZABLEEXCEPTION: JAVA.IO.PRINTSTREAM

When I tried to parse a CSV-dataset using Apache Spark, I decided to go through it and print all records to a console using System.out.println() inside the foreach method of a Dataset object.   But all I got was exceptions: org.apache.spark.SparkException: Task not serializable and java.io.NotSerializableException: java.io.PrintStream FIXING NOTSERIALIZABLEEXCEPTION: JAVA.IO.PRINTSTREAM IN SPARK This only means that Spark tries to use System.out object, which has […]

thaiphan 
BIG DATA

BIG DATA BASICS

Big Data collocation has become pretty popular recently. But what is behind its first meaning of “large and complex data sets”? Now we will talk about Big Data basics. Nowadays, talking about Big Data a broader concept is implied. It is a series of approaches, tools and methods for handling structured and unstructured data of huge […]

thaiphan 
BIG DATA

APACHE FLINK HELLO WORLD JAVA EXAMPLE

Apache Flink is a distributed streaming platform for big datasets. In this article we are going to show you a simple Hello World example written in Java. Flink has an agile API for Java and Scala that we need to access. We will use Maven as a build tool for dependency management. You don’t need Hadoop or […]

thaiphan 
BIG DATA

ERROR “ORG.APACHE.SPARK.SPARKEXCEPTION: A MASTER URL MUST BE SET IN YOUR CONFIGURATION”

15/10/06 17:14:43 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: A master URL must be set in your configuration   If you’ve got such Spark exception in the output, it means you simply forgot to specify the master URL. Probably you’re running Spark locally. The most common mistake of running the application from IDE is the absence […]

thaiphan