BIG DATA BASICS
Big Data collocation has become pretty popular recently. But what is behind its first meaning of “large and complex data sets”? Now we will talk about Big Data basics.
Nowadays, talking about Big Data a broader concept is implied. It is a series of approaches, tools and methods for handling structured and unstructured data of huge volumes.
Concept of Big Data today is built on three V – Volume, Velocity and Variety, which means:
- Volume – the amount of data,
- Velocity – the need to process information at high speed,
- Variety – the diversity and often lack of structured data.
For instance, the operation of checking a balance on the card during cash withdrawal is calculated in milliseconds. These are the requirements dictated by the market. The other side of the issue is a diversity and unstructured of data. Increasingly, there is a demand of operating media content, blog entries, poorly structured documents, etc.
To address these challenges, there is a very specific set of approaches and technologies. At the heart of Big Data approaches is a system of distributed computing, where processing of large amounts of data required not one high-performance computing machine, but a whole group of them – a cluster.
It is worth noting that Big Data technologies are designed for operating the continuous stream of data. It means that with the help of them you can cope not only with high system load but also with the ever-increasing volume of data. Besides, all data operations can be done within milliseconds.
In the context of Big Data, you might have heard about such concepts and tools as MapReduce, Hadoop or HDFS. What is all that?
MapReduce is a methodology, concept and/or pattern which helps to build high-powered and scalable programs for processing large data sets. The basic idea of this concept is mapping, sorting and reducing data with a distributed algorithm on a cluster.
With MapReduce, you can effectively handle data of different type and size. Mostly it can be used for sorting, filtration, collating, summing and counting, but is not limited thereto. To understand MapReduce range of application, just take a glance on the most famous its adaptation. That is Internet search engines. For instance, with the help of MapReduce patterns search engine can count keywords on web pages to give the most relevant result on your request. More about Map Reduce algorithm you can read at our MapReduce explained article.
Hadoop is a free set of tools, libraries and frameworks for the development and implementation of distributed programs that are running on clusters of hundreds or thousands of nodes. It is used for search and contextual mechanisms on many heavily loaded websites, including, for instance, Yahoo! and Facebook.
Hadoop has its own Distributed File System – HDFS (Hadoop Distributed File System) – a file system for storing large files in blocks distributed between the nodes of a computer cluster. The distributed file system is designed to store large amounts of data and to access it by many distributed clients across a network. The HDFS big advantage is its reliability. For example, if during the permutation of equipment IT department accidentally destroyed 50% of our servers, the irretrievable data loss will be only about 3%.
IT infrastructure is growing fast nowadays. With help of open-source solutions there is a lot of Big Data basic tools that simplify data processing and make it much quicker. Among them – Apache Spark, YARN, Flink. There are a lot of vendors that distributes their own data platforms like Hortonworks, Cloudera and MapR.
That’s all Big Data basics for today. Stay tuned to get more updates on new frameworks, features and products.