MapReduce is an algorithm to process large amounts of data very quickly. It splits data into pieces, gives one piece to one computer, the computers do their counting jobs in parallel, then the algorithm gathers and summarizes the results.
The name MapReduce is a combination of two main stages in the algorithm: map for feeding a set of data to a machine and reduce for summarizing many results into several ones.
This method is a great timesaver: 10 machines, each working on its part of processing job at the same time, would get the result ten times faster than one machine!
The algorithm was invented by two Google engineers way back in 2004 and helped Google to index billions and billions of web pages across the World Web.
How does it work?
Let’s take a very simple example. In one of Cinderella’s versions, the Witchy Stepmom mixed a bucket of wheat and rice and told Cindy she would only go to Masquerade after sorting and counting every grain separately. It would take Cindy a week to do that on her own, but a dozen of little mice came to rescue. This is how they worked:
|INPUT||Cindy takes a mixture of wheat and rice and divides it into bunched: 1,2… N|
|MAP||She gives one bunch to one mouse: Mouse 1, Mouse 2… Mouse N|
|PROCESS||All the mice do their job in parallel, separating grains and counting them|
|REDUCE||Cindy combines smaller bunches of wheat and rice and sums many (N) answers into two|
MapReduce explained in the picture:
That’s it! Surely, in computing there are hundreds of data bunches (sets) and machines for processing. But the approach remains the same.
If you have something to say about “MapReduce explained” article, please leave a comment below.