MapReduce is an algorithm for distributed calculation. This term is widely used in Big Data field. All big IT companies like Google, Yahoo! and Amazon uses it for large datasets processing…
MapReduce is a significant part of Hadoop ecosystem since Hadoop task are implemented via MapReduce jobs. It is also used in NoSQL databases like MongoDB and CouchDB.
There is a bunch of different areas to use this algorithm. And of course, society has faced a lot of challenges and released design pattern to simplify the development. Here we describe 3 basic patterns: filtering, summarization and structural.
Filtering pattern is used when we need to go through the dataset and chose some subset that matches your criteria. Here are some typical cases of this pattern:
– simple filter – when you simply reduce the input depending by condition (example: select only users that are older than 18);
– bloom filter;
– top-n list – when you need to form a Top-N list from a dataset (like calculating 10 most popular pages of your site from a web-server log file);
For summarization patterns we also need to cross the whole dataset. But this time we use different way of computation. Typical tasks for it are indexing and numerical summarizations. It is used when you need to create the inverted index, count words or phrases, count minimum/maximum values by some field, etc.
Structural patterns are used for migrating data from relational databases to Hadoop systems or for combining several datasets. For instance, we can use this pattern to combine information from customers table and orders table by same key of user ID.
If you are interested in MapReduce, there is a great book called MapReduce design patterns. It deeply dives into all patterns with real examples.