Some guys at Google came up with this idea of specifying a set of problems as two stages processes.
Step 1: Map (come up with a mapping function that takes a key value pair and maps it to a new set of key value pairs)
** do this for all key value pairs **
Step 2: Reduce (for each of the new key values, take all values and reduce them down to a single value)
Any problem that can be broken down into these steps can be parallelized really easily and scales well by adding more boxes.
An example may be 'Count word occurrence on the web'
Mapping: [word, url] => for each occurrence of 'word' in document@url add an item to an output hash for value 'word'
Reduce : [word, values] => number of items in value list 'values'
Anyway... lots of problems can be defined in this way... the code that runs your map/reduce functions for you can solve all sorts of problem handling networking issues, machine loading, etc...
Blaine Cook wrote a version in ruby here. Or see mapreducerb@github
I can't wait to have a play with this.
No comments:
Post a Comment