什么是MapReduce?
MapReduce是Google提出的一个的软件架构, 用于大规模数据集的并行运算。Map Reduce的思想很简单,就是 通过Map步骤使用多台机器并行将所有数据整理为<Key, Value>的二元组,然后在Reduce之前,系统会按照key的不同,将不同的key分给不同的机器进行处理,比如可以简单的根据hash(key) % 机器数的方式进行数据分配(这个过程叫做shuffle)。接下来,每台机器拿到数据后,进行reduce合并统计的操作,将同一个key的数据进行处理。最终得到了每个key的处理结果。
MapReduce处理步骤
1. Input
2. Split
3. Map
4. 传输整理
5. Reduce
6. Output
MapReduce函数接口
使用MapReduce做并行运算,实际上就是用使用定制化的Map 和 Reduce
Lintcode 题目
504 倒排索引
描述
使用map reduce来实现一个倒排索引
Python3 AC代码
'''
Definition of Document
class Document:
def __init__(self, id, cotent):
self.id = id
self.content = content
'''
class InvertedIndex:
# @param {Document} value is a document
def mapper(self, _, value):
# Write your code here
# Please use 'yield key, value' here
for word in value.content.split():
if word:
yield word, value.id
# @param key is from mapper
# @param values is a set of value with the same key
def reducer(self, key, values):
# Write your code here
# Please use 'yield key, value' here
result = list(set(values))
result = sorted(result, key=lambda x: int(x))
yield key, result
注意:输出的结果也要排序才能AC
503乱序字符串
class Anagram:
# @param {str} line a text, for example "Bye Bye see you next"
def mapper(self, _, line):
# Write your code here
# Please use 'yield key, value' here
for word in line.split():
if word:
yield ''.join(sorted(word)), word
# @param key is from mapper
# @param values is a set of value with the same key
def reducer(self, key, values):
# Write your code here
# Please use 'yield key, value' here
yield key, list(values)