python词频统计实例

项目概述

通过两个Python文件实现一个简单的词频统计。


项目截图.PNG

本工程共有4个文件:

  • file01:要统计的词频文件。
  • maptest.py:MapReduce的第一个阶段:map
  • file02:中间结果保存文件。
  • reducetest.py:MapReduce的第二个阶段:reduce

各个文件内容:

file01文件内容:

We think that could provide quite a buffer for the hormone replacement franchise
Meanwhile Spiros is determined to buffer his family against this uncertainty despite his deep patriotism
Everyone agrees on the destination: lots more pure equity, the highest-quality buffer against losses
The wind whooshed and whined, a buffer against the lonesome quiet of my strange hotel room
Meanwhile Spiros is determined to buffer his family against this uncertainty despite his deep patriotism

maptest.py文件内容:

# wordcount map阶段
"""
1.读取文件file01,将单词依次存入数组。
2.对数组进行排序。
3.将数组中的单词依次写入文件file02。
"""
ss = []
ff = open("file01", "r")
for x in ff.readlines():
    y = x.strip().split(" ")
    for xx in y:
        ss.append(xx)
ff.close()

ss.sort()
gg = open("file02", "w")
for y in ss:
    gg.write(y)
    gg.write('\n')
gg.close()

file02文件内容:

Everyone
Meanwhile
Meanwhile
Spiros
Spiros
The
We
a
a
against
against
against
against
agrees
and
buffer
buffer
buffer
buffer
buffer
could
deep
deep
despite
despite
destination:
determined
determined
equity,
family
family
for
franchise
highest-quality
his
his
his
his
hormone
hotel
is
is
lonesome
losses
lots
more
my
of
on
patriotism
patriotism
provide
pure
quiet
quite
replacement
room
strange
that
the
the
the
the
think
this
this
to
to
uncertainty
uncertainty
whined,
whooshed
wind

reducetest.py文件内容:

# wordcount reduce阶段

cur_word = None
sum = 0

ff = open("file02", "r")
for line in ff.readlines():
    x = line.strip()
    if cur_word == None:
        cur_word = x
    if cur_word != x:
        print('\t'.join([cur_word, str(sum)]))
        cur_word = x
        sum = 0
    sum += 1
print('\t'.join([cur_word, str(sum)]))

推荐阅读更多精彩内容