# 使用python中的fuzzywuzzy库进行模糊匹配实例

fuzzywuzzy库是Python中的模糊匹配库，它依据 Levenshtein Distance 算法 计算两个序列之间的差异。

`Levenshtein Distance` 算法，又叫 `Edit Distance` 算法，是指两个字符串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符。一般来说，编辑距离越小，两个串的相似度越大。

## 使用 PIP 通过 PyPI 安装

``````pip install fuzzywuzzy
``````

## 用法

``````>>> from fuzzywuzzy import fuzz    >>> from fuzzywuzzy import process
``````

## 简单匹配（Simple Ratio）

``````>>> fuzz.ratio("this is a test", "this is a test!")        97
``````

## 非完全匹配（Partial Ratio）

``````>>> fuzz.partial_ratio("this is a test", "this is a test!")        100
``````

## 忽略顺序匹配（Token Sort Ratio）

``````>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")        91    >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")        100
``````

## 去重子集匹配（Token Set Ratio）

``````>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")        84    >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")        100
``````

## Process

``````>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]    >>> process.extract("new york jets", choices, limit=2)        [('New York Jets', 100), ('New York Giants', 78)]    >>> process.extractOne("cowboys", choices)        ("Dallas Cowboys", 90)
``````

``````>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61
``````

sp_code为rawdata 表中的主键，因此我们直接将主键匹配到结果中

``````from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

#将需要匹配的列表放入dataframe中并将需要匹配的信息拼起来
file_path=r"fuzzywuzzy test data.xlsx"
sp_rawdata['text']=sp_rawdata['sp_webiste']+sp_rawdata['sp_channel']+sp_rawdata['sp_position']+sp_rawdata['sp_format']
tr_rawdata['text']=tr_rawdata['tr_Website']+tr_rawdata['tr_Position_Channel']+tr_rawdata['tr_Format']

#获取dataframe中cacode所有的去重后的值，并以列表的形式返回，即去重操作
sp_listtype=sp_rawdata['cacode'].unique()
tr_listtype=tr_rawdata['cacode'].unique()

scorelist=[]
rawlist=[]
#df = pd.DataFrame(columns = ["cacode", "tr_campaign_name", "tr_Website", "tr_Position_Channel", "tr_Format"])
for i in sp_listtype:
# isin()接受一个列表，判断该列中元素是否在列表中,再根据dataframe中的布尔索引进行筛选,类似的筛选函数还有 str.contains()
#在本例中，这个语句将cacode中属于1,2,3的dataframe拆分成三个列表，从而匹配两个dataframe时只会匹配cacode相同的信息
sp_data = sp_rawdata[sp_rawdata['cacode'].isin([i])]
tr_data = tr_rawdata[tr_rawdata['cacode'].isin([i])]
#按行取dataframe中的值
for row in  tr_data.itertuples():
rawlist.append(row)
for text in tr_data['text']:
#忽略顺序匹配并取出匹配分数最高的值
score = process.extractOne(str(text), sp_data['text'].astype(str), scorer=fuzz.token_sort_ratio)
scorelist.append(score)

#转换list为dataframe类型
scorecode=pd.DataFrame(scorelist)
df=pd.DataFrame(rawlist)
#修改转变后的dataframe的字段名称，注意这里0和1都不是字符串