使用word2vec和xgboost寻找Quora上的相似问题

原作者 Susan Li

Changing the world, one article at a time. Sr. Data Scientist, Toronto Canada. Opinion=my own.

http://www.linkedin.com/in/susanli/

备注：Quora是一个国外的问答网站，可以理解为外国的知乎。

image.jpeg

上周，我们探索了几种不同的去重技术，尝试使用BOW，TFIDF，Xgboost方法来识别相似文档。我们发现使用传统的TFIDF方法可以解决一些比较明显的问题。这可以解释为什么谷歌在搜索领域长期使用TFIDF方法来判断一个单词对于一个页面的重要程度。

为了深入研究和提升能力，我们来探索一些新的方法来解决类似的匹配和去重问题，首先我们把去重问题引申为一个分类问题，然后再去解决它。

数据

这个任务的目标是鉴别Quora中的一对问题是不是表达同样的意思，在数据中，每一组数据包含两个问题，以及人类专家（难道不是运营）标注的这俩问题是否属于同一个意思的标签。不过需要注意的是，这个标注过程是很主观的，对于同一对问题是否表述同一个意思，不同的专家可能有不同的意见。所以这个标签算是一种参考，它不是100%准确的。

df = pd.read_csv('quora_train.csv')

df = df.dropna(how="any").reset_index(drop=True) 

a= 0

for i in range(a,a+10):

    print(df.question1[i])

    print(df.question2[i])

    print()

image.png

计算词移距离（WMD：word mover’s distance）

词移距离是一个能让我们使用“距离”评估两个文档相似度的方法，这种方法不关心是不是有相同的单词。因为它使用了word2vec的向量进行计算。它的主要思想是利用文档中词的embedded向量，来计算一篇文档“游走”到另一篇文档的最小距离来衡量两篇文章的差异性。我们来看一个例子。下面是一对已经标注重复的数据：

question1 = 'What would a Trump presidency mean for

current international master’s students on an F1

visa?'

question2 = 'How will a Trump presidency affect the

students presently in US or planning to study in US?'

question1 = question1.lower().split()

question2 = question2.lower().split()

question1 = [w for w in question1 if w not in

stop_words]

question2 = [w for w in question2 if w not in

stop_words]

上面的代码做了两个操作，一个是转换大小写，一个是去除停用词，算是初步清洗样本吧。

我们预先用google news的语料训练了Word2vec模型。使用了genim的word2vec算法包。


import gensim

from gensim.models import Word2Vec

model =

gensim.models.KeyedVectors.load_word2vec_format('./wor

d2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

下面开始计算两个问题的WMD距离。注意，这俩文本表达了同样的意思，并被标注为重复Quora数据。

distance = model.wmdistance(question1, question2)

print('distance = %.4f' % distance)

结果是 distance = 1.8293

结果的值看起来很大，我们需要把它进行标准化。

标准化word2vec向量

在使用wmd方法时，首先去标准化word2vec向量，这是有好处的，这样他们就有一样的长度了。

model.init_sims(replace=True)

distance = model.wmdistance(question1, question2)

print('normalized distance = %.4f' % distance)

normalized distance = 0.7589

标准化之后，这个值就小多了。

我们再来做一对实验，这次用的是两个不重复的问题。

question3 = 'Why am I mentally very lonely? How can I

solve it?'

question4 = 'Find the remainder when [math]23^{24}

[/math] is divided by 24,23?'

question3 = question3.lower().split()

question4 = question4.lower().split()

question3 = [w for w in question3 if w not in

stop_words]

question4 = [w for w in question4 if w not in

stop_words]

distance = model.wmdistance(question3, question4)

print('distance = %.4f' % distance)

distance = 1.2637

model.init_sims(replace=True)

distance = model.wmdistance(question3, question4)

print('normalized distance = %.4f' % distance)

normalized distance = 1.2637 、

这一次，标准化之后的值没有发生变化。WMD方法认为这一组数据不如第一组那么相似，看起来很有效果不是吗。

FuzzyWuzzy

在前面的一篇文章中，我们已经了解过Python中模糊字符匹配的方法https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49

快速了解一下FuzzyWuzzy工具在处理我们的重复问题上能起什么作用。

from fuzzywuzzy import fuzz

question1 = 'What would a Trump presidency mean for

current international master’s students on an F1

visa?'

question2 = 'How will a Trump presidency affect the

students presently in US or planning to study in US?'

fuzz.ratio(question1, question2)

53

fuzz.partial_token_set_ratio(question1, question2)

100

question3 = 'Why am I mentally very lonely? How can I

solve it?'

question4 = 'Find the remainder when [math]23^{24}

[/math] is divided by 24,23?'

fuzz.ratio(question3, question4)

28

fuzz.partial_token_set_ratio(question3, question4)

从上面的结果看起来，FuzzyWuzzy工具也不认为第二组数据相似。这很好，因为人类专家也这么认为。

特征工程

首先，我们先实现几个函数——计算WMD，标准化WMD，word2vec表达

def wmd(q1, q2):

    q1 = str(q1).lower().split()

    q2 = str(q2).lower().split()

    stop_words = stopwords.words('english')

    q1 = [w for w in q1 if w not in stop_words]

    q2 = [w for w in q2 if w not in stop_words]

    return model.wmdistance(q1, q2)

def norm_wmd(q1, q2):

    q1 = str(q1).lower().split()

    q2 = str(q2).lower().split()

    stop_words = stopwords.words('english')

    q1 = [w for w in q1 if w not in stop_words]

    q2 = [w for w in q2 if w not in stop_words]

    return norm_model.wmdistance(q1, q2)

def sent2vec(s):

    words = str(s).lower()

    words = word_tokenize(words)

    words = [w for w in words if not w in stop_words]

    M = []

    for w in words:

    try:

        M.append(model[w])

    except:

        continue

    M = np.array(M)

    v = M.sum(axis=0)

    return v / np.sqrt((v ** 2).sum())

我们需要构建一些新的特征：

1.单词个数

2.字符个数

3.问题1和问题2中相同单词的个数

4.问题1和问题2中不同单词的个数

5.问题1和问题2的向量余弦距离

6.问题1和问题2的向量曼哈顿距离

7.--杰卡德距离

8.--兰氏距离

9.--欧氏距离

10.--闵可夫斯基距离

11.--布雷柯蒂斯距离

12.峰度和偏度

13.词移距离

14.标准化词移距离

所有以上距离计算公式都可以在scipy.spatial.distance中找到。

1\. df['len_q1'] = df.question1.apply(lambda x: len(str(x))

2\. df['len_q2'] = df.question2.apply(lambda x: len(str(x))

3\. df['diff_len'] = df.len_q1 - df.len_q2

4\. df['len_char_q1'] = df.question1.apply(lambda x: len(''.join(set(str(x).replace(' ',''))))

5\. df['len_char_q2'] = df.question2.apply(lambda x: len(''.join(set(str(x).replace(' ',''))))

6\. df['len_word_q1'] = df.question1.apply(lambda x: len(str(x).split())

7\. df['len_word_q2'] = df.question2.apply(lambda x: len(str(x).split())

8\. df['common_words'] = df.apply(lambda x: len(set(str(x[‘question1’]).lower().split()).intersection(set(str(x[‘question2’]).lower().split()))), axis=1)

9\. df['fuzz_ratio'] = df.apply(lambda x: fuzz.ratio(str(x[‘question1’]), str(x[‘question2'])),axis=1)

10.df['fuzz_partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(str(x[‘question1’]), str(x['question2'])), axis=1)

11.df['fuzz_partial_token_set_ratio'] = df.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

12.df[‘fuzz_partial_token_sort_ratio'] = df.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),str(x['question2'])), axis=1)

13.df['fuzz_token_set_ratio’] = df.apply(lambda x: fuzz.token_set_ratio(str(x[‘question1’]), str(x['question2'])), axis=1)

14.df[‘fuzz_token_sort_ratio’] = df.apply(lambda x: fuzz.token_sort_ratio(str(x[‘question1’]),str(x[‘question2'])), axis=1)

word2vec模型

前面说了，我们使用预先训练好的google news 语料的Word2vec模型。我下载下来并保存在word2Vec_models文件夹里面。我们用gensim的模块加载这个模型。

model =

gensim.models.KeyedVectors.load_word2vec_format('./wor

d2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

df['wmd'] = df.apply(lambda x: wmd(x['question1'],

x['question2']), axis=1)

标准化word2vec模型

norm_model =

gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

norm_model.init_sims(replace=True)

df['norm_wmd'] = df.apply(lambda x:

norm_wmd(x['question1'], x['question2']), axis=1)

好了，到这里，我们构建的所有特征都已经有了，计算出他们的距离。

question1_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(tqdm_notebook(df.question1.values)):

    question1_vectors[i, :] = sent2vec(q)

question2_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(tqdm_notebook(df.question2.values)):

    question2_vectors[i, :] = sent2vec(q)

df['cosine_distance'] = [cosine(x, y) for (x, y) in zip(np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

df['cityblock_distance'] = [cityblock(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['jaccard_distance'] = [jaccard(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['canberra_distance'] = [canberra(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['euclidean_distance'] = [euclidean(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['skew_q1vec'] = [skew(x) for x in np.nan_to_num(question1_vectors)]

df['skew_q2vec'] = [skew(x) for x in np.nan_to_num(question2_vectors)]

df[‘kur_q1vec’] = [kurtosis(x) for x in mp.nan_to_num(question1_vectors)]

df[‘kur_q2vec’] = [kurtosis(x) for x in mp.nan_to_num(question2_vectors)]

使用Xgboost进行分类模型训练

df.drop(['question1', 'question2'], axis=1, inplace=True)

df = df[pd.notnull(df['cosine_distance'])]

df = df[pd.notnull(df['jaccard_distance'])]

X = df.loc[:, df.columns != 'is_duplicate']

y = df.loc[:, df.columns == 'is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=0)

import xgboost as xgb

 model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1, colsample_bytree=.7, gamma=0, reg_alpha=4, objective=’binary:logistic’, eta=0.3,silent=1, subsample=0.8).fit(X_train, y_train.values.ravel))

prediction = model.predict(X_test)

cm = confusion_matrix(y_test, prediction)

print(cm)

print(’Accuracy’, accyracy_score(y_test, prediction))

print(classification_report(y_test, prediction))

image.png

使用我们构建的特征，Xgboost分类准确率为0.77，如果融合Xgboost+TFIDF可以达到0.8

同时，我们的召回从0.67提升到了0.73

Jupyter notebook can be found on Github. 周末加油鸭！

Reference:

https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,835评论 4赞 364
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,598评论 1赞 295
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,569评论 0赞 244
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,159评论 0赞 213
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,533评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,710评论 1赞 222
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,923评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,674评论 0赞 203
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,421评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,622评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,115评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,428评论 2赞 254
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,114评论 3赞 238
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,097评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,875评论 0赞 197
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,753评论 2赞 276
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,649评论 2赞 271

使用word2vec和xgboost寻找Quora上的相似问题

推荐阅读更多精彩内容