【NLP】情感分析kaggle比赛

021 【NLP】情感分析kaggle比赛

这几天一直在做这个kaggle项目：Bag of Words Meets Bags of Popcorn

做这个项目的目的是学习如何使用word2vec模型，以及掌握ensemble的方法。我找了个项目，在其基础上进行了更改。原项目在这里：pangolulu/sentiment-analysis。我顺便用python3更新了python2的代码。在这个项目之前，基于同一个数据集，我还做了一个更初级的word2vec项目：word2vec-movies。可以看完word2vec-movies后，再看这个项目。

笔记里关于如何使用doc2vec模型有比较多的描述，而且在看过很多资料后（大部分代码并不能运行），也算是能正常使用了。如果想要了解doc2vec如何使用的话，还是能帮上忙的。

项目地址： sentiment-analysis

先说结论，这里我实现的最好成绩是0.89，无法做到原作者0.96的程度。有很多地方作者并没有进行解释，比如data中的feature_chi.txt文件是如何得到的，sentence_vector_org.txt是如何得到的。而且作者在使用word2vec训练的时候，用的是C代码，这部分我不熟悉就全部删除了，自己查资料重新实现了一遍，可能是我自己的方法的问题，才导致无法做到0.96的。如果在使用这个项目的过程中，有能做到0.96的话，请告知一下我究竟是哪里有问题。

最简洁的实现部分请查看py文件，如果有些地方不理解的话，可以查看notebook部分，notebook部分我写得较为繁琐，看起来可能有些不便，但因为其中中文解释比较多，对于理解代码应该有帮助。

我对这个项目实现的效果还是不满意，打算换一个更新一些的kaggle nlp比赛继续进行学习。如果有朋友看到我的代码里有哪些不合理的地方，或是有什么改进意见，欢迎issue和pr。

使用方法

三个模型分别存放在Sentiment/src/下面三个文件夹里，分别是bow, dov2vec, 'ensemble'。具体预处理，模型构建，预测请参考这三个文件夹里的内容。

在项目根目录下运行：

python Sentiment/src/bow/runBow.py
python Sentiment/src/doc2vec/doc2vec_lr.py
python Sentiment/src/ensemble/ensemble.py

requirements

python==3.5
pandas==0.21.0
numpy==1.13.3
jupyter==1.0.0
scipy==0.19.1
scikit-learn==0.19.0
nltk==3.2.1
gensim==2.2.0

下面英文部分是原作者项目中的，中文部分是我添加的。

sentiment-classification

Kaggle challenge "Bag of words meets bags of popcorn". And ranked 57th/578, with precision 0.96255.
The website is https://www.kaggle.com/c/word2vec-nlp-tutorial.

Method

My method contains three parts. One is learning a shallow model; the other is learning a deep model. And then I combine the two models to train an ensemble model.

Shallow Model

The method involves a bag-of-words model, which represents the sentence or document by a vector of words. But due to the sentences have lots of noises, so I use a feature selection process. And chi-square statistic is adopted by me. This will result in a feature vector that is more relevant to the classification label. Then I use the TF-IDF score as each dimension of feature vector. Although I have selected the features, the dimension of feature vector is still very high (19000 features I use in our model). So I can use logistic regression to train the classification model. And I use L1 regularization. The process of training a shallow model is as following. And I call the mothed as BOW.

Why I call this model shallow? MaiInly because it adopts a bag-of-words based model, which only extracts the shallow words frequency of the sentence. But it will not involve the syntactic and semantic of the sentence. So I call it a shallow model. And I will introduce a deep model which can capture more meanings of sentence.

我实现的版本最终效果是0.88。

Deep Model

Recently, Le & Mikolov proposed an unsupervised method to learn distributed representations of words and paragraphs. The key idea is to learn a compact representation of a word or paragraph by predicting nearby words in a fixed context window. This captures co-occurrence statistics and it learns embedding of words and paragraphs that capture rich semantics. Synonym words and similar paragraphs often are surrounded by similar context, and therefore, they will be mapped into nearby feature vectors (and vice versa). I call the method as Doc2Vec. Doc2Vec is a neural network like method, but it contains no hidden layers. And Softmax layer is the output. To avoid the high time complexity of Softmax output layer, they propose hierarchical softmax based on Huffman tree. The architecture of the model is as follows.

Such embeddings can then be used to represent sentences or paragraphs. And can be used as an input for a classifier. In my method, I first train a 200 dimensions paragraph vector. And then I adopt a SVM classifier with RBF kernel.
The process of training a shallow model is as following.

这个模型最好效果是0.87，doc2vec选取的向量为100维，分类器为SVM或logistic regression。SVM的训练很花时间，可以把SVM变为logistic regression，效果没有多大变化。这部分作者用C代码写了word2vec的训练部分，我全部删掉自己实现了一遍。主要用到了gensim中的doc2vec模型。这个模型可以对每一段文字输出一个向量，对于情感分析非常方便，不过官方文档写得很烂，大部分只能靠自己查资料来实现。这里介绍两个不错的资料：A gentle introduction to Doc2Vec and word2vec-sentiments

Ensemble Model

The ensemble model will involve the above two method (BOW and Doc2Vec). In practice, ensemble method can always result in high precise than single model. And the more diversity of the base models, the better performance the ensemble method can get. So combining the shallow model and the deep model is reasonable. Not just averaging the outputs of the two base models, I use the outputs of base models as input to another classifier. The architecture of the ensemble model is as follows.

And in L2 level learning, I use logistic regression.

ensemble的结果是得分最高的，0.89。
下面是我根据代码画的示意图，能更好理解如何做ensemble。

最后编辑于：2018.02.14 20:55:16

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 156,265评论 4赞 359
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,274评论 1赞 288
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 106,087评论 0赞 237
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,479评论 0赞 203
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 51,782评论 3赞 285
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,218评论 1赞 207
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,594评论 2赞 309
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,316评论 0赞 194
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 33,955评论 1赞 237
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,274评论 2赞 240
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,803评论 1赞 255
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,177评论 2赞 250
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,732评论 3赞 229
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 25,953评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,687评论 0赞 192
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,263评论 2赞 267
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,189评论 2赞 258