双语词向量映射代码解读vecmap

在上一篇文章中无监督机器翻译《An Effective Approach to Unsupervised Machine Translation》说现在的机器翻译模型有部分优秀模型不再使用平行语料也能够完成翻译。

为什么可以不再使用双语平行语料?

双语词向量映射python代码——vecmap

为了构建自己的跨语种的词向量之间的映射操作,首先需要选择一个单语种词向量训练工具(e.g. word2vec or fasttext),然后再用今天的工具vecmap将一种单语种映射为另一种单语种。

实验测试数据获取

./get_data.sh

多种Mapping操作

Supervised方法

python3 map_embeddings.py --supervised TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Semi-supervised方法

python3 map_embeddings.py --semi_supervised TRAIN.DICT SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Identical方法

python3 map_embeddings.py --identical SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Unsupervised方法

python3 map_embeddings.py --unsupervised SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

·SRC.EMB是源语言embedding
·TRG.EMB是目标语言embedding
·SRC_MAPPED.EMB是映射后的源语言embedding
·TRG_MAPPED.EMB是映射后的目标语言embedding
·TRAIN.DICT

Unsupervised方法实践

小demo数据集准备---白雪公主(中文,英文)
英文文件名en.txt

In the harsh winter season, the goose-like snow flakes are flying around in the sky. There is a queen sitting in a window in the palace, doing needlework for her daughter, the wind blowing snow flakes into the window, the ebony window sill There are a lot of snowflakes falling on it. She looked up and looked out the window. She did not pay attention, and a needle stuck into her finger. The red blood flowed out of the needle, and three drops of blood fell on the snow on the window. She thoughtfully stared at the red blood drops on the white snow, and looked at the ebony window sill. She said, "I hope my little daughter's skin will be white and red, and it looks like this white snow and The red blood is the same, so gorgeous, so arrogant, the hair looks like the ebony of this window is generally black and bright!"
Her little daughter has grown up, and the little girl is so beautiful that she is beautiful, beautiful and moving. Her skin is really white like snow, ruddy with blood, and black hair like ebony. So the queen gave her a name, called Snow White. But Snow White has not grown up, and her queen mother died.
Soon, the father of the king married another wife. This queen is very beautiful, but she is very proud and conceited. She is very strong and can't stand it if she hears someone is more beautiful than her. She has a mirror, she often goes to the mirror to appreciate herself and asks: "Tell me, mirror, tell me the truth! All the women here are the most beautiful? Tell me who she is?"
The mirror replied: "It's you, queen! You are the most beautiful woman here."
When she heard this, she would smile with satisfaction. But Snow White grew up slowly and became more and more beautiful. By the time she was seven years old, she was more dazzling than the bright spring, more beautiful than the queen. Until one day, when the queen went to ask the mirror as usual, the mirror made the answer: "The queen, you are beautiful and beautiful, but Bai Xuegong is more beautiful than you!"
She heard this, her heart filled with anger and jealousy, and her face became pale. She called a servant and said to him, "Give me Snow White to the big forest. I don't want to see her anymore." The servant took Snow White away. When he was about to kill her in the forest, she cried and begged him not to kill her. Facing the pleading of the pitiful little princess, the servant’s sympathy came to life. He said, “You are a child who loves you, I will not kill you.” In this way, he left her alone. In the forest. When the servant decided not to kill Snow White and left her there, even though he knew that in the uninhabited big forest, she would be torn into pieces by the beast, but thought that he did not have to kill her by hand. He felt that a heavy stone that was pressing on his heart fell.
After the servant left, Snow White was very scared. She was everywhere in the forest, looking for a way out. The beast screamed beside her, but did not hurt her. In the evening, she came to a small house. When she determined that there was no one in the house, she pushed the door and went to rest, because she was really unable to move. As soon as she entered the door, she found that everything in the house was well organized and clean. A table was covered with white cloth with seven small plates, each with a piece of bread and some other food. There were seven glasses filled with wine next to the plate, seven knives. And the fork, etc., and the wall is also discharged with seven small beds. At this time she felt hungry and thirsty, and she did not care who it was. She went up to cut a small piece of bread from each piece of bread and drank a little bit of wine in each glass. After eating and drinking, she felt very tired and wanted to lie down and rest. So she came to the bed and almost tried every single of the seven beds. It was not too long, it was too short. It was not until the seventh bed was tried. She lay down on it and soon fell asleep.

中文文件名zh.txt

严冬时节,鹅毛一样的大雪片在天空中到处飞舞着,有一个王后坐在王宫里的一扇窗子边,正在为她的女儿做针线活儿,寒风卷着雪片飘进了窗子,乌木窗台上飘落了不少雪花。她抬头向窗外望去,一不留神,针刺进了她的手指,红红的鲜血从针口流了出来,有三点血滴飘落在窗子的雪花上。她若有所思地凝视着点缀在白雪上的鲜红血滴,又看了看乌木窗台,说道:“但愿我小女儿的皮肤长得白里透红,看起来就像这洁白的雪和鲜红的血一样,那么艳丽,那么骄嫩,头发长得就像这窗子的乌木一般又黑又亮!”
她的小女儿渐渐长大了,小姑娘长得水灵灵的,真是人见人爱,美丽动人。她的皮肤真的就像雪一样的白嫩,又透着血一样的红润,头发像乌木一样的黑亮。所以王后给她取了个名字,叫白雪公主。但白雪公主还没有长大,她的王后妈妈就死去了。
不久,国王爸爸又娶了一个妻子。这个王后长得非常漂亮,但她很骄傲自负,嫉妒心极强,只要听说有人比她漂亮,她都不能忍受。她有一块魔镜,她经常走到镜子面前自我欣赏,并问道:“告诉我,镜子,告诉我实话!这儿所有的女人谁最漂亮?告诉我她是谁?”
镜子回答道:“是你,王后!你就是这儿最漂亮的女人。”
听到这样的话,她就会满意地笑起来。但白雪公主慢慢地长大,并出落得越来越标致漂亮了。到了七岁时,她长得比明媚的春光还要艳丽夺目,比王后更美丽动人。直到有一天,王后像往常一样地去问那面魔镜时,镜子作出了这样的回答:“王后,你是美丽漂亮的,但是白雪公主要比你更加漂亮!”
她听到了这话,心里充满了愤怒和妒忌,脸也变得苍白起来。她叫来了一名仆人对他说:“给我把白雪公主抓到大森林里去,我再也不希望看到她了。”仆人把白雪公主带走了。在森林里他正要动手杀死她时,她哭泣着哀求他不要杀害她。面对楚楚动人的可怜小公主的哀求,仆人的同情之心油然而生,他说道:“你是一个人见人爱的孩子,我不会杀害你。”就这样,他把她单独留在了森林里。当仆人决定不再杀害白雪公主,而把她留在那儿时,尽管他知道在那荒无人际的大森林里,她十有八九会被野兽撕成碎片,但想到他不必亲手杀害她,他就觉得压在心上的一块沉重的大石头落了下来。
仆人走了以后,白雪公主一个人非常害怕,她在森林里到处徘徊,寻找出去的路。野兽在她身旁吼叫,但却没去伤害她。到了晚上,她来到了一间小房子跟前。当她确定这间房子没有人时,就推门走进去想休息一下,因为她已经实在走不动了。一进门,她就发现房子里的一切都布置得井井有条,十分整洁干净。一张桌子上铺着白布,上面摆放着七个小盘子,每个盘子里都装有一块面包和其它一些吃的东西,盘子旁边依次放着七个装满葡萄酒的玻璃杯,七把刀子和叉子等,靠墙还并排放着七张小床。此时她感到又饿又渴,也顾不得这是谁的了,走上前去从每块面包上切了一小块吃了,又把每只玻璃杯里的酒喝了一点点。吃过喝过之后,她觉得非常疲倦,想躺下休息休息,于是来到那些床前,七张床的每一张她几乎都试过了,不是这一张太长,就是那一张太短,直到试了第七张床才合适。她在上面躺下来,很快就睡着了。

数据处理代码

from gensim.models import fasttext
from gensim.models import word2vec
import jieba

def get_zh_embedding():
    sentance = []
    with open('zh.txt', 'r', encoding='utf8') as f:
        line = f.readline().strip()
        while line:
            sentance.append(line)
            line = f.readline().strip()

    ## 对句子进行分词分词
    def segment_sen(sen):
        sen_list = []
        try:
            sen_list = jieba.lcut(sen)
        except:
                pass
        return sen_list
    # 将数据变成gensim中 word2wec函数的数据格式
    sens_list = [segment_sen(i) for i in sentance]

    model = word2vec.Word2Vec(sens_list,min_count=1,iter=20)
    model.wv.save_word2vec_format('SRC.EMB', binary=False)


def get_en_embedding():
    sentance = []
    with open('en.txt', 'r', encoding='utf8') as f:
        line = f.readline().strip().lower()
        while line:
            sentance.append(line)
            line = f.readline().strip().lower()

    # 将数据变成gensim中 word2wec函数的数据格式
    sens_list = [i.strip().split(' ') for i in sentance]


    model = word2vec.Word2Vec(sens_list, min_count=1, iter=20)
    model.wv.save_word2vec_format('TRG.EMB', binary=False)

get_en_embedding()
get_zh_embedding()

SRC.EMB文件

TRG.EMB文件

运行代码

python3 map_embeddings.py --unsupervised SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

vecmap映射之后得到的SRC_MAPPED.EMB

vecmap映射之后得到的TRG_MAPPED.EMB

评估

TEST.DICT文件内容如下,格式:(源语词+空格+目标语词)同义词形式,尽量是前面两个embedding里有的词。

就 on
里 in
王后 queen
白雪公主 snow
白雪公主 white
他 he
在 in
又 also
我 i
漂亮 pretty
一样 same
长 long

训练语句

python3 eval_translation.py SRC_MAPPED.EMB TRG_MAPPED.EMB -d TEST.DICT

训练结果(因为是一个小demo,所以模型按照以上文件格式能够正常运行,但效果要好,可以通过原始的get数据来测试,或者自己用大语料,多次迭代来训练单语言词向量)


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,117评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,328评论 1 293
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,839评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,007评论 0 206
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,384评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,629评论 1 219
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,880评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,593评论 0 198
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,313评论 1 243
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,575评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,066评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,392评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,052评论 3 236
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,082评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,844评论 0 195
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,662评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,575评论 2 270

推荐阅读更多精彩内容

  • # Python 资源大全中文版 我想很多程序员应该记得 GitHub 上有一个 Awesome - XXX 系列...
    小迈克阅读 2,862评论 1 3
  • Introduction This document gives coding conventions for t...
    wuutiing阅读 4,433评论 0 9
  • 一、Python简介和环境搭建以及pip的安装 4课时实验课主要内容 【Python简介】: Python 是一个...
    _小老虎_阅读 5,618评论 0 10
  • 最近需要用到Dubbo分布式框架,由于之前没有接触过,特写篇简书记录下搭建过程中遇到的一些问题,方便自己以后学习。...
    JokerJin阅读 368评论 0 0
  • 程序员,对很多人来说,其生活方式都是神秘的,甚至可以说程序员都是不善于表达却拥有非凡技能的一类人,在这行业中,我们...
    ewfeqf阅读 331评论 0 1