AlphaML(算法篇)-基于语义路径的文本数据结构化算法

文本数据结构化在知识图谱、标签标注、chatbot等应用中占据重要地位，普通的文本是无结构的数据，进行进一步的分析和理解，需要将无结构化的文本进行“有结构化”，科班自然语言处理的专业人士多会采用词法分析、句法分析等专业方法，该系列方法不在本文讨论范畴。本文主要基于统计+“深度学习”方式进行粗暴解析，涉及语义树、word2vec、贝叶斯、威尔逊置信区间等知识对文本语料进行结构化抽取。抽取逻辑：先计算语料归属语义树中相关节点的概率和置信度，然后根据树结构抽取合并当前标注的路径。假设语料：小米科技生产的手机很实惠。抽取后的可能路径为：科技—>制造—>电子产品—>正面。当然也可以进行意图路径等抽取，只需要在语义树中设置意图类型和表达即可，剩下交给统计来处理。

语义树(图)的定义

该树的定义很关键，关系文本结构化后路径的走向，由使用者根据自己的业务方向进行制作，后期也只需维护和更新该树的内容即可，不同任务的长期积累即形成有价值的知识库，具体定义如下：

  {
      "domain": "代表该规则的抽象概念",
      "concepts": [
          "该规则的表达（关键词集）"
      ],
      "children": ["该规则的子规则List"]
  }
  ```

示例：

{
    "domain": "体育",
    "concepts": [
        "健身","打篮球","瑜伽","跑步"
    ],
    "children": [
        "健身",
        "篮球爱好",
        "瑜伽爱好者",
    ]
}

{
    "domain": "健身",
    "concepts": [
        "健身房","哑铃","跑步机"
    ],
    "children": [
    ]
}
```

树状结构即为：

Semantic_Tree_Path.png

注意点：
对语义树中domain的抽象表达，越高的节点上概念越宽泛，如“健身”，“运动会”，“人工智能”等，越低的层次概念越细化，如“贝叶斯信任网络”，“08奥运会”，“白鹿原”等。人工抽取出不同概念层次的词作为标签/意图等比较关键，会影响语料抽取的合理性，也就是说抽取的词概念既不要太宽泛，也不要太细化，例如，“计算机”，“汽车”，“电视剧”等词过于宽泛；而“健身房”，“三分球”，“白鹿原”又过于细化，理想情况下，抽取出“健身”，“蓝球”，“剧情/伦理”可能更为合适。

对给定语料domain的抽取

在该步骤中，我们先忘掉语义树的结构，把语义树的每个节点看作独立的个体，此时domain的抽取就退化为一个分类问题，任务变为：一个语料可以被预测为相关domain的概率，为实现该任务比较容易想到的方法是基于语料级文本分类，但该方式需要标注大量训练样本。除此之外，容易想到的方法为基于keyword的匹配，比如文本中出现篮球，即给归为体育，出现三分球即归为篮球，进一步想到通过word2vec对keyword进行扩展，即出现和三分球相似的滞空扣篮也归为篮球，可以看出这个想法看似也make sense，但细想可能禁不起推敲，举例：假如语料中出现小米，是给其分类到科技还是植物呢？没错，这是个概率问题，另外一个大段文本中被打上体育标签的有20个keyword，而被打上娱乐标签的有2个，那么针对该语料这两个被打上的标签那哪个更可信呢？所以除了需要去计算其归属概率，还要判断是否可信，或在多大概率下可信（置信区间/置信度）。通过以上分析，思路逐渐清晰，接下来第一步我们需要去train每个token/keyword对应相关domain分类的概率，然后在predict的时候给出置信度（confidence）。

p(domain|token)的估计
容易想到的方法：bayes估计（细节可参见Naive Bayes spam filtering），首先
![][2]
[2]: http://latex.codecogs.com/gif.latex?\Pr(S|W)={\frac{\Pr(W|S)\cdot\Pr(S)}{\Pr(W|S)\cdot\Pr(S)+\Pr(W|H)\cdot\Pr(H)}}
Pr(S|W)：为语料中token/similarity(token)为某domain S的概率, 比如篮球属于体育的概率；
Pr(S)为对于给定任何语料或token/similarity(token)情况下domain S的先验概率；
Pr(W|S)为token/similarity(token) W出现在domain S的概率；
Pr(H) 为对于给定任何语料或token/similarity(token)情况下不在tag S中的先验概率；
Pr(W|H)为token/similarity(token) W没有出现在标签S的概率；
在train之前，我们先用word2vec对每个domain的concepts进行扩展，具体方法在spotify开源annoy[Github]库基础上封装一层方法类，具体如下：

# coding: utf-8
"""
The MIT License (MIT)

Copyright (c) 2017 Leepand
"""
#keyord_knn.py
import gzip
import struct
import cPickle
import collections
class BaseANN(object):
    def use_threads(self):
        return True
class Annoy(BaseANN):
    def __init__(self, metric, n_trees, search_k):
        self._n_trees = n_trees
        self._search_k = search_k
        self._metric = metric
        self.name = 'Annoy(n_trees=%d, search_k=%d)' % (n_trees, search_k)
        self.id2word={}
        self.word2id={}
    @classmethod
    def get_vectors(cls,fn, n=float('inf')):
        def _get_vectors(fn):
            if fn.endswith('.gz'):
                f = gzip.open(fn)
                fn = fn[:-3]
            else:
                f = open(fn)
            if fn.endswith('.bin'): # word2vec format
                words, size = (int(x) for x in f.readline().strip().split())
                t = 'f' * size
                while True:
                    pos = f.tell()
                    buf = f.read(1024)
                    if buf == '' or buf == '\n': return
                    i = buf.index(' ')
                    word = buf[:i]
                    f.seek(pos + i + 1)
                    vec = struct.unpack(t, f.read(4 * size))
                    yield word, vec

            elif fn.endswith('.txt'): # Assume simple text format
                for line in f:
                    items = line.strip().split()
                    yield items[0], [float(x) for x in items[1:]]

            elif fn.endswith('.pkl'): # Assume pickle (MNIST)
                mi = 0
                for pics, labels in cPickle.load(f):
                    for pic in pics:
                        yield mi, pic
                        mi += 1
        i = 0
        for line in _get_vectors(fn):
            yield line
            i += 1
            if i >= n:
                break
    def fit(self, vecpath):
        import annoy
        self._annoy = annoy.AnnoyIndex(f=200, metric=self._metric)
        getvec=self.get_vectors(vecpath)
        
        j=0
        for i, x in getvec:
            self.id2word[int(j)]=str(i.strip())
            self.word2id[str(i.strip())]=int(j)
            self._annoy.add_item(j, x)
            j=j+1
        print 'building annoy tree...'
        self._annoy.build(self._n_trees)
        self._annoy.save('wiki2.ann')
        with open('id2word.pickle', 'wb') as f:
            cPickle.dump(self.id2word, f)     
        with open('word2id.pickle', 'wb') as f:
            cPickle.dump(self.word2id, f)
        print 'building annoy tree DONE!'
    def predict(self,annoytreepath):
        import annoy
        self._annoy = annoy.AnnoyIndex(f=200, metric=self._metric)
        self._annoy.load(annoytreepath)
        with open('word2id.pickle', 'rb') as f:
            self.word2id = cPickle.load(f) 
        with open('id2word.pickle', 'rb') as f:
            self.id2word = cPickle.load(f) 
    def query(self, v, n):
        return self._annoy.get_nns_by_vector(v.tolist(), n, self._search_k)
    def search(self,word):
        index=self.word2id[str(word)]
        knn=dict(zip((self._annoy.get_nns_by_item(index, 5,include_distances=True))[0] ,\
                     (self._annoy.get_nns_by_item(index, 5,include_distances=True))[1] ))
        knn_word_dist=collections.OrderedDict({self.id2word[int(k)]:(1-(v**2)/2) for k,v in knn.items()})       
        return knn_word_dist
    def get_cosine_by_items(self,word1,word2):
        if word1 in self.word2id and word2 in self.word2id:
            index1=self.word2id[str(word1)]
            index2=self.word2id[str(word2)]
            cosine=1-(self._annoy.get_distance(index1, index2)**2)/2
            return cosine
        else:
            return 'NoItem'
#build annoy tree
an=Annoy(metric = 'angular',n_trees=10,search_k=10)
an.fit(vecpath='./data/word_vec.bin')

p(domain|token)的估计：

import json
def calculate_category_probability(self):
        """
        Caches the individual probabilities for each category
        """
        total_tally = 0.0
        probs = {}
        for category, bayes_category in \
                self.categories.get_categories().items():
            count = bayes_category.get_tally()
            total_tally += count
            probs[category] = count

        # Calculating the probability
        for category, count in probs.items():
            if total_tally > 0:
                probs[category] = float(count)/float(total_tally)
            else:
                probs[category] = 0.0

        for category, probability in probs.items():
            self.probabilities[category] = {
                # Probability that any given token is of this category
                'prc': probability,
                # Probability that any given token is not of this category
                'prnc': sum(probs.values()) - probability
            }
def calculate_bayesian_probability(self, cat, token_score, token_tally):
        """
        Calculates the bayesian probability for a given token/category

        :param cat: The category we're scoring for this token
        :type cat: str
        :param token_score: The tally of this token for this category
        :type token_score: float
        :param token_tally: The tally total for this token from all categories
        :type token_tally: float
        :return: bayesian probability
        :rtype: float
        """
        # P that any given token IS in this category
        prc = self.probabilities[cat]['prc']
        # P that any given token is NOT in this category
        prnc = self.probabilities[cat]['prnc']
        # P that this token is NOT of this category
        prtnc = (token_tally - token_score) / token_tally
        # P that this token IS of this category
        prtc = token_score / token_tally

        # Assembling the parts of the bayes equation
        numerator = (prtc * prc)
        denominator = (numerator + (prtnc * prnc))

        # Returning the calculated bayes probability unless the denom. is 0
        return numerator / denominator if denominator != 0.0 else 0.0
###train:
def train(self, category, text):
        """
        Trains a category with a sample of text

        :param category: the name of the category we want to train
        :type category: str
        :param text: the text we want to train the category with
        :type text: str
        """
        try:
            bayes_category = self.categories.get_category(category)
        except KeyError:
            bayes_category = self.categories.add_category(category)

        tokens = self.tokenizer(str(text))
        occurrence_counts = self.count_token_occurrences(tokens)

        for word, count in occurrence_counts.items():
            bayes_category.train_token(word, count)

        # Updating our per-category overall probabilities
        self.calculate_category_probability()
def my_tokenizer(sample):
    return sample
def load_rules(path, reload=False, is_root=False):
    with open('word2id.pickle', 'rb') as f:
        word2id_dict = cPickle.load(f)
    """
    Build the rulebase by loading the rules terms from the given file.
    Args: the path of file.
    """
    bn=Annoy(metric = 'angular',n_trees=10,search_k=10)
    bn.predict(annoytreepath='wiki2.ann')
    bayes = simplebayes.SimpleBayes(cache_path='./my/tree/')
    bayes.flush()
    bayes.cache_train()
    with open(path, 'r') as input:
        json_data = json.load(input)
        # load rule and build an instance
        for data in json_data:
            simlist=[]
            domain = data["domain"]
            concepts_list = data["concepts"]
            children_list = data["children"]
            simstr=' '.join(concepts_list).encode('utf-8')+' '
            for word in concepts_list:
                if str(word.encode('utf-8')) in word2id_dict:
                    print 'word:',str(word.encode('utf-8'))
                    simstr+=' '.join([k for k,v in bn.search(str(word.encode('utf-8')) ).items() if v>0.5])
                else:
                    continue
            #' '.join(concepts_list).encode('utf-8')
            print 'sim',domain,simstr
            bayes.train(domain,simstr)

predict/score过程中p(domain|tokens)的置信计算：
置信区间的实质，就是进行可信度的修正，弥补样本量过小的影响。如果样本多，就说明比较可信，不需要很大的修正，所以置信区间会比较窄，下限值会比较大；如果样本少，就说明不一定可信，必须进行较大的修正，所以置信区间会比较宽，下限值会比较小。
二项分布的置信区间有多种计算公式，最常见的是"正态区间"（Normal approximation interval），教科书里几乎都是这种方法。但是，它只适用于样本较多的情况（np > 5 且 n(1 − p) > 5），对于小样本，它的准确性很差。
1927年，美国数学家 Edwin Bidwell Wilson提出了一个修正公式，被称为"威尔逊区间"，很好地解决了小样本的准确性问题。
- 计算每个标签的投票率（即一个语料归属该标签的标签词的比率）
- 计算每个投票率低置信区间（95%）
- 根据置信区间的下限值作为该标签的置信度
- 原理解释：置信区间的宽窄与样本的数量有关，比如，某条语料中归属A标签的标签词1个，9个为其他标签；另外一条归属B标签的有3个，27个为其他标签，两个标签的投票比率都是10%，但是B的置信度要高于A
- 语料中token归属任一节点的概率：Wilson_score_interval*p(s|w)
- 估计公式：
  ![][1]
- 代码示例：

@classmethod
    def confidence(cls,supports, downs):
        if ups == 0:
            return -downs
        n = supports + downs
        z = 1.6
        liked = float(supports) / n
        return (liked+z*z/(2*n)-z*sqrt((liked*(1-liked)+z*z/(4*n))/n))/(1+z*z/n)
#z为不同置信度对应的分位值
def score(self, text):
        """
        Scores a sample of text

        :param text: sample text to score
        :type text: str
        :return: dict of scores per category
        :rtype: dict
        """
        occurs = self.count_token_occurrences(self.tokenizer(text))
        scores = {}
        for category in self.categories.get_categories().keys():
            scores[category] = 0

        categories = self.categories.get_categories().items()

        for word, count in occurs.items():
            token_scores = {}

            # Adding up individual token scores
            for category, bayes_category in categories:
                token_scores[category] = \
                    float(bayes_category.get_token_count(word))

            # We use this to get token-in-category probabilities
            token_tally = sum(token_scores.values())
            #print word,count,token_tally,sum(occurs.values()),self.confidence(count,sum(occurs.values())-count)

            # If this token isn't found anywhere its probability is 0
            if token_tally == 0.0:
                continue

            # Calculating bayes probabiltity for this token
            # http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
            Wilson_score_interval=self.confidence(count,sum(occurs.values())-count)
            for category, token_score in token_scores.items():
                # Bayes probability * the number of occurances of this token
                #print self.calculate_bayesian_probability(category,token_score,token_tally)
                #using Wilson_score_interval inplace count


                scores[category] += Wilson_score_interval * \
                    self.calculate_bayesian_probability(
                        category,
                        token_score,
                        token_tally
                    )

        # Removing empty categories from the results
        final_scores = {}
        for category, score in scores.items():
            if score > 0:
                final_scores[category] = score

        return final_scores

类别抽取示例：

# -*- coding:utf-8 -*-  
#tagger_score.py
load_rules('/Users/leepand/Downloads/Semantic_Tree_Path/rules/rule.json')
b_p=simplebayes.SimpleBayes(cache_path='./my/tree/')
b_p.get_cache_location()
print b_p.get_cache_location()
b_p.cache_train()
print 'predict:'
import jieba
#print ' '.join(jieba.lcut('姚明喜欢打篮球和逛街住宿贝叶斯数据挖掘'))             
for k,v in b_p.score(' '.join(jieba.lcut('姚明喜欢打篮球和逛街住宿贝叶斯数据挖掘晶体学')).encode('utf-8')).items():
    print k,v

语义路径提取

该过程将上一步骤抽取的domain进行路径合并，用于相应业务分析和应用，具体方法：

树规则序列化及模型预测

# -*- coding: utf-8 -*-
import os
import json
class ParseTree(object):
    """
    Store the concept terms of a rule, and calculate the rule match.
    """

    def __init__(self, domain, rule_terms, children,bayes_model):

        self.id_term = domain
        self.terms = rule_terms
        self.model = bayes_model
        self.children = children

    def __str__(self):
        res = 'Domain:' + self.id_term
        if self.has_child():
            res += ' with children: '
            for child in self.children:
                res += ' ' + str(child)
        return res

    def serialize(self):

        """
        Convert the instance to json format.
        """

        ch_list = []
        for child in self.children:
            ch_list.append(child.id_term)

        cp_list = []
        for t in self.terms:
            cp_list.append(t)

        response = []

        data = {
                "domain": str(self.id_term),
                "concepts": cp_list,
                "children": ch_list
        }

        return data

    def add_child(self,child_rule):

        """
        Add child rule into children list , e.g: Purchase(Parent) -> Drinks(Child).
        """

        self.children.append(child_rule)

    def has_child(self):
        return len(self.children)

    def has_response(self):
        return len(self.response)

    def match(self, sentence, threshold=0):

        """
        Calculate the p*confidence of the input using bayes model.
        Args:
            threshold: a threshold to ignore the low confidence.
            sentence : a list of words.
        Returns:
            a dict : [domain,p*confidence]
        """
        matchs_dict=self.model.score(' '.join(jieba.lcut(sentence)).encode('utf-8'))
        return matchs_dict

解析树建立及domain路径搜索
创建树解析方法类：

class ParseBase(object):

  """
  to store rules, and load the trained bayes model.
  """

  def __init__(self, domain="general"):
      self.rules = {}
      self.domain = domain
      self.model = None
      self.forest_base_roots = []

  def __str__(self):
      res = "There are " + str(self.rule_amount()) + " rules in the ParseBase:"
      res+= "\n-------\n"
      for key,rulebody in self.rules.items():
          res += str(rulebody) + '\n'
      return res

  def rule_amount(self):
      return len(self.rules)

  def output_as_json(self, path='rule.json'):
      rule_list = []
      print self.rules.values()
      for rule in self.rules.values():
          print rule.serialize()
          rule_list.append(rule.serialize())

      with open(path,'w') as op:
          op.write(json.dumps(rule_list, indent=4))

规则配置数据导入

 def load_rules(self, path, reload=False, is_root=False):

        """
        Build the rulebase by loading the rules terms from the given file.
        Args: the path of file.
        """
        assert self.model is not None, "Please load the bayes model before loading rules."
        if reload:
            self.rules.clear()

        with open(path, 'r') as input:
            json_data = json.load(input)
            # load rule and build an instance
            for data in json_data:

                domain = data["domain"]
                concepts_list = data["concepts"]
                children_list = data["children"]
                response = data["response"]

                if domain not in self.rules:
                    self.rule = ParseTree(domain, concepts_list, children_list,self.model)
                    self.rules[domain] = self.rule
                    if is_root:
                        self.forest_base_roots.append(self.rule)
                else:
                    print("[Rules]: Detect a duplicate domain name '%s'." % domain)

    def load_rules_from_dic(self,path):

        """
        load all rule_files in given path
        """

        for file_name in os.listdir(path):
            if not file_name.startswith('.'):  # escape .DS_Store on OSX.
                if file_name == "rule.json": # roots of forest
                    self.load_rules(path + file_name, is_root=True)
                else:
                    self.load_rules(path + file_name)

load已训练好的token-tags bayes 模型

def load_model(self,path='./my/tree/'):

        """
        Load a trained bayes model(binary format only).
        Args:
        path: the path of the model.
        """
        self.model = simplebayes.SimpleBayes(cache_path=path)
        self.model.get_cache_location()
        print self.model.get_cache_location()
        self.model.cache_train()

路径搜索的迭代宽度优先搜索算法

@classmethod
    def iterative_bfs(cls,graph, start): 
        '''iterative breadth first search from start''' 
        bfs_tree = {start: {"parents":[], "children":[], "level":0}} 
        bfs_path=start+'>'
        q = [start] 
        while q: 
            current = q.pop(0) 
            for v in graph[current]: 
                if not v in bfs_tree: 
                    bfs_tree[v]={"parents":[current], "children":[], "level": bfs_tree[current]["level"] + 1} 
                    #print v
                    bfs_path+=v+'>'
                    bfs_tree[current]["children"].append(v) 
                    q.append(v) 
                else: 
                    if bfs_tree[v]["level"] > bfs_tree[current]["level"]: 
                        bfs_tree[current]["children"].append(v) 
                        bfs_tree[v]["parents"].append(current)
                        bfs_path+=current+'>'
        #print bfs_path
        return bfs_path

路径解析主函数

def parse(self, sentence, threshold=0, root=None):

        """
        parse the sentence score order by bayes probability and confidence.
        Args:
            sentence: a list of domain and p*confidence
            threshold: a threshold to ignore the low confidence.
        Return:
            a list holds the max-path rules and the classification tree travel path.
        """

        log = open("matching_log.txt",'w')
        result_list  = []
        at_leaf_node = False
        
        assert self.model is not None, "Please load the model before any match."
        if root is None: # then search from roots of forest.
            focused_rule = self.forest_base_roots[:]
        else:
            focused_rule = [self.rules[root]]
        #rule=focused_rule
        rule_match2=self.rule.match(sentence, threshold)
        rule_match={}
        for d,w in rule_match2.items():
            if w>threshold:
                rule_match[d]=w
        domain_graph={}
        for domain,weight in rule_match.items():
            term_trans   = ""
            if self.rules[domain].has_child():
                domain_graph[domain]=[domain]
                term_trans += domain+'>'
                for child_id in self.rules[domain].children:
                    if child_id in rule_match:
                        term_trans += child_id+'>'
                        domain_graph[domain]=[child_id]
                    else:
                        continue
            else:
                term_trans= domain
                domain_graph[domain]=[domain]
            print 'all',term_trans.split('>')
            print 'end',term_trans.split('>')[-2]
            print 'begin',term_trans.split('>')[0]
            path_begin=term_trans.split('>')[0]
            path_end=term_trans.split('>')[-2]
                
            print 'domain tree path:',term_trans
        tmp_list=[]
        for domain,child in domain_graph.items():
            tag_path=self.iterative_bfs(domain_graph,domain)
            if len(tag_path.split('>'))>2:
                if tag_path not in ' '.join(tmp_list):
                    tmp_list.append(tag_path)
                else:
                    continue
        print 'This sentence\'s Domain Path is:','|'.join(tmp_list)

调用示例

# -*- coding: utf-8 -*-
import jieba
rb=ParseBase()
#load_bayes_model
rb.load_model()
#load domain tree
rb.load_rules_from_dic('./rules/')
#rb.output_as_json(path='rule.json')
rb.rule_amount()
sentence='贝叶斯数据挖掘人工智能时代的新算法'
rb.parse(sentence, threshold=0.02, root=None)
###############Log:######################
[Rules]: Detect a duplicate domain name '鬧鐘'.
[Rules]: Detect a duplicate domain name '觀光'.
[Rules]: Detect a duplicate domain name '天氣'.
[Rules]: Detect a duplicate domain name '吃喝玩樂'.
[Rules]: Detect a duplicate domain name '股票'.
[Rules]: Detect a duplicate domain name '住宿'.
[Rules]: Detect a duplicate domain name '購買'.
[Rules]: Detect a duplicate domain name '病症'.
domain tree path: 机器学习>算法>
domain tree path: 算法>贝叶斯>
This sentence's Domain Path is: 机器学习>算法>贝叶斯>

思考

1、扩展部分采用词向量方式，该方式扩展时属于弱匹配，虽然弹性比较大，但很容易发生干扰情形，比如有一些高频词「嘻嘻」、「哈哈」经常在匹配过程中得到较高分数，而一些如「动漫」有明确意图的被迫做了陪衬，也不能武断的将这些词柜为停用词，思考贝叶斯+置信度+路径的方式可以规避掉这个问题吗？
2、当domain“样本”非均衡时需要对先验概率做相应调整么？
[1]: http://latex.codecogs.com/gif.latex?{Wilson_score_interval}=\frac{\hat\p+\frac1{2n}+z^{2_{1-\frac{\alpha}{2}}\pm\z_{1-\frac\alpha{2}}\sqrt{\frac{\hat\p(1-\hat\p)}{n}+\frac{z}2_{1-\frac\alpha{2}}}{4n^{2}}}{1+\frac1{n}z}2_{1-\frac\alpha{2}}}

最后编辑于：2017.12.08 03:24:39

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269

AlphaML(算法篇)-基于语义路径的文本数据结构化算法

语义树(图)的定义

对给定语料domain的抽取

语义路径提取

思考

推荐阅读更多精彩内容