×

# iOS中文近似度的算法及中文分词(结巴分词)的集成

## 理论

#### 编辑距离算法:

``````#import "NSString+Distance.h"
static inline int min(int a, int b) {
return a < b ? a : b;
}

@implementation NSString (Distance)
- (float)SimilarPercentWithStringA:(NSString *)stringA andStringB:(NSString *)stringB{
NSInteger n = stringA.length;
NSInteger m = stringB.length;
if (m == 0 || n == 0) return 0;

//Construct a matrix, need C99 support
NSInteger matrix[n + 1][m + 1];
memset(&matrix[0], 0, m + 1);
for(NSInteger i=1; i<=n; i++) {
memset(&matrix[i], 0, m + 1);
matrix[i][0] = i;
}
for(NSInteger i = 1; i <= m; i++) {
matrix[0][i] = i;
}
for(NSInteger i = 1; i <= n; i++) {
unichar si = [stringA characterAtIndex:i - 1];
for(NSInteger j = 1; j <= m; j++) {
unichar dj = [stringB characterAtIndex:j-1];
NSInteger cost;
if(si == dj){
cost = 0;
} else {
cost = 1;
}
const NSInteger above = matrix[i - 1][j] + 1;
const NSInteger left = matrix[i][j - 1] + 1;
const NSInteger diag = matrix[i - 1][j - 1] + cost;
matrix[i][j] = MIN(above, MIN(left, diag));
}
}
return 100.0 - 100.0 * matrix[n][m] / stringA.length;
}
@end
``````

#### 词频向量余弦夹角算法:

ab = 2x2(斗鱼) + 1x1(伴侣) + 1x0(真是) + 1x0(挺) + 1x1(有意思) + 1x0(支持) + 1x1(直播) + 1x0(可以) + 1x0(用)
= 7

||a|| = sqrt(2x2(斗鱼) + 1x1(伴侣) + 1x1(真是) + 1x1(有意思) + 1x1(支持) + 1x1(直播))
= 3

||b|| = 2x2(斗鱼) + 1x1(伴侣) + 1x1(挺) + 1x1(有意思) + 1x1(直播) + 1x1(可以) + 1x1(用)
= 3.16....

cos θ = 0.737865

similarity = arccos(0.737865) / M_PI
= 0.764166

## 实际

1.分词: iOS系统其实自带分词Api, 只是对中文的支持并不是那么友好,

2.构造向量并计算: 这个其实在iOS中直接构造向量也是不那么好实现的,

#### 分词

https://github.com/yanyiwu/iosjieba

iosjieba.bundle/dict/user.dict.utf8

``````//初始化后直接使用
NSString *dictPath = [[[NSBundle mainBundle] resourcePath] stringByAppendingPathComponent:@"iosjieba.bundle/dict/jieba.dict.small.utf8"];
NSString *hmmPath = [[[NSBundle mainBundle] resourcePath] stringByAppendingPathComponent:@"iosjieba.bundle/dict/hmm_model.utf8"];
NSString *userDictPath = [[[NSBundle mainBundle] resourcePath] stringByAppendingPathComponent:@"iosjieba.bundle/dict/user.dict.utf8"];

const char *cDictPath = [dictPath UTF8String];
const char *cHmmPath = [hmmPath UTF8String];
const char *cUserDictPath = [userDictPath UTF8String];

JiebaInit(cDictPath, cHmmPath, cUserDictPath);
}

//字符串转词数组
- (NSArray *)stringCutByJieba:(NSString *)string{

//结巴分词, 转为词数组
const char* sentence = [string UTF8String];
std::vector<std::string> words;
JiebaCut(sentence, words);
std::string result;
result << words;

NSString *relustString = [NSString stringWithUTF8String:result.c_str()].copy;

relustString = [relustString stringByReplacingOccurrencesOfString:@"[" withString:@""];
relustString = [relustString stringByReplacingOccurrencesOfString:@"]" withString:@""];
relustString = [relustString stringByReplacingOccurrencesOfString:@" " withString:@""];
relustString = [relustString stringByReplacingOccurrencesOfString:@"\"" withString:@""];
NSArray *wordsArray = [relustString componentsSeparatedByString:@","];

return wordsArray;
}
``````

#### 计算

``````//这里构造了两个BASentenceModel用来存原来的文本,分词后的词数组,以及词频字典.

- (void)setWordsArray:(NSArray *)wordsArray{
_wordsArray = wordsArray;

//根据句子出现的频率构造一个字典
__block NSMutableDictionary *wordsDic = [NSMutableDictionary dictionary];
[wordsArray enumerateObjectsUsingBlock:^(NSString *obj1, NSUInteger idx1, BOOL * _Nonnull stop1) {

//若字典中已有这个词的词频 +1
if (![[wordsDic objectForKey:obj1] integerValue]) {
__block NSInteger count = 1;
[wordsArray enumerateObjectsUsingBlock:^(NSString *obj2, NSUInteger idx2, BOOL * _Nonnull stop2) {
if ([obj1 isEqualToString:obj2] && idx1 != idx2) {
count += 1;
}
}];

[wordsDic setObject:@(count) forKey:obj1];
}
}];
_wordsDic = wordsDic;
}

//传入两个句子对象即可得出两个句子之间的近似度

/**
余弦夹角算法计算句子近似度
*/
- (CGFloat)similarityPercentWithSentenceA:(BASentenceModel *)sentenceA sentenceB:(BASentenceModel *)sentenceB{
//计算余弦角度
//两个向量内积
//两个向量模长乘积
__block NSInteger A = 0; //两个向量内积
__block NSInteger B = 0; //第一个句子的模长乘积的平方
__block NSInteger C = 0; //第二个句子的模长乘积的平方
[sentenceA.wordsDic enumerateKeysAndObjectsUsingBlock:^(NSString *key1, NSNumber *value1, BOOL * _Nonnull stop) {

NSNumber *value2 = [sentenceB.wordsDic objectForKey:key1];
if (value2.integerValue) {
A += (value1.integerValue * value2.integerValue);
}

B += value1.integerValue * value1.integerValue;
}];

[sentenceB.wordsDic enumerateKeysAndObjectsUsingBlock:^(NSString *key2, NSNumber *value2, BOOL * _Nonnull stop) {

C += value2.integerValue * value2.integerValue;
}];

CGFloat percent = 1 - acos(A / (sqrt(B) * sqrt(C))) / M_PI;

return percent;
}
``````
##### 结论

App大家可以下下来看看, 顺便给个好评, 3Q!

iOS开发笔记