• 我 在 北京 天安门
• 在 我 北京 天安门
• 北京 我 在 天安门
• 北京 在 我 天安门
• ...

# 语言模型

## 概率表示

$$$$P(S) = P(w_1,w_2,\cdots,w_n)$$$$

$$$$P(S) = P(w_1,w_2,\cdots,w_n) = P(w_1) \cdot P(w_2\vert w_1) \cdot P(w_3\vert w_1,w_2)\cdots P(w_n\vert w_1,w_2,\dots,w_{n-1})$$$$

## 马尔科夫假设

$$$$P(S) = P(w_1,w_2,\cdots,w_n) = P(w_1) \cdot P(w_2\vert w_1) \cdot P(w_3\vert w_2) \cdots P(w_n\vert w_{n-1})$$$$

# 概率估计

1. 一个句子可以看成是一系列词语的全排列
2. 每一种排列情况都能够通过条件概率公式计算得到一个表示这个句子合理性的概率
3. 通过引入马尔科夫假设，简化了句子概率的计算过程。

$$$$P(w_i\vert w_{i-1}) = \frac{P(w_i,w_{i-1})}{P(w_{i-1})} = \frac{\frac{ \# (x_{i-1},x_i)}{ \# }}{\frac{ \# w_{i-1}}{ \# }} = \frac{ \#(x_{i-1},x_i)}{ \# w_{i-1}}$$$$

# 模型训练&零概率

• 加法平滑
• 古德-图灵(Good-Turing)估计法
• Katz平滑方法
• Jelinek-Mercer平滑方法
• Witten-Bell平滑方法
• 绝对减值法
• Kneser-Ney平滑方法

$$$$P(w_i\vert w_{i-1}) = \frac{ \#(w_i,w_{i-1}) + \lambda}{ \#w_{i-1} + \lambda\vert V \vert}$$$$

# 具体实现

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 #coding:utf-8 """ Program: 语言模型的简单实现 Description: 简单实现一个N元的语言模型，其中N是可以进行设置的参数. Author: Flyaway - flyaway1217@gmail.com Date: 2014-01-16 19:23:29 Last modified: 2014-01-16 21:54:08 Python release: 3.2.3 """ import math class LM: def __init__(self,train_path,step=2,lam=1): #punctuations need to be removed self.punctuation = set(["，","、","。","？","“","”","！","；","‘","’","《","》","%"]) self.step = step self.lam = lam self.words = [] self.freq = {} #read the train data and remove the punctuation with open(train_path) as f: unique = set() for line in f: s = line.strip().split() #remove the punctuation s = [item.strip() for item in s if item not in self.punctuation] #gather the unique word for r in s: unique.add(r) #add the start/end symbol s.insert(0,'') s.append('') self.words.append(s) self.size = len(unique) def getNgram(self,sentence): for e in range(1,len(sentence)-1): if e > self.step: s = e - self.step + 1 else: s = 0 words = sentence[s:e+1] words.insert(-1,'|') words = ','.join(words) words = words.replace(',|,','|') yield words def train(self): for sentence in self.words: for words in self.getNgram(sentence): s = words.split('|') cond = s[0] w = s[1] if cond not in self.freq: self.freq[cond]={} self.freq[cond][w] = 1 else: self.freq[cond][w] = self.freq[cond].get(w,0) + 1 def getProb(self,word,condition): cond = condition cond_num = 0 w_num = 0 if cond in self.freq: for key in self.freq[cond]: cond_num += self.freq[cond][key] w_num = self.freq[cond].get(word,0) #smooth w_num = w_num + self.lam cond_num += (self.size * self.lam) return float(w_num/cond_num) def sentenceProb(self,sentence): mysentence = sentence.strip().split() mysentence.insert(0,'') mysentence.append('') result = 0 for words in self.getNgram(mysentence): s = words.split('|') cond = s[0] w = s[1] result += math.log(self.getProb(w,cond),2) return result if __name__ == '__main__': inpath = './Data/train.in' print('Reading data...') lm = LM(inpath,2) print('Start training...') lm.train() s1 = '我 在 北京 天安门' s2 = '在 我 北京 天安门' s3 = '北京 我 在 天安门' s4 = '北京 在 我 天安门' for s in [s1,s2,s3,s4]: print(s,end=":") print(lm.sentenceProb(s)) 

# 参考资料

Flyaway is the owner of this blog.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below

nlp