【说话处理惩罚与Python】7.3开辟和评估分块器
添加时间:2013-5-31 点击量:
读取IOB格局与CoNLL2000分块语料库
CoNLL2000,是已经加载标注的文本,应用IOB符号分块。
这个语料库供给的类型有NP,VP,PP。
例如:
hePRPB-NP
accepted VBDB-VP
the DTB-NP
positionNNI-NP
...
chunk.conllstr2tree()的函数感化:将字符串建树一个树默示。
例如:
>>>text =
... he PRPB-NP
... accepted VBDB-VP
... the DTB-NP
... position NNI-NP
... of IN B-PP
... vice NNB-NP
... chairman NNI-NP
... of IN B-PP
... CarlyleNNPB-NP
... GroupNNPI-NP
... , , O
... a DTB-NP
... merchantNNI-NP
... banking NNI-NP
... concernNNI-NP
... . . O
...
>>>nltk.chunk.conllstr2tree(text,chunk_types=[NP]).draw()
运行成果如图所示:
对于CoNLL2000分块语料,我们可以对他进行如下操纵:
#接见分块语料文件
>>> nltk.corpusimport conll2000
>>>print conll2000.chunked_sents(train.txt)[99]
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNPStone/NNP)
(VP told/VBD)
(NP his/PRP¥story/NN)
./.)
#若是只对NP感爱好,可以如许写
>>>print conll2000.chunked_sents(train.txt,chunk_types=[NP])[99]
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNPStone/NNP)
told/VBD
(NP his/PRP¥story/NN)
./.)
简单评估和基准
>>>grammar= r"NP: {<[CDJNP].>+}"
>>>cp = nltk.RegexpParser(grammar)
>>>print cp.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 87.7%
Precision: 70.6%
Recall: 67.8%
F-Measure: 69.2%
我们可以机关一个Unigram标注器来建树一个分块器。
#我们定义一个分块器,此中包含机关函数和一个parse办法,用来给新的句子分块
例7-4. 应用unigram标注器对名词短语分块。
classUnigramChunker(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.UnigramTagger(train_data)
def parse(self, sentence):
pos_tags= [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags =[(word, pos,chunktag)for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
重视parse这个函数,他的工作流程是如许的:
1、取一个已经标注的句子作为输入
2、从那句话提取的词性标识表记标帜开端
3、应用在机关函数中练习过的标注器self.tagger,为词性添加标注IOB块标识表记标帜。
4、提取块标识表记标帜,与原句组合。
5、组合成一个块树。
做好块标识表记标帜器之后,应用分块语料库库练习他。
>>>test_sents = conll2000.chunked_sents(test.txt,chunk_types=[NP])
>>>train_sents = conll2000.chunked_sents(train.txt,chunk_types=[NP])
>>>unigram_chunker= UnigramChunker(train_sents)
>>>print unigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.8%
F-Measure: 83.2%
#我们可以经由过程这些代码,看到进修景象
>>>postags= sorted(set(pos for sent in train_sents
... for (word,pos) in sent.leaves()))
>>>print unigram_chunker.tagger.tag(postags)
[(#, B-NP), (¥, B-NP), ("", O), ((, O), (), O),
(,, O), (., O), (:, O), (CC, O), (CD, I-NP),
(DT, B-NP), (EX, B-NP), (FW, I-NP), (IN, O),
(JJ, I-NP), (JJR, B-NP), (JJS, I-NP), (MD, O),
(NN, I-NP), (NNP, I-NP), (NNPS, I-NP), (NNS, I-NP),
(PDT, B-NP), (POS, B-NP), (PRP, B-NP), (PRP¥, B-NP),
(RB, O), (RBR, O), (RBS, B-NP), (RP, O), (SYM, O),
(TO, O), (UH, O), (VB, O), (VBD, O), (VBG, O),
(VBN, O), (VBP, O), (VBZ, O), (WDT, B-NP),
(WP, B-NP), (WP¥, B-NP), (WRB, O), (``, O)]
同样,我们也可以建树bigramTagger。
>>>bigram_chunker= BigramChunker(train_sents)
>>>print bigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.3%
Precision: 82.3%
Recall: 86.8%
F-Measure: 84.5%
练习基于分类器的分块器
今朝评论辩论的分块器有:正则表达式分块器、n-gram分块器,决意创建什么块完全基于词性标识表记标帜。然而,有时词性标识表记标帜不足以断定一个句子应如何分块。
例如:
(3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.
固然标识表记标帜都一样,然则很明显分块并不一样。
所以,我们须要应用词的内容信息作为词性标识表记标帜的补充。
若是想应用词的内容信息的办法之一,是应用基于分类器的标注器对句子分块。
基于分类器的NP分块器的根蒂根基代码如下面的代码所示:
&#160;
#在第2个类上,根蒂根基上是标注器的一个包装器,将它变成一个分块器。练习时代,这第二个类映射练习预猜中的块树到标识表记标帜序列
#在parse办法中,它将标注器供给的标识表记标帜序列转换回一个块树。
classConsecutiveNPChunkTagger(nltk.TaggerI):
def __init__(self, train_sents):
train_set = []
for tagged_sent in train_sents:
untagged_sent = nltk.tag.untag(tagged_sent)
history = []
for i, (word, tag) in enumerate(tagged_sent):
featureset = npchunk_features(untagged_sent, i, history)
train_set.append( (featureset, tag) )
history.append(tag)
self.classifier = nltk.MaxentClassifier.train(
train_set, algorithm=megam, trace=0)
def tag(self, sentence):
history = []
for i, wordin enumerate(sentence):
featureset = npchunk_features(sentence,i, history)
tag = self.classifier.classify(featureset)
history.append(tag)
return zip(sentence, history)
classConsecutiveNPChunker(nltk.ChunkParserI):④
def __init__(self, train_sents):
tagged_sents = [[((w,t),c) for (w,t,c) in
nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
def parse(self, sentence):
tagged_sents = self.tagger.tag(sentence)
conlltags =[(w,t,c) for ((w,t),c) in tagged_sents]
return nltk.chunk.conlltags2tree(conlltags)
然后,定义一个特点提取函数:
&#160;
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
... return {"pos": pos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.7%
F-Measure: 83.2%
对于这个分类标识表记标帜器我们还可以做改进,增加一个前面的词性标识表记标帜。
&#160;
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
.. . if i ==0:
... prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... return {"pos": pos,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.6%
Precision: 81.9%
Recall: 87.1%
F-Measure: 84.4%
我们可以不仅仅以两个词性为特点,还可以再添加一个词的内容。
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
.. . if i ==0:
.. . prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... return {"pos": pos,"word": word,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 94.2%
Precision: 83.4%
Recall: 88.6%
F-Measure: 85.9%
我们可以试着测验测验多加几种特点提取,来增长分块器的发挥解析,例如下面代码中增加了预取特点、配对功能和错杂的语境特点。最后一个特点是tags-since-dt,创建了一个字符串,描述自比来的限制词以来碰到的所有的词性标识表记标帜。
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
... if i ==0:
... prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... if i ==len(sentence)-1:
... nextword, nextpos= "<END>", "<END>"
... else:
... nextword, nextpos= sentence[i+1]
... return {"pos": pos,
... "word": word,
... "prevpos": prevpos,
... "nextpos": nextpos,
.. . "prevpos+pos": "%s+%s" %(prevpos, pos),
... "pos+nextpos": "%s+%s" %(pos, nextpos),
... "tags-since-dt": tags_since_dt(sentence, i)}
>>>def tags_since_dt(sentence, i):
... tags = set()
... for word,pos in sentence[:i]:
... if pos==DT:
... tags = set()
... else:
... tags.add(pos)
... return +.join(sorted(tags))
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 95.9%
Precision: 88.3%
Recall: 90.7%
F-Measure: 89.5%
真正的心灵世界会告诉你根本看不见的东西,这东西需要你付出思想和灵魂的劳动去获取,然后它会照亮你的生命,永远照亮你的生命。——王安忆《小说家的十三堂课》
读取IOB格局与CoNLL2000分块语料库
CoNLL2000,是已经加载标注的文本,应用IOB符号分块。
这个语料库供给的类型有NP,VP,PP。
例如:
hePRPB-NP
accepted VBDB-VP
the DTB-NP
positionNNI-NP
...
chunk.conllstr2tree()的函数感化:将字符串建树一个树默示。
例如:
>>>text =
... he PRPB-NP
... accepted VBDB-VP
... the DTB-NP
... position NNI-NP
... of IN B-PP
... vice NNB-NP
... chairman NNI-NP
... of IN B-PP
... CarlyleNNPB-NP
... GroupNNPI-NP
... , , O
... a DTB-NP
... merchantNNI-NP
... banking NNI-NP
... concernNNI-NP
... . . O
...
>>>nltk.chunk.conllstr2tree(text,chunk_types=[NP]).draw()
运行成果如图所示:
对于CoNLL2000分块语料,我们可以对他进行如下操纵:
#接见分块语料文件
>>> nltk.corpusimport conll2000
>>>print conll2000.chunked_sents(train.txt)[99]
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNPStone/NNP)
(VP told/VBD)
(NP his/PRP¥story/NN)
./.)
#若是只对NP感爱好,可以如许写
>>>print conll2000.chunked_sents(train.txt,chunk_types=[NP])[99]
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNPStone/NNP)
told/VBD
(NP his/PRP¥story/NN)
./.)
简单评估和基准
>>>grammar= r"NP: {<[CDJNP].>+}"
>>>cp = nltk.RegexpParser(grammar)
>>>print cp.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 87.7%
Precision: 70.6%
Recall: 67.8%
F-Measure: 69.2%
我们可以机关一个Unigram标注器来建树一个分块器。
#我们定义一个分块器,此中包含机关函数和一个parse办法,用来给新的句子分块
例7-4. 应用unigram标注器对名词短语分块。
classUnigramChunker(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.UnigramTagger(train_data)
def parse(self, sentence):
pos_tags= [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags =[(word, pos,chunktag)for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
重视parse这个函数,他的工作流程是如许的:
1、取一个已经标注的句子作为输入
2、从那句话提取的词性标识表记标帜开端
3、应用在机关函数中练习过的标注器self.tagger,为词性添加标注IOB块标识表记标帜。
4、提取块标识表记标帜,与原句组合。
5、组合成一个块树。
做好块标识表记标帜器之后,应用分块语料库库练习他。
>>>test_sents = conll2000.chunked_sents(test.txt,chunk_types=[NP])
>>>train_sents = conll2000.chunked_sents(train.txt,chunk_types=[NP])
>>>unigram_chunker= UnigramChunker(train_sents)
>>>print unigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.8%
F-Measure: 83.2%
#我们可以经由过程这些代码,看到进修景象
>>>postags= sorted(set(pos for sent in train_sents
... for (word,pos) in sent.leaves()))
>>>print unigram_chunker.tagger.tag(postags)
[(#, B-NP), (¥, B-NP), ("", O), ((, O), (), O),
(,, O), (., O), (:, O), (CC, O), (CD, I-NP),
(DT, B-NP), (EX, B-NP), (FW, I-NP), (IN, O),
(JJ, I-NP), (JJR, B-NP), (JJS, I-NP), (MD, O),
(NN, I-NP), (NNP, I-NP), (NNPS, I-NP), (NNS, I-NP),
(PDT, B-NP), (POS, B-NP), (PRP, B-NP), (PRP¥, B-NP),
(RB, O), (RBR, O), (RBS, B-NP), (RP, O), (SYM, O),
(TO, O), (UH, O), (VB, O), (VBD, O), (VBG, O),
(VBN, O), (VBP, O), (VBZ, O), (WDT, B-NP),
(WP, B-NP), (WP¥, B-NP), (WRB, O), (``, O)]
同样,我们也可以建树bigramTagger。
>>>bigram_chunker= BigramChunker(train_sents)
>>>print bigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.3%
Precision: 82.3%
Recall: 86.8%
F-Measure: 84.5%
练习基于分类器的分块器
今朝评论辩论的分块器有:正则表达式分块器、n-gram分块器,决意创建什么块完全基于词性标识表记标帜。然而,有时词性标识表记标帜不足以断定一个句子应如何分块。
例如:
(3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.
固然标识表记标帜都一样,然则很明显分块并不一样。
所以,我们须要应用词的内容信息作为词性标识表记标帜的补充。
若是想应用词的内容信息的办法之一,是应用基于分类器的标注器对句子分块。
基于分类器的NP分块器的根蒂根基代码如下面的代码所示:
&#160;
#在第2个类上,根蒂根基上是标注器的一个包装器,将它变成一个分块器。练习时代,这第二个类映射练习预猜中的块树到标识表记标帜序列
#在parse办法中,它将标注器供给的标识表记标帜序列转换回一个块树。
classConsecutiveNPChunkTagger(nltk.TaggerI):
def __init__(self, train_sents):
train_set = []
for tagged_sent in train_sents:
untagged_sent = nltk.tag.untag(tagged_sent)
history = []
for i, (word, tag) in enumerate(tagged_sent):
featureset = npchunk_features(untagged_sent, i, history)
train_set.append( (featureset, tag) )
history.append(tag)
self.classifier = nltk.MaxentClassifier.train(
train_set, algorithm=megam, trace=0)
def tag(self, sentence):
history = []
for i, wordin enumerate(sentence):
featureset = npchunk_features(sentence,i, history)
tag = self.classifier.classify(featureset)
history.append(tag)
return zip(sentence, history)
classConsecutiveNPChunker(nltk.ChunkParserI):④
def __init__(self, train_sents):
tagged_sents = [[((w,t),c) for (w,t,c) in
nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
def parse(self, sentence):
tagged_sents = self.tagger.tag(sentence)
conlltags =[(w,t,c) for ((w,t),c) in tagged_sents]
return nltk.chunk.conlltags2tree(conlltags)
然后,定义一个特点提取函数:
&#160;
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
... return {"pos": pos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.7%
F-Measure: 83.2%
对于这个分类标识表记标帜器我们还可以做改进,增加一个前面的词性标识表记标帜。
&#160;
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
.. . if i ==0:
... prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... return {"pos": pos,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.6%
Precision: 81.9%
Recall: 87.1%
F-Measure: 84.4%
我们可以不仅仅以两个词性为特点,还可以再添加一个词的内容。
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
.. . if i ==0:
.. . prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... return {"pos": pos,"word": word,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 94.2%
Precision: 83.4%
Recall: 88.6%
F-Measure: 85.9%
我们可以试着测验测验多加几种特点提取,来增长分块器的发挥解析,例如下面代码中增加了预取特点、配对功能和错杂的语境特点。最后一个特点是tags-since-dt,创建了一个字符串,描述自比来的限制词以来碰到的所有的词性标识表记标帜。
>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
... if i ==0:
... prevword, prevpos= "<START>", "<START>"
... else:
... prevword, prevpos= sentence[i-1]
... if i ==len(sentence)-1:
... nextword, nextpos= "<END>", "<END>"
... else:
... nextword, nextpos= sentence[i+1]
... return {"pos": pos,
... "word": word,
... "prevpos": prevpos,
... "nextpos": nextpos,
.. . "prevpos+pos": "%s+%s" %(prevpos, pos),
... "pos+nextpos": "%s+%s" %(pos, nextpos),
... "tags-since-dt": tags_since_dt(sentence, i)}
>>>def tags_since_dt(sentence, i):
... tags = set()
... for word,pos in sentence[:i]:
... if pos==DT:
... tags = set()
... else:
... tags.add(pos)
... return +.join(sorted(tags))
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 95.9%
Precision: 88.3%
Recall: 90.7%
F-Measure: 89.5%