【说话处理惩罚与Python】2.1获取文本语料库-

【说话处理惩罚与Python】2.1获取文本语料库

添加时间:2013-5-23 点击量:

古藤堡语料库（电子文本档案经过遴选的一小项目组文本）

＃语料库中所有的文件

Import nltk



Nltk.corpus.gutenberg.fileids（）



 



From nltk.corpus import Gutenberg



Gutenberg.fileids（）



Emma=Gutenberg.words（‘austen-emma.txt’）

＃遴选此中的某一部著作进行操纵

Emma=nltk.corpus.gutenberg.words（‘austen-emma.txt’）



Num_chars=len（Gutenberg.raw（fileid））



Num_words=len（Gutenberg.words（fileid））



Num_sents=len（Gutenberg.sents（fileid））



Num_vocab=len（set（[w.lower（） for w in Gutenberg.words（fileid）]））

收集和聊天文本

From nltk.corpus import webtext



From nltk.corpus import nps_chat

布朗语料库

＃对布朗语料库的一些操纵：

From nltk.corpus import brown



Brown.categories（）＃语料库的一些分类



Brown.words（categories=’news’）＃接见某一体裁的单词



Brown.words（fields=[‘cg22’]）



Brown.sents（categories=[‘news’，’editorial’，’reviews’]）



＃应用前提频率分布做一些统计



Cfd=nltk.ConditionalFreqdist（



（genre，word）



For genre in brown.categories（）



For word in brown.words（categories=genre）



）



Genres=[‘news’，’religion’，’hobbies’，’science_fiction’，’’romance’，’humor’]



Modals=[‘can’，’could’，’may’，’might’，’must’m’will’]



Cfd.tabulate（conditions=generes，samples=modals）

路透社语料库（消息文档，分成了90个主题，遵守练习和测试分为两组）

就职演说语料库

＃应用前提分布做一些统计工作



Cfd=nltk.ConditionalFreqdist（



（target，fileid[:4]）



For fileid in inaugural.fileids（）



For w in inaugural.words（fileid）



For target in [‘america’，’citizen’]



If  w.lower（）.startswith（target）



）



Cfd.plot（）

标注文本语料库（含有说话学标注，词性标注、定名实体、句法布局、语义角色等）

在其他说话的语料库

文本语料库的布局

载入本身的语料库

＃在一些处所可以用匹配符号



From nltk.corpus import PlaintextCorpusReader



Corpus_root=’/usr/share/dict’



Wordlists=PlaintextCorpusReader（corpus_root，’.’）



Wordlists.fileids（）



Wordlists.words（‘connectives’）



＃在硬盘上的语料库



From nltk.corpus import BracketParseCorpusReader



Corpus_root=r”C:\corpura\penntreebank\parsed\mrg\wsj”



File_pattern=r”./wsj_.\.mrg”



Ptb=BracketParseCorpusReader（corpus_root，file_pattern）



Ptb.fileids（）

我们永远不要期待别人的拯救，只有自己才能升华自己。自己已准备好了多少容量，方能吸引对等的人与我们相遇，否则再美好的人出现、再动人的事情降临身边，我们也没有能量去理解与珍惜，终将擦肩而过。—— 姚谦《品味》

分享到：

相关文章

按版本划分

按功能划分

企业管理软件