【说话处理惩罚与Python】7.5定名实体辨认/7.6关系抽取-

【说话处理惩罚与Python】7.5定名实体辨认/7.6关系抽取

添加时间:2013-5-31 点击量:

7.5定名实体辨认（NER）

目标是辨认所有文字说起的定名实体。

可以分成两个子任务：断定NE的鸿沟和断定其类型。

NLTK供给了一个已经练习好的可以辨认定名实体的分类器，若是我们设置参数binary=True，那么定名实体只被标注为NE，没有类型标签。可以经由过程代码来看：

>>>sent = nltk.corpus.treebank.tagged_sents（）[22]

>>>print nltk.ne_chunk（sent， binary=True） 

（S

The/DT

（NE U.S./NNP）

is/VBZ

one/CD

...

according/VBG

to/TO

（NE Brooke/NNPT./NNPMossman/NNP）

...）

>>>print nltk.ne_chunk（sent）

（S

The/DT

（GPE U.S./NNP）

is/VBZ

one/CD

...

according/VBG

to/TO

（PERSON Brooke/NNPT./NNPMossman/NNP）

...）

7.6关系抽取

一旦文本中的定名实体已被辨认，我们就可以提取它们之间存在的关系。

进行这一任务的办法之一，就是寻找所有的（X，α， Y）情势的三元组，我们可以应用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜刮包含词in的字符串。

特别的正则表达式（？!\b.+ing\b）是一个否定猜测先行断言，容许我们忽视如success in supervising the transition of中的字符串，此中in 后面跟一个动名词。

>>>IN = re.compile（r.\bin\b（？!\b.+ing））

>>>for docin nltk.corpus.ieer.parsed_docs（NYT_19980315）:

...     for rel in nltk.sem.extract_rels（ORG， LOC， doc，

...                 corpus=ieer，pattern = IN）:

...         print nltk.sem.show_raw_rtuple（rel）

[ORG: WHYY] in [LOC: Philadelphia]

[ORG: McGlashan &AMP;Sarrail]firm in [LOC: San Mateo]

[ORG: Freedom Forum]in [LOC: Arlington]

[ORG: Brookings Institution] ， the research group in [LOC: Washington]

[ORG: Idealab] ， a self-described businessincubator basedin [LOC: Los Angeles]

[ORG: Open Text]， basedin [LOC: Waterloo]

[ORG: WGBH] in [LOC: Boston]

[ORG: Bastille Opera]in [LOC: Paris]

[ORG: Omnicom] in [LOC: New York]

[ORG: DDB Needham]in [LOC: New York]

[ORG: Kaplan ThalerGroup]in [LOC: New York]

[ORG: BBDO South]in [LOC: Atlanta]

[ORG: Georgia-Pacific] in [LOC: Atlanta]

如前文所示，CoNLL2002定名实体语料库的荷兰语项目组不只包含定名实体标注，也包含词性标注。这容许我们设计对这些标识表记标帜敏感的模式，如下面的例子所示。show_clause（）办法以分条情势输出关系，此中二元关系符号作为参数relsym的值被指定。

>>> nltk.corpusimport conll2002

>>>vnv= """

... （

... is/V| ＃3rdsing present and

... was/V| ＃past forms of the verb zijn （be）

... werd/V| ＃and also present

... wordt/V ＃pastof worden（become）

... ）

... . ＃followed byanything

... van/Prep ＃followed byvan（of）

... """

>>>VAN= re.compile（vnv， re.VERBOSE）

>>>for docin conll2002.chunked_sents（ned.train）:

...     for r in nltk.sem.extract_rels（PER， ORG， doc，

...         corpus=conll2002， pattern=VAN）:

...         print nltk.sem.show_clause（r，relsym="VAN"） 

VAN（"cornet_delzius"，buitenlandse_handel）

VAN（johan_rottiers，kardinaal_van_roey_instituut）

VAN（annie_lennox，eurythmics）

我所有的自负皆来自我的自卑，所有的英雄气概都来自于我的软弱。嘴里振振有词是因为心里满是怀疑，深情是因为痛恨自己无情。这世界没有一件事情是虚空而生的，站在光里，背后就会有阴影，这深夜里一片寂静，是因为你还没有听见声音。—— 马良《坦白书》

分享到：

相关文章

按版本划分

按功能划分

企业管理软件

【说话处理惩罚与Python】7.5定名实体辨认/7.6关系抽取