} } }

    【说话处理惩罚与Python】11.4应用XML\11.5应用Toolbox数据

    添加时间:2013-6-6 点击量:

    11.4应用Toolbox数据

    说话布局中应用XML

    (2) <entry>
    
    <headword>whale</headword>
    <pos>noun</pos>
    <gloss>anyofthe larger cetaceanmammalshaving a streamlined
    bodyand breathing through a blowhole onthe head
    </gloss>
    </entry>



    XML的感化



    (关于XML更多的根蒂根基常识请本身查询相干材料)



    ElementTree接口




    >>> nltk.etree.ElementTreeimport ElementTree
    
    >>>merchant= ElementTree().parse(merchant_file)
    >>>merchant
    <Element PLAYat 22fa800>
    >>>merchant[0]
    <ElementTITLEat 22fa828>
    >>>merchant[0].text
    The MerchantofVenice
    >>>merchant.getchildren()
    [
    <Element TITLEat 22fa828>, <Element PERSONAE at 22fa7b0>, <Element SCNDE
    SCRat
    2300170>
    <ElementPLAYSUBTat 2300198>, <ElementACTat 23001e8>, <ElementACTat 2
    34ec88
    >
    <ElementACTat 23c87d8>, <ElementACTat 2439198>, <ElementACTat 24923c8
    >]





    我们可以应用更多的办法来操纵XML:




    >>>for i, act in enumerate(merchant.findall(ACT)):
    
    ...
    for j, scene in enumerate(act.findall(SCENE)):
    ...
    for k,speechin enumerate(scene.findall(SPEECH)):
    ...
    for line in speech.findall(LINE):
    ...
    if music in str(line.text):
    ...
    print "Act %dScene %dSpeech %d:%s"%(i+1, j+1, k+1, line.text)
    Act3Scene2Speech9: Let musicsoundwhilehedoth makehis choice;






    Act3Scene2Speech9: Fadingin music:that the comparison
    
    Act3Scene2Speech9:Andwhatis musicthen? Thenmusicis
    Act5Scene1Speech23:Andbring yourmusicforth into the air.
    Act5Scene1Speech23: Herewillwesit
    and let the sounds ofmusic
    Act5Scene1Speech23:Anddrawher homewithmusic.
    Act5Scene1Speech24: I am never merrywhenI hear sweet music.
    Act5Scene1Speech25: Orany air ofmusictouch their ears,
    Act5Scene1Speech25: Bythe sweet powerof music:therefore the poet
    Act5Scene1Speech25: Butmusicfor the time doth changehis nature.
    Act5Scene1Speech25: Themanthat hathnomusicin himself,
    Act5Scene1Speech25: Let nosuchmanbe trusted. Markthe music.
    Act5Scene1Speech29: It
    is yourmusic,madam,of the house.
    Act5Scene1Speech32: Nobetter a musicianthan the wren.



    我们也可以查查演员的次序。我们可以应用频率分布看看谁最能说:




    >>>speaker_seq = [s.text for s in merchant.findall(ACT/SCENE/SPEECH/SPEAKER
    
    )]
    >>>speaker_freq = nltk.FreqDist(speaker_seq)
    >>>top5 =speaker_freq.keys()[:5]
    >>>top5
    [
    PORTIASHYLOCKBASSANIOGRATIANOANTONIO]



    我们也可以查看对话中谁跟着谁的模式。




    >>>mapping= nltk.defaultdict(lambda: OTH
    >>>for s in top5:
    ... mapping[s]
    = s[:4]
    ...
    >>>speaker_seq2 = [mapping[s] for s in speaker_seq]
    >>>cfd =nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))
    >>>cfd.tabulate()



    应用ElementTree接见Toolbox数据



    我们可以用toolbox.xml()来接见Toolbox文件。




    >>> nltk.corpusimport toolbox
    
    >>>lexicon = toolbox.xml(rotokas.dic



    可以经由过程如许的体式格式来接见内容:




    >>>lexicon[3][0]
    
    <Element lx at 77bd28>
    >>>lexicon[3][0].tag
    lx
    >>>lexicon[3][0].text
    kaa



    我们也可以应用路径接见XML的内容:




    >>>[lexeme.text.lower() for lexeme in lexicon.findall(record/lx)]
    
    [
    kaakaakaakaakaarokaakaavikokaakaavokaakaoko
    kaakasikaakaukaakaukokaakitokaakuupato, ..., kuvuto]




    >>>import sys
    
    >>> nltk.etree.ElementTreeimport ElementTree
    >>>tree = ElementTree(lexicon[3])
    >>>tree.write(sys.stdout)
    <record>
    <lx>kaa</lx>
    <ps>N</ps>
    <pt>MASC</pt>
    <cl>isi</cl>
    <ge>cookingbanana</ge>
    <tkp>bananabilong kukim</tkp>
    <pt>itoo</pt>
    <sf>FLORA</sf>
    <dt>12/Aug/2005</dt>
    <ex>Taeaviiria kaaisi kovopaueva kaparapasia.</ex>
    <xp>Taeavii bin planim gadenbanana bilongkukim tasol long paia.</xp>
    <xe>Taeaviplantedbanana in orderto cookit.</xe>
    </record>



    格局化条目



    我们可以按照本身的须要,来生成特定的格局输出。




    >>>html= "<table>\n"
    
    >>>for entry in lexicon[70:80]:
    ... lx
    = entry.findtext(lx
    ... ps
    = entry.findtext(ps
    ... ge
    = entry.findtext(ge
    ... html
    +=" <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n"(lx, ps,ge)
    >>>html+="</table>"
    >>>print html
    <table>
    <tr><td>kakae</td><td>???</td><td>small</td></tr>
    <tr><td>kakae</td><td>CLASS</td><td>child</td></tr>
    <tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr>
    <tr><td>kakapikoa</td><td>???</td><td>small</td></tr>
    <tr><td>kakapikoto</td><td>N</td><td>newbornbaby</td></tr>
    <tr><td>kakapu</td><td>V</td><td>placein sling for purpo搜刮引擎优化f carrying</td></tr>
    <tr><td>kakapua</td><td>N</td><td>slingfor lifting</td></tr>
    <tr><td>kakara</td><td>N</td><td>armband</td></tr>
    <tr><td>Kakarapaia</td><td>N</td><td>villagename</td></tr>
    <tr><td>kakarau</td><td>N</td><td>frog</td></tr>
    </table>



    11.5应用Toolbox数据



    为每个条目添加一个字段




    例11-2. 为词汇条目添加新的cv字段
    
    nltk.etree.ElementTreeimport SubElement
    def cv(s):
    s
    = s.lower()
    s
    = re.sub(r[^a-z], r_, s)
    s
    = re.sub(r[aeiou], rV, s)
    s
    = re.sub(r[^V_], rC, s)
    return (s)
    def add_cv_field(entry):
    for field in entry:
    if field.tag ==lx:
    cv_field
    = SubElement(entry,cv
    cv_field.text
    = cv(field.text)
    >>>lexicon = toolbox.xml(rotokas.dic
    >>>add_cv_field(lexicon[53])
    >>>print nltk.to_sfm_string(lexicon[53])
    \lx kaeviro
    \ps V
    \pt A
    \ge lift off
    \ge take off
    \tkp goantap
    \sc MOTION
    \vx
    1
    \nt usedto describe action of plane
    \dt
    03/Jun/2005
    \ex Pitakaeviroroekepakekesia oavuripierevo kiuvu.
    \xp Pitai goantap nalukim hauswini bagarapim.
    \xe Peterwentto look at the housethat the winddestroyed.
    \cv CVVCVCV



    验证Toolbox词汇



    Toolbox格局的很多词汇不合适任何特定的模式。有些条目可能包含额外的字段,或以一种新的体式格式排序现有字段。



    例如,我们可以在FreqDist的帮助下,很轻易的找到频率异常的字段序列:




    >>>fd = nltk.FreqDist(:.join(field.tag for field in entry) for entry in lexicon)
    
    >>>fd.items()
    [(
    lx:ps:pt:ge:tkp:dt:ex:xp:xe, 41),(lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe, 37),
    lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe, 27), (lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe, 20),
    ..., (
    lx:alt:rt:ps:pt:ge:eng:eng:eng:tkp:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe, 1)]

    原来,再大的房子,再大的床,没有相爱的人陪伴,都只是冰冷的物质。而如果身边有爱人陪伴,即使房子小,床小,也觉得无关紧要,因为这些物质上面有了爱的温度,成了家的元素。—— 何珞《婚房》#书摘#
    分享到: