代码之家  ›  专栏  ›  技术社区  ›  Abdulrahman Bres Cristiana Chavez

如何改进NLTK句子切分?

  •  9
  • Abdulrahman Bres Cristiana Chavez  · 技术社区  · 7 年前

    弗农F。 加拉赫1952年。第一个学生宿舍的假设大厅 1954年开业,1958年11月罗克韦尔音乐厅落成, 为商学院和法学院提供住宿。那是在 F、 亨利·J·麦卡努蒂(HenryJ.McAnulty)说,加拉赫(Fr.Gallagher)的雄心勃勃的计划已付诸实施 行动

    我运行NLTK nltk.sent_tokenize 为了得到句子。这将返回:

    ['An ambitious campus expansion plan was proposed by Fr.', 
    'Vernon F. Gallagher in 1952.', 
    'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
    'It was during the tenure of Fr.', 
    'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
     ] 
    

    而NTLK可以处理 F、 亨利·J·麦卡努蒂 作为一个实体, 它失败了 ,这将句子一分为二。

    正确的令牌化应该是:

    [
    'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 
    'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
    'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
     ] 
    

    1 回复  |  直到 3 年前
        1
  •  13
  •   alvas    7 年前

    Kiss和Strunk(2006)Punkt算法的可怕之处在于它是无监督的。因此,给定一个新文本,你应该重新训练模型,并将模型应用到你的文本中,例如。

    >>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
    >>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."
    
    # Training a new model with the text.
    >>> tokenizer = PunktSentenceTokenizer()
    >>> tokenizer.train(text)
    <nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
    
    # It automatically learns the abbreviations.
    >>> tokenizer._params.abbrev_types
    {'f', 'fr', 'j'}
    
    # Use the customized tokenizer.
    >>> tokenizer.tokenize(text)
    ['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]
    

    当重新训练模型时,如果没有足够的数据来生成良好的统计数据,您也可以在训练之前输入一个预先确定的缩写列表;看见 How to avoid NLTK's sentence tokenizer spliting on abbreviations?

    >>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
    
    >>> punkt_param = PunktParameters()
    >>> abbreviation = ['f', 'fr', 'k']
    >>> punkt_param.abbrev_types = set(abbreviation)
    
    >>> tokenizer = PunktSentenceTokenizer(punkt_param)
    >>> tokenizer.train(text)
    <nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
    
    >>> tokenizer.tokenize(text)
    ['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]