Kiss和Strunk(2006)Punkt算法的可怕之处在于它是无监督的。因此,给定一个新文本,你应该重新训练模型,并将模型应用到你的文本中,例如。
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]
当重新训练模型时,如果没有足够的数据来生成良好的统计数据,您也可以在训练之前输入一个预先确定的缩写列表;看见
How to avoid NLTK's sentence tokenizer spliting on abbreviations?
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)
>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]