An Approach to the POS Tagging Problem Using Genetic Algorithms

The automatic part-of-speech tagging is the process of automatically assigning to the words of a text a part-of-speech (POS) tag. The words of a language are grouped into grammatical categories that represent the function that they might have in a sentence. These grammatical classes (or categories) are usually called part-of-speech. However, in most languages, there are a large number of words that can be used in different ways, thus having more than one possible part-of-speech. To choose the right tag for a particular word, a POS tagger must consider the surrounding words’ part-of-speeches. The neighboring words could also have more than one possible way to be tagged. This means that, in order to solve the problem, we need a method to disambiguate a word’s possible tags set. In this work, we modeled the part-of-speech tagging problem as a combinatorial optimization problem, which we solve using a genetic algorithm. The search for the best combinatorial solution is guided by a set of disambiguation rules that we first discovered using a classification algorithm, that also includes a genetic algorithm. Using rules to disambiguate the tagging, we were able to generalize the context information present on the training tables adopted by approaches based on probabilistic data. We were also able to incorporate other type of information that helps to identify a word’s grammatical class. The results obtained on two different corpora are amongst the best ones published.

[1]  Lourdes Araujo,et al.  How evolutionary algorithms are applied to statistical natural language processing , 2007, Artificial Intelligence Review.

[2]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[3]  Kenneth A. De Jong,et al.  Using genetic algorithms for concept learning , 1993, Machine Learning.

[4]  Donald Hindle,et al.  Acquiring Disambiguation Rules from Text , 1989, ACL.

[5]  Enrique Alba,et al.  Metaheuristics for Natural Language Tagging , 2004, GECCO.

[6]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Filippo Neri,et al.  Search-Intensive Concept Induction , 1995, Evolutionary Computation.

[9]  Riccardo Poli,et al.  A Simple but Theoretically-Motivated Method to Control Bloat in Genetic Programming , 2003, EuroGP.

[10]  Alex A. Freitas,et al.  A survey of evolutionary algorithms for data mining and knowledge discovery , 2003 .

[11]  Alex A. Freitas,et al.  Discovering interesting prediction rules with a genetic algorithm , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[12]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[13]  Malcolm I. Heywood,et al.  Use of a genetic algorithm in brill's transformation-based part-of-speech tagger , 2005, GECCO '05.

[14]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[15]  Lourdes Araujo Part-of-Speech Tagging with Evolutionary Algorithms , 2002, CICLing.

[16]  Stephen F. Smith,et al.  Competition-based induction of decision models from examples , 1993, Machine Learning.

[17]  Enrique Alba,et al.  Natural language tagging with genetic algorithms , 2006, Inf. Process. Lett..

[18]  Riccardo Poli,et al.  Genetic and Evolutionary Computation – GECCO 2004 , 2004, Lecture Notes in Computer Science.

[19]  Renata Vieira,et al.  Computational Processing of the Portuguese Language , 2006, Lecture Notes in Computer Science.

[20]  Cezary Z. Janikow,et al.  A knowledge-intensive genetic algorithm for supervised learning , 1993, Machine Learning.

[21]  Lourdes Araujo,et al.  Symbiosis of evolutionary techniques and statistical natural language processing , 2004, IEEE Transactions on Evolutionary Computation.

[22]  Cícero Nogueira dos Santos,et al.  Portuguese Part-of-Speech Tagging Using Entropy Guided Transformation Learning , 2008, PROPOR.