A Generalized LCS Algorithm and Its Application to Corpus Alignment

The paper addresses the problem of text variation which often hinders interoperable use or reuse of corpora and annotations. A systematic solution is presented based on a variation of Longest Common Sequence algorithm. An empirical experiment with 20 full text articles shows it works well with a real world application.

[1]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[2]  Helen L. Johnson,et al.  Corpus Refactoring: a Feasibility Study , 2007, Journal of biomedical discovery and collaboration.

[3]  J. W. Hunt,et al.  An Algorithm for Differential File Comparison , 2008 .

[4]  K. Bretonnel Cohen,et al.  Empirical data on corpus design and usage in biomedical natural language processing , 2005, AMIA.

[5]  Dietrich Rebholz-Schuhmann,et al.  Harmonization of gene/protein annotations: towards a gold standard MEDLINE , 2012, Bioinform..

[6]  Yue Wang,et al.  Improving the Inter-Corpora Compatibility for protein Annotations , 2010, J. Bioinform. Comput. Biol..

[7]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[8]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[9]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[10]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[11]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[12]  Laura Inés Furlong,et al.  Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.