论文信息 - A Generalized LCS Algorithm and Its Application to Corpus Alignment

A Generalized LCS Algorithm and Its Application to Corpus Alignment

The paper addresses the problem of text variation which often hinders interoperable use or reuse of corpora and annotations. A systematic solution is presented based on a variation of Longest Common Sequence algorithm. An empirical experiment with 20 full text articles shows it works well with a real world application.

Jin-Dong Kim

[1] Sampo Pyysalo,et al. Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[2] Helen L. Johnson,et al. Corpus Refactoring: a Feasibility Study , 2007, Journal of biomedical discovery and collaboration.

[3] J. W. Hunt,et al. An Algorithm for Differential File Comparison , 2008 .

[4] K. Bretonnel Cohen,et al. Empirical data on corpus design and usage in biomedical natural language processing , 2005, AMIA.

[5] Dietrich Rebholz-Schuhmann,et al. Harmonization of gene/protein annotations: towards a gold standard MEDLINE , 2012, Bioinform..

[6] Yue Wang,et al. Improving the Inter-Corpora Compatibility for protein Annotations , 2010, J. Bioinform. Comput. Biol..

[7] Nancy Ide,et al. International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[8] Jun'ichi Tsujii,et al. GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[9] Rohit J. Kate,et al. Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[10] L. Bergroth,et al. A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[11] Fredrik Olsson,et al. Protein names and how to find them , 2002, Int. J. Medical Informatics.

[12] Laura Inés Furlong,et al. Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.