Coupling Maximum Entropy and Probabilistic Context-Free Grammar Models for XML Annotation of Documents

We consider the problem of semantic annotation of semi-structured documents according to a target XML schema. The task is to annotate a document in a tree-like manner where the annotation tree is an instance of a tree class defined by DTD or W3C XML Schema descriptions. In the probabilistic setting, we cope with the tree annotation problem as a generalized probabilistic contextfree parsing of an observation sequence where each observation comes with a probability distribution over terminals supplied by a probabilistic classifier associated with the content of documents. We determine the most probable tree annotation by maximizing the joint probability of selecting a terminal sequence for the observation sequence and the most probable parse for the selected terminal sequence. We extend the inside-outside algorithm for probabilistic context-free grammars and establish a Naive Bayes-like requirement that the content classifier should satisfy when estimating the terminal probabilities. Nous considerons le probleme de l’annotation semantique de documents semistructures guidee par un schema xml cible. Le but est d’annoter un document de fa con arborescente ou l’arbre d’annotation est l’instance d’une DTD ou d’un schema W3C XML. Avec notre approche probabiliste, nous traitons le probleme de l’annotation comme une generalisation de la derivation de grammaires horscontextes probabilistes pour des sequences d’observations. Chaque observation possede une distribution de probabilites sur les classes qui est fournie par un classificateur probabiliste associe au contenu du document. L’arbre d’annotation le plus probable est choisi en maximisant la probabilite jointe de la sequence d’observations et de l’arbre de derivation associe a cette sequence. Nous ameliorons l’algorithme inside-outside pour les grammaires hors-contextes probabilistes et etablissons des contraintes d’independances que le classificateur doit satisfaire pour estimer les probabilites des classes. Mots-cles : Apprentissage artificiel, Web semantique, Extraction d’informations

[1]  I. V. Ramakrishnan,et al.  Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[2]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[6]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[7]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Yannis Papakonstantinou,et al.  DTD inference for views of XML data , 2000, PODS.

[9]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[10]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[11]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[12]  James R. Curran,et al.  Transformation-Based Learning for Automatic Translation from HTML to XML , 1999 .

[13]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[14]  Lukasz A. Kurgan,et al.  Semantic Mapping of XML Tags Using Inductive Machine Learning , 2002, ICMLA.

[15]  Mark Johnson,et al.  Language modeling using efficient best-first bottom-up parsing , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).