Machine Learning Techniques for Document Processing and Web Security

[1]  Herbert Bos,et al.  Ruler: high-speed packet matching and rewriting on NPUs , 2007, ANCS '07.

[2]  Niels Provos,et al.  All Your iFRAMEs Point to Us , 2008, USENIX Security Symposium.

[3]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[4]  Massimo Ruffolo,et al.  XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[5]  Jan P. Allebach,et al.  Document visual similarity measure for document search , 2011, DocEng '11.

[6]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Elio Masciari,et al.  A Fuzzy Logic Approach to Wrapping PDF Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jayant Madhavan,et al.  Harvesting Relational Tables from Lists on the Web , 2009, Proc. VLDB Endow..

[9]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[11]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[12]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[13]  Michael Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007 .

[14]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[15]  Eric Medvet,et al.  The Reaction Time to Web Site Defacements , 2009, IEEE Internet Computing.

[16]  Andreas Dengel,et al.  Seizing the Treasure: Transferring Knowledge in Invoice Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Tyler Moore,et al.  Measuring and Analyzing Search-Redirection Attacks in the Illicit Online Prescription Drug Trade , 2011, USENIX Security Symposium.

[18]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[19]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.

[20]  Brian J. Ross,et al.  Probabilistic Pattern Matching and the Evolution of Stochastic Regular Expressions , 2000, Applied Intelligence.

[21]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[22]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[23]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[25]  Tae-Hoon Kim,et al.  Automatic generation of XForms code using DTD , 2005, Fourth Annual ACIS International Conference on Computer and Information Science (ICIS'05).

[26]  Hanchuan Peng,et al.  Document Image Recognition Based on Template Matching of Component Block Projections , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Boris Chidlovskii Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.

[28]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment , 2012, IEEE Trans. Knowl. Data Eng..

[29]  Daniela Florescu Managing Semi-Structured Data , 2005, ACM Queue.

[30]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[31]  Somesh Jha,et al.  Automatic Generation of Remediation Procedures for Malware Infections , 2010, USENIX Security Symposium.

[32]  Stamatis Vassiliadis,et al.  Regular Expression Matching in Reconfigurable Hardware , 2008, J. Signal Process. Syst..

[33]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[34]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[35]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[36]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[37]  Bertin Klein,et al.  Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[38]  Tamir Hassan User-Guided Wrapping of PDF Documents Using Graph Matching Techniques , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[39]  Francesca Cesarini,et al.  Analysis and understanding of multi-class invoices , 2003, Document Analysis and Recognition.

[40]  Huajun Huang,et al.  A SVM-based Technique to Detect Phishing URLs , 2012 .

[41]  Eric Medvet,et al.  A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[42]  Eric Medvet,et al.  Semisupervised Wrapper Choice and Generation for Print-Oriented Documents , 2014, IEEE Transactions on Knowledge and Data Engineering.

[43]  Masakazu Suzuki,et al.  Syntactic Detection and Correction of Misrecognitions in Mathematical OCR , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[44]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[45]  Guofei Gu,et al.  WebPatrol: automated collection and replay of web-based malware scenarios , 2011, ASIACCS '11.

[46]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2010, ACM Trans. Web.

[47]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[48]  Yuan An,et al.  Understanding deep web search interfaces: a survey , 2010, SGMD.

[49]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[50]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[51]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[52]  Eric Medvet,et al.  A look at hidden web pages in Italian public administrations , 2012, 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN).

[53]  Valter Crescenzi,et al.  Wrapper Generation for Overlapping Web Sources , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[54]  Herbert Shiu,et al.  Recovering data semantics from XML documents into DTD graph with SAX , 2006 .

[55]  Eric Medvet,et al.  A Framework for Large-Scale Detection of Web Site Defacements , 2010, TOIT.

[56]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[57]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[58]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[59]  Boaz Ophir,et al.  A Generic Form Processing Approach for Large Variant Templates , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[60]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[61]  Frederick E. Petry,et al.  Regular language induction with genetic programming , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[62]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[63]  Sotiris Ioannidis,et al.  Regular Expression Matching on Graphics Hardware for Intrusion Detection , 2009, RAID.