Passage detection using text classification

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety,passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

[1]  Hazel Oliver,et al.  Email and Internet Monitoring in the Workplace: Information Privacy and Contracting‐Out , 2002 .

[2]  Saket S. R. Mengle,et al.  On document splitting in passage detection , 2008, SIGIR '08.

[3]  Nazli Goharian,et al.  Extracting unstructured data from template generated web documents , 2003, CIKM '03.

[4]  Saket S. R. Mengle,et al.  Using ambiguity measure feature selection algorithm for support vector machine classifier , 2008, SAC '08.

[5]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[6]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[7]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[8]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[9]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[10]  Saket S. R. Mengle,et al.  Discovering relationships among categories using misclassification information , 2008, SAC '08.

[11]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[12]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[13]  Myoung-Ho Kim,et al.  An Evaluation of Passage-Based Text Categorization , 2004, Journal of Intelligent Information Systems.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Jamie Callan,et al.  Passage-retrieval evidence in document retrieval , 1994, SIGIR 1994.

[16]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[17]  Mark D. Smucker,et al.  UMass at TREC 2004: Notebook , 2004 .

[18]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[19]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001 .

[20]  James Allan,et al.  Passage Retrieval and Evaluation , 2005 .

[21]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[22]  Gerard Salton,et al.  Automatic Text Decomposition and Structuring , 1994, Inf. Process. Manag..

[23]  Clement T. Yu,et al.  Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature , 2007, SIGIR.

[24]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[25]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[26]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[27]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[28]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[29]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.