Automating the Measurement of Linguistic Features to Help Classify Texts as Technical

Text classification plays a central role in software systems which perform automatic information classification and retrieval. Occurrences of linguistic feature values must be counted by any mechanism that classifies or characterizes natural language text by topic, style, genre or, in our case, by the degree to which a text is technical. We discuss the methodology and key details of the feature value extraction process, paying attention to fast and reliable implementation. Our results are mixed but support continued investigation— while a significant level of automation has been achieved, the successfully extracted feature counts do not always correlate with technicality as strongly as anticipated.

[1]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[2]  Pierre Lafon,et al.  TyPTex: generic features for text profiler , 2000 .

[3]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[4]  Elizabeth D. Liddy,et al.  Text categorization for multiple users based on semantic features from a machine-readable dictionary , 1994, TOIS.

[5]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[6]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[7]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[8]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[9]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[10]  Daniel Marcu A surface-based approach to identifying discourse markers and elementary textual units in unrestricted texts , 1998 .

[11]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[12]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[13]  Sylvain Delisle,et al.  What is technical text , 1997 .

[14]  Michael Stubbs Review of Dimensions of register variation: a cross-linguistic comparison by Douglas Biber. Cambridge University Press 1995. , 1997 .

[15]  Raman Chandrasekar,et al.  Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-speech Tagging and Supertagging , 1997, RIAO.

[16]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[17]  David D. Lewis,et al.  Text filtering in MUC-3 and MUC-4 , 1992, MUC.

[18]  Markus Junker,et al.  Exploiting Thesaurus Knowledge in Rule Induction for Text Classification , 1997 .

[19]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[20]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[21]  Takahiro Ikeda,et al.  Information Classification and Navigation Based on 5W1H of the Target Information , 1998, COLING-ACL.

[22]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[23]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[24]  Anna Trosborg,et al.  Text typology and translation , 1997 .

[25]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[26]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[27]  Sylvain Delisle,et al.  MORE ALIKE THAN NOT AN ANALYSIS OF WORD FREQUENCIES IN FOUR GENERAL-PURPOSE TEXT CORPORA , 1999 .