A Methodology to Segment the Text for Index Terms

The problem of information overload is a hot issue with the growth of the world wide web. The need for the tools those should be able to absorb this huge information and eliminate this problem is evident especially for IR systems. Text is not a simple sequence of words but carries a structure. It is essential to handle these uncontrollable complex structures of sentence, grammatical and lexical irrelevancy of different units. The main idea to handle these problems is to segment the text into elementary units, which will be simpler and lesser complex than their equivalent text. We have used cue phrases, punctuations. We are presenting an algorithm, which is not only efficient but also handling more than 500 cue phrases and most of punctuations. This proposed algorithm can yield elementary units, which can be used by Rhetorical Relations Finder to get relations among them, which can be used by the RST Parser for the construction of RST Tree which will be used to design an RST based indexer. In future, the algorithm can be enhanced for handling other discourse markers, which will enable us to handle the most complex cases where cue phrases and punctuations are not applicable.

[1]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[2]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[3]  Daniel Marcu,et al.  Building Up Rhetorical Structure Trees , 1996, AAAI/IAAI, Vol. 2.

[4]  Daniel Marcu The rhetorical parsing of natural language texts , 1997 .

[5]  Muhammad Shoaib,et al.  Sources Of Irrelevancy In Information Retrieval Systems , 2005, Software Engineering Research and Practice.

[6]  M. Shoaib,et al.  A Dynamic Weight Assignment Approach for IR Systems , 2005, 2005 International Conference on Information and Communication Technologies.

[7]  Panagiotis Stamatopoulos,et al.  Summarization from Medical Documents: A Survey , 2005, Artif. Intell. Medicine.

[8]  Gisela Redeker,et al.  Coherence and structure in text and discourse , 2000, Abduction, Belief and Context in Dialogue.

[9]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[10]  S. Corston-Oliver,et al.  Computing representations of the structure of written discourse , 1998 .

[11]  Muhammad Shoaib,et al.  Remote Information Retreival Using Cell Phone , 2005, Software Engineering Research and Practice.

[12]  Jeffrey C. Reynar An Automatic Method of Finding Topic Boundaries , 1994, ACL.

[13]  J. Bateman,et al.  Coherence relations: Towards a general specification , 1997 .

[14]  Daniel Marcu,et al.  Finding the WRITE Stuff: Automatic Identification of Discourse Structure in Student Essays , 2003, IEEE Intell. Syst..

[15]  J. Oberlander,et al.  Abduction, Belief and Context in Dialogue , 2000 .

[16]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.