Stanford parser based approach for extraction of Link- Context from non-descriptive Anchor-Text

Link Context Analysis has been widely explored for determining the context of the target web page. But most of the researchers have only considered descriptive or meaningful anchor text and left the undiscriptive anchor text. By researching the World Wide Web it is analyzed that a good percentage of web pages can be reached by following the undescriptive anchor text. So an algorithm has been proposed and implemented for Link context determination (LCD) to determine the context of non-descriptive anchor text in this paper. In this work non-descriptive anchor text are mainly considered for Link Context determination. A corpus of different web pages belonging to a common domain has been considered first. Then the pages were manually analyzed and relation between the anchor text and the words in its vicinity were discovered. Certain numbers of rules were formed and represented in the form of a tree, based upon these relationships. In our proposed and implemented architecture for LCD we have used three components(1) Stanford parser (2) Rules (3) Link Context Determination. The input sentence is given to the Stanford parser which creates a parse tree for the read sentence. This tree is then used by the link context determiner along with the appropriate rules tree to determine the link context. The proposed approach has been implemented and validated by considering limited samples of non-descriptive ATs. The results have shown that, the proposed LCD has extracted 100% actual link-context of each considered non-descriptive Anchor Text (AT's).

[1]  Pabitra Mohan Khilar,et al.  Intermittent Fault Diagnosis in Wireless Sensor Networks , 2007 .

[2]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Manjeet Singh,et al.  A Rule-Based Approach for Extraction of Link-Context from Anchor-Text Structure , 2012, ISI.

[4]  Hideaki Takeda,et al.  Ontology-Based Information Gathering and Categorization from the Internet , 1996, IEA/AIE.

[5]  Toyoaki Nishida,et al.  IICA: An Ontology-based Internet Navigation System , 2002 .

[6]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[7]  Wanli Zuo,et al.  Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context , 2006, MICAI.

[8]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[11]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[12]  Susumu Akamine,et al.  Evaluation of Web Retrieval Methods Using Anchor Text , 2002, NTCIR.

[13]  Wanli Zuo,et al.  Extracting Precise Link Context Using NLP Parsing Technique , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[14]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[15]  Wanli Zuo,et al.  Deriving Link Context through Dependency Analysis , 2009, 2009 International Conference on Education Technology and Computer.

[16]  Jun Li,et al.  Focused crawling by exploiting anchor text using decision tree , 2005, WWW '05.

[17]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[18]  Gautam Pant Deriving link-context from HTML tag tree , 2003, DMKD '03.

[19]  Sergei Nirenburg,et al.  Using a Natural Language Understanding System to Generate Semantic Web Content , 2007, Int. J. Semantic Web Inf. Syst..

[20]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.