This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for their characters. In the first phase of our approach, we apply this distinction for a fast separation of the Non-ASCII from the English characters. After that, we determine some areas of the HTML file with high density of the Non-ASCII character set and low density of the ASCII character set. At the end of this phase, we use this density to identify the areas which contain the main content. Finally, we feed those areas to our parser in order to extract the main content of the Web page. The proposed algorithm, called DANA, exceeds alternative approaches in terms of both, efficiency and effectiveness, and has the potential to be extended also to languages based on ASCII characters.
[1]
Thomas Gottron,et al.
Content Code Blurring: A New Approach to Content Extraction
,
2008,
2008 19th International Workshop on Database and Expert Systems Applications.
[2]
Gail E. Kaiser,et al.
DOM-based content extraction of HTML documents
,
2003,
WWW '03.
[3]
Sandip Debnath,et al.
Identifying Content Blocks from Web Documents
,
2005,
ISMIS.
[4]
Mehmet A. Orgun,et al.
Separating XHTML content from navigation clutter using DOM-structure block analysis
,
2005,
HYPERTEXT '05.
[5]
Barry Smyth,et al.
Fact or Fiction: Content Classification for Digital Libraries
,
2001,
DELOS.
[6]
Wei Li,et al.
QuASM: a system for question answering using semi-structured data
,
2002,
JCDL '02.
[7]
Thomas Gottron.
EVALUATING CONTENT EXTRACTION ON HTML DOCUMENTS
,
2007
.
[8]
Marie-Francine Moens,et al.
Language independent content extraction from web pages
,
2009
.