TARENTe: an experimental tool for extracting and exploring Web aggregates

We discribe how to extract and visually explore the topology of an open, large scale, hypertext system such as the Web? We address this issue by developing an experimental tool for extracting, exploring and analyzing Aggregates of Web documents. This tool, called TARFNTe, includes a crawling technology, and algorithms for both content analysis and authority graphs calculations (as Kleinberg's HITS), linked with visualization solutions. We provide series of experimental results on different topics that allow us to describe the web's structure in terms of topic Aggregates. The TARENTe system was designed to provide multiple services including Web crawling, network analysis, data mining and information visualization tools. For these purposes we chose to build it using an ad hoc modular Java framework, which allows the integration of open-source code for each task. For simplicity concerns we organized the gathering/analyzing information process around a mySQL database, which can be addressed by different crawlers, as well as by multiple infoviz tools and analysis plug-ins.