Blogosphere community formation , structure and visualization

Even as social networks have become fashionable over the last few years, the emphasis has been on "artificial" social networks (like Orkut, where you state your links explicitly) over "natural" social networks (where social links are deducted from user actions, i.e., those underlying the blogosphere, globally or regionally). Blog readers and writers form communities, and there are several tools that allow to visualize them. By mapping a subset of the blogosphere at different intervals in time, a picture of its evolution can be also drawn. And, finally, by looking closely at each community and its evolution, some conclusions can be drawn on what is the feature that defines such community. In this presentation, we will show which tools are available to find and map weblog communities, the evolution of how selected communities, and what conclusions we draw from it. In particular, we will present a neural-network based tool called Kohonen's map, which we have introduced for mapping and representing weblog communities. Introduction and state of the art Weblogs, or blogs, can be considered on-screen renderings of communities of readers/writers, which establish long-running relationships; these communities include weblog owners/writers or editors, people that post comments to weblog stories, and silent but persistent readers, both of whom might have their own weblog. A weblog by itself need not be important, but as part of a community, its importance cannot be disregarded. All weblogs in the world can be seen as components of a set of communities, each one with its own idols, axioms, enemies, and hierarchies. Communities are not clear-cut, since a particular weblog might belong to several communities at the same time, even though most weblogs (in fact, all weblogs in the Spanish-speaking community (Tricas et al., 2003) are connected to each other by a finite set of links. Since blogs perform a sort of collaborative filtering of information published on the web at large, and are starting to be used as knowledge management tools, identifying communities becomes specially important. Information flows more easily within communities than outside them; getting a message across to as many persons as possible becomes, then, a matter of identifying communities, and the position of different sites within them. As straightforward as this view of the community concept might seem, the main problem is that there is no universally accepted definition of community in complex networks. Informally, it can be defined as a set of blogs (or websites) that share common interests, but this only begs the definition of common and interest. Another possible definition is to consider a community as a set of blogs that have a stronger relationship among them than with the rest of the websites of the same class. Equating relationship with hyperlinks means that a community is a set of weblogs that has more links within the group than to outside sites. However, while heavily linking implies belonging to the same community, the inverse does not necessarily hold: two weblogs (and its readers/commenters; from now on, every time we refer to weblogs in a community context, we actually refer to the group of persons related to that weblog: readers, writer(s), commenters, and even those who link to it without even reading it) might both link to the same one, and thus belong, in a sense, to the same community without being aware of each other or the community. In practice, data available to discover community ascription must be included in the web page source code, which is text formatted using HTML tags and some additional meta-tags; sometimes, each text can be assigned a time-stamp. The aforementioned common interest will have to be identified by using this data. From the point of view of text content, two websites are related if they deal approximately with the same topics. Considering links, two websites are related if they link to each other in either direction. These two definitions are actually correlated: Menczer has proved that pages that link to each other are semantically related. Furthermore, there are several additional problems with communities related by content: if a community is defined by keywords, synonyms and hypernims, if not considered or appropriately chosen, can lead to overseeing certain websites. This problem is aggravated further by the distinct characteristics of weblogs as rapidly changing websites and not focusing on a single topic or set of topics. Using content requires a vector space representation, usually term frequency/inverse document frequency. This representation is usually highly-dimensional, much more so than using links to other members of the set of webs that is going to be studied. For a small set of sites, link-based representation is much more compact. Relationship expressed by content distance, however, is implicit: two weblogs talking about politics, for instance, need not know each other, although it is very likely that they do since at least the Spanish blogosphere is connected (Tricas et al., 2003). Moreover, in many cases, communities are multilingual; two weblogs closely related to each other (for instance, written by the same author) but written in different languages (for instance, Spanish and Catalan, or Spanish and English) will be completely unrelated if only content is taken into account. Meta-content following protocols such as Friend of a Friend (FOAF) could, in principle, be also used as network arcs, but its use is not widespread, and it represents simply a binary relation (either you are a FOAF or you are not), while links have some quantitative quality (linking several times is different from linking only once). In this work, links have been chosen over content because they are easily parseable from the document source; this choice allows for a low-dimensional representation of each blog which will be represented by a vector with as many components as blogs in the group under study. This obviously only holds if the number of relevant sites is smaller than the vocabulary needed to represent the same sites in a vector space model. It is also univocal: a link clearly identifies origin (the weblog it has been found in) and destination (from the URL). Links represent a real relationship among the blogs they join: they imply that, at least, one has read the other, which shows a kind of community relation. This is inferred because communities are created by reading, writing about other blogs or commenting on them. It is true that there might be other members of the community not uncovered by this method (for instance, loyal readers or people who use comments to participate); similarly, a member of the community could be linked to another via a blog not belonging to the set of blogs under study (Blogalia, in this case); however, we do not attempt to say the last word about community structure in the blogosphere (as is usually called the set of all weblogs). Our aim is to portray a method to identify communities by considering hyperlinks a good enough indicator of community relationship. Content (distance in vector space) or links (number of links, or just the existence or not of links) are used to create a complex network of the set of sites under study; consequently, a community must be defined by some measure that distinguishes, or makes apart, some sites from others. There are several possible network structures that could be considered communities: cliques, or sets of sites that link to each other, bipartite cliques, sets of sites which all link to another, different, set of sites, k-cores or factions, sets of sites connected to, at most, k other sites in the group, or bipartite cores, which includes both the connector and the connected sites. Most of these structures can be computed and displayed with programs such as Pajek (available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/) or UCINET (available from http://www.analytictech.com/), but require some initial parameters such as the number of cliques or the number of cores we want to divide the original set into. All of these are valid definitions, and can be used in some cases. However, some of them are restrictive in the sense that they only take into account binary relations, and not the link weight (number of times it has been used) or direction. In the case at hand, direction is important: usually, some blog that has been ``pointed to'' might not even be aware of it. The majority of the concepts defined above do not create clear visual image of the community they are describing. Sometimes, further steps must be taken to infer complex network communities. Some of them are geared toward specific communities, e.g. communities expressed via web pages or email messages, like the one we are dealing with in this paper. Gibson et al. proposed one of the first algorithms to infer web communities; it defined a community as a core of central, authoritative pages linked by hub pages. However, this definition is a bit fuzzy and does not provide clear-cut partitions of a set of websites, but it is interesting in the sense that it was one of the first to realize the importance of communities on the web, and to propose an algorithm to define them. Shortly afterwards, Flake et al. use a maximum flow/minimal cut algorithm to define the edges and nodes that act as boundary between communities. There exist other algorithms that detect partitions of the original set according to properties of links, as opposed to properties of nodes. One of these is the Girvan-Newman algorithm (see Merelo et al. 2004 or Radicchi et al. 2003 for the state of the art and references), which detects links that, when removed, isolate some part of the original set. Clusters, or communities, are then computed according to where these removed links are. This algorithm discovers communities quite efficiently, but, once again, it does not discover the internal structure of each com

[1]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[2]  Claudio Castellano,et al.  Defining and identifying communities in networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Alberto Prieto,et al.  CLUSTERING WEB-BASED COMMUNITIES USING SELF-ORGANIZING MAPS , 2004 .