Scalable Group Detection via a Mutual Information Model

A major problem in the area of link discovery is the discovery of hidden organizational structure such as groups and their members [5]. The group detection task can be further qualified into (1) discovering hidden members of known groups(or group extension) and (2) identifying completely unknown groups . Adibi et al. [1] describe the KOJAK Group Finder (GF) system which uses a novel mutual information (MI) approach combined with logic-based reasoning to find hidden groups and group members in large evidence databases. In this paper we report on the wider applicability and scalability of the GF by applying it to a variety of synthetic datasets that contain up to 7,500,000 links. The GF detects groups in four phases. (1) A logic-based group seed generator analyzes the evidence and outputs a set of seed groups using deductive and abductive reasoning. (2) An MI model finds likely new candidates for each group, producing an extended group. (3) The MI model is used to rank these likely members by how strongly connected they are to the seed members. (4) The ranked extended group is pruned using a threshold to produce the final output. After phase 1 has completed and seed groups have been generated from available evidence, the GF tries to identify additional members by looking for people that are strongly connected with one or more of the seed members. To find two strongly connected entities, we aggregate the known links between them and statistically contrast them with connections to other candidates and the general population. This is done by an MI model that exploits evidence such as individuals sharing an attribute (e.g., their address) or being involved in the same activity (e.g., communicating via email). These attributes and actions are represented as random variables and we measure connection strength by measuring the MI between them. If the variables (or entities) are independent, the MI between them is zero. If they are strongly dependent, the MI between them is large. MI between XandY is defined as: