Efforts Towards Automatically Generating Personas in Real-time Using Actual User Data

The use of personas is an interactive design technique with considerable potential for product and content development. A persona is a representation of a group or segment of users, sharing common behavioral characteristics. Although representing a segment of users, a persona is generally developed in the form of a detailed narrative about an explicit but fictitious individual that represents the collection of users possessing similar behaviors or characteristics. In order to make the fictitious individual appear as real person to the product developers, the persona narrative usually contains a variety of both demographic and behavioral details about socio economic status, gender, hobbies, family members, friends, possessions, among many other data. Also, the narrative of a persona normally also addresses the goals, needs, wants, frustrations and other emotional aspects of the fictitious individual that are pertinent to the product being designed. However, personas have typically been viewed as fairly static. In this research, we demonstrate an approach for creating and validating personas in real time, based on automated analysis of actual user data. Our data collection site and research partner is AJ+ (http://ajplus.net/), which is a news channel from Al Jazeera Media Network that is natively digital with a presence only on social media platforms and a mobile application. Its media concept is unique in that AJ+ was designed from the ground up to serve news in the medium of viewer, versus a teaser in one medium with a redirect to a website. In pursuit of our overall research objective of automatically generating personas in real time, for research reported in this manuscript, we are specifically interested in understanding the AJ+ audience by identifying (1) whom are they reaching (i.e., market segment) and (2) what competitive (i.e., non-AJ+) content are associated with each market segment. Focusing on one aspect of user behavior, we collect 8,065,350 instances of sharing of links by 54,892 users of an online news channel, specifically examining the domains these users share. We then cluster users based on similarity of domains shared, identifying seven personas based on this behavioral aspect. We conduct term-frequency – inverse document frequency (tf-idf) vectorization. We remove outliers of less than 5 shares (too unique) and more than 80% of the all users' shares (too popular). We use K-means++ clustering (K = 2.. 10), which is an advanced version of K-means to improve selection of initial seeds, because K-means++ effectively works for a very sparse matrix (user-link). We use the “elbow” method to choose the optimal number of clusters, which is eight in this case. In order to characterize each cluster, we list top 100 domains from each cluster and discover that there are large overlaps among clusters. We then remove from each cluster the domains that existed in another cluster in order to identify the relevant, unique, and impactful domains. This de-duplication results in the elimination of one cluster, leaving us with a set of clusters, where each cluster is characterized by domains that are shared only by users within that cluster. We note that the K-means++ clustering method can be replaced easily with other clustering methods in various situations. Demonstrating that these insights can be used to develop personas in real-time, the research results provide insights into competitive marketing, topic interests, and preferred system features for the users of the online news medium. Using the description of each of shared links, we detect their languages. 55.2% (30,294) users share links in one just language and 44.8% users share links in multiple languages. The most frequently used language is English (31.98%), followed by German (5.69%), Spanish (5.02%), French (4.75%), Italian (3.46%), Indonesian (2.99%), Portuguese (2.94%), Dutch (2.94%), Tagalog1 (2.71%), and Afrikaans (2.69%). As there were millions of domains shared, we utilize the top one hundred domains for each cluster, resulting in 700 top domains shared by the 54,892 AJ+ users. We, as mentioned, de-duplicated, resulting in the elimination of a cluster (11,011 users, 20.06%). So, we have seven unique clusters based on sharing of domains representing 43,881 users. We then demonstrate how these findings can be leveraged to generate real-time personas based on actual user data. We stream the data analyze results into a relational database, combine the results with other demographic data that we gleaned from available sources such as Facebook and other social media accounts, using each of the seven clusters as representative of a persona. We give each persona a fictional name and use a stock photo as the face of our personas. Each persona was linked to the top alternate (i.e., non-AJ+) domains they most commonly shared with the personas shared links updateable with new data. Research implications are that personas can be generated in real-time, instead of being the result of a laborious, time-consuming development process.