XOXO Gossip Insights
Another Way to Visualise Keywords
The aim of my master thesis was to design and implement a prototype which extracts & visualises trending topics out of social data as well as enables analysts and social media managers to explore these trends in a contextual manner. For this purpose, a multi-stage process was developed which extracts significant keywords from a collection of tweets and ranks them by various heuristics and a graph; the constructed graph was also used for the visualisation of the keywords.
The Motivation in a Nutshell
Brandwatch Analytics is a software that enables real-time analyses of social media data by collecting mentions. For this purpose, the application continually gathers data using queries defined by customers or analysts. In the case of significant peaks in line charts visualising the volume of mentions per time interval, it might be of great use to quickly identify the most discussed topics to be able to react responsively to trends.
The use of the word cloud enables the user to identify the most common terms in relation to the peak data. Although the word cloud provides a good overview of discussed topics and a wide range of insights into trends, such as the visualisation of growth or sentiment, some disadvantages have been worked out in cooperation with data scientists and research analysts.
Since the axes of the word clouds often have no particular meaning, terms are arranged kind of randomly. Due to this comparatively simple visualisation, a lot of mostly context-related information is lost. So, word clouds often contain several terms that belong to the same topic, but that's not immediately apparent.
Moreover, the algorithm which extracts the terms from the corpus is mainly based on the terms' frequency. This leads to the issue that terms in wide-spread discussions are often suppressed by retweet-related terms. That's because the choice of words in wide-spread conversations tend to differ more so that these terms' frequency is usually lower than the ones of retweets.
The Data Science-ish Part
First, all texts are pre-processed and normalised to compensate for specific characteristics of social snippets and to generalise the algorithm better. This includes removing corrupted Unicode symbols, email addresses, phone numbers and URLs as well as transliterating all characters and flattening contractions.
As the last step in data preparation, the data is restructured to simplify the further process. The restructuring takes place in two steps: grouping the mentions by days and reducing the hierarchy of these groupings. These mentions are not the mentions of the peak itself, but those of background data which are used to draw more peak-specific conclusions by comparing their statistics with those of the peak. For this purpose, the mentions grouped by days can be merged into so-called pseudo-documents.
Twitter Thread Tree
To consider threads on Twitter, this information must be provided for each tweet. To create the thread tree, the superordinate tweets are requested from the Twitter API. Superordinate tweets are commented ones as well as with a comment retweeted ones. This step is performed iteratively until no superordinate tweet is available. With the help of the grouping of tweets in threads, pseudo-documents can now be created again. Both the pseudo-documents and the thread volumes, the number of tweets per thread, allow to define co-occurrences in threads later and to calculate corresponding weights.
Extraction of Keyword Candidates
In addition to the extraction of usernames and hashtags, the following POS tag patterns are the basis for keyword extraction. Nouns and proper names are used synonymously since the model tags proper names mostly on a case-sensitive basis — which is often not taken into account in social media.
// Nouns (ADJ)?(NOUN|PROPN)*(STOP|X)?(NOUN|PROPN)+ // Numerics (SYM)?(NUM)+(SYM)?(NOUN)*
For these terms, the frequency within the mentions is determined subsequently while taking word boundaries into account. Additionally, a subsumption count is determined, which defines how many terms are the superset of a specific term. This subsumption count is used later to reduce the weighting of terms that offer less context, thus minimising overlaps and duplicates.
In order to minimise duplicates, another representation of the term is created. The lemma of the keyword without whitespaces is utilised in this step to unite keyword candidates that overlap in terms of representations. Previously determined heuristics, such as the frequency and the subsumption count, have to be merged.
Ranking and Selection of Keywords
In a second step, the keywords get ranked and selected with a variation of TF-IDF. To prefer longer n-grams, the TF-IDF score is extended by the square root of n. This non-linear factor has only an insignificant influence on the result but emphasises the effect of the subsumption count.
Next, the keywords are filtered by removing those with a score less than 0.005 and selecting up to 150 remaining ones with the highest scores.
Graph Creation and Clustering
The creation of the graph and the corresponding community detection take several steps. In addition to the actual creation and clustering, this includes filtering, customisations and optional steps. The edges and their weights created based on the co-occurrences in threads are used to create an undirected graph. The resulting graph can be processed directly, and the communities detected using the Louvain approach.
The weighting of the nodes consists of the product of the previously calculated TF-IDF and the degree of the respective node. This is based on the observations that central nodes in the network are often keyword candidates. Furthermore, the most central node per community is defined as the respective main topic. Afterwards, the generated nodes and communities are cleaned up by removing all communities that are exclusively based on hashtags or Twitter handles. Such communities tend to be spam and are therefore negligible.
In order to make the visualisation simpler and more transparent, the number of edges and nodes must be reasonably reduced. Thus, the nodes are filtered again by selecting only the up to thirty nodes with the highest weighting. This ensures that all necessary contextual information is retained between the nodes, while at the same time the graph becomes simpler to read.
A Demo Is Worth a 1000 Words
An exemplary graph with subclusters
I don't want to waste words explaining the visualisation itself. The only hint: it is a charge-based force-directed layout. And now, take a look at this demo, which is based on the same dataset as the above word cloud. It's fun — promised.
In the process of researching approaches, technologies and implementations that deal with the given problem, only individual aspects and not an entirely suitable solution were identified. On closer examination of these, approaches were extracted which seemed reasonable and promising in combination. The challenges became evident in the design and implementation of the pilot experiment:
- POS patterns-based keyword extraction in noisy social media data.
- Balanced and meaningful selection of keywords through ranking.
- Maintaining the visual balance between detail and overview.
Nevertheless, the final prototype provides a solid base which can already be used. Of course, there are dozens of ideas ready to improve extraction and visualisation.
P.S. thanks a million to Brandwatch, my internal supervisors Dr. Dan Chalmers & Yanick Nedderhoff, the Stuttgart-based Team Awesome, the Data Science team in Brighton as well as all the UX research participants. It was an honour.