Posted by Nodus Labs | September 22, 2019
Generate a Word Cloud with a Context
Word clouds are used to generate visual summaries of text, however, they have one major deficiency: they completely lose track of the context. As a result, there is a risk of getting false meaningless results and making the wrong assumptions about the underlying data. Text network analysis and visualization can solve this problem.
We’ve already presented this approach in our post on word cloud generation. Below we will explain how it is different from the standard word clouds. The text network approach also proposes a different way of thinking about text: not a cloud of concepts but a cloud of relations between them.
Standard Word Cloud Generator
For example, if we take the text of Obama’s 2013 inauguration speech and put it into one of the popular word cloud generators, we will get the following result:
The most frequently mentioned words are bigger, ranged in the alphabetic order. That’s it. We can understand from the picture above that Obama had been talking about the “people”, “american”, “citizens”, as well as something about “requires” and “freedom” but that’s about it. Just a collection of disjointed terms without a context. This post on Thematic describes all the shortcomings in more detail, but it is clear that word clouds are oversimplified and should probably only be used for decorative purposes.
Text Network Visualization: Word Cloud with a Context
So, how can we improve the word clouds? The best place to start is to introduce the context. Then we can see not only the most influential words but also how they are used together, providing a much better overview of the text’s meaning. One way to do that is to take into account the words’ co-occurrences or the n-grams (e.g. the bigrams), so we can see which words tend to appear next to each other.
While there are many different ways to do that, text network visualization offers a powerful way to achieve that because it is based on the graph theory, providing the quantitive base layer to the qualitative insights we can derive from visualizations.
To visualize that same text as a network we will use the open-source tool InfraNodus. We will simply copy and paste the text into the system to visualize it as a network:
We can see now the most influential terms (which are the bigger nodes on the graph), as well as the relations between them — the context. The words that belong to the same clusters are shown with the same colors close to each other (based on the modularity community detection algorithm from graph theory).
For example, we can see that the word “require” was used in conjunction with the words “principle” and “time”, so we can now see what it is, that is, actually, required.
We can also see that “believe” is quite an important word, which was not previously detected by the word cloud generator. It is connected to the frequently used “american” and “people”, which highlights the ideological and unifying connotation of the context where these words were used in.
The Interactive Word Cloud Text Network
As we’ve shown above, text network visualizations generate more precise word clouds that take the context into account. However, they can also be used to provide the interactive features, which the standard word clouds lack.
We often see word clouds that are used in navigational elements: click on the term and you’ll see all the articles tagged with this term. So the only thing you can do is to click on “Nation” and see all the other political speeches that contain this term: that is, all of them 🙂
Alternatively, a text network graph can provide a much more precise way to search through the text or through the corpus of text. When we select a specific word, for example “nation”, the graph will show us all the words it’s connected to. If we then click on “technology”, we get an excerpt of the text that contains those two terms simultaneously: talking about the sustainable energy development.
If we then search for the other presidential speeches, which contained the same terms, “nation” and “technology” we will see that Obama’s 2009 speech also used these two terms:
If we then click on that graph, “obama2009”, we will see the context where Obama used these words in his 2009 address was connected to “health” and “healthcare”, not the sustainable energy. So the focus has shifted.
A simple word cloud generator cannot provide this level of insight about the text’s context and would not be able to provide the same level of precision when searching through several texts, finding connections between them.
You can try the text network visualization of word clouds on www.infranodus.com or download a free open-source standalone version on your computer from github.com/noduslabs/infranodus