Posted by Nodus Labs | July 10, 2021
Analysis of Typical Elements in Various Movie Genres using TNA
This very interesting study of Netflix’ movie genres was prepared by Johnny Peng, who is an active InfraNodus user. We find this use case very interesting, particularly because it shows a very easy-to-follow workflow and arrives to some interesting insights that reveal some peculiarities about our culture and bias.
1. The Purpose and Preliminary Results
The purpose of this report is to discuss the effectiveness of using the InfraNodus text network analysis tool in understanding the commonalities between the different topics/themes in various movie genres.
We discovered that there are some interesting correlations between the movie genres and the vocabulary used in the descriptions of the movies from those genres. For instance, a horror movie is very likely to have a woman as the main protagonist as they are considered to be more vulnerable. While a crime movie is likely to be focused on a murder.
These findings can reveal an implicit bias in cinema narratives and can also be useful for screenwriters and film aficionados who are interested to explore more about the way movies are made.
2. InfraNodus
InfraNodus is an open-source tool that can automatically convert free texts to network diagrams. The stop words in the free text will be filtered out automatically, and the remaining texts will be used for generating the Word Network.
3. Word Network
Word Network is a node-link diagram that is inferred from free texts. Each unique word in the free texts (except for stop words) will be converted to nodes. An edge between two words means that they have appeared in the same bi-gram/four-gram. The size of the node scales with its betweenness centrality. It’s a great alternative to Word Cloud, as in addition to showing the important words, it also shows the co-occurrence relationship between words.
4. Data sets
The dataset that will be used for this report is from Kaggle (link below):
https://www.kaggle.com/shivamb/netflix-shows
According to the dataset description, the data consists of tv shows and movies available on Netflix as of 2019, and has an original data source of Flexible. The original dataset contains 6234 rows and 12 columns, and consists of a mixture of numerical, categorical, and free-text entries.
We will be only using the last 2 of the columns in the dataset, which are
- listed_in – Multi-label categorical variable representing the genres of the movie which was classified, each genre was separated by a comma.
- description – Free texts, contain general descriptions for each movie (each row)
5. Data Preprocessing
Genre Grouping
In the data, it was found that there were 42 unique genres. However, many genres were similar and thus could be grouped together. For instance, ‘Romantic Movies’ and ‘Romantic TV Shows’ could be merged into one genre.
To do this, a python dictionary consisting of the genre names as the key with corresponding id (integer) as the value was created. Genres with the same value/id in the dictionary meant that they were grouped together. For instance, in the dictionary, “Documentaries” and “Docuseries” were both mapped to an integer of 7 which essentially meant that these 2 genres were merged together. And uninformative genres such as genres titled “Movies” and “TV Shows” are excluded from this analysis.
After filtering and grouping similar genres together, 27 unique genres remain (shown below) and 28 unique txt files were created, which corresponds to 27 genres + 1 txt file containing text from all genres. These 28 txt files will then be imported to InfraNodus, and created corresponding 28 Word Network.
{‘Action & Adventure’: 0,
‘Anime Features’: 1,
‘Anime Series’: 1,
‘British TV Shows’: 2,
‘Children & Family Movies’: 3,
‘Classic & Cult TV’: 4,
‘Classic Movies’: 5,
‘Comedies’: 6,
‘Crime TV Shows’: 7,
‘Cult Movies’: 4,
‘Documentaries’: 8,
‘Docuseries’: 8,
‘Dramas’: 9,
‘Faith & Spirituality’: 10,
‘Horror Movies’: 11,
‘Independent Movies’: 12,
“Kids’ TV”: 13,
‘Korean TV Shows’: 14,
‘LGBTQ Movies’: 15,
‘Music & Musicals’: 16,
‘Reality TV’: 17,
‘Romantic Movies’: 18,
‘Romantic TV Shows’: 18,
‘Sci-Fi & Fantasy’: 19,
‘Science & Nature TV’: 20,
‘Spanish-Language TV Shows’: 21,
‘Sports Movies’: 22,
‘Stand-Up Comedy’: 23,
‘Stand-Up Comedy & Talk Shows’: 23,
‘TV Action & Adventure’: 0,
‘TV Comedies’: 6,
‘TV Dramas’: 9,
‘TV Horror’: 11,
‘TV Mysteries’: 24,
‘TV Sci-Fi & Fantasy’: 19,
‘TV Thrillers’: 25,
‘Teen TV Shows’: 26,
‘Thrillers’: 25}
6. Results & Discussion
6.1 Genre Analysis – Crime
The following graph is the Word Network produced by InfraNodus based on the texts from Crime Movies, which is able to show the following insights:
- The most influential elements for Crime movies -> Murder, Crime, Drug, Solve these are within expectation.
- Clusters of the topics that are closely related into the same topical group (e.g. Murder, Solve, Case).
Overall I think InfraNodus did a good job on producing this graph, because we can almost picture a classic setting of Crime Movie / Drama just based on the large nodes (with text) shown in the graph below. For example:
- First, you need to have some sort of “Crime” happening, “Murder” is a common one.
- Then it became a “Case”, which a “Detective” / “Police” / “Cop” will try to “Investigate” and “Solve”
- Might be good to add some “Drug” related elements, but not essential.
6.2 Genre Analysis – Horror
The following graph is the Word Network produced by InfraNodus based on the texts from Thriller Movies, which is able to show the following insights:
- The most influential elements for Horror movies are -> Home, Young, Woman, Evil.
- Clusters of the topics that are closely related into the same topical group.
The most influential elements shown here are interesting, and we might be able to infer the following hypothesis:
- There are always some sort of evil forces in a horror movie e.g.“Ghost”, “Deadly Demon”.
- Horror movies like to set someone “Young” and likely a “Women” as the main character since they consider them to be more vulnerable.
- The common stage of horror movies is the characters’ home, having mysterious things happening at your own home are likely to be scarier than anywhere else.
6.4 Limitations
The main issue of these visualizations is that one movie can be categorised into multiple genres, thus in the Word Network the different genres are correlated to each other. This leads to some of the generic topics such as “Men” & “Women” being the most influential topics in a lot of genres.
In addition, based on the word network we produced for texts from all genres (shown below in the next page), again it indicates that “Love”, “Woman”, and “Man” are the main topical groups.
This could be expected as “Man” and “Woman” would be the main subject for most movies / dramas, and “Love” is always a classical theme that fits well in almost all genres of movies / dramas.
To solve this problem, we can simply remove the most influential and obvious words from the graph, by clicking on them, and then adding them to the stop words list. In our case, it’s going to be the words: “woman”, “man”, “love”, “girl”, “young”, “life”, “series” and “story”. We will then reveal the topics that are hiding behind them as is shown on the graph below:
6. Conclusion
Overall, the Word Network produced from InfraNodus is an easy to use, powerful alternative to Word Cloud, it provides so much more contextual information that Word Clouds cannot display. In addition, there are a lot of built-in analytical functions in InfraNodus that can help you detect deeper structural relationships of the network.
If you are interested to see more details of these Word Network produced based on Netflix movie description, you can have a look at them from the link below:
https://infranodus.com/woshipjl
This article was written by Johnny Peng, who is an active user of InfraNodus. If you would like to contact him, you can do so via his LinkedIn profile: https://www.linkedin.com/in/johnny-peng-b09365b9/
To try InfraNodus with your own datasets, please, sign up for an account on www.infranodus.com