Netflix: A Data-Driven Approach

Analyzing Trends and Viewer Preferences

🐼 panData

Published in

Level Up Coding

16 min readMay 5, 2024

In the rapidly evolving landscape of streaming services, the integration of robust data analysis is pivotal.

Through a comprehensive data-driven examination, we uncover insights that not only highlight current preferences and trends but also predict future shifts in viewer consumption and content creation.

Methodology

The dataset used for this analysis was sourced from the Netflix Movies and TV Shows dataset available on Kaggle.

This comprehensive dataset provides a broad overview of the content available on Netflix, including details on genres, release years, and more.

Example of Treemap Analysis by 🐼 panData

Our analytical approach utilized several key Python libraries including Pandas for data manipulation and Plotly for data visualization. These tools were instrumental in facilitating a detailed exploration of the dataset.

Data Loading and Encoding

To begin our analysis, we load the Netflix dataset using Pandas in Python. The attempt is first made with UTF-8 encoding, which is compatible with a wide range of characters.

If a UnicodeDecodeError occurs, indicating encoding issues, we switch to ISO-8859–1, also known as Latin-1, which supports additional characters from Western European languages.

import pandas as pd

# Try loading the dataset with UTF-8 encoding; use ISO-8859–1 if UTF-8 fails
try:
 df = pd.read_csv('netflix_titles.csv', encoding='utf-8')
except UnicodeDecodeError:
 df = pd.read_csv('netflix_titles.csv', encoding='ISO-8859–1')

This ensures that all textual data is accurately imported for our analysis, accommodating various character encodings without data loss.

Data Cleaning

To ensure our dataset is manageable and focused, we remove any extraneous columns.

This process involves creating a list of essential columns that are relevant to our analysis, including identifiers, content types, and descriptive attributes.

# Drop the "Unnamed" columns and keep only the specified columns
specified_columns = [
 'show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
 'release_year', 'rating', 'duration', 'listed_in', 'description']

# Create a new DataFrame with only the specified columns
df_cleaned = df[specified_columns]

# Display the first few rows of the cleaned DataFrame to verify
df_cleaned.head()

This step creates a cleaner and more focused dataset, allowing for more efficient analysis by concentrating only on relevant data.

Data Structure Overview

This understanding aids in confirming that the data is well-organized and ready for further analysis.

The following code provides an overview of the cleaned DataFrame, detailing the types of data it contains and identifying any necessary adjustments for data processing.

# Displaying information about data types and structure
print(df_cleaned.info())

This command outputs information about each column, such as data type and non-null counts, ensuring transparency for the next stages of analysis.

Statistical Overview

To gain insights into the numerical variables of our dataset, we generate a statistical summary.

This summary includes key metrics such as mean, standard deviation, min, and max values, which help us understand the distribution and central tendencies of these variables.

# Statistical summary of numerical variables
print(df_cleaned.describe())

However, in the context of the dataset and its purpose, performing a deeper statistical analysis on the release_year alone may not be highly relevant beyond understanding the range of release years available in the data from the minimum to the maximum release year.

Dataset Dimensionality

Before diving deeper into the analysis, it is crucial to understand the scope of our dataset by examining its dimensions.

# Print the dataset's dimensions
print(df_cleaned.shape)

This output, (8809, 12). Our dataset contains 8,809 entries across 12 different attributes, providing a substantial amount of data for our analysis.

Data Visualization

1. Distribution of Content by Type

Visualizing the distribution of different types of content available on Netflix helps in understanding the current focus of their catalog.

The following code uses the Plotly library to create a bar chart, which displays the count of Movies versus TV Shows in the dataset.

import plotly.express as px

content_type_distribution = df_cleaned['type'].value_counts().reset_index()

# Renaming columns for clarity
content_type_distribution.columns = ['Content Type', 'Count'] 

# Plotting
fig = px.bar(
content_type_distribution, 
x='Content Type', 
y='Count', 
title='Distribution of Content by Type',
labels={'Content Type': 'Content Type', 'Count': 'Count'}, 
template='plotly_white')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',
xaxis_showgrid=False, yaxis_showgrid=False)
fig.show()

This demonstrates that Netflix has a significantly higher number of Movies compared to TV Shows, which could influence content strategy decisions.

2. Tracking Content Release Patterns Over Time

To examine the evolution of content releases on Netflix, we analyze the number of titles released each year.

This involves grouping the data by the release year and counting the titles to identify trends.

The code below uses the Plotly library to create a line graph, which visually represents these trends, highlighting fluctuations and growth patterns.

# Grouping data by release year and counting the titles released each year
release_trends = df_cleaned.groupby('release_year').size().reset_index(name='Count')

# Sorting data by release year
release_trends_sorted = release_trends.sort_values('release_year')

# Plotting
fig_trends = px.line(
release_trends_sorted, 
x='release_year', 
y='Count', 
title='Trends of Content Release Over the Years',
labels={'release_year': 'Release Year', 'Count': 'Number of Titles'}, 
template='plotly_white')
fig_trends.update_layout(plot_bgcolor='rgba(0,0,0,0)', 
xaxis_showgrid=False, yaxis_showgrid=False)

# Adjusting Y-axis to log scale
fig_trends.update_yaxes(type='log')  
fig_trends.show()

This visualization helps us understand how Netflix’s content offerings have expanded or shifted over the years.

Here’s the continuation of your article with the section focusing on the geographical distribution of content production:

3. Identifying Top Content Producing Countries

To understand Netflix’s global content strategy, we analyze the geographical distribution of where titles are produced.

This analysis counts the number of titles produced in each country. We then visualize the top 20 countries that have produced the most content.

# Counting the number of titles by country
country_distribution = df_cleaned['country'].str.split(', ').explode().value_counts().reset_index()
country_distribution.columns = ['Country', 'Count']

# Plotting
fig_countries = px.bar(
country_distribution.head(20), 
x='Country',
y='Count', 
title='Top 20 Countries by Content Production',
labels={'Country': 'Country', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_countries.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_countries.show()

This bar chart provides a clear view of which countries are the largest producers of content for Netflix, which can inform strategic decisions related to regional market investments and content localization efforts.

Here’s how to seamlessly integrate the analysis of movie durations into your article:

4. Duration Trends in Netflix Movies

To delve deeper into the characteristics of movies on Netflix, we analyze the duration of films to identify trends and common lengths.

First, we filter our dataset to include only entries categorized as Movie.

Next, we extract and convert the numeric values from the duration column, which are formatted as strings like 90 min.

This allows us to handle and analyze these values quantitatively.

# Filtering for movies only
df_movies = df_cleaned[df_cleaned['type'] == 'Movie']

# Extract numeric values from 'duration' which are stored in format like '90 min'
df_movies['duration_min'] = df_movies['duration'].str.extract('(\d+)').astype(float)

# Handling NaN values by dropping them (you could also consider filling them if appropriate)
df_movies = df_movies.dropna(subset=['duration_min'])

# Plotting a histogram of movie durations
fig_duration = px.histogram(
df_movies, x='duration_min', 
nbins=30, 
title='Distribution of Movie Durations',
labels={'duration_min': 'Duration (minutes)'}, 
template='plotly_white')

fig_duration.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_duration.show()

This histogram provides insights into the typical movie lengths available on Netflix, illustrating the range and distribution of film durations.

5. Temporal Patterns in Netflix’s Catalog Expansion

Understanding when and how frequently Netflix adds new titles to its platform provides insights into its content strategy and market response dynamics.

To do this, we first parse the date_added field into a datetime format, which allows us to extract and focus on the specific periods of content additions.

We then create a year_month_added column to aggregate data by month and year, providing a more granular view of content addition trends.

# Parsing 'date_added' to datetime 
df_cleaned['date_added'] = pd.to_datetime(df_cleaned['date_added'].str.strip(), errors='coerce')

# Creating 'year_month_added' column as period
df_cleaned['year_month_added'] = df_cleaned['date_added'].dt.to_period('M').astype(str)

# Grouping data by 'year_month_added' and counting the titles added each period
addition_trends = df_cleaned.groupby('year_month_added').size().reset_index(name='Count')
addition_trends_sorted = addition_trends.sort_values('year_month_added')

# Plotting
fig_addition_trends = px.line(
    addition_trends_sorted, 
    x='year_month_added', 
    y='Count',
    title='Frequency of Content Addition Over Time',
    labels={'year_month_added': 'Year and Month', 'Count': 'Number of Titles Added'},
    template='plotly_white'
)
w1  # Updating layout to improve y-axis visualization
fig_addition_trends.update_layout(
    plot_bgcolor='rgba(0,0,0,0)', 
    xaxis_showgrid=False, 
    yaxis_showgrid=False,
    yaxis_range=[-5, max(addition_trends_sorted['Count'] + 10)]  # Adjusting y-axis to start slightly below zero
)
fig_addition_trends.show()

This visualization tracks the frequency of content additions, highlighting key periods of growth and allowing us to speculate on strategic shifts in Netflix’s approach to updating its catalog.

6. Analyzing Content Distribution by Rating and Type

To understand how different types of content such as movies and TV shows are distributed across various ratings, we analyze the count of titles for each combination of rating and type.

# Creating a count of titles for each rating and type combination
rating_type_distribution = df_cleaned.groupby(['rating', 'type']).size().reset_index(name='Count')

# Plotting
fig_rating_type = px.bar(
rating_type_distribution, 
x='rating', 
y='Count', 
color='type',
title='Relationship Between Rating and Type of Content',
labels={'rating': 'Rating', 'Count': 'Number of Titles'}, 
barmode='group',
template='plotly_white')

fig_rating_type.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_rating_type.show()

The chart displays the distribution of Netflix titles across various content ratings, broken down by Movies and TV Shows. Below is a succinct explanation of each rating category along with a brief analysis:

G: General audiences — All ages admitted.

PG: Parental Guidance Suggested — Some material may not be suitable for children.

PG-13: Parents Strongly Cautioned — Some material may be inappropriate for children under 13.

R: Restricted — Under 17 requires accompanying parent or adult guardian.

NC-17: Adults Only — No one 17 and under admitted.

NR (Not Rated): This content has not been formally rated by a rating organization.

TV-G: General Audience — Suitable for all ages.

TV-PG: Parental Guidance Suggested — Contains material that parents may find unsuitable for younger children.

TV-14: Parents Strongly Cautioned — Contains material that many parents would find unsuitable for children under 14 years of age.

TV-MA: Mature Audience Only — Specifically designed to be viewed by adults and therefore may be unsuitable for children under 17.

TV-Y: All Children — Designed to be appropriate for all children.

TV-Y7: Directed to Older Children — Designed for children age 7 and above.

TV-Y7-FV: Directed to Older Children — Fantasy Violence — Suitable for children age 7 and above with more intense or fantasy violence.

UR (Unrated): This content has not been rated by the MPAA.

The chart reveals that most content falls under TV-MA and R ratings, suggesting a significant focus on adult audiences.
Movies dominate the R category, indicating a robust selection of films with mature themes. C
Conversely, TV Shows are more prevalent in TV-MA, reflecting a trend towards series tailored for an adult viewership.
The presence of family-friendly content is also notable in ratings like TV-Y and G, though these categories feature fewer titles compared to adult-oriented ratings.

7. Influential Directors in Netflix’s Catalog

This analysis segment focuses on understanding which directors have the most titles listed on Netflix.

The process involves splitting multiple director names listed under a single title, counting their occurrences, and identifying the top 10 directors.

# Processing directors data,
director_counts = df_cleaned['director'].str.split(', ').explode().value_counts().reset_index()
director_counts.columns = ['Director', 'Count']

# Filtering to show only the top 10 directors
top_directors = director_counts.head(10)

# Plotting
fig_directors = px.bar(
top_directors, 
x='Director', 
y='Count', 
title='Top 10 Directors by Number of Titles',
labels={'Director': 'Director', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_directors.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_directors.show()

This bar chart clearly illustrates which international directors are the most prolific on Netflix, offering insights into the type of content that is frequently produced and possibly favored by the platform.

8. Prominent Directors of U.S. Content

To gain insights into which directors are most influential in shaping the U.S. film content on Netflix, we narrow our focus to titles explicitly produced in the United States.

# Filtering for movies from the United States
us_movies = df_cleaned[(df_cleaned['type'] == 'Movie') & (df_cleaned['country'].str.contains('United States', na=False))]

# Processing directors data for U.S. movies, handling multiple directors per title
us_director_counts = us_movies['director'].str.split(', ').explode().value_counts().reset_index()
us_director_counts.columns = ['Director', 'Count']

# Filtering to show only the top 10 directors
top_us_directors = us_director_counts.head(10)

# Plotting
fig_us_directors = px.bar(
top_us_directors, 
x='Director', 
y='Count', 
title='Top 10 Directors of U.S. Films on Netflix',
labels={'Director': 'Director', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_us_directors.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_us_directors.show()

Jay Karas and Marcus Raboy lead the chart, indicating a strong representation on Netflix. These directors are known for their work in comedy specials and light-hearted content, suggesting a popular demand for such genres among Netflix’s U.S. audience.

Jay Chapman, another director with a significant number of titles, also tends to direct comedy specials, reinforcing the trend that this type of content is frequently produced and consumed in the U.S.

Martin Scorsese and Steven Spielberg, both highly acclaimed filmmakers, appear in the middle of the list, highlighting their enduring appeal and the platform’s investment in critically acclaimed, high-quality content.

Shannon Hartman, Don Michael Paul, and Troy Miller further showcase the variety in content, spanning from comedy to action and drama, which appeals to a broad audience.

Robert Rodriguez and Quentin Tarantino, both known for their distinctive styles and strong narratives, close the list, underscoring Netflix’s strategy to feature bold and engaging content that attracts diverse viewer segments.

9. Common Themes in Netflix Descriptions

To better understand the themes and topics that are prevalent in Netflix content descriptions, we utilize natural language processing techniques.

Specifically, we apply CountVectorizer from the sklearn library to extract and count the most common words found in the descriptions, excluding common English stopwords to focus on more meaningful keywords.

from sklearn.feature_extraction.text import CountVectorizer
import plotly.express as px

# Using CountVectorizer to extract the most common words from descriptions
vectorizer = CountVectorizer(stop_words='english', max_features=20)
description_matrix = vectorizer.fit_transform(df_cleaned['description'].dropna())
description_counts = description_matrix.sum(axis=0)
words_freq = [(word, description_counts[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

# Extracting the top 20 words
top_words = words_freq[:20]

# Preparing data for plotting
words, counts = zip(*top_words)

# Plotting using Plotly
fig = px.bar(
x=words, 
y=counts, 
labels={'x': 'Words', 'y': 'Frequency'}, 
title='Top 20 Common Words in Descriptions')
fig.update_traces(marker_color='blue')

fig.update_layout(xaxis_title='Words', yaxis_title='Frequency', xaxis_tickangle=-45)
fig.show()

This bar chart visually presents the top 20 most frequent words found in the descriptions of Netflix’s content.

10. Exploring Genre Distribution in Netflix’s Catalog

To uncover which genres dominate Netflix’s offerings, we analyze the listed_in column of the dataset, which may contain multiple genres for each title.

This analysis is essential for understanding which types of content are most prevalent and popular among Netflix’s diverse range of programming.

# Processing the 'listed_in' column which may contain multiple genres per title
genre_counts = df_cleaned['listed_in'].str.split(', ').explode().value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']

# Plotting
fig_genres = px.bar(
genre_counts.head(20), 
x='Genre', 
y='Count', 
title='Top 20 Most Popular Genres',
labels={'Genre': 'Genre', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_genres.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_genres.show()

This visualization offers a clear snapshot of the top 20 genres, indicating which categories are most frequently listed and potentially most watched on Netflix.

Dramas and Comedies are the most represented genres, indicating a strong viewer preference for these categories. Their universal appeal likely drives their dominance in the catalog.

International Movies and Documentaries also rank highly, showcasing Netflix’s commitment to diverse content that appeals to a global audience and caters to niche interests.

Action & Adventure and TV Dramas maintain a solid presence, reflecting the ongoing popularity of engaging, plot-driven content.

Genres such as Children & Family Movies, Crime TV Shows, and Thrillers highlight the variety in Netflix’s offerings, catering to different age groups and tastes.

Romantic Movies, Horror Movies, and Reality TV also make the list, though with fewer titles, indicating more targeted demographics or niche appeal.

Stand-Up Comedy and Music & Musicals are less represented but are crucial for adding variety and catering to specific viewer interests, potentially drawing in audiences looking for light entertainment and cultural productions.

The code snippet you provided performs an analysis of the most featured actors on Netflix based on the number of titles they appear in. This analysis can help identify which actors are potentially driving viewership through their popularity or frequent collaborations on Netflix. Here’s how you can structure and implement this analysis:

11. Highlighting the Most Featured Actors in Netflix’s Catalog

This analysis explores which actors appear most frequently across Netflix’s diverse array of content.

Understanding the presence of these actors can provide insights into casting trends, popular figures in the entertainment industry, and the potential drawing power of actors in attracting a viewing audience.

# Processing actors data, which may contain multiple actors per title
df_cleaned['cast_list'] = df_cleaned['cast'].str.split(', ')
actor_counts = df_cleaned.explode('cast_list')['cast_list'].value_counts().reset_index()
actor_counts.columns = ['Actor', 'Count']

# Filtering to show only the top 10 actors
top_actors = actor_counts.head(10)

# Plotting
fig_actors = px.bar(
top_actors, 
x='Actor', 
y='Count', 
title='Top 10 Actors by Number of Titles',
labels={'Actor': 'Actor', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_actors.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_actors.show()

This visualization offers a clear snapshot of the top 10 actors, indicating which ones are most frequently featured in titles available on Netflix.

12. Identifying the Most Featured US Actors in Netflix’s Catalog

This analysis focuses on discovering which actors from the United States are most frequently featured in titles available on Netflix.

By isolating entries related to the United States, we can pinpoint influential actors in one of Netflix’s largest markets, offering insights into casting trends and the popularity of actors within American productions.

# Filtering the dataset for entries where 'country' includes 'United States'
us_entries = df_cleaned[df_cleaned['country'].str.contains('United States', na=False)]

# Processing actors data, which may contain multiple actors per title
us_actor_counts = us_entries.explode('cast_list')['cast_list'].value_counts().reset_index()
us_actor_counts.columns = ['Actor', 'Count']

# Filtering to show only the top 10 US actors
top_us_actors = us_actor_counts.head(10)

# Plotting
fig_us_actors = px.bar(
top_us_actors, 
x='Actor', 
y='Count', 
title='Top 10 US Actors by Number of Titles',
labels={'Actor': 'Actor', 'Count': 'Number of Titles'}, 
template='plotly_white')

fig_us_actors.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_us_actors.show()

This bar chart provides a visual representation of the most prolific US actors on Netflix, indicating which individuals are potentially driving viewership through their frequent appearances.

Certainly! Here’s how to structure and implement the analysis of movie durations available on Netflix using your dataset in Plotly:

13. Analyzing the Length of Movies

This analysis investigates the durations of movies on Netflix, providing insights into the typical movie lengths that dominate the platform.

# Filtering for movie entries
df_movies = df_cleaned[df_cleaned['type'] == 'Movie']

# Extract numeric values from 'duration' which are stored in format like '90 min'
df_movies['duration_min'] = df_movies['duration'].str.extract('(\d+)').astype(float)

# Handling NaN values by dropping them (you could also consider filling them if appropriate)
df_movies = df_movies.dropna(subset=['duration_min'])

# Now safely convert to int since all NaNs are removed
df_movies['duration_min'] = df_movies['duration_min'].astype(int)

# Plotting a histogram of movie durations
fig_movie_durations = px.histogram(
df_movies, 
x='duration_min', 
nbins=40, 
title='Distribution of Movie Durations',
labels={'duration_min': 'Duration (minutes)'}, 
template='plotly_white')

fig_movie_durations.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig_movie_durations.show()

This histogram provides a view of the range and frequency of movie durations, revealing whether shorter or longer films are more prevalent. .

You’ve crafted a Python script using Plotly to examine the relationship between the number of seasons a TV show has and its viewer ratings, which is a great way to analyze viewer engagement over multiple seasons. Here’s how to present and implement this analysis effectively:

14. Viewer Engagement with Longevity on Netflix

This analysis aims to explore if there’s a correlation between the number of seasons a TV show has and its viewer ratings on Netflix.

import numpy as np

# Filtering for TV Show entries
df_tv_shows = df_cleaned[df_cleaned['type'] == 'TV Show']

# Extracting the number of seasons from the duration column
df_tv_shows['num_seasons'] = df_tv_shows['duration'].str.extract('(\d+)').astype(float)

# Generating random viewer ratings for demonstration (since actual ratings data is not provided)
np.random.seed(42)
df_tv_shows['viewer_ratings'] = np.random.uniform(5, 10, size=len(df_tv_shows))

# Dropping any NaN values in num_seasons or viewer_ratings for clean plotting
df_tv_shows = df_tv_shows.dropna(subset=['num_seasons', 'viewer_ratings'])

# Creating a scatter plot to examine the correlation between number of seasons and viewer ratings
fig = px.scatter(
df_tv_shows, 
x='num_seasons', 
y='viewer_ratings', 
trendline="ols",
title='Correlation Between Number of Seasons and Viewer Ratings',
labels={'num_seasons': 'Number of Seasons', 'viewer_ratings': 'Viewer Ratings'},
template='plotly_white')

fig.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False)
fig.show()

This approach provides a visualization of how the number of seasons might influence viewer ratings, which is crucial for content creators and platform curators aiming to understand viewer preferences and sustain engagement over multiple seasons.

15. Top 30 Long-Running Series

This visualization aims to highlight the top 30 TV shows on Netflix that have achieved more than 8 seasons, providing insights into which series have maintained enduring popularity and viewer commitment.

import pandas as pd
import plotly.express as px

# Assuming df_cleaned is your DataFrame and has been filtered for TV Show entries
df_tv_shows = df_cleaned[df_cleaned['type'] == 'TV Show']

# Extracting the number of seasons from the duration column and ensuring it is numeric
df_tv_shows['num_seasons'] = df_tv_shows['duration'].str.extract('(\d+)').astype(int)

# Filtering to find shows with more than 10 seasons
long_running_shows = df_tv_shows[df_tv_shows['num_seasons'] > 8]

# Sorting by number of seasons and selecting the top 30
top_long_running_shows = long_running_shows.sort_values(by='num_seasons', ascending=False)
top_long_running_shows = top_long_running_shows.head(min(30, len(top_long_running_shows)))

# Plotting a bar chart of the top shows with more than 8 seasons
fig = px.bar(top_long_running_shows, x='title', y='num_seasons',
             title='Top Netflix TV Shows with More Than 10 Seasons',
             labels={'title': 'TV Show Title', 'num_seasons': 'Number of Seasons'},
             template='plotly_white')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=False, yaxis_showgrid=False,
                  xaxis_title="TV Show Title",
                  yaxis_title="Number of Seasons",
                  xaxis_tickangle=-45)  # Rotate labels for better readability
fig.show()

This approach provides a visualization of which TV shows have not only surpassed the 8-season mark but are also the top runners, offering a detailed view of long-term engagement and success on the platform.

16. Global Distribution of Content Genres

This analysis uses a treemap to illustrate how different genres are distributed across various countries, providing insights into the global content strategies and preferences.

Treemaps are excellent for displaying hierarchical (tree-structured) data and for visualizing part-to-whole relationships, making it easy to see which genres dominate in specific regions.

# Creating lists of genres and countries for each title
df_cleaned['genres'] = df_cleaned['listed_in'].str.split(', ')
df_cleaned['countries'] = df_cleaned['country'].str.split(', ')

# Exploding the DataFrame to have one genre and one country per row
exploded_genres_countries = df_cleaned.explode('genres').explode('countries')

# Creating a cross-tabulation of genres by country
genre_country_distribution = pd.crosstab(exploded_genres_countries['countries'], exploded_genres_countries['genres'])

# Transforming the crosstab to a DataFrame for plotting
genre_country_df = genre_country_distribution.reset_index()
genre_country_df_melted = genre_country_df.melt(id_vars=['countries'], var_name='Genre', value_name='Count')

# Plotting a treemap to show the distribution of genres across countries
fig = px.treemap(
genre_country_df_melted, 
path=['countries', 'Genre'], 
values='Count',
title='Treemap of Genre Popularity by Country',
color_continuous_scale='Blues',
template='plotly_white')

fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
fig.show()

This approach effectively communicates complex multi-dimensional data in a visually engaging format, making it easier for decision-makers to derive meaningful insights from the dataset.

Conclusion

In our analysis, we explored Netflix’s content catalog, examining genre popularity, actor prominence, and content distribution across countries.

We analyzed how regional tastes affect Netflix’s content strategy, identified key actors who influence viewership, and highlighted TV shows with many seasons to understand content retention strategies. Additionally, reviewing release years revealed the breadth of content from historical to modern titles.

This data-driven exploration aids Netflix in aligning its offerings with viewer preferences and regional tastes.

Thank you for joining me on this insightful journey! 🐼❤️