Netflix Movies and TV Shows Clustering
- Tech Stack: Python,Machine Learning.
- Github URL: Project Link
🎬 Netflix Movies and TV Shows Clustering: Analyzing Netflix's content library to explore content distribution, trends, and similarities across genres and countries by performing clustering on movies and TV shows available on the platform.
Overview:
- Analysis of Netflix's content library with a focus on clustering movies and TV shows.
- Objective to explore content distribution, trends, and similarities across genres and countries.
Objectives:
- Conduct Exploratory Data Analysis (EDA).
- Understand the type of content available in different countries.
- Check if Netflix is increasingly focusing on TV shows rather than movies in recent years.
- Cluster similar content by matching text-based features (like descriptions).
Methods Used:
- Descriptive Statistics: Summarizing and describing the data.
- Data Visualization: Visualizing trends and insights from the dataset.
- Machine Learning: K-means clustering to group similar content based on text-based features.
Libraries Utilized:
- NumPy and Pandas: For dataset cleaning and analysis.
- Matplotlib, Plotly, and Seaborn: For data visualization.
- SkLearn and nltk: For machine learning and clustering.
Dataset Used:
- TV shows and movies available on Netflix as of 2019, collected from Flixable, a third-party Netflix search engine.
- The dataset contains the following attributes:
- show_id: Unique ID for every Movie/TV Show
- type: Identifier - Movie or TV Show
- title: Title of the Movie/TV Show
- director: Director of the Movie
- cast: Actors involved in the Movie/Show
- country: Country where the Movie/Show was produced
- date_added: Date the Movie/Show was added on Netflix
- release_year: Actual release year of the Movie/Show
- rating: TV rating of the Movie/Show
- duration: Total duration (in minutes or number of seasons)
- listed_in: Genre
- description: Summary description of the Movie/Show
Project Workflow:
- Data Preprocessing: Load the dataset, inspect the data, handle missing values, and perform feature engineering.
- Exploratory Data Analysis (EDA): Visualize trends, content type distribution, and country-specific content.
- Text Data Preprocessing: Clean text fields, apply TF-IDF Vectorizer to convert text into numerical form for clustering.
- Clustering Using K-Means: Perform K-means clustering to create 10 distinct clusters based on text similarity.
- Recommender System: Implement a movie/TV show recommender system based on clustering results.
Outcome:
- Insight into prevalent content types (movies vs. TV shows) and country-specific trends.
- Analysis of Netflix's focus shift towards TV shows over time.
- Clustering of similar movies and TV shows based on text data.
- A simple recommender system to suggest similar content to users based on their preferences.
Tools & Technologies:
- Python: For data analysis and machine learning.
- Libraries: Pandas, NumPy, Sklearn, nltk, Matplotlib, Seaborn, Plotly.
- K-Means Clustering: To group content based on textual features.