Netflix Movies and TV Shows Clustering

  • Tech Stack: Python,Machine Learning.
  • Github URL: Project Link

🎬 Netflix Movies and TV Shows Clustering: Analyzing Netflix's content library to explore content distribution, trends, and similarities across genres and countries by performing clustering on movies and TV shows available on the platform.

Overview:

  • Analysis of Netflix's content library with a focus on clustering movies and TV shows.
  • Objective to explore content distribution, trends, and similarities across genres and countries.

Objectives:

  • Conduct Exploratory Data Analysis (EDA).
  • Understand the type of content available in different countries.
  • Check if Netflix is increasingly focusing on TV shows rather than movies in recent years.
  • Cluster similar content by matching text-based features (like descriptions).

Methods Used:

  • Descriptive Statistics: Summarizing and describing the data.
  • Data Visualization: Visualizing trends and insights from the dataset.
  • Machine Learning: K-means clustering to group similar content based on text-based features.

Libraries Utilized:

  • NumPy and Pandas: For dataset cleaning and analysis.
  • Matplotlib, Plotly, and Seaborn: For data visualization.
  • SkLearn and nltk: For machine learning and clustering.

Dataset Used:

  • TV shows and movies available on Netflix as of 2019, collected from Flixable, a third-party Netflix search engine.
  • The dataset contains the following attributes:
    • show_id: Unique ID for every Movie/TV Show
    • type: Identifier - Movie or TV Show
    • title: Title of the Movie/TV Show
    • director: Director of the Movie
    • cast: Actors involved in the Movie/Show
    • country: Country where the Movie/Show was produced
    • date_added: Date the Movie/Show was added on Netflix
    • release_year: Actual release year of the Movie/Show
    • rating: TV rating of the Movie/Show
    • duration: Total duration (in minutes or number of seasons)
    • listed_in: Genre
    • description: Summary description of the Movie/Show

Project Workflow:

  • Data Preprocessing: Load the dataset, inspect the data, handle missing values, and perform feature engineering.
  • Exploratory Data Analysis (EDA): Visualize trends, content type distribution, and country-specific content.
  • Text Data Preprocessing: Clean text fields, apply TF-IDF Vectorizer to convert text into numerical form for clustering.
  • Clustering Using K-Means: Perform K-means clustering to create 10 distinct clusters based on text similarity.
  • Recommender System: Implement a movie/TV show recommender system based on clustering results.

Outcome:

  • Insight into prevalent content types (movies vs. TV shows) and country-specific trends.
  • Analysis of Netflix's focus shift towards TV shows over time.
  • Clustering of similar movies and TV shows based on text data.
  • A simple recommender system to suggest similar content to users based on their preferences.

Tools & Technologies:

  • Python: For data analysis and machine learning.
  • Libraries: Pandas, NumPy, Sklearn, nltk, Matplotlib, Seaborn, Plotly.
  • K-Means Clustering: To group content based on textual features.