HepAgenda - Electronic Medical Record

Netflix Movies and TV Shows Clustering

Tech Stack: Python,Machine Learning.
Github URL: Project Link

🎬 Netflix Movies and TV Shows Clustering: Analyzing Netflix's content library to explore content distribution, trends, and similarities across genres and countries by performing clustering on movies and TV shows available on the platform.

Overview:

Analysis of Netflix's content library with a focus on clustering movies and TV shows.
Objective to explore content distribution, trends, and similarities across genres and countries.

Objectives:

Conduct Exploratory Data Analysis (EDA).
Understand the type of content available in different countries.
Check if Netflix is increasingly focusing on TV shows rather than movies in recent years.
Cluster similar content by matching text-based features (like descriptions).

Methods Used:

Descriptive Statistics: Summarizing and describing the data.
Data Visualization: Visualizing trends and insights from the dataset.
Machine Learning: K-means clustering to group similar content based on text-based features.

Libraries Utilized:

NumPy and Pandas: For dataset cleaning and analysis.
Matplotlib, Plotly, and Seaborn: For data visualization.
SkLearn and nltk: For machine learning and clustering.

Dataset Used:

TV shows and movies available on Netflix as of 2019, collected from Flixable, a third-party Netflix search engine.
The dataset contains the following attributes:

show_id: Unique ID for every Movie/TV Show
type: Identifier - Movie or TV Show
title: Title of the Movie/TV Show
director: Director of the Movie
cast: Actors involved in the Movie/Show
country: Country where the Movie/Show was produced
date_added: Date the Movie/Show was added on Netflix
release_year: Actual release year of the Movie/Show
rating: TV rating of the Movie/Show
duration: Total duration (in minutes or number of seasons)
listed_in: Genre
description: Summary description of the Movie/Show

Project Workflow:

Data Preprocessing: Load the dataset, inspect the data, handle missing values, and perform feature engineering.
Exploratory Data Analysis (EDA): Visualize trends, content type distribution, and country-specific content.
Text Data Preprocessing: Clean text fields, apply TF-IDF Vectorizer to convert text into numerical form for clustering.
Clustering Using K-Means: Perform K-means clustering to create 10 distinct clusters based on text similarity.
Recommender System: Implement a movie/TV show recommender system based on clustering results.

Outcome:

Insight into prevalent content types (movies vs. TV shows) and country-specific trends.
Analysis of Netflix's focus shift towards TV shows over time.
Clustering of similar movies and TV shows based on text data.
A simple recommender system to suggest similar content to users based on their preferences.

Tools & Technologies:

Python: For data analysis and machine learning.
Libraries: Pandas, NumPy, Sklearn, nltk, Matplotlib, Seaborn, Plotly.
K-Means Clustering: To group content based on textual features.