0% found this document useful (0 votes)
8 views5 pages

project 5

The document outlines a structured analysis of a movie dataset using Pandas, including data reading, cleaning, and analysis tasks. Key findings include the identification of the highest-grossing movies, the top 250 IMDb-rated films, and the favorite actors based on user and critic reviews. The analysis serves to provide insights valuable to movie enthusiasts and industry professionals.

Uploaded by

Aisha Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

project 5

The document outlines a structured analysis of a movie dataset using Pandas, including data reading, cleaning, and analysis tasks. Key findings include the identification of the highest-grossing movies, the top 250 IMDb-rated films, and the favorite actors based on user and critic reviews. The analysis serves to provide insights valuable to movie enthusiasts and industry professionals.

Uploaded by

Aisha Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Task 1: Reading and Inspection

Subtask 1.1: Import and read the movie database

We begin by importing the necessary libraries and reading the movie


dataset into a Pandas DataFrame.

pythonCopy code
import numpy as np
import pandas as pd

# Read the movie dataset


movies = pd.read_csv("Movies.csv")

Subtask 1.2: Inspect the dataframe

We inspect the dataset to understand its structure and contents.

pythonCopy code
# Check the number of rows and columns
print("Number of rows and columns:", movies.shape)

# Check columns with null values


print("Columns with null values:", (movies.isnull().sum() >
0).sum())

Answers to Questions:

1. There are 3821 rows and 26 columns in the dataframe.

2. Three columns have null values.

Task 2: Cleaning the Data


Subtask 2.1: Drop unnecessary columns

We drop columns that are not required for our analysis.

pythonCopy code
columns_to_drop = [
'color', 'director_facebook_likes', 'actor_1_facebook_likes',
'actor_2_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'cast_total_facebook_likes', 'actor_3_name', 'duration',
'facenumber_in_poster', 'content_rating', 'country',
'movie_imdb_link', 'aspect_ratio', 'plot_keywords'
]

movies.drop(columns=columns_to_drop, inplace=True)

Answers to Questions: 3. After dropping unnecessary columns, the


dataframe contains 10 columns.

Subtask 2.2: Inspect Null values

We find the percentage of null values in each column.

pythonCopy code
null_percentage = (movies.isnull().sum() / len(movies)) * 100

Answers to Questions: 4. The column with the highest percentage of


null values is “language”.

Subtask 2.3: Fill NaN values

We fill NaN values in the “language” column with “English”.

pythonCopy code
movies.language.fillna("English", inplace=True)

Answers to Questions: 5. After filling NaN values, there are 3670


movies made in the English language.

Task 3: Data Analysis


Subtask 3.1: Change the unit of columns

We convert the unit of the “budget” and “gross” columns from


dollars to million dollars.

pythonCopy code
movies.gross = movies.gross / 1000000
movies.budget = movies.budget / 1000000

Subtask 3.2: Find the movies with the highest profit

We calculate the “profit” for each movie and find the top ten
profiting movies.

pythonCopy code
movies["Profit"] = movies.gross - movies.budget
top10 = movies.sort_values("Profit", ascending=False).head(10)

Answers to Questions: 6. The movie ranked 5th from the top in the
list is “The Avengers”.

Subtask 3.3: Find IMDb Top 250


We create a dataframe IMDb_Top_250 containing the top 250 movies
with the highest IMDb rating and where num_voted_users is greater
than 25,000.

IMDb_Top_250 = movies[(movies['imdb_score'] > 8.0) &


(movies['num_voted_users'] > 25000)]
IMDb_Top_250 = IMDb_Top_250.sort_values(by='imdb_score',
ascending=False).head(250)
IMDb_Top_250['Rank'] = range(1, IMDb_Top_250.shape[0] + 1)

Answers to Questions: 7. The bucket holding the maximum number


of movies from IMDb_Top_250 is "8 to 8.5".

Subtask 3.4: Find the critic-favorite and audience-favorite actors

We create dataframes for three actors,


namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt, containing movies
where they are the lead actors. Then, we combine these dataframes,
group by actor, and find the mean of critic and user reviews.

Meryl_Streep = movies[movies["actor_1_name"] == "Meryl Streep"]


Leo_Caprio = movies[movies["actor_1_name"] == "Leonardo DiCaprio"]
Brad_Pitt = movies[movies["actor_1_name"] == "Brad Pitt"]
Combined = pd.concat([Meryl_Streep, Leo_Caprio, Brad_Pitt], axis=0)
actor_reviews =
Combined.groupby(by="actor_1_name")[["num_critic_for_reviews",
"num_user_for_reviews"]].mean()

Answers to Questions: 8 and 9

1. According to user reviews, “Leonardo DiCaprio” is the


highest-rated among the three actors.
2. According to critic reviews, “Leonardo DiCaprio” is also
the highest-rated among the three actors.

Conclusion
In this analysis, we explored a movie dataset, cleaned the data, and
conducted various analyses to find interesting insights about
movies, actors, and ratings. We discovered the highest-grossing
movies, IMDb’s top 250 movies, and the favorite actors among
critics and audiences. This analysis provides valuable information
for movie enthusiasts and industry professionals.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy