IMDB Dataframe Insights
IMDB Dataframe Insights
1. Inspect the data frame for dimensions, null-values, and summary of different
columns
2. Get the summary of numeric columns
3. Convert the unit of the budget and gross columns from $ to million $
4. Create a new column profit and sort the data frame using the profit then Extract the
top ten profiting movies in descending order and store them in a new data frame
— top10
5. Plot a scatter plot for budget and profit features and write a few words on what you
observed
6. Extract the movies with a negative profit and store them in a new data frame –
negative_profit
7. Create a new column Avg_rating (average of the MetaCritic and Rating) in the data
frame and arrange the movies in the descending order of Avg_rating
8. Find the trios which have the most number of Facebook likes combined (i.e the sum
of actor_1_facebook_likes, actor_2_facebook_likes, and actor_3_facebook_likes should
be maximum) and find out the top 5 popular trios, and output their names in a list –
Write a few words on what you observed
9. Check how the Runtime variable is distributed by plotting a histogram or distplot of
seaborn to find the Runtime range most of the movies fall into.
10. Although R rated movies are restricted movies for the under 18 age group, still there
are vote counts from that age group, so filter these R rated movies and sort them by
‘CVotesU18’ in descending order. Get the top 5 among all the R rated movies that have
been voted by the under-18 age group.
11.Display Title of The Movie Having Runtime Greater Than or equal to 180 Minutes
12.In Which Year There Was The Highest Average Voting?
13.In Which Year There Was The Highest Average Revenue?
14.Find The Average Rating For Each Director
15.Display Top 10 Lengthy Movie Titles and Runtime
16.Display Number of Movies released Per Year
17.Find Most Popular Movie Title (Highest Revenue)
18.Display Top 10 Highest Rated Movies And its Directors
19.Display Top 10 Highest Revenue Movies
20.Find Average Rating of Movies Year Wise
21.Does Rating Affect The Revenue ?
22.Classify Movies Based on Ratings [Excellent, Good, and Average]
23.Count the Number of Action Movies
24.How Many Films of Each Genre Were Made?
25. Demographic Analysis – Create a new data frame genre_top10 as below
a. Create a new dataframe df_by_genre that contains genre_1, genre_2,
and genre_3 and all the columns related to CVotes/Votes from the movies data
frame. There are 47 columns to be extracted in total. add a column called cnt to the
dataframe df_by_genre and initialize it to one.
b. Group the dataframe df_by_genre by genre_1 and find the sum of all the numeric
columns such as cnt, columns related to CVotes and Votes columns and store it in
a dataframe df_by_g1. Performing the same operation for genre_2 and genre_3 and
store it dataframes df_by_g2 and df_by_g3 respectively
c. Now that we have 3 dataframes performed by grouping over genre_1, genre_2,
and genre_3 separately, it's time to combine them. For this, add the three
dataframes and store it in a new dataframe df_add, so that the corresponding
values of Votes/CVotes get added for each genre(use the function add())
d. The column cnt on aggregation has basically kept the track of the number of
occurrences of each genre. Subset the genres that have at least 10 movies into a
new dataframe genre_top10 based on the cnt column value.
e. Now, take the mean of all the numeric columns by dividing them with the column
value cnt and store it back to the same dataframe.
f. Since the number of votes can’t be a fraction, typecasting all the CVotes related
columns to integers. Also, round off all the Votes related columns up to two digits
after the decimal point.
g. Now the final data frame genre_top10 should have the complete information about
all the demographic (Votes- and CVotes-related) columns across the top 10 genres.
26. By using the genre_top10 data frame (created in above step) draw some
insights as below
a. plot a bar chart for different genres vs cnt using seaborn
b. Plot a heatmap to see how the average number of votes of males is varying across
the genres. Use a seaborn heatmap for this analysis. The X-axis should contain the
four age-groups for males, i.e., CVotesU18M, CVotes1829M, CVotes3044M,
and CVotes45AM. The Y-axis will have the genres and the annotation in the
heatmap tell the average number of votes for that age-male group – Draw the
inferences from this plotting
c. heatmap to see how the average number of votes of females is varying across the
genres. Use a seaborn heatmap for this analysis. The X-axis should contain the four
age-groups for females, i.e., CVotesU18F, CVotes1829F, CVotes3044F,
and CVotes45AF. The Y-axis will have the genres and the annotation in the heatmap
tell the average number of votes for that age-female group - Draw the inferences
from this plotting
d. Plot a heatmap to see how the average number of votes of females is varying across
the genres. Use a seaborn heatmap for this analysis. The X-axis should contain the
four age-groups for females, i.e., VotesU18F, Votes1829F, Votes3044F,
and Votes45AF. The Y-axis will have the genres and the annotation in the heatmap
tell the average number of votes for that age-female group - Draw the inferences
from this plotting
e. Sort the dataframe genre_top10 based on the value of CVotes1000in descending
order.
27.USA vs non-USA cross-analysis – Consider movies data frame for this analysis –
- Create a column IFUS in the dataframe movies. The column IFUS should contain
the value "USA" if the Country of the movie is "USA". For all other countries other
than the USA, IFUS should contain the value non-USA.
- Make a boxplot that shows how the number of votes from the US people
i.e. CVotesUS is varying for the US and non-US movies. Make use of the
column IFUS to make this plot. Similarly, make another subplot that shows how
non-US voters have voted for the US and non-US movies by
plotting CVotesnUS for both the US and non-US movies.
- Draw the inferences for this analysis
28. Write a complete report on IMDB data collected based on the analysis
done in all the above steps and also show the visualization using Power BI
*******************************HAPPY ANALYSIS****************************************