Investigating a Dataset: Movies

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction

What are the properties of high revenue films? Using data from The Movie Database (TMDb), I investigated the relationship between popularity, runtime, and vote average (rating) and the adjusted revenue of the films. To account for inflation, the adjusted revenue column reflects a film’s revenue in 2010 dollars.

Has film revenue changed over time?

Are high revenue films more popular than low revenue films?

Is there a positive relationship between runtime and revenue? Is it worth it to produce a long film?

Are high revenue films highly rated?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df_movies = pd.read_csv(r"C:\Users\bsear\Desktop\Class Assignments\movies.csv")
df_movies.head(1)

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09

1 rows × 21 columns

Data Wrangling

A quick review of the dataset tells me what data types I have, if there are null values, and what kind of clean-up, if any, is needed before evaluation.

At first glance, it appears that all of the variables I want to investigate are populated with data. Before I take a closer look at those columns, I removed the columns “ID”, “IMDB ID”, “Homepage”, “Tagline”, and “Overview”. Since I am not using those variables in my evaluation, I took those out in order to keep the dataframe tidy and easy to read.

I kept other columns, such as titles, directors, cast, genres, production companies, and release dates in case those variables provide additional insights later on.

After closer inspection, I had to clean the runtime and adjusted revenue columns.

#what does my dataset look like?
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 id                    10866 non-null  int64  
 imdb_id               10856 non-null  object 
 popularity            10866 non-null  float64
 budget                10866 non-null  int64  
 revenue               10866 non-null  int64  
 original_title        10866 non-null  object 
 cast                  10790 non-null  object 
 homepage              2936 non-null   object 
 director              10822 non-null  object 
 tagline               8042 non-null   object 
keywords              9373 non-null   object 
overview              10862 non-null  object 
runtime               10866 non-null  int64  
genres                10843 non-null  object 
production_companies  9836 non-null   object 
release_date          10866 non-null  object 
vote_count            10866 non-null  int64  
vote_average          10866 non-null  float64
release_year          10866 non-null  int64  
budget_adj            10866 non-null  float64
revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

#Removing extraneous data columns here
df_movies.drop(['id', 'imdb_id', 'homepage','tagline','overview'], axis=1, inplace=True)
df_movies.head()

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09
1	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	George Miller	future\|chase\|post-apocalyptic\|dystopia\|australia	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/2015	6185	7.1	2015	137999939.3	3.481613e+08
2	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	Robert Schwentke	based on novel\|revolution\|dystopia\|sequel\|dyst...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/2015	2480	6.3	2015	101199955.5	2.716190e+08
3	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	J.J. Abrams	android\|spaceship\|jedi\|space opera\|3d	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/2015	5292	7.5	2015	183999919.0	1.902723e+09
4	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	James Wan	car race\|speed\|revenge\|suspense\|car	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/2015	2947	7.3	2015	174799923.1	1.385749e+09

#checking to see if there are duplicates and removing them
df_movies.drop_duplicates(inplace=True)
sum(df_movies.duplicated())

#check for null values, particularly in my chosen variables
list(df_movies.isnull().sum().items())

[('popularity', 0),
 ('budget', 0),
 ('revenue', 0),
 ('original_title', 0),
 ('cast', 76),
 ('director', 44),
 ('keywords', 1493),
 ('runtime', 0),
 ('genres', 23),
 ('production_companies', 1030),
 ('release_date', 0),
 ('vote_count', 0),
 ('vote_average', 0),
 ('release_year', 0),
 ('budget_adj', 0),
 ('revenue_adj', 0)]

#before starting my analysis, I want to investigate the lack of null-values just a little more.
#this serves as a double check.
df_movies.describe()

	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10865.000000	1.086500e+04	1.086500e+04	10865.000000	10865.000000	10865.000000	10865.000000	1.086500e+04	1.086500e+04
mean	0.646446	1.462429e+07	3.982690e+07	102.071790	217.399632	5.975012	2001.321859	1.754989e+07	5.136900e+07
std	1.000231	3.091428e+07	1.170083e+08	31.382701	575.644627	0.935138	12.813260	3.430753e+07	1.446383e+08
min	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	0.207575	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	0.383831	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	0.713857	1.500000e+07	2.400000e+07	111.000000	146.000000	6.600000	2011.000000	2.085325e+07	3.370173e+07
max	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

#even though the previous null data check came back as 0 for my chosen variables, the describe function shows
#that my minimum value for revenue_adj is 0 - which seems odd. Here I take a closer look
null_data=df_movies[df_movies.isnull().any(axis=1)]
null_data.head(1)

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
130	1.284541	0	4719695	True Story	Jonah Hill\|James Franco\|Felicity Jones\|Maria D...	Rupert Goold	NaN	100	Crime\|Drama\|Mystery	Plan B Entertainment\|Regency Enterprises\|New R...	4/17/2015	354	6.0	2015	0.0	4342117.489

#why does the runtime column have 0 as a minimum runtime?
df_movies[df_movies['runtime']==df_movies['runtime'].min()].head(1)

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
92	1.876037	0	0	Mythica: The Necromancer	Melanie Stone\|Adam Johnson\|Kevin Sorbo\|Nicola ...	A. Todd Smith	sword\|magic\|sorcery\|necromancer	0	Fantasy\|Action\|Adventure	Arrowstorm Entertainment\|Camera 40 Productions...	12/19/2015	11	5.4	2015	0.0	0.0

#The other variables in the table indicate that those films with "0" runtime must actually be null values
#Here I replaced the '0' values with null
df_movies['runtime'].replace(0, np.nan, inplace=True)
df_movies['runtime'].describe()

count    10834.000000
mean       102.363855
std         30.948225
min          2.000000
25%         90.000000
50%         99.000000
75%        112.000000
max        900.000000
Name: runtime, dtype: float64

#this is a variable I plan to use in my analysis, so I am replacing those null values with the mean runtime
#this allows me to compare the same number of movies across my variables
runtime_mean=df_movies['runtime'].mean()
df_movies['runtime'].fillna(runtime_mean, inplace = True)
df_movies['runtime'].describe()

count    10865.000000
mean       102.363855
std         30.904039
min          2.000000
25%         90.000000
50%         99.000000
75%        111.000000
max        900.000000
Name: runtime, dtype: float64

#why does the adjusted revenue column have 0 as a minimum value?
#Just like with runtime, the adjusted revenue column also has '0' values where null is more appropriate
df_movies[df_movies['revenue_adj']==df_movies['revenue_adj'].min()].head(1)

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
48	2.93234	30000000	0	Wild Card	Jason Statham\|Michael Angarano\|Milo Ventimigli...	Simon West	gambling\|bodyguard\|remake	92.0	Thriller\|Crime\|Drama	Current Entertainment\|Lionsgate\|Sierra / Affin...	1/14/2015	481	5.3	2015	27599987.86	0.0

#The other variables in the table indicate that those films with "0" adjusted revenue must actually be null values
#Here I am replacing the '0' values with null
df_movies['revenue_adj'].replace(0, np.nan, inplace=True)
df_movies['revenue_adj'].describe()

count    4.849000e+03
mean     1.151009e+08
std      1.988557e+08
min      2.370705e+00
25%      1.046585e+07
50%      4.395666e+07
75%      1.316482e+08
max      2.827124e+09
Name: revenue_adj, dtype: float64

#Now I am determining the mean adjusted revenue and replacing the null values with the mean
revenue_adj_mean=df_movies['revenue_adj'].mean()
df_movies['revenue_adj'].fillna(revenue_adj_mean, inplace = True)
df_movies['revenue_adj'].describe()

count    1.086500e+04
mean     1.151009e+08
std      1.328386e+08
min      2.370705e+00
25%      5.878518e+07
50%      1.151009e+08
75%      1.151009e+08
max      2.827124e+09
Name: revenue_adj, dtype: float64

#now I have the same number of non-null values for the characteristics I want to test.
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10865 non-null  float64
 1   budget                10865 non-null  int64  
 2   revenue               10865 non-null  int64  
 3   original_title        10865 non-null  object 
 4   cast                  10789 non-null  object 
 5   director              10821 non-null  object 
 6   keywords              9372 non-null   object 
 7   runtime               10865 non-null  float64
 8   genres                10842 non-null  object 
 9   production_companies  9835 non-null   object 
 10  release_date          10865 non-null  object 
 11  vote_count            10865 non-null  int64  
 12  vote_average          10865 non-null  float64
 13  release_year          10865 non-null  int64  
 14  budget_adj            10865 non-null  float64
 15  revenue_adj           10865 non-null  float64
dtypes: float64(5), int64(4), object(7)
memory usage: 1.4+ MB

Exploratory Data Analysis

After an initial cleanup of my data, I can now begin my data analysis to answer my research questions:

1. Has film revenue changed over time?
2. Are high revenue films more popular than low revenue films?
3. Is there a positive relationship between runtime and revenue?
4. Are high revenue films highly rated?

1. Has film revenue changed over time?

Are movies making more money now than in the past?

#initial scatter plot
#I referenced the matplotlib.org page to learn the syntax for .set_xlabel 
film_plot = df_movies.plot(x='release_year', y='revenue_adj', 
                           title='Adjusted Revenue of Films Over Time', kind='scatter');
film_plot.set_xlabel('Release Year')
film_plot.set_ylabel('Adjusted Revenue in 2010 Dollars')

Text(0, 0.5, 'Adjusted Revenue in 2010 Dollars')

png

This chart visually represents every film in the dataset and its adjusted revenue. At first glance, this chart shows a higher number of films produced in later decades, as well as a number of outliers in each year. It is not, however, very clear to me if there is an obvious trend in this chart. I will need to create a better visual. I have over 10,000 records, so I will need to organize the films into broader categories, such as decades.

#Here I will cut the values into decades based on release year
bin_edges=[1960, 1970, 1980, 1990, 2000, 2010, 2020]
bin_names=['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
df_movies['release_decade']=pd.cut(df_movies['release_year'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	release_decade
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09	2010s

#pie chart visual
film_pie = df_movies['release_decade'].value_counts().plot(kind='pie', 
                                                           title='Count of Films Released per Decade');
film_pie.set_ylabel(" ")

Text(0, 0.5, ' ')

png

In this pie chart, I wanted to see if there were more film releases in certain decades to confirm my initial impression from the scatter plot. It shows an increase in the number of films released each decade.

#to prepare for my descriptive bar chart, I start by calucalting
#the mean revenue for the decades
decade_revenue_means = df_movies.groupby('release_decade')['revenue_adj'].mean()
decade_revenue_means

release_decade
1960s    1.486290e+08
1970s    1.520075e+08
1980s    1.066893e+08
1990s    1.173703e+08
2000s    1.118565e+08
2010s    1.102115e+08
Name: revenue_adj, dtype: float64

#then organize my data into the chart
locations = [1,2,3,4, 5, 6]
heights = decade_revenue_means
labels = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
plt.bar(locations, heights, tick_label=labels)
plt.title('Mean Adjusted Revenue by Decade')
plt.xlabel('Decade')
plt.ylabel('Mean Adjusted Revenue in 2010 Dollars')

Text(0, 0.5, 'Mean Adjusted Revenue in 2010 Dollars')

png

Thought: It appears that, while there are more movies made in later decades, the mean adjusted revenue is higher for earlier films. I decided to create this kind of visualization to compare the adjusted revenue means. The pie chart was interesting, but it really focused on the count of movies. The scatter plot showed every film and it’s adjusted revenue, but looking at individual films in this way is too detailed. This bar chart is a great visual to show, regardless of number of films produced each decade, which decade had the highest mean adjusted revenue. In short, it’s a single visual that answers my question.

This is a visualization I really like and plan to use again, so I will make a function here to reduce the repetition of the lengthy code.

#defining my function
def bar_plot(locations, heights, labels, xlabel, ylabel, title):
    plt.bar(locations, heights, tick_label=labels)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)

#creating a list of labels
decades = list(df_movies['release_decade'].unique().dropna())
decades.sort(reverse=False)
decades

['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']

#running my new function
bar_plot([1,2,3,4,5,6], decade_revenue_means, decades,'Decades', 
         'Mean Adjusted Revenue in 2010 Dollars', 'Mean Adjusted Revenue by Decade')

png

2. Are high revenue (adj) films more popular than low revenue films?

How popular are the high revenue films? Are they more or less popular than low revenue films?

#Information about the 'popularity' variable
df_movies['popularity'].describe()

count    10865.000000
mean         0.646446
std          1.000231
min          0.000065
25%          0.207575
50%          0.383831
75%          0.713857
max         32.985763
Name: popularity, dtype: float64

df_movies['popularity'].plot(kind='hist', title= "Popularity");

png

#a closer look at the minimum value to better understand the variable
df_movies[df_movies['popularity']==df_movies['popularity'].min()]

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	release_decade	popularity_level
6181	0.000065	0	0	North and South, Book I	Patrick Swayze\|Philip Casnoff\|Kirstie Alley\|Ge...	NaN	NaN	561.0	Drama\|History\|Western	NaN	11/3/1985	17	6.0	1985	0.0	1.151009e+08	1980s	NaN

#Getting more information about what the 'popularity' variable means
df_movies.groupby('revenue_adj')['popularity'].mean()

revenue_adj
370705e+00     0.462609
861934e+00     0.552091
038360e+00     0.352054
926763e+00     0.208637
951084e+00     0.578849
                  ...    
907006e+09     2.563191
167325e+09     2.010733
506406e+09     4.355219
789712e+09    12.037933
827124e+09     9.432768
Name: popularity, Length: 4840, dtype: float64

In this case, I’m not sure that the variable ‘popularity’ is really understandable in its current format. I think I need to adjust the values into more meaningful phrases.

#changing the values into string labels
bin_edges=[0.000065, 0.207575, 0.383831, 32.985763]
bin_names=['Low', 'Moderate', 'High']
df_movies['popularity_level']=pd.cut(df_movies['popularity'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	release_decade	popularity_level
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09	2010s	High

I think my new column ‘popularity level’ better conveys the meaning from the ‘popularity’ column.

#build a visual to compare popularity and mean adjusted revenue
popularity_revenue_means = df_movies.groupby('popularity_level')['revenue_adj'].mean()
popularity_revenue_means

popularity_level
Low         1.013358e+08
Moderate    9.190113e+07
High        1.335833e+08
Name: revenue_adj, dtype: float64

#creating a list of labels
popularity_level = ['Low', 'Moderate', 'High']
popularity_level

['Low', 'Moderate', 'High']

#running my new function
bar_plot([1,2,3], popularity_revenue_means, popularity_level,'Popularity Level', 'Mean Adjusted Revenue in 2010 Dollars', 
         'Popularity of Films and Mean Adjusted Revenue')

png

Thought: Just like in my first question, I wanted to create a bar graph to answer my question. It’s an easy way to see the difference in popularity levels and film revenue, regardess of the number of films produced. It’s not particularly surprising that highly popular films also bring in the most revenues. It’s interesting, however, that low popularity films make more revenue than moderatly popular films.

3. Is there a positive relationship between runtime and revenue (adj)?

Do long movies make more money?

#Descriptive stats on the 'runtime' variable
df_movies['runtime'].describe()

count    10865.000000
mean       102.363855
std         30.904039
min          2.000000
25%         90.000000
50%         99.000000
75%        111.000000
max        900.000000
Name: runtime, dtype: float64

Just like ‘popularity’, I think looking at the length of a film in terms of minutes might not be meaningful #to people reading the information. For example, it’s easier to understand what 1 hour 40 minutes means #in terms of movie length compared to 100 minutes.

#quick review of my variable
df_movies['runtime'].plot(kind='hist', title="Movie Runtimes");

png

#Here I create a new measure for the movie runtimes
bin_edges=[0,90,99,111,900]
bin_names=['short', 'medium', 'moderately long', 'long']
df_movies['movie_length']=pd.cut(df_movies['runtime'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	release_decade	popularity_level	movie_length
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09	2010s	High	long

#How many of each movie length type
df_movies['movie_length'].value_counts()

short              2931
long               2710
medium             2651
moderately long    2573
Name: movie_length, dtype: int64

The number of films fall fairly evenly into each category of length

#As percentages to confirm my impression
(df_movies['movie_length'].value_counts()/df_movies['movie_length'].value_counts().sum())*100

short              26.976530
long               24.942476
medium             24.399448
moderately long    23.681546
Name: movie_length, dtype: float64

Is there a relationship between movie length and revenue?

#Calculating means for each length type
runtime_revenue_means = df_movies.groupby('movie_length')['revenue_adj'].mean()
runtime_revenue_means

movie_length
short              1.035472e+08
medium             9.980478e+07
moderately long    1.009606e+08
long               1.559853e+08
Name: revenue_adj, dtype: float64

#creating a list of labels
movie_lengths = list(df_movies['movie_length'].unique().dropna())
movie_lengths.sort(reverse=True)
movie_lengths

['short', 'moderately long', 'medium', 'long']

#running my function
bar_plot([1,2,3,4], runtime_revenue_means, movie_lengths,'Film Runtime', 
         'Mean Adjusted Revenue in 2010 Dollars', 'Mean Adjusted Revenue by Film Runtime')

png

Thought: Again, using this kind of visualization is a great way to see where difference exists. A pie chart might also be an interesting visual for this output, as well; particularly since there is one value much greater than the rest. I a little suprrised by the result. There is clearly a jump in revenue for “long” films. “Long” films are at least 1 hour and 51 minutes long.

4. Are high revenue (adj) films highly rated?

Just because a movie makes a lot of money - does it mean that it’s also highly rated?

#Descriptive stats for the variable
df_movies['vote_average'].describe()

count    10865.000000
mean         5.975012
std          0.935138
min          1.500000
25%          5.400000
50%          6.000000
75%          6.600000
max          9.200000
Name: vote_average, dtype: float64

After looking at the descriptive stats for ‘vote average’, these values are a little more unclear at first glance. Some descriptive language might make the values have more meaning to readers.

#more context for my analysis
df_movies['vote_average'].plot(kind='hist', title="Vote Averages");

png

#again, creating descriptive values by defining bin edges and cutting the data into my new measures.
bin_edges=[1.5, 5.4,6.0,6.6,9.2]
bin_names=['low', 'medium', 'moderately high', 'high']
df_movies['voting_score']=pd.cut(df_movies['vote_average'], bin_edges, labels=bin_names)
df_movies.head(1)

	popularity	budget	revenue	original_title	cast	director	keywords	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	release_decade	popularity_level	movie_length	voting_score
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/2015	5562	6.5	2015	137999939.3	1.392446e+09	2010s	High	long	moderately high

#creating means
vote_values_means = df_movies.groupby('voting_score')['popularity'].mean()
vote_values_means

voting_score
low                0.420572
medium             0.555395
moderately high    0.705920
high               0.930673
Name: popularity, dtype: float64

#creating a list of labels
scores = ['low', 'medium', 'moderately high', 'high']
scores

['low', 'medium', 'moderately high', 'high']

#running my function
bar_plot([1,2,3,4], vote_values_means, scores,'Voting Scores', 
         'Popularity', 'Relationship Between Voting Scores and Popularity')

png

Thought: A detailed bar chart shows the relationship between voting scores (ratings) and popularity. It’s a clear trend indicating that popular films are highly rated by audiences. Does the same trend hold with another kind of visual?

#Scatter
popularity_plot = df_movies.plot(x='vote_average', y='popularity',
                                 title="Vote Average and Popularity", kind='scatter');
popularity_plot.set_xlabel('Vote Average')
popularity_plot.set_ylabel('Popularity')

Text(0, 0.5, 'Popularity')

png

The trend holds within a scatter plot; while this kind of chart shows the outliers, the data clearly shows a positive relationship between vote average and popularity.

Conclusions

1. Has film revenue changed over time?

Yes, absolutely. I found it particularly interesting that the number of films being made has increased, but the revenue is down. Movies released in the 1960s and 1970s were fewer in number but earned more at the boxoffice. Why?

Fewer TVs? Less at-home entertainment could explain why revenue was higher in teh 1960s and 1970s.
The rise of the VHS in the 1980s and the prevalence of at-home movie consumption in the 1990s and 2000s could explain the drop in revenue despite the increased number of film releases.

Limitation: I am making an educated guess as to why films brought in higher revenue in the 1960s and 1970s. An additonal dataset with information about at-home entertainment systems would be an interesting evaluation. Is there a correlation?

Limitation: Were people going to the movies more often back in teh 1960s and 1970s? This dataset may record “popularity” but it doesn’t necessarily indicate how may people attended these movies.

2. Are high revenue films more popular than low revenue films?

As expected, yes, high revenue films are popular films. It is an easy assumption to make that, if people like a film, they will invest in it more, either by seeing it in the theatre multiple times and/or purchasing the film.

Limitation: high revenue films are popular, but are popular films high earners? It would be a good idea to confirm this. Are both variables predictors of the other?

Limitation: the variable “popularity” is a somewhat vague variable and I could not find much information online about how TMDb defines popularity. Is it the number of people who see the film? The number of times it plays? How many copies of the film people buy? How long the film is in theatres? Or is it another measure? While the relationship between popularity and revenue doesn’t change, the meaning of the results could change if the variable is more clearly defined.

3. Is there a positive relationship between runtime and revenue?

There is a positive relationship between runtime and revenue. You can clearly see in the bar graph how much higher revenue is for the “long” film runtime.

I was surprised by this result at first, as I wasn’t sure there would be a well-defined relationship at all.
I also expected that if there was a relationship, I didn’t think long films would necessarily be popular.

Limitations: Time is relative and subjective. Using runtime as a variable may be useful in determining the minimum and maximum film runtime to get the most return on investment, but people’s definitions of a “long” or “short” film may vary.

Liminations: After looking at the data more closely, 111 minutes (1 hour 51 minutes) isn’t really that long (to me). As I mentioned above, movie time length can be subjective and others may think a 2 hour film is long. If I did this analysis again, I might categorize the lengths differently or change the minute runtime into hour runtimes.

4. Are high revenue films highly rated?

Yes, high revenue films also receive high ratings. I was a little surprised at these findings; just because a movie brings in a lot of money doesn’t automatically make it a “good” film. Then again, this is a subjective measure.

Limitations: Just like with the “popularity” variable, TMDb doesn’t define “average rating” in this dataset. I made the assumption that it was a film rating.

Final Thoughts

This dataset was very interesting to work with and there are many ways to measure the characteristics of a high revenue film than the ones I used. If I were to do this project over, I would re-consider some of the variables I chose to evaluate my questions. The dataset could be improved with more information about some of the variables, such as popularity and average ratings. While the lack of a definition doesn’t affect the calculations, it can affect the story I tell with the data.