karolines-questions

View project on GitHub

Investigating a Dataset: Movies

Introduction

What are the properties of high revenue films? Using data from The Movie Database (TMDb), I investigated the relationship between popularity, runtime, and vote average (rating) and the adjusted revenue of the films. To account for inflation, the adjusted revenue column reflects a film’s revenue in 2010 dollars.

  • Has film revenue changed over time?
  • Are high revenue films more popular than low revenue films?
  • Is there a positive relationship between runtime and revenue? Is it worth it to produce a long film?
  • Are high revenue films highly rated?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df_movies = pd.read_csv(r"C:\Users\bsear\Desktop\Class Assignments\movies.csv")
df_movies.head(1)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09

1 rows × 21 columns


Data Wrangling

A quick review of the dataset tells me what data types I have, if there are null values, and what kind of clean-up, if any, is needed before evaluation.

At first glance, it appears that all of the variables I want to investigate are populated with data. Before I take a closer look at those columns, I removed the columns “ID”, “IMDB ID”, “Homepage”, “Tagline”, and “Overview”. Since I am not using those variables in my evaluation, I took those out in order to keep the dataframe tidy and easy to read.

I kept other columns, such as titles, directors, cast, genres, production companies, and release dates in case those variables provide additional insights later on.

After closer inspection, I had to clean the runtime and adjusted revenue columns.

#what does my dataset look like?
df_movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
#Removing extraneous data columns here
df_movies.drop(['id', 'imdb_id', 'homepage','tagline','overview'], axis=1, inplace=True)
df_movies.head()
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09
1 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller future|chase|post-apocalyptic|dystopia|australia 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/2015 6185 7.1 2015 137999939.3 3.481613e+08
2 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke based on novel|revolution|dystopia|sequel|dyst... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/2015 2480 6.3 2015 101199955.5 2.716190e+08
3 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams android|spaceship|jedi|space opera|3d 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/2015 5292 7.5 2015 183999919.0 1.902723e+09
4 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan car race|speed|revenge|suspense|car 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/2015 2947 7.3 2015 174799923.1 1.385749e+09
#checking to see if there are duplicates and removing them
df_movies.drop_duplicates(inplace=True)
sum(df_movies.duplicated())
0
#check for null values, particularly in my chosen variables
list(df_movies.isnull().sum().items())
[('popularity', 0),
 ('budget', 0),
 ('revenue', 0),
 ('original_title', 0),
 ('cast', 76),
 ('director', 44),
 ('keywords', 1493),
 ('runtime', 0),
 ('genres', 23),
 ('production_companies', 1030),
 ('release_date', 0),
 ('vote_count', 0),
 ('vote_average', 0),
 ('release_year', 0),
 ('budget_adj', 0),
 ('revenue_adj', 0)]
#before starting my analysis, I want to investigate the lack of null-values just a little more.
#this serves as a double check.
df_movies.describe()
popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 1.086500e+04 1.086500e+04
mean 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859 1.754989e+07 5.136900e+07
std 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260 3.430753e+07 1.446383e+08
min 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000 2.085325e+07 3.370173e+07
max 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
#even though the previous null data check came back as 0 for my chosen variables, the describe function shows
#that my minimum value for revenue_adj is 0 - which seems odd. Here I take a closer look
null_data=df_movies[df_movies.isnull().any(axis=1)]
null_data.head(1)
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
130 1.284541 0 4719695 True Story Jonah Hill|James Franco|Felicity Jones|Maria D... Rupert Goold NaN 100 Crime|Drama|Mystery Plan B Entertainment|Regency Enterprises|New R... 4/17/2015 354 6.0 2015 0.0 4342117.489
#why does the runtime column have 0 as a minimum runtime?
df_movies[df_movies['runtime']==df_movies['runtime'].min()].head(1)
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
92 1.876037 0 0 Mythica: The Necromancer Melanie Stone|Adam Johnson|Kevin Sorbo|Nicola ... A. Todd Smith sword|magic|sorcery|necromancer 0 Fantasy|Action|Adventure Arrowstorm Entertainment|Camera 40 Productions... 12/19/2015 11 5.4 2015 0.0 0.0
#The other variables in the table indicate that those films with "0" runtime must actually be null values
#Here I replaced the '0' values with null
df_movies['runtime'].replace(0, np.nan, inplace=True)
df_movies['runtime'].describe()
count    10834.000000
mean       102.363855
std         30.948225
min          2.000000
25%         90.000000
50%         99.000000
75%        112.000000
max        900.000000
Name: runtime, dtype: float64
#this is a variable I plan to use in my analysis, so I am replacing those null values with the mean runtime
#this allows me to compare the same number of movies across my variables
runtime_mean=df_movies['runtime'].mean()
df_movies['runtime'].fillna(runtime_mean, inplace = True)
df_movies['runtime'].describe()
count    10865.000000
mean       102.363855
std         30.904039
min          2.000000
25%         90.000000
50%         99.000000
75%        111.000000
max        900.000000
Name: runtime, dtype: float64
#why does the adjusted revenue column have 0 as a minimum value?
#Just like with runtime, the adjusted revenue column also has '0' values where null is more appropriate
df_movies[df_movies['revenue_adj']==df_movies['revenue_adj'].min()].head(1)
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
48 2.93234 30000000 0 Wild Card Jason Statham|Michael Angarano|Milo Ventimigli... Simon West gambling|bodyguard|remake 92.0 Thriller|Crime|Drama Current Entertainment|Lionsgate|Sierra / Affin... 1/14/2015 481 5.3 2015 27599987.86 0.0
#The other variables in the table indicate that those films with "0" adjusted revenue must actually be null values
#Here I am replacing the '0' values with null
df_movies['revenue_adj'].replace(0, np.nan, inplace=True)
df_movies['revenue_adj'].describe()
count    4.849000e+03
mean     1.151009e+08
std      1.988557e+08
min      2.370705e+00
25%      1.046585e+07
50%      4.395666e+07
75%      1.316482e+08
max      2.827124e+09
Name: revenue_adj, dtype: float64
#Now I am determining the mean adjusted revenue and replacing the null values with the mean
revenue_adj_mean=df_movies['revenue_adj'].mean()
df_movies['revenue_adj'].fillna(revenue_adj_mean, inplace = True)
df_movies['revenue_adj'].describe()
count    1.086500e+04
mean     1.151009e+08
std      1.328386e+08
min      2.370705e+00
25%      5.878518e+07
50%      1.151009e+08
75%      1.151009e+08
max      2.827124e+09
Name: revenue_adj, dtype: float64
#now I have the same number of non-null values for the characteristics I want to test.
df_movies.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10865 non-null  float64
 1   budget                10865 non-null  int64  
 2   revenue               10865 non-null  int64  
 3   original_title        10865 non-null  object 
 4   cast                  10789 non-null  object 
 5   director              10821 non-null  object 
 6   keywords              9372 non-null   object 
 7   runtime               10865 non-null  float64
 8   genres                10842 non-null  object 
 9   production_companies  9835 non-null   object 
 10  release_date          10865 non-null  object 
 11  vote_count            10865 non-null  int64  
 12  vote_average          10865 non-null  float64
 13  release_year          10865 non-null  int64  
 14  budget_adj            10865 non-null  float64
 15  revenue_adj           10865 non-null  float64
dtypes: float64(5), int64(4), object(7)
memory usage: 1.4+ MB

Exploratory Data Analysis

After an initial cleanup of my data, I can now begin my data analysis to answer my research questions:


1. Has film revenue changed over time?

Are movies making more money now than in the past?

#initial scatter plot
#I referenced the matplotlib.org page to learn the syntax for .set_xlabel 
film_plot = df_movies.plot(x='release_year', y='revenue_adj', 
                           title='Adjusted Revenue of Films Over Time', kind='scatter');
film_plot.set_xlabel('Release Year')
film_plot.set_ylabel('Adjusted Revenue in 2010 Dollars')
Text(0, 0.5, 'Adjusted Revenue in 2010 Dollars')

png

This chart visually represents every film in the dataset and its adjusted revenue. At first glance, this chart shows a higher number of films produced in later decades, as well as a number of outliers in each year. It is not, however, very clear to me if there is an obvious trend in this chart. I will need to create a better visual. I have over 10,000 records, so I will need to organize the films into broader categories, such as decades.

#Here I will cut the values into decades based on release year
bin_edges=[1960, 1970, 1980, 1990, 2000, 2010, 2020]
bin_names=['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
df_movies['release_decade']=pd.cut(df_movies['release_year'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_decade
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09 2010s
#pie chart visual
film_pie = df_movies['release_decade'].value_counts().plot(kind='pie', 
                                                           title='Count of Films Released per Decade');
film_pie.set_ylabel(" ")
Text(0, 0.5, ' ')

png

In this pie chart, I wanted to see if there were more film releases in certain decades to confirm my initial impression from the scatter plot. It shows an increase in the number of films released each decade.

#to prepare for my descriptive bar chart, I start by calucalting
#the mean revenue for the decades
decade_revenue_means = df_movies.groupby('release_decade')['revenue_adj'].mean()
decade_revenue_means
release_decade
1960s    1.486290e+08
1970s    1.520075e+08
1980s    1.066893e+08
1990s    1.173703e+08
2000s    1.118565e+08
2010s    1.102115e+08
Name: revenue_adj, dtype: float64
#then organize my data into the chart
locations = [1,2,3,4, 5, 6]
heights = decade_revenue_means
labels = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
plt.bar(locations, heights, tick_label=labels)
plt.title('Mean Adjusted Revenue by Decade')
plt.xlabel('Decade')
plt.ylabel('Mean Adjusted Revenue in 2010 Dollars')
Text(0, 0.5, 'Mean Adjusted Revenue in 2010 Dollars')

png

Thought: It appears that, while there are more movies made in later decades, the mean adjusted revenue is higher for earlier films. I decided to create this kind of visualization to compare the adjusted revenue means. The pie chart was interesting, but it really focused on the count of movies. The scatter plot showed every film and it’s adjusted revenue, but looking at individual films in this way is too detailed. This bar chart is a great visual to show, regardless of number of films produced each decade, which decade had the highest mean adjusted revenue. In short, it’s a single visual that answers my question.

  • This is a visualization I really like and plan to use again, so I will make a function here to reduce the repetition of the lengthy code.
#defining my function
def bar_plot(locations, heights, labels, xlabel, ylabel, title):
    plt.bar(locations, heights, tick_label=labels)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
#creating a list of labels
decades = list(df_movies['release_decade'].unique().dropna())
decades.sort(reverse=False)
decades
['1960s', '1970s', '1980s', '1990s', '2000s', '2010s']
#running my new function
bar_plot([1,2,3,4,5,6], decade_revenue_means, decades,'Decades', 
         'Mean Adjusted Revenue in 2010 Dollars', 'Mean Adjusted Revenue by Decade')

png


2. Are high revenue (adj) films more popular than low revenue films?

How popular are the high revenue films? Are they more or less popular than low revenue films?

#Information about the 'popularity' variable
df_movies['popularity'].describe()
count    10865.000000
mean         0.646446
std          1.000231
min          0.000065
25%          0.207575
50%          0.383831
75%          0.713857
max         32.985763
Name: popularity, dtype: float64
df_movies['popularity'].plot(kind='hist', title= "Popularity");

png

#a closer look at the minimum value to better understand the variable
df_movies[df_movies['popularity']==df_movies['popularity'].min()]
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_decade popularity_level
6181 0.000065 0 0 North and South, Book I Patrick Swayze|Philip Casnoff|Kirstie Alley|Ge... NaN NaN 561.0 Drama|History|Western NaN 11/3/1985 17 6.0 1985 0.0 1.151009e+08 1980s NaN
#Getting more information about what the 'popularity' variable means
df_movies.groupby('revenue_adj')['popularity'].mean()
revenue_adj
2.370705e+00     0.462609
2.861934e+00     0.552091
3.038360e+00     0.352054
5.926763e+00     0.208637
6.951084e+00     0.578849
                  ...    
1.907006e+09     2.563191
2.167325e+09     2.010733
2.506406e+09     4.355219
2.789712e+09    12.037933
2.827124e+09     9.432768
Name: popularity, Length: 4840, dtype: float64

In this case, I’m not sure that the variable ‘popularity’ is really understandable in its current format. I think I need to adjust the values into more meaningful phrases.

#changing the values into string labels
bin_edges=[0.000065, 0.207575, 0.383831, 32.985763]
bin_names=['Low', 'Moderate', 'High']
df_movies['popularity_level']=pd.cut(df_movies['popularity'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_decade popularity_level
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09 2010s High

I think my new column ‘popularity level’ better conveys the meaning from the ‘popularity’ column.

#build a visual to compare popularity and mean adjusted revenue
popularity_revenue_means = df_movies.groupby('popularity_level')['revenue_adj'].mean()
popularity_revenue_means
popularity_level
Low         1.013358e+08
Moderate    9.190113e+07
High        1.335833e+08
Name: revenue_adj, dtype: float64
#creating a list of labels
popularity_level = ['Low', 'Moderate', 'High']
popularity_level
['Low', 'Moderate', 'High']
#running my new function
bar_plot([1,2,3], popularity_revenue_means, popularity_level,'Popularity Level', 'Mean Adjusted Revenue in 2010 Dollars', 
         'Popularity of Films and Mean Adjusted Revenue')

png

Thought: Just like in my first question, I wanted to create a bar graph to answer my question. It’s an easy way to see the difference in popularity levels and film revenue, regardess of the number of films produced. It’s not particularly surprising that highly popular films also bring in the most revenues. It’s interesting, however, that low popularity films make more revenue than moderatly popular films.


3. Is there a positive relationship between runtime and revenue (adj)?

Do long movies make more money?

#Descriptive stats on the 'runtime' variable
df_movies['runtime'].describe()
count    10865.000000
mean       102.363855
std         30.904039
min          2.000000
25%         90.000000
50%         99.000000
75%        111.000000
max        900.000000
Name: runtime, dtype: float64

Just like ‘popularity’, I think looking at the length of a film in terms of minutes might not be meaningful #to people reading the information. For example, it’s easier to understand what 1 hour 40 minutes means #in terms of movie length compared to 100 minutes.

#quick review of my variable
df_movies['runtime'].plot(kind='hist', title="Movie Runtimes");

png

#Here I create a new measure for the movie runtimes
bin_edges=[0,90,99,111,900]
bin_names=['short', 'medium', 'moderately long', 'long']
df_movies['movie_length']=pd.cut(df_movies['runtime'], bin_edges, labels=bin_names)
df_movies.head(1) #testing what it looks like
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_decade popularity_level movie_length
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09 2010s High long
#How many of each movie length type
df_movies['movie_length'].value_counts()
short              2931
long               2710
medium             2651
moderately long    2573
Name: movie_length, dtype: int64

The number of films fall fairly evenly into each category of length

#As percentages to confirm my impression
(df_movies['movie_length'].value_counts()/df_movies['movie_length'].value_counts().sum())*100
short              26.976530
long               24.942476
medium             24.399448
moderately long    23.681546
Name: movie_length, dtype: float64

Is there a relationship between movie length and revenue?

#Calculating means for each length type
runtime_revenue_means = df_movies.groupby('movie_length')['revenue_adj'].mean()
runtime_revenue_means
movie_length
short              1.035472e+08
medium             9.980478e+07
moderately long    1.009606e+08
long               1.559853e+08
Name: revenue_adj, dtype: float64
#creating a list of labels
movie_lengths = list(df_movies['movie_length'].unique().dropna())
movie_lengths.sort(reverse=True)
movie_lengths
['short', 'moderately long', 'medium', 'long']
#running my function
bar_plot([1,2,3,4], runtime_revenue_means, movie_lengths,'Film Runtime', 
         'Mean Adjusted Revenue in 2010 Dollars', 'Mean Adjusted Revenue by Film Runtime')

png

Thought: Again, using this kind of visualization is a great way to see where difference exists. A pie chart might also be an interesting visual for this output, as well; particularly since there is one value much greater than the rest. I a little suprrised by the result. There is clearly a jump in revenue for “long” films. “Long” films are at least 1 hour and 51 minutes long.

4. Are high revenue (adj) films highly rated?

Just because a movie makes a lot of money - does it mean that it’s also highly rated?

#Descriptive stats for the variable
df_movies['vote_average'].describe()
count    10865.000000
mean         5.975012
std          0.935138
min          1.500000
25%          5.400000
50%          6.000000
75%          6.600000
max          9.200000
Name: vote_average, dtype: float64

After looking at the descriptive stats for ‘vote average’, these values are a little more unclear at first glance. Some descriptive language might make the values have more meaning to readers.

#more context for my analysis
df_movies['vote_average'].plot(kind='hist', title="Vote Averages");

png

#again, creating descriptive values by defining bin edges and cutting the data into my new measures.
bin_edges=[1.5, 5.4,6.0,6.6,9.2]
bin_names=['low', 'medium', 'moderately high', 'high']
df_movies['voting_score']=pd.cut(df_movies['vote_average'], bin_edges, labels=bin_names)
df_movies.head(1)
popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_decade popularity_level movie_length voting_score
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/2015 5562 6.5 2015 137999939.3 1.392446e+09 2010s High long moderately high
#creating means
vote_values_means = df_movies.groupby('voting_score')['popularity'].mean()
vote_values_means
voting_score
low                0.420572
medium             0.555395
moderately high    0.705920
high               0.930673
Name: popularity, dtype: float64
#creating a list of labels
scores = ['low', 'medium', 'moderately high', 'high']
scores
['low', 'medium', 'moderately high', 'high']
#running my function
bar_plot([1,2,3,4], vote_values_means, scores,'Voting Scores', 
         'Popularity', 'Relationship Between Voting Scores and Popularity')

png

Thought: A detailed bar chart shows the relationship between voting scores (ratings) and popularity. It’s a clear trend indicating that popular films are highly rated by audiences. Does the same trend hold with another kind of visual?

#Scatter
popularity_plot = df_movies.plot(x='vote_average', y='popularity',
                                 title="Vote Average and Popularity", kind='scatter');
popularity_plot.set_xlabel('Vote Average')
popularity_plot.set_ylabel('Popularity')
Text(0, 0.5, 'Popularity')

png

The trend holds within a scatter plot; while this kind of chart shows the outliers, the data clearly shows a positive relationship between vote average and popularity.


Conclusions

1. Has film revenue changed over time?

Yes, absolutely. I found it particularly interesting that the number of films being made has increased, but the revenue is down. Movies released in the 1960s and 1970s were fewer in number but earned more at the boxoffice. Why?

  • Fewer TVs? Less at-home entertainment could explain why revenue was higher in teh 1960s and 1970s.
  • The rise of the VHS in the 1980s and the prevalence of at-home movie consumption in the 1990s and 2000s could explain the drop in revenue despite the increased number of film releases.

Limitation: I am making an educated guess as to why films brought in higher revenue in the 1960s and 1970s. An additonal dataset with information about at-home entertainment systems would be an interesting evaluation. Is there a correlation?

Limitation: Were people going to the movies more often back in teh 1960s and 1970s? This dataset may record “popularity” but it doesn’t necessarily indicate how may people attended these movies.

2. Are high revenue films more popular than low revenue films?

As expected, yes, high revenue films are popular films. It is an easy assumption to make that, if people like a film, they will invest in it more, either by seeing it in the theatre multiple times and/or purchasing the film.

Limitation: high revenue films are popular, but are popular films high earners? It would be a good idea to confirm this. Are both variables predictors of the other?

Limitation: the variable “popularity” is a somewhat vague variable and I could not find much information online about how TMDb defines popularity. Is it the number of people who see the film? The number of times it plays? How many copies of the film people buy? How long the film is in theatres? Or is it another measure? While the relationship between popularity and revenue doesn’t change, the meaning of the results could change if the variable is more clearly defined.

3. Is there a positive relationship between runtime and revenue?

There is a positive relationship between runtime and revenue. You can clearly see in the bar graph how much higher revenue is for the “long” film runtime.

  • I was surprised by this result at first, as I wasn’t sure there would be a well-defined relationship at all.
  • I also expected that if there was a relationship, I didn’t think long films would necessarily be popular.

Limitations: Time is relative and subjective. Using runtime as a variable may be useful in determining the minimum and maximum film runtime to get the most return on investment, but people’s definitions of a “long” or “short” film may vary.

Liminations: After looking at the data more closely, 111 minutes (1 hour 51 minutes) isn’t really that long (to me). As I mentioned above, movie time length can be subjective and others may think a 2 hour film is long. If I did this analysis again, I might categorize the lengths differently or change the minute runtime into hour runtimes.

4. Are high revenue films highly rated?

Yes, high revenue films also receive high ratings. I was a little surprised at these findings; just because a movie brings in a lot of money doesn’t automatically make it a “good” film. Then again, this is a subjective measure.

Limitations: Just like with the “popularity” variable, TMDb doesn’t define “average rating” in this dataset. I made the assumption that it was a film rating.

Final Thoughts

This dataset was very interesting to work with and there are many ways to measure the characteristics of a high revenue film than the ones I used. If I were to do this project over, I would re-consider some of the variables I chose to evaluate my questions. The dataset could be improved with more information about some of the variables, such as popularity and average ratings. While the lack of a definition doesn’t affect the calculations, it can affect the story I tell with the data.