2017 - Predicting Oscar winners & box office hits

Predicting box office hits & Oscar winners using things you
found on the Internet Deborah Hanus @deborahhanus

Background

Credit: Gil Press, Forbes 82% 13% 5%

82% DATA WRANGLING 13% EDA & ML 5% OTHER 62%
DATA WRANGLING 20% DATA UNDERSTANDING

• Deﬁne a question you can answer • Acquire good
data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

What factors drive movie revenue? Image: unclaimedmoney.com

Not so good: Vague Will my movie be a box
office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer

• Relevant • Structured • (Relatively) complete What is good
data?

Where to ﬁnd good data?

• Use an API • Write a web scraper •
Get all the text • Make the text queryable How to get good data

• Make an HTTP request to get the HTML Writing
a web scraper Requests Example

Writing a web scraper http://www.boxofficemojo.com/yearly/chart/? page=1&view=releasedate&view2=domestic&yr=%2 017.htm

• Make HTML queryable using BeautifulSoup & PyQuery Writing a
web scraper

Right Click : View Source

• Rate limiting • API Keys • Selenium Common problems

Factors we can explore: • Movie budget • IMDB Rating
• Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis

Gross revenue vs. # opening theaters Exploratory Data Analysis ~3500

Gross revenue vs. Quality rating Exploratory Data Analysis No relationship

Gross revenue vs. Opening gross Exploratory Data Analysis Predictive

Multivariate regression Exploratory Data Analysis Predictive

• Budget helps (but only a little). • Timing is
important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we ﬁnd?

What does it take to win an Oscar?

What makes an Oscar winner? Image: superawesomevectors.com

Not so good: Vague Will this movie win an Oscar?
Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Deﬁne a question you can answer

Acquiring Data IMDBpy Drama

Not so good: Vague Will this movie win an Oscar?
Good: Likelihood What is the likelihood that this movie will be a box ofﬁce hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?

Factors we can explore: • Movie nomination category • Thematic
content (e.g. family, violence, war, father- son relationship, smoking) • Movie genre • Where the movie was made • When the movie debuted Exploratory Data Analysis

Exploratory Data Analysis Countries associated with winning Oscar movies

Exploratory Data Analysis Number of winning movies per month

Exploratory Data Analysis Ratio of winning films to films to
nominated films by month

• Binary output - winner/non-winner • More accurate than baseline
What do we want from a model?

• Logistic regression (Ridge, Lasso, ElasticNet) • Support vector machine
(SVM) • Ensemble methods Potential models

• Binary output - winner/non-winner • More accurate than baseline
What do we want from a model?

• All winners Accuracy = 29% • All losers Accuracy
= 71% Establish Baselines

To select a model, think about what it gets wrong.

Selecting a model Confusion Matrix

• Accuracy = (TP+TN)/(TP+TN+FN+FP) • Recall = TP/(TP+FN) • Precision
= TP/(TP+FP) • F1 = (Precision*Recall)/(Precision+Recall) Selecting a model

Selecting a model Receiver-Operating Characteristic (ROC)

• Nominated films made in Italy & Spain have a
good chance at winning. • Films are more likely to win if they are released later in the year. • Tone down the gore (unless it is a war film). • If a film is nominated for best picture, its odds of winning are good. • If a film is nominated for best cinematography, its odds are less good. What did we find?

What did our model get wrong?

• What did your model misclassify? • Are any of
those errors systematic? Analyze errors

Image: NYT Coded Gaze: Joy Buolamwini

Labeling sensitive content

Always analyze your errors

• Deﬁned an answerable question • Built a web scraper
• Explored the data • Fit a model to the data • Analyzed our errors What did we do?

Building a scraper Requests - HTTP for Humans BeautifulSoup or
PyQuery Analyzing data Jupyter SciKit Learn Statsmodels Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus

2017 - Predicting Oscar winners & box office hits

2017 - Predicting Oscar winners & box office hits

More Decks by PyBay

Other Decks in Programming

Featured

Transcript