Top 50 All Time Grossing Movies
top 50 all time grossing movies
A Quick Scrape of Top Grossing Films from boxofficemojo.com « Consistently Infrequent
Introduction
I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don't know what kind of graphs I'd construct other than a simple barplot but figured I'd at least get the basics done and then if I feel motivated enough I could revisit this in the future.
Objective
Scrape the information available on R and make a simple barplot.
Solution
This is probably one of the easier scraping challenges. The function readHTMLTable() from the XML package does all the hard work. We just point the url of the page we're interested in and feed it into the function. The function then pulls out all tables on the webpage as a list of data.frames. We then choose which data.frame we want. Here's a single wrapper function:
Desert HeartsLearn more
box_office_mojo_top <- function(num.pages) { # load required packages require(XML) # local helper functions get_table <- function(u) { table <- readHTMLTable(u)[[3]] names(table) <- c("Rank", "Title", "Studio", "Worldwide.Gross", "Domestic.Gross", "Domestic.pct", "Overseas.Gross", "Overseas.pct", "Year") df <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE) df <- as.data.frame(df, stringsAsFactors=FALSE) return(df) } clean_df <- function(df) { clean <- function(col) { col <- gsub("$", "", col, fixed = TRUE) col <- gsub("%", "", col, fixed = TRUE) col <- gsub(",", "", col, fixed = TRUE) col <- gsub("^", "", col, fixed = TRUE) return(col) } df <- sapply(df, clean) df <- as.data.frame(df, stringsAsFactors=FALSE) return(df) } # Main # Step 1: construct URLs urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:num.pages, "&p=.htm", sep = "") # Step 2: scrape website df <- do.Movie Blockbusters: Hollywood's 50 Top-Grossing Films of All Time
Learn more
Steven H. Scheuercall("rbind", lapply(urls, get_table)) # Step 3: clean dataframe df <- clean_df(df) # Step 4: set column types s <- c(1, 4:9) df[, s] <- sapply(df[, s], as.numeric) df$Studio <- as.factor(df$Studio) # step 5: return dataframe return(df) }
Which we use as follows:
num.pages <- 5 df <- box_office_mojo_top(num.pages) head(df) # Rank Title Studio Worldwide.Gross Domestic.Gross Domestic.pct Overseas.Gross Overseas.pct Year # 1 1 Avatar Fox 2782.3 760.5 27.3 2021.8 72.7 2009 # 2 2 Titanic Par. 1843.2 600.8 32.6 1242.4 67.4 1997 # 3 3 Harry Potter and the Deathly Hallows Part 2 WB 1328.1 381.0 28.7 947.1 71.3 2011 # 4 4 Transformers: Dark of the Moon P/DW 1123.7 352.4 31.4 771.4 68.6 2011 # 5 5 The Lord of the Rings: The Return of the King NL 1119.9 377.8 33.7 742.1 66.3 2003 # 6 6 Pirates of the Caribbean: Dead Man's Chest BV 1066.2 423.3 39.7 642.9 60.3 2006 str(df) # 'data.frame': 475 obs. of 9 variables: # $ Rank : num 1 2 3 4 5 6 7 8 9 10 ... # $ Title : chr "Avatar" "Titanic" "Harry Potter and the Deathly Hallows Part 2" "Transformers: Dark of the Moon" ... # $ Studio : Factor w/ 35 levels "Art.","BV","Col.",..: 7 20 33 19 16 2 2 2 2 33 ... # $ Worldwide.Gross: num 2782 1843 1328 1124 1120 ... # $ Domestic.Gross : num 760 601 381 352 378 ... # $ Domestic.pct : num 27.3 32.6 28.7 31.4 33.7 39.7 39 23.1 32.6 53.2 ... # $ Overseas.Gross : num 2022 1242 947 771 742 ... # $ Overseas.pct : num 72.7 67.4 71.3 68.6 66.3 60.3 61 76.9 67.4 46.8 ... # $ Year : num 2009 1997 2011 2011 2003 ...
We can even do a simple barplot of the top 50 films by worldwide gross (in millions) :
require(ggplot2) df2 <- subset(df, Rank<=50) ggplot(df2, aes(reorder(Title, Worldwide.Gross), Worldwide.Gross)) + geom_bar() + opts(axis.text.x=theme_text(angle=0)) + opts(axis.text.y=theme_text(angle=0)) + coord_flip() + ylab("Worldwise Gross (USD $ millions)") + xlab("Title") + opts(title = "TOP 50 FILMS BY WORLDWIDE GROSS")
Advertisement
0 コメント:
コメントを投稿