2012年2月6日月曜日

Top 50 All Time Grossing Movies

top 50 all time grossing movies

A Quick Scrape of Top Grossing Films from boxofficemojo.com « Consistently Infrequent

 

Introduction

I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don't know what kind of graphs I'd construct other than a simple barplot but figured I'd at least get the basics done and then if I feel motivated enough I could revisit this in the future.

Objective

Scrape the information available on  R and make a simple barplot.

Solution

This is probably one of the easier scraping challenges. The function readHTMLTable() from the XML package does all the hard work. We just point the url of the page we're interested in and feed it into the function. The function then pulls out all tables on the webpage as a list of data.frames. We then choose which data.frame we want. Here's a single wrapper function:


Desert Hearts
Learn more
 box_office_mojo_top <- function(num.pages) {   # load required packages   require(XML)    # local helper functions   get_table <- function(u) {     table <- readHTMLTable(u)[[3]]     names(table) <- c("Rank", "Title", "Studio", "Worldwide.Gross", "Domestic.Gross", "Domestic.pct", "Overseas.Gross", "Overseas.pct", "Year")     df <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE)     df <- as.data.frame(df, stringsAsFactors=FALSE)     return(df)   }   clean_df <- function(df) {     clean <- function(col) {       col <- gsub("$", "", col, fixed = TRUE)       col <- gsub("%", "", col, fixed = TRUE)       col <- gsub(",", "", col, fixed = TRUE)       col <- gsub("^", "", col, fixed = TRUE)       return(col)     }      df <- sapply(df, clean)     df <- as.data.frame(df, stringsAsFactors=FALSE)     return(df)   }    # Main   # Step 1: construct URLs   urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:num.pages, "&p=.htm", sep = "")    # Step 2: scrape website   df <- do.
Movie Blockbusters: Hollywood's 50 Top-Grossing Films of All Time
Learn more
Steven H. Scheuer
call("rbind", lapply(urls, get_table)) # Step 3: clean dataframe df <- clean_df(df) # Step 4: set column types s <- c(1, 4:9) df[, s] <- sapply(df[, s], as.numeric) df$Studio <- as.factor(df$Studio) # step 5: return dataframe return(df) }

Which we use as follows:


 num.pages <- 5 df <- box_office_mojo_top(num.pages)  head(df) # Rank Title Studio Worldwide.Gross Domestic.Gross Domestic.pct Overseas.Gross Overseas.pct Year # 1 1 Avatar Fox 2782.3 760.5 27.3 2021.8 72.7 2009 # 2 2 Titanic Par. 1843.2 600.8 32.6 1242.4 67.4 1997 # 3 3 Harry Potter and the Deathly Hallows Part 2 WB 1328.1 381.0 28.7 947.1 71.3 2011 # 4 4 Transformers: Dark of the Moon P/DW 1123.7 352.4 31.4 771.4 68.6 2011 # 5 5 The Lord of the Rings: The Return of the King NL 1119.9 377.8 33.7 742.1 66.3 2003 # 6 6 Pirates of the Caribbean: Dead Man's Chest BV 1066.2 423.3 39.7 642.9 60.3 2006  str(df) # 'data.frame': 475 obs. of 9 variables: # $ Rank : num 1 2 3 4 5 6 7 8 9 10 ... # $ Title : chr "Avatar" "Titanic" "Harry Potter and the Deathly Hallows Part 2" "Transformers: Dark of the Moon" ... # $ Studio : Factor w/ 35 levels "Art.","BV","Col.",..: 7 20 33 19 16 2 2 2 2 33 ... # $ Worldwide.Gross: num 2782 1843 1328 1124 1120 ... # $ Domestic.Gross : num 760 601 381 352 378 ... # $ Domestic.pct : num 27.3 32.6 28.7 31.4 33.7 39.7 39 23.1 32.6 53.2 ... # $ Overseas.Gross : num 2022 1242 947 771 742 ... # $ Overseas.pct : num 72.7 67.4 71.3 68.6 66.
3 60.3 61 76.9 67.4 46.8 ... # $ Year : num 2009 1997 2011 2011 2003 ...

We can even do a simple barplot of the top 50 films by worldwide gross (in millions) :

   require(ggplot2)  df2 <- subset(df, Rank<=50)  ggplot(df2, aes(reorder(Title, Worldwide.Gross), Worldwide.Gross)) +    geom_bar() +    opts(axis.text.x=theme_text(angle=0)) +    opts(axis.text.y=theme_text(angle=0)) +    coord_flip() +    ylab("Worldwise Gross (USD $ millions)") +    xlab("Title") +    opts(title = "TOP 50 FILMS BY WORLDWIDE GROSS")  

Advertisement