Scraping IMDB with Python

Scraping is fun, whether you’re doing it just for fun or profit. I created a couple of scrapers already for iTunes, Paintbottle (deleted as per requested by the site-admin), Cricinfo, Google’s Did you Mean? and more. check em out on Github.

IMDB does not have an API, for accessing information on movies or TV series. So, had to write a scraper for fetching accessing their information on movies.

I did know about the couple of other unofficial API’s (including omdb), but creating your own solution is always fun :)

If you don’t want to go much into the technical details, but are just looking to use it, it is hosted at http://getimdb.herokuapp.com.

The scraper is written in Python and uses lxml, for parsing the webpages. I m using XPath for selecting elements from the DOM.

Following are the dependencies, and can be installed using pip:

requests==1.2.3
 lxml==3.2.1

The code:

#!/usr/bin/env python

import sys
import requests
import lxml.html

def main(id):
    hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content)
    movie = {}
    try:
        movie['title'] = hxs.xpath('//*[@id="overview-top"]/h1/span[1]/text()')[0].strip()
    except IndexError:
        movie['title']
    try:
        movie['year'] = hxs.xpath('//*[@id="overview-top"]/h1/span[2]/a/text()')[0].strip()
    except IndexError:
        try:
            movie['year'] = hxs.xpath('//*[@id="overview-top"]/h1/span[3]/a/text()')[0].strip()
        except IndexError:
            movie['year'] = ""
    try:
        movie['certification'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[1]/@title')[0].strip()
    except IndexError:
        movie['certification'] = ""
    try:
        movie['running_time'] = hxs.xpath('//*[@id="overview-top"]/div[2]/time/text()')[0].strip()
    except IndexError:
        movie['running_time'] = ""
    try:
        movie['genre'] = hxs.xpath('//*[@id="overview-top"]/div[2]/a/span/text()')
    except IndexError:
        movie['genre'] = []
    try:
        movie['release_date'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[3]/a/text()')[0].strip()
    except IndexError:
        try:
            movie['release_date'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[4]/a/text()')[0].strip()
        except Exception:
            movie['release_date'] = ""
    try:
        movie['rating'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/strong/span/text()')[0]
    except IndexError:
        movie['rating'] = ""
    try:
        movie['metascore'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/a[2]/text()')[0].strip().split('/')[0]
    except IndexError:
        movie['metascore'] = 0
    try:
        movie['description'] = hxs.xpath('//*[@id="overview-top"]/p[2]/text()')[0].strip()
    except IndexError:
        movie['description'] = ""
    try:
        movie['director'] = hxs.xpath('//*[@id="overview-top"]/div[4]/a/span/text()')[0].strip()
    except IndexError:
        movie['director'] = ""
    try:
        movie['stars'] = hxs.xpath('//*[@id="overview-top"]/div[6]/a/span/text()')
    except IndexError:
        movie['stars'] = ""
    try:
        movie['poster'] = hxs.xpath('//*[@id="img_primary"]/div/a/img/@src')[0]
    except IndexError:
        movie['poster'] = ""
    try:
        movie['gallery'] = hxs.xpath('//*[@id="combined-photos"]/div/a/img/@src')
    except IndexError:
        movie['gallery'] = ""
    try:
        movie['storyline'] = hxs.xpath('//*[@id="titleStoryLine"]/div[1]/p/text()')[0].strip()
    except IndexError:
        movie['storyline'] = ""
    try:
        movie['votes'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/a[1]/span/text()')[0].strip()
    except IndexError:
        movie['votes'] = ""
    return movie

if __name__ == "__main__":
    print main(sys.argv[1])

You can use it by passing any valid imdb id as an argument:

$ python imdb.py tt1905041

And the output will be returned as follows:

{
  "certification": "PG-13", 
  "description": "Hobbs has Dom and Brian reassemble their crew in order
   to take down a mastermind who commands an organization of mercenary drivers across 12 countries. Payment? Full pardons for them all.", 
  "director": "Justin Lin", 
  "gallery": [
    "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/small/unknown-1394846836._V379391227_.png", 
    "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/small/unknown-1394846836._V379391227_.png", 
    "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/small/unknown-1394846836._V379391227_.png"
  ], 
  "genre": [
    "Action", 
    "Crime", 
    "Thriller"
  ], 
  "metascore": "61", 
  "poster": "http://ia.media-imdb.com/images/M/MV5BMTM3NTg2NDQzOF5BMl5BanBnXkFtZTcwNjc2NzQzOQ@@._V1_SX214_.jpg", 
  "rating": "7.2", 
  "release_date": "24 May 2013", 
  "running_time": "130 min", 
  "stars": [
    "Vin Diesel", 
    "Paul Walker", 
    "Dwayne Johnson"
  ], 
  "storyline": "Since Dom (Diesel) and Brian's (Walker) Rio heist toppled a kingpin's empire and left their crew with $100 million, our heroes have scattered across the globe. But their inability to return home and living forever on the lam have left their lives incomplete. Meanwhile, Hobbs (Johnson) has been tracking an organization of lethally skilled mercenary drivers across 12 countries, whose mastermind (Evans) is aided by a ruthless second-in-command revealed to be the love Dom thought was dead, Letty (Rodriguez). The only way to stop the criminal outfit is to outmatch them at street level, so Hobbs asks Dom to assemble his elite team in London. Payment? Full pardons for all of them so they can return home and make their families whole again.", 
  "title": "Furious 6", 
  "votes": "154,139", 
  "year": "2013"
}

This will return a JSON object containing the data for the movie. You can fork the code on Github. You can try it out at http://getimdb.herokuapp.com/.

Virendra Rajput (BkVirendra)

Scraping IMDB with Python