Virendra Rajput (BkVirendra)

home me my projects

Google's "Did You Mean" Hack in Python

By Virendra Rajput Tagged:

I ve always been pretty fond of Google Search Engine (well everyone is). Google has some really handy features that are helpful while searching something that you cant actually spell correctly (well there are alot of things that aren't easy to spell, unless you are a English Professor / or maybe an expert in Literature).

So I had this problem, with one of my apps Nearme, where people weren't actually querying correctly (there were a lot of misspelled words in the queries). Since these queries were Proper nouns, so there is no specific dictionary/ source that I can make use of to correct them. So I thought of using Google’s “Did You Mean” since it corrects all types of words (including Proper Nouns, that aren't included in any of the Dictionaries)

So here’s a hack that I wrote solve this problem of fixing the spelling mistakes that users made while Querying my app. (it's not the BEST solution to the problem, but well it works)

The code is in Python, and makes use of one of my favorite modules BeautifulSoup.

The getPage function is used to retrieve the pages in gzip so that it reduces the Bandwidth usage while retrieving the page.

The didYouMean is the main function that you call with the argument of word and it will return you the correct the word (if it is misspelled) or else it will simply 1 that means the word has no corrections.

The code for the script:

import os
import urllib2
import io
import gzip
import sys
import urllib
import re

from bs4 import BeautifulSoup
from StringIO import StringIO

def getPage(url):
    request = urllib2.Request(url)
    request.add_header('Accept-encoding', 'gzip')
    request.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20')
    response = urllib2.urlopen(request)
    if response.info().get('Content-Encoding') == 'gzip':
        buf = StringIO( response.read())
        f = gzip.GzipFile(fileobj=buf)
        data = f.read()
    else:
        data = response.read()
    return data

def didYouMean(q):
    q = str(str.lower(q)).strip()
    url = "http://www.google.com/search?q=" + urllib.quote(q)
    html = getPage(url)
    soup = BeautifulSoup(html)
    ans = soup.find('a', attrs={'class' : 'spell'})
    try:
        result = repr(ans.contents)
        result = result.replace("u'","")
        result = result.replace("/","")
        result = result.replace("<b>","")
        result = result.replace("<i>","")
        result = re.sub('[^A-Za-z0-9\s]+', '', result)
        result = re.sub(' +',' ',result)
    except AttributeError:
        result = 1
    return result

if __name__ == "__main__":
    response = didYouMean(sys.argv[1])
    return response

You can even fork it on Github here.