I ve always been pretty fond of Google Search Engine (well everyone is). Google has some really handy features that are helpful while searching something that you cant actually spell correctly (well there are alot of things that aren't easy to spell, unless you are a English Professor / or maybe an expert in Literature).
So I had this problem, with one of my apps Nearme, where people weren't actually querying correctly (there were a lot of misspelled words in the queries). Since these queries were Proper nouns, so there is no specific dictionary/ source that I can make use of to correct them. So I thought of using Google’s “Did You Mean” since it corrects all types of words (including Proper Nouns, that aren't included in any of the Dictionaries)
So here’s a hack that I wrote solve this problem of fixing the spelling mistakes that users made while Querying my app. (it's not the BEST solution to the problem, but well it works)
The code is in Python, and makes use of one of my favorite modules BeautifulSoup
.
The getPage
function is used to retrieve the pages in gzip
so that it reduces the Bandwidth usage while retrieving the page.
The didYouMean
is the main function that you call with the argument of word
and it will return you the correct the word (if it is misspelled) or else it will simply 1
that means the word has no corrections.
The code
for the script:
import os
import urllib2
import io
import gzip
import sys
import urllib
import re
from bs4 import BeautifulSoup
from StringIO import StringIO
def getPage(url):
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
request.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = response.read()
return data
def didYouMean(q):
q = str(str.lower(q)).strip()
url = "http://www.google.com/search?q=" + urllib.quote(q)
html = getPage(url)
soup = BeautifulSoup(html)
ans = soup.find('a', attrs={'class' : 'spell'})
try:
result = repr(ans.contents)
result = result.replace("u'","")
result = result.replace("/","")
result = result.replace("<b>","")
result = result.replace("<i>","")
result = re.sub('[^A-Za-z0-9\s]+', '', result)
result = re.sub(' +',' ',result)
except AttributeError:
result = 1
return result
if __name__ == "__main__":
response = didYouMean(sys.argv[1])
return response
You can even fork it on Github here.