Virendra Rajput (BkVirendra)

home me my projects

Launching an API based App | Scrapit - Extract keywords from webpages

By Virendra Rajput Tagged:

Recently trying to find some good alternatives for web scrapper for one my apps, that had to scrap links and extract keywords from the webpages, since it had to much robust, handling broken html, and heaps of text. 

I browsed around the web looking for some existing libraries that could help me out. Since I don’t like to create things from scratch. Well I couldn’t find what I exactly wanted, but I found somethings that could help me building it. 

Python has some really good text processing modules along with html processing libraries. So I ended up using : 

Topia.termextract for text processing

lxml for html parsing

After writing the code, I tested it pretty thoroughly and after the module was completed. I thought to launch it as an API service. Well deployment was not an issue, since Heroku is my option. So got the API deployed on Heroku, along with some modifications, and running it with gunicorn server.

What is Scrapit?

Scrapit is an API for scrapping webpages for keywords. Using Scrapit you can extract important keywords from webpages. That are relevant to the page.

Using Scrapit:

You need to make calls to

http://scrapit.herokuapp.com/q/?q={url} 

Parameters:

q : (required) url to be fetched

occurs : (optional) Will only return the words that are repeated more that once on the webpage. Set to ''1'' while you want to enable it

pretty : (optional) Used for pretty printing the response. Set to ''1'' while you want to enable it

Example Usage:

http://scrapit.herokuapp.com/q/?q=http://imdb.com

http://scrapit.herokuapp.com/q/?q=http://imdb.com&pretty=1

http://scrapit.herokuapp.com/q/?q=http://imdb.com&pretty=1&occurs=1

Well I m going to try and continuously fix the bugs in the API.

So if you, have any suggestions that would make the Scrappit any better, they are welcome here :)