Wednesday 24 July 2013

life is short - you need Python!

 I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively.

The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url.


import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data


Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html

No comments:

Post a Comment