I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively.
The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url.
Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html
The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html
No comments:
Post a Comment