I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions
import sys import re import urllib2 import urlparse tocrawl = set([sys.argv]) crawled = set() keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>') linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>') while 1: try: crawling = tocrawl.pop() print crawling except KeyError: raise StopIteration url = urlparse.urlparse(crawling) try: response = urllib2.urlopen(crawling) except: continue msg = response.read() startPos = msg.find('<title>') if startPos != -1: endPos = msg.find('</title>', startPos+7) if endPos != -1: title = msg[startPos+7:endPos] print title keywordlist = keywordregex.findall(msg) if len(keywordlist) > 0: keywordlist = keywordlist keywordlist = keywordlist.split(", ") print keywordlist links = linkregex.findall(msg) crawled.add(crawling) for link in (links.pop(0) for _ in xrange(len(links))): if link.startswith('/'): link = 'http://' + url + link elif link.startswith('#'): link = 'http://' + url + url + link elif not link.startswith('http'): link = 'http://' + url + '/' + link if link not in crawled: tocrawl.add(link)
** EDIT **
This was a very early draft of this program. As it turns out, I revisited this project a few months later and it evolved much more.
If you would like to check out the more evolved form, feel free to have a look here at my github!