Writing a simple Web Crawler
Although I shouldn’t really be procrastinating, writing for hours makes me depressed. Learning a new programming language makes me happy. Hence, for the past 2~3 weeks, I spent like 2~3 hours on Saturdays or Sundays to build a web crawler which I never done before. The crawler was implemented to capture and import posts from my Korean blog to this wordpress blog. In order to do this, I learned a new language, python.
This is what I did:
1. I opened up a http connection using urllib2 module.
2. In order to parse the content of interest, I used BeautifulSoup module. It is built on top of regular expressions and sgml. I can traverse the html tree very easily and can search a node using regular expressions.
3. I dumped it out to a text file in Movable Type format, which was inserted into wordpress import system.
View original post 416 more words