2009年5月3日 星期日

Web Spider for Wiki

Recently, I wrote a web spider to crawling text from wikipedia website. This spider runs at 500ms per page and saves web page as text data which reside in the center area, like the following page:


The useless information are cropped.

After saving a page, this spider clicks the "Random article" link, see blow, to get another wiki content.



I had slow this spider down for 500ms after a page. After 10,000 random pages downloaded, this spider will stop running. So, it needs more than 5,000 seconds to finished the task.

You can download the source code from http://code.google.com/p/spiderframework/downloads/list.

This is written in C# of Visual Studio 2005.

If you need the binary code directly, please email me. I can build it for you.

1 則留言:

  1. Hi, love the framework, great job!.....I am having trouble crawling into an IFrame i believe.

    I am trying to crawl http://www.healthsouth.com/careers/job_postings.asp and get the job listings and detail for the job.

    Not sure what I am doing wrong.

    Can you help me? Can you post an example spider.config for this?

    Thanks!

    -RDG

    回覆刪除