Recently, I wrote a web spider to crawling text from wikipedia website. This spider runs at 500ms per page and saves web page as text data which reside in the center area, like the following page:
The useless information are cropped.
After saving a page, this spider clicks the "Random article" link, see blow, to get another wiki content.
I had slow this spider down for 500ms after a page. After 10,000 random pages downloaded, this spider will stop running. So, it needs more than 5,000 seconds to finished the task.
You can download the source code from http://code.google.com/p/spiderframework/downloads/list.
This is written in C# of Visual Studio 2005.
If you need the binary code directly, please email me. I can build it for you.
Hi, love the framework, great job!.....I am having trouble crawling into an IFrame i believe.
回覆刪除I am trying to crawl http://www.healthsouth.com/careers/job_postings.asp and get the job listings and detail for the job.
Not sure what I am doing wrong.
Can you help me? Can you post an example spider.config for this?
Thanks!
-RDG