Recently, I wrote a web spider to crawling text from wikipedia website. This spider runs at 500ms per page and saves web page as text data which reside in the center area, like the following page:
The useless information are cropped.
After saving a page, this spider clicks the "Random article" link, see blow, to get another wiki content.
I had slow this spider down for 500ms after a page. After 10,000 random pages downloaded, this spider will stop running. So, it needs more than 5,000 seconds to finished the task.
You can download the source code from
http://code.google.com/p/spiderframework/downloads/list.
This is written in C# of Visual Studio 2005.
If you need the binary code directly, please email me. I can build it for you.