Fox.IT: Web Spider for Wiki

2009年5月3日星期日

Web Spider for Wiki

Recently, I wrote a web spider to crawling text from wikipedia website. This spider runs at 500ms per page and saves web page as text data which reside in the center area, like the following page:

The useless information are cropped.

After saving a page, this spider clicks the "Random article" link, see blow, to get another wiki content.

I had slow this spider down for 500ms after a page. After 10,000 random pages downloaded, this spider will stop running. So, it needs more than 5,000 seconds to finished the task.

You can download the source code from http://code.google.com/p/spiderframework/downloads/list.

This is written in C# of Visual Studio 2005.

If you need the binary code directly, please email me. I can build it for you.

1 則留言:

rgauny2009年5月15日凌晨4:44
Hi, love the framework, great job!.....I am having trouble crawling into an IFrame i believe.

I am trying to crawl http://www.healthsouth.com/careers/job_postings.asp and get the job listings and detail for the job.

Not sure what I am doing wrong.

Can you help me? Can you post an example spider.config for this?

Thanks!

-RDG
回覆刪除
回覆

新增留言

Fox.IT

2009年5月3日星期日

Web Spider for Wiki

1 則留言:

網誌存檔

標籤

Fox 相關網站

追蹤者

關於我自己

Fox.IT

2009年5月3日 星期日

Web Spider for Wiki

1 則留言:

網誌存檔

標籤

Fox 相關網站

追蹤者

關於我自己

2009年5月3日星期日