Japanese Page
Shim-crawler was written by Shim Wonbo of Chikayama-Taura laboratory. The main goal behind writing the crawler is to collect web pages for researches related to web-search and data mining. Recently, we are planning to use it for crawling weblogs too. The Crawler is used by the members of Chikayama-Taura Laboratory to crawl web-pages only for the research purposes. Our crawling policy distinctly respects the general crawling norm. Though we duely understand the concern of the webmasters, we would like to assure that our crawler is only crawling pages for performing researches and not for any business use. Please have a glance at our crawling policy for better understanding. We sincerely appriciate your co-operation and support.
Our Crawler always respects the common crawling norm as like following:
  • If the respective webpage has the meta tag included as follows, our crawler never crawls the page.
    Ex:<meta name="robots" content="nofollow, noindex">
  • It always reads the "robots.txt" and never crawls disallowed pages.
    User-agent: *
    Disallow: /cgi-bin
    User-agent: Shim-Crawler
    Disallow: /
    There are other crawlers using the same program of Shim-Crawler but managed by different organizations. If you refuse access by all of them, please use 'LC-Crawler' as the generic name.
    User-agent: LC-Crawler
    Disallow: /
  • Given Crawl-Delay in /robots.txt, our crawler will connect every "Crawl-Delay" seconds. Otherwise, the time interval of connections will be 1 minute, except for the request that immediatelly follows /robots.txt (i.e., your server will first get a request to /robots.txt and then another one right after that). We send up to five requests per connection.
  • In case, anyone wants his/her pages not to be crawled at all, if he/she kindly contact us, we will make sure that it is properly respected from then onwards.
Information on Current Crawling
Machine currently used for crawling:
tako(dot) <133_11_238_6>
taz(dot) <133_11_238_7>

Crawling Hosts
tako(dot) <133_11_238_6>
taz(dot) <133_11_238_7>
We would like to clarify again that our crawler is collecting pages solely for research purposes.We are interested in crawling large volume of pages for following ongoing researches at our lab:
  • Analyze the policies to perform suitable, fast and efficient crawling.
  • Discovering web-community from the collected pages.
  • Observing clustering of the webpages within our corpus.
  • Analyze the weblogs distinctly and to devise suitable ways to dedicatedly crawl, differentiate and analyze them.
For any query or comment or request please send mails to us.

Chikayama-Taura Lab,
Department of Information and Communication Engineering,
Faculty of Engineering,
The University of Tokyo.
Copyright ©2005 Chikayama-Taura Lab