Help - Search - Members - Calendar
Full Version: crawl-4.cuill.com.
TNG Forums > TNG Specific > Questions and Answers
gproctor
Really crawling my site and causing very slow response. Has anyone dealt with this one, and can offer suggestions to block?

TIA
JoeP
www.houseofproctor.org
arnold
There are several ways.

1. Write to Cuill at contact@cuill.com and ask that they stop visiting your website. They are very nice.
2. Add them to your robots.txt file (Disallow: cuill.com)
3a. Add them to your IP Deny Manager
3b. Add them to .htaccess file

I would go with number one or two above.
gproctor
QUOTE(arnold @ Aug 20 2007, 11:38 AM) *

There are several ways.

1. Write to Cuill at contact@cuill.com and ask that they stop visiting your website. They are very nice.
2. Add them to your robots.txt file (Disallow: cuill.com)
3a. Add them to your IP Deny Manager
3b. Add them to .htaccess file

I would go with number one or two above.


Thank You.
I will post results of contacting them.
gproctor
QUOTE(gproctor @ Aug 20 2007, 01:06 PM) *

Thank You.
I will post results of contacting them.



I followed 1-3 of your suggestions. Cruil returned message:

Twiceler is an experimental crawler that we are developing for our new search engine.
It is important to us that it obey robots.txt, and that it not crawl sites that do not
wish to be crawled. I will add www.houseofproctor.org to our list of sites to exclude
and I apologize for any inconvenience this has caused you.

Sincerely,

James Akers
Operations Engineer
Cuill, Inc.

Thx Again,
JoeP
Steven Grim
Just an FYI, I found that this web bot has been hitting my site also.

The web bot name is Twiceler NOT cuill.com

From their website:

CODE
Twiceler Info
Twiceler is an experimental robot. The user-agent to block this is twiceler. It could take 24-48 hours for us to re-read your robots.txt file. If you need something blocked immediately, please email us using the link below.


HTH.

Steven
reverendspam
After writing a not so nice email to these guys asking them their intent, they did respond and the good thing is their email did not bounce off any Russian or Asian servers. This does not prove though that their site is not spoofed - but at least a response was comforting. Most hackers do not respond as they unleash self running code out onto the net.

I made complaint their bot was hammering my database driven site at 4 times the rate of any other bot that visits me.

They explained their bot was experimental and in beta status. They asked for my logs to show that their bot was causing issue. Of course I nicely responded that I did not want to be a beta tester for their bad acting bot and I know how to easily block their bot, but I do hate it for those unsuspecting folks whose bandwidth is getting sucked dry. No response to that of course as I didn't expect any icon_smile.gif


If these folks are legit, they are not making very many friends unleashing this bandwidth sucker out on the net. This is again why I think they are suspect or they are just full of themselves. At least they do have some phone numbers listed on the website and I just may get bored one day and call or call Stamford U. since the website owner is using a Stamford email address.

I was told to add a timer into my robots file and asked to report back to them. I have a problem with this mentality, in that, hey I'll screw with your site and its up to you to keep our bad bot at bay.


Their suggestion:

CODE
User Agent: Twiceler
Crawl-delay: 300


This supposedly will tell the bot to hit your site every 300 seconds (5 minutes), that's if the bot will listen to this request.

The jury is still out on that though as the bot has not reacted to this directive in the past 12 hours. I will give it a couple of days to see if the bot will listen. Right now it is still hammering my site, as it has for 2 weeks now, at 1 second to 2 minutes per hit. It's hit over 25,000 page in my stats. I think I only have maybe 6-10 thousand pages that it could possible read.


The thing that worries me is that I blocked the bot at a higher level for two days and instead of the bot going somewhere else it kept trying to hammer my site. This was deduced because as soon as I took the block off, the Twiceler bot immediately began crawling my site again.

To block the bot from your site totally...add the following code to your .htaccess file:

CODE
SetEnvIfNoCase User-Agent .*Twiceler.* bad_bot

order allow,deny
deny from env=bad_bot
allow from all


After you do that and if they want to keep hammering your site that's their problem and their bandwidth.

-Joe

ps. If anyone is so inclined to build a bot trap for bad bots check out this link:

http://www.kloth.net/internet/bottrap.php
genfan
I placed this in my "robots.txt" file:

CODE

User-agent: *
Disallow: /cgi-bin/
Disallow: cuill.com
Sitemap: <sitemap_location>
http://www.genfan.com/tngsitemap1.xml


This seems to have done nothing as Cuill is still blasting my site. I could not find an htaccess file on my site. Do I have to create one? What should the contents look like?

I want to stop this bot and everything I have done until now has been fruitless.
arnold
1. The easiest thing to do is just contact them. See Post #2 above.

2. As to your .htaccess file, go to your cPanelX (or its equivalent), then File Manager, then public html, and there you will find .htaccess. You can edit the file at this point.

3. I am unable to find .htaccess with my FTP program. If I recall correctly, this is because the file name begins with a period/dot. There is a setting in my FTP program which allows me to view these files, but I cannot find it at the moment. icon_razz.gif
deboard
I don't think that the Disallow: Cuill.com does anything, since it seems like it would be disallowing access to a directory called cuill.com on your server.

This should work:

User-agent: Twiceler
Disallow: /

Which is a dumbed down version of reverendspam's more elegant suggestion.

The cuill.com site says it may take 24-48 hours to re-read your robots.txt. Not sure why it would take so long, but it looks like you may have to wait to see the hits quit.

arnold
QUOTE
I don't think that the Disallow: Cuill.com does anything, since it seems like it would be disallowing access to a directory called cuill.com on your server.

Our experience is that it will work. Below is what we have entered in Exclude Host Names*:
googlebot.com, gigablast.com, proxy.aol.com, inktomisearch.com, live.com, crawl.yahoo.net
brumer
My hosting company run cPanel but the way it is setup .htaccess is not allowed. It can be done in a php.ini file tho.
I might have to restrict them as well it seems.
arnold
QUOTE
My hosting company run cPanel but the way it is setup .htaccess is not allowed.

You can still pretty much do it via IP Deny Manager in your cPanelX. However, .htaccess give you more latitude as to what is allowed.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2014 Invision Power Services, Inc.