Author Topic: TinEye.com (Read 15708 times)

lucia · « **on:** December 19, 2011, 06:40:49 PM »

It could be worth excluding TinEye
http://www.tineye.com/

I denied "tineye.com" in .htaccess-- but I'm not surethat's useful.

The robot is evidently:

User-agent: TinEye
Disallow: /

Some might want to watch for IPs as it might not use the IP range for the company domain.

lucia · « **Reply #1 on:** December 19, 2011, 06:52:06 PM »

I'm just going to reply to myself as I find stuff:

http://www.useragentstring.com/TinEye1.1_id_8965.php

This page lists the user string as:
TinEye/1.1 (http://tineye.com/crawler.html)
I'm going to block that.

It lists IPs as
204.15.199.142 - 142-199-15-204-static.prioritycolo.com
41.68.22.0 - 41.68.22.0
66.230.232.19 - mail.macrobright.com
67.202.44.125 - ec2-67-202-44-125.compute-1.amazonaws.com
67.202.48.109 - 0
75.101.176.194 - ec2-75-101-176-194.compute-1.amazonaws.com
75.101.238.112 - ec2-75-101-238-112.compute-1.amazonaws.com

SoylentGreen · « **Reply #2 on:** December 20, 2011, 12:49:13 AM »

Interesting stuff...
What about google image search?

S.G.

Robert Krausankas (BuddhaPi) · « **Reply #3 on:** December 20, 2011, 07:24:19 AM »

Quote from: SoylentGreen on December 20, 2011, 12:49:13 AM

Interesting stuff...
What about google image search?

S.G.

Being that Google adheres to robots.txt, the following can simply be added to your meta-tags in the head section of your pages

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

you can also have any cached pages on google removed by following these directions

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1663660

Be sure to select "Remove cached version of this page" or you might inadvertently removed your pages and not the cached version

havenotreceivedaletter · « **Reply #4 on:** January 29, 2012, 10:45:43 PM »

When you mention head section, is that within the page itself? I think any type of bots probably suck up bandwidth and who wants to pay extra for that? You're only allowed so much bandwidth usage.

lucia · « **Reply #5 on:** January 30, 2012, 10:38:16 AM »

You want some 'bots to come by. In particular, you probably want the google bot to come by because google runs a real search engine and you do want people who might be readers or customers to find your site.

What you generally want to discourage is 'bots who are either

sucking utterly excessive amounts of bandwidth, cpu or memory doing entirely unreasonable searches to detect "copyright" violations.
sucking utterly excessive amounts of bandwidth, cpu or memory so they can create their own data base to sell SEO services
sucking utterly excessive amounts of bandwidth, cpu or memory because they just don't want to spend money programming a bot that doesn't misbehave.
shows any evidence of doing something dangerous that might look like it's trying to crack into your server, steal email addresses, leave malware etc.

The copyright issue comes up here, but it's really not the main reason to try to control bots. Also, controlling bots can't protect you from a copyright owner suing you if you really are abusing copyright. What it can do is give you limited protection by making it more difficult for something like the picscout bot to race quickly through your site downloading hundreds of pages at little expense to itself-- but significant to you. If you selectively block these bots, you can at least make it so that getty or picscout have to spend their human time or energy searching your site.

To tell you how bad it can get: Shortly before I got my getty letter, I noticed my blog was crashing a lot. I don't know if this had anything to do with picscout-- but I could see spikes. After I got my 2nd getty letter, I noticed that an agent hosted on bezeqint hammered my blog and caused it to crash. Oddly, they were not served pages. Zbblock was denying them pages, but the bot just kept requesting over and over and over and over and.... you get the picture.

My blog has a decent readership with quite a few people who work in IT. I got a great deal of advice, and decided that on shared hosting, the solution would be to a) host on cloudflare and b) write some code to auto-ban IPs that do certain things. The system seems to be working quite well, and I have greatly reduced image scraping, crack attempts, visits from bandwidth sucking referrer loggers etc.

I have no illusion that someone looking for copyright violations can't visit my site-- but it is now somewhat more difficult for them to do race through quickly using automated tools. This means it is more expensive for something like picscout to run a scraper.

But returning to the issue with google: I don't block google. I let them visit, cache, surf whatever it likes. It's a mostly well behaved bot and google has actually created a service that benefits me. I'm willing to let them use all the bandwidth they use.

For what it's worth: if picscouts method of looking for images ends up being writing a tool that does one bajillion searches at google and afterwards visit the blog post where they think a violation exist, that's fine with me. If they do it that way, they mostly be sucking up google's bandwidth, not mine. Also, their bot will have to spend much more processing time to search. That's money from their pockets, not mine.

I have no objection to copyright owners trying to identify copyright infringers and enforce it. What I object to is services that operate in a way that deprives my blog readers of access because my blog periodically crashes when the image scraping service decides they want to search and which by increasing the resources I need to rent to keep my blog running shifts much of the cost of their searching onto me. If they want to make money by searching for violations that's fine. But I don't want to donate my server resources to their commercial enterprise.

SoylentGreen · « **Reply #6 on:** January 30, 2012, 12:10:46 PM »

Thanks for all the detailed strategic and tactical info that you provide here, Lucia.
I know that this is quite a bit of work to put in that amount of detail.

I blocked PicScout, and the idiots who were harassing me and their lawyers with a simple htaccess file.
They haven't gotten through, or bothered to try to get through since.
Could somebody be eating up your bandwidth as an intimidation tactic?
What would be the point of constantly scanning your blog, other than to harass you?
I don't want to sound like "conspiracy theorist" or anything; I'm just throwing this out there.

S.G.

lucia · « **Reply #7 on:** January 30, 2012, 02:55:29 PM »

Soylent Green--
I think it's impossible to back out motivation from what I see in server logs. But I think it's much more likely that lots of these scrapers are pre-programmed and just prowling. The main reasons I think this are that

a) businesses want to keep costs low. Having simple bots that just prowl is the cheapest thing for picscout/getty etc. Whatever 'bot was arriving was just trying to load and reload the same page over and over and over and over. This sort of behavior suggests a bot that wasn't even programmed very well as otherwise it should have known to give up. (Google bots don't reload pages that send them 403's or 404's. Only stoooooopid bots do.) So, I image that someone just programmed a bot to do something and keep trying. They didn't even think to program in a graceful exist. That's cost saving for them-- blog crashing for me. And the bot came from bezeqint.

b) generally, bombarding a bloggers site causes no intimidation because the blogger has no idea why their blog is crashing. They just know it's crashing but they don't know why. They are likely to be angry with their host rather than imagining it might be the person who sent them a letter about copyright.

So, all in all, I think mostly, the bots are programmed to get information as cheaply as possible for picscout/getty whomever. They operate by sucking up bandwidth/memory/cpu on the people they are either 'spying on' or 'investigating' (take your pick.) I suspect there is very little more to it.

lucia · « **Reply #8 on:** January 30, 2012, 03:04:32 PM »

Soylent--
Also: are you sure no one is looking at your images? I blocked picscout and bezeq. But looking at my server logs I could see that certain things were downloading every image in my wp-content/uploads files (where images are stored) on a regular basis. This is an image intensive blog and I mean image back to 2007.

The referrers were either a) blogname/feed or b) blogname. But neither of those referrers has links to the images that were being downloaded. When I would block one type of user agent, others would start downloading the images. I only got it to stop by modifying .htaccess to forward image requests to a php file, logging the referrers and only supplying the image if the referrer is actually, literally correct. This is relatively cpu intensive, but I also joined cloudflare and now that I can detect, I can ban agents. In the meantime, I've discovered a vast number of user agents which seem to visit and only look at images. Did you know that playstations can be used as browsers? Also, I saw useragents that point to software used by doctors to read x-rays (http://www.voyanthealth.com/traumacad.jsp) . Once again: these things somehow only want to look at images-- trying to race through lots of them. Go figure? Oh. And besides that, some of these started coming from IPs that are used by wireless agents. So, there I have something visiting my server trying to download thousands of images (and nothing else) supposedly using TraumaCad coming in from an IP range that specializes in wireless agents in airports.

You tell me what was going on. Because I think someone wanted to race through my images!

havenotreceivedaletter · « **Reply #9 on:** January 30, 2012, 07:21:10 PM »

Even if you don't have stuff on your page you shouldn't have, it would still be problematic if some bot crashes your website because it wants to download all your images especially if it's image intensive. That's definitely a cause for concern. And if you got your letter do they think you're stupid enough to put back whatever they're angry about?

SoylentGreen · « **Reply #10 on:** January 31, 2012, 03:52:27 PM »

Thanks for the further info, Lucia.

I'm not that active on the 'web right now, so I'm not receiving much traffic.

At the present, I could only make guesses as to why your traffic pattern is so brutal.
I do know that most people need only undertake such countermeasures as you have when they're the victims of DDOS attacks.
I'm not saying that you're under attack. But, your info is quite interesting indeed.

Do you know of any other bloggers whom are experiencing the same kind of traffic?

S.G.

lucia · « **Reply #11 on:** February 02, 2012, 11:29:10 AM »

Dreamhost may have experienced a DDOS attack. See
http://www.dreamhoststatus.com/2012/01/30/connectivity-issues-in-one-datacenter/

But I don't think that explains user agents like TraumaCAD and other obvious image specific agents appearing in my user logs.

Quote

And if you got your letter do they think you're stupid enough to put back whatever they're angry about?

No. In a case that Getty took to court (and was not awarded money) it's clear Getty had initially found a few images. Later, as the proceeded toward court, they found additional images. So, I suspect what they may do is:
1) Send the picscout out rather generally using some sort of algorithm to decide what sites to interrogate. The algorithm has nothing to do with me personally. It could be trying to target domains in a certain country, finding domains that host lots of images etc. (I'm in the US. I post frequently. My posts often contain images. Usually graphs of global temperature series, or illustrations of ENSO etc. But still, ".png" and ".jpg".

2) The algorithm might be set to look at a site more intensively after it thinks it's found an infraction.

3) The bot might be badly programmed and not respond properly to "403" and "404" responses. (Responding properly means going away.) So, when it sees a 403 or 404 it just keeps hammering away over and over and over.

4) The algorithm might be set to instruct the 'bot revisit the page where it believes the infraction was found. (And repeat until such time as a human tells the bot to stop visiting.)

None of this would imply any human thinks that anyone is stupid enough to take something down and put it back in place. It would merely mean that at some point, a human thought through a crawl proceedure they thought would be useful for monitoring and collecting data for files. Crawling is very cheap for them but missing information could be costly in court. So sending out the bot too often would ordinarily be in their interest.
However, I now hold a grudge against their bots and I'm monitoring what the heck is going on with images. I'm seeing "weird" things. I've consulted with an IT friend and he agrees it's "weird". As in "not what normal, human blog visitors do" weird. To me that means it's safe to block it. I'm not blocking people. I don't know for sure I'm blocking image-copyright-troll-bots, but I'm blocking something weird.

JPicker · « **Reply #12 on:** February 02, 2012, 01:10:28 PM »

Could someone provide the syntax for an htaccess file ...as far as "deny from" ...for the above listed IPs and IP range? Thanks in advance!!!

lucia · « **Reply #13 on:** February 02, 2012, 01:54:44 PM »

JPicker--
I used to think your question would be easy to answer, but actually, it turns out to be more difficult. I learned this by monitoring my image visits a lot.

I've started doing something much more complicated than htaccess alone Part of the reason is that I'm trying to deal with both image scraping and a lot of obvious malicious hacking attempts (which have exploded at blogs btw. It's not just me.) But the other reason is that I developed the impression that after blocking bezeqint (rumored host for picscout) I started seeing things like huge number of downloads of images from IPs that correspond to 'hotspots' (say in airports) with UA's that look like mobile devices (or oddly, that TraumaCAD ua. ) The referrers tended to be my feed.

(This was odder than you can imagine because a) my feed had no link to these images so the referrer didn't make sense, b) a connection using TraumaCAD from a hotspot and downloading a zillions images? What am I to make of this. Some physician sitting in an airport decided to download thousands of images including things like histograms of temperatures in VaxJoe Sweden posted in 2007? As if.)

I've gotten this weird behavior reduced using a combination of .htacess, running my site connection through cloudflare and writing a php script to monitor requests for most my images and banning IPs that make ridiculous requests.

SoylentGreen · « **Reply #14 on:** February 02, 2012, 04:20:03 PM »

I think that this is what jpicker is looking for:

http://www.techiecorner.com/95/block-ip-from-accessing-website-using-htaccess/

You can block entire countries (such as Israel, which PicScout normally operated from):

http://www.countryipblocks.net/country-blocks/htaccess-deny-format/

It worked for me and Buddhapi.
If you have further issues, you can consider tougher methods.

You can search on Google for the latest Picscout IP addresses if needed.

S.G.

Author Topic: TinEye.com (Read 15708 times)

lucia

TinEye.com

lucia

Re: TinEye.com

SoylentGreen

Re: TinEye.com

Robert Krausankas (BuddhaPi)

Re: TinEye.com

havenotreceivedaletter

Re: TinEye.com

lucia

Re: TinEye.com

SoylentGreen

Re: TinEye.com

lucia

Re: TinEye.com

lucia

Re: TinEye.com

havenotreceivedaletter

Re: TinEye.com

SoylentGreen

Re: TinEye.com

lucia

Re: TinEye.com

JPicker

Re: TinEye.com

lucia

Re: TinEye.com

SoylentGreen

Re: TinEye.com

Click Official ELI Links	Get Help With Your Extortion Letter \| ELI Phone Support \| ELI Legal Representation Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Official ELI Help Options	Get Help With Your Extortion Letter \| ELI Phone Support Call \| ELI Defense Letter Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.