Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - lucia

Pages: 1 ... 41 42 [43] 44

631

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 13, 2011, 03:15:00 PM »

Ok- but would now you might wonder as a techical nuts and bolts matter, can anything lie about the useragent?

Easily! I very easily program Firefox to leave the useragent "Googlebot/2.1 ( http://www.googlebot.com/bot.html)", or I could protram it to tell the server I am visiting using "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)". Doing so would be called "spoofing".

Spoofing can be legitimate. For example: I have spoofed the referrer and hit my own site to see whether I have correctly programmed .htaccess to block to block the baidubot. After testing, I change my user agent back to the default. Should I ever mistakenly crawl as the baidubot, I'll probably find myself blocked all over the place!

Next question: Do things spoof? Oh Yes! I could show examples of obvious spoofing, but I'm going to show one of suspected spoofing instead. Let me return to the example of something I saw in my server logs. This time I'm going to highlight something called the IP in bold:

180.175.7.236 - - [12/Dec/2011:01:17:22 -0800] "GET /blog/name_of_page/ HTTP/1.1" 403 521 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.803.0 Safari/535.1"

The IP address is related to the machine through which the connection occurred. It tells me more about "who" might have connected than the useragent. In fact, I can look up 180.175.7.236 here:
http://whois.domaintools.com/180.175.7.236
What this tells me is whatever that is connects through "China Chinanet Shanghai Province Network". In fact everything with IPs starting with 180. come through "China Chinanet Shanghai Province Network".

If you do a little checking you will find that rumor has it that IPs near that value are the Baiduspider! My guess is this entry corresponds to an attempt by the baidubot to connect while spoofing the user agent. (Mind you, this is not necessarily so. But I suspect it.) Remember, if it leaves "Baidu" in the user agent, it is blocked. But it's possible for the person who programmed the bot to tell it to initially tell the truth, but change user agent if it gets back a "403" or "forbidden" message. (You can see this happen in server logs.) Now, I could maybe call a lawyer and try to assemble a case about Baidu-- but likely that would be expensive. And anyway, I might have trouble proving it was Baidu lying. After all, for all I know "China Chinanet Shanghai Province Network" is the chinese equivalent of Comcast and I'm blocking a real visitor. I doubt it, but I might be.

Since I don't have any particular need for traffic from China, I decided to deal with the huge number of spammy hits from this IP range by blocking by IP. It turns out my .htaccess also contains

order allow,deny
# baidu spider various ranges
deny from 119.
deny from 123.125.71
deny from 124.114
deny from 124.115
deny from 180
deny from 220.181
deny from 183
# china China Fujian Chinanet Fujian Province Network
deny from 120.37.209.57
# copyscape
deny from 212.100.254.105
# block picscout
deny from bezeqint.net
deny from 82.80.249
deny from 82.80.252
deny from 62.0.8.
deny from gettyimages.com
deny from gettywan.com
deny from picscout.com
deny from istockphoto.com
allow from all

Once again, this is edited to keep from filing the entire screen. The bold 'deny from 180' means I deny all IPs starting with 180-- which means no one can visit my blog if they connect through "China Chinanet Shanghai Province Network".

I'm sure you've also noticed '#block from picscout", right? All the commands between that line and "allow from all" block everything I've found that either is known or rumored to be associated with picscout or getty surfing. I have other blocks in place too.

As I've said on other thread, I'm trying to put a php script that will auto install and implement a lot of these blocks for people. I thought I'd be done more quickly, but as I checked things out, I needed to make sure it really, really does what I want it to do, and fairly easily. Given that, I might be charging a small amount for it. ($10 or so.) But it would put in blocks for things known or rumored to be getty/ picscout, etc. And do a few more things to protect people with a range of sites to some extent. (Nothing will give perfect protection. But you can make your site much less vulnerable by making it harder for picscout to crawl!)

632

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 13, 2011, 03:08:23 PM »

Quote from: Lettered on December 13, 2011, 01:23:00 PM

With the "lack of permission" issue off the table, by faking the user agent aren't they are basically just requesting the information without identifying themselves and receiving it? I can't see how that could be construed as circumvention under the DMCA.

I hope I am wrong, by the way. I'm not saying picscout isn't breaking any laws ... i just don't think they are violating the DMCA circumvention laws.

Not quite. The answer will be long and I'm going to follow it with further stuff.

First, I'm not a lawyer. My training is mechanical engineering, but I self host and organize my own web site. So, I can describe a little what I mean about user agents. My illustration will use blocking with .htaccess as an example of a method to block user agents. People who know more about .htaccess should feel free to correct my mis-usage of terms etc. (I'm sure to do so.)

This is going to be long because I assume lots of people don't know what certain things are. So what are the different things that get recorded when something hits a page. Here's a slightly edited example of something that I would see if something I blocked hit the address "mydomain.com/blog/name_of_page".

180.175.7.236 - - [12/Dec/2011:01:17:22 -0800] "GET /blog/name_of_page/ HTTP/1.1" 403 521 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.803.0 Safari/535.1"

The part that is the useragent string is on the far right and shown in bold. I can tell I successfully blocked it because the '403' appears after "HTTP/1.1" . In contrast "200" appearing where 403 appears would mean my server sent them the page. I'll explain how I blocked this later and relate that to user agent.

But for now: What is a useragent? I found a long, good explanation is here: http://whatsmyuseragent.com/WhatsAUserAgent.asp My short approximate explanation is this:

When you surf the web, you will be using some sort of utility. This is typically a browser. I often use Firefox 8.0.1 on the mac. Firefox 8.0.1 is a useragent. This user agent will identify itself to the web site you visit by leaving a "useragent string". The string I leave is

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:8.0.1) Gecko/20100101 Firefox/8.0.1

This string tells them what utility I used to download the page. Because lots of people use Firefox 8.0.1 on the Mac, the useragent string along doesn't tell them who I am.

In contrast, when google crawler visits, it doesn't use Firefox 8.0.1. It uses a different useragent. In fact it has more than one possible agent-- one agent looks at pages. one looks at images. The different crawlers tell me who they are. One says

Googlebot/2.1 ( http://www.googlebot.com/bot.html) another says
Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)

Needless to say even the braindead can figure out these are representing themselves as google, and guess they are "bots". But you can also look these up at Googles site. Note: They leave web site to learn more! These are nicely behaved bots.

Meanwhile a pesky chinese spider sometimes uses useragent strings like this:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

To block anything with this useragent, my .htaccess file contains a bit of code that looks like this:

Options +FollowSymlinks
RewriteEngine on
# agents
RewriteCond %{HTTP_USER_AGENT} Baidu [nc,or]
RewriteCond %{HTTP_USER_AGENT} ^$ [or]
RewriteCond %{HTTP_USER_AGENT} Ezooms [nc,or]
RewriteCond %{HTTP_USER_AGENT} picscout [nc,or]
RewriteCond %{HTTP_USER_AGENT} java [nc,or]
# methods
RewriteCond %{REQUEST_METHOD} ^PROPFIND$ [NC,OR]
RewriteCond %{REQUEST_METHOD} ^OPTIONS$ [NC,OR]
# referrers
RewriteCond %{HTTP_REFERER} (getty|picscout) [NC]
RewriteRule .* - [F]

I've edited my block down so that I don't fill the comment-- but I've left a few key things in there, and you can ask about why they are there if you like.

For now, the "RewriteCond %{HTTP_USER_AGENT} Baidu [nc,or]" blocks anything that contains "Baidu" in the user agent. So, if something visits and shows my server "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" it is blocked. Period. When I look at my server logs, if the useragent says

"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
The access code will say "403".

Now, in principle, if anything visits using the Baidu useragent-- a program--, it will either
a) tell the truth and leave an agent that contains "Baidu" in it,
b) leave no user agent, in which case, I will see a "-" where the useragent string belongs or
c) leaven a false user agent which is presenting false information or lying.

So (b) is "requesting the information without identifying themselves and receiving it?" but (c) is lying.

Now, if you look at the code I use to block user agents, you'll also see:

"RewriteCond %{HTTP_USER_AGENT} ^$ [or]"

This command will block anything that refuses to provide a useragent. So, that eliminates the possibility that they would gain access by merely not identifying themselves.. Because I refuse to supply pages both to visitors with Baidu in their UA string and visitors with no UA string, someone wants to use the baidu bot to crawl my site, they must lie. They can't just not tell me what useragent they used.

The text budhappi left suggests lying about the USERAGENT to gain access that is otherwise refused violates DMAC. If that is true, then anything surfing using the baidu-bot and avoiding being blocked would be violating DMAC. (But this is a legal issue, and you lawyers can decide what the DMCA says. I can only tell you what the block does.)

With regard to picscout, I've tried to block them by blocking connections with "picscout" in the UA string. But I'm not sure that appears in their UA string. Picscouts page doesn't seem to reveal what their UA string is-- which makes things a bit difficult technically, and likely legally. (Legal people could maybe look into whether we can send letter to groups whose bot refuses to tells us what the UA string is.)

Next post will discuss a related topic, but I think I've now discussed the answer to the question you actually asked.

633

Hawaiian Letters & Lawsuits Forum / Re: Hawaiin Artwork, LLC Files Lawsuit!

« on: December 13, 2011, 10:38:58 AM »

Quote from: summer99 on December 12, 2011, 10:38:23 PM

Quote from: lucia on December 12, 2011, 12:51:32 PM
Quote from: summer99 on December 11, 2011, 06:12:42 PM
I have had a very large increas in the pages read on my blog for about the last two months. It went from 2500 pages to about 11000 - averageing from 350 to 400 each day. Didn't really know why but after checking the history now realize that most of it is seaching images with the the direct names of my blogs. Someone has spent a lot of hours going through my site page by page looking for things. This might be a heads up to anyone else when they have a high increase in volume which they did not have before.

Thanks again.
I saw that at my blog too. I'm working on a set of scripts to mitigate this for people. For now, at least add the following two lines to the htaccess file in the root for your domain:

# don't permit directory listing
IndexIgnore *

The first line is a comment and does nothing but remind you about the function of the second line. The second line closes off a great big superhighway method some bots use to find the names of all your file images. Backroads still exist. That's why I'm trying to find other methods to help people block off the backroads and catch this sudden grazing real time. But at least add the

IndexIgnore *

line.

Even aside from any copyright issues, this behavior is bandwidth sucking (i.e. increasing your hosting costs). So you want to stop it anyway.

Sorry to be so computer ignorant but you give me some idea of how to get to the access file on my blog so I can insert the command you gave to block the bots?

First: the .htaccess isn't technically on your blog. It would be on the server that hosts your blog. Are you self hosted?

Example: I self host using Wordpress software which is installed on the physical computer (i.e. server) owned by my host, which is Dreamhost. But other people use Wordpress hosted by wordpress. Their web addresses often end with "Wordpress.com". If you selfhost like I do, you can probably install an htaccess file, and I can tell you where to find it. If you are on wordpress.com or blogspot or something like that you can't install an .htaccess file.

I'm guessing based on what you wrote that you self host. But I need to know for sure. After that, I can ask how you installed Wordpress in the first place-- and then I can explain how to find .htaccess, explaining all this based on knowledge you already have.

634

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 12, 2011, 07:05:49 PM »

Lettered,
Other than with some pedantic nitpicking , I don't disagree with your interpretation of what the court might be saying about robots.txt.

But the reason I was saying that I don't think this is what buddhapi started out discussing is that in his introductory comment, he bolded this from the law:

Quote

It is also a crime under US law to use any trick or false information to gain access to a computer system. Running a robot that pretends to be a user by faking its useragent is crime under US Law because it is using false information to gain access to a computer system."

Notice the bit he quotes says nothing about robots.txt. It says something about faking a user agent.

What I'm going to say next has nothing to do with legalities. It has to do with nuts and bolts of running a web site:

Nothing needs to fake a user agent to get around robots.txt. This is because robots.txt is not a block. (In fact, the reason the court seems to recognize disobeying robots.txt isn't necessarily violating DMCA is that robots.txt is not really a block.)

Faking user agents is a way to get around a real, honest to goodness block like the kind in .htaccess on Apache. Also: In discussions above and on other thread, people have been talking about picscout faking useragents.

So while I think a case discussing robots.txt especially as it involves the Wayback machine is interesting, I think maybe people are getting distracted by an interesting discussion of robots.txt and forgetting about the issue of faking useragents.

635

Hawaiian Letters & Lawsuits Forum / Re: Hawaiin Artwork, LLC Files Lawsuit!

« on: December 12, 2011, 12:51:32 PM »

Quote from: summer99 on December 11, 2011, 06:12:42 PM

I have had a very large increas in the pages read on my blog for about the last two months. It went from 2500 pages to about 11000 - averageing from 350 to 400 each day. Didn't really know why but after checking the history now realize that most of it is seaching images with the the direct names of my blogs. Someone has spent a lot of hours going through my site page by page looking for things. This might be a heads up to anyone else when they have a high increase in volume which they did not have before.

Thanks again.

I saw that at my blog too. I'm working on a set of scripts to mitigate this for people. For now, at least add the following two lines to the htaccess file in the root for your domain:

# don't permit directory listing
IndexIgnore *

The first line is a comment and does nothing but remind you about the function of the second line. The second line closes off a great big superhighway method some bots use to find the names of all your file images. Backroads still exist. That's why I'm trying to find other methods to help people block off the backroads and catch this sudden grazing real time. But at least add the

IndexIgnore *

line.

Even aside from any copyright issues, this behavior is bandwidth sucking (i.e. increasing your hosting costs). So you want to stop it anyway.

636

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 12, 2011, 12:41:40 PM »

Lettered--
I think that case is interesting, but I don't think it's what budhappi is talking about. In that case, as far as I can see Harding just didn't do anything to violate robots.txt or access Healthcare advocates server illegally.

What's discussed further upstream are things like this:

1) Server X includes a "disallow imagecrawler" in their robots.txt. But image crawler crawls anyway. Their crawling would be violating robots.txt. (Lots of bots violate this because robots.txt is like a verbal 'imagecrawler, please go away'. This violation didn't happen in Healthcare Advocates v Harding. )

2) Server X excludes 'imagecrawler" useragent in .htaccess. This is a little harder for agents to get around because .htaccess is more like a bouncer that picks up the agent and kicks them out. But a browser or bot can 'fake' their useragent. That is: it can present a type of fake id. So, maybe imagecrawler presents a fake ID. The bouncer can't recognize them, and lets them in. This didn't happen in Healthcare Advocates v Harding.

3) Server X excludes everything from the server or ISP where 'imagecrawler' operates. (Example: if you wanted to keep everyone who surfs using a comcast out, you can exclude comcast.com) This is a bit like the bouncer too. It just looks at a different thing. The image crawler just goes to find another ISP. Maybe they go to ATT. Now they aren't excluded.

None of these three things happened in Health Advocates v. Harding. But some suggests picscout might do them. (I don't know if there is any evidence picscout does do them.)

It seems to me the Health Advocates v. Harding can't tell us anything about the legality of 1-3 because none of those things happened.

637

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 10, 2011, 12:04:10 PM »

Quote from: buddhapi on December 10, 2011, 07:13:52 AM

The links are below are a bit more in depth, again the problem would be proving picscout was sucking your bandwidth, the RIAA fiasco has made it difficult to use IP's as any sort of evidence..not to mention the issue with them being in Israel as well.

Sorry, but I'm not up to speed. I did a little search on RIAA, but I'm not sure which part you consider the fiasco and I don't know what it makes it difficult to use IPs as any sort of evidence. Could you elaborate? Thanks!

638

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 09, 2011, 05:59:49 PM »

I can't begin to guess on the legalities but my hosting service is in the US. So, any intrusion is into a machine physically located on US soil.

As for the difficulty proving something, I guess I was thinking more along the lines of hand offs. It could be that picscout might find things on Google's image base. Then, afterwards, someone sends an email to someone somewhere else: that is "person B". That person might not be excluded in robots.txt. Anyway, they aren't a robot so you don't expect them to read or obey robots.txt. If asked the lawyer says the evidence was obtained when "person B" visited google, clicked a link and then got a screen shot. Maybe they saved the html for the page and so on. It would be true enough. Even if something illegal was done and even if it would matter, getting back to how 'person B' knew to do anything and tracing it to anything illegal and demonstrating it in court might be pretty hard.

That's not to say you shouldn't look into it. After all, I could be wrong and it might be that 100% of the evidence came from a picscout bot prowling around disobeying robots.txt.

639

Getty Images Letter Forum / Re: Picscout / DMCA question

« on: December 08, 2011, 09:21:45 PM »

Quote

I can't imagine them bringing such data to court as evidence.
It would certainly seem odd to arrive in a US court with evidence collected "legally offshore", that would have been "illegal" to collect on US soil.
Again, that's assuming that the DMCA prevents the trolls from ignoring attempts to block "robots".

I'm curious how, as a practical matter, you are every going to be able to demonstrate that any evidence that might be presented was collected illegally.

640

Getty Images Letter Forum / Re: Too all webmasters, do you recall this Getty traffic source?

« on: December 06, 2011, 09:27:06 PM »

buddhapi
I'm partly writing to keep track of what I find. I intend to write script that helps people implement things simply and also let them tailor things for their site.

641

Getty Images Letter Forum / Re: Too all webmasters, do you recall this Getty traffic source?

« on: December 06, 2011, 08:12:52 PM »

I ran across something that could leave gettywan.com referrers when googling about the GettyImage issue. This used to exist:
http://gettyunauth.application.gettywan.com/CaseView.aspx
(I'm getting no server errors now.)

I can't recall for sure, but I think it used to return a page the way this one now does:

https://stock.picscout.com/monitoring/getty/login.aspx

It may be that getty or picscout have tools that let authors surf themselves. I don't know what IP would be left in server logs with this.

I've been collecting together a list of good practices and I've been writing a script to automate implementation for people. In the meantime, people who want to do things manually could try these four things:

http://rankexploits.com/protect/2011/12/four-steps-to-slow-down-image-scrapers/

Step 4 would be the one that would block the gettywan referrer because it contains 'getty'. Unfortunately, after writing I realized I'm not entirely sure step 4 works. I think it works but I need to figure out how to spoof referrers so I can verify.

If you know how to edit or create an .htaccess file, the first step is quick and very useful. It impedes crawling through images. It doesn't prevent it because the crawler could find other ways to crawl, but it impedes it.

642

Getty Images Letter Forum / Re: I got hit 2 separate letters 2 images - HawaiiArt.com or Hawaii Art Network LLC

« on: December 05, 2011, 05:04:58 PM »

Where did they file the suit?

643

Getty Images Letter Forum / Re: Bot trap for image browsing

« on: December 02, 2011, 03:12:26 PM »

I saw this bot-trap suggested a few times. I got to it quickly because, believe it or not, I was working on something to bounce bots from Wordpress already, and I'd been using some of the idea around the bot already.

You'll see I discussed some fiddling at my blog:
http://rankexploits.com/musings/2011/sorry-bergen-norway/

I even set up a new blog to discuss the fiddling. (Though, very little is discussed at the new one.)
http://rankexploits.com/protect/

But up until this week, I didn't see any big reason to block the bots that do nothing but load images. I thought they were obnoxious, but they didn't spike memory or cpu.

So, suggest away. The sooner the better. I'm not a programmer-- but I can program. One thing I find is that it helps to have a plan before coding rather than coding away and then changing to suit a new plan. Plus, I'm perfectly capable of ignoring an idea if I think there is some reason it should be ignored. Also, if it gets to off topic, we can move the conversation to the "new" blog, and then just post synopses.

644

Getty Images Letter Forum / Bot trap for image browsing

« on: December 02, 2011, 02:03:17 PM »

Hi all,

I got an extortion letter early this week. Oddly, I'd been working on banning bots all last month but I hadn't been worried about images. My issue was cpu and memory, which bots were sucking like crazy. Images are static, so don't cause that problem. Needless to say, I now see the need to try to trap bots that are racing through images. I've been reading various concerns and had some of my own. These here including:

1) Not knowing IPs of things like picScout's current for certain.
2) PicScout (or others in future) changing IPs.
3) Masking of user agents

etc.

So, I want to come up with a way that a blogger, web site or forum host can at identify bots as they crawl and block them. No way is going to be 100% effective, but this morning I've been ginning up an idea based on the bot crawler here:

http://danielwebb.us/software/bot-trap/

That 'bot crawler will not work for catching some programmed image crawling bots I've seen crawling my blog because at least some programmed image crawling bots aren't going to hit a php file on purpose. They are programmed to just crawl through images leaving php files alone. The also don't make mistakes. (I know how to catch bots that make mistakes on a wordpress blog and would know how to do it here at the forum. More on that later.)

My idea for catching what I might name "pure image browzing bots" is to do this:

1) add directory specific .htaccess files to directories I wish to prevent browzing by image bots. (These would b at least in my image directories. I could put them higher up-- but I need to be sure I know how to avoid screwing up a complicated .htaccess file in that case. Anyway, I really only want to block these guys from images.)
2) add an image or multiple images that I *never* link on purpose to my site. These can be 1 pixel colored images or anything. For now, call that image 'honeyPotImage.jpg'
3) in a top level htaccess, send any bot trying to 'honeyPotImage.jpg' those specific images to a bot-trap written in php. This bot-trap is somewhat similar to the one above.
4) Add the IP of all bots sent to the trap to the appropriate htaccess files.
5) After (4) the bots (or whoever gets trapped) will no longer be able to load images in the protected directories even when they load text. Note: because they can load text, human visitors to my blog will be able to tell me that images vanished. This will let me unban them-- taking care to do this in a way that I think will still protect me from bots.

FWIW: I'll be adding some whitelisted hosts to the tool. My first draft has google and bingbot white listed.

I'm going to get this working for my blogs. I was wondering if others would be interested in using it once it's working? If yes, I might ask you questions to figure out how to make this user friendly. Also, if people do use it, at some point, we may want to share lists of user agents and IPs we are seeing racing through images.

This sharing could be automated and would help us identify any changes in IP ranges or host addresses and help people at sites 2-N ban the creepy bots as soon as they are detected at site 1.

FWIW: Lots of people at web host forums are complaining about these bots for reasons other than concerns about getting a Getty letter. The bots just race through, suck bandwidth, clutter up server logs and are just a plain old nuisance. Because of the latter, if the system is made convenient, we might be able to get lots of people using it. But first I think I just need to know if anyone would like to volunteer to try it in a week or two after I have it working. Actually, probably by Wed.

645

Getty Images Letter Forum / Re: Re: ELI Website Traffic Statistics Trivia

« on: December 01, 2011, 01:55:14 PM »

Do you know if it's true that picscouts IP is associated with deny from bezeqint.net? I've added "deny bezeqint.net" to my .htaccess block for now. That might be over-inclusive, but I'll be looking for more.

Oddly, I'd been on a mission to block or bounce bandwidth, memory, cpu hogging crawlers in Oct/Nov. Even apart from the Getty issue it sounds like picscout falls in that category.

Pages: 1 ... 41 42 [43] 44

Click Official ELI Links	Get Help With Your Extortion Letter \| ELI Phone Support \| ELI Legal Representation Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.

Official ELI Help Options	Get Help With Your Extortion Letter \| ELI Phone Support Call \| ELI Defense Letter Program
	Show your support of the ELI website & ELI Forums through a PayPal Contribution. Thank you for supporting the ongoing fight and reporting of Extortion Settlement Demand Letters.