Pretend your scraper script as a browser when scraping in PHP

by Yang Yang on December 29, 2008

Share This Article:
Subscribe to Kavoir: blog feed

It’s easy to make a simple scraper script in PHP but it’s also easy for data-centric sites to detect and keep out suspiciously continued page accesses done in large amounts and a small period of time. There are usually 2 ways for a site to detect possible scraping activities.

One is to make sure the visitor client has a justifiable name which a lot of unprofessional scrapers don’t have and the other being to detect if a large number of page fetches are done in a relatively small time span.

It’s almost impossible to deal with the 2nd but we sure can pretend to be someone who we are not, can we. The solution is to use cURL the PHP Client URL Library. Not only do you use cURL but also you specify the http request headers and referer:

$url = 'http://www.somesite.com';

$headers = array(
	"Accept-Language: en-us",
	"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)",
	"Connection: Keep-Alive",
	"Cache-Control: no-cache"
	);

$referer = 'http://www.google.com/search';

$get = curl_init($url);

curl_setopt($get, CURLOPT_HTTPHEADER, $headers); // this pretends this scraper to be browser client IE6 on Windows XP, of course you can pretend to be other browsers just you have to know the correct headers

curl_setopt($get, CURLOPT_REFERER, $referer); // lie to the server that we are some visitor who arrived here through google search

// curl_setopt($get, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // this one's a plus, it specifically sets User-Agent in the headers. As we already have set that in $headers, we don't need it this time

Good to go. Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!

Share This Article:
Subscribe to Kavoir: blog feed

You should also read:

{ 5 comments… read them below or add one }

jibblerish December 6, 2009 at 2:34 pm

“Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!” – Just True.
Also could you tell us what the closest solution for php curl javascript running issue? Or just there’s nothing it can be done for something like that.

Thanks

Reply

Shock Marketer December 8, 2009 at 10:37 am

Could you also have a tutorial on regex when screen scraping?

Reply

Yang Yang February 26, 2010 at 3:23 pm

I’ll have to charge on that. ;)

Reply

Clayton McIlrath January 18, 2011 at 5:36 am

This post is really helpful, but I’m trying to mimic a session for the server I’m scraping. I have this all working correctly in my curl and it’s getting the cookie/session just fine:

curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, DOCROOT.’test/test-cookies.txt’);

However, my cookies looks like this
#HttpOnly_www.mcd911.net FALSE / FALSE 0 ASP.NET_SessionId qz11nb450k0ta355ldetxd55
Which kind of feels like some kind of obfuscation trick with the http only prefix, but I can’t really tell for sure. Any insight to what I should do to get this cookie set when i post to the server?

Reply

Thomas October 21, 2011 at 2:05 am

Thanks for the post. It really helped me alot while I made my scraper :)

Reply

Leave a Comment

Previous post:

Next post: