Pretend your scraper script as a browser when scraping in PHP

It’s easy to make a simple scraper script in PHP but it’s also easy for data-centric sites to detect and keep out suspiciously continued page accesses done in large amounts and a small period of time. There are usually 2 ways for a site to detect possible scraping activities.

One is to make sure the visitor client has a justifiable name which a lot of unprofessional scrapers don’t have and the other being to detect if a large number of page fetches are done in a relatively small time span.

It’s almost impossible to deal with the 2nd but we sure can pretend to be someone who we are not, can we. The solution is to use cURL the PHP Client URL Library. Not only do you use cURL but also you specify the http request headers and referer:

$url = 'http://www.somesite.com';

$headers = array(
	"Accept-Language: en-us",
	"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)",
	"Connection: Keep-Alive",
	"Cache-Control: no-cache"
	);
	
$referer = 'http://www.google.com/search';

$get = curl_init($url);

curl_setopt($get, CURLOPT_HTTPHEADER, $headers); // this pretends this scraper to be browser client IE6 on Windows XP, of course you can pretend to be other browsers just you have to know the correct headers

curl_setopt($get, CURLOPT_REFERER, $referer); // lie to the server that we are some visitor who arrived here through google search

// curl_setopt($get, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // this one's a plus, it specifically sets User-Agent in the headers. As we already have set that in $headers, we don't need it this time

Good to go. Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!

6 thoughts on “Pretend your scraper script as a browser when scraping in PHP”

  1. “Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!” – Just True.
    Also could you tell us what the closest solution for php curl javascript running issue? Or just there’s nothing it can be done for something like that.

    Thanks

  2. This post is really helpful, but I’m trying to mimic a session for the server I’m scraping. I have this all working correctly in my curl and it’s getting the cookie/session just fine:

    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, DOCROOT.’test/test-cookies.txt’);

    However, my cookies looks like this
    #HttpOnly_www.mcd911.net FALSE / FALSE 0 ASP.NET_SessionId qz11nb450k0ta355ldetxd55
    Which kind of feels like some kind of obfuscation trick with the http only prefix, but I can’t really tell for sure. Any insight to what I should do to get this cookie set when i post to the server?

Comments are closed.

Scroll to Top