It’s easy to make a simple scraper script in PHP but it’s also easy for data-centric sites to detect and keep out suspiciously continued page accesses done in large amounts and a small period of time. There are usually 2 ways for a site to detect possible scraping activities.
One is to make sure the visitor client has a justifiable name which a lot of unprofessional scrapers don’t have and the other being to detect if a large number of page fetches are done in a relatively small time span.
It’s almost impossible to deal with the 2nd but we sure can pretend to be someone who we are not, can we. The solution is to use cURL the PHP Client URL Library. Not only do you use cURL but also you specify the http request headers and referer:
$url = 'http://www.somesite.com'; $headers = array( "Accept-Language: en-us", "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)", "Connection: Keep-Alive", "Cache-Control: no-cache" ); $referer = 'http://www.google.com/search'; $get = curl_init($url); curl_setopt($get, CURLOPT_HTTPHEADER, $headers); // this pretends this scraper to be browser client IE6 on Windows XP, of course you can pretend to be other browsers just you have to know the correct headers curl_setopt($get, CURLOPT_REFERER, $referer); // lie to the server that we are some visitor who arrived here through google search // curl_setopt($get, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // this one's a plus, it specifically sets User-Agent in the headers. As we already have set that in $headers, we don't need it this time
Good to go. Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!
6 thoughts on “Pretend your scraper script as a browser when scraping in PHP”
“Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!” – Just True.
Could you also have a tutorial on regex when screen scraping?
I’ll have to charge on that. 😉
This post is really helpful, but I’m trying to mimic a session for the server I’m scraping. I have this all working correctly in my curl and it’s getting the cookie/session just fine:
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, DOCROOT.’test/test-cookies.txt’);
However, my cookies looks like this
#HttpOnly_www.mcd911.net FALSE / FALSE 0 ASP.NET_SessionId qz11nb450k0ta355ldetxd55
Which kind of feels like some kind of obfuscation trick with the http only prefix, but I can’t really tell for sure. Any insight to what I should do to get this cookie set when i post to the server?
Thanks for the post. It really helped me alot while I made my scraper 🙂
Thanks for the post. I am new to the scraping thing and this sure did help 😀
Comments are closed.