Itâ€™s easy to make a simple scraper script in PHP but itâ€™s also easy for data-centric sites to detect and keep out suspiciously continued page accesses done in large amounts and a small period of time. There are usually 2 ways for a site to detect possible scraping activities.
One is to make sure the visitor client has a justifiable name which a lot of unprofessional scrapers donâ€™t have and the other being to detect if a large number of page fetches are done in a relatively small time span.
Itâ€™s almost impossible to deal with the 2nd but we sure can pretend to be someone who we are not, can we. The solution is to use cURL the PHP Client URL Library. Not only do you use cURL but also you specify the http request headers and referer:
$url = 'http://www.somesite.com'; $headers = array( "Accept-Language: en-us", "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)", "Connection: Keep-Alive", "Cache-Control: no-cache" ); $referer = 'http://www.google.com/search'; $get = curl_init($url); curl_setopt($get, CURLOPT_HTTPHEADER, $headers); // this pretends this scraper to be browser client IE6 on Windows XP, of course you can pretend to be other browsers just you have to know the correct headers curl_setopt($get, CURLOPT_REFERER, $referer); // lie to the server that we are some visitor who arrived here through google search // curl_setopt($get, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // this one's a plus, it specifically sets User-Agent in the headers. As we already have set that in $headers, we don't need it this time
Good to go. Now you know why server analytics and traffic statistics are just a bunch of well organized lies. Welcome to the real world!