March Madness: Product Info Screen Scraping!
Since everyone else appears to be at SXSW, I suppose I’ll have to step up for today’s March Madness. I bring you: a screen scraper for retail products on ecommerce sites.
While I’m hardly the first person to write such a tool, finding useful examples or libraries among the hundreds of pages of screen scraper spam has proven difficult. I ended up writing one from scratch in PHP using the DomDocument object.
The goal of the scraper is to come up with the product title, price, and 3 most likely product photos from any given product URL. In order to make it a bit faster (it’s pretty painfully slow), I attempt to filter out images which are obviously not product photos (those which are very long/tall, those which are not displayed in the browser). Then for a bit of extra fun, it sorts the image array by it’s “likeliness” to be a product photo. Obviously it needs some refining to actually be useful.
loadHTMLFile($link)) {
//get the title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;
//get the images
$images = $dom->getElementsByTagName('img');
//get the images most likely to be product photos
foreach ($images as $img) {
$skip = FALSE;
$likely = 1;
unset($img_height); unset($img_width);
//get the style on the images
$style = $img->getAttribute('style');
//if the style is set to not display, it's not worth checking
if(preg_match('/display\:\s?none/',$style)) $skip = TRUE;
//attempt to get the height and width if they are set
$img_height = $img->getAttribute('height');
$img_width = $img->getAttribute('width');
if(is_numeric($img_height) && $img_height<$height) {
$skip = TRUE;
} else if(is_numeric($img_width) &&$img_width<$width){ $skip = TRUE; } else if(is_numeric($img_height)&&is_numeric($img_width)){ if(($img_width/$img_height)>3||($img_width/$img_height)<0.33) $skip=TRUE; } //if it's not already thrown out if($skip === FALSE){ if ( ($url = rel2abs($img->getAttribute('src'), $link)) &&
($i = getimagesize($url)) &&
$i[0] >= ($width-10) &&
$i[1] >= ($height-10)
) {
//if the aspect ratio is greater than 1:2, it's unlikely that it's a product image
if($i[0]/$i[1]>=2||$i[0]/$i[1]<=0.5){ $likely = $likely*0.5; } $thumbs[] = array('url'=>$url,'likely'=>$likely);
}
//sort the array by likelyness, most likely first
foreach ($thumbs as $key => $row) {
$likeliness[$key] = $row['likely'];
}
array_multisort($likeliness,SORT_DESC,$thumbs);
}
}
//gross hack to try to find price
$xmlstring = $dom->saveHTML();
if(preg_match_all('/\$[0-9\.]+/',$xmlstring,$matches)){
$price = $matches[0][0];
}
//output to browser
echo "
$title
";
echo "
$price
";
foreach ($thumbs as $thumb){
$src = $thumb['url'];
echo "
";
}
}
$time2 = microtime(true);
$diff = $time2-$time1;
echo "Script executed in $diff seconds";
?>
You should try the Beta2 version of ScrapePro Web Scraper Designer application for free:
http://www.scrapepro.com
Did you have any thoughts about a nicer way to get price information?