Web Scraping consists in extracting information from a webpage by using software or scripts that we write.
This software usually analyzes the HTML structure of a webpage and extracts information from those tags of interest. So, before writing any script or start using any kind of software, we must take a look first to how the HTML of the webpage we want to take information from is organized.
Let’s see an example of a scraper that will take information from smart TVs at Amazon:
The first thing we must notice is the url as it will be used in our scraper to take the information we want.
Once in the webpage, we start by analyzing the HTML structure.
We will see that all prices are listed inside a div
with the s-main-slot
class, and at the same time each product is placed inside another div
with the data-uuid
attribute.
We have a good starting point with this: now we know in which part of the HTML is the content we need.
In our example we are going to extract the name, price, and url of each product, so we must analyze where is all this information inside each product’s div
:
- Price:
span
with thea-price-whole
class - Name:
a
(link) with thea-link-normal
ID inside an H2 with thea-size-mini
class - Link:
a
(link) with thea-link-normal
ID inside an H2 with thea-size-mini
class
As you can see, the name and link are inside the same tag. The difference is that the name is inside the tag and the link is in the href
attribute of the link.
So, now that we know how the HTML is structured, it is time to start entering the code and create a scraper using PHP.
Scraper written in PHP
The first thing to do is to start a new project using composer, so we open a terminal and write:
composer init
To start scraping we will use Goutte, which we will install using composer:
composer require fabpot/goutte
The good thing about this scraper is that it allows us to work with CSS selectors, turning content filtering into a quite intuitive task.
With this short script we will be able to scrap the Amazon’s deals page:
<?php require_once '../vendor/autoload.php'; use Goutte\Client; use Symfony\Component\DomCrawler\Crawler; $client = new Client(); $crawler = $client->request('GET', 'https://www.amazon.es/s?k=televisores+smart+tv&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=21Y28T9QMH8LQ&sprefix=televisores%2Caps%2C173&ref=nb_sb_ss_i_1_11'); $crawler = $crawler->filter('div.s-main-slot div[data-uuid]')->each(function (Crawler $node){ try { $price = $node->filter('span.a-price-whole')->first()->html(); $title = $node->filter('h2.a-size-mini a.a-link-normal')->first()->text(); $url = $node->filter('h2.a-size-mini a.a-link-normal')->first()->attr('href'); echo 'PRODUCT: ' . $title . PHP_EOL; echo 'PRICE: ' . $price . PHP_EOL; echo 'URL: ' . $url . PHP_EOL; echo '...............' . PHP_EOL; } catch (InvalidArgumentException $e) { echo 'FAILED GETTING A PRODUCT ' . $e->getMessage() . $e->getMessage(); } });
For this example we show the information by the terminal:
PRODUCT: Television LED 50" 4K INFINITON Smart TV-Android TV (TDT2, HDMI, VGA, USB) (50 Pulgadas) PRICE: 289,00 URL: /gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0702342WW2SPT5PAQCT&url=%2FTelevision-INFINITON-Smart-TV-Android-Pulgadas%2Fdp%2FB07VXQQ5JD%2Fref%3Dsr_1_1_sspa%3F__mk_es_ES%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26crid%3D21Y28T9QMH8LQ%26dchild%3D1%26keywords%3Dtelevisores%2Bsmart%2Btv%26qid%3D1603434010%26sprefix%3Dtelevisores%252Caps%252C173%26sr%3D8-1-spons%26psc%3D1&qualifier=1603434010&id=4114882010574106&widgetName=sp_atf ............... PRODUCT: RCA RS32H2 Android TV (32 Pulgadas HD Smart TV con Google Assistant), Chromecast Incorporado, HDMI+USB, Triple Tuner, 60Hz PRICE: 189,99 URL: /gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A08846331IQPR268E4WHF&url=%2FRCA-Pulgadas-Assistant-Chromecast-Incorporado%2Fdp%2FB082MMBRVP%2Fref%3Dsr_1_2_sspa%3F__mk_es_ES%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26crid%3D21Y28T9QMH8LQ%26dchild%3D1%26keywords%3Dtelevisores%2Bsmart%2Btv%26qid%3D1603434010%26sprefix%3Dtelevisores%252Caps%252C173%26sr%3D8-2-spons%26psc%3D1&qualifier=1603434010&id=4114882010574106&widgetName=sp_atf ............... PRODUCT: CHiQ Televisor Smart TV LED 40 Pulgadas FHD, HDR, WiFi, Bluetooth, Youtube, Netflix, Prime Video, 3 x HDMI, 2 x USB - L40H7N PRICE: 249,99 URL: /gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0618336VEK5HQU91ZMY&url=%2FCHiQ-L40H7N-Televisi%25C3%25B3n-Bluetooth-Sintonizador%2Fdp%2FB07ZJBRLF1%2Fref%3Dsr_1_3_sspa%3F__mk_es_ES%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26crid%3D21Y28T9QMH8LQ%26dchild%3D1%26keywords%3Dtelevisores%2Bsmart%2Btv%26qid%3D1603434010%26sprefix%3Dtelevisores%252Caps%252C173%26sr%3D8-3-spons%26psc%3D1&qualifier=1603434010&id=4114882010574106&widgetName=sp_atf ...............
What this script does in the end is to send a HTTP petition to the web server, and so the web server will answer with the HTML we will use to extract the information. This is a very efficient method to use with those webpages using a static HTML.
Scraper using Puppeteer JS
Things may get a little more complicated with those webs using Javascript to render, or Ajax requests to complete some of the information. With this type of scraping we don’t run the Javascript code on these webpages: we only take their HTML.
This can be seen in the Amazon flash deals’ webpage. This page uses Ajax requests every now and then to check the deals status, and content is not rendered until those requests are made. If we stop the Javascript execution in our browser we will see the page is shown without content:
A way to solve this problem is to use the browser engine as if there was a user visiting the webpage. One of the most useful tools for this is Puppeteer.
Puppeteer is a Node.js library with a broad range of different uses, such as to take screenshots of a webpage and create a pdf, to automate UI tests…
In order to do all this, Puppeteer comes with a Chromium version by default. Puppeteer will create a Chromium instance with the specified website to interact with.
To begin with our scraper we will start a new project using npm. We open a terminal and write:
npm init
Now we install Puppeteer in our project by writing in our terminal the following:
npm install --save puppeteer
Now it is time to write our scraper in Javascript:
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch() const page = await browser.newPage() await page.goto('https://www.amazon.es/gp/goldbox/ref=gbpp_itr_s-4_caa3_TDLDS?&gb_f_deals1=dealTypes:LIGHTNING_DEAL&gb_ttl_deals1=Ofertas%2520flash&ie=UTF8') const result = await page.evaluate(() => { var data = []; document.querySelectorAll('div#widgetContent div.singleCell').forEach(product => { data.push({ "price": product.querySelector('div.priceBlock').innerText, "name": product.querySelector('a.singleCellTitle').innerText, "url": product.querySelector('a.singleCellTitle').getAttribute('href') }); }); return data; }); console.log(result); await browser.close() })()
And as we did with our scraper written in PHP, we will show the result on the terminal:
[ { price: '5,94 € - 8,49 €', name: 'Pecute Cortauña Perro, Cortauñas de uñas para Perros Gatos Conej...', url: 'https://www.amazon.es/Cortau%C3%B1as-Profesional-Inoxidable-Medianos-Grandes/dp/B01N9TAYLH/ref=gbps_tit_s-5_4603_538e1f2a?smid=A13EKVTW8DBOVU&pf_rd_p=2aff475b-2ec2-456a-8bbb-e33958304603&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A1AT7YVPFBWXBL&pf_rd_r=XWNV9E33ZERJQXPH4FYB' }, { price: '9,59 €', name: 'Luz Nocturna Infantil (2 Pack) OMERIL Luz Noche con Luz Sen...', url: 'https://www.amazon.es/Nocturna-OMERIL-Quitamiedos-Habitaci%C3%B3n-Dormitorio/dp/B07HR89KC6/ref=gbps_tit_s-5_4603_93005d34?smid=A379JFQXZJDMZG&pf_rd_p=2aff475b-2ec2-456a-8bbb-e33958304603&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A1AT7YVPFBWXBL&pf_rd_r=XWNV9E33ZERJQXPH4FYB' }, { price: '14,24 € - 14,99 €', name: 'TUUHAW Braguita de Talle Alto Algodón para Mujer Pack de 5 Cu...', url: 'https://www.amazon.es/TUUHAW-Braguita-Algod%C3%B3n-Culotte-Cintura/dp/B07TM271V8/ref=gbps_tit_s-5_4603_8d826976?smid=A3AH2Y3JH6APFA&pf_rd_p=2aff475b-2ec2-456a-8bbb-e33958304603&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A1AT7YVPFBWXBL&pf_rd_r=XWNV9E33ZERJQXPH4FYB' } ]
As you can see, both PHP and Javascript scraper code examples are very simple. But the libraries used in each example are much more powerful and can perform more functions than those shown in the examples.
I would like to invite you to visit Goutte and Puppeteer documents so you can see all they can do by yourselves.
Some aspects to consider
In the best of cases, we can extract the information we want whenever we want it by using our scrapers, but this will not be the usual thing. There are some pages that can use several techniques to make things difficult for scrapers, both at a server level or by means of the systems developed inside the web.
Servers may limit the number of requests made in a period of time, or even cut the access to those IP directions making many requests. So we must be very careful with the number of requests made by our scrapers and with the number of times we execute them.
In the case of Puppeteer, running the default Chromium instance in the headless mode, there are some techniques using Javascript to detect browsers in this headless mode.
Some sites may block our scrapers by checking the User Agent of the HTTP request. Both with Goutte as with Puppeteer we can modify the User Agent.
A change in the page design will make the scraper stop working.