ScrapingBytes | Web Scraping With PHP

Get ready for an interesting ride! PHP is a very popular server-side language that is widely used to scrape the web. With PHP, you can be set up and do web scraping in a matter of minutes. Of course, you may run into errors or improperly formatted JSON with a quick, thrown-together script. However, if you take your time and set things up properly, you can create a beast of a scraper. PHP makes it easy to tackle complex scraping projects with libraries such as GuzzleHttp, DomCrawler, and CssSelector, to name a few. The flexibility of PHP is what makes it one of the best languages to scrape the web in 2025.

Using cURL with PHP

Let’s get started with one of the most basic ways to pull information from the web using curl. Curl is a command line tool for transferring data to or from a server. Let's make a simple PHP curl web scraping script to pull the HTML from books.toscrape.com.

blog/uploads/lFBcv42643wOZaTwEVf2zvGBbZvGbgC2dMGcHf0F.png

<?php
// Initialize cURL session
$ch = curl_init();

// Set the URL to books.toscrape.com
curl_setopt($ch, CURLOPT_URL, "http://books.toscrape.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Execute the request and get the response
$response = curl_exec($ch);

// Close cURL session
curl_close($ch);

// Print the raw HTML
echo $response;

In This PHP web scraping example the script does the following:

Connects to books.toscrape.com
Gets the HTML content
Prints HTML in terminal

This is a great start to web scraping however this will pull all of the HTML making it very difficult to find the information you need. That is where DOMDocument comes in, it is a prebuilt class inside of PHP to help parse and extract data from HTML and XML documents. This makes data scraping a breeze. It's like dumping out a box of legos versus organizing the legos by size and color making everything easy to find.

Here is an example using DOMDocument. Keeping it very simple like our first script just adding a way to pull the titles out of the HTML instead of having to search through the whole file of books.toscrape.com.

<?php

// Create a new DOMDocument instance
$dom = new DOMDocument();

// Load HTML from books.toscrape.com
$html = file_get_contents('http://books.toscrape.com');
@$dom->loadHTML($html);

// Get all book titles (h3 elements)
$titles = $dom->getElementsByTagName('h3');

// Print each title
foreach ($titles as $title) {
    echo $title->textContent . "\n";
}

This does the following:

Grabs the HTML
Loads the HTML
Prints the titles of the book

Here are the results:

A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects

As you can see this is much faster and cleaner to work with. Giving us the exact results we are looking for.

You can get pretty far with PHP curl web scraping however there are some drawbacks like not being able to render JavaScript. Plus it’s a little tedious to write. If you need to test something or grab some info real quick curl is the go to option. If you are tackling a more complex task like loading JavaScript that is where we will need to start using some of the libraries we talked about earlier.

Guzzle

blog/uploads/SOIGhhWZaFTdgJZWxxW4HbYLtewh8ReKT1IhMO4c.png

The first library we will talk about is GuzzleHttp. GuzzleHttp is a modern php HTTP client that makes sending requests much easier than curl. You don’t have as much boilerplate to worry about.

Let's take a look at a script using GuzzleHttp.

Here is the same curl script we made before but using Guzzle:

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

$client = new Client();
$response = $client->get('http://books.toscrape.com');
echo $response->getBody();

The Benefits of Guzzle can be shown with this simple script. However, you can already start to see the difference between how we pull the HTML using a simpler syntax. Guzzle also enables easy error-handling implementation and modern PHP features like built-in JSON and handling middleware support. Think of Curl web scraping as a manual transmission car and guzzle as an automatic car. Both will get you there, an automatic will just be a little easier.

Here is the basic script with simple error handling and a time-out request:

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;

// Create a client with options
$client = new Client([
    'timeout' => 5,  // 5 second timeout
    'http_errors' => true  // Enable exceptions for errors
]);

try {
    $response = $client->get('http://books.toscrape.com');
    echo $response->getBody();
} catch (RequestException $e) {
    echo "Error: " . $e->getMessage();
}

This version will handle timeouts itself and then we add a catch-all for a request exception. This way we know if something went wrong.

DomCrawler

Now that we are extracting the HTML in a better way we need to parse the information. Instead of using curl to extract information and DomDocument to parse it, we will use GuzzleHttp to extract and DomCrawler to parse the information. You might be asking what is DomCrawler? DomCrawler is a part of Symfony that makes it easier to extract data from HTML documents processing the information to return only what you need.

Let's take a look at our basic PHP web scraping example using both of these together:

<?php
// Require Composer's autoloader
require 'vendor/autoload.php';

// Import the classes we need
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

// Create Guzzle client
$client = new Client();

// Get the webpage
$response = $client->get('http://books.toscrape.com');

// Create crawler with the response HTML
$crawler = new Crawler($response->getBody()->getContents());

// Find and print all book titles
$crawler->filter('article.product_pod h3 a')->each(function ($node) {
    echo "Title: " . $node->attr('title') . "\n";
});

Here are the results:

Title: A Light in the Attic
Title: Tipping the Velvet
Title: Soumission
Title: Sharp Objects

As you can see it grabbed the full title this time and formatted it with the title:

Now let's show off what DomCrawler can do a little bit. Let's make a script to grab both titles and price of books in a nice format to read.

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

// Create Guzzle client
$client = new Client();

// Get the webpage
$response = $client->get('http://books.toscrape.com');

// Create crawler
$crawler = new Crawler($response->getBody()->getContents());

// Find all book articles and extract info
$crawler->filter('article.product_pod')->each(function ($node) {
    // Get title and price using different selectors
    $title = $node->filter('h3 a')->attr('title');
    $price = $node->filter('p.price_color')->text();
    
    // Print in a nice format
    echo "📚 Book: " . $title . "\n";
    echo "💰 Price: " . $price . "\n";
    echo "-------------------\n";
});

This is what the output looks like:

📚 Book: A Light in the Attic
💰 Price: £51.77
-------------------
📚 Book: Tipping the Velvet
💰 Price: £53.74
-------------------
📚 Book: Soumission
💰 Price: £50.10
-------------------
📚 Book: Sharp Objects
💰 Price: £47.82

As you can see DomCrawler makes it easy to find specific elements like price and title, extracting multiple types of content at once and formatting the output nicely for us to read.

CssSelector

Now that we are using modern libraries like GuzzleHttp, and DomCrawler let's add in one more called CssSelector. CSS Selectors are patterns used to select HTML elements.

Here are some examples of CSS Selectors.

// Basic CSS Selector patterns:

// 1. Select by class
'.product_pod'       // Selects elements with class="product_pod"

// 2. Select by ID
'#banner'           // Selects element with id="banner"

// 3. Select by tag name
'h3'                // Selects all <h3> elements

// 4. Select nested elements
'article h3'        // Selects <h3> inside <article>

// 5. Select direct children
'article > p'       // Selects <p> that are direct children of <article>

// 6. Select by multiple classes
'.price_color.sale' // Selects elements with both classes

Let's take a look at the CSS Selectors for books.toscrape.com:

'article.product_pod'          // Select book containers
'h3 a'                        // Select book titles
'p.price_color'              // Select prices
'.star-rating'               // Select ratings
'img.thumbnail'             // Select book images

These libraries are useful for webpages without much interactivity, but where they start to fall off is when you need to interact with a dynamic page with lots of JavaScript. You will need to use a headless browser when the content is not inside of the initial HTML or you need to interact with the page.

Headless Browser

What is a headless browser? A headless browser is just like the browser you use to search the internet, except it doesn’t have a graphical interface. It can do anything a normal browser can do, like filling out forms, scrolling pages, logging into websites, and handling other complex user interactions. This makes scraping very versatile, as you can now grab anything the website is showing.

There are a few drawbacks to headless browsers, the main one being they are easily detected and blocked. There are ways around that, of course, but that’s for another time. Let’s take a look at a headless browser called Panther.

Panther

Panther is a PHP library that makes setting up and running a headless browser easy. With Panther, we will be able to fully customize and automate any task we want on Chrome or Firefox through PHP.

Examples of Panther's Capabilities

Simulate User Actions:
- Click buttons and links.
- Perform keyboard and mouse movements.
- Upload files.
- Run custom JavaScript in the browser.
Parallel Processing:
- Handle multiple tabs or browsers at once.
Web Scraping:
- Extract content.
- Grab text, attributes, and HTML.
- Handle JavaScript-heavy websites.
- Wait for the content to load.
- Accept or dismiss pop-up alerts.
- Spoof HTTP headers and user agents.
- Proxy support to avoid bans.

Let's use our basic script example with Panther:

<?php
require 'vendor/autoload.php';

use Symfony\Component\Panther\Client;

// Create Panther client (headless browser)
$client = Client::createChromeClient();

// Go to the website
$crawler = $client->request('GET', 'http://books.toscrape.com');

// Get and print the page content
echo $crawler->html();

// Quit the browser
$client->quit();

This will dump the raw html in the terminal, we will need to add a few things to make this script useful. Let's show off a cool feature Panther can do by clicking the next button and grabbing the titles of the books on page 2.

<?php
require 'vendor/autoload.php';

use Symfony\Component\Panther\Client;

// Create Panther client
$client = Client::createChromeClient();

// Go to the website
$crawler = $client->request('GET', 'http://books.toscrape.com');

// Print titles from first page
echo "=== Page 1 Books ===\n";
$crawler->filter('h3 a')->each(function ($node) {
    echo $node->text() . "\n";
});

// Click the 'next' button and get more books
$crawler = $client->clickLink('next');

// Print titles from second page
echo "\n=== Page 2 Books ===\n";
$crawler->filter('h3 a')->each(function ($node) {
    echo $node->text() . "\n";
});

// Quit the browser
$client->quit();

Here are the results from our PHP web scraping example as you can see we are now able to go to the second page and continue scraping content:

=== Page 1 Books ===
A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects


=== Page 2 Books ===
In Her Wake
How Music Works
Foolproof Preserving: A Guide ...
Chase Me (Paris Nights ...

These are just a few things prebuilt inside Panther that you can leverage with a few lines of code to make a powerful scraper.

Proxies

Proxies are your best friend when scraping. They hide your real IP and enable you to run tasks in parallel without being rate-limited or banned.

Not every proxy is equal. Some proxies are much better than others for avoiding bans. A 4G mobile proxy is a great option, but this might be overkill if you just want the titles of books from books.toscrape.com. Choosing the right type of proxy for your project can save you not only money but also time. Using a proxy that is too slow to load the pages you are trying to scrape can add enormous amounts of time or even break your script not being able to load all the content in time. You will run into blocks, errors, rate limits, and many other frustrations when trying to scrape. That’s why we’ve created ScrapingBytes to eliminate all the frustration and make scraping a frictionless process. Sign up now and get 1,000 free tokens to start scraping today!

Level Up Your Web Scraping Game!

Simply use the ScrapingBytes API. Get started for free with 1000 credits. No credit card required.

Sign Me Up!

We’ve gone over the basics of web scraping with PHP, but let me tell you, there is still much more to learn.