README

A PHP package that fetches product information from any URL and returns structured data. It parses JSON-LD structured data, Open Graph meta tags, and HTML image elements to extract product details like name, description, price, and images.

Installation

composer require givetwice/product-info-fetcher

Usage

Basic

use GiveTwice\ProductInfoFetcher\ProductInfoFetcher;

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->fetchAndParse();

// Core fields
echo $product->name;          // "iPhone 15 Pro"
echo $product->description;   // "The latest iPhone with A17 Pro chip"
echo $product->priceInCents;  // 99900
echo $product->priceCurrency; // "USD"
echo $product->url;           // "https://example.com/product"
echo $product->imageUrl;      // "https://example.com/images/iphone.jpg"
echo $product->allImageUrls;  // ["https://...", "https://..."] (all found images)

// Additional fields
echo $product->brand;         // "Apple"
echo $product->sku;           // "IPHONE15PRO-256"
echo $product->gtin;          // "0194253392200"
echo $product->availability;  // ProductAvailability::InStock
echo $product->condition;     // ProductCondition::New
echo $product->rating;        // 4.8
echo $product->reviewCount;   // 1250

// For display purposes
echo $product->getFormattedPrice(); // "USD 999.00"

Pricing

Prices are stored as integers in cents to avoid floating-point precision issues. This follows the same approach used by payment systems like Stripe.

$product->priceInCents;  // 139099 (integer)
$product->priceCurrency; // "EUR" (ISO 4217 currency code)

// For display
$product->getFormattedPrice(); // "EUR 1390.99"

// For calculations (no floating-point issues)
$total = $product->priceInCents * $quantity;
$displayPrice = number_format($total / 100, 2);

The parser normalizes various price formats:

String prices: "999.00" → 99900
Integer prices: 1479 → 147900
European format: "1.234,56" → 123456

Availability & Condition

The availability and condition fields return enum instances:

use GiveTwice\ProductInfoFetcher\Enum\ProductAvailability;
use GiveTwice\ProductInfoFetcher\Enum\ProductCondition;

// Availability values
ProductAvailability::InStock
ProductAvailability::OutOfStock
ProductAvailability::PreOrder
ProductAvailability::BackOrder
ProductAvailability::Discontinued

// Condition values
ProductCondition::New
ProductCondition::Used
ProductCondition::Refurbished
ProductCondition::Damaged

// Usage
if ($product->availability === ProductAvailability::InStock) {
    // Product is available
}

// Get string value
$product->availability?->value; // "InStock"

Multiple Images

The imageUrl field contains the primary image (first found). The allImageUrls array contains all unique images found across all sources (JSON-LD, meta tags, and HTML image elements):

$product->imageUrl;      // Primary image (first found)
$product->allImageUrls;  // All unique images from all sources

// Example: different resolutions from different sources
// [0] "http://example.com/product-370x370.jpg"  (from JSON-LD)
// [1] "//example.com/product-big.jpg"           (from og:image)

This is useful when sources provide different image sizes or when you want fallback options.

With Options

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->setUserAgent('MyApp/1.0 (https://myapp.com)')
    ->setTimeout(10)
    ->setConnectTimeout(5)
    ->setAcceptLanguage('nl-BE,nl;q=0.9,en;q=0.8')
    ->fetchAndParse();

Custom Headers

For sites with stricter bot detection, you can add extra HTTP headers to mimic a real browser:

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->withExtraHeaders([
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
        'Cache-Control' => 'no-cache',
        'DNT' => '1',
        'Sec-CH-UA' => '"Google Chrome";v="131", "Chromium";v="131"',
        'Sec-CH-UA-Mobile' => '?0',
        'Sec-CH-UA-Platform' => '"macOS"',
        'Sec-Fetch-Dest' => 'document',
        'Sec-Fetch-Mode' => 'navigate',
        'Sec-Fetch-Site' => 'none',
        'Sec-Fetch-User' => '?1',
        'Upgrade-Insecure-Requests' => '1',
    ])
    ->fetchAndParse();

Extra headers are merged with defaults and can override them. Multiple withExtraHeaders() calls can be chained.

HTTP Proxy

Route requests through an HTTP proxy:

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->viaProxy('http://proxy.example.com:3128')
    ->fetchAndParse();

// With authentication
$product = (new ProductInfoFetcher('https://example.com/product'))
    ->viaProxy('http://username:password@proxy.example.com:3128')
    ->fetchAndParse();

The proxy is used for both regular HTTP requests (via Guzzle/cURL) and headless browser requests (via Chrome's --proxy-server flag). Proxy authentication is handled automatically.

Headless Browser

Some sites use advanced bot protection (Akamai, Cloudflare) that blocks simple HTTP requests. For these sites, you can use a headless Chrome browser via Puppeteer to fetch the page.

Prefer Headless (recommended for known-protected sites):

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->preferHeadless()
    ->fetchAndParse();

Use preferHeadless() when you know the site has bot protection. This skips the HTTP request entirely and uses headless Chrome directly. This is more efficient because attempting an HTTP request first can get your IP flagged, causing subsequent headless requests to also fail.

Fallback Mode:

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->enableHeadlessFallback()
    ->fetchAndParse();

When enabled, the fetcher will:

First attempt a normal HTTP request
If blocked with a 403 status, automatically retry using a headless Chrome browser
Parse the resulting HTML as usual

Use enableHeadlessFallback() when you're unsure if the site has bot protection and want to try the faster HTTP request first.

Requirements:

To use the headless fallback, you need Node.js 18+ and Puppeteer installed:

npm install puppeteer@^23.0 puppeteer-extra@^3.3 puppeteer-extra-plugin-stealth@^2.11

On Linux servers, you may also need system dependencies for headless Chrome:

# Ubuntu 24.04+
apt-get install -y libnss3 libatk1.0-0t64 libatk-bridge2.0-0t64 \
  libcups2t64 libdrm2 libxkbcommon0 libxcomposite1 \
  libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2t64 \
  libpango-1.0-0 libcairo2

# Ubuntu 22.04 and earlier
apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 \
  libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
  libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2

Custom Paths:

$product = (new ProductInfoFetcher('https://example.com/product'))
    ->enableHeadlessFallback()
    ->setNodeBinary('/usr/local/bin/node')
    ->setChromePath('/usr/bin/chromium')
    ->fetchAndParse();

Separate Fetch and Parse

$fetcher = new ProductInfoFetcher('https://example.com/product');
$fetcher->fetch();
$product = $fetcher->parse();

Parse Existing HTML

$product = (new ProductInfoFetcher())
    ->setHtml($html)
    ->parse();

Access as Array

$product = (new ProductInfoFetcher($url))->fetchAndParse();

$data = $product->toArray();
// [
//     'name' => 'iPhone 15 Pro',
//     'description' => 'The latest iPhone...',
//     'url' => 'https://example.com/product',
//     'priceInCents' => 99900,
//     'priceCurrency' => 'USD',
//     'imageUrl' => 'https://example.com/image.jpg',
//     'allImageUrls' => ['https://...', 'https://...'],
//     'brand' => 'Apple',
//     'sku' => 'IPHONE15PRO-256',
//     'gtin' => '0194253392200',
//     'availability' => 'InStock',
//     'condition' => 'New',
//     'rating' => 4.8,
//     'reviewCount' => 1250,
// ]

Check Completeness

if ($product->isComplete()) {
    // Product has name and description
}

How It Works

The package attempts to extract product information in the following order:

JSON-LD - Looks for <script type="application/ld+json"> with @type: Product or @type: ProductGroup
Meta Tags - Falls back to Open Graph (og:), Twitter Cards (twitter:), and standard meta tags
HTML Images - Extracts product images directly from <img> elements using common patterns (Amazon's landingImage, product image classes, data attributes)

If the first parser returns complete data (name and description), it returns immediately. Otherwise, it merges results from multiple parsers. Images from all three sources are always combined.

Supported Structures

schema.org Product - Standard product markup including offers, brand, sku, gtin, aggregateRating
schema.org ProductGroup - Product variants (e.g., bol.com) with hasVariant[]
Open Graph - og:title, og:description, og:image, product:price:amount, product:price:currency, product:availability, product:condition

Both short ("@type": "Product") and full URL ("@type": "http://schema.org/Product") formats are supported for all schema.org types.

Meta Tag Fallback Chain

When JSON-LD is unavailable, the parser tries multiple sources:

name: og:title → twitter:title → <title>
description: og:description → twitter:description → <meta name="description">
image: og:image → twitter:image → HTML image elements
url: <link rel="canonical"> → og:url

HTML Image Extraction

For sites without structured data or meta tags (e.g., Amazon), the package extracts images directly from HTML:

Amazon pattern: <img id="landingImage"> with data-old-hires for high-res images
Common IDs: main-image, product-image, hero-image
Common classes: product-image, main-image, gallery-image
Data attributes: data-zoom-image, data-large-image, data-src

High-resolution images are prioritized when available.

Testing

composer test

Changelog

Please see CHANGELOG for more information on what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security Vulnerabilities

Please review our security policy on how to report security vulnerabilities.

Credits

License

The MIT License (MIT). Please see License File for more information.

givetwice / product-info-fetcher

Maintainers

Details