givetwice / product-info-fetcher
When given a product URL, will return structured data of that product, including name, description, price, etc
Fund package maintenance!
GiveTwice
Installs: 218
Dependents: 0
Suggesters: 0
Security: 0
Stars: 10
Watchers: 0
Forks: 1
Open Issues: 1
pkg:composer/givetwice/product-info-fetcher
Requires
- php: ^8.4
- guzzlehttp/guzzle: ^7.9
- symfony/process: ^6.0|^7.0
Requires (Dev)
- laravel/pint: ^1.0
- pestphp/pest: ^4.0
- spatie/ray: ^1.28
This package is auto-updated.
Last update: 2026-01-07 09:10:17 UTC
README
A PHP package that fetches product information from any URL and returns structured data. It parses JSON-LD structured data, Open Graph meta tags, and HTML image elements to extract product details like name, description, price, and images.
Installation
composer require givetwice/product-info-fetcher
Usage
Basic
use GiveTwice\ProductInfoFetcher\ProductInfoFetcher; $product = (new ProductInfoFetcher('https://example.com/product')) ->fetchAndParse(); // Core fields echo $product->name; // "iPhone 15 Pro" echo $product->description; // "The latest iPhone with A17 Pro chip" echo $product->priceInCents; // 99900 echo $product->priceCurrency; // "USD" echo $product->url; // "https://example.com/product" echo $product->imageUrl; // "https://example.com/images/iphone.jpg" echo $product->allImageUrls; // ["https://...", "https://..."] (all found images) // Additional fields echo $product->brand; // "Apple" echo $product->sku; // "IPHONE15PRO-256" echo $product->gtin; // "0194253392200" echo $product->availability; // ProductAvailability::InStock echo $product->condition; // ProductCondition::New echo $product->rating; // 4.8 echo $product->reviewCount; // 1250 // For display purposes echo $product->getFormattedPrice(); // "USD 999.00"
Pricing
Prices are stored as integers in cents to avoid floating-point precision issues. This follows the same approach used by payment systems like Stripe.
$product->priceInCents; // 139099 (integer) $product->priceCurrency; // "EUR" (ISO 4217 currency code) // For display $product->getFormattedPrice(); // "EUR 1390.99" // For calculations (no floating-point issues) $total = $product->priceInCents * $quantity; $displayPrice = number_format($total / 100, 2);
The parser normalizes various price formats:
- String prices:
"999.00"→99900 - Integer prices:
1479→147900 - European format:
"1.234,56"→123456
Availability & Condition
The availability and condition fields return enum instances:
use GiveTwice\ProductInfoFetcher\Enum\ProductAvailability; use GiveTwice\ProductInfoFetcher\Enum\ProductCondition; // Availability values ProductAvailability::InStock ProductAvailability::OutOfStock ProductAvailability::PreOrder ProductAvailability::BackOrder ProductAvailability::Discontinued // Condition values ProductCondition::New ProductCondition::Used ProductCondition::Refurbished ProductCondition::Damaged // Usage if ($product->availability === ProductAvailability::InStock) { // Product is available } // Get string value $product->availability?->value; // "InStock"
Multiple Images
The imageUrl field contains the primary image (first found). The allImageUrls array contains all unique images found across all sources (JSON-LD, meta tags, and HTML image elements):
$product->imageUrl; // Primary image (first found) $product->allImageUrls; // All unique images from all sources // Example: different resolutions from different sources // [0] "http://example.com/product-370x370.jpg" (from JSON-LD) // [1] "//example.com/product-big.jpg" (from og:image)
This is useful when sources provide different image sizes or when you want fallback options.
With Options
$product = (new ProductInfoFetcher('https://example.com/product')) ->setUserAgent('MyApp/1.0 (https://myapp.com)') ->setTimeout(10) ->setConnectTimeout(5) ->setAcceptLanguage('nl-BE,nl;q=0.9,en;q=0.8') ->fetchAndParse();
Custom Headers
For sites with stricter bot detection, you can add extra HTTP headers to mimic a real browser:
$product = (new ProductInfoFetcher('https://example.com/product')) ->withExtraHeaders([ 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Cache-Control' => 'no-cache', 'DNT' => '1', 'Sec-CH-UA' => '"Google Chrome";v="131", "Chromium";v="131"', 'Sec-CH-UA-Mobile' => '?0', 'Sec-CH-UA-Platform' => '"macOS"', 'Sec-Fetch-Dest' => 'document', 'Sec-Fetch-Mode' => 'navigate', 'Sec-Fetch-Site' => 'none', 'Sec-Fetch-User' => '?1', 'Upgrade-Insecure-Requests' => '1', ]) ->fetchAndParse();
Extra headers are merged with defaults and can override them. Multiple withExtraHeaders() calls can be chained.
HTTP Proxy
Route requests through an HTTP proxy:
$product = (new ProductInfoFetcher('https://example.com/product')) ->viaProxy('http://proxy.example.com:3128') ->fetchAndParse(); // With authentication $product = (new ProductInfoFetcher('https://example.com/product')) ->viaProxy('http://username:password@proxy.example.com:3128') ->fetchAndParse();
The proxy is used for both regular HTTP requests (via Guzzle/cURL) and headless browser requests (via Chrome's --proxy-server flag). Proxy authentication is handled automatically.
Headless Browser
Some sites use advanced bot protection (Akamai, Cloudflare) that blocks simple HTTP requests. For these sites, you can use a headless Chrome browser via Puppeteer to fetch the page.
Prefer Headless (recommended for known-protected sites):
$product = (new ProductInfoFetcher('https://example.com/product')) ->preferHeadless() ->fetchAndParse();
Use preferHeadless() when you know the site has bot protection. This skips the HTTP request entirely and uses headless Chrome directly. This is more efficient because attempting an HTTP request first can get your IP flagged, causing subsequent headless requests to also fail.
Fallback Mode:
$product = (new ProductInfoFetcher('https://example.com/product')) ->enableHeadlessFallback() ->fetchAndParse();
When enabled, the fetcher will:
- First attempt a normal HTTP request
- If blocked with a 403 status, automatically retry using a headless Chrome browser
- Parse the resulting HTML as usual
Use enableHeadlessFallback() when you're unsure if the site has bot protection and want to try the faster HTTP request first.
Requirements:
To use the headless fallback, you need Node.js 18+ and Puppeteer installed:
npm install puppeteer@^23.0 puppeteer-extra@^3.3 puppeteer-extra-plugin-stealth@^2.11
On Linux servers, you may also need system dependencies for headless Chrome:
# Ubuntu 24.04+ apt-get install -y libnss3 libatk1.0-0t64 libatk-bridge2.0-0t64 \ libcups2t64 libdrm2 libxkbcommon0 libxcomposite1 \ libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2t64 \ libpango-1.0-0 libcairo2 # Ubuntu 22.04 and earlier apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 \ libcups2 libdrm2 libxkbcommon0 libxcomposite1 \ libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2
Custom Paths:
$product = (new ProductInfoFetcher('https://example.com/product')) ->enableHeadlessFallback() ->setNodeBinary('/usr/local/bin/node') ->setChromePath('/usr/bin/chromium') ->fetchAndParse();
Separate Fetch and Parse
$fetcher = new ProductInfoFetcher('https://example.com/product'); $fetcher->fetch(); $product = $fetcher->parse();
Parse Existing HTML
$product = (new ProductInfoFetcher()) ->setHtml($html) ->parse();
Access as Array
$product = (new ProductInfoFetcher($url))->fetchAndParse(); $data = $product->toArray(); // [ // 'name' => 'iPhone 15 Pro', // 'description' => 'The latest iPhone...', // 'url' => 'https://example.com/product', // 'priceInCents' => 99900, // 'priceCurrency' => 'USD', // 'imageUrl' => 'https://example.com/image.jpg', // 'allImageUrls' => ['https://...', 'https://...'], // 'brand' => 'Apple', // 'sku' => 'IPHONE15PRO-256', // 'gtin' => '0194253392200', // 'availability' => 'InStock', // 'condition' => 'New', // 'rating' => 4.8, // 'reviewCount' => 1250, // ]
Check Completeness
if ($product->isComplete()) { // Product has name and description }
How It Works
The package attempts to extract product information in the following order:
- JSON-LD - Looks for
<script type="application/ld+json">with@type: Productor@type: ProductGroup - Meta Tags - Falls back to Open Graph (
og:), Twitter Cards (twitter:), and standard meta tags - HTML Images - Extracts product images directly from
<img>elements using common patterns (Amazon'slandingImage, product image classes, data attributes)
If the first parser returns complete data (name and description), it returns immediately. Otherwise, it merges results from multiple parsers. Images from all three sources are always combined.
Supported Structures
- schema.org Product - Standard product markup including
offers,brand,sku,gtin,aggregateRating - schema.org ProductGroup - Product variants (e.g., bol.com) with
hasVariant[] - Open Graph -
og:title,og:description,og:image,product:price:amount,product:price:currency,product:availability,product:condition
Both short ("@type": "Product") and full URL ("@type": "http://schema.org/Product") formats are supported for all schema.org types.
Meta Tag Fallback Chain
When JSON-LD is unavailable, the parser tries multiple sources:
- name:
og:title→twitter:title→<title> - description:
og:description→twitter:description→<meta name="description"> - image:
og:image→twitter:image→ HTML image elements - url:
<link rel="canonical">→og:url
HTML Image Extraction
For sites without structured data or meta tags (e.g., Amazon), the package extracts images directly from HTML:
- Amazon pattern:
<img id="landingImage">withdata-old-hiresfor high-res images - Common IDs:
main-image,product-image,hero-image - Common classes:
product-image,main-image,gallery-image - Data attributes:
data-zoom-image,data-large-image,data-src
High-resolution images are prioritized when available.
Testing
composer test
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Security Vulnerabilities
Please review our security policy on how to report security vulnerabilities.
Credits
License
The MIT License (MIT). Please see License File for more information.