dprmc/biz-journals

A PHP library to interface with the BizJournals.com website.

Maintainers

Package info

github.com/DPRMC/BizJournals

Language:HTML

pkg:composer/dprmc/biz-journals

Statistics

Installs: 4

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

v0.4 2026-03-25 23:03 UTC

This package is auto-updated.

Last update: 2026-03-26 04:29:06 UTC


README

A PHP library for authenticating with BizJournals, crawling section pages, and returning article data as structured JSON.

Scope

This repository now includes a first-pass scraping framework for:

  • establishing an authenticated BizJournals session
  • crawling one or more root section index pages
  • discovering article URLs from those section pages
  • fetching a specific article page
  • normalizing article content into JSON-serializable models

The initial root URL targeted by the example CLI is:

https://www.bizjournals.com/news/commercial-real-estate

Install

composer install

Install development dependencies and run the test suite:

composer test

Usage

Set credentials if the crawl requires an authenticated session:

export BIZJOURNALS_EMAIL="you@example.com"
export BIZJOURNALS_PASSWORD="your-password"

Run the example spider:

php bin/bizjournals-spider
php bin/bizjournals-spider https://www.bizjournals.com/news/commercial-real-estate 3 25
php bin/bizjournals-spider https://www.bizjournals.com/news/commercial-real-estate 3 25 --debug
php bin/bizjournals-article https://www.bizjournals.com/boston/news/2026/03/25/lender-s-95m-offer-is-winning-bid-for-back-bay-of.html
php bin/bizjournals-article https://www.bizjournals.com/boston/news/2026/03/25/lender-s-95m-offer-is-winning-bid-for-back-bay-of.html --debug --debug-dir=/tmp/bizjournals-debug

Architecture

  • Dprmc\BizJournals\Http\BizJournalsSession: owns the HTTP client, cookies, and login flow.
  • Dprmc\BizJournals\Http\ChromiumLoginAuthenticator: performs the login flow in a real Chromium browser so JavaScript executes and inputs are typed into the page.
  • Dprmc\BizJournals\Crawler\BizJournalsSpider: exposes crawlIndex() for section indexes and crawlArticle() for individual story pages.
  • Dprmc\BizJournals\Debug\DebugArtifactRecorder: saves response HTML and screenshots for each unique loaded URL when debug mode is enabled.
  • Dprmc\BizJournals\Parser\CategoryPageParser: extracts article URLs from a section page.
  • Dprmc\BizJournals\Parser\ArticleParser: extracts normalized article data from a story page.
  • Dprmc\BizJournals\Model\Story and SpiderResult: JSON-ready output objects.

Testing

  • PHPUnit is configured through phpunit.xml.dist.
  • The sample login test is in tests/Http/BizJournalsSessionTest.php.
  • The login test uses Symfony's mocked HTTP client, so it validates the session flow without making live requests to BizJournals.

Notes

  • The login form field names and success detection are configurable through Dprmc\BizJournals\Config\LoginConfig.
  • The default login URL now matches the captured BizJournals login page in development/login.html: https://www.bizjournals.com/bizjournals/login?r=%2F.
  • Authentication now uses a real Chromium-driven login flow instead of raw form posts, so JavaScript can render the email/password steps and the automation can type into the page before cookies are imported back into the crawler session.
  • Story discovery currently uses URL heuristics for BizJournals article paths.
  • Index crawling supports a pageLimit value and currently expands pages using BizJournals' ?page=N pagination format.
  • Debug mode can be enabled with --debug; it saves .html and .png files for each newly loaded URL and reports the output directory in the index JSON.
  • Article extraction prefers JSON-LD metadata when available, then falls back to DOM selectors.
  • If BizJournals changes its login flow or introduces JavaScript-only auth or bot mitigation, the session layer is the place to swap in a browser automation implementation.
  • On March 25, 2026, the commercial real estate section returned a Cloudflare mitigation response (403 with cf-mitigated: challenge) during verification. The current framework now throws an explicit access-blocked exception in that case instead of returning an empty crawl.