ramphor / rake
Rake 2.0 core crawling/migration library
1.3.5
2025-07-14 01:40 UTC
Requires
- php: >=7.4
- guzzlehttp/psr7: ^2.7.0
- nesbot/carbon: ^2.73|^3.9.0
- php-extended/php-html-object: ^6.1|^8.0
- php-http/httplug: ^2.1
- psr/log: ^1.0.1
- ramphor/httplug-guzzle6-adapter: ^2.0.2
- ramphor/php-html-parser: ^3.1.0
- ramphor/sql: 1.0.0
README
-
Context & Goals:
- Rake is a cross-platform data crawling/migration framework, being refactored for simplicity, plugin-ability, modern design patterns, extensibility, maintainability, and testability.
- There are 3 main components: wp-crawlflow (main app), rake (core library), and rake-wordpress-adapter (WordPress adapter).
-
Overall Architecture:
- The main app (wp-crawlflow) initializes the Rake App, which calls the config loader. The Tooth Factory creates multiple Tooth instances from the config.
- Each Tooth has multiple Data Sources (Feeds), each Feed has a Data Type (URL, CSV, Sitemap XML, JSON, etc.).
- Data Source uses a Reception to detect data type, assign mapping rules, and pass to the Parser.
- The Parser generates Feed Items (product, post, movie, etc.), which are classified and sent to the appropriate Processor Chain (Chain of Responsibility).
- The Processor Chain processes items sequentially (insert DB, backup, export, etc.), with outputs such as saving to DB, exporting EPUB, JSON, DOCX, CSV, etc.
- All files/images only store URLs in the resources table; a worker downloads files separately, checks for duplicates via checksum, and updates the resource and parent resource after download.
-
Applied Patterns & Managers:
- Factory Pattern: Tooth Factory, HTTP Client Manager, Database Driver Manager.
- Builder Pattern: Feed Item Builder.
- Chain of Responsibility: Processor Chain.
- Strategy Pattern: Selects appropriate parser, reception, client, driver.
- Registry/Service Locator: Central managers (Reception, Parser, Preset, Feed Item Builder, Processor, HTTP Client, Database Driver).
- Adapter Pattern: HTTP Client, Database Driver (standardized interfaces, easily replaceable).
- Command Pattern: Queue/Worker (each task is a command object, easy to retry, rollback, audit).
- State Pattern: Feed Item, Resource, Tooth (pending, processing, done, error, etc.).
- Decorator Pattern: Processor/Feed Item (wrap for logging, validation, caching, etc.).
- Observer/Event Bus: Event-driven, easy for logging, notification, auditing, integrating auxiliary systems.
-
Resource-centric:
- All data (product, post, project, category, image, etc.) is a resource, with URL/file, type, metadata, status, parent link, checksum, etc.
- All processing steps operate via resources.
-
Config Loader & Database Driver Loader:
- Prefer loading from files (rake.config.php, rake.db.php); if not available, use a transformer to fetch from the database.
- Tooth Factory and Database Driver Manager receive standard configs to create corresponding Tooth and driver instances.
-
HTTP Client & DB Driver:
- Supports multiple clients (Guzzle, Selenium, Webdriver, Web Bidi, Unirest, etc.) and drivers (wpdb, Eloquent, Doctrine, etc.); developers can register custom adapters.
- Selects suitable client/driver for each Feed/Tooth.
-
Technology Packages:
- symfony/dom-crawler: for DOM parsing.
- nesbot/carbon: for date/time handling.
-
Batch Mode, CLI-first:
- Optimized for automation, cronjobs, CI/CD, detailed logging, resume, retry, no UI required for the core.
- Separate GUI for mapping rules to inspect DOM and edit rules visually.
-
Diagram:
- Multiple Mermaid diagrams have been created to clearly show modules: Boot, App, Managers, Plugins, processing flows, resources, workers, event bus, patterns, processor groups, file deduplication, resource updates, etc.
- Ensures all output goes through resources, with patterns clearly highlighted.
-
Technical Evaluation:
- The architecture is robust, plugin-based, extensible, testable, maintainable, cross-platform, resource-centric, and event-driven.
- Points to note: complexity, need for validation tools, documentation, tests, state/rule/driver/event management.
-
Description for Rake 2.0:
- Proposed description emphasizes modularity, extensibility, event-driven, resource-centric, plugin-based, batch mode, CLI-first, integration of modern packages, suitable for large-scale, multi-source, cross-platform crawling/migration.
In summary: Rake 2.0 is a modern PHP framework for data crawling/migration, plugin-based, resource-centric, event-driven, fully applying design patterns, supporting cross-platform, extensible, maintainable, testable, and professional operation.