ramphor/rake

Rake 2.0 core crawling/migration library

1.3.5 2025-07-14 01:40 UTC

This package is auto-updated.

Last update: 2025-07-21 16:44:55 UTC


README

  1. Context & Goals:

    • Rake is a cross-platform data crawling/migration framework, being refactored for simplicity, plugin-ability, modern design patterns, extensibility, maintainability, and testability.
    • There are 3 main components: wp-crawlflow (main app), rake (core library), and rake-wordpress-adapter (WordPress adapter).
  2. Overall Architecture:

    • The main app (wp-crawlflow) initializes the Rake App, which calls the config loader. The Tooth Factory creates multiple Tooth instances from the config.
    • Each Tooth has multiple Data Sources (Feeds), each Feed has a Data Type (URL, CSV, Sitemap XML, JSON, etc.).
    • Data Source uses a Reception to detect data type, assign mapping rules, and pass to the Parser.
    • The Parser generates Feed Items (product, post, movie, etc.), which are classified and sent to the appropriate Processor Chain (Chain of Responsibility).
    • The Processor Chain processes items sequentially (insert DB, backup, export, etc.), with outputs such as saving to DB, exporting EPUB, JSON, DOCX, CSV, etc.
    • All files/images only store URLs in the resources table; a worker downloads files separately, checks for duplicates via checksum, and updates the resource and parent resource after download.
  3. Applied Patterns & Managers:

    • Factory Pattern: Tooth Factory, HTTP Client Manager, Database Driver Manager.
    • Builder Pattern: Feed Item Builder.
    • Chain of Responsibility: Processor Chain.
    • Strategy Pattern: Selects appropriate parser, reception, client, driver.
    • Registry/Service Locator: Central managers (Reception, Parser, Preset, Feed Item Builder, Processor, HTTP Client, Database Driver).
    • Adapter Pattern: HTTP Client, Database Driver (standardized interfaces, easily replaceable).
    • Command Pattern: Queue/Worker (each task is a command object, easy to retry, rollback, audit).
    • State Pattern: Feed Item, Resource, Tooth (pending, processing, done, error, etc.).
    • Decorator Pattern: Processor/Feed Item (wrap for logging, validation, caching, etc.).
    • Observer/Event Bus: Event-driven, easy for logging, notification, auditing, integrating auxiliary systems.
  4. Resource-centric:

    • All data (product, post, project, category, image, etc.) is a resource, with URL/file, type, metadata, status, parent link, checksum, etc.
    • All processing steps operate via resources.
  5. Config Loader & Database Driver Loader:

    • Prefer loading from files (rake.config.php, rake.db.php); if not available, use a transformer to fetch from the database.
    • Tooth Factory and Database Driver Manager receive standard configs to create corresponding Tooth and driver instances.
  6. HTTP Client & DB Driver:

    • Supports multiple clients (Guzzle, Selenium, Webdriver, Web Bidi, Unirest, etc.) and drivers (wpdb, Eloquent, Doctrine, etc.); developers can register custom adapters.
    • Selects suitable client/driver for each Feed/Tooth.
  7. Technology Packages:

    • symfony/dom-crawler: for DOM parsing.
    • nesbot/carbon: for date/time handling.
  8. Batch Mode, CLI-first:

    • Optimized for automation, cronjobs, CI/CD, detailed logging, resume, retry, no UI required for the core.
    • Separate GUI for mapping rules to inspect DOM and edit rules visually.
  9. Diagram:

    • Multiple Mermaid diagrams have been created to clearly show modules: Boot, App, Managers, Plugins, processing flows, resources, workers, event bus, patterns, processor groups, file deduplication, resource updates, etc.
    • Ensures all output goes through resources, with patterns clearly highlighted.
  10. Technical Evaluation:

    • The architecture is robust, plugin-based, extensible, testable, maintainable, cross-platform, resource-centric, and event-driven.
    • Points to note: complexity, need for validation tools, documentation, tests, state/rule/driver/event management.
  11. Description for Rake 2.0:

    • Proposed description emphasizes modularity, extensibility, event-driven, resource-centric, plugin-based, batch mode, CLI-first, integration of modern packages, suitable for large-scale, multi-source, cross-platform crawling/migration.

In summary: Rake 2.0 is a modern PHP framework for data crawling/migration, plugin-based, resource-centric, event-driven, fully applying design patterns, supporting cross-platform, extensible, maintainable, testable, and professional operation.