juvo/as-processor

Process huge datasets for import or sync with ease.


README

The AS Processor library is a robust synchronization and data chunking framework designed specifically for WordPress environments. Leveraging asynchronous task management through the Action Scheduler, it provides a flexible and efficient orchestration for large-scale data processing tasks, such as API synchronizations, file-based (CSV, Excel, JSON) imports, and seamless chunk-wise data management.

Core Features

  1. Data Chunking and Processing:

    • The library introduces a consistent chunking mechanism to split large datasets (from files or APIs) into smaller, manageable pieces. Each chunk is processed asynchronously, reducing memory usage and improving load balancing.
    • Multiple data sources are supported, including API endpoints, Excel files, CSV files, and JSON files.
  2. Asynchronous Task Management:

    • Powered by the awesome Action Scheduler, tasks can be managed asynchronously, ensuring smooth execution without blocking other site processes.
    • Tasks are queued for execution, with efficient handling of task timeouts, retries, and cancellations.
  3. Data Source Adaptability:

    • The library provides abstract, extensible classes for different data formats:
      • CSV Processor (CSV_Sync): Handles CSV imports split into chunks, supporting UTF-8 conversions, custom delimiters, optional headers, and efficient file cleanup.
      • Excel Processor (Excel): Manages Excel files with optional headers and the ability to process specific worksheets. Includes support for rich text cell values and row skipping.
      • JSON Processor (JSON): Enables chunked processing of JSON files and supports JSON Pointer for extracting specific portions of the data.
      • API Processor (API): Works with paginated APIs, automatically handles rate limiting, request intervals, and pagination (by page, offset, or URL).
  4. Highly Extensible and Customizable:

    • Abstract classes allow developers to implement their own data-fetching or processing methods tailored to their use case.
    • A robust foundation supports advanced features like progressive pagination, deep merging, lock management, and transient-based sync data storage.
  5. Reliable Sync Management:

    • Built-in sync lifecycle management, including hooks for starting, progressing, and completing synchronizations.
    • Automatic cleanup of completed tasks and retention of synchronization data for specified durations.
  6. Error Handling and Recovery:

    • Exception handling is seamlessly integrated at every critical stage, ensuring the process fails gracefully and any issues are recorded for diagnosis.
    • Built-in support for handling job timeouts, cancellations, and retries with appropriate sync lifecycle callbacks.

Use Cases

  • Importing Data: Import and process data from large datasets such as customer lists in Excel, product catalogs in JSON, or user data from CSV files.
  • API Data Sync: Synchronize large-scale data from external APIs, with built-in features like pagination handling and rate limiting.
  • Scheduled and Batch Processing: Execute complex batch processing tasks (like order processing or generating reports) asynchronously without affecting website performance.
  • Custom Data Processing Flows: Build scalable workflows for chunking and processing any large dataset with minimal memory consumption and maximum fault tolerance.

Technical Highlights

  1. Core Components:

    • The Chunker trait schedules and processes data chunks, leveraging Action Scheduler for asynchronous execution.
    • The Sync class serves as an abstract base that manages the entire synchronization process lifecycle.
    • The Sync_Data trait provides reliable synchronization data storage using WordPress's transients, enabling flexible data sharing and locking mechanisms.
  2. Focus on Memory Efficiency:

    • Uses iterative processing (e.g., PHP Generators, chunk-based file reading) to minimize memory usage and ensure scalability for large datasets.
  3. Modular Design:

    • The clean separation of concerns allows developers to modify or extend specific aspects like chunk processing or API fetching independently.
  4. Action Scheduler Integration:

    • Uses the Action Scheduler's event-driven architecture to manage background tasks effectively, along with group-level synchronization to maintain context-aware processing.

Ideal For

This library is an excellent choice for WordPress developers and enterprises dealing with:

  • High-volume data integration from various sources.
  • Automating repetitive and resource-intensive synchronization tasks.
  • Optimizing workflows for applications that rely on large datasets or slow APIs.
  • Frequent e-commerce product imports

AS Processor combines the power of modern WordPress development practices, Action Scheduler's asynchronous processing capabilities, and a highly abstracted framework to enable seamless and fault-tolerant data processing at scale. It offers developers a solid foundation for building efficient, scalable synchronization solutions tailored to their applications' unique requirements.