mbaynton / batch-framework
An API and foundational algorithms for efficient processing of long-running jobs that can be divided into small work units.
Requires
- php: >=5.4.0
- psr/http-message: ^1.0
Requires (Dev)
- guzzlehttp/psr7: ^1.3
- phpunit/phpunit: ~4.8
- satooshi/php-coveralls: ^1.0
This package is auto-updated.
Last update: 2024-11-29 05:33:08 UTC
README
This library offers foundational algorithms and structures to enable scenarios where long-running tasks that can be divided into small work units get processed progressively by successive calls to a PHP script on a webserver. This avoids exceeding script execution time and network timeout limitations often found in web execution environments.
It emphasizes minimal overhead of the framework itself so that jobs complete as quickly as possible.
Features include:
- Support for processing the batch of work units across the lifespan of many requests when being run in a web environment. This prevents individual responses and webserver processes from running longer than is desirable.
- Efficient determination of when to stop running more work units based on past work units' runtimes so that requests complete around a target duration.
- Attention to minimizing the amount of state data and number of trips to a backing store that are involved with handing off between reqeusts.
- Support for parallel execution of embarrasingly parallelizable problems, e.g. those where individual work units do not need to communicate or coordinate between each other during their execution. See parallelization for details.
- No requirement to use a particular PHP framework, but with an awareness of controller and service design patterns.
As this is a library, it offers no functionality "out of the box."
Dependencies
- PHP 5.4+
Psr\Http\Message\ResponseInterface
available via Composer, and any implementation of this interface.
Documentation / Examples
The docs here will help start you up writing code that's meant to work with this framework. If you encounter gaps or questions about the info here, you might want to refer to the Curator application on GitHub, which uses and was written alongside this framework.
Documentation is accurate for v1.0.0
.
Terms and their definitions
- Runnable:
One of the user-implemented classes that models a long-running task. An instance of a Runnable models and provides the implementation for a single unit of work. It is itsrun()
method whose body does the actual work/computation to further the Task's progress. - Runnable Iterator:
A PHP\Iterator
(please extendAbstractRunnableIterator
) that producesRunnables
appropriate to the segment of the overall task that should be performed, given as input theRunner rank
and number ofRunnables
already performed on prior incarnations of theRunner
. - Runner:
The server-side code that runs the show. The Runner pumps the Runnable iterator for new Runnables, launches them, monitors the time runnables are taking and the time remaining to decide when to stop, dispatches Runnable and Task execution events to Task and Controller callbacks, and initiates Runnable and Task intermediate result aggregation. - Runner id:
An integer uniquely identifying a given logical
Runner
. Clients are expected to create as many correspondingRunner
requests as the framework's currentTask instance state
supports, initially assigning a unique integer id that the client has not used before to each of these requests. - Runner incarnation:
Logically, the framework tries to create the illusion ofn
Runnable
units of work that are executed byx
Runners
(concurrently ifx > 1
.) However, in order to prevent the HTTP request that started theRunnable
from remaining incomplete for longer than desired, the framework may stop launching newRunnables
, let theRunner
stop doing work early, and signal the client to make a successive request with the sameRunner id
. Each HTTP request that's handled by starting aRunner
bearing the sameRunner id
is called an incarnation of the runner with that id. All incarnations of aRunner
also will share the sameRunner rank
. - Runner rank:
A number uniquely identifying a givenRunner
within a Task. If your Task only supports one concurrentRunner
, this will always be0
. If yourTask
declares support forn
concurrentRunner
s, this will range from0
ton-1
. Differs fromRunner id
in that its range is always0
ton-1
. - Task:
One of the user-implemented classes that models a long-running task. TheTask
serves as a factory forRunnable Iterator
s, tells the framework what to do with results ofRunnable
s, may intervene in the event aRunnable
experiences a throwable error or exception, provides methods to reduce multipleRunnable
results to simpler intermediate results, and provides a method to translate the completeRunnable
results to aPsr\Http\Message\ResponseInterface
. - Task instance state:
One of the user-implemented classes that models a long-running task. Task instance state captures the variable properties of a given task execution, such as where to find inputs to operate on, who (in terms of PHP session id) is currently running thisTask
, how large theTask
is estimated to be (in terms ofRunnable
s), and how many concurrentRunners
theTask
supports. Typically, one can extend theTaskInstanceState
class, which handles most everything but your task's unique inputs. Note that this class is not intended to be used to captureRunnable
output.
This framework primarily provides an implementation of the Runner
in the class AbstractRunner
.
A complete system leveraging this library will typically include a concrete extension
of AbstractRunner
to interface with your application's persistence layer (e.g.,
database), and a controller or other script making use of the HttpRunnerControllerTrait
to handle incoming requests and interface with your application's session layer.
Coding a long-running task typically involves setting up the following components:
- An implementation of
TaskInterface
. - An extension of
AbstractRunnableIterator
to serveRunnables
. - An implementation of
RunnableInterface
to do the work units. - An extension of
TaskInstanceState
to provide input properties specific to the job.
Parallelization: using multiple runners
Strictly speaking, this framework supports concurrent execution of more than one runnable from the same Task at a time. But, in order to do concurrent runnables, lots of other code must support this, too:
- Your extension of
AbstractRunner
must implement its methods in a concurrency-safe manner, especiallyAbstractRunner::retrieveRunnerState()
andAbstractRunner::finalizeRunner()
should read and write to their underlying storage in a way that does not cause corruption or lost writes should several instances for the same Task instance be run simultaneously. - Your client must be programmed to send multiple concurrent batch runner requests.
- The work you want to do must be embarrasingly parallelizable.
Each runnable can produce output, but runnables cannot take other runnables' output
from the
Task
as input or otherwise interfere with each other if they access a shared resource. - Your
Task instance state
'sgetNumRunners()
must return more than 1 to declare concurrent support for more than 1Runner
. - The
Runnable iterator
constructed by yourTask
must take theRunner rank
into account and be able to assign a portion of the totalRunnable
s to eachRunner rank
, as evenly as possible, with eachRunnable
unit of work being given out to one of theRunner
s exactly once. - Your overall application (request controller, etc.) must not be impacted by several simultaneous requests from the same user, and must not be holding the PHP session lock when the runnables are executing.
Why is the Task's final result always an HTTP response?
Packaging the batch run's overall result in a standard HTTP response format enables
applications to receive requests and decide whether or not to defer them to a batch task.
In either case, the HTTP response that the client is expecting is ultimately generated. This
works well when clients are implemented using libraries that support request middleware
and the Promise pattern. The request middleware watches for raw responses that indicate
a batch task is necessary, and rather than resolving the client application code's Promise
with this incomplete raw response, launches Runner
requests until it obtains the result HTTP
response, which it finally resolves the original Promise with.
License
MIT