cognesy/instructor-php

Structured data extraction in PHP, powered by LLMs

v0.6.6 2024-04-30 09:25 UTC

README

Structured data extraction in PHP, powered by LLMs. Designed for simplicity, transparency, and control.

What is Instructor?

Instructor is a library that allows you to extract structured, validated data from unstructured text or OpenAI style chat sequence arrays. It is powered by Large Language Models (LLMs).

Instructor for PHP is inspired by the Instructor library for Python created by Jason Liu.

image

Instructor in Other Languages

Check out implementations in other languages below:

If you want to port Instructor to another language, please reach out to us on Twitter we'd love to help you get started!

How Instructor Enhances Your Workflow

Instructor introduces three key enhancements compared to direct API usage.

Response Model

You just specify a PHP class to extract data into via the 'magic' of LLM chat completion. And that's it.

Instructor reduces brittleness of the code extracting the information from textual data by leveraging structured LLM responses.

Instructor helps you write simpler, easier to understand code - you no longer have to define lengthy function call definitions or write code for assigning returned JSON into target data objects.

Validation

Response model generated by LLM can be automatically validated, following set of rules. Currently, Instructor supports only Symfony validation.

You can also provide a context object to use enhanced validator capabilities.

Max Retries

You can set the number of retry attempts for requests.

Instructor will repeat requests in case of validation or deserialization error up to the specified number of times, trying to get a valid response from LLM.

Get Started

Installing Instructor is simple. Run following command in your terminal and you're on your way to a smoother data handling experience!

composer install cognesy/instructor-php

Usage

Basic example

This is a simple example demonstrating how Instructor retrieves structured information from provided text (or chat message sequence).

Response model class is a plain PHP class with typehints specifying the types of fields of the object.

use Cognesy\Instructor\Instructor;

// Step 0: Create .env file in your project root:
// OPENAI_API_KEY=your_api_key

// Step 1: Define target data structure(s)
class Person {
    public string $name;
    public int $age;
}

// Step 2: Provide content to process
$text = "His name is Jason and he is 28 years old.";

// Step 3: Use Instructor to run LLM inference
$person = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => $text]],
    responseModel: Person::class,
);

// Step 4: Work with structured response data
assert($person instanceof Person); // true
assert($person->name === 'Jason'); // true
assert($person->age === 28); // true

echo $person->name; // Jason
echo $person->age; // 28

var_dump($person);
// Person {
//     name: "Jason",
//     age: 28
// }    

NOTE: Instructor only supports classes / objects as response models. In case you want to extract simple types or enums, you need to wrap them in Scalar adapter - see section below: Extracting Scalar Values. If you want to define the shape of data during runtime, you can use structures (see Structures section).

Validation

Instructor validates results of LLM response against validation rules specified in your data model.

For further details on available validation rules, check Symfony Validation constraints.

use Symfony\Component\Validator\Constraints as Assert;

class Person {
    public string $name;
    #[Assert\PositiveOrZero]
    public int $age;
}

$text = "His name is Jason, he is -28 years old.";
$person = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => $text]],
    responseModel: Person::class,
);

// if the resulting object does not validate, Instructor throws an exception

Max Retries

In case maxRetries parameter is provided and LLM response does not meet validation criteria, Instructor will make subsequent inference attempts until results meet the requirements or maxRetries is reached.

Instructor uses validation errors to inform LLM on the problems identified in the response, so that LLM can try self-correcting in the next attempt.

use Symfony\Component\Validator\Constraints as Assert;

class Person {
    #[Assert\Length(min: 3)]
    public string $name;
    #[Assert\PositiveOrZero]
    public int $age;
}

$text = "His name is JX, aka Jason, he is -28 years old.";
$person = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => $text]],
    responseModel: Person::class,
    maxRetries: 3,
);

// if all LLM's attempts to self-correct the results fail, Instructor throws an exception

Alternative ways to call Instructor

You can call request() method to set the parameters of the request and then call get() to get the response.

use Cognesy\Instructor;

$instructor = (new Instructor)->request(
    messages: "His name is Jason, he is 28 years old.",
    responseModel: Person::class,
);
$person = $instructor->get();

You can also initialize Instructor with a request object.

use Cognesy\Instructor;
use Cognesy\Instructor\Data\Request;

$instructor = (new Instructor)->withRequest(new Request(
    messages: "His name is Jason, he is 28 years old.",
    responseModel: Person::class,
))->get();

Partial results

You can define onPartialUpdate() callback to receive partial results that can be used to start updating UI before LLM completes the inference.

NOTE: Partial updates are not validated. The response is only validated after it is fully received.

use Cognesy\Instructor;

function updateUI($person) {
    // Here you get partially completed Person object update UI with the partial result
}

$person = (new Instructor)->request(
    messages: "His name is Jason, he is 28 years old.",
    responseModel: Person::class,
    options: ['stream' => true]
)->onPartialUpdate(
    fn($partial) => updateUI($partial)
)->get();

// Here you get completed and validated Person object
$this->db->save($person); // ...for example: save to DB

Shortcuts

String as Input

You can provide a string instead of an array of messages. This is useful when you want to extract data from a single block of text and want to keep your code simple.

// Usually, you work with sequences of messages:

$value = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => "His name is Jason, he is 28 years old."]],
    responseModel: Person::class,
);

// ...but if you want to keep it simple, you can just pass a string:

$value = (new Instructor)->respond(
    messages: "His name is Jason, he is 28 years old.",
    responseModel: Person::class,
);

Extracting Scalar Values

Sometimes we just want to get quick results without defining a class for the response model, especially if we're trying to get a straight, simple answer in a form of string, integer, boolean or float. Instructor provides a simplified API for such cases.

use Cognesy\Instructor\Extras\Scalars\Scalar;
use Cognesy\Instructor\Instructor;

$value = (new Instructor)->respond(
    messages: "His name is Jason, he is 28 years old.",
    responseModel: Scalar::integer('age'),
);

var_dump($value);
// int(28)

In this example, we're extracting a single integer value from the text. You can also use Scalar::string(), Scalar::boolean() and Scalar::float() to extract other types of values.

Extracting Enum Values

Additionally, you can use Scalar adapter to extract one of the provided options by using Scalar::enum().

use Cognesy\Instructor\Extras\Scalars\Scalar;
use Cognesy\Instructor\Instructor;

enum ActivityType : string {
    case Work = 'work';
    case Entertainment = 'entertainment';
    case Sport = 'sport';
    case Other = 'other';
}

$value = (new Instructor)->respond(
    messages: "His name is Jason, he currently plays Doom Eternal.",
    responseModel: Scalar::enum(ActivityType::class, 'activityType'),
);

var_dump($value);
// enum(ActivityType:Entertainment)

Extracting Sequences of Objects

Sequence is a wrapper class that can be used to represent a list of objects to be extracted by Instructor from provided context.

It is usually more convenient not create a dedicated class with a single array property just to handle a list of objects of a given class.

Additional, unique feature of sequences is that they can be streamed per each completed item in a sequence, rather than on any property update.

class Person
{
    public string $name;
    public int $age;
}

$text = <<<TEXT
    Jason is 25 years old. Jane is 18 yo. John is 30 years old
    and Anna is 2 years younger than him.
TEXT;

$list = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => $text]],
    responseModel: Sequence::of(Person::class),
    options: ['stream' => true]
);

See more about sequences in the Sequences section.

Specifying Data Model

Type Hints

Use PHP type hints to specify the type of extracted data.

Use nullable types to indicate that given field is optional.

    class Person {
        public string $name;
        public ?int $age;
        public Address $address;
    }

DocBlock type hints

You can also use PHP DocBlock style comments to specify the type of extracted data. This is useful when you want to specify property types for LLM, but can't or don't want to enforce type at the code level.

class Person {
    /** @var string */
    public $name;
    /** @var int */
    public $age;
    /** @var Address $address person's address */
    public $address;
}

See PHPDoc documentation for more details on DocBlock website.

Typed Collections / Arrays

PHP currently does not support generics or typehints to specify array element types.

Use PHP DocBlock style comments to specify the type of array elements.

class Person {
    // ...
}

class Event {
    // ...
    /** @var Person[] list of extracted event participants */
    public array $participants;
    // ...
}

Complex data extraction

Instructor can retrieve complex data structures from text. Your response model can contain nested objects, arrays, and enums.

use Cognesy\Instructor\Instructor;

// define a data structures to extract data into
class Person {
    public string $name;
    public int $age;
    public string $profession;
    /** @var Skill[] */
    public array $skills;
}

class Skill {
    public string $name;
    public SkillType $type;
}

enum SkillType {
    case Technical = 'technical';
    case Other = 'other';
}

$text = "Alex is 25 years old software engineer, who knows PHP, Python and can play the guitar.";

$person = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => $text]],
    responseModel: Person::class,
); // client is passed explicitly, can specify e.g. different base URL

// data is extracted into an object of given class
assert($person instanceof Person); // true

// you can access object's extracted property values
echo $person->name; // Alex
echo $person->age; // 25
echo $person->profession; // software engineer
echo $person->skills[0]->name; // PHP
echo $person->skills[0]->type; // SkillType::Technical
// ...

var_dump($person);
// Person {
//     name: "Alex",
//     age: 25,
//     profession: "software engineer",
//     skills: [
//         Skill {
//              name: "PHP",
//              type: SkillType::Technical,
//         },
//         Skill {
//              name: "Python",
//              type: SkillType::Technical,
//         },
//         Skill {
//              name: "guitar",
//              type: SkillType::Other
//         },
//     ]
// }

Changing LLM model and options

You can specify model and other options that will be passed to OpenAI / LLM endpoint.

use Cognesy\Instructor\Instructor;
use Cognesy\Instructor\Clients\OpenAI\OpenAIClient;

// OpenAI auth params
$yourApiKey = Env::get('OPENAI_API_KEY'); // use your own API key

// Create instance of OpenAI client initialized with custom parameters
$client = new OpenAIClient(
    $yourApiKey,
    baseUri: 'https://api.openai.com', // you can change base URI
    organization: '',
    connectTimeout: 3,
    requestTimeout: 30,
);

/// Get Instructor with the default client component overridden with your own
$instructor = (new Instructor)->withClient($client);

$user = $instructor->respond(
    messages: "Jason (@jxnlco) is 25 years old and is the admin of this project. He likes playing football and reading books.",
    responseModel: User::class,
    model: 'gpt-3.5-turbo',
    options: ['stream' => true ]
);

Support for language models and API providers

Instructor offers out of the box support for following API providers:

  • Anthropic
  • Anyscale
  • Azure OpenAI
  • Fireworks AI
  • Groq
  • Mistral
  • Ollama (on localhost)
  • OpenAI
  • OpenRouter
  • Together AI

For usage examples, check Hub section or examples directory in the code repository.

Using DocBlocks as Additional Instructions for LLM

You can use PHP DocBlocks (/** */) to provide additional instructions for LLM at class or field level, for example to clarify what you expect or how LLM should process your data.

Instructor extracts PHP DocBlocks comments from class and property defined and includes them in specification of response model sent to LLM.

Using PHP DocBlocks instructions is not required, but sometimes you may want to clarify your intentions to improve LLM's inference results.

/**
 * Represents a skill of a person and context in which it was mentioned. 
 */
class Skill {
    public string $name;
    /** @var SkillType $type type of the skill, derived from the description and context */
    public SkillType $type;
    /** Directly quoted, full sentence mentioning person's skill */
    public string $context;
}

Customizing Validation

ValidationMixin

You can use ValidationMixin trait to add ability of easy, custom data object validation.

use Cognesy\Instructor\Validation\Traits\ValidationMixin;

class User {
    use ValidationMixin;

    public int $age;
    public int $name;

    public function validate() : array {
        if ($this->age < 18) {
            return ["User has to be adult to sign the contract."];
        }
        return [];
    }
}

Validation Callback

Instructor uses Symfony validation component to validate extracted data. You can use #[Assert/Callback] annotation to build fully customized validation logic.

use Cognesy\Instructor\Instructor;
use Symfony\Component\Validator\Constraints as Assert;
use Symfony\Component\Validator\Context\ExecutionContextInterface;

class UserDetails
{
    public string $name;
    public int $age;
    
    #[Assert\Callback]
    public function validateName(ExecutionContextInterface $context, mixed $payload) {
        if ($this->name !== strtoupper($this->name)) {
            $context->buildViolation("Name must be in uppercase.")
                ->atPath('name')
                ->setInvalidValue($this->name)
                ->addViolation();
        }
    }
}

$user = (new Instructor)->respond(
    messages: [['role' => 'user', 'content' => 'jason is 25 years old']],
    responseModel: UserDetails::class,
    maxRetries: 2
);

assert($user->name === "JASON");

See Symfony docs for more details on how to use Callback constraint.

Internals

Lifecycle

As Instructor for PHP processes your request, it goes through several stages:

  1. Initialize and self-configure (with possible overrides defined by developer).
  2. Analyze classes and properties of the response data model specified by developer.
  3. Encode data model into a schema that can be provided to LLM.
  4. Execute request to LLM using specified messages (content) and response model metadata.
  5. Receive a response from LLM or multiple partial responses (if streaming enabled).
  6. Deserialize response received from LLM into originally requested classes and their properties.
  7. In case response contained incomplete or corrupted data - if errors are encountered, create feedback message for LLM and requests regeneration of the response.
  8. Execute validations defined by developer for the data model - if any of them fail, create feedback message for LLM and requests regeneration of the response.
  9. Repeat the steps 4-8, unless specified limit of retries has been reached or response passes validation

Receiving notification on internal events

Instructor allows you to receive detailed information at every stage of request and response processing via events.

  • (new Instructor)->onEvent(string $class, callable $callback) method - receive callback when specified type of event is dispatched
  • (new Instructor)->wiretap(callable $callback) method - receive any event dispatched by Instructor, may be useful for debugging or performance analysis
  • (new Instructor)->onError(callable $callback) method - receive callback on any uncaught error, so you can customize handling it, for example logging the error or using some fallback mechanism in an attempt to recover

Receiving events can help you to monitor the execution process and makes it easier for a developer to understand and resolve any processing issues.

$instructor = (new Instructor)
    // see requests to LLM
    ->onEvent(RequestSentToLLM::class, fn($e) => dump($e))
    // see responses from LLM
    ->onEvent(ResponseReceivedFromLLM::class, fn($event) => dump($event))
    // see all events in console-friendly format
    ->wiretap(fn($event) => dump($event->toConsole()))
    // log errors via your custom logger
    ->onError(fn($request, $error) => $logger->log($error));

$instructor->respond(
    messages: "What is the population of Paris?",
    responseModel: Scalar::integer(),
);
// check your console for the details on the Instructor execution

Response Models

Instructor is able to process several types of input provided as response model, giving you more flexibility on how you interact with the library.

The signature of respond() method of Instructor states the responseModel can be either string, object or array.

Handling string $responseModel value

If string value is provided, it is used as a name of the class of the response model.

Instructor checks if the class exists and analyzes the class & properties type information & doc comments to generate a schema needed to specify LLM response constraints.

The best way to provide the name of the response model class is to use NameOfTheClass::class instead of string, making it possible for IDE to execute type checks, handle refactorings, etc.

Handling object $responseModel value

If object value is provided, it is considered an instance of the response model. Instructor checks the class of the instance, then analyzes it and its property type data to specify LLM response constraints.

Handling array $responseModel value

If array value is provided, it is considered a raw JSON Schema, therefore allowing Instructor to use it directly in LLM requests (after wrapping in appropriate context - e.g. function call).

Instructor requires information on the class of each nested object in your JSON Schema, so it can correctly deserialize the data into appropriate type.

This information is available to Instructor when you are passing $responseModel as a class name or an instance, but it is missing from raw JSON Schema.

Current design uses JSON Schema $comment field on property to overcome this. Instructor expects developer to use $comment field to provide fully qualified name of the target class to be used to deserialize property data of object or enum type.

Response model contracts

Instructor allows you to customize processing of $responseModel value also by looking at the interfaces the class or instance implements:

  • CanProvideJsonSchema - implement to be able to provide JSON Schema or the response model, overriding the default approach of Instructor, which is analyzing $responseModel value class information,
  • CanDeserializeSelf - implement to customize the way the response from LLM is deserialized from JSON into PHP object,
  • CanValidateSelf - implement to customize the way the deserialized object is validated,
  • CanTransformSelf - implement to transform the validated object into target value received by the caller (e.g. unwrap simple type from a class to a scalar value).

Additional Notes

PHP ecosystem does not (yet) have a strong equivalent of Pydantic, which is at the core of Instructor for Python.

To provide an essential functionality we needed here Instructor for PHP leverages:

Dependencies

Instructor for PHP is compatible with PHP 8.2 or later and, due to minimal dependencies, should work with any framework of your choice.

TODOs

  • Async support
  • Documentation

Contributing

If you want to help, check out some of the issues. All contributions are welcome - code improvements, documentation, bug reports, blog posts / articles, or new cookbooks and application examples.

License

This project is licensed under the terms of the MIT License.

Contributors

68747470733a2f2f636f6e747269622e726f636b732f696d6167653f7265706f3d636f676e6573792f696e7374727563746f722d706870