franzip / serp-page-serializer
Serialize/deserialize Search Engine Result Pages to JSON and XML (JMS/Serializer wrapper).
Requires
- php: >=8.0.0
- doctrine/annotations: 2.x.*
- jms/serializer: 3.x.*
Requires (Dev)
- mustangostang/spyc: 0.6.*@dev
- phpunit/phpunit: 11.x.*
This package is auto-updated.
Last update: 2024-11-14 15:16:57 UTC
README
SerpPageSerializer
Serialize/deserialize Search Engine Result Pages to JSON and XML (JMS/Serializer wrapper).
Installing via Composer (recommended)
Install composer in your project:
curl -s http://getcomposer.org/installer | php
Create a composer.json file in your project root:
{
"require": {
"franzip/serp-page-serializer": "1.0.*"
}
}
Install via composer
php composer.phar install
Constructor
$serpSerializer = new SerpPageSerializer($cacheDir = "serializer_cache");
Data type constraints
Serialization
The SerpPageSerializer->serialize()
method accepts only a SerializableSerpPage
object and returns a SerializedSerpPage
object.
The serialized content is available through the SerializedSerpPage->getContent()
method.
Before using the serializer, normalize your data as follows:
use Franzip\SerpPageSerializer\Models\SerializableSerpPage; // assuming you have extracted the data someway $serializableSerpPage = new SerializableSerpPage($engine, $keyword, $pageUrl, $pageNumber, $age, $entries);
Where:
$engine
- string- Represents the Search Engine vendor (i.e. Google, Bing, etc).
$keyword
- string- Represents the keyword associated to the Search Engine page
$pageUrl
- string- Represents the url of the Search Engine for the given keyword/pageNumber
$pageNumber
- integer- Represents the page number for the given Search Engine keyword search
$age
- DateTime object- Represents when the data were extracted
$entries
- array- Represents the core data (see below)
Every Search Engine result page entry has a tripartite structure:
- A title, usually highlighted in blue.
- A url.
- A textual snippet.
The $entries array structure must resemble the above mentioned schema, where the sequential array index stands for the entry position in the page:
array( array('url' => 'someurl', 'snippet' => 'somesnippet', 'title' => 'sometitle'), array('url' => 'someurl', 'snippet' => 'somesnippet', 'title' => 'sometitle'), array('url' => 'someurl', 'snippet' => 'somesnippet', 'title' => 'sometitle'), ... );
Deserialization
The SerpPageSerializer->deserialize()
only accepts a SerializedSerpPage
as argument, yielding back a SerpPageJSON
or a SerpPageXML
object.
Usage (serialize data)
use Franzip\SerpPageSerializer\SerpPageSerializer; use Franzip\SerpPageSerializer\Models\SerializableSerpPage; $engine = 'google'; $keyword = 'foobar'; $pageUrl = 'https://www.google.com/search?q=foobar'; $pageNumber = 1; $age = new \DateTime(); $age->setTimeStamp(time()); $entries = array(array('url' => 'www.foobar2000.org', 'title' => 'foobar2000', 'snippet' => 'blabla'), array(...), ...); $serpSerializer = new SerpPageSerializer(); $pageToSerialize = new SerializableSerpPage($engine, $keyword, $pageUrl, $pageNumber, $age, $entries); $serializedXMLData = $serpSerializer->serialize($pageToSerialize->getContent(), 'xml'); var_dump($serializedXMLData); /* * <?xml version="1.0" encoding="UTF-8"?> * <serp_page engine="google" page_number="1" page_url="https://www.google.com/search?q=foobar" keyword="foobar" age="2015-03-19"> * <entry position="1"> * <url>www.foobar2000.org</url> * <title>foobar2000</title> * <snippet>blabla</snippet> * </entry> * <entry position="2"> * ... * </entry> * </serp_page> */ $serializedJSONData = $serpSerializer->serialize($pageToSerialize->getContent(), 'json'); var_dump($serializedJSONData); /* * { * "engine": "google", * "page_number": 1, * "page_url": "https:\/\/www.google.com\/search?q=foobar", * "keyword":"foobar", * "age":"2015-03-19", * "entries":[ * { * "position": 1, * "url": "www.foobar2000.org", * "title": "foobar2000", * "snippet": "blabla" * }, * { * "position": 2, * ... * }, * ... * ] * } */
Usage (deserialize data)
use Franzip\SerpPageSerializer\SerpPageSerializer; $serpSerializer = new SerpPageSerializer(); $serpPageXML = $serpSerializer->deserialize($serializedXMLPage, 'xml'); var_dump($serializedXMLPage); // object(Franzip\SerpPageSerializer\Models\SerializedSerpPage) (1) { // ... var_dump($serpPageXML); // object(Franzip\SerpPageSerializer\Models\SerpPageXML) (6) { // ... $serpPageJSON = $serpSerializer->deserialize($serializedJSONPage, 'json'); var_dump($serializedJSONPage); // object(Franzip\SerpPageSerializer\Models\SerializedSerpPage) (1) { // ... var_dump($serpPageJSON); // object(Franzip\SerpPageSerializer\Models\SerpPageJSON) (6) { // ...
TODOs
- Add a default $cacheDir to constructor.
- A decent exceptions system.
- Allow typechecking on deserialization by wrapping serialized strings in a dedicated class.
- Fix serialization tests.
- Fix deserialization tests.
- Rewrite docs.
- CSV serialization/deserialization support.
- Fix messy tests.
License
MIT Public License.