gm314 / diavazo
PHP 7 HTML Parser
Requires
- php: >=7.0.0
- gm314/common: master
Requires (Dev)
This package is not auto-updated.
Last update: 2025-01-05 05:21:27 UTC
README
Diavazo is a wrapper arround \DOMDocument
and \DOMElement
. It adds some useful functionality
to search within descendants or query by classes. The HTMLDocument
class allows to either load a string or a
file or url. Some basic search methods are available as well.
For example the method getElement("p .spanClass b.bClass")
allows to search for elements, classes
and a combination of both. The example will find all <p>
elements, all elements
with a the class spanClass
as well as all <b class="bClass">
.
The result of these searches are an array of HTMLElement
objects. These again allow to query, with the difference
that searches are only applied to the their direct descendants.
Installation
composer require gm314/diavazo
Usage
use Diavazo\HTMLDocument; $document = new HTMLDocument(); // load file $document->loadFile("local.html"); $document->loadFile("http://mypage.com/test.html"); // load from string $document->loadString("<html></html>");
HTMLDocument methods
$document = new HTMLDocument(); $document->loadFile(__DIR__ . "/assets/TableToArrayTest.html"); // get element by id $table = $document->getElementById("associateArrayTest"); // get element by tag name $elementList = $document->getElementByTagName("div"); // find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass"> $elementList = $document->getElement("p .spanClass b.bClass"); // xpath query $title = $document->query("/html/head/title"); // get root (<html>) $root = $document->getRootElement();
HTMLElement descendants methods
The HTML Element is result of queries like getElementById
. Further search methods can
be applied on the element. They will search within all descendants.
The method getDescendantByName("td th")
allows to search for several tags.
$document = new HTMLDocument(); $document->loadFile(__DIR__ . "/assets/TableToArrayTest.html"); $table = $document->getElementById("table"); // will return the first tr (Breadth-first search) $table->getFirstDescendantByName("tr"); // will return all td and th elements $tdList = $table->getDescendantByName("td th"); // will find all elements that have the class 'active' $root = $document->getRootElement(); $elementsWithClass = $root->getDescendantWithClassName("active"); // will find all elements that have the class 'myClass' and are td or th elements $elementsWithClass = $root->getDescendantWithClassName("myClass", "td th"); // will find all elements having only the class 'testClass' $elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass"); // will find all elements having only the class 'testClass' and are td or th elements $elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass", "td th"); // find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass"> that are descendants of #myId $anyElement = $document-getElementById("myId"); $elementList = $document->getElement("p .spanClass b.bClass");
HTMLElement attribute methods
$document = new HTMLDocument(); $document->loadFile("myFile.html"); $table = $document->getElementBy("myTable"); // will return null if the attribute does not exist otherwise string $table->getAttributeValue("align");
Table to Array Converter
Diavazo allows converting a table to an associative or index based array. Associative Array will use the first row for the key attribute.
$document = new HTMLDocument(); $document->loadFile("tabletest.html"); $table = $document->getElementById("myTableID"); $arrayConverter = new TableToArrayConverter($table); $array = $arrayConverter->getAsAssociativeArray(); <table id="myTableID"> <tr> <td>Key1</td> <td>Key2</td> </tr> <tr> <td>Value 1</td> <td>Value 2</td> </tr> ... </table> will result in: $array = [ [ "Key1" => "Value 1", "Key2" => "Value 2" ], ... ]
Table 2 Array using an extractor
The following examples show how to register an extractor. The closure will be invoked
with the table data cell (<td>
) and is expected to return the value that will be added to the array.
The following example gets the first <a>
element and extracts the href attribute
$document = $this->getDocument(); $table = $document->getElementById("extractorTest"); $arrayConverter = new TableToArrayConverter($table); $arrayConverter->registerExtractor("columnName", function (HTMLElement $td) { $a = $td->getFirstDescendantByName("a"); return $a->getAttributeValue("href"); }); $array = $arrayConverter->getAsAssociativeArray();