mattwright/urlresolver

PHP class that attempts to resolve URLs to a final, canonical link.

2.0 2019-01-18 00:59 UTC

This package is not auto-updated.

Last update: 2024-05-05 01:49:58 UTC


README

URLResolver.php is a PHP class that attempts to resolve URLs to a final, canonical link. On the web today, link shorteners, tracking codes and more can result in many different links that ultimately point to the same resource. By following HTTP redirects and parsing web pages for open graph and canonical URLs, URLResolver.php attempts to solve this issue.

Patterns Recognized

  • Follows 301, 302, and 303 redirects found in HTTP headers
  • Follows Open Graph URL <meta> tags found in web page <head>
  • Follows Canonical URL <link> tags found in web page <head>
  • Aborts download quickly if content type is not an HTML page

I am open to additional suggestions for improvement.

Usage

Resolving a URL can be as easy as:

<?php require_once('URLResolver.php');

$resolver = new mattwright\URLResolver();
print $resolver->resolveURL('http://goo.gl/0GMP1')->getURL();

If you installed this library using composer, you would change the first line above to:

<?php require_once('vendor/autoload.php');

However, in most cases you will want to perform a little extra setup. The following code sets a user agent to identify your crawler (otherwise the default will be used) and also designates a temporary file that can be used for storing cookies during the session. Some web sites will test the browser for cookie support, so this will enhance your results.

<?php require_once('URLResolver.php');
$resolver = new mattwright\URLResolver();

# Identify your crawler (otherwise the default will be used)
$resolver->setUserAgent('Mozilla/5.0 (compatible; YourAppName/1.0; +http://www.example.com)');

# Designate a temporary file that will store cookies during the session.
# Some web sites test the browser for cookie support, so this enhances results.
$resolver->setCookieJar('/tmp/url_resolver.cookies');

# resolveURL() returns an object that allows for additional information.
$url = 'http://goo.gl/0GMP1';
$url_result = $resolver->resolveURL($url);

# Test to see if any error occurred while resolving the URL:
if ($url_result->didErrorOccur()) {
	print "there was an error resolving $url:\n  ";
	print $url_result->getErrorMessageString();
}

# Otherwise, print out the resolved URL.  The [HTTP status code] will tell you
# additional information about the success/failure. For instance, if the
# link resulted in a 404 Not Found error, it would print '404: http://...'
# The successful status code is 200.
else {
	print $url_result->getHTTPStatusCode();
	print ': ';
	print $url_result->getURL();
}

Installation and Requirements

License

URLResolver.php is licensed under the MIT License, viewable in the source code.

Install with Composer

composer require mattwright/urlresolver

Download

URLResolver.php as a .tar.gz or .zip file.

Requirements

API

URLResolver()

$resolver = new mattwright\URLResolver();
Create the URL resolver object that you call additional methods on.

$resolver->resolveURL($url);
$url is the link you want to resolve.
Returns a [URLResult] object that contains the final, resolved URL.

$resolver->setUserAgent($user_agent);
Pass in a string that is sent to each web server to identify your crawler.

$resolver->setCookieJar($cookie_file); # Defaults to disable cookies
*** This file will be removed at the end of each resolveURL() call. ***
Pass in the path to a file used to store cookies during each resolveURL() call.
If no cookie file is set, cookies will be disabled and results may suffer.
This file must not already exist. If it does, pass true as second argument to enable overwrite.

$resolver->setMaxRedirects($max_redirects); # Defaults to 10
Set the maximum number of URL requests to attempt during each resolveURL() call.

$resolver->setMaxResponseDataSize($max_bytes); # Defaults to 120000
Pass in an integer specifying the maximum data to download per request.
Multiple URL requests may occur during each resolveURL() call.
Setting this too low may limit the usefulness of results (default 120000).

$resolver->setRequestTimeout($num_seconds); # Defaults to 30
Set the maximum amount of time, in seconds, any URL request can take.
Multiple URL requests may occur during each resolveURL() call.

$resolver->setPreferCanonicalURL($value); # Defaults to false
Set $value to true to prioritize canonical URL over Open Graph URL.

$resolver->isDebugMode($value); # Defaults to false
Set $value to true to enable debug mode and false to disable (the default).
This will print out each link visited, along with status codes and link types.

URLResolverResult()

$url_result = $resolver->resolveURL($url);
Retrieve the URLResolverResult() object representing the resolution of $url.

$url_result->getURL();
This is the best resolved URL we could obtain after following redirects.

$url_result->getHTTPStatusCode();
Returns the integer HTTP status code for the resolved URL.
Examples: 200 - OK (success), 404 - Not Found, 301 - Moved Permanently, ...

$url_result->hasSuccessHTTPStatus();
Returns true if the HTTP status code for the resolved URL is 200.

$url_result->hasRedirectHTTPStatus();
Returns true if the HTTP status code for the resolved URL is 301, 302, or 303.

$url_result->getContentType();
Returns the value of the Content-Type HTTP header for the resolved URL.
If header not provided, null is returned. Examples: text/html, image/jpeg, ...

$url_result->getContentLength();
Returns the size of the fetched URL in bytes for the resolved URL.
Determined only by the Content-Length HTTP header. null returned otherwise.

$url_result->isOpenGraphURL();
Returns true if resolved URL was marked as the Open Graph URL (og:url)

$url_result->isCanonicalURL();
Returns true if resolved URL was marked as the Canonical URL (rel=canonical)

$url_result->isStartingURL();
Returns true if resolved URL was also the URL you passed to resolveURL().

$url_result->didErrorOccur();
Returns true if an error occurred while resolving the URL.
If this returns false, $url_result is guaranteed to have a status code.

$url_result->getErrorMessageString();
Returns an explanation of what went wrong if didErrorOccur() returns true.

$url_result->didConnectionFail();
Returns true if there was a connection error (no header or no body returned).
May indicate a situation where you are more likely to try at least once more.
If this returns true, didErrorOccur() will true as well.

Changelog

  • v2.0 - January 17, 2019

    • Breaking change: namespaced the library for use with composer psr-4
    • Add requested option to prefer canonical URL over Open Graph
    • Minor fixes / improvements
    • Upgrade simple_html_dom to 1.8.1
  • v1.1 - June 3, 2014

    • Support http redirect code 303
  • v1.0 - December 3, 2011

    • Initial release supports http header redirects, og:url and rel=canonical