larra-press/blog-poster

Automatized poster for your Laravel based blog. Configure the scraper to properly get the data from the selected third party resource, test it, set a Job and enjoy.

1.1.2 2021-08-28 11:17 UTC

This package is auto-updated.

Last update: 2024-04-29 04:53:38 UTC


README

Autoscraping from third party sources, automatically posting to DB, downloading media files to your storage!

Latest Version on Packagist StyleCI TESTED OS Total Downloads

About Package

This package was developed by Alexey Khachatryan for personal usage, but then author decided to make it public package for world usage and created LarraPress Project This project will help developers to create powerful blogs and use third party package for better blog owning. The meaning of this package is to scrape articles from third party sources and post on your blog. There are many things to add and fix, because this is in Alpha version. Feel free to report bugs, ask questions and create PRs.

So far this package has the following features:

  • Scrape posts from third party resources
  • Download selected media files
  • Remove useless elements from scraped articles
  • Work with lazy-loaded media files by replacing HTML tag attributes values
  • Detect duplications
  • Scrape multi-value elements such as article tags
  • Create thumbnails
  • Test the job before publishing

Installation

composer require larra-press/blog-poster

Configuration

Publish package assets

php artisan vendor:publish --tag=larrapress-blog-poster

Add routes

LarraPress\BlogPoster\Facades\BlogPoster::routes();

ATTENTION! These routes MUST to be added under some auth middleware to prevent everybody to edit your blog poster. For example:

Route::middleware('auth', function (){
    LarraPress\BlogPoster\Facades\BlogPoster::routes();
});

Run migrations, create required tables

php artisan migrate

Create your scraping job

You can create one scraping job class for all jobs you'll create or have different job classes for each of your scraping job. Creating a scraping job class which will work for all of yours scraping jobs

php artisan make:scraping_job ScrapingJobName

Or you can create a separated job special for CNN or whatever you want

php artisan make:scraping_job ScrapingCNNSource

No matter how you call them, but how you use them.

Queues

As website scraping job takes some time to finish we use laravel queues for proper work. If you don't want to use the queues you can override parent ScrapingJob class: \LarraPress\BlogPoster\Jobs\ScrapingJob and remove queueable traits and interfaces.

Setting Up your first scraping job

ScrapingJob classes handle ScrapingJobModel with all configs. To create your scraping job, go to dashboard. The URL of the dashboard depends on how and where you put its routes. If you not sure where are they kindly run this command:

php artisan route:list # on UNIX machines you can filter by adding "| grep blog-poster" without quotation marks
  1. Click on Add New Job button image

  2. Fill Job Properties Form image

  • Name - the name of the source, it's a hint just for you
  • Source - the full URL of the web page where the articles/posts are. The list of posts
  • Icon - the icon of the source. You can manually put some icon URL here or click on PARSE button to fetch it
  • Identifier In List - the selector of single post in the list. You need to put a selector of anchor
  • Category - tell the system in which category you want to post the articles came from this source
  • Daily Limit - some of the source posts a lot of articles. You can set a daily limit for this source
  • Is Draft - the status of the scraping job. Useful when you do some tests or decided to pause scraping from this source
  1. Add New Attribute image

Each post/article has title, body, image(with thumb), tags and so on. We call that elements here Article Attribute. If you want to parse titles, bodies and images you need to create 3 Article Attributes.

In this box you can see 3 tabs:

Attribute Main Configs - the basis of the information about attribute. It contains:

  • As Thumbnail - if you set a selector to some image and want to make it a thumbnail - enable it. Note that the real file will not be downloaded. To have both of full image and thumb you need to create two Article Attributes
  • Is File - let Crawler know that it must to download the content of the selector
  • Is HTML - this is usefull for articles bodies where you can get comments in HTML or other bad staff
  • Attribute Name - this name will be processed with a Crawler and then passed to the ScrapingJob class where you can play with it. It'll be the index of the attribute.
  • Attribute Selector - the CSS selector of the attribute
  • Attribute Type - There are 3 types so far: array, URL and default. If you want to scrape and image or some file, set the type to URL. By that way you tell Crawler that it's a URL. Sometimes there can be not full URL like this: /path/to/image.jpg If you want to scrape article tags (there are many tags) use array type. By this way you tell Crawler that there are many elements in the article with this selector and all of them must to be scraped
  • Custom Tag Attribute - There are lazy loading in modern blogs. So the real URL of the media will not be in SRC attr, but, let's say, in SRCSET. Set srcset here to get URL from different attr.

Ignoring Elements image

You can have elements in original article body which need to be removed. Elements such as injected ads, or some referal links. Just create a new Ignoring Attribute and add that selector of the HTML tag you want to remove from body or whatever.

Replacing Elements image

If the body of the article you want to scrape has lazy-loaded media you can use this feature. Unlike Custom Tag attribute field from Attribute Main Configs tab this feature will work in a body or whereever. For example if you want to scrape a single image and get the URL from custom attribute, you use Custom Tag attribute. If you want to scrape an article body, but it contains media with lazyloading, you need to use it. The differense between these features is that Custom Tag attribute work for a single element with specific selector, while Replaing Elements feature works with CHILD elements in the element with a specific selector.

Run scraping job

After you create a scraping job class a model with all configs, you can start the scraping process. Just dispatch the ScrapingJob job and pass the new created model to the job construct.

TODO

  • Handle errors from Crawler and pass to the user while testing
  • Handle all errors from Crawler and properly log
  • Create queue management in dashboard to check the health and status of scraping queue
  • Write tests
  • Write full documentation

Security

If you discover any security related issues, please email alexey.khachatryan@gmail.com instead of using the issue tracker.

Credits

Used packages

Versioning

The version example: 1.0.0 The package version is divided by 3 parts:

  • Global update
  • Feature
  • Bugfix