larra-press / blog-poster
Automatized poster for your Laravel based blog. Configure the scraper to properly get the data from the selected third party resource, test it, set a Job and enjoy.
Installs: 16
Dependents: 0
Suggesters: 0
Security: 0
Stars: 15
Watchers: 2
Forks: 0
Open Issues: 0
Language:CSS
Requires
- php: >=7.2
- ext-json: *
- illuminate/support: ^6.0|^7.0|^8.0
- intervention/image: ^2.6
- spatie/laravel-enum: ^2.5
- symfony/css-selector: ^5.3
- symfony/dom-crawler: ^5.3
Requires (Dev)
- orchestra/testbench: ^4.0|^5.0|^6.0
- phpunit/phpunit: ^8.4|^9.0
This package is auto-updated.
Last update: 2024-10-29 06:12:51 UTC
README
LarraPress BlogPoster
Autoscraping from third party sources, automatically posting to DB, downloading media files to your storage!
About Package
This package was developed by Alexey Khachatryan for personal usage, but then author decided to make it public package for world usage and created LarraPress Project This project will help developers to create powerful blogs and use third party package for better blog owning. The meaning of this package is to scrape articles from third party sources and post on your blog. There are many things to add and fix, because this is in Alpha version. Feel free to report bugs, ask questions and create PRs.
So far this package has the following features:
- Scrape posts from third party resources
- Download selected media files
- Remove useless elements from scraped articles
- Work with lazy-loaded media files by replacing HTML tag attributes values
- Detect duplications
- Scrape multi-value elements such as article tags
- Create thumbnails
- Test the job before publishing
Installation
composer require larra-press/blog-poster
Configuration
Publish package assets
php artisan vendor:publish --tag=larrapress-blog-poster
Add routes
LarraPress\BlogPoster\Facades\BlogPoster::routes();
ATTENTION! These routes MUST to be added under some auth middleware to prevent everybody to edit your blog poster. For example:
Route::middleware('auth', function (){ LarraPress\BlogPoster\Facades\BlogPoster::routes(); });
Run migrations, create required tables
php artisan migrate
Create your scraping job
You can create one scraping job class for all jobs you'll create or have different job classes for each of your scraping job. Creating a scraping job class which will work for all of yours scraping jobs
php artisan make:scraping_job ScrapingJobName
Or you can create a separated job special for CNN or whatever you want
php artisan make:scraping_job ScrapingCNNSource
No matter how you call them, but how you use them.
Queues
As website scraping job takes some time to finish we use laravel queues for proper work. If you don't want to use the queues you can override parent ScrapingJob class: \LarraPress\BlogPoster\Jobs\ScrapingJob and remove queueable traits and interfaces.
Setting Up your first scraping job
ScrapingJob classes handle ScrapingJobModel with all configs. To create your scraping job, go to dashboard. The URL of the dashboard depends on how and where you put its routes. If you not sure where are they kindly run this command:
php artisan route:list # on UNIX machines you can filter by adding "| grep blog-poster" without quotation marks
- Name - the name of the source, it's a hint just for you
- Source - the full URL of the web page where the articles/posts are. The list of posts
- Icon - the icon of the source. You can manually put some icon URL here or click on PARSE button to fetch it
- Identifier In List - the selector of single post in the list. You need to put a selector of anchor
- Category - tell the system in which category you want to post the articles came from this source
- Daily Limit - some of the source posts a lot of articles. You can set a daily limit for this source
- Is Draft - the status of the scraping job. Useful when you do some tests or decided to pause scraping from this source
Each post/article has title, body, image(with thumb), tags and so on. We call that elements here Article Attribute. If you want to parse titles, bodies and images you need to create 3 Article Attributes.
In this box you can see 3 tabs:
Attribute Main Configs - the basis of the information about attribute. It contains:
- As Thumbnail - if you set a selector to some image and want to make it a thumbnail - enable it. Note that the real file will not be downloaded. To have both of full image and thumb you need to create two Article Attributes
- Is File - let Crawler know that it must to download the content of the selector
- Is HTML - this is usefull for articles bodies where you can get comments in HTML or other bad staff
- Attribute Name - this name will be processed with a Crawler and then passed to the ScrapingJob class where you can play with it. It'll be the index of the attribute.
- Attribute Selector - the CSS selector of the attribute
- Attribute Type - There are 3 types so far: array, URL and default. If you want to scrape and image or some file, set the type to URL. By that way you tell Crawler that it's a URL. Sometimes there can be not full URL like this: /path/to/image.jpg If you want to scrape article tags (there are many tags) use array type. By this way you tell Crawler that there are many elements in the article with this selector and all of them must to be scraped
- Custom Tag Attribute - There are lazy loading in modern blogs. So the real URL of the media will not be in SRC attr, but, let's say, in SRCSET. Set srcset here to get URL from different attr.
You can have elements in original article body which need to be removed. Elements such as injected ads, or some referal links. Just create a new Ignoring Attribute and add that selector of the HTML tag you want to remove from body or whatever.
If the body of the article you want to scrape has lazy-loaded media you can use this feature. Unlike Custom Tag attribute field from Attribute Main Configs tab this feature will work in a body or whereever. For example if you want to scrape a single image and get the URL from custom attribute, you use Custom Tag attribute. If you want to scrape an article body, but it contains media with lazyloading, you need to use it. The differense between these features is that Custom Tag attribute work for a single element with specific selector, while Replaing Elements feature works with CHILD elements in the element with a specific selector.
Run scraping job
After you create a scraping job class a model with all configs, you can start the scraping process. Just dispatch the ScrapingJob job and pass the new created model to the job construct.
TODO
- Handle errors from Crawler and pass to the user while testing
- Handle all errors from Crawler and properly log
- Create queue management in dashboard to check the health and status of scraping queue
- Write tests
- Write full documentation
Security
If you discover any security related issues, please email alexey.khachatryan@gmail.com instead of using the issue tracker.
Credits
Used packages
Versioning
The version example: 1.0.0 The package version is divided by 3 parts:
- Global update
- Feature
- Bugfix