alirezasedghi / virastar
Virastar is a Persian text cleaner.
Requires
- php: >=7.0.0
This package is auto-updated.
Last update: 2024-11-15 20:32:34 UTC
README
Virastar
ویراستار
Virastar is a Persian text cleaner.
کتابخانه ویراستاری متن فارسی برای PHP
This repository is PHP port of brothersincode/virastar
Official website and Persian usage guide
Install
composer require alirezasedghi/virastar
Usage
// Require Composer's autoloader. require 'vendor/autoload.php'; // Using Virastar namespace. use Alirezasedghi\Virastar\Virastar; $virastar = new Virastar(); $text = "فارسي را كمی درست تر می نويسيم"; $cleaned = $virastar->cleanup($text); echo $cleaned; // Outputs: "فارسی را کمی درستتر مینویسیم"
Virastar([options])
options
Type: array
$virastar = new Virastar([ "fix_english_numbers": false, "cleanup_line_breaks": false ]);
Options and Specifications
Virastar comes with a list of options to control its behavior.
normalize_eol
default: true
- replace windows end of lines with unix eol (
\n
)
decode_html_entities
default: true
- converts numeral and selected html character-sets into original characters
fix_dashes
default: true
- replaces triple dash to mdash
- replaces double dash to ndash
fix_three_dots
default: true
- removes spaces between dots
- replaces three dots with ellipsis character
normalize_ellipsis
default: true
- replaces more than one ellipsis with one
- replaces (space|tab|zwnj) after ellipsis with one space
normalize_dates
default: true
- re-orders date parts with slash as delimiter
fix_english_quotes_pairs
default: true
- replaces english quote pairs (
“”
) with their persian equivalent («»
)
fix_english_quotes
default: true
- replaces english quote marks with their persian equivalent
fix_hamzeh
default: true
- replaces
ه
followed by (space|ZWNJ|lrm) follow byی
withهٔ
- replaces
ه
followed by (space|ZWNJ|lrm|nothing) follow byء
withهٔ
- replaces
هٓ
or single-characterۀ
with the standardهٔ
fix_hamzeh_arabic
default: false
- converts arabic hamzeh
ة
toهٔ
cleanup_rlm
default: true
- converts Right-to-left marks followed by persian characters to zero-width non-joiners (ZWNJ)
cleanup_zwnj
default: true
- converts all soft hyphens (
­
) into zwnj - removes more than one zwnj
- cleans zwnj after characters that don't connect to the next
- cleans zwnj before and after numbers, english words, spaces and punctuations
- removes unnecessary zwnj on start/end of each line
fix_arabic_numbers
default: true
- replaces arabic numbers with their persian equivalent
fix_english_numbers
default: true
- replaces english numbers with their persian equivalent
fix_numeral_symbols
default: true
- replaces english percent signs (U+066A)
- replaces dots between numbers into decimal separator (U+066B)
- replaces commas between numbers into thousands separator (U+066C)
fix_misc_non_persian_chars
default: true
- replaces arabic normal/swash kaf with its persian equivalent
- replaces arabic/urdu/pushtu/uyghur yeh with its persian equivalent
- replaces kurdish he with its persian equivalent
fix_punctuations
default: true
- replaces
,
,;
with its persian equivalent
fix_question_mark
default: true
- replaces question marks with its persian equivalent
fix_prefix_spacing
default: true
- puts zwnj between the word and the prefix:
mi*
,nemi*
,bi*
fix_suffix_spacing
default: true
- puts zwnj between the word and the suffix:
*ha
,*haye
*am
,*at
,*ash
,*ei
,*eid
,*eem
,*and
,*man
,*tan
,*shan
*tar
,*tari
,*tarin
*hayee
,*hayam
,*hayat
,*hayash
,*hayetan
,*hayeman
,*hayeshan
fix_suffix_misc
default: true
- replaces
ه
followed byئ
orی
, and then byی
, withهای
fix_spacing_for_braces_and_quotes
default: true
- removes inside spaces and more than one outside for
()
,[]
,{}
,“”
and«»
fix_spacing_for_punctuations
default: true
- removes space before punctuations
- removes more than one space after punctuations, except followed by new-lines
- removes space after colon that separates time parts
- removes space after dots in numbers
- removes space before some common domain tlds
- removes space between question and exclamation marks
- removes space between same marks
fix_diacritics
default: true
- cleans zwnj before diacritic characters
- cleans more than one diacritic characters
- clean spaces before diacritic characters
remove_diacritics
default: false
- removes all diacritic characters
fix_persian_glyphs
default: true
- converts incorrect persian glyphs to standard characters
fix_misc_spacing
default: true
- removes space before parentheses on misc cases
- removes space before braces containing numbers
cleanup_spacing
default: true
- replaces more than one space with just a single one
- cleans whitespace/zwnj between new-lines
cleanup_line_breaks
default: true
- cleans more than two contiguous line breaks
cleanup_begin_and_end
default: true
- removes space/tab/zwnj/nbsp from the beginning of the new-lines
- remove spaces, tabs, zwnj, direction marks and new lines from the beginning and end of text
markdown
markdown_normalize_braces
default: true
- remove spaces between
[]
and()
([text] (link)
into[text](link)
) - removes space between
!
and opening brace (! [alt](src)
into![alt](src)
) - remove spaces inside double
()
,[]
,{}
([[ text ]]
into[[text]]
) - remove spaces between double
()
,[]
,{}
([[text] ]
into[[text]]
)
markdown_normalize_lists
default: true
- removes extra lines between two items on a markdown list beginning with
-
,*
or#
skip_markdown_ordered_lists_numbers_conversion
default: false
- skips converting english numbers of ordered lists in markdown
aggressive editing
cleanup_extra_marks
default: true
- replaces more than one exclamation mark with just one
- replaces more than one english or persian question mark with just one
- re-orders consecutive marks:
?!
into!?
kashidas_as_parenthetic
default: true
- replaces kashidas to ndash in parenthetic
cleanup_kashidas
default: true
- converts kashida between numbers to ndash
- removes all kashidas between non-whitespace characters
extras
preserve_front_matter
default: true
- preserves front matter data in the text
preserve_HTML
default: true
- preserves all html tags in the text
preserve_comments
default: true
- preserves all html comments in the text
preserve_entities
default: true
- preserves all html entities in the text
preserve_URIs
default: true
- preserves all uri strings in the text
preserve_brackets
default: false
- preserves strings inside square brackets (
[]
)
preserve_braces
default: false
- preserves strings inside curly braces (
{}
)
preserve_nbsp
default: true
- preserves all no-break space entities in the text
License
This software is licensed under the MIT License. View the license.