octoberfa/virastar

Virastar is a Persian text cleaner.

1.0.1 2019-01-07 12:02 UTC

This package is auto-updated.

Last update: 2024-11-10 21:14:28 UTC


README

Virastar is a Persian text cleaner.

Latest Stable Version Latest Unstable Version Build Status Total Downloads License composer.lock

A php port of juvee/virastar

Install

composer require octoberfa/virastar

Usage

require "./vendor/autoload.php";
echo virastar("فارسي را كمی درست تر می نويسيم");
// Outputs: "فارسی را کمی درست‌تر می‌نویسیم"

virastar([text] [,options])

text

Type: string

String of Persian source to be cleaned.

options

Type: array

virastar("سلام 123" ,["fix_english_numbers":false]);
// Outputs:"سلام 123"

Options and Specifications

Virastar comes with a list of options to control its behavior.

all options are enabled by default.

  • normalize_eol

    • replace Windows end of lines with Unix EOL (\n)
  • decode_htmlentities

    • converts all HTML characterSets into original characters
  • fix_dashes

    • replace double dash to ndash and triple dash to mdash
  • fix_three_dots

    • replace three dots with ellipsis
  • fix_english_quotes_pairs

    • replace English quotes pairs (“”) with their Persian equivalent («»)
  • fix_english_quotes

    • replace English quotes, commas and semicolons with their Persian equivalent
  • fix_hamzeh

    • convert ه ی to هٔ
  • cleanup_rlm

    • converting Right-to-left marks followed by Persian characters to zero-width non-joiners (ZWNJ)
  • cleanup_zwnj

    • remove more than one zwnj chars
    • remove unnecessary zwnj chars that are succeeded/preceded by a space
    • clean zwnj chars after Persian characters that don't conncet to the next letter
    • clean zwnj chars before English characters
    • clean zwnj chars after and before punctuation
  • fix_arabic_numbers

    • replace Arabic numbers with their Persian equivalent
  • fix_english_numbers

    • replace English numbers with their Persian equivalent
    • should not replace English numbers in English phrases
  • skip_markdown_ordered_lists_numbers_conversion

    • skip converting English numbers of ordered lists in markdown
  • fix_misc_non_persian_chars

    • replace Arabic kaf and Yeh with its Persian equivalent
  • fix_question_mark

    • replace question marks with its Persian equivalent
  • fix_perfix_spacing

    • put zwnj between word and prefix (mi* nemi*)
  • fix_suffix_spacing

    • put zwnj between word and suffix (*tar *tarin *ha *haye)
  • fix_spacing_for_braces_and_quotes

    • fix spacing for () [] {} “” «» (one space outside, no space inside)
    • correct :;,.?! spacing (one space after and no space before)
  • cleanup_spacing

    • replace more than one space with just a single one
  • cleanup_begin_and_end

    • remove spaces, tabs, and new lines from the beginning and end of text
  • cleanup_extra_marks

    • replace more than one ! or ? mark with just one
  • kashidas_as_parenthetic

    • replace kashidas to ndash in parenthetic
  • cleanup_kashidas

    • remove all kashidas
  • preserve_HTML

    • preserve all HTML tags
  • preserve_URIs

    • preserve all URI links in the text
  • preserve_brackets

    • preserve strings inside square brackets ([])
  • preserve_braces

    • preserve strings inside curly braces ({})
  • preserve_code

    • preserve strings inside html code tag and markdown "```"
  • preserve_pre

    • preserve strings inside html pre tag

License

This software is licensed under the MIT License. View the license.