vgip/datanorm

Data normalization

1.2.1 2021-08-09 17:15 UTC

This package is auto-updated.

Last update: 2024-05-09 23:38:56 UTC


README

Data normalization from some open sources

Installation

System Requirements

You need PHP >= 7.4 but the latest stable version of PHP is recommended

Composer

$ composer require Vgip/Datanorm

Functionality list

Transliteration from Ukrainian into English KMU 2010-01-27 #55

use Vgip\Datanorm\Transliteration\UkrEng\Cabmin2010;

$word = 'Єзгїґіпенєп';
$cabmin2010 = new Cabmin2010();
$wordTransliterated = $cabmin2010->transliterate($word);
echo $word.' -> '.$wordTransliterated;

Kyiv street getter from kga.gov.ua

Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga

Get array with normalized data from CSV file

Check and normalized street name data:

  • Convert possible apostrophe symbols to one symbol (ʼ - 02BC).
  • Check id (forbidden symbols, double). If error see to $this->warning.
  • Check street type by whitelist. New type save to $this->warning and this->typeNotFound.
  • Check Kyiv district name by whitelist. New Kyiv district name save to $this->warning and this->districtNotFound.
  • Check the street names and normalized street names . (if data saved to $this->streetNormalization array)
  • Generate $this->nameDouble array - save 2 or more double street name.
  • Generate $this->nameList - all unique street names.
  • Generate $this->typeCounter - quantity of all street types in Kyiv.

Result array from method getCsvAsArray():

  • ['number'] - (int) serial number from file
  • ['id'] - (int) identifier from file
  • ['name_original'] - (string) street name from file
  • ['name'] - (string) normalized street name
  • ['type_name'] - (string) street type name from file
  • ['type_key'] - (string) street type key
  • ['district_string'] - (string) street districts from file
  • ['district_list'] - (array) street districts ['district_key', 'district_key', ...]
  • ['document_name'] - (string) Document on assigning the name of the object
  • ['document_date'] - (string) Date of the document on assigning the name of the object
  • ['document_number'] - (string) Number of the document on assigning the name of the object
  • ['document_title'] - (string) The title of the document on the naming of the object
  • ['place_description'] - (string) Location of the object in the city
  • ['name_old'] - (string) Former name of the object
  • ['type_old'] - (string)Former category (type) of the object

Example

use Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga;
use Vgip\Datanorm\Directory\Address\Country\Ukr\Address AS DirUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\City\Kyiv AS DirKyiv;
use Vgip\Datanorm\Directory\Lang\Ukr\Pattern AS PatternUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalizedList;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalization;

$dirUkrAddress = DirUkrAddress::getInstance();
$dirKyiv = DirKyiv::getInstance();
$patternUkrAddress = PatternUkrAddress::getInstance();
$streetNormalizedListObj = StreetNormalizedList::getInstance();
$streetNormalizedList = $streetNormalizedListObj->getNormalization();

/** Get configuration and whitelist data */
$pathSourceFile = join(DIRECTORY_SEPARATOR, ['file', 'Reestr_vulits_Kyiva_2020_10_25.csv']);
$streetTypeList = $dirUkrAddress->getStreetTypeWhitelist();
$districtWhitelist = $dirKyiv->getDistrictWhitelist();
$patternStreetName = $patternUkrAddress->getStreetName();

/** Object initialization */
$streetNameKga = new StreetNameKga();

/** Set parameter */
$streetNameKga->setTypeWhitelist($streetTypeList);
$streetNameKga->setDistrictWhitelist($districtWhitelist);
$streetNameKga->setStreetNormalization($streetNormalizedList);
$streetNameKga->setPatternStreetName($patternStreetName);

/** Get a result (array) with normalized data */
$data = $streetNameKga->getCsvAsArray($pathSourceFile);

/** Get other data */
$res = [];
$res['type_list'] = $streetNameKga->getTypeList();
$res['type_counter'] = $streetNameKga->getTypeCounter();
$res['name_list'] = $streetNameKga->getNameList();
$res['name_double'] = $streetNameKga->getNameDouble();
$res['district_not_whitelist'] = $streetNameKga->getDistrictNotFound();

/** Get warnings if present */
$warning = $streetNameKga->getWarning();
$warningValue = $streetNameKga->getWarningValue();
if (null !== $warning AND count($warning) > 0) {
    print_r($warning);
}
print_r($data);
print_r($res);

Ukrainian language

Apostrophe

The resulting data will contain as ukrainian apostrophe symbol "ʼ" unicode symbol U+02BC. All other similar characters in source data (' - U+0027, ’ - U+2019, etc) will be replaced to ʼ (U+02BC). U+02BC - this symbol is used in the ukrainian domain name (ICANN).

Street name normalization

  • Position and surname - Академіка Єфремова, Генерала Авдєєнка, Маршала Бірюзова
  • Name and surname - Леоніда Бикова
  • Family relationships and surname - Братів Зерових, Родини Рудинських

Versioning

Data normalization follows Semantic Versioning.