vgip / datanorm
Data normalization
1.2.1
2021-08-09 17:15 UTC
Requires
- php: >=5.6.0
README
Data normalization from some open sources
Installation
System Requirements
You need PHP >= 7.4 but the latest stable version of PHP is recommended
Composer
$ composer require Vgip/Datanorm
Functionality list
- Transliteration from Ukrainian into English KMU 2010-01-27 #55
- Kyiv street getter from kga.gov.ua
Transliteration from Ukrainian into English KMU 2010-01-27 #55
use Vgip\Datanorm\Transliteration\UkrEng\Cabmin2010; $word = 'Єзгїґіпенєп'; $cabmin2010 = new Cabmin2010(); $wordTransliterated = $cabmin2010->transliterate($word); echo $word.' -> '.$wordTransliterated;
Kyiv street getter from kga.gov.ua
Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga
Get array with normalized data from CSV file
Check and normalized street name data:
- Convert possible apostrophe symbols to one symbol (ʼ - 02BC).
- Check id (forbidden symbols, double). If error see to $this->warning.
- Check street type by whitelist. New type save to $this->warning and this->typeNotFound.
- Check Kyiv district name by whitelist. New Kyiv district name save to $this->warning and this->districtNotFound.
- Check the street names and normalized street names . (if data saved to $this->streetNormalization array)
- Generate $this->nameDouble array - save 2 or more double street name.
- Generate $this->nameList - all unique street names.
- Generate $this->typeCounter - quantity of all street types in Kyiv.
Result array from method getCsvAsArray():
- ['number'] - (int) serial number from file
- ['id'] - (int) identifier from file
- ['name_original'] - (string) street name from file
- ['name'] - (string) normalized street name
- ['type_name'] - (string) street type name from file
- ['type_key'] - (string) street type key
- ['district_string'] - (string) street districts from file
- ['district_list'] - (array) street districts ['district_key', 'district_key', ...]
- ['document_name'] - (string) Document on assigning the name of the object
- ['document_date'] - (string) Date of the document on assigning the name of the object
- ['document_number'] - (string) Number of the document on assigning the name of the object
- ['document_title'] - (string) The title of the document on the naming of the object
- ['place_description'] - (string) Location of the object in the city
- ['name_old'] - (string) Former name of the object
- ['type_old'] - (string)Former category (type) of the object
Example
use Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga; use Vgip\Datanorm\Directory\Address\Country\Ukr\Address AS DirUkrAddress; use Vgip\Datanorm\Directory\Address\Country\Ukr\City\Kyiv AS DirKyiv; use Vgip\Datanorm\Directory\Lang\Ukr\Pattern AS PatternUkrAddress; use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalizedList; use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalization; $dirUkrAddress = DirUkrAddress::getInstance(); $dirKyiv = DirKyiv::getInstance(); $patternUkrAddress = PatternUkrAddress::getInstance(); $streetNormalizedListObj = StreetNormalizedList::getInstance(); $streetNormalizedList = $streetNormalizedListObj->getNormalization(); /** Get configuration and whitelist data */ $pathSourceFile = join(DIRECTORY_SEPARATOR, ['file', 'Reestr_vulits_Kyiva_2020_10_25.csv']); $streetTypeList = $dirUkrAddress->getStreetTypeWhitelist(); $districtWhitelist = $dirKyiv->getDistrictWhitelist(); $patternStreetName = $patternUkrAddress->getStreetName(); /** Object initialization */ $streetNameKga = new StreetNameKga(); /** Set parameter */ $streetNameKga->setTypeWhitelist($streetTypeList); $streetNameKga->setDistrictWhitelist($districtWhitelist); $streetNameKga->setStreetNormalization($streetNormalizedList); $streetNameKga->setPatternStreetName($patternStreetName); /** Get a result (array) with normalized data */ $data = $streetNameKga->getCsvAsArray($pathSourceFile); /** Get other data */ $res = []; $res['type_list'] = $streetNameKga->getTypeList(); $res['type_counter'] = $streetNameKga->getTypeCounter(); $res['name_list'] = $streetNameKga->getNameList(); $res['name_double'] = $streetNameKga->getNameDouble(); $res['district_not_whitelist'] = $streetNameKga->getDistrictNotFound(); /** Get warnings if present */ $warning = $streetNameKga->getWarning(); $warningValue = $streetNameKga->getWarningValue(); if (null !== $warning AND count($warning) > 0) { print_r($warning); } print_r($data); print_r($res);
Ukrainian language
Apostrophe
The resulting data will contain as ukrainian apostrophe symbol "ʼ" unicode symbol U+02BC. All other similar characters in source data (' - U+0027, ’ - U+2019, etc) will be replaced to ʼ (U+02BC). U+02BC - this symbol is used in the ukrainian domain name (ICANN).
Street name normalization
- Position and surname - Академіка Єфремова, Генерала Авдєєнка, Маршала Бірюзова
- Name and surname - Леоніда Бикова
- Family relationships and surname - Братів Зерових, Родини Рудинських
Versioning
Data normalization follows Semantic Versioning.