ghostjat / dna
Description of project DNA.
Requires
- ghostjat/plot: *
- ghostjat/pml: *
- psr/log: 3.0.2
This package is auto-updated.
Last update: 2026-04-30 08:22:29 UTC
README
Build a multi-class DNA sequence classifier in pure PHP using PHP-ML โ from raw data to predictions.
๐ Introduction
This tutorial demonstrates how to use the PHP-ML library to build a machine learning model that classifies DNA sequences into:
- ๐ฆ Bacteria
- ๐พ Animal
- ๐ Fungi
- ๐งซ Virus
- ๐ฟ Plant
Youโll go through the complete pipeline:
- Data preparation
- Exploratory Data Analysis (EDA)
- Model training
- Evaluation
- Prediction
All examples are located in:
example/dna/
โโโ eda.php
โโโ train.php
โโโ predict.php
๐งช Problem Overview
DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs multi-class classification across five organism types.
๐ค Why Machine Learning?
Machine learning helps by:
- Automatically discovering patterns in DNA sequences
- Scaling to large biological datasets
- Providing fast and accurate classification
โ๏ธ Prerequisites
Ensure you have:
- PHP โฅ 8.2
- Composer
- Install PHP-ML:
composer require ghostjat/pml:*
- Basic command-line knowledge
๐ Dataset Overview
๐ Summary
- Total Samples: 244,447
- Features: 256 (k-mer frequencies)
- Classes: 5
๐งฌ Classes
- bacteria
- animal
- fungi
- virus
- plant
๐ Storage
datasets/train_*.csv
๐ Step 1: Exploratory Data Analysis (eda.php)
This script loads and inspects the dataset.
$trainFiles = glob(__DIR__ . '/datasets/train_*.csv'); $dataset = loadDna($trainFiles[0]); for ($i = 1; $i < count($trainFiles); $i++) { $dataset = $dataset->stack(loadDna($trainFiles[$i])); } $df0 = DataFrame::fromCSV($trainFiles[0], false); $cols0 = $df0->columns(); $classes = $df0->categories(end($cols0));
๐ What it does
- Loads multiple CSV files
- Merges them into one dataset
- Extracts class distribution
๐ง Step 2: Training & Evaluation (train.php)
Train a neural network using MLPClassifier.
$pipeline = new Pipeline( [new NumericStringConverter(), new ZScaleStandardizer()], new MLPClassifier( architecture: [32, 16], epochs: 10, learningRate: 0.01, batchSize: 32 ) ); Dataset::seed(42); $dataset->randomize(); [$train, $val] = $dataset->split(0.8); $pipeline->train($train); $valPreds = $pipeline->predict($val); $valAcc = (new Accuracy())->score($valPreds, $val->labels());
โก Training Details
- Train Samples: 195,558
- Validation Samples: 48,889
- Validation Accuracy: ~90.07%
- Training Time: ~20 seconds
๐ฎ Step 3: Prediction (predict.php)
Use a trained model to classify new DNA sequences.
// โโ 1. Load model + class map โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ $logger->info('Loading model โฆ'); $pipeline = Pipeline::load($modelDir); $classes = json_decode(file_get_contents($modelDir . '/classes.json'), true); $logger->info('Model loaded', ['classes' => $classes]); // โโ 2. Load unknown CSV โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ $logger->info('Loading unknown data โฆ'); $df = DataFrame::fromCSV($unknownCsv, false); $cols = $df->columns(); // Check if last col is a label (STRING) or a feature (float32) $dtypes = $df->dtypes(); $lastCol = end($cols); $hasLabels = ($dtypes[$lastCol] === 'string'); $X = $df->drop($hasLabels ? [$lastCol] : [])->toTensor(); $dataset = new Dataset($X); $logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]); // โโ 3. Predict โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ $logger->info('Predicting โฆ'); $predIndices = $pipeline->predict($dataset)->toFlatArray(); // [N] class indices // โโ 4. Evaluate if labels available โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ if ($hasLabels) { $yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze(); $predT = \Pml\Tensor::fromArray($predIndices); $acc = (new Accuracy())->score($predT, $yTrue); $logger->info(sprintf('Test accuracy: %.4f (%.2f%%)', $acc, $acc * 100)); }
โถ๏ธ Running the Example
๐ EDA
php eda.php
๐ง Training
php train.php //softmax php trainMLP.php
๐ฎ Prediction
php predict.php
๐ Interpreting Results
- Accuracy โ Overall correctness
- Multi-class Predictions โ Output label among 5 classes
๐ Extending the Tutorial
- Increase epochs for better accuracy
- Try deeper architectures
- Experiment with other classifiers
- Add cross-validation
๐ Conclusion
You now have a complete workflow for building a multi-class DNA classifier in PHP.
โค๏ธ Final Note
Push PHP beyond traditional limits โ even into machine learning.
Happy coding! ๐