decatur-vote/text-diff

A minimal library that produces diffs in a fairly human readable, plain-text format. Diffs allow converting old text to new text & visa versa.

1.0.0 2023-05-16 17:07 UTC

This package is auto-updated.

Last update: 2024-03-30 00:31:39 UTC


README

A minimal library that produces diffs in a fairly human readable, plain-text format. Diffs allow converting old text to new text & visa versa.

Installation

composer require decatur-vote/text-diff v1.0.x-dev   

or in your composer.json

{"require":{ "decatur-vote/text-diff": "v1.0.x-dev"}}  

Usage

Generate a diff, and move forward or backward with the operations (opps).

<?php  
  
$differ = new \DecaturVote\Differ();  
$old = 'old text';  
// $old = $differ->fix_html($old); # optional - will beautify html using DOMDocument, via composer package taeluf/phtml   
$new = 'new text';  
$opps = $differ->get_opps($old, $new);  
// file_put_contents('file_path.txt', $opps)  
  
$recovered_old_text = $differ->backward($new, $opps);  
$restored_new_text = $differ->forward($old, $opps);  

Notes

  • Takes 122ms (on my machine) to generate diffs for two 15,000 character strings and then run both forward() and backward(). Run via phptest LargeRandomStringDiffs
  • phptest -test MultipleDiffs -run dup - tests a specific diff test. See the test class.
  • There are 27 different diff tests that are all passing, several of which were added when bugs were found via LargeRandomStringDiffs test. Most (maybe all) bugs were because of duplicate lines not being handled.

TODO maybe?

  • create diff html (I don't know how that'll work with my move operations, so I probably won't do this)

How it works

Overview of how the Differ works. We'll cover what diffs look like, how they're processed, THEN how they're created.

What diffs look like

This will remove lines 1 & 3, add lines 'c' & 'def' at positions 3 & 5, then move line 6 to line 4 & line 7 to line 3.

1-abc  
3-baby  
3+c  
5+def  
6>4  
7>3  

NOTE: 0-based line indexes. The first line is line 0. The second line is line 1. 1-abc means there is a line BEFORE abc

NOTE: remove & add line operations depend on previous instructions having already been applied. 3-baby, above, says 'Remove Line 3, which contains the text "baby".'. However, the original text has baby as line 4. 3-baby is processed AFTER 1-abc ... so the 4th line ("baby") becomes the 3rd line after line 1 is removed, so we get 3-baby

Processing Diffs

Forward diffs are processed from top to bottom, always in this order:

  1. remove lines
  2. add lines
  3. move lines

NOTE: For 6>4, line 4 is replaced by line 6. Line 6 is unchanged, unless a different move opp overwrites line 6.

Backward diffs are processed from bottom to top:

  1. Move lines
  2. Remove lines that are listed as add operations (ex 5+def would be removed)
  3. Add lines that are listed as remove operations (ex 1-abc would be added)

Generating Diffs

Duplicate lines are handled at each stage (most of the complexity of the algorithm comes from managing duplicate lines).

  1. Generate an array of lines for $old_text and $new_text
  2. Find any lines in $old_text that don't exist in $new_text & generate opps for these lines to be removed
  3. Find any lines in $new_text that don't exist in $old_text & generate opps for these lines to be added
  4. Apply the remove & add opps to the $old_text to create an out-of-order copy of the new text.
  5. Compare each line of the out-of-order new text to the supplied $new_text. Find the correct location.