I’m just reading an excellent MSc dissertation on digital forensics, and it mentions Lempel-Ziv, and I remember I wrote a good deal of things around the method a while ago. It was in 1977, that Abraham Lempel and Jacob Ziv developed the Lempel-Ziv class of adaptive dictionary data compression techniques (also known as LZ-77 coding), and are now some of the most popular compression techniques.
The LZ coding scheme is especially suited to data which has a high degree of repetition, and makes back references to these repeated parts. Typically a flag is normally used to identify coded and unencoded parts, where the flag creates back references to the repeated sequence. An example piece of text could be:
This text has several repeated sequences, such as ‘ is ‘, ‘it’, ‘en’, ‘re’ and ‘ receiv’. For example the repetitive sequence ‘ recei’ (as shown by the underlined highlight), and the encoded sequence could be modified with the flag sequence #m#n where m represents the number of characters to trace back to find the character sequence and n the number of replaced characters. Thus the encoded message could become:
‘The receiver#10#3quires a#20#5pt for it. This is automatically sent wh#6#2 it #30#2#47#5ved.’
Normally a long sequence of text has many repeated words and phases, such as ‘and’, ‘there’, and so on. Note that in some cases this could lead to longer files if short sequences were replaced with codes that were longer than the actual sequence itself.
Mulchester 3 Madchester 2
Smellmore Wanderers 60 Drinksome Wanderers 23
we could compress this with:
Mulchester 3 Mad#13#7 2
Smellmore Wanderers 60 Drinksome#23#1123
— I’ll add more to this later …