Pure Bollocks Issue 22_042

Published in
· 5 years ago
  

 
                            ---------------------- 

                            *  C  O  D  I  N  G  * 

                            ---------------------- 

 
From:           john@cooper.cooper.EDU (John Barkaus) 
Newsgroups:     comp.graphics 
Subject:        GIF file format responses 5/5 
Date:           21 Apr 89 20:58:01 GMT 
Organization:   The Cooper Union (NY, NY) 

 
         ------------------------------------------------------------- 

 
                             LZW and GIF explained 

                              by Steve Blackstock 

 
    I hope this little document will help  enlighten those of you out there who 
want to know  more  about  the  Lempel-Ziv  Welch  compression  algorithm, and, 
specifically, the implementation that GIF uses. 
    Before we start, here's  a  little  terminology,  for  the purposes of this 
document: 

    "character":    a fundamental data element. In  normal  text files, this is 
                    just a single byte. In  raster  images, which is what we're 
                    interested in, it's an index that  specifies the color of a 
                    given pixel. I'll refer to an arbitray character as "K". 
    "charstream":   a stream of characters, as in a data file. 
    "string":       a number of  continuous  characters,  anywhere  from one to 
                    very many characters in length.  I can specify an arbitrary 
                    string as "[...]K". 
    "prefix":       almost the same as a string,  but with the implication that 
                    a prefix immediately precedes a character, and a prefix can 
                    have a length of zero. So, a prefix and a character make up 
                    a string. I will refer to an arbitrary prefix as "[...]". 
    "root":         a single-character string.  For  most  purposes,  this is a 
                    character, but we may  occasionally  make a distinction. It 
                    is [...]K, where [...] is empty. 
    "code":         a number, specified by a  known  number of bits, which maps 
                    to a string. 
    "codestream":   the output stream of codes, as in the "raster data" 
    "entry":        a code and its string. 
    "string table": a list of entries; usually, but not necessarily, unique. 

    That should be enough of that. 

    LZW is a way of  compressing  data  that  takes  advantage of repetition of 
strings in  the  data.  Since  raster  data  usually  contains  a  lot  of this 
repetition, LZW is a good way of compressing and decompressing it. 
For the moment, lets consider normal LZW encoding and decoding. GIF's variation 
on the concept is just an extension from there. 
    LZW manipulates three objects  in  both  compression and decompression: the 
charstream,  the  codestream,  and  the   string  table.  In  compression,  the 
charstream is the input and the codestream is the output. In decompression, the 
codestream is the input and the charstream is the output. The string table is a 
product of both compression and decompression, but  is never passed from one to 
the other. 
    The first thing we do in LZW compression is initialize our string table. To 
do this, we need to choose a code size (how many bits) and know how many values 
our characters can possibly take. Let's say  our  code size is 12 bits, meaning 
we can store 0->FFF, or 4096 entries in our string table. Lets also say that we 
have 32 possible different characters. (This  corresponds to, say, a picture in 
which there are 32 different colors possible for each pixel.) To initialize the 
table, we set code#0 to character#0, code  #1  to character#1, and so on, until 
code#31 to character#31. Actually, we are  specifying  that each code from 0 to 
31 maps to a root. There will be  no  more  entries in the table that have this 
property. 
    Now we start compressing  data.  Let's  first  define  something called the 
"current prefix". It's just a  prefix  that  we'll  store things in and compare 
things to now and then. I will  refer  to it as "[.c.]". Initially, the current 
prefix has nothing in it. Let's also  define  a "current string", which will be 
the current prefix plus the next character  in  the charstream. I will refer to 
the current string as "[.c.]K",  where  K  is  some  character. OK, look at the 
first character in the charstream. Call  it  P. Make [.c.]P the current string. 
(At this point, of course, it's just the root P.) Now search through the string 
table to see if [.c.]P  appears  in  it.  Of  course,  it does now, because our 
string table is initialized to have  all  roots.  So  we don't do anything. Now 
make [.c.]P the current prefix. Look  at  the next character in the charstream. 
Call it Q. Add it to the current prefix to form [.c.]Q, the current string. Now 
search through the string table to see  if  [.c.]Q appears in it. In this case, 
of course, it doesn't. Aha! Now we get to do something. Add [.c.]Q (which is PQ 
in this case) to the string table for code#32, and output the code for [.c.] to 
the codestream. Now start over  again  with  the  current prefix being just the 
root P. Keep adding characters to  [.c.]  to  form [.c.]K, until you can't find 
[.c.]K in the string table. Then output  the  code  for [.c.] and add [.c.]K to 
the string table. In pseudo-code, the algorithm goes something like this: 

    [1] Initialize string table; 
    [2] [.c.] <- empty; 
    [3] K <- next character in charstream; 
    [4] Is [.c.]K in string table? 
        (yes: [.c.] <- [.c.]K; 
            go to [3]; 
        ) 
        (no: add [.c.]K to the string table; 
            output the code for [.c.] to the codestream; 
            [.c.] <- K; 
            go to [3]; 
        ) 

    It's as simple as that!  Of  course,  when  you  get  to step [3] and there 
aren't any more characters left, you just  output  the code for [.c.] and throw 
the table away. You're done. 

 
    Wanna do an  example?  Let's  pretend  we  have  a four-character alphabet: 
A,B,C,D. The charstream  looks  like  ABACABA.  Let's  compress  it.  First, we 
initialize our string table to: #0=A, #1=B,  #2=C, #3=D. The first character is 
A, which is in the string table, so  [.c.]  becomes A. Next we get AB, which is 
not in the table, so we output code  #0  (for  [.c.]), and add AB to the string 
table as code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in the 
string table, so output code #1, and  add  BA  to  the string table as code #5. 
[.c.] becomes A. Next we get AC, which  is not in the string table. Output code 
#0, and add AC to the string table as code #6. Now [.c.] becomes C. Next we get 
[.c.]A = CA, which is not in the table. Output #2 for C, and add CA to table as 
code#7. Now [.c.] becomes A. Next we get  AB,  which IS in the string table, so 
[.c.] gets AB, and we look at ABA, which  is not in the string table, so output 
the code for AB, which is #4, and add ABA to the string table as code #8. [.c.] 
becomes A. We can't get any more characters,  so we just output #0 for the code 
for A, and we're done. So, the codestream is #0#1#0#2#4#0. 
    A few words (four) should  be  said  here  about  efficiency: use a hashing 
strategy. The search through the string table can be computationally intensive, 
and some hashing is  well  worth  the  effort.  Also,  note that "straight LZW" 
compression runs the risk of overflowing the  string  table - getting to a code 
which can't be represented in the  number  of  bits you've set aside for codes. 
There are several ways of dealing with  this problem, and GIF implements a very 
clever one, but we'll get to that. 
    An important thing to notice is that,  at any point during the compression, 
if [...]K is in the string table,  [...]  is  there also. This fact suggests an 
efficient method for storing strings in the table. Rather than store the entire 
string of K's in the  table,  realize  that  any  string  can be expressed as a 
prefix plus a character: [...]K. If we're  about  to store [...]K in the table, 
we know that [...] is already there,  so  we  can just store the code for [...] 
plus the final character K. 

 
    Ok, that takes care of compression. Decompression is perhaps more difficult 
conceptually, but it is really easier to program. 
    Here's how it goes:  We  again  have  to  start  with an initialized string 
table. This table comes from what  knowledge  we have about the charstream that 
we will eventually get, like what  possible  values the characters can take. In 
GIF files, this information is in  the  header  as the number of possible pixel 
values. The beauty of LZW, though, is that this is all we need to know. We will 
build the rest  of  the  string  table  as  we  decompress  the codestream. The 
compression is done in such a way  that  we  will never encounter a code in the 
codestream that we can't translate into a string. 
    We need to define something called a  "current code", which I will refer to 
as "<code>", and an "old-code",  which  I  will  refer  to as "<old>". To start 
things off, look at the first code.  This  is  now <code>. This code will be in 
the intialized string table as the  code  for  a  root.  Output the root to the 
charstream. Make this code the old-code <old>.  *Now look at the next code, and 
make it <code>. It is possible that this  code will not be in the string table, 
but let's assume for now that it  is. Output the string corresponding to <code> 
to the codestream.  Now  find  the  first  character  in  the  string  you just 
translated. Call this K. Add this  to  the  prefix  [...] generated by <old> to 
form a new string [...]K. Add this  string  [...]K to the string table, and set 
the old-code <old> to the current  code  <code>.  Repeat from where I typed the 
asterisk, and you're all set.  Read  this  paragraph  again if you just skimmed 
it!!!  Now let's consider the  possibility  that  <code>  is  not in the string 
table. Think back to compression, and  try  to understand what happens when you 
have a string like P[...]P[...]PQ appear  in  the charstream. Suppose P[...] is 
already in the string table, but P[...]P  is not. The compressor will parse out 
P[...], and find that P[...]P is not  in  the  string table. It will output the 
code for P[...], and add P[...]P to  the  string  table. Then it will get up to 
P[...]P for the next string, and find that P[...]P is in the table, as the code 
just added. So it will output the code for P[...]P if it finds that P[...]PQ is 
not in the table. The decompressor is  always "one step behind" the compressor. 
When the decompressor sees the code  for  P[...]P,  it will not have added that 
code to it's string table  yet  because  it  needed  the beginning character of 
P[...]P to add to the string for  the  last  code, P[...], to form the code for 
P[...]P. However, when a decompressor finds a code that it doesn't know yet, it 
will always be the very next one  to  be  added  to the string table. So it can 
guess at what the string for the code  should  be, and, in fact, it will always 
be correct. If I am a decompressor, and I see code#124, and yet my string table 
has entries only up to code#123, I can figure out what code#124 must be, add it 
to my string table, and output  the  string.  If code#123 generated the string, 
which I will refer to here as  a  prefix, [...], then code#124, in this special 
case, will be [...] plus the first  character  of  [...]. So just add the first 
character of [...] to the end of  itself.  Not  too  bad.  As an example (and a 
very common one) of this special case,  let's  assume we have a raster image in 
which the first three pixels have the  same color value. That is, my charstream 
looks like: QQQ.... For the sake of argument,  let's say we have 32 colors, and 
Q is the color#12. The  compressor  will  generate the code sequence 12,32,.... 
(if you don't know why, take a  minute  to understand it.) Remember that #32 is 
not in the initial table, which goes from  #0 to #31. The decompressor will see 
#12 and translate it just fine as  color  Q.  Then  it will see #32 and not yet 
know what that means. But if it thinks  about it long enough, it can figure out 
that QQ should be entry#32  in  the  table  and  QQ  should  be the next string 
output.  So the decompression pseudo-code goes something like: 

    [1] Initialize string table; 
    [2] get first code: <code>; 
    [3] output the string for <code> to the charstream; 
    [4] <old> = <code>; 
    [5] <code> <- next code in codestream; 
    [6] does <code> exist in the string table? 
        (yes: output the string for <code> to the charstream; 
            [...] <- translation for <old>; 
            K <- first character of translation for <code>; 
            add [...]K to the string table; 
            <old> <- <code> 
        ) 
        (no: [...] <- translation for <old>; 
            K <- first character of [...]; 
            output [...]K to charstream and add it to string table; 
            <old> <- <code> 
        ) 
    [7] go to [5]; 

    Again, when you get  to  step  [5]  and  there  are  no  more codes, you're 
finished.  Outputting of strings, and finding  of initial characters in strings 
are efficiency problems all to themselves, but I'm not going to suggest ways to 
do them here. Half the fun of programming is figuring these things out! 

 
    Now for the GIF variations on the  theme.  In  part  of the header of a GIF 
file, there is a field, in the Raster  Data stream, called "code size". This is 
a very misleading name for the field, but  we  have to live with it. What it is 
really is the "root size". The actual  size,  in bits, of the compression codes 
actually changes during compression/decompression,  and  I  will  refer to that 
size here as the "compression size".  The  initial  table is just the codes for 
all the roots, as usual,  but  two  special  codes  are  added on top of those. 
Suppose you have a "code size", which  is  usually the number of bits per pixel 
in the image, of N. If the number of  bits/pixel  is one, then N must be 2: the 
roots take up slots #0 and #1 in  the  initial table, and the two special codes 
will take up slots #4 and #5. In  any  other  case, N is the number of bits per 
pixel, and the roots take up slots  #0 through #(2**N-1), and the special codes 
are (2**N) and (2**N + 1). The  initial  compression  size will be N+1 bits per 
code. If you're encoding, you output the  codes  (N+1)  bits at a time to start 
with, and if you're decoding,  you  grab  (N+1)  bits  from the codestream at a 
time.  As for the special codes: <CC> or  the clear code, is (2**N), and <EOI>, 
or end-of-information,  is  (2**N  +  1).  <CC>  tells  the  compressor  to re- 
initialize the string table, and to reset  the compression size to (N+1). <EOI> 
means there's no more in the  codestream.   If you're encoding or decoding, you 
should start adding things to the string table at <CC> + 2. If you're encoding, 
you should output <CC> as the very first code, and then whenever after that you 
reach code #4095 (hex FFF), because GIF  does not allow compression sizes to be 
greater than 12 bits. If you're  decoding,  you should reinitialize your string 
table when you observe <CC>.  The variable  compression sizes are really no big 
deal. If you're encoding, you start with a compression size of (N+1) bits, and, 
whenever  you  output  the  code   (2**(compression   size)-1),  you  bump  the 
compression size up one bit.  So  the  next  code  you  output  will be one bit 
longer. Remember that the largest compression size is 12 bits, corresponding to 
a code of 4095. If you get that far, you must output <CC> as the next code, and 
start over.  If you're decoding,  you  must  increase  your compression size AS 
SOON AS YOU write entry #(2**(compression size)  -  1) to the string table. The 
next code you READ will be one  bit  longer.  Don't make the mistake of waiting 
until you need to add the code  (2**compression size) to the table. You'll have 
already missed a bit  from  the  last  code.   The  packaging  of  codes into a 
bitsream for the raster data is also a potential stumbling block for the novice 
encoder or decoder. The lowest order bit  in  the code should coincide with the 
lowest available bit  in  the  first  available  byte  in  the  codestream. For 
example, if you're starting with 5-bit  compression codes, and your first three 
codes are, say, <abcde>, <fghij>, <klmno>,  where  e,  j, and o are bit#0, then 
your codestream will start off like: 

    byte#0: hijabcde 
    byte#1: .klmnofg 

    So the differences between straight  LZW  and  GIF  LZW are: two additional 
special codes and variable compression  sizes.  If  you understand LZW, and you 
understand those variations, you understand it all! 
    Just as sort of a P.S., you may have noticed that a compressor has a little 
bit of flexibility at compression time. I  specified a "greedy" approach to the 
compression, grabbing as many characters  as  possible before outputting codes. 
This is, in fact, the standard LZW way  of  doing things, and it will yield the 
best compression ratio. But  there's  no  rule  saying  you can't stop anywhere 
along the line and just output  the  code  for the current prefix, whether it's 
already in the table or not, and add that string plus the next character to the 
string table. There are various reasons  for  wanting to do this, especially if 
the strings get extremely long and make  hashing  difficult. If you need to, do 
it. 
    Hope this helps out. 

 
    Steve Blackstock 

  --------------------------------------------------------------------------- 

                        MORE INFO ABOUT LZW COMPRESSION 

LZW - A sophisticated  data  compression  algorithm  based  on   work  done  by 
Lempel-Ziv & Welch which has  the  feature  of very efficient one-pass encoding 
and decoding. This allows the image to  be  decompressed and displayed  at  the 
same time. The original article from which this technique was adapted is: 

    Terry  A.   Welch,  "A  Technique  for  High Performance Data Compression", 
    IEEE Computer, vol 17 no 6 (June 1984) 

This basic algorithm is also used in the  public  domain  ARC  file compression 
utilities. 

 
  ---------------------------------------------------------------------------
Pure Bollocks Issue 22_042

Share this article

Let's discover also

Pure Bollocks Issue 22 READ_ME

Pure Bollocks Issue 21_000

Pure Bollocks Online Vol 105

Pure Bollocks Issue 22_000

Pure Bollocks Issue 21 READ_ME

Pure Bollocks Issue 22_010

Pure Bollocks Issue 21_001

Pure Bollocks Issue 22_004

Pure Bollocks Issue 22_065

Pure Bollocks Issue 22_011

Recent Articles

Oh I think it is the first time I am seeing king Charles

My British citizenship

Crema catalana ketogenica

Leeto Phreako Headz Issue 4

Leeto Phreako Headz Issue 3

Leeto Phreako Headz Issue 2 (Part II of II)

Sea Serpents and Lake Monsters: Legends or Reality ?

The Palenque Stela, evidence of ancient astronauts?

Leeto Phreako Headz Issue 2 (Part I of II)

Interview d'Alain Finkielkraut

Recent Comments