Salza2 - Create compressed data from Common Lisp

Abstract

Salza2 is a Common Lisp library for creating compressed data in the ZLIB, DEFLATE, or GZIP data formats, described in RFC 1950, RFC 1951, and RFC 1952, respectively. It does not use any external libraries for compression. It does not yet support decompression. Salza2 is available under a BSD-like license. The current version is 2.0.7, released on June 12, 2009.

Download shortcut:

http://www.xach.com/lisp/salza2.tgz

Contents

  1. Overview and Limitations
  2. Dictionary
  3. References
  4. Acknowledgements
  5. Feedback

Overview and Limitations

Salza2 provides an interface for creating a compressor object. This object acts as a sink for octets (either individual octets or vectors of octets), and is a source for octets in a compressed data format. The compressed octet data is provided to a user-defined callback that can write it to a stream, copy it to another vector, etc.

Salza2 has built-in compressors that support the ZLIB, DEFLATE, and GZIP data formats. The classes and generic function protocol are available to make it easy to support similar formats via subclassing and new methods. ZLIB and GZIP are extensions to the DEFLATE format and are implemented as subclasses of DEFLATE-COMPRESSOR with a few methods implemented for the protocol.

Salza2 is the successor to Salza, but it is not backwards-compatible. Among other changes, Salza2 drops support for compressing Lisp character data, since the compression formats are octet-based and obtaining encoded octets from Lisp characters varies from implementation to implementation.

There are a number of functions that provide a simple interface to specific tasks such as gzipping a file or compressing a single vector.

Salza2 does not decode compressed data. There is no support for dynamically defined Huffman codes. There is currently no interface for changing the tradeoff between compression speed and compressed data size.

Dictionary

The following symbols are exported from the SALZA2 package.

Standard Compressors

[Classes]
deflate-compressor
zlib-compressor
gzip-compressor

Instances of these classes may be created via make-instance. The only supported initarg is :CALLBACK. See CALLBACK for the expected value.

[Accessor]
callback compressor => callback
(setf (callback compressor) new-value) => new-value

Gets or sets the callback function of compressor. The callback should be a function of two arguments, an octet vector and an end index, and it should process all octets from the start of the vector below the end index as the compressed output data stream of the compressor. See MAKE-STREAM-OUTPUT-CALLBACK for an example callback.

[Function]
compress-octet octet compressor => |

Adds octet to compressor to be compressed.

[Function]
compress-octet-vector vector compressor &key start end => |

Adds the octets from vector to compressor to be compressed, beginning with the octet at start and ending at the octet at end - 1. If start is not specified, it defaults to 0. If end is not specified, it defaults to the total length of vector. Equivalent to (but much more efficient than) the following:
(loop for i from start below end
      do (compress-octet (aref vector i) compressor))

[Generic function]
finish-compression compressor => |

Compresses any pending data, concludes the data format for compressor with FINISH-DATA-FORMAT, and invokes the user callback for the final octets of the compressed data format. This function must be called at the end of compression to ensure the validity of the data format; it is called implicitly by WITH-COMPRESSOR.

[Generic function]
reset compressor => |

The default method for DEFLATE-COMPRESSOR objects resets the internal state of compressor and calls START-DATA-FORMAT. This allows the re-use of a single compressor object for multiple compression tasks.

[Macro]
with-compressor (var class &rest initargs &key &allow-other-keys) &body body => |

Evaluates body with var bound to a new compressor created as with (apply #'make-instance class initargs). FINISH-COMPRESSION is implicitly called on the compressor at the end of evaluation.

Customizing Compressors

Compressor objects follow a protocol that makes it easy to create specialized data formats. The ZLIB data format is essentially the same as the DEFLATE format with an additional header and a trailing checksum; this is implemented by creating a new class and adding a few new methods to the generic functions below.

For example, consider a new compressed data format FOO that encapsulates a DEFLATE data stream but adds four signature octets, F0 0D 00 D1, to the start of the output data stream, and adds a trailing 32-bit length value, MSB first, after the end. It could be implemented like this:

(defclass foo-compressor (deflate-compressor)
  ((data-length
    :initarg :data-length
    :accessor data-length))
  (:default-initargs
   :data-length 0))

(defmethod start-data-format :before ((compressor foo-compressor))
  (write-octet #xF0 compressor)
  (write-octet #x0D compressor)
  (write-octet #x00 compressor)
  (write-octet #xD1 compressor))

(defmethod process-input :after ((compressor foo-compressor) input start count)
  (declare (ignore input start))
  (incf (data-length compressor) count))

(defmethod finish-data-format :after ((compressor foo-compressor))
  (let ((length (data-length compressor)))
    (write-octet (ldb (byte 8 24) length) compressor)
    (write-octet (ldb (byte 8 16) length) compressor)
    (write-octet (ldb (byte 8  8) length) compressor)
    (write-octet (ldb (byte 8  0) length) compressor)))

(defmethod reset :after ((compressor foo-compressor))
  (setf (data-length compressor) 0))

[Function]
write-bits code size compressor => |

Writes size low bits of the integer code to the output buffer of compressor. Follows the bit packing layout described in RFC 1951. The bits are not compressed, but become literal parts of the output stream.

[Function]
write-octet octet compressor => |

Writes octet to the output buffer of compressor. Bits of the octet are not packed; the octet is added to the output buffer at the next octet boundary. The octet is not compressed, but becomes a literal part of the output stream.

[Generic function]
start-data-format compressor => |

Outputs any prologue bits or octets needed to produce a valid compressed data stream for compressor. Called from initialize-instance and RESET for subclasses of deflate-compressor. Should not be called directly, but subclasses may add methods to customize what literal data is added to the beginning of the output buffer.

[Generic function]
process-input compressor input start count => |

Called when count octets of the octet vector input, starting from start, are about to be compressed. This generic function should not be called directly, but may be specialized.

This is useful for data formats that must maintain information about the uncompressed contents of a compressed data stream, such as checksums or total data length.

[Generic function]
finish-data-format compressor => |

Called by FINISH-COMPRESSION. Outputs any epilogue bits or octets needed to produce a valid compressed data stream for compressor. This generic function should not be called directly, but may be specialized.

Checksums

Checksums are used in several data formats to check data integrity. For example, PNG uses a CRC32 checksum for its chunks of data. Salza2 exports support for two common checksums.

[Standard classes]
adler32-checksum
crc32-checksum

Instances of these classes may be created directly with make-instance.

[Generic function]
update checksum buffer start count => |

Updates checksum with count octets from the octet vector buffer, starting at start.

[Generic function]
result checksum => result

Returns the accumulated value of checksum as an integer.

[Generic function]
result-octets checksum => result-list

Returns the individual octets of checksum as a list of octets, in MSB order.

[Generic function]
reset checksum => |

The default method for checksum objects resets the internal state of checksum so it may be re-used.

Shortcuts

Some shortcuts for common compression tasks are available.

[Function]
make-stream-output-callback stream => callback>

Creates and returns a callback function that writes all compressed data to stream. It is defined like this:
(defun make-stream-output-callback (stream)
  (lambda (buffer end)
    (write-sequence buffer stream :end end)))

[Function]
gzip-stream input-stream output-stream => |

Compresses all data read from input-stream and writes the compressed data to output-stream.

[Function]
gzip-file input-file output-file => pathname

Compresses input-file and writes the compressed data to output-file.

[Function]
compress-data data compressor-designator &rest initargs => compressed-data

Compresses the octet vector data and returns the compressed data as an octet vector. compressor-designator should be either a compressor object, designating itself, or a symbol, designating a compressor created as with (apply #'make-instance compressor-designator initargs).

For example:

* (compress-data (sb-ext:string-to-octets "Hello, hello, hello, hello world.") 
                 'zlib-compressor)
#(8 153 243 72 205 201 201 215 81 200 192 164 20 202 243 139 114 82 244 0 194 64 11 139)

References

Acknowledgements

Thanks to Paul Khuong for his help optimizing the modulo-8191 hashing.

Thanks to Austin Haas for providing some test SWF files demonstrating a data format bug.

Feedback

Please direct any comments, questions, bug reports, or other feedback to Zach Beane.