Who invented zip files
What is even more exciting is that we can left-shift those limits so that they are all 3 bits wide. Let us call them the sentinel bits for each codeword length:. This means we can look at three bits of input and compare against the sentinel bits to figure out how long our codeword is. Once that is done, we shift the input bits as to only consider the right number of them, and then find the symbol index as shown above:.
The time complexity of this is linear in the number of codeword bits, but it is space efficient, requires only a load and comparison per step, and since shorter codewords are more frequent it optimizes for the common case. Deflate, introduced with PKZip 2. It is also the compression method used in gzip, PNG, and many other file formats. It uses LZ77 compression and Huffman coding in a combination which will be described and implemented in this section.
Although those methods are rarely seen in use today, they were still in use some time after the introduction of Deflate since they required less memory. Those legacy methods are covered in a follow-up article.
Deflate stores Huffman codewords in a least-significant-bit-first LSB-first bitstream, meaning that the first bit of the stream is stored in the least significant bit of the first byte.
For example, consider this bit stream read left-to-right : When stored LSB-first in a byte, the byte's value becomes 0b binary or 0x19 hexadecimal. This might seem backwards in a sense it is , but one advantage is that it makes it easy to get the first N bits from a computer word: just mask off the N lowest bits. The following routines are from bitstream.
For our Huffman decoder, we want to look at the next bits in the stream enough bits for the longest possible codeword , and then advance the stream by the number of bits used by the decoded symbol:. For the output bitstream, we write bits using a read-modify-write sequence. In the fast case, a bit write can be done by a bit read, some bit operations, and a bit write. We also want an efficient way of writing bytes to the stream. One could of course perform repeated 8-bit writes, but using memcpy is much faster:.
Since the compression algorithm is called Deflate —to let the air out of something—the decompression process is sometimes referred to as Inflation. Studying this process first will give us an understanding of how the format works. The code is available in the first part of deflate. Deflate-compressed data is stored as a series of blocks. Each block starts with a 3-bit header where the first least significant bit is set if this is the final block of the series, and the other two bits indicate the block type.
There are three block types: uncompressed 0 , compressed with fixed Huffman codes 1 and compressed with "dynamic" Huffman codes 2. The following code drives the decompression, relying on helper functions for the different block types which will be implemented further below. The simplest block type is the non-compressed or "stored" block. It begins at the next 8-bit boundary of the bitstream, with a bit word len indicating the length of the block, followed by another bit word nlen which is the ones' complement all bits inverted of len.
The idea is presumably that nlen acts as a simple checksum of len : if the file is corrupted, it is likely that the values are no longer each others' complements, and the program can detect the error.
After len and nlen follows the non-compressed data. Because the block length is a bit value, it is limited to 65, bytes. Compressed Deflate blocks use Huffman codes to represent a sequence of LZ77 literals and back references terminated by an end-of-block marker. One Huffman code, the litlen code , is used for literals, back reference lengths, and the end-of-block marker.
A second code, the dist code , is used for back reference distances. The litlen code encodes values between 0 and Values 0 through represent literal bytes, is the end-of-block marker, and values through represent back reference lengths.
Back references are between 3 and bytes long. The litlen value determines a base length to which zero or more extra bits from the stream are added to get the full length according to the table below. For example, a litlen value of indicates a base length of 19 and two extra bits. Adding the next two bits from the stream yields a final length between 19 and Note that litlen value plus five extra bits could actually represents lengths —, but the specification indicates that , the maximum back reference length, should be represented using a separate litlen value.
This is presumably to allow for a shorter encoding in cases where the maximum length is common. The decompressor uses a table that maps from litlen value minus to base length and extra bits:. The fixed litlen Huffman code is a canonical code using the following codeword lengths — are not valid litlen values, but they participate in the code construction :. Back reference distances, ranging from 1 to 32,, are encoded using a scheme similar to the one for lengths. The dist Huffman code encodes values between 0 and 29, each corresponding to a base length to which a number of extra bits are added to get the final distance:.
The fixed dist code is a canonical Huffman code where all codewords are 5 bits long. Note that as an optimization when there is enough room in the output buffer, we output back references using the routine below which copies 64 bits at a time. In fact, short back references will now all be handled by a single iteration, which is great for branch prediction.
Deflate blocks using dynamic Huffman codes work similarly to the blocks described above, but instead of using pre-determined Huffman codes for the litlen and dist codes, they use codes that are stored in the Deflate stream itself, at the start of the block.
The name is perhaps unfortunate, since dynamic Huffman codes can also refer to codes that change during the coding process, sometimes called adaptive Huffman coding. The codes described here have nothing to do with that; they are only dynamic in the sense that different blocks can use different codes. The litlen and dist codes for a dynamic Deflate block are stored as a series of codeword lengths.
Those codeword lengths are themselves encoded using a third Huffman code, which we will call the codelen code. Did I mention it was intricate?
At the beginning of the dynamic block are 14 bits that define the number of litlen, dist, and codelen codeword lengths that should be read from the block:. After those bits follow the codeword lengths for the codelen code. It is for this reason the lengths are in a special order: to increase the chance that latter lengths will all be zero and do not have to be stored in the block.
With the codelen decoder set up, we can proceed to read the litlen and dist codeword lengths from the stream. Lengths 16, 17, and 18 are not real lengths, but indicate that the previous length should be repeated some number of times, or that a zero length should be repeated:.
They could not be read separately, because code length runs can carry over from the last litlen lengths to the first dist lengths.
With the codeword lengths ready for use, we can set up the Huffman decoders and return to the task of decoding literals and back references:. From the sections above, we have all the tools needed for Deflate compression: Lempel—Ziv, Huffman coding, bitstreams, and the description of the three Deflate block types.
This section puts the pieces together to finally perform Deflate compression. Lempel—Ziv compression parses the source data into a sequence of back references and literals. This sequence needs to be divided and encoded into Deflate blocks as described in the previous section. Choosing how to do this division is sometimes referred to as block splitting. On the one hand, each new block carries some overhead which varies depending on block type and contents, so fewer blocks means less overhead.
On the other hand, the overhead from starting a new block might be worth it, for example if the characteristics of the data lead to a more efficient Huffman encoding in the new block and smaller output overall.
Block splitting is a difficult optimization problem. Some compressors such as Zopfli try harder than others, but most just use a greedy approach: output a block once a certain size has been reached.
To be able to freely choose any of the three types for block, we limit the block size to at most 65, bytes:. We use a structure to keep track of the output bitstream and the contents of the current block during deflation:.
The interesting part is of course writing the blocks. Writing an uncompressed block is straight-forward:. To write a static Huffman block, we first generate canonical Huffman codes based on the fixed codeword lengths for the litlen and dist codes. Then we iterate through the block, writing the symbols using those codes:. Dynamic Huffman blocks are of course the trickiest to write, since they include the intricate encoding of the litlen and dist codes.
We will use this struct to represent their encoding:. First, we drop trailing zero litlen and dist codeword lengths, and copy them into a common array for encoding.
We cannot drop all trailing zeros: it is not possible to encode a Deflate block with fewer than one dist code. It is also not possible to have fewer then litlen codes, but since there is always an end-of-byte marker, there will always be a non-zero code length for symbol Once the code lengths are in a single array, we perform the encoding, using special symbols for runs of identical code lengths.
The symbols used in the encoding above will in turn get written using a Huffman code, the "codelen code". The codeword lengths of the codelen code are written to the block in a certain order, with lengths more likely to be zero coming last. A function is used to count how many of the lengths that need to be written:. Assuming we have the litlen and dist codes set up, the encoding of their codeword lengths, and the code for that encoding, we can write the dynamic Huffman block:. For each block, we want to use the type that needs the smallest number of bits.
For an uncompressed block, the length can be computed quickly:. For Huffman encoded blocks, we can compute the length of the body using the litlen and dist symbol frequencies and codeword lengths:. For a static block, the total length is 3 bits for the header plus the length of the body. For a dynamic block, computing the size of the header requires a bit more work:.
Finally, the driver of the whole deflation process simply has to set up the initial state, kick off the Lempel—Ziv compression, and write the final block:. We have seen above exactly how the Deflate compression used in Zip files works, but what about the file format itself? This section explains it in detail and provides an implementation. The code is available in zip. Each archive member is compressed and stored individually.
This means that even if there are similarities between files in the archive, those similarities are not exploited to generate better compression. Having the Central Directory at the end enables an archive to be created gradually. As each member file is compressed, it gets written to the archive, and the index is written afterwards when all the compressed sizes, and therefore the file offsets, are known. A file can also be added to an existing archive fairly easily, by putting it after the last member and re-writing the Central Directory.
The ability to create archives gradually was especially important for archives spanning multiple floppy disks, or volumes.
As compression progressed, PKZip would prompt the user to insert new floppies, and finally write the Central Directory to the last one s. To extract a multi-volume archive, PKZip would first ask for the last floppy in order to read the Central Directory, and then for whatever floppies were needed to extract the requested files. Perhaps surprisingly, there is no rule against having multiple files with the same name in an archive.
This can lead to great confusion during file extraction: if there are multiple files with the specified name, which one should be extracted? All the values are stored in little-endian byte order where the field length counts the length in bytes. All the structures in a ZIP file use 4-byte signatures for each file entry. The end of central directory signature is 0xb50 and can be distinguished using its own unique signature.
Following is the order of information stored in Local File Header. Table of Content. What is a ZIP file? ZIP File Format specifications, the following compression methods are supported. Katz is also credited with pioneering shareware marketing. The company sold its compression tools to developers, but offered downloadable and shareable software to consumers.
Users may use the software for free on a limited basis, but to continue using the program, they must register and pay for the program. Steve Burg, president of NeoWorx Inc. Burg described Katz, who he said struggled with alcoholism in the latter years of his life, as a friend and mentor.
He kept most of his personal thoughts to himself. One persistent problem for these early archivers was the size of the single files they produced.
If you create an archive of every file on a computer, obviously the size of that one file is at least equal to the sizes of all the files stored on the computer. Data transfer and storage have almost always been the slowest and most expensive aspects of computer operation, so making data smaller has always been a very profitable business. And like AI, data compression has a reputation even among software engineers as an arcane art.
By the early s, data compression engineers had achieved a sort of rockstar status within the software industry for their arcane statistical wizardry and low-level programming incantations.
Among the luminaries of this generation was a man from Milwaukee named Phil Katz. In the mid-to-late 80s, PCs became a rapidly growing component of American business and education, producing ever-larger volumes of data. In the blink of an eye fixed disks changed from an expensive luxury into an absolute necessity, every form of software was sold in a box containing floppy disks, and office supply stores started selling floppies directly to consumers in bulk.
Anyone who could reliably cut consumption of floppies had the ear of investors and computer enthusiasts. He wrote a program that combined the roles of archiving and compression, producing single files that both aggregated multiple files and condensed them into a much smaller amount of storage space. Katz spoke the language of machines better than most, and the compression program he wrote was both speedier and more efficient than its predecessor, a program named ARC.
In fact, he originally planned to name it PKARC, but a trademark dispute forced him to find a new name for his brainchild that would fit in the 3 letters allowed for file extensions in those days. ZIP took its place in the pantheon alongside. PDF, and. First, the PKZIP program was distributed as shareware, meaning that there was a free version that people could distribute along with their compressed archives, so that anyone could decompress them without having to buy their own copy of the software.
Second, and perhaps most importantly, Katz distributed the ZIP standard—that is, the rulebook for how to make and read ZIP files—along with every copy of the software. This meant that any software engineer reasonably versed in the art could write their own program to work with the ZIP format, and many did. The separator appears at least twice for every file in a ZIP. The content of the separator is arbitrary, it just needs to be consistent, and it so happens that Katz made it his initials, PK.
The format has been extended multiple times in his absence, new features added and old ones changed; but his initials remain, an indelible reminder of his legacy. Logikcull integrates seamlessly with Office for incredibly fast, always reliable cloud-to-cloud eDiscovery.
0コメント