update bundled libdeflate to v1.14

Signed-off-by: Ivailo Monev <xakepa10@gmail.com>
This commit is contained in:
Ivailo Monev 2022-09-23 13:26:45 +03:00
parent c50a283a39
commit a7a9b5317b
12 changed files with 793 additions and 447 deletions

View file

@ -1,2 +1,2 @@
This is Git checkout 72b2ce0d28970b1affc31efeb86daeffee1d7410
This is Git checkout 18d6cc22b75643ec52111efeb27a22b9d860a982
from https://github.com/ebiggers/libdeflate that has not been modified.

View file

@ -27,11 +27,13 @@ For the release notes, see the [NEWS file](NEWS.md).
## Table of Contents
- [Building](#building)
- [For UNIX](#for-unix)
- [For macOS](#for-macos)
- [For Windows](#for-windows)
- [Using Cygwin](#using-cygwin)
- [Using MSYS2](#using-msys2)
- [Using the Makefile](#using-the-makefile)
- [For UNIX](#for-unix)
- [For macOS](#for-macos)
- [For Windows](#for-windows)
- [Using Cygwin](#using-cygwin)
- [Using MSYS2](#using-msys2)
- [Using a custom build system](#using-a-custom-build-system)
- [API](#api)
- [Bindings for other programming languages](#bindings-for-other-programming-languages)
- [DEFLATE vs. zlib vs. gzip](#deflate-vs-zlib-vs-gzip)
@ -42,7 +44,14 @@ For the release notes, see the [NEWS file](NEWS.md).
# Building
## For UNIX
libdeflate and the provided programs like `gzip` can be built using the provided
Makefile. If only the library is needed, it can alternatively be easily
integrated into applications and built using any build system; see [Using a
custom build system](#using-a-custom-build-system).
## Using the Makefile
### For UNIX
Just run `make`, then (if desired) `make install`. You need GNU Make and either
GCC or Clang. GCC is recommended because it builds slightly faster binaries.
@ -57,7 +66,7 @@ There are also many options which can be set on the `make` command line, e.g. to
omit library features or to customize the directories into which `make install`
installs files. See the Makefile for details.
## For macOS
### For macOS
Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):
@ -65,7 +74,7 @@ Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):
But if you need to build the binaries yourself, see the section for UNIX above.
## For Windows
### For Windows
Prebuilt Windows binaries can be downloaded from
https://github.com/ebiggers/libdeflate/releases. But if you need to build the
@ -84,7 +93,7 @@ binaries built with MinGW will be significantly faster.
Also note that 64-bit binaries are faster than 32-bit binaries and should be
preferred whenever possible.
### Using Cygwin
#### Using Cygwin
Run the Cygwin installer, available from https://cygwin.com/setup-x86_64.exe.
When you get to the package selection screen, choose the following additional
@ -119,7 +128,7 @@ or to build 32-bit binaries:
make CC=i686-w64-mingw32-gcc
### Using MSYS2
#### Using MSYS2
Run the MSYS2 installer, available from http://www.msys2.org/. After
installing, open an MSYS2 shell and run:
@ -161,6 +170,23 @@ and run the following commands:
Or to build 32-bit binaries, do the same but use "MSYS2 MinGW 32-bit" instead.
## Using a custom build system
The source files of the library are designed to be compilable directly, without
any prerequisite step like running a `./configure` script. Therefore, as an
alternative to building the library using the provided Makefile, the library
source files can be easily integrated directly into your application and built
using any build system.
You should compile both `lib/*.c` and `lib/*/*.c`. You don't need to worry
about excluding irrelevant architecture-specific code, as this is already
handled in the source files themselves using `#ifdef`s.
It is **strongly** recommended to use either gcc or clang, and to use `-O2`.
If you are doing a freestanding build with `-ffreestanding`, you must add
`-DFREESTANDING` as well, otherwise performance will suffer greatly.
# API
libdeflate has a simple API that is not zlib-compatible. You can create
@ -183,10 +209,7 @@ guessing. However, libdeflate's decompression routines do optionally provide
the actual number of output bytes in case you need it.
Windows developers: note that the calling convention of libdeflate.dll is
"stdcall" -- the same as the Win32 API. If you call into libdeflate.dll using a
non-C/C++ language, or dynamically using LoadLibrary(), make sure to use the
stdcall convention. Using the wrong convention may crash your application.
(Note: older versions of libdeflate used the "cdecl" convention instead.)
"cdecl". (libdeflate v1.4 through v1.12 used "stdcall" instead.)
# Bindings for other programming languages

View file

@ -144,8 +144,17 @@ typedef size_t machine_word_t;
/* restrict - hint that writes only occur through the given pointer */
#ifdef __GNUC__
# define restrict __restrict__
#elif defined(_MSC_VER)
/*
* Don't use MSVC's __restrict; it has nonstandard behavior.
* Standard restrict is okay, if it is supported.
*/
# if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L)
# define restrict restrict
# else
# define restrict
# endif
#else
/* Don't use MSVC's __restrict; it has nonstandard behavior. */
# define restrict
#endif
@ -200,6 +209,7 @@ typedef size_t machine_word_t;
#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
#define STATIC_ASSERT(expr) ((void)sizeof(char[1 - 2 * !(expr)]))
#define ALIGN(n, a) (((n) + (a) - 1) & ~((a) - 1))
#define ROUND_UP(n, d) ((d) * DIV_ROUND_UP((n), (d)))
/* ========================================================================== */
/* Endianness handling */
@ -513,8 +523,10 @@ bsr32(u32 v)
#ifdef __GNUC__
return 31 - __builtin_clz(v);
#elif defined(_MSC_VER)
_BitScanReverse(&v, v);
return v;
unsigned long i;
_BitScanReverse(&i, v);
return i;
#else
unsigned i = 0;
@ -529,9 +541,11 @@ bsr64(u64 v)
{
#ifdef __GNUC__
return 63 - __builtin_clzll(v);
#elif defined(_MSC_VER) && defined(_M_X64)
_BitScanReverse64(&v, v);
return v;
#elif defined(_MSC_VER) && defined(_WIN64)
unsigned long i;
_BitScanReverse64(&i, v);
return i;
#else
unsigned i = 0;
@ -563,8 +577,10 @@ bsf32(u32 v)
#ifdef __GNUC__
return __builtin_ctz(v);
#elif defined(_MSC_VER)
_BitScanForward(&v, v);
return v;
unsigned long i;
_BitScanForward(&i, v);
return i;
#else
unsigned i = 0;
@ -579,9 +595,11 @@ bsf64(u64 v)
{
#ifdef __GNUC__
return __builtin_ctzll(v);
#elif defined(_MSC_VER) && defined(_M_X64)
_BitScanForward64(&v, v);
return v;
#elif defined(_MSC_VER) && defined(_WIN64)
unsigned long i;
_BitScanForward64(&i, v);
return i;
#else
unsigned i = 0;

View file

@ -28,7 +28,9 @@
#ifndef LIB_ARM_MATCHFINDER_IMPL_H
#define LIB_ARM_MATCHFINDER_IMPL_H
#ifdef __ARM_NEON
#include "cpu_features.h"
#if HAVE_NEON_NATIVE
# include <arm_neon.h>
static forceinline void
matchfinder_init_neon(mf_pos_t *data, size_t size)
@ -81,6 +83,6 @@ matchfinder_rebase_neon(mf_pos_t *data, size_t size)
}
#define matchfinder_rebase matchfinder_rebase_neon
#endif /* __ARM_NEON */
#endif /* HAVE_NEON_NATIVE */
#endif /* LIB_ARM_MATCHFINDER_IMPL_H */

View file

@ -31,7 +31,17 @@
* target instruction sets.
*/
static enum libdeflate_result ATTRIBUTES
#ifndef ATTRIBUTES
# define ATTRIBUTES
#endif
#ifndef EXTRACT_VARBITS
# define EXTRACT_VARBITS(word, count) ((word) & BITMASK(count))
#endif
#ifndef EXTRACT_VARBITS8
# define EXTRACT_VARBITS8(word, count) ((word) & BITMASK((u8)(count)))
#endif
static enum libdeflate_result ATTRIBUTES MAYBE_UNUSED
FUNCNAME(struct libdeflate_decompressor * restrict d,
const void * restrict in, size_t in_nbytes,
void * restrict out, size_t out_nbytes_avail,
@ -41,35 +51,36 @@ FUNCNAME(struct libdeflate_decompressor * restrict d,
u8 * const out_end = out_next + out_nbytes_avail;
u8 * const out_fastloop_end =
out_end - MIN(out_nbytes_avail, FASTLOOP_MAX_BYTES_WRITTEN);
/* Input bitstream state; see deflate_decompress.c for documentation */
const u8 *in_next = in;
const u8 * const in_end = in_next + in_nbytes;
const u8 * const in_fastloop_end =
in_end - MIN(in_nbytes, FASTLOOP_MAX_BYTES_READ);
bitbuf_t bitbuf = 0;
bitbuf_t saved_bitbuf;
machine_word_t bitsleft = 0;
u32 bitsleft = 0;
size_t overread_count = 0;
unsigned i;
bool is_final_block;
unsigned block_type;
u16 len;
u16 nlen;
unsigned num_litlen_syms;
unsigned num_offset_syms;
bitbuf_t tmpbits;
bitbuf_t litlen_tablemask;
u32 entry;
next_block:
/* Starting to read the next block */
;
STATIC_ASSERT(CAN_ENSURE(1 + 2 + 5 + 5 + 4 + 3));
STATIC_ASSERT(CAN_CONSUME(1 + 2 + 5 + 5 + 4 + 3));
REFILL_BITS();
/* BFINAL: 1 bit */
is_final_block = POP_BITS(1);
is_final_block = bitbuf & BITMASK(1);
/* BTYPE: 2 bits */
block_type = POP_BITS(2);
block_type = (bitbuf >> 1) & BITMASK(2);
if (block_type == DEFLATE_BLOCKTYPE_DYNAMIC_HUFFMAN) {
@ -81,17 +92,18 @@ next_block:
};
unsigned num_explicit_precode_lens;
unsigned i;
/* Read the codeword length counts. */
STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == ((1 << 5) - 1) + 257);
num_litlen_syms = POP_BITS(5) + 257;
STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == 257 + BITMASK(5));
num_litlen_syms = 257 + ((bitbuf >> 3) & BITMASK(5));
STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == ((1 << 5) - 1) + 1);
num_offset_syms = POP_BITS(5) + 1;
STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == 1 + BITMASK(5));
num_offset_syms = 1 + ((bitbuf >> 8) & BITMASK(5));
STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == ((1 << 4) - 1) + 4);
num_explicit_precode_lens = POP_BITS(4) + 4;
STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == 4 + BITMASK(4));
num_explicit_precode_lens = 4 + ((bitbuf >> 13) & BITMASK(4));
d->static_codes_loaded = false;
@ -103,16 +115,31 @@ next_block:
* merge one len with the previous fields.
*/
STATIC_ASSERT(DEFLATE_MAX_PRE_CODEWORD_LEN == (1 << 3) - 1);
if (CAN_ENSURE(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
d->u.precode_lens[deflate_precode_lens_permutation[0]] = POP_BITS(3);
if (CAN_CONSUME(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
d->u.precode_lens[deflate_precode_lens_permutation[0]] =
(bitbuf >> 17) & BITMASK(3);
bitbuf >>= 20;
bitsleft -= 20;
REFILL_BITS();
for (i = 1; i < num_explicit_precode_lens; i++)
d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
i = 1;
do {
d->u.precode_lens[deflate_precode_lens_permutation[i]] =
bitbuf & BITMASK(3);
bitbuf >>= 3;
bitsleft -= 3;
} while (++i < num_explicit_precode_lens);
} else {
for (i = 0; i < num_explicit_precode_lens; i++) {
ENSURE_BITS(3);
d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
}
bitbuf >>= 17;
bitsleft -= 17;
i = 0;
do {
if ((u8)bitsleft < 3)
REFILL_BITS();
d->u.precode_lens[deflate_precode_lens_permutation[i]] =
bitbuf & BITMASK(3);
bitbuf >>= 3;
bitsleft -= 3;
} while (++i < num_explicit_precode_lens);
}
for (; i < DEFLATE_NUM_PRECODE_SYMS; i++)
d->u.precode_lens[deflate_precode_lens_permutation[i]] = 0;
@ -121,13 +148,14 @@ next_block:
SAFETY_CHECK(build_precode_decode_table(d));
/* Decode the litlen and offset codeword lengths. */
for (i = 0; i < num_litlen_syms + num_offset_syms; ) {
u32 entry;
i = 0;
do {
unsigned presym;
u8 rep_val;
unsigned rep_count;
ENSURE_BITS(DEFLATE_MAX_PRE_CODEWORD_LEN + 7);
if ((u8)bitsleft < DEFLATE_MAX_PRE_CODEWORD_LEN + 7)
REFILL_BITS();
/*
* The code below assumes that the precode decode table
@ -135,9 +163,11 @@ next_block:
*/
STATIC_ASSERT(PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN);
/* Read the next precode symbol. */
entry = d->u.l.precode_decode_table[BITS(DEFLATE_MAX_PRE_CODEWORD_LEN)];
REMOVE_BITS((u8)entry);
/* Decode the next precode symbol. */
entry = d->u.l.precode_decode_table[
bitbuf & BITMASK(DEFLATE_MAX_PRE_CODEWORD_LEN)];
bitbuf >>= (u8)entry;
bitsleft -= entry; /* optimization: subtract full entry */
presym = entry >> 16;
if (presym < 16) {
@ -171,8 +201,10 @@ next_block:
/* Repeat the previous length 3 - 6 times. */
SAFETY_CHECK(i != 0);
rep_val = d->u.l.lens[i - 1];
STATIC_ASSERT(3 + ((1 << 2) - 1) == 6);
rep_count = 3 + POP_BITS(2);
STATIC_ASSERT(3 + BITMASK(2) == 6);
rep_count = 3 + (bitbuf & BITMASK(2));
bitbuf >>= 2;
bitsleft -= 2;
d->u.l.lens[i + 0] = rep_val;
d->u.l.lens[i + 1] = rep_val;
d->u.l.lens[i + 2] = rep_val;
@ -182,8 +214,10 @@ next_block:
i += rep_count;
} else if (presym == 17) {
/* Repeat zero 3 - 10 times. */
STATIC_ASSERT(3 + ((1 << 3) - 1) == 10);
rep_count = 3 + POP_BITS(3);
STATIC_ASSERT(3 + BITMASK(3) == 10);
rep_count = 3 + (bitbuf & BITMASK(3));
bitbuf >>= 3;
bitsleft -= 3;
d->u.l.lens[i + 0] = 0;
d->u.l.lens[i + 1] = 0;
d->u.l.lens[i + 2] = 0;
@ -197,20 +231,39 @@ next_block:
i += rep_count;
} else {
/* Repeat zero 11 - 138 times. */
STATIC_ASSERT(11 + ((1 << 7) - 1) == 138);
rep_count = 11 + POP_BITS(7);
STATIC_ASSERT(11 + BITMASK(7) == 138);
rep_count = 11 + (bitbuf & BITMASK(7));
bitbuf >>= 7;
bitsleft -= 7;
memset(&d->u.l.lens[i], 0,
rep_count * sizeof(d->u.l.lens[i]));
i += rep_count;
}
}
} while (i < num_litlen_syms + num_offset_syms);
} else if (block_type == DEFLATE_BLOCKTYPE_UNCOMPRESSED) {
u16 len, nlen;
/*
* Uncompressed block: copy 'len' bytes literally from the input
* buffer to the output buffer.
*/
ALIGN_INPUT();
bitsleft -= 3; /* for BTYPE and BFINAL */
/*
* Align the bitstream to the next byte boundary. This means
* the next byte boundary as if we were reading a byte at a
* time. Therefore, we have to rewind 'in_next' by any bytes
* that have been refilled but not actually consumed yet (not
* counting overread bytes, which don't increment 'in_next').
*/
bitsleft = (u8)bitsleft;
SAFETY_CHECK(overread_count <= (bitsleft >> 3));
in_next -= (bitsleft >> 3) - overread_count;
overread_count = 0;
bitbuf = 0;
bitsleft = 0;
SAFETY_CHECK(in_end - in_next >= 4);
len = get_unaligned_le16(in_next);
@ -229,6 +282,8 @@ next_block:
goto block_done;
} else {
unsigned i;
SAFETY_CHECK(block_type == DEFLATE_BLOCKTYPE_STATIC_HUFFMAN);
/*
@ -241,6 +296,9 @@ next_block:
* dynamic Huffman block.
*/
bitbuf >>= 3; /* for BTYPE and BFINAL */
bitsleft -= 3;
if (d->static_codes_loaded)
goto have_decode_tables;
@ -270,186 +328,344 @@ next_block:
SAFETY_CHECK(build_offset_decode_table(d, num_litlen_syms, num_offset_syms));
SAFETY_CHECK(build_litlen_decode_table(d, num_litlen_syms, num_offset_syms));
have_decode_tables:
litlen_tablemask = BITMASK(d->litlen_tablebits);
/*
* This is the "fastloop" for decoding literals and matches. It does
* bounds checks on in_next and out_next in the loop conditions so that
* additional bounds checks aren't needed inside the loop body.
*
* To reduce latency, the bitbuffer is refilled and the next litlen
* decode table entry is preloaded before each loop iteration.
*/
while (in_next < in_fastloop_end && out_next < out_fastloop_end) {
u32 entry, length, offset;
u8 lit;
if (in_next >= in_fastloop_end || out_next >= out_fastloop_end)
goto generic_loop;
REFILL_BITS_IN_FASTLOOP();
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
do {
u32 length, offset, lit;
const u8 *src;
u8 *dst;
/* Refill the bitbuffer and decode a litlen symbol. */
REFILL_BITS_IN_FASTLOOP();
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
preloaded:
if (CAN_ENSURE(3 * LITLEN_TABLEBITS +
DEFLATE_MAX_LITLEN_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_LENGTH_BITS) &&
(entry & HUFFDEC_LITERAL)) {
/*
* Consume the bits for the litlen decode table entry. Save the
* original bitbuf for later, in case the extra match length
* bits need to be extracted from it.
*/
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry; /* optimization: subtract full entry */
/*
* Begin by checking for a "fast" literal, i.e. a literal that
* doesn't need a subtable.
*/
if (entry & HUFFDEC_LITERAL) {
/*
* 64-bit only: fast path for decoding literals that
* don't need subtables. We do up to 3 of these before
* proceeding to the general case. This is the largest
* number of times that LITLEN_TABLEBITS bits can be
* extracted from a refilled 64-bit bitbuffer while
* still leaving enough bits to decode any match length.
* On 64-bit platforms, we decode up to 2 extra fast
* literals in addition to the primary item, as this
* increases performance and still leaves enough bits
* remaining for what follows. We could actually do 3,
* assuming LITLEN_TABLEBITS=11, but that actually
* decreases performance slightly (perhaps by messing
* with the branch prediction of the conditional refill
* that happens later while decoding the match offset).
*
* Note: the definitions of FASTLOOP_MAX_BYTES_WRITTEN
* and FASTLOOP_MAX_BYTES_READ need to be updated if the
* maximum number of literals decoded here is changed.
* number of extra literals decoded here is changed.
*/
REMOVE_ENTRY_BITS_FAST(entry);
lit = entry >> 16;
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
*out_next++ = lit;
if (entry & HUFFDEC_LITERAL) {
REMOVE_ENTRY_BITS_FAST(entry);
if (/* enough bits for 2 fast literals + length + offset preload? */
CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
LENGTH_MAXBITS,
OFFSET_TABLEBITS) &&
/* enough bits for 2 fast literals + slow literal + litlen preload? */
CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
DEFLATE_MAX_LITLEN_CODEWORD_LEN,
LITLEN_TABLEBITS)) {
/* 1st extra fast literal */
lit = entry >> 16;
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry;
*out_next++ = lit;
if (entry & HUFFDEC_LITERAL) {
REMOVE_ENTRY_BITS_FAST(entry);
/* 2nd extra fast literal */
lit = entry >> 16;
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry;
*out_next++ = lit;
if (entry & HUFFDEC_LITERAL) {
/*
* Another fast literal, but
* this one is in lieu of the
* primary item, so it doesn't
* count as one of the extras.
*/
lit = entry >> 16;
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
REFILL_BITS_IN_FASTLOOP();
*out_next++ = lit;
continue;
}
}
} else {
/*
* Decode a literal. While doing so, preload
* the next litlen decode table entry and refill
* the bitbuffer. To reduce latency, we've
* arranged for there to be enough "preloadable"
* bits remaining to do the table preload
* independently of the refill.
*/
STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(
LITLEN_TABLEBITS, LITLEN_TABLEBITS));
lit = entry >> 16;
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
REFILL_BITS_IN_FASTLOOP();
*out_next++ = lit;
continue;
}
}
/*
* It's not a literal entry, so it can be a length entry, a
* subtable pointer entry, or an end-of-block entry. Detect the
* two unlikely cases by testing the HUFFDEC_EXCEPTIONAL flag.
*/
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
/* Subtable pointer or end-of-block entry */
if (entry & HUFFDEC_SUBTABLE_POINTER) {
REMOVE_BITS(LITLEN_TABLEBITS);
entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
}
SAVE_BITBUF();
REMOVE_ENTRY_BITS_FAST(entry);
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
goto block_done;
/* Literal or length entry, from a subtable */
} else {
/* Literal or length entry, from the main table */
SAVE_BITBUF();
REMOVE_ENTRY_BITS_FAST(entry);
}
length = entry >> 16;
if (entry & HUFFDEC_LITERAL) {
/*
* Literal that didn't get handled by the literal fast
* path earlier
* A subtable is required. Load and consume the
* subtable entry. The subtable entry can be of any
* type: literal, length, or end-of-block.
*/
*out_next++ = length;
continue;
}
/*
* Match length. Finish decoding it. We don't need to check
* for too-long matches here, as this is inside the fastloop
* where it's already been verified that the output buffer has
* enough space remaining to copy a max-length match.
*/
length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
entry = d->u.litlen_decode_table[(entry >> 16) +
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry;
/* Decode the match offset. */
/* Refill the bitbuffer if it may be needed for the offset. */
if (unlikely(GET_REAL_BITSLEFT() <
DEFLATE_MAX_OFFSET_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_OFFSET_BITS))
REFILL_BITS_IN_FASTLOOP();
STATIC_ASSERT(CAN_ENSURE(OFFSET_TABLEBITS +
DEFLATE_MAX_EXTRA_OFFSET_BITS));
STATIC_ASSERT(CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
OFFSET_TABLEBITS +
DEFLATE_MAX_EXTRA_OFFSET_BITS));
entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
if (entry & HUFFDEC_EXCEPTIONAL) {
/* Offset codeword requires a subtable */
REMOVE_BITS(OFFSET_TABLEBITS);
entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
/*
* On 32-bit, we might not be able to decode the offset
* symbol and extra offset bits without refilling the
* bitbuffer in between. However, this is only an issue
* when a subtable is needed, so do the refill here.
* 32-bit platforms that use the byte-at-a-time refill
* method have to do a refill here for there to always
* be enough bits to decode a literal that requires a
* subtable, then preload the next litlen decode table
* entry; or to decode a match length that requires a
* subtable, then preload the offset decode table entry.
*/
if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_OFFSET_BITS))
if (!CAN_CONSUME_AND_THEN_PRELOAD(DEFLATE_MAX_LITLEN_CODEWORD_LEN,
LITLEN_TABLEBITS) ||
!CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXBITS,
OFFSET_TABLEBITS))
REFILL_BITS_IN_FASTLOOP();
if (entry & HUFFDEC_LITERAL) {
/* Decode a literal that required a subtable. */
lit = entry >> 16;
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
REFILL_BITS_IN_FASTLOOP();
*out_next++ = lit;
continue;
}
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
goto block_done;
/* Else, it's a length that required a subtable. */
}
SAVE_BITBUF();
REMOVE_ENTRY_BITS_FAST(entry);
offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
/*
* Decode the match length: the length base value associated
* with the litlen symbol (which we extract from the decode
* table entry), plus the extra length bits. We don't need to
* consume the extra length bits here, as they were included in
* the bits consumed by the entry earlier. We also don't need
* to check for too-long matches here, as this is inside the
* fastloop where it's already been verified that the output
* buffer has enough space remaining to copy a max-length match.
*/
length = entry >> 16;
length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
/*
* Decode the match offset. There are enough "preloadable" bits
* remaining to preload the offset decode table entry, but a
* refill might be needed before consuming it.
*/
STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXFASTBITS,
OFFSET_TABLEBITS));
entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
if (CAN_CONSUME_AND_THEN_PRELOAD(OFFSET_MAXBITS,
LITLEN_TABLEBITS)) {
/*
* Decoding a match offset on a 64-bit platform. We may
* need to refill once, but then we can decode the whole
* offset and preload the next litlen table entry.
*/
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
/* Offset codeword requires a subtable */
if (unlikely((u8)bitsleft < OFFSET_MAXBITS +
LITLEN_TABLEBITS - PRELOAD_SLACK))
REFILL_BITS_IN_FASTLOOP();
bitbuf >>= OFFSET_TABLEBITS;
bitsleft -= OFFSET_TABLEBITS;
entry = d->offset_decode_table[(entry >> 16) +
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
} else if (unlikely((u8)bitsleft < OFFSET_MAXFASTBITS +
LITLEN_TABLEBITS - PRELOAD_SLACK))
REFILL_BITS_IN_FASTLOOP();
} else {
/* Decoding a match offset on a 32-bit platform */
REFILL_BITS_IN_FASTLOOP();
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
/* Offset codeword requires a subtable */
bitbuf >>= OFFSET_TABLEBITS;
bitsleft -= OFFSET_TABLEBITS;
entry = d->offset_decode_table[(entry >> 16) +
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
REFILL_BITS_IN_FASTLOOP();
/* No further refill needed before extra bits */
STATIC_ASSERT(CAN_CONSUME(
OFFSET_MAXBITS - OFFSET_TABLEBITS));
} else {
/* No refill needed before extra bits */
STATIC_ASSERT(CAN_CONSUME(OFFSET_MAXFASTBITS));
}
}
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry; /* optimization: subtract full entry */
offset = entry >> 16;
offset += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
/* Validate the match offset; needed even in the fastloop. */
SAFETY_CHECK(offset <= out_next - (const u8 *)out);
src = out_next - offset;
dst = out_next;
out_next += length;
/*
* Before starting to copy the match, refill the bitbuffer and
* preload the litlen decode table entry for the next loop
* iteration. This can increase performance by allowing the
* latency of the two operations to overlap.
* Before starting to issue the instructions to copy the match,
* refill the bitbuffer and preload the litlen decode table
* entry for the next loop iteration. This can increase
* performance by allowing the latency of the match copy to
* overlap with these other operations. To further reduce
* latency, we've arranged for there to be enough bits remaining
* to do the table preload independently of the refill, except
* on 32-bit platforms using the byte-at-a-time refill method.
*/
if (!CAN_CONSUME_AND_THEN_PRELOAD(
MAX(OFFSET_MAXBITS - OFFSET_TABLEBITS,
OFFSET_MAXFASTBITS),
LITLEN_TABLEBITS) &&
unlikely((u8)bitsleft < LITLEN_TABLEBITS - PRELOAD_SLACK))
REFILL_BITS_IN_FASTLOOP();
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
REFILL_BITS_IN_FASTLOOP();
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
/*
* Copy the match. On most CPUs the fastest method is a
* word-at-a-time copy, unconditionally copying at least 3 words
* word-at-a-time copy, unconditionally copying about 5 words
* since this is enough for most matches without being too much.
*
* The normal word-at-a-time copy works for offset >= WORDBYTES,
* which is most cases. The case of offset == 1 is also common
* and is worth optimizing for, since it is just RLE encoding of
* the previous byte, which is the result of compressing long
* runs of the same byte. We currently don't optimize for the
* less common cases of offset > 1 && offset < WORDBYTES; we
* just fall back to a traditional byte-at-a-time copy for them.
* runs of the same byte.
*
* Writing past the match 'length' is allowed here, since it's
* been ensured there is enough output space left for a slight
* overrun. FASTLOOP_MAX_BYTES_WRITTEN needs to be updated if
* the maximum possible overrun here is changed.
*/
src = out_next - offset;
dst = out_next;
out_next += length;
if (UNALIGNED_ACCESS_IS_FAST && offset >= WORDBYTES) {
copy_word_unaligned(src, dst);
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
copy_word_unaligned(src, dst);
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
do {
copy_word_unaligned(src, dst);
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
while (dst < out_next) {
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
} while (dst < out_next);
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDBYTES;
dst += WORDBYTES;
}
} else if (UNALIGNED_ACCESS_IS_FAST && offset == 1) {
machine_word_t v = repeat_byte(*src);
machine_word_t v;
/*
* This part tends to get auto-vectorized, so keep it
* copying a multiple of 16 bytes at a time.
*/
v = (machine_word_t)0x0101010101010101 * src[0];
store_word_unaligned(v, dst);
dst += WORDBYTES;
store_word_unaligned(v, dst);
dst += WORDBYTES;
do {
store_word_unaligned(v, dst);
dst += WORDBYTES;
store_word_unaligned(v, dst);
dst += WORDBYTES;
while (dst < out_next) {
store_word_unaligned(v, dst);
dst += WORDBYTES;
store_word_unaligned(v, dst);
dst += WORDBYTES;
store_word_unaligned(v, dst);
dst += WORDBYTES;
store_word_unaligned(v, dst);
dst += WORDBYTES;
}
} else if (UNALIGNED_ACCESS_IS_FAST) {
store_word_unaligned(load_word_unaligned(src), dst);
src += offset;
dst += offset;
store_word_unaligned(load_word_unaligned(src), dst);
src += offset;
dst += offset;
do {
store_word_unaligned(load_word_unaligned(src), dst);
src += offset;
dst += offset;
store_word_unaligned(load_word_unaligned(src), dst);
src += offset;
dst += offset;
} while (dst < out_next);
} else {
STATIC_ASSERT(DEFLATE_MIN_MATCH_LEN == 3);
*dst++ = *src++;
*dst++ = *src++;
do {
*dst++ = *src++;
} while (dst < out_next);
}
if (in_next < in_fastloop_end && out_next < out_fastloop_end)
goto preloaded;
break;
}
/* MASK_BITSLEFT() is needed when leaving the fastloop. */
MASK_BITSLEFT();
} while (in_next < in_fastloop_end && out_next < out_fastloop_end);
/*
* This is the generic loop for decoding literals and matches. This
@ -458,19 +674,24 @@ preloaded:
* critical, as most time is spent in the fastloop above instead. We
* therefore omit some optimizations here in favor of smaller code.
*/
generic_loop:
for (;;) {
u32 entry, length, offset;
u32 length, offset;
const u8 *src;
u8 *dst;
REFILL_BITS();
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry;
if (unlikely(entry & HUFFDEC_SUBTABLE_POINTER)) {
REMOVE_BITS(LITLEN_TABLEBITS);
entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
entry = d->u.litlen_decode_table[(entry >> 16) +
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
saved_bitbuf = bitbuf;
bitbuf >>= (u8)entry;
bitsleft -= entry;
}
SAVE_BITBUF();
REMOVE_BITS((u8)entry);
length = entry >> 16;
if (entry & HUFFDEC_LITERAL) {
if (unlikely(out_next == out_end))
@ -480,34 +701,27 @@ preloaded:
}
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
goto block_done;
length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
if (unlikely(length > out_end - out_next))
return LIBDEFLATE_INSUFFICIENT_SPACE;
if (CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_OFFSET_BITS)) {
ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_OFFSET_BITS);
} else {
ENSURE_BITS(OFFSET_TABLEBITS +
DEFLATE_MAX_EXTRA_OFFSET_BITS);
if (!CAN_CONSUME(LENGTH_MAXBITS + OFFSET_MAXBITS))
REFILL_BITS();
entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
bitbuf >>= OFFSET_TABLEBITS;
bitsleft -= OFFSET_TABLEBITS;
entry = d->offset_decode_table[(entry >> 16) +
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
if (!CAN_CONSUME(OFFSET_MAXBITS))
REFILL_BITS();
}
entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
if (entry & HUFFDEC_EXCEPTIONAL) {
REMOVE_BITS(OFFSET_TABLEBITS);
entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
DEFLATE_MAX_EXTRA_OFFSET_BITS))
ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
OFFSET_TABLEBITS +
DEFLATE_MAX_EXTRA_OFFSET_BITS);
}
SAVE_BITBUF();
REMOVE_BITS((u8)entry);
offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
offset = entry >> 16;
offset += EXTRACT_VARBITS8(bitbuf, entry) >> (u8)(entry >> 8);
bitbuf >>= (u8)entry;
bitsleft -= entry;
SAFETY_CHECK(offset <= out_next - (const u8 *)out);
src = out_next - offset;
dst = out_next;
out_next += length;
@ -521,9 +735,6 @@ preloaded:
}
block_done:
/* MASK_BITSLEFT() is needed when leaving the fastloop. */
MASK_BITSLEFT();
/* Finished decoding a block */
if (!is_final_block)
@ -531,12 +742,21 @@ block_done:
/* That was the last block. */
/* Discard any readahead bits and check for excessive overread. */
ALIGN_INPUT();
bitsleft = (u8)bitsleft;
/*
* If any of the implicit appended zero bytes were consumed (not just
* refilled) before hitting end of stream, then the data is bad.
*/
SAFETY_CHECK(overread_count <= (bitsleft >> 3));
/* Optionally return the actual number of bytes consumed. */
if (actual_in_nbytes_ret) {
/* Don't count bytes that were refilled but not consumed. */
in_next -= (bitsleft >> 3) - overread_count;
/* Optionally return the actual number of bytes read. */
if (actual_in_nbytes_ret)
*actual_in_nbytes_ret = in_next - (u8 *)in;
}
/* Optionally return the actual number of bytes written. */
if (actual_out_nbytes_ret) {
@ -550,3 +770,5 @@ block_done:
#undef FUNCNAME
#undef ATTRIBUTES
#undef EXTRACT_VARBITS
#undef EXTRACT_VARBITS8

View file

@ -475,7 +475,7 @@ struct deflate_output_bitstream;
struct libdeflate_compressor {
/* Pointer to the compress() implementation chosen at allocation time */
void (*impl)(struct libdeflate_compressor *c, const u8 *in,
void (*impl)(struct libdeflate_compressor *restrict c, const u8 *in,
size_t in_nbytes, struct deflate_output_bitstream *os);
/* The compression level with which this compressor was created */
@ -1041,7 +1041,6 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
unsigned parent = A[node] >> NUM_SYMBOL_BITS;
unsigned parent_depth = A[parent] >> NUM_SYMBOL_BITS;
unsigned depth = parent_depth + 1;
unsigned len = depth;
/*
* Set the depth of this node so that it is available when its
@ -1054,19 +1053,19 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
* constraint. This is not the optimal method for generating
* length-limited Huffman codes! But it should be good enough.
*/
if (len >= max_codeword_len) {
len = max_codeword_len;
if (depth >= max_codeword_len) {
depth = max_codeword_len;
do {
len--;
} while (len_counts[len] == 0);
depth--;
} while (len_counts[depth] == 0);
}
/*
* Account for the fact that we have a non-leaf node at the
* current depth.
*/
len_counts[len]--;
len_counts[len + 1] += 2;
len_counts[depth]--;
len_counts[depth + 1] += 2;
}
}
@ -1189,11 +1188,9 @@ gen_codewords(u32 A[], u8 lens[], const unsigned len_counts[],
(next_codewords[len - 1] + len_counts[len - 1]) << 1;
for (sym = 0; sym < num_syms; sym++) {
u8 len = lens[sym];
u32 codeword = next_codewords[len]++;
/* DEFLATE requires bit-reversed codewords. */
A[sym] = reverse_codeword(codeword, len);
A[sym] = reverse_codeword(next_codewords[lens[sym]]++,
lens[sym]);
}
}

View file

@ -49,11 +49,8 @@
/*
* Maximum number of extra bits that may be required to represent a match
* length or offset.
*
* TODO: are we going to have full DEFLATE64 support? If so, up to 16
* length bits must be supported.
*/
#define DEFLATE_MAX_EXTRA_LENGTH_BITS 5
#define DEFLATE_MAX_EXTRA_OFFSET_BITS 14
#define DEFLATE_MAX_EXTRA_OFFSET_BITS 13
#endif /* LIB_DEFLATE_CONSTANTS_H */

View file

@ -27,14 +27,14 @@
* ---------------------------------------------------------------------------
*
* This is a highly optimized DEFLATE decompressor. It is much faster than
* zlib, typically more than twice as fast, though results vary by CPU.
* vanilla zlib, typically well over twice as fast, though results vary by CPU.
*
* Why this is faster than zlib's implementation:
* Why this is faster than vanilla zlib:
*
* - Word accesses rather than byte accesses when reading input
* - Word accesses rather than byte accesses when copying matches
* - Faster Huffman decoding combined with various DEFLATE-specific tricks
* - Larger bitbuffer variable that doesn't need to be filled as often
* - Larger bitbuffer variable that doesn't need to be refilled as often
* - Other optimizations to remove unnecessary branches
* - Only full-buffer decompression is supported, so the code doesn't need to
* support stopping and resuming decompression.
@ -71,22 +71,32 @@
/*
* The state of the "input bitstream" consists of the following variables:
*
* - in_next: pointer to the next unread byte in the input buffer
* - in_next: a pointer to the next unread byte in the input buffer
*
* - in_end: pointer just past the end of the input buffer
* - in_end: a pointer to just past the end of the input buffer
*
* - bitbuf: a word-sized variable containing bits that have been read from
* the input buffer. The buffered bits are right-aligned
* (they're the low-order bits).
* the input buffer or from the implicit appended zero bytes
*
* - bitsleft: number of bits in 'bitbuf' that are valid. NOTE: in the
* fastloop, bits 8 and above of bitsleft can contain garbage.
* - bitsleft: the number of bits in 'bitbuf' available to be consumed.
* After REFILL_BITS_BRANCHLESS(), 'bitbuf' can actually
* contain more bits than this. However, only the bits counted
* by 'bitsleft' can actually be consumed; the rest can only be
* used for preloading.
*
* - overread_count: number of implicit 0 bytes past 'in_end' that have
* been loaded into the bitbuffer
* As a micro-optimization, we allow bits 8 and higher of
* 'bitsleft' to contain garbage. When consuming the bits
* associated with a decode table entry, this allows us to do
* 'bitsleft -= entry' instead of 'bitsleft -= (u8)entry'.
* On some CPUs, this helps reduce instruction dependencies.
* This does have the disadvantage that 'bitsleft' sometimes
* needs to be cast to 'u8', such as when it's used as a shift
* amount in REFILL_BITS_BRANCHLESS(). But that one happens
* for free since most CPUs ignore high bits in shift amounts.
*
* For performance reasons, these variables are declared as standalone variables
* and are manipulated using macros, rather than being packed into a struct.
* - overread_count: the total number of implicit appended zero bytes that
* have been loaded into the bitbuffer, including any
* counted by 'bitsleft' and any already consumed
*/
/*
@ -97,60 +107,92 @@
* which they don't have to refill as often.
*/
typedef machine_word_t bitbuf_t;
#define BITBUF_NBITS (8 * (int)sizeof(bitbuf_t))
/* BITMASK(n) returns a bitmask of length 'n'. */
#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1)
/*
* BITBUF_NBITS is the number of bits the bitbuffer variable can hold. See
* REFILL_BITS_WORDWISE() for why this is 1 less than the obvious value.
* MAX_BITSLEFT is the maximum number of consumable bits, i.e. the maximum value
* of '(u8)bitsleft'. This is the size of the bitbuffer variable, minus 1 if
* the branchless refill method is being used (see REFILL_BITS_BRANCHLESS()).
*/
#define BITBUF_NBITS (8 * sizeof(bitbuf_t) - 1)
#define MAX_BITSLEFT \
(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS - 1 : BITBUF_NBITS)
/*
* REFILL_GUARANTEED_NBITS is the number of bits that are guaranteed in the
* bitbuffer variable after refilling it with ENSURE_BITS(n), REFILL_BITS(), or
* REFILL_BITS_IN_FASTLOOP(). There might be up to BITBUF_NBITS bits; however,
* since only whole bytes can be added, only 'BITBUF_NBITS - 7' bits are
* guaranteed. That is the smallest amount where another byte doesn't fit.
* CONSUMABLE_NBITS is the minimum number of bits that are guaranteed to be
* consumable (counted in 'bitsleft') immediately after refilling the bitbuffer.
* Since only whole bytes can be added to 'bitsleft', the worst case is
* 'MAX_BITSLEFT - 7': the smallest amount where another byte doesn't fit.
*/
#define REFILL_GUARANTEED_NBITS (BITBUF_NBITS - 7)
#define CONSUMABLE_NBITS (MAX_BITSLEFT - 7)
/*
* CAN_ENSURE(n) evaluates to true if the bitbuffer variable is guaranteed to
* contain at least 'n' bits after a refill. See REFILL_GUARANTEED_NBITS.
*
* This can be used to choose between alternate refill strategies based on the
* size of the bitbuffer variable. 'n' should be a compile-time constant.
* FASTLOOP_PRELOADABLE_NBITS is the minimum number of bits that are guaranteed
* to be preloadable immediately after REFILL_BITS_IN_FASTLOOP(). (It is *not*
* guaranteed after REFILL_BITS(), since REFILL_BITS() falls back to a
* byte-at-a-time refill method near the end of input.) This may exceed the
* number of consumable bits (counted by 'bitsleft'). Any bits not counted in
* 'bitsleft' can only be used for precomputation and cannot be consumed.
*/
#define CAN_ENSURE(n) ((n) <= REFILL_GUARANTEED_NBITS)
#define FASTLOOP_PRELOADABLE_NBITS \
(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS : CONSUMABLE_NBITS)
/*
* REFILL_BITS_WORDWISE() branchlessly refills the bitbuffer variable by reading
* the next word from the input buffer and updating 'in_next' and 'bitsleft'
* based on how many bits were refilled -- counting whole bytes only. This is
* much faster than reading a byte at a time, at least if the CPU is little
* endian and supports fast unaligned memory accesses.
* PRELOAD_SLACK is the minimum number of bits that are guaranteed to be
* preloadable but not consumable, following REFILL_BITS_IN_FASTLOOP() and any
* subsequent consumptions. This is 1 bit if the branchless refill method is
* being used, and 0 bits otherwise.
*/
#define PRELOAD_SLACK MAX(0, FASTLOOP_PRELOADABLE_NBITS - MAX_BITSLEFT)
/*
* CAN_CONSUME(n) is true if it's guaranteed that if the bitbuffer has just been
* refilled, then it's always possible to consume 'n' bits from it. 'n' should
* be a compile-time constant, to enable compile-time evaluation.
*/
#define CAN_CONSUME(n) (CONSUMABLE_NBITS >= (n))
/*
* CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) is true if it's
* guaranteed that after REFILL_BITS_IN_FASTLOOP(), it's always possible to
* consume 'consume_nbits' bits, then preload 'preload_nbits' bits. The
* arguments should be compile-time constants to enable compile-time evaluation.
*/
#define CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) \
(CONSUMABLE_NBITS >= (consume_nbits) && \
FASTLOOP_PRELOADABLE_NBITS >= (consume_nbits) + (preload_nbits))
/*
* REFILL_BITS_BRANCHLESS() branchlessly refills the bitbuffer variable by
* reading the next word from the input buffer and updating 'in_next' and
* 'bitsleft' based on how many bits were refilled -- counting whole bytes only.
* This is much faster than reading a byte at a time, at least if the CPU is
* little endian and supports fast unaligned memory accesses.
*
* The simplest way of branchlessly updating 'bitsleft' would be:
*
* bitsleft += (BITBUF_NBITS - bitsleft) & ~7;
* bitsleft += (MAX_BITSLEFT - bitsleft) & ~7;
*
* To make it faster, we define BITBUF_NBITS to be 'WORDBITS - 1' rather than
* To make it faster, we define MAX_BITSLEFT to be 'WORDBITS - 1' rather than
* WORDBITS, so that in binary it looks like 111111 or 11111. Then, we update
* 'bitsleft' just by setting the bits above the low 3 bits:
* 'bitsleft' by just setting the bits above the low 3 bits:
*
* bitsleft |= BITBUF_NBITS & ~7;
* bitsleft |= MAX_BITSLEFT & ~7;
*
* That compiles down to a single instruction like 'or $0x38, %rbp'. Using
* 'BITBUF_NBITS == WORDBITS - 1' also has the advantage that refills can be
* done when 'bitsleft == BITBUF_NBITS' without invoking undefined behavior.
* 'MAX_BITSLEFT == WORDBITS - 1' also has the advantage that refills can be
* done when 'bitsleft == MAX_BITSLEFT' without invoking undefined behavior.
*
* The simplest way of branchlessly updating 'in_next' would be:
*
* in_next += (BITBUF_NBITS - bitsleft) >> 3;
* in_next += (MAX_BITSLEFT - bitsleft) >> 3;
*
* With 'BITBUF_NBITS == WORDBITS - 1' we could use an XOR instead, though this
* With 'MAX_BITSLEFT == WORDBITS - 1' we could use an XOR instead, though this
* isn't really better:
*
* in_next += (BITBUF_NBITS ^ bitsleft) >> 3;
* in_next += (MAX_BITSLEFT ^ bitsleft) >> 3;
*
* An alternative which can be marginally better is the following:
*
@ -162,22 +204,23 @@ typedef machine_word_t bitbuf_t;
* extraction instruction (e.g. arm's ubfx), it stays at 3, and is potentially
* more efficient because the length of the longest dependency chain decreases
* from 3 to 2. This alternative also has the advantage that it ignores the
* high bits in 'bitsleft', so it is compatible with the fastloop optimization
* (described later) where we let the high bits of 'bitsleft' contain garbage.
* high bits in 'bitsleft', so it is compatible with the micro-optimization we
* use where we let the high bits of 'bitsleft' contain garbage.
*/
#define REFILL_BITS_WORDWISE() \
do { \
bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft;\
in_next += sizeof(bitbuf_t) - 1; \
in_next -= (bitsleft >> 3) & 0x7; \
bitsleft |= BITBUF_NBITS & ~7; \
#define REFILL_BITS_BRANCHLESS() \
do { \
bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft; \
in_next += sizeof(bitbuf_t) - 1; \
in_next -= (bitsleft >> 3) & 0x7; \
bitsleft |= MAX_BITSLEFT & ~7; \
} while (0)
/*
* REFILL_BITS() loads bits from the input buffer until the bitbuffer variable
* contains at least REFILL_GUARANTEED_NBITS bits.
* contains at least CONSUMABLE_NBITS consumable bits.
*
* This checks for the end of input, and it cannot be used in the fastloop.
* This checks for the end of input, and it doesn't guarantee
* FASTLOOP_PRELOADABLE_NBITS, so it can't be used in the fastloop.
*
* If we would overread the input buffer, we just don't read anything, leaving
* the bits zeroed but marking them filled. This simplifies the decompressor
@ -196,121 +239,66 @@ do { \
*/
#define REFILL_BITS() \
do { \
if (CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST && \
if (UNALIGNED_ACCESS_IS_FAST && \
likely(in_end - in_next >= sizeof(bitbuf_t))) { \
REFILL_BITS_WORDWISE(); \
REFILL_BITS_BRANCHLESS(); \
} else { \
while (bitsleft < REFILL_GUARANTEED_NBITS) { \
while ((u8)bitsleft < CONSUMABLE_NBITS) { \
if (likely(in_next != in_end)) { \
bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \
bitbuf |= (bitbuf_t)*in_next++ << \
(u8)bitsleft; \
} else { \
overread_count++; \
SAFETY_CHECK(overread_count <= sizeof(bitbuf)); \
} \
SAFETY_CHECK(overread_count <= \
sizeof(bitbuf_t)); \
} \
bitsleft += 8; \
} \
} \
} while (0)
/* ENSURE_BITS(n) calls REFILL_BITS() if fewer than 'n' bits are buffered. */
#define ENSURE_BITS(n) \
/*
* REFILL_BITS_IN_FASTLOOP() is like REFILL_BITS(), but it doesn't check for the
* end of the input. It can only be used in the fastloop.
*/
#define REFILL_BITS_IN_FASTLOOP() \
do { \
if (bitsleft < (n)) \
REFILL_BITS(); \
STATIC_ASSERT(UNALIGNED_ACCESS_IS_FAST || \
FASTLOOP_PRELOADABLE_NBITS == CONSUMABLE_NBITS); \
if (UNALIGNED_ACCESS_IS_FAST) { \
REFILL_BITS_BRANCHLESS(); \
} else { \
while ((u8)bitsleft < CONSUMABLE_NBITS) { \
bitbuf |= (bitbuf_t)*in_next++ << (u8)bitsleft; \
bitsleft += 8; \
} \
} \
} while (0)
#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1)
/* BITS(n) returns the next 'n' buffered bits without removing them. */
#define BITS(n) (bitbuf & BITMASK(n))
/* Macros to save the value of the bitbuffer variable and use it later. */
#define SAVE_BITBUF() (saved_bitbuf = bitbuf)
#define SAVED_BITS(n) (saved_bitbuf & BITMASK(n))
/* REMOVE_BITS(n) removes the next 'n' buffered bits. */
#define REMOVE_BITS(n) (bitbuf >>= (n), bitsleft -= (n))
/* POP_BITS(n) removes and returns the next 'n' buffered bits. */
#define POP_BITS(n) (tmpbits = BITS(n), REMOVE_BITS(n), tmpbits)
/*
* ALIGN_INPUT() verifies that the input buffer hasn't been overread, then
* aligns the bitstream to the next byte boundary, discarding any unused bits in
* the current byte.
*
* Note that if the bitbuffer variable currently contains more than 7 bits, then
* we must rewind 'in_next', effectively putting those bits back. Only the bits
* in what would be the "current" byte if we were reading one byte at a time can
* be actually discarded.
*/
#define ALIGN_INPUT() \
do { \
SAFETY_CHECK(overread_count <= (bitsleft >> 3)); \
in_next -= (bitsleft >> 3) - overread_count; \
overread_count = 0; \
bitbuf = 0; \
bitsleft = 0; \
} while(0)
/*
* Macros used in the "fastloop": the loop that decodes literals and matches
* while there is still plenty of space left in the input and output buffers.
*
* In the fastloop, we improve performance by skipping redundant bounds checks.
* On platforms where it helps, we also use an optimization where we allow bits
* 8 and higher of 'bitsleft' to contain garbage. This is sometimes a useful
* microoptimization because it means the whole 32-bit decode table entry can be
* subtracted from 'bitsleft' without an intermediate step to convert it to 8
* bits. (It still needs to be converted to 8 bits for the shift of 'bitbuf',
* but most CPUs ignore high bits in shift amounts, so that happens implicitly
* with zero overhead.) REMOVE_ENTRY_BITS_FAST() implements this optimization.
*
* MASK_BITSLEFT() is used to clear the garbage bits when leaving the fastloop.
*/
#if CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST
# define REFILL_BITS_IN_FASTLOOP() REFILL_BITS_WORDWISE()
# define REMOVE_ENTRY_BITS_FAST(entry) (bitbuf >>= (u8)entry, bitsleft -= entry)
# define GET_REAL_BITSLEFT() ((u8)bitsleft)
# define MASK_BITSLEFT() (bitsleft &= 0xFF)
#else
# define REFILL_BITS_IN_FASTLOOP() \
while (bitsleft < REFILL_GUARANTEED_NBITS) { \
bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \
bitsleft += 8; \
}
# define REMOVE_ENTRY_BITS_FAST(entry) REMOVE_BITS((u8)entry)
# define GET_REAL_BITSLEFT() bitsleft
# define MASK_BITSLEFT()
#endif
/*
* This is the worst-case maximum number of output bytes that are written to
* during each iteration of the fastloop. The worst case is 3 literals, then a
* match of length DEFLATE_MAX_MATCH_LEN. The match length must be rounded up
* to a word boundary due to the word-at-a-time match copy implementation.
* during each iteration of the fastloop. The worst case is 2 literals, then a
* match of length DEFLATE_MAX_MATCH_LEN. Additionally, some slack space must
* be included for the intentional overrun in the match copy implementation.
*/
#define FASTLOOP_MAX_BYTES_WRITTEN \
(3 + ALIGN(DEFLATE_MAX_MATCH_LEN, WORDBYTES))
(2 + DEFLATE_MAX_MATCH_LEN + (5 * WORDBYTES) - 1)
/*
* This is the worst-case maximum number of input bytes that are read during
* each iteration of the fastloop. To get this value, we first compute the
* greatest number of bits that can be refilled during a loop iteration. The
* refill at the beginning can add at most BITBUF_NBITS, and the amount that can
* refill at the beginning can add at most MAX_BITSLEFT, and the amount that can
* be refilled later is no more than the maximum amount that can be consumed by
* 3 literals that don't need a subtable, then a match. We convert this value
* to bytes, rounding up. Finally, we added sizeof(bitbuf_t) to account for
* REFILL_BITS_WORDWISE() reading up to a word past the part really used.
* 2 literals that don't need a subtable, then a match. We convert this value
* to bytes, rounding up; this gives the maximum number of bytes that 'in_next'
* can be advanced. Finally, we add sizeof(bitbuf_t) to account for
* REFILL_BITS_BRANCHLESS() reading a word past 'in_next'.
*/
#define FASTLOOP_MAX_BYTES_READ \
(DIV_ROUND_UP(BITBUF_NBITS + \
((3 * LITLEN_TABLEBITS) + \
DEFLATE_MAX_LITLEN_CODEWORD_LEN + \
DEFLATE_MAX_EXTRA_LENGTH_BITS + \
DEFLATE_MAX_OFFSET_CODEWORD_LEN + \
DEFLATE_MAX_EXTRA_OFFSET_BITS), 8) + \
sizeof(bitbuf_t))
(DIV_ROUND_UP(MAX_BITSLEFT + (2 * LITLEN_TABLEBITS) + \
LENGTH_MAXBITS + OFFSET_MAXBITS, 8) + \
sizeof(bitbuf_t))
/*****************************************************************************
* Huffman decoding *
@ -358,24 +346,37 @@ do { \
* take longer, which decreases performance. We choose values that work well in
* practice, making subtables rarely needed without making the tables too large.
*
* Our choice of OFFSET_TABLEBITS == 8 is a bit low; without any special
* considerations, 9 would fit the trade-off curve better. However, there is a
* performance benefit to using exactly 8 bits when it is a compile-time
* constant, as many CPUs can take the low byte more easily than the low 9 bits.
*
* zlib treats its equivalents of TABLEBITS as maximum values; whenever it
* builds a table, it caps the actual table_bits to the longest codeword. This
* makes sense in theory, as there's no need for the table to be any larger than
* needed to support the longest codeword. However, having the table bits be a
* compile-time constant is beneficial to the performance of the decode loop, so
* there is a trade-off. libdeflate currently uses the dynamic table_bits
* strategy for the litlen table only, due to its larger maximum size.
* PRECODE_TABLEBITS and OFFSET_TABLEBITS are smaller, so going dynamic there
* isn't as useful, and OFFSET_TABLEBITS=8 is useful as mentioned above.
*
* Each TABLEBITS value has a corresponding ENOUGH value that gives the
* worst-case maximum number of decode table entries, including the main table
* and all subtables. The ENOUGH value depends on three parameters:
*
* (1) the maximum number of symbols in the code (DEFLATE_NUM_*_SYMS)
* (2) the number of main table bits (the corresponding TABLEBITS value)
* (2) the maximum number of main table bits (*_TABLEBITS)
* (3) the maximum allowed codeword length (DEFLATE_MAX_*_CODEWORD_LEN)
*
* The ENOUGH values were computed using the utility program 'enough' from zlib.
*/
#define PRECODE_TABLEBITS 7
#define PRECODE_ENOUGH 128 /* enough 19 7 7 */
#define LITLEN_TABLEBITS 11
#define LITLEN_ENOUGH 2342 /* enough 288 11 15 */
#define OFFSET_TABLEBITS 9
#define OFFSET_ENOUGH 594 /* enough 32 9 15 */
#define OFFSET_TABLEBITS 8
#define OFFSET_ENOUGH 402 /* enough 32 8 15 */
/*
* make_decode_table_entry() creates a decode table entry for the given symbol
@ -387,7 +388,7 @@ do { \
* appropriately-formatted decode table entry. See the definitions of the
* *_decode_results[] arrays below, where the entry format is described.
*/
static inline u32
static forceinline u32
make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
{
return decode_results[sym] + (len << 8) + len;
@ -398,8 +399,8 @@ make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
* described contain zeroes:
*
* Bit 20-16: presym
* Bit 10-8: codeword_len [not used]
* Bit 2-0: codeword_len
* Bit 10-8: codeword length [not used]
* Bit 2-0: codeword length
*
* The precode decode table never has subtables, since we use
* PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN.
@ -431,6 +432,12 @@ static const u32 precode_decode_results[] = {
/* Indicates an end-of-block entry in the litlen decode table */
#define HUFFDEC_END_OF_BLOCK 0x00002000
/* Maximum number of bits that can be consumed by decoding a match length */
#define LENGTH_MAXBITS (DEFLATE_MAX_LITLEN_CODEWORD_LEN + \
DEFLATE_MAX_EXTRA_LENGTH_BITS)
#define LENGTH_MAXFASTBITS (LITLEN_TABLEBITS /* no subtable needed */ + \
DEFLATE_MAX_EXTRA_LENGTH_BITS)
/*
* Here is the format of our litlen decode table entries. Bits not explicitly
* described contain zeroes:
@ -464,7 +471,8 @@ static const u32 precode_decode_results[] = {
* Bit 15: 1 (HUFFDEC_EXCEPTIONAL)
* Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER)
* Bit 13: 0 (!HUFFDEC_END_OF_BLOCK)
* Bit 3-0: number of subtable bits
* Bit 11-8: number of subtable bits
* Bit 3-0: number of main table bits
*
* This format has several desirable properties:
*
@ -481,20 +489,26 @@ static const u32 precode_decode_results[] = {
*
* - The low byte is the number of bits that need to be removed from the
* bitstream; this makes this value easily accessible, and it enables the
* optimization used in REMOVE_ENTRY_BITS_FAST(). It also includes the
* number of extra bits, so they don't need to be removed separately.
* micro-optimization of doing 'bitsleft -= entry' instead of
* 'bitsleft -= (u8)entry'. It also includes the number of extra bits,
* so they don't need to be removed separately.
*
* - The flags in bits 13-15 are arranged to be 0 when the number of
* non-extra bits (the value in bits 11-8) is needed, making this value
* - The flags in bits 15-13 are arranged to be 0 when the
* "remaining codeword length" in bits 11-8 is needed, making this value
* fairly easily accessible as well via a shift and downcast.
*
* - Similarly, bits 13-12 are 0 when the "subtable bits" in bits 11-8 are
* needed, making it possible to extract this value with '& 0x3F' rather
* than '& 0xF'. This value is only used as a shift amount, so this can
* save an 'and' instruction as the masking by 0x3F happens implicitly.
*
* litlen_decode_results[] contains the static part of the entry for each
* symbol. make_decode_table_entry() produces the final entries.
*/
static const u32 litlen_decode_results[] = {
/* Literals */
#define ENTRY(literal) (((u32)literal << 16) | HUFFDEC_LITERAL)
#define ENTRY(literal) (HUFFDEC_LITERAL | ((u32)literal << 16))
ENTRY(0) , ENTRY(1) , ENTRY(2) , ENTRY(3) ,
ENTRY(4) , ENTRY(5) , ENTRY(6) , ENTRY(7) ,
ENTRY(8) , ENTRY(9) , ENTRY(10) , ENTRY(11) ,
@ -578,6 +592,12 @@ static const u32 litlen_decode_results[] = {
#undef ENTRY
};
/* Maximum number of bits that can be consumed by decoding a match offset */
#define OFFSET_MAXBITS (DEFLATE_MAX_OFFSET_CODEWORD_LEN + \
DEFLATE_MAX_EXTRA_OFFSET_BITS)
#define OFFSET_MAXFASTBITS (OFFSET_TABLEBITS /* no subtable needed */ + \
DEFLATE_MAX_EXTRA_OFFSET_BITS)
/*
* Here is the format of our offset decode table entries. Bits not explicitly
* described contain zeroes:
@ -592,7 +612,8 @@ static const u32 litlen_decode_results[] = {
* Bit 31-16: index of start of subtable
* Bit 15: 1 (HUFFDEC_EXCEPTIONAL)
* Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER)
* Bit 3-0: number of subtable bits
* Bit 11-8: number of subtable bits
* Bit 3-0: number of main table bits
*
* These work the same way as the length entries and subtable pointer entries in
* the litlen decode table; see litlen_decode_results[] above.
@ -607,15 +628,20 @@ static const u32 offset_decode_results[] = {
ENTRY(257 , 7) , ENTRY(385 , 7) , ENTRY(513 , 8) , ENTRY(769 , 8) ,
ENTRY(1025 , 9) , ENTRY(1537 , 9) , ENTRY(2049 , 10) , ENTRY(3073 , 10) ,
ENTRY(4097 , 11) , ENTRY(6145 , 11) , ENTRY(8193 , 12) , ENTRY(12289 , 12) ,
ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(32769 , 14) , ENTRY(49153 , 14) ,
ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) ,
#undef ENTRY
};
/*
* The main DEFLATE decompressor structure. Since this implementation only
* supports full buffer decompression, this structure does not store the entire
* decompression state, but rather only some arrays that are too large to
* comfortably allocate on the stack.
* The main DEFLATE decompressor structure. Since libdeflate only supports
* full-buffer decompression, this structure doesn't store the entire
* decompression state, most of which is in stack variables. Instead, this
* struct just contains the decode tables and some temporary arrays used for
* building them, as these are too large to comfortably allocate on the stack.
*
* Storing the decode tables in the decompressor struct also allows the decode
* tables for the static codes to be reused whenever two static Huffman blocks
* are decoded without an intervening dynamic block, even across streams.
*/
struct libdeflate_decompressor {
@ -648,6 +674,7 @@ struct libdeflate_decompressor {
u16 sorted_syms[DEFLATE_MAX_NUM_SYMS];
bool static_codes_loaded;
unsigned litlen_tablebits;
};
/*
@ -678,11 +705,16 @@ struct libdeflate_decompressor {
* make the final decode table entries using make_decode_table_entry().
* @table_bits
* The log base-2 of the number of main table entries to use.
* If @table_bits_ret != NULL, then @table_bits is treated as a maximum
* value and it will be decreased if a smaller table would be sufficient.
* @max_codeword_len
* The maximum allowed codeword length for this Huffman code.
* Must be <= DEFLATE_MAX_CODEWORD_LEN.
* @sorted_syms
* A temporary array of length @num_syms.
* @table_bits_ret
* If non-NULL, then the dynamic table_bits is enabled, and the actual
* table_bits value will be returned here.
*
* Returns %true if successful; %false if the codeword lengths do not form a
* valid Huffman code.
@ -692,9 +724,10 @@ build_decode_table(u32 decode_table[],
const u8 lens[],
const unsigned num_syms,
const u32 decode_results[],
const unsigned table_bits,
const unsigned max_codeword_len,
u16 *sorted_syms)
unsigned table_bits,
unsigned max_codeword_len,
u16 *sorted_syms,
unsigned *table_bits_ret)
{
unsigned len_counts[DEFLATE_MAX_CODEWORD_LEN + 1];
unsigned offsets[DEFLATE_MAX_CODEWORD_LEN + 1];
@ -714,6 +747,17 @@ build_decode_table(u32 decode_table[],
for (sym = 0; sym < num_syms; sym++)
len_counts[lens[sym]]++;
/*
* Determine the actual maximum codeword length that was used, and
* decrease table_bits to it if allowed.
*/
while (max_codeword_len > 1 && len_counts[max_codeword_len] == 0)
max_codeword_len--;
if (table_bits_ret != NULL) {
table_bits = MIN(table_bits, max_codeword_len);
*table_bits_ret = table_bits;
}
/*
* Sort the symbols primarily by increasing codeword length and
* secondarily by increasing symbol value; or equivalently by their
@ -919,16 +963,13 @@ build_decode_table(u32 decode_table[],
/*
* Create the entry that points from the main table to
* the subtable. This entry contains the index of the
* start of the subtable and the number of bits with
* which the subtable is indexed (the log base 2 of the
* number of entries it contains).
* the subtable.
*/
decode_table[subtable_prefix] =
((u32)subtable_start << 16) |
HUFFDEC_EXCEPTIONAL |
HUFFDEC_SUBTABLE_POINTER |
subtable_bits;
(subtable_bits << 8) | table_bits;
}
/* Fill the subtable entries for the current codeword. */
@ -969,7 +1010,8 @@ build_precode_decode_table(struct libdeflate_decompressor *d)
precode_decode_results,
PRECODE_TABLEBITS,
DEFLATE_MAX_PRE_CODEWORD_LEN,
d->sorted_syms);
d->sorted_syms,
NULL);
}
/* Build the decode table for the literal/length code. */
@ -989,7 +1031,8 @@ build_litlen_decode_table(struct libdeflate_decompressor *d,
litlen_decode_results,
LITLEN_TABLEBITS,
DEFLATE_MAX_LITLEN_CODEWORD_LEN,
d->sorted_syms);
d->sorted_syms,
&d->litlen_tablebits);
}
/* Build the decode table for the offset code. */
@ -998,7 +1041,7 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
unsigned num_litlen_syms, unsigned num_offset_syms)
{
/* When you change TABLEBITS, you must change ENOUGH, and vice versa! */
STATIC_ASSERT(OFFSET_TABLEBITS == 9 && OFFSET_ENOUGH == 594);
STATIC_ASSERT(OFFSET_TABLEBITS == 8 && OFFSET_ENOUGH == 402);
STATIC_ASSERT(ARRAY_LEN(offset_decode_results) ==
DEFLATE_NUM_OFFSET_SYMS);
@ -1009,27 +1052,8 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
offset_decode_results,
OFFSET_TABLEBITS,
DEFLATE_MAX_OFFSET_CODEWORD_LEN,
d->sorted_syms);
}
static forceinline machine_word_t
repeat_byte(u8 b)
{
machine_word_t v;
STATIC_ASSERT(WORDBITS == 32 || WORDBITS == 64);
v = b;
v |= v << 8;
v |= v << 16;
v |= v << ((WORDBITS == 64) ? 32 : 0);
return v;
}
static forceinline void
copy_word_unaligned(const void *src, void *dst)
{
store_word_unaligned(load_word_unaligned(src), dst);
d->sorted_syms,
NULL);
}
/*****************************************************************************
@ -1037,12 +1061,15 @@ copy_word_unaligned(const void *src, void *dst)
*****************************************************************************/
typedef enum libdeflate_result (*decompress_func_t)
(struct libdeflate_decompressor *d,
const void *in, size_t in_nbytes, void *out, size_t out_nbytes_avail,
(struct libdeflate_decompressor * restrict d,
const void * restrict in, size_t in_nbytes,
void * restrict out, size_t out_nbytes_avail,
size_t *actual_in_nbytes_ret, size_t *actual_out_nbytes_ret);
#define FUNCNAME deflate_decompress_default
#define ATTRIBUTES
#undef ATTRIBUTES
#undef EXTRACT_VARBITS
#undef EXTRACT_VARBITS8
#include "decompress_template.h"
/* Include architecture-specific implementation(s) if available. */

View file

@ -156,6 +156,8 @@ typedef char __v64qi __attribute__((__vector_size__(64)));
#define HAVE_BMI2_TARGET \
(HAVE_DYNAMIC_X86_CPU_FEATURES && \
(GCC_PREREQ(4, 7) || __has_builtin(__builtin_ia32_pdep_di)))
#define HAVE_BMI2_INTRIN \
(HAVE_BMI2_NATIVE || (HAVE_BMI2_TARGET && HAVE_TARGET_INTRINSICS))
#endif /* __i386__ || __x86_64__ */

View file

@ -4,18 +4,46 @@
#include "cpu_features.h"
/* BMI2 optimized version */
#if HAVE_BMI2_TARGET && !HAVE_BMI2_NATIVE
# define FUNCNAME deflate_decompress_bmi2
# define ATTRIBUTES __attribute__((target("bmi2")))
#if HAVE_BMI2_INTRIN
# define deflate_decompress_bmi2 deflate_decompress_bmi2
# define FUNCNAME deflate_decompress_bmi2
# if !HAVE_BMI2_NATIVE
# define ATTRIBUTES __attribute__((target("bmi2")))
# endif
/*
* Even with __attribute__((target("bmi2"))), gcc doesn't reliably use the
* bzhi instruction for 'word & BITMASK(count)'. So use the bzhi intrinsic
* explicitly. EXTRACT_VARBITS() is equivalent to 'word & BITMASK(count)';
* EXTRACT_VARBITS8() is equivalent to 'word & BITMASK((u8)count)'.
* Nevertheless, their implementation using the bzhi intrinsic is identical,
* as the bzhi instruction truncates the count to 8 bits implicitly.
*/
# ifndef __clang__
# include <immintrin.h>
# ifdef __x86_64__
# define EXTRACT_VARBITS(word, count) _bzhi_u64((word), (count))
# define EXTRACT_VARBITS8(word, count) _bzhi_u64((word), (count))
# else
# define EXTRACT_VARBITS(word, count) _bzhi_u32((word), (count))
# define EXTRACT_VARBITS8(word, count) _bzhi_u32((word), (count))
# endif
# endif
# include "../decompress_template.h"
#endif /* HAVE_BMI2_INTRIN */
#if defined(deflate_decompress_bmi2) && HAVE_BMI2_NATIVE
#define DEFAULT_IMPL deflate_decompress_bmi2
#else
static inline decompress_func_t
arch_select_decompress_func(void)
{
#ifdef deflate_decompress_bmi2
if (HAVE_BMI2(get_x86_cpu_features()))
return deflate_decompress_bmi2;
#endif
return NULL;
}
# define arch_select_decompress_func arch_select_decompress_func
#define arch_select_decompress_func arch_select_decompress_func
#endif
#endif /* LIB_X86_DECOMPRESS_IMPL_H */

View file

@ -28,7 +28,9 @@
#ifndef LIB_X86_MATCHFINDER_IMPL_H
#define LIB_X86_MATCHFINDER_IMPL_H
#ifdef __AVX2__
#include "cpu_features.h"
#if HAVE_AVX2_NATIVE
# include <immintrin.h>
static forceinline void
matchfinder_init_avx2(mf_pos_t *data, size_t size)
@ -73,7 +75,7 @@ matchfinder_rebase_avx2(mf_pos_t *data, size_t size)
}
#define matchfinder_rebase matchfinder_rebase_avx2
#elif defined(__SSE2__)
#elif HAVE_SSE2_NATIVE
# include <emmintrin.h>
static forceinline void
matchfinder_init_sse2(mf_pos_t *data, size_t size)
@ -117,6 +119,6 @@ matchfinder_rebase_sse2(mf_pos_t *data, size_t size)
} while (size != 0);
}
#define matchfinder_rebase matchfinder_rebase_sse2
#endif /* __SSE2__ */
#endif /* HAVE_SSE2_NATIVE */
#endif /* LIB_X86_MATCHFINDER_IMPL_H */

View file

@ -10,33 +10,36 @@ extern "C" {
#endif
#define LIBDEFLATE_VERSION_MAJOR 1
#define LIBDEFLATE_VERSION_MINOR 13
#define LIBDEFLATE_VERSION_STRING "1.13"
#define LIBDEFLATE_VERSION_MINOR 14
#define LIBDEFLATE_VERSION_STRING "1.14"
#include <stddef.h>
#include <stdint.h>
/*
* On Windows, if you want to link to the DLL version of libdeflate, then
* #define LIBDEFLATE_DLL. Note that the calling convention is "cdecl".
* On Windows, you must define LIBDEFLATE_STATIC if you are linking to the
* static library version of libdeflate instead of the DLL. On other platforms,
* LIBDEFLATE_STATIC has no effect.
*/
#ifdef LIBDEFLATE_DLL
# ifdef BUILDING_LIBDEFLATE
# define LIBDEFLATEEXPORT LIBEXPORT
# elif defined(_WIN32) || defined(__CYGWIN__)
#ifdef _WIN32
# if defined(LIBDEFLATE_STATIC)
# define LIBDEFLATEEXPORT
# elif defined(BUILDING_LIBDEFLATE)
# define LIBDEFLATEEXPORT __declspec(dllexport)
# else
# define LIBDEFLATEEXPORT __declspec(dllimport)
# endif
#endif
#ifndef LIBDEFLATEEXPORT
# define LIBDEFLATEEXPORT
#else
# define LIBDEFLATEEXPORT __attribute__((visibility("default")))
#endif
#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && \
defined(_WIN32) && !defined(_WIN64)
#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && defined(__i386__)
/*
* On 32-bit Windows, gcc assumes 16-byte stack alignment but MSVC only 4.
* Realign the stack when entering libdeflate to avoid crashing in SSE/AVX
* code when called from an MSVC-compiled application.
* On i386, gcc assumes that the stack is 16-byte aligned at function entry.
* However, some compilers (e.g. MSVC) and programming languages (e.g.
* Delphi) only guarantee 4-byte alignment when calling functions. Work
* around this ABI incompatibility by realigning the stack pointer when
* entering libdeflate. This prevents crashes in SSE/AVX code.
*/
# define LIBDEFLATEAPI __attribute__((force_align_arg_pointer))
#else
@ -72,10 +75,35 @@ libdeflate_alloc_compressor(int compression_level);
/*
* libdeflate_deflate_compress() performs raw DEFLATE compression on a buffer of
* data. The function attempts to compress 'in_nbytes' bytes of data located at
* 'in' and write the results to 'out', which has space for 'out_nbytes_avail'
* bytes. The return value is the compressed size in bytes, or 0 if the data
* could not be compressed to 'out_nbytes_avail' bytes or fewer.
* data. It attempts to compress 'in_nbytes' bytes of data located at 'in' and
* write the results to 'out', which has space for 'out_nbytes_avail' bytes.
* The return value is the compressed size in bytes, or 0 if the data could not
* be compressed to 'out_nbytes_avail' bytes or fewer (but see note below).
*
* If compression is successful, then the output data is guaranteed to be a
* valid DEFLATE stream that decompresses to the input data. No other
* guarantees are made about the output data. Notably, different versions of
* libdeflate can produce different compressed data for the same uncompressed
* data, even at the same compression level. Do ***NOT*** do things like
* writing tests that compare compressed data to a golden output, as this can
* break when libdeflate is updated. (This property isn't specific to
* libdeflate; the same is true for zlib and other compression libraries too.)
*
* Note: due to a performance optimization, libdeflate_deflate_compress()
* currently needs a small amount of slack space at the end of the output
* buffer. As a result, it can't actually report compressed sizes very close to
* 'out_nbytes_avail'. This doesn't matter in real-world use cases, and
* libdeflate_deflate_compress_bound() already includes the slack space.
* However, it does mean that testing code that redundantly compresses data
* using an exact-sized output buffer won't work as might be expected:
*
* out_nbytes = libdeflate_deflate_compress(c, in, in_nbytes, out,
* libdeflate_deflate_compress_bound(in_nbytes));
* // The following assertion will fail.
* assert(libdeflate_deflate_compress(c, in, in_nbytes, out, out_nbytes) != 0);
*
* To avoid this, either don't write tests like the above, or make sure to
* include at least 9 bytes of slack space in 'out_nbytes_avail'.
*/
LIBDEFLATEEXPORT size_t LIBDEFLATEAPI
libdeflate_deflate_compress(struct libdeflate_compressor *compressor,