From a7a9b5317baef59b53c7df7f675719ca3ee3da66 Mon Sep 17 00:00:00 2001 From: Ivailo Monev Date: Fri, 23 Sep 2022 13:26:45 +0300 Subject: [PATCH] update bundled libdeflate to v1.14 Signed-off-by: Ivailo Monev --- src/3rdparty/libdeflate/NOTE | 2 +- src/3rdparty/libdeflate/README.md | 51 +- src/3rdparty/libdeflate/common_defs.h | 40 +- .../libdeflate/lib/arm/matchfinder_impl.h | 6 +- .../libdeflate/lib/decompress_template.h | 580 ++++++++++++------ .../libdeflate/lib/deflate_compress.c | 21 +- .../libdeflate/lib/deflate_constants.h | 5 +- .../libdeflate/lib/deflate_decompress.c | 421 +++++++------ .../libdeflate/lib/x86/cpu_features.h | 2 + .../libdeflate/lib/x86/decompress_impl.h | 36 +- .../libdeflate/lib/x86/matchfinder_impl.h | 8 +- src/3rdparty/libdeflate/libdeflate.h | 68 +- 12 files changed, 793 insertions(+), 447 deletions(-) diff --git a/src/3rdparty/libdeflate/NOTE b/src/3rdparty/libdeflate/NOTE index 71618247f..f16b56766 100644 --- a/src/3rdparty/libdeflate/NOTE +++ b/src/3rdparty/libdeflate/NOTE @@ -1,2 +1,2 @@ -This is Git checkout 72b2ce0d28970b1affc31efeb86daeffee1d7410 +This is Git checkout 18d6cc22b75643ec52111efeb27a22b9d860a982 from https://github.com/ebiggers/libdeflate that has not been modified. diff --git a/src/3rdparty/libdeflate/README.md b/src/3rdparty/libdeflate/README.md index 6c2f2f2a5..b2e27c543 100644 --- a/src/3rdparty/libdeflate/README.md +++ b/src/3rdparty/libdeflate/README.md @@ -27,11 +27,13 @@ For the release notes, see the [NEWS file](NEWS.md). ## Table of Contents - [Building](#building) - - [For UNIX](#for-unix) - - [For macOS](#for-macos) - - [For Windows](#for-windows) - - [Using Cygwin](#using-cygwin) - - [Using MSYS2](#using-msys2) + - [Using the Makefile](#using-the-makefile) + - [For UNIX](#for-unix) + - [For macOS](#for-macos) + - [For Windows](#for-windows) + - [Using Cygwin](#using-cygwin) + - [Using MSYS2](#using-msys2) + - [Using a custom build system](#using-a-custom-build-system) - [API](#api) - [Bindings for other programming languages](#bindings-for-other-programming-languages) - [DEFLATE vs. zlib vs. gzip](#deflate-vs-zlib-vs-gzip) @@ -42,7 +44,14 @@ For the release notes, see the [NEWS file](NEWS.md). # Building -## For UNIX +libdeflate and the provided programs like `gzip` can be built using the provided +Makefile. If only the library is needed, it can alternatively be easily +integrated into applications and built using any build system; see [Using a +custom build system](#using-a-custom-build-system). + +## Using the Makefile + +### For UNIX Just run `make`, then (if desired) `make install`. You need GNU Make and either GCC or Clang. GCC is recommended because it builds slightly faster binaries. @@ -57,7 +66,7 @@ There are also many options which can be set on the `make` command line, e.g. to omit library features or to customize the directories into which `make install` installs files. See the Makefile for details. -## For macOS +### For macOS Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh): @@ -65,7 +74,7 @@ Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh): But if you need to build the binaries yourself, see the section for UNIX above. -## For Windows +### For Windows Prebuilt Windows binaries can be downloaded from https://github.com/ebiggers/libdeflate/releases. But if you need to build the @@ -84,7 +93,7 @@ binaries built with MinGW will be significantly faster. Also note that 64-bit binaries are faster than 32-bit binaries and should be preferred whenever possible. -### Using Cygwin +#### Using Cygwin Run the Cygwin installer, available from https://cygwin.com/setup-x86_64.exe. When you get to the package selection screen, choose the following additional @@ -119,7 +128,7 @@ or to build 32-bit binaries: make CC=i686-w64-mingw32-gcc -### Using MSYS2 +#### Using MSYS2 Run the MSYS2 installer, available from http://www.msys2.org/. After installing, open an MSYS2 shell and run: @@ -161,6 +170,23 @@ and run the following commands: Or to build 32-bit binaries, do the same but use "MSYS2 MinGW 32-bit" instead. +## Using a custom build system + +The source files of the library are designed to be compilable directly, without +any prerequisite step like running a `./configure` script. Therefore, as an +alternative to building the library using the provided Makefile, the library +source files can be easily integrated directly into your application and built +using any build system. + +You should compile both `lib/*.c` and `lib/*/*.c`. You don't need to worry +about excluding irrelevant architecture-specific code, as this is already +handled in the source files themselves using `#ifdef`s. + +It is **strongly** recommended to use either gcc or clang, and to use `-O2`. + +If you are doing a freestanding build with `-ffreestanding`, you must add +`-DFREESTANDING` as well, otherwise performance will suffer greatly. + # API libdeflate has a simple API that is not zlib-compatible. You can create @@ -183,10 +209,7 @@ guessing. However, libdeflate's decompression routines do optionally provide the actual number of output bytes in case you need it. Windows developers: note that the calling convention of libdeflate.dll is -"stdcall" -- the same as the Win32 API. If you call into libdeflate.dll using a -non-C/C++ language, or dynamically using LoadLibrary(), make sure to use the -stdcall convention. Using the wrong convention may crash your application. -(Note: older versions of libdeflate used the "cdecl" convention instead.) +"cdecl". (libdeflate v1.4 through v1.12 used "stdcall" instead.) # Bindings for other programming languages diff --git a/src/3rdparty/libdeflate/common_defs.h b/src/3rdparty/libdeflate/common_defs.h index a0efb9b4c..cfe6fd62b 100644 --- a/src/3rdparty/libdeflate/common_defs.h +++ b/src/3rdparty/libdeflate/common_defs.h @@ -144,8 +144,17 @@ typedef size_t machine_word_t; /* restrict - hint that writes only occur through the given pointer */ #ifdef __GNUC__ # define restrict __restrict__ +#elif defined(_MSC_VER) + /* + * Don't use MSVC's __restrict; it has nonstandard behavior. + * Standard restrict is okay, if it is supported. + */ +# if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) +# define restrict restrict +# else +# define restrict +# endif #else -/* Don't use MSVC's __restrict; it has nonstandard behavior. */ # define restrict #endif @@ -200,6 +209,7 @@ typedef size_t machine_word_t; #define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) #define STATIC_ASSERT(expr) ((void)sizeof(char[1 - 2 * !(expr)])) #define ALIGN(n, a) (((n) + (a) - 1) & ~((a) - 1)) +#define ROUND_UP(n, d) ((d) * DIV_ROUND_UP((n), (d))) /* ========================================================================== */ /* Endianness handling */ @@ -513,8 +523,10 @@ bsr32(u32 v) #ifdef __GNUC__ return 31 - __builtin_clz(v); #elif defined(_MSC_VER) - _BitScanReverse(&v, v); - return v; + unsigned long i; + + _BitScanReverse(&i, v); + return i; #else unsigned i = 0; @@ -529,9 +541,11 @@ bsr64(u64 v) { #ifdef __GNUC__ return 63 - __builtin_clzll(v); -#elif defined(_MSC_VER) && defined(_M_X64) - _BitScanReverse64(&v, v); - return v; +#elif defined(_MSC_VER) && defined(_WIN64) + unsigned long i; + + _BitScanReverse64(&i, v); + return i; #else unsigned i = 0; @@ -563,8 +577,10 @@ bsf32(u32 v) #ifdef __GNUC__ return __builtin_ctz(v); #elif defined(_MSC_VER) - _BitScanForward(&v, v); - return v; + unsigned long i; + + _BitScanForward(&i, v); + return i; #else unsigned i = 0; @@ -579,9 +595,11 @@ bsf64(u64 v) { #ifdef __GNUC__ return __builtin_ctzll(v); -#elif defined(_MSC_VER) && defined(_M_X64) - _BitScanForward64(&v, v); - return v; +#elif defined(_MSC_VER) && defined(_WIN64) + unsigned long i; + + _BitScanForward64(&i, v); + return i; #else unsigned i = 0; diff --git a/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h b/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h index da0d2fd79..4b10ba2f8 100644 --- a/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h +++ b/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h @@ -28,7 +28,9 @@ #ifndef LIB_ARM_MATCHFINDER_IMPL_H #define LIB_ARM_MATCHFINDER_IMPL_H -#ifdef __ARM_NEON +#include "cpu_features.h" + +#if HAVE_NEON_NATIVE # include static forceinline void matchfinder_init_neon(mf_pos_t *data, size_t size) @@ -81,6 +83,6 @@ matchfinder_rebase_neon(mf_pos_t *data, size_t size) } #define matchfinder_rebase matchfinder_rebase_neon -#endif /* __ARM_NEON */ +#endif /* HAVE_NEON_NATIVE */ #endif /* LIB_ARM_MATCHFINDER_IMPL_H */ diff --git a/src/3rdparty/libdeflate/lib/decompress_template.h b/src/3rdparty/libdeflate/lib/decompress_template.h index 22ab41e85..2d9dfa82b 100644 --- a/src/3rdparty/libdeflate/lib/decompress_template.h +++ b/src/3rdparty/libdeflate/lib/decompress_template.h @@ -31,7 +31,17 @@ * target instruction sets. */ -static enum libdeflate_result ATTRIBUTES +#ifndef ATTRIBUTES +# define ATTRIBUTES +#endif +#ifndef EXTRACT_VARBITS +# define EXTRACT_VARBITS(word, count) ((word) & BITMASK(count)) +#endif +#ifndef EXTRACT_VARBITS8 +# define EXTRACT_VARBITS8(word, count) ((word) & BITMASK((u8)(count))) +#endif + +static enum libdeflate_result ATTRIBUTES MAYBE_UNUSED FUNCNAME(struct libdeflate_decompressor * restrict d, const void * restrict in, size_t in_nbytes, void * restrict out, size_t out_nbytes_avail, @@ -41,35 +51,36 @@ FUNCNAME(struct libdeflate_decompressor * restrict d, u8 * const out_end = out_next + out_nbytes_avail; u8 * const out_fastloop_end = out_end - MIN(out_nbytes_avail, FASTLOOP_MAX_BYTES_WRITTEN); + + /* Input bitstream state; see deflate_decompress.c for documentation */ const u8 *in_next = in; const u8 * const in_end = in_next + in_nbytes; const u8 * const in_fastloop_end = in_end - MIN(in_nbytes, FASTLOOP_MAX_BYTES_READ); bitbuf_t bitbuf = 0; bitbuf_t saved_bitbuf; - machine_word_t bitsleft = 0; + u32 bitsleft = 0; size_t overread_count = 0; - unsigned i; + bool is_final_block; unsigned block_type; - u16 len; - u16 nlen; unsigned num_litlen_syms; unsigned num_offset_syms; - bitbuf_t tmpbits; + bitbuf_t litlen_tablemask; + u32 entry; next_block: /* Starting to read the next block */ ; - STATIC_ASSERT(CAN_ENSURE(1 + 2 + 5 + 5 + 4 + 3)); + STATIC_ASSERT(CAN_CONSUME(1 + 2 + 5 + 5 + 4 + 3)); REFILL_BITS(); /* BFINAL: 1 bit */ - is_final_block = POP_BITS(1); + is_final_block = bitbuf & BITMASK(1); /* BTYPE: 2 bits */ - block_type = POP_BITS(2); + block_type = (bitbuf >> 1) & BITMASK(2); if (block_type == DEFLATE_BLOCKTYPE_DYNAMIC_HUFFMAN) { @@ -81,17 +92,18 @@ next_block: }; unsigned num_explicit_precode_lens; + unsigned i; /* Read the codeword length counts. */ - STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == ((1 << 5) - 1) + 257); - num_litlen_syms = POP_BITS(5) + 257; + STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == 257 + BITMASK(5)); + num_litlen_syms = 257 + ((bitbuf >> 3) & BITMASK(5)); - STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == ((1 << 5) - 1) + 1); - num_offset_syms = POP_BITS(5) + 1; + STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == 1 + BITMASK(5)); + num_offset_syms = 1 + ((bitbuf >> 8) & BITMASK(5)); - STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == ((1 << 4) - 1) + 4); - num_explicit_precode_lens = POP_BITS(4) + 4; + STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == 4 + BITMASK(4)); + num_explicit_precode_lens = 4 + ((bitbuf >> 13) & BITMASK(4)); d->static_codes_loaded = false; @@ -103,16 +115,31 @@ next_block: * merge one len with the previous fields. */ STATIC_ASSERT(DEFLATE_MAX_PRE_CODEWORD_LEN == (1 << 3) - 1); - if (CAN_ENSURE(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) { - d->u.precode_lens[deflate_precode_lens_permutation[0]] = POP_BITS(3); + if (CAN_CONSUME(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) { + d->u.precode_lens[deflate_precode_lens_permutation[0]] = + (bitbuf >> 17) & BITMASK(3); + bitbuf >>= 20; + bitsleft -= 20; REFILL_BITS(); - for (i = 1; i < num_explicit_precode_lens; i++) - d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3); + i = 1; + do { + d->u.precode_lens[deflate_precode_lens_permutation[i]] = + bitbuf & BITMASK(3); + bitbuf >>= 3; + bitsleft -= 3; + } while (++i < num_explicit_precode_lens); } else { - for (i = 0; i < num_explicit_precode_lens; i++) { - ENSURE_BITS(3); - d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3); - } + bitbuf >>= 17; + bitsleft -= 17; + i = 0; + do { + if ((u8)bitsleft < 3) + REFILL_BITS(); + d->u.precode_lens[deflate_precode_lens_permutation[i]] = + bitbuf & BITMASK(3); + bitbuf >>= 3; + bitsleft -= 3; + } while (++i < num_explicit_precode_lens); } for (; i < DEFLATE_NUM_PRECODE_SYMS; i++) d->u.precode_lens[deflate_precode_lens_permutation[i]] = 0; @@ -121,13 +148,14 @@ next_block: SAFETY_CHECK(build_precode_decode_table(d)); /* Decode the litlen and offset codeword lengths. */ - for (i = 0; i < num_litlen_syms + num_offset_syms; ) { - u32 entry; + i = 0; + do { unsigned presym; u8 rep_val; unsigned rep_count; - ENSURE_BITS(DEFLATE_MAX_PRE_CODEWORD_LEN + 7); + if ((u8)bitsleft < DEFLATE_MAX_PRE_CODEWORD_LEN + 7) + REFILL_BITS(); /* * The code below assumes that the precode decode table @@ -135,9 +163,11 @@ next_block: */ STATIC_ASSERT(PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN); - /* Read the next precode symbol. */ - entry = d->u.l.precode_decode_table[BITS(DEFLATE_MAX_PRE_CODEWORD_LEN)]; - REMOVE_BITS((u8)entry); + /* Decode the next precode symbol. */ + entry = d->u.l.precode_decode_table[ + bitbuf & BITMASK(DEFLATE_MAX_PRE_CODEWORD_LEN)]; + bitbuf >>= (u8)entry; + bitsleft -= entry; /* optimization: subtract full entry */ presym = entry >> 16; if (presym < 16) { @@ -171,8 +201,10 @@ next_block: /* Repeat the previous length 3 - 6 times. */ SAFETY_CHECK(i != 0); rep_val = d->u.l.lens[i - 1]; - STATIC_ASSERT(3 + ((1 << 2) - 1) == 6); - rep_count = 3 + POP_BITS(2); + STATIC_ASSERT(3 + BITMASK(2) == 6); + rep_count = 3 + (bitbuf & BITMASK(2)); + bitbuf >>= 2; + bitsleft -= 2; d->u.l.lens[i + 0] = rep_val; d->u.l.lens[i + 1] = rep_val; d->u.l.lens[i + 2] = rep_val; @@ -182,8 +214,10 @@ next_block: i += rep_count; } else if (presym == 17) { /* Repeat zero 3 - 10 times. */ - STATIC_ASSERT(3 + ((1 << 3) - 1) == 10); - rep_count = 3 + POP_BITS(3); + STATIC_ASSERT(3 + BITMASK(3) == 10); + rep_count = 3 + (bitbuf & BITMASK(3)); + bitbuf >>= 3; + bitsleft -= 3; d->u.l.lens[i + 0] = 0; d->u.l.lens[i + 1] = 0; d->u.l.lens[i + 2] = 0; @@ -197,20 +231,39 @@ next_block: i += rep_count; } else { /* Repeat zero 11 - 138 times. */ - STATIC_ASSERT(11 + ((1 << 7) - 1) == 138); - rep_count = 11 + POP_BITS(7); + STATIC_ASSERT(11 + BITMASK(7) == 138); + rep_count = 11 + (bitbuf & BITMASK(7)); + bitbuf >>= 7; + bitsleft -= 7; memset(&d->u.l.lens[i], 0, rep_count * sizeof(d->u.l.lens[i])); i += rep_count; } - } + } while (i < num_litlen_syms + num_offset_syms); + } else if (block_type == DEFLATE_BLOCKTYPE_UNCOMPRESSED) { + u16 len, nlen; + /* * Uncompressed block: copy 'len' bytes literally from the input * buffer to the output buffer. */ - ALIGN_INPUT(); + bitsleft -= 3; /* for BTYPE and BFINAL */ + + /* + * Align the bitstream to the next byte boundary. This means + * the next byte boundary as if we were reading a byte at a + * time. Therefore, we have to rewind 'in_next' by any bytes + * that have been refilled but not actually consumed yet (not + * counting overread bytes, which don't increment 'in_next'). + */ + bitsleft = (u8)bitsleft; + SAFETY_CHECK(overread_count <= (bitsleft >> 3)); + in_next -= (bitsleft >> 3) - overread_count; + overread_count = 0; + bitbuf = 0; + bitsleft = 0; SAFETY_CHECK(in_end - in_next >= 4); len = get_unaligned_le16(in_next); @@ -229,6 +282,8 @@ next_block: goto block_done; } else { + unsigned i; + SAFETY_CHECK(block_type == DEFLATE_BLOCKTYPE_STATIC_HUFFMAN); /* @@ -241,6 +296,9 @@ next_block: * dynamic Huffman block. */ + bitbuf >>= 3; /* for BTYPE and BFINAL */ + bitsleft -= 3; + if (d->static_codes_loaded) goto have_decode_tables; @@ -270,186 +328,344 @@ next_block: SAFETY_CHECK(build_offset_decode_table(d, num_litlen_syms, num_offset_syms)); SAFETY_CHECK(build_litlen_decode_table(d, num_litlen_syms, num_offset_syms)); have_decode_tables: + litlen_tablemask = BITMASK(d->litlen_tablebits); /* * This is the "fastloop" for decoding literals and matches. It does * bounds checks on in_next and out_next in the loop conditions so that * additional bounds checks aren't needed inside the loop body. + * + * To reduce latency, the bitbuffer is refilled and the next litlen + * decode table entry is preloaded before each loop iteration. */ - while (in_next < in_fastloop_end && out_next < out_fastloop_end) { - u32 entry, length, offset; - u8 lit; + if (in_next >= in_fastloop_end || out_next >= out_fastloop_end) + goto generic_loop; + REFILL_BITS_IN_FASTLOOP(); + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + do { + u32 length, offset, lit; const u8 *src; u8 *dst; - /* Refill the bitbuffer and decode a litlen symbol. */ - REFILL_BITS_IN_FASTLOOP(); - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; -preloaded: - if (CAN_ENSURE(3 * LITLEN_TABLEBITS + - DEFLATE_MAX_LITLEN_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_LENGTH_BITS) && - (entry & HUFFDEC_LITERAL)) { + /* + * Consume the bits for the litlen decode table entry. Save the + * original bitbuf for later, in case the extra match length + * bits need to be extracted from it. + */ + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; /* optimization: subtract full entry */ + + /* + * Begin by checking for a "fast" literal, i.e. a literal that + * doesn't need a subtable. + */ + if (entry & HUFFDEC_LITERAL) { /* - * 64-bit only: fast path for decoding literals that - * don't need subtables. We do up to 3 of these before - * proceeding to the general case. This is the largest - * number of times that LITLEN_TABLEBITS bits can be - * extracted from a refilled 64-bit bitbuffer while - * still leaving enough bits to decode any match length. + * On 64-bit platforms, we decode up to 2 extra fast + * literals in addition to the primary item, as this + * increases performance and still leaves enough bits + * remaining for what follows. We could actually do 3, + * assuming LITLEN_TABLEBITS=11, but that actually + * decreases performance slightly (perhaps by messing + * with the branch prediction of the conditional refill + * that happens later while decoding the match offset). * * Note: the definitions of FASTLOOP_MAX_BYTES_WRITTEN * and FASTLOOP_MAX_BYTES_READ need to be updated if the - * maximum number of literals decoded here is changed. + * number of extra literals decoded here is changed. */ - REMOVE_ENTRY_BITS_FAST(entry); - lit = entry >> 16; - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; - *out_next++ = lit; - if (entry & HUFFDEC_LITERAL) { - REMOVE_ENTRY_BITS_FAST(entry); + if (/* enough bits for 2 fast literals + length + offset preload? */ + CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS + + LENGTH_MAXBITS, + OFFSET_TABLEBITS) && + /* enough bits for 2 fast literals + slow literal + litlen preload? */ + CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS + + DEFLATE_MAX_LITLEN_CODEWORD_LEN, + LITLEN_TABLEBITS)) { + /* 1st extra fast literal */ lit = entry >> 16; - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; *out_next++ = lit; if (entry & HUFFDEC_LITERAL) { - REMOVE_ENTRY_BITS_FAST(entry); + /* 2nd extra fast literal */ lit = entry >> 16; - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; *out_next++ = lit; + if (entry & HUFFDEC_LITERAL) { + /* + * Another fast literal, but + * this one is in lieu of the + * primary item, so it doesn't + * count as one of the extras. + */ + lit = entry >> 16; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + REFILL_BITS_IN_FASTLOOP(); + *out_next++ = lit; + continue; + } } + } else { + /* + * Decode a literal. While doing so, preload + * the next litlen decode table entry and refill + * the bitbuffer. To reduce latency, we've + * arranged for there to be enough "preloadable" + * bits remaining to do the table preload + * independently of the refill. + */ + STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD( + LITLEN_TABLEBITS, LITLEN_TABLEBITS)); + lit = entry >> 16; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + REFILL_BITS_IN_FASTLOOP(); + *out_next++ = lit; + continue; } } + + /* + * It's not a literal entry, so it can be a length entry, a + * subtable pointer entry, or an end-of-block entry. Detect the + * two unlikely cases by testing the HUFFDEC_EXCEPTIONAL flag. + */ if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) { /* Subtable pointer or end-of-block entry */ - if (entry & HUFFDEC_SUBTABLE_POINTER) { - REMOVE_BITS(LITLEN_TABLEBITS); - entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)]; - } - SAVE_BITBUF(); - REMOVE_ENTRY_BITS_FAST(entry); + if (unlikely(entry & HUFFDEC_END_OF_BLOCK)) goto block_done; - /* Literal or length entry, from a subtable */ - } else { - /* Literal or length entry, from the main table */ - SAVE_BITBUF(); - REMOVE_ENTRY_BITS_FAST(entry); - } - length = entry >> 16; - if (entry & HUFFDEC_LITERAL) { + /* - * Literal that didn't get handled by the literal fast - * path earlier + * A subtable is required. Load and consume the + * subtable entry. The subtable entry can be of any + * type: literal, length, or end-of-block. */ - *out_next++ = length; - continue; - } - /* - * Match length. Finish decoding it. We don't need to check - * for too-long matches here, as this is inside the fastloop - * where it's already been verified that the output buffer has - * enough space remaining to copy a max-length match. - */ - length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8); + entry = d->u.litlen_decode_table[(entry >> 16) + + EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)]; + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; - /* Decode the match offset. */ - - /* Refill the bitbuffer if it may be needed for the offset. */ - if (unlikely(GET_REAL_BITSLEFT() < - DEFLATE_MAX_OFFSET_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_OFFSET_BITS)) - REFILL_BITS_IN_FASTLOOP(); - - STATIC_ASSERT(CAN_ENSURE(OFFSET_TABLEBITS + - DEFLATE_MAX_EXTRA_OFFSET_BITS)); - STATIC_ASSERT(CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN - - OFFSET_TABLEBITS + - DEFLATE_MAX_EXTRA_OFFSET_BITS)); - - entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)]; - if (entry & HUFFDEC_EXCEPTIONAL) { - /* Offset codeword requires a subtable */ - REMOVE_BITS(OFFSET_TABLEBITS); - entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)]; /* - * On 32-bit, we might not be able to decode the offset - * symbol and extra offset bits without refilling the - * bitbuffer in between. However, this is only an issue - * when a subtable is needed, so do the refill here. + * 32-bit platforms that use the byte-at-a-time refill + * method have to do a refill here for there to always + * be enough bits to decode a literal that requires a + * subtable, then preload the next litlen decode table + * entry; or to decode a match length that requires a + * subtable, then preload the offset decode table entry. */ - if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_OFFSET_BITS)) + if (!CAN_CONSUME_AND_THEN_PRELOAD(DEFLATE_MAX_LITLEN_CODEWORD_LEN, + LITLEN_TABLEBITS) || + !CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXBITS, + OFFSET_TABLEBITS)) REFILL_BITS_IN_FASTLOOP(); + if (entry & HUFFDEC_LITERAL) { + /* Decode a literal that required a subtable. */ + lit = entry >> 16; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + REFILL_BITS_IN_FASTLOOP(); + *out_next++ = lit; + continue; + } + if (unlikely(entry & HUFFDEC_END_OF_BLOCK)) + goto block_done; + /* Else, it's a length that required a subtable. */ } - SAVE_BITBUF(); - REMOVE_ENTRY_BITS_FAST(entry); - offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8)); + + /* + * Decode the match length: the length base value associated + * with the litlen symbol (which we extract from the decode + * table entry), plus the extra length bits. We don't need to + * consume the extra length bits here, as they were included in + * the bits consumed by the entry earlier. We also don't need + * to check for too-long matches here, as this is inside the + * fastloop where it's already been verified that the output + * buffer has enough space remaining to copy a max-length match. + */ + length = entry >> 16; + length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8); + + /* + * Decode the match offset. There are enough "preloadable" bits + * remaining to preload the offset decode table entry, but a + * refill might be needed before consuming it. + */ + STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXFASTBITS, + OFFSET_TABLEBITS)); + entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)]; + if (CAN_CONSUME_AND_THEN_PRELOAD(OFFSET_MAXBITS, + LITLEN_TABLEBITS)) { + /* + * Decoding a match offset on a 64-bit platform. We may + * need to refill once, but then we can decode the whole + * offset and preload the next litlen table entry. + */ + if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) { + /* Offset codeword requires a subtable */ + if (unlikely((u8)bitsleft < OFFSET_MAXBITS + + LITLEN_TABLEBITS - PRELOAD_SLACK)) + REFILL_BITS_IN_FASTLOOP(); + bitbuf >>= OFFSET_TABLEBITS; + bitsleft -= OFFSET_TABLEBITS; + entry = d->offset_decode_table[(entry >> 16) + + EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)]; + } else if (unlikely((u8)bitsleft < OFFSET_MAXFASTBITS + + LITLEN_TABLEBITS - PRELOAD_SLACK)) + REFILL_BITS_IN_FASTLOOP(); + } else { + /* Decoding a match offset on a 32-bit platform */ + REFILL_BITS_IN_FASTLOOP(); + if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) { + /* Offset codeword requires a subtable */ + bitbuf >>= OFFSET_TABLEBITS; + bitsleft -= OFFSET_TABLEBITS; + entry = d->offset_decode_table[(entry >> 16) + + EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)]; + REFILL_BITS_IN_FASTLOOP(); + /* No further refill needed before extra bits */ + STATIC_ASSERT(CAN_CONSUME( + OFFSET_MAXBITS - OFFSET_TABLEBITS)); + } else { + /* No refill needed before extra bits */ + STATIC_ASSERT(CAN_CONSUME(OFFSET_MAXFASTBITS)); + } + } + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; /* optimization: subtract full entry */ + offset = entry >> 16; + offset += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8); /* Validate the match offset; needed even in the fastloop. */ SAFETY_CHECK(offset <= out_next - (const u8 *)out); + src = out_next - offset; + dst = out_next; + out_next += length; /* - * Before starting to copy the match, refill the bitbuffer and - * preload the litlen decode table entry for the next loop - * iteration. This can increase performance by allowing the - * latency of the two operations to overlap. + * Before starting to issue the instructions to copy the match, + * refill the bitbuffer and preload the litlen decode table + * entry for the next loop iteration. This can increase + * performance by allowing the latency of the match copy to + * overlap with these other operations. To further reduce + * latency, we've arranged for there to be enough bits remaining + * to do the table preload independently of the refill, except + * on 32-bit platforms using the byte-at-a-time refill method. */ + if (!CAN_CONSUME_AND_THEN_PRELOAD( + MAX(OFFSET_MAXBITS - OFFSET_TABLEBITS, + OFFSET_MAXFASTBITS), + LITLEN_TABLEBITS) && + unlikely((u8)bitsleft < LITLEN_TABLEBITS - PRELOAD_SLACK)) + REFILL_BITS_IN_FASTLOOP(); + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; REFILL_BITS_IN_FASTLOOP(); - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; /* * Copy the match. On most CPUs the fastest method is a - * word-at-a-time copy, unconditionally copying at least 3 words + * word-at-a-time copy, unconditionally copying about 5 words * since this is enough for most matches without being too much. * * The normal word-at-a-time copy works for offset >= WORDBYTES, * which is most cases. The case of offset == 1 is also common * and is worth optimizing for, since it is just RLE encoding of * the previous byte, which is the result of compressing long - * runs of the same byte. We currently don't optimize for the - * less common cases of offset > 1 && offset < WORDBYTES; we - * just fall back to a traditional byte-at-a-time copy for them. + * runs of the same byte. + * + * Writing past the match 'length' is allowed here, since it's + * been ensured there is enough output space left for a slight + * overrun. FASTLOOP_MAX_BYTES_WRITTEN needs to be updated if + * the maximum possible overrun here is changed. */ - src = out_next - offset; - dst = out_next; - out_next += length; if (UNALIGNED_ACCESS_IS_FAST && offset >= WORDBYTES) { - copy_word_unaligned(src, dst); + store_word_unaligned(load_word_unaligned(src), dst); src += WORDBYTES; dst += WORDBYTES; - copy_word_unaligned(src, dst); + store_word_unaligned(load_word_unaligned(src), dst); src += WORDBYTES; dst += WORDBYTES; - do { - copy_word_unaligned(src, dst); + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + while (dst < out_next) { + store_word_unaligned(load_word_unaligned(src), dst); src += WORDBYTES; dst += WORDBYTES; - } while (dst < out_next); + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + store_word_unaligned(load_word_unaligned(src), dst); + src += WORDBYTES; + dst += WORDBYTES; + } } else if (UNALIGNED_ACCESS_IS_FAST && offset == 1) { - machine_word_t v = repeat_byte(*src); + machine_word_t v; + /* + * This part tends to get auto-vectorized, so keep it + * copying a multiple of 16 bytes at a time. + */ + v = (machine_word_t)0x0101010101010101 * src[0]; store_word_unaligned(v, dst); dst += WORDBYTES; store_word_unaligned(v, dst); dst += WORDBYTES; - do { + store_word_unaligned(v, dst); + dst += WORDBYTES; + store_word_unaligned(v, dst); + dst += WORDBYTES; + while (dst < out_next) { store_word_unaligned(v, dst); dst += WORDBYTES; + store_word_unaligned(v, dst); + dst += WORDBYTES; + store_word_unaligned(v, dst); + dst += WORDBYTES; + store_word_unaligned(v, dst); + dst += WORDBYTES; + } + } else if (UNALIGNED_ACCESS_IS_FAST) { + store_word_unaligned(load_word_unaligned(src), dst); + src += offset; + dst += offset; + store_word_unaligned(load_word_unaligned(src), dst); + src += offset; + dst += offset; + do { + store_word_unaligned(load_word_unaligned(src), dst); + src += offset; + dst += offset; + store_word_unaligned(load_word_unaligned(src), dst); + src += offset; + dst += offset; } while (dst < out_next); } else { - STATIC_ASSERT(DEFLATE_MIN_MATCH_LEN == 3); *dst++ = *src++; *dst++ = *src++; do { *dst++ = *src++; } while (dst < out_next); } - if (in_next < in_fastloop_end && out_next < out_fastloop_end) - goto preloaded; - break; - } - /* MASK_BITSLEFT() is needed when leaving the fastloop. */ - MASK_BITSLEFT(); + } while (in_next < in_fastloop_end && out_next < out_fastloop_end); /* * This is the generic loop for decoding literals and matches. This @@ -458,19 +674,24 @@ preloaded: * critical, as most time is spent in the fastloop above instead. We * therefore omit some optimizations here in favor of smaller code. */ +generic_loop: for (;;) { - u32 entry, length, offset; + u32 length, offset; const u8 *src; u8 *dst; REFILL_BITS(); - entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)]; + entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask]; + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; if (unlikely(entry & HUFFDEC_SUBTABLE_POINTER)) { - REMOVE_BITS(LITLEN_TABLEBITS); - entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)]; + entry = d->u.litlen_decode_table[(entry >> 16) + + EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)]; + saved_bitbuf = bitbuf; + bitbuf >>= (u8)entry; + bitsleft -= entry; } - SAVE_BITBUF(); - REMOVE_BITS((u8)entry); length = entry >> 16; if (entry & HUFFDEC_LITERAL) { if (unlikely(out_next == out_end)) @@ -480,34 +701,27 @@ preloaded: } if (unlikely(entry & HUFFDEC_END_OF_BLOCK)) goto block_done; - length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8); + length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8); if (unlikely(length > out_end - out_next)) return LIBDEFLATE_INSUFFICIENT_SPACE; - if (CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_OFFSET_BITS)) { - ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_OFFSET_BITS); - } else { - ENSURE_BITS(OFFSET_TABLEBITS + - DEFLATE_MAX_EXTRA_OFFSET_BITS); + if (!CAN_CONSUME(LENGTH_MAXBITS + OFFSET_MAXBITS)) + REFILL_BITS(); + entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)]; + if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) { + bitbuf >>= OFFSET_TABLEBITS; + bitsleft -= OFFSET_TABLEBITS; + entry = d->offset_decode_table[(entry >> 16) + + EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)]; + if (!CAN_CONSUME(OFFSET_MAXBITS)) + REFILL_BITS(); } - entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)]; - if (entry & HUFFDEC_EXCEPTIONAL) { - REMOVE_BITS(OFFSET_TABLEBITS); - entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)]; - if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN + - DEFLATE_MAX_EXTRA_OFFSET_BITS)) - ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN - - OFFSET_TABLEBITS + - DEFLATE_MAX_EXTRA_OFFSET_BITS); - } - SAVE_BITBUF(); - REMOVE_BITS((u8)entry); - offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8)); + offset = entry >> 16; + offset += EXTRACT_VARBITS8(bitbuf, entry) >> (u8)(entry >> 8); + bitbuf >>= (u8)entry; + bitsleft -= entry; SAFETY_CHECK(offset <= out_next - (const u8 *)out); - src = out_next - offset; dst = out_next; out_next += length; @@ -521,9 +735,6 @@ preloaded: } block_done: - /* MASK_BITSLEFT() is needed when leaving the fastloop. */ - MASK_BITSLEFT(); - /* Finished decoding a block */ if (!is_final_block) @@ -531,12 +742,21 @@ block_done: /* That was the last block. */ - /* Discard any readahead bits and check for excessive overread. */ - ALIGN_INPUT(); + bitsleft = (u8)bitsleft; + + /* + * If any of the implicit appended zero bytes were consumed (not just + * refilled) before hitting end of stream, then the data is bad. + */ + SAFETY_CHECK(overread_count <= (bitsleft >> 3)); + + /* Optionally return the actual number of bytes consumed. */ + if (actual_in_nbytes_ret) { + /* Don't count bytes that were refilled but not consumed. */ + in_next -= (bitsleft >> 3) - overread_count; - /* Optionally return the actual number of bytes read. */ - if (actual_in_nbytes_ret) *actual_in_nbytes_ret = in_next - (u8 *)in; + } /* Optionally return the actual number of bytes written. */ if (actual_out_nbytes_ret) { @@ -550,3 +770,5 @@ block_done: #undef FUNCNAME #undef ATTRIBUTES +#undef EXTRACT_VARBITS +#undef EXTRACT_VARBITS8 diff --git a/src/3rdparty/libdeflate/lib/deflate_compress.c b/src/3rdparty/libdeflate/lib/deflate_compress.c index 096c0668a..c2dcbb4cf 100644 --- a/src/3rdparty/libdeflate/lib/deflate_compress.c +++ b/src/3rdparty/libdeflate/lib/deflate_compress.c @@ -475,7 +475,7 @@ struct deflate_output_bitstream; struct libdeflate_compressor { /* Pointer to the compress() implementation chosen at allocation time */ - void (*impl)(struct libdeflate_compressor *c, const u8 *in, + void (*impl)(struct libdeflate_compressor *restrict c, const u8 *in, size_t in_nbytes, struct deflate_output_bitstream *os); /* The compression level with which this compressor was created */ @@ -1041,7 +1041,6 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[], unsigned parent = A[node] >> NUM_SYMBOL_BITS; unsigned parent_depth = A[parent] >> NUM_SYMBOL_BITS; unsigned depth = parent_depth + 1; - unsigned len = depth; /* * Set the depth of this node so that it is available when its @@ -1054,19 +1053,19 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[], * constraint. This is not the optimal method for generating * length-limited Huffman codes! But it should be good enough. */ - if (len >= max_codeword_len) { - len = max_codeword_len; + if (depth >= max_codeword_len) { + depth = max_codeword_len; do { - len--; - } while (len_counts[len] == 0); + depth--; + } while (len_counts[depth] == 0); } /* * Account for the fact that we have a non-leaf node at the * current depth. */ - len_counts[len]--; - len_counts[len + 1] += 2; + len_counts[depth]--; + len_counts[depth + 1] += 2; } } @@ -1189,11 +1188,9 @@ gen_codewords(u32 A[], u8 lens[], const unsigned len_counts[], (next_codewords[len - 1] + len_counts[len - 1]) << 1; for (sym = 0; sym < num_syms; sym++) { - u8 len = lens[sym]; - u32 codeword = next_codewords[len]++; - /* DEFLATE requires bit-reversed codewords. */ - A[sym] = reverse_codeword(codeword, len); + A[sym] = reverse_codeword(next_codewords[lens[sym]]++, + lens[sym]); } } diff --git a/src/3rdparty/libdeflate/lib/deflate_constants.h b/src/3rdparty/libdeflate/lib/deflate_constants.h index 5982c152b..95c9e0a50 100644 --- a/src/3rdparty/libdeflate/lib/deflate_constants.h +++ b/src/3rdparty/libdeflate/lib/deflate_constants.h @@ -49,11 +49,8 @@ /* * Maximum number of extra bits that may be required to represent a match * length or offset. - * - * TODO: are we going to have full DEFLATE64 support? If so, up to 16 - * length bits must be supported. */ #define DEFLATE_MAX_EXTRA_LENGTH_BITS 5 -#define DEFLATE_MAX_EXTRA_OFFSET_BITS 14 +#define DEFLATE_MAX_EXTRA_OFFSET_BITS 13 #endif /* LIB_DEFLATE_CONSTANTS_H */ diff --git a/src/3rdparty/libdeflate/lib/deflate_decompress.c b/src/3rdparty/libdeflate/lib/deflate_decompress.c index 05d497e5c..fd6dde836 100644 --- a/src/3rdparty/libdeflate/lib/deflate_decompress.c +++ b/src/3rdparty/libdeflate/lib/deflate_decompress.c @@ -27,14 +27,14 @@ * --------------------------------------------------------------------------- * * This is a highly optimized DEFLATE decompressor. It is much faster than - * zlib, typically more than twice as fast, though results vary by CPU. + * vanilla zlib, typically well over twice as fast, though results vary by CPU. * - * Why this is faster than zlib's implementation: + * Why this is faster than vanilla zlib: * * - Word accesses rather than byte accesses when reading input * - Word accesses rather than byte accesses when copying matches * - Faster Huffman decoding combined with various DEFLATE-specific tricks - * - Larger bitbuffer variable that doesn't need to be filled as often + * - Larger bitbuffer variable that doesn't need to be refilled as often * - Other optimizations to remove unnecessary branches * - Only full-buffer decompression is supported, so the code doesn't need to * support stopping and resuming decompression. @@ -71,22 +71,32 @@ /* * The state of the "input bitstream" consists of the following variables: * - * - in_next: pointer to the next unread byte in the input buffer + * - in_next: a pointer to the next unread byte in the input buffer * - * - in_end: pointer just past the end of the input buffer + * - in_end: a pointer to just past the end of the input buffer * * - bitbuf: a word-sized variable containing bits that have been read from - * the input buffer. The buffered bits are right-aligned - * (they're the low-order bits). + * the input buffer or from the implicit appended zero bytes * - * - bitsleft: number of bits in 'bitbuf' that are valid. NOTE: in the - * fastloop, bits 8 and above of bitsleft can contain garbage. + * - bitsleft: the number of bits in 'bitbuf' available to be consumed. + * After REFILL_BITS_BRANCHLESS(), 'bitbuf' can actually + * contain more bits than this. However, only the bits counted + * by 'bitsleft' can actually be consumed; the rest can only be + * used for preloading. * - * - overread_count: number of implicit 0 bytes past 'in_end' that have - * been loaded into the bitbuffer + * As a micro-optimization, we allow bits 8 and higher of + * 'bitsleft' to contain garbage. When consuming the bits + * associated with a decode table entry, this allows us to do + * 'bitsleft -= entry' instead of 'bitsleft -= (u8)entry'. + * On some CPUs, this helps reduce instruction dependencies. + * This does have the disadvantage that 'bitsleft' sometimes + * needs to be cast to 'u8', such as when it's used as a shift + * amount in REFILL_BITS_BRANCHLESS(). But that one happens + * for free since most CPUs ignore high bits in shift amounts. * - * For performance reasons, these variables are declared as standalone variables - * and are manipulated using macros, rather than being packed into a struct. + * - overread_count: the total number of implicit appended zero bytes that + * have been loaded into the bitbuffer, including any + * counted by 'bitsleft' and any already consumed */ /* @@ -97,60 +107,92 @@ * which they don't have to refill as often. */ typedef machine_word_t bitbuf_t; +#define BITBUF_NBITS (8 * (int)sizeof(bitbuf_t)) + +/* BITMASK(n) returns a bitmask of length 'n'. */ +#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1) /* - * BITBUF_NBITS is the number of bits the bitbuffer variable can hold. See - * REFILL_BITS_WORDWISE() for why this is 1 less than the obvious value. + * MAX_BITSLEFT is the maximum number of consumable bits, i.e. the maximum value + * of '(u8)bitsleft'. This is the size of the bitbuffer variable, minus 1 if + * the branchless refill method is being used (see REFILL_BITS_BRANCHLESS()). */ -#define BITBUF_NBITS (8 * sizeof(bitbuf_t) - 1) +#define MAX_BITSLEFT \ + (UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS - 1 : BITBUF_NBITS) /* - * REFILL_GUARANTEED_NBITS is the number of bits that are guaranteed in the - * bitbuffer variable after refilling it with ENSURE_BITS(n), REFILL_BITS(), or - * REFILL_BITS_IN_FASTLOOP(). There might be up to BITBUF_NBITS bits; however, - * since only whole bytes can be added, only 'BITBUF_NBITS - 7' bits are - * guaranteed. That is the smallest amount where another byte doesn't fit. + * CONSUMABLE_NBITS is the minimum number of bits that are guaranteed to be + * consumable (counted in 'bitsleft') immediately after refilling the bitbuffer. + * Since only whole bytes can be added to 'bitsleft', the worst case is + * 'MAX_BITSLEFT - 7': the smallest amount where another byte doesn't fit. */ -#define REFILL_GUARANTEED_NBITS (BITBUF_NBITS - 7) +#define CONSUMABLE_NBITS (MAX_BITSLEFT - 7) /* - * CAN_ENSURE(n) evaluates to true if the bitbuffer variable is guaranteed to - * contain at least 'n' bits after a refill. See REFILL_GUARANTEED_NBITS. - * - * This can be used to choose between alternate refill strategies based on the - * size of the bitbuffer variable. 'n' should be a compile-time constant. + * FASTLOOP_PRELOADABLE_NBITS is the minimum number of bits that are guaranteed + * to be preloadable immediately after REFILL_BITS_IN_FASTLOOP(). (It is *not* + * guaranteed after REFILL_BITS(), since REFILL_BITS() falls back to a + * byte-at-a-time refill method near the end of input.) This may exceed the + * number of consumable bits (counted by 'bitsleft'). Any bits not counted in + * 'bitsleft' can only be used for precomputation and cannot be consumed. */ -#define CAN_ENSURE(n) ((n) <= REFILL_GUARANTEED_NBITS) +#define FASTLOOP_PRELOADABLE_NBITS \ + (UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS : CONSUMABLE_NBITS) /* - * REFILL_BITS_WORDWISE() branchlessly refills the bitbuffer variable by reading - * the next word from the input buffer and updating 'in_next' and 'bitsleft' - * based on how many bits were refilled -- counting whole bytes only. This is - * much faster than reading a byte at a time, at least if the CPU is little - * endian and supports fast unaligned memory accesses. + * PRELOAD_SLACK is the minimum number of bits that are guaranteed to be + * preloadable but not consumable, following REFILL_BITS_IN_FASTLOOP() and any + * subsequent consumptions. This is 1 bit if the branchless refill method is + * being used, and 0 bits otherwise. + */ +#define PRELOAD_SLACK MAX(0, FASTLOOP_PRELOADABLE_NBITS - MAX_BITSLEFT) + +/* + * CAN_CONSUME(n) is true if it's guaranteed that if the bitbuffer has just been + * refilled, then it's always possible to consume 'n' bits from it. 'n' should + * be a compile-time constant, to enable compile-time evaluation. + */ +#define CAN_CONSUME(n) (CONSUMABLE_NBITS >= (n)) + +/* + * CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) is true if it's + * guaranteed that after REFILL_BITS_IN_FASTLOOP(), it's always possible to + * consume 'consume_nbits' bits, then preload 'preload_nbits' bits. The + * arguments should be compile-time constants to enable compile-time evaluation. + */ +#define CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) \ + (CONSUMABLE_NBITS >= (consume_nbits) && \ + FASTLOOP_PRELOADABLE_NBITS >= (consume_nbits) + (preload_nbits)) + +/* + * REFILL_BITS_BRANCHLESS() branchlessly refills the bitbuffer variable by + * reading the next word from the input buffer and updating 'in_next' and + * 'bitsleft' based on how many bits were refilled -- counting whole bytes only. + * This is much faster than reading a byte at a time, at least if the CPU is + * little endian and supports fast unaligned memory accesses. * * The simplest way of branchlessly updating 'bitsleft' would be: * - * bitsleft += (BITBUF_NBITS - bitsleft) & ~7; + * bitsleft += (MAX_BITSLEFT - bitsleft) & ~7; * - * To make it faster, we define BITBUF_NBITS to be 'WORDBITS - 1' rather than + * To make it faster, we define MAX_BITSLEFT to be 'WORDBITS - 1' rather than * WORDBITS, so that in binary it looks like 111111 or 11111. Then, we update - * 'bitsleft' just by setting the bits above the low 3 bits: + * 'bitsleft' by just setting the bits above the low 3 bits: * - * bitsleft |= BITBUF_NBITS & ~7; + * bitsleft |= MAX_BITSLEFT & ~7; * * That compiles down to a single instruction like 'or $0x38, %rbp'. Using - * 'BITBUF_NBITS == WORDBITS - 1' also has the advantage that refills can be - * done when 'bitsleft == BITBUF_NBITS' without invoking undefined behavior. + * 'MAX_BITSLEFT == WORDBITS - 1' also has the advantage that refills can be + * done when 'bitsleft == MAX_BITSLEFT' without invoking undefined behavior. * * The simplest way of branchlessly updating 'in_next' would be: * - * in_next += (BITBUF_NBITS - bitsleft) >> 3; + * in_next += (MAX_BITSLEFT - bitsleft) >> 3; * - * With 'BITBUF_NBITS == WORDBITS - 1' we could use an XOR instead, though this + * With 'MAX_BITSLEFT == WORDBITS - 1' we could use an XOR instead, though this * isn't really better: * - * in_next += (BITBUF_NBITS ^ bitsleft) >> 3; + * in_next += (MAX_BITSLEFT ^ bitsleft) >> 3; * * An alternative which can be marginally better is the following: * @@ -162,22 +204,23 @@ typedef machine_word_t bitbuf_t; * extraction instruction (e.g. arm's ubfx), it stays at 3, and is potentially * more efficient because the length of the longest dependency chain decreases * from 3 to 2. This alternative also has the advantage that it ignores the - * high bits in 'bitsleft', so it is compatible with the fastloop optimization - * (described later) where we let the high bits of 'bitsleft' contain garbage. + * high bits in 'bitsleft', so it is compatible with the micro-optimization we + * use where we let the high bits of 'bitsleft' contain garbage. */ -#define REFILL_BITS_WORDWISE() \ -do { \ - bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft;\ - in_next += sizeof(bitbuf_t) - 1; \ - in_next -= (bitsleft >> 3) & 0x7; \ - bitsleft |= BITBUF_NBITS & ~7; \ +#define REFILL_BITS_BRANCHLESS() \ +do { \ + bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft; \ + in_next += sizeof(bitbuf_t) - 1; \ + in_next -= (bitsleft >> 3) & 0x7; \ + bitsleft |= MAX_BITSLEFT & ~7; \ } while (0) /* * REFILL_BITS() loads bits from the input buffer until the bitbuffer variable - * contains at least REFILL_GUARANTEED_NBITS bits. + * contains at least CONSUMABLE_NBITS consumable bits. * - * This checks for the end of input, and it cannot be used in the fastloop. + * This checks for the end of input, and it doesn't guarantee + * FASTLOOP_PRELOADABLE_NBITS, so it can't be used in the fastloop. * * If we would overread the input buffer, we just don't read anything, leaving * the bits zeroed but marking them filled. This simplifies the decompressor @@ -196,121 +239,66 @@ do { \ */ #define REFILL_BITS() \ do { \ - if (CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST && \ + if (UNALIGNED_ACCESS_IS_FAST && \ likely(in_end - in_next >= sizeof(bitbuf_t))) { \ - REFILL_BITS_WORDWISE(); \ + REFILL_BITS_BRANCHLESS(); \ } else { \ - while (bitsleft < REFILL_GUARANTEED_NBITS) { \ + while ((u8)bitsleft < CONSUMABLE_NBITS) { \ if (likely(in_next != in_end)) { \ - bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \ + bitbuf |= (bitbuf_t)*in_next++ << \ + (u8)bitsleft; \ } else { \ overread_count++; \ - SAFETY_CHECK(overread_count <= sizeof(bitbuf)); \ - } \ + SAFETY_CHECK(overread_count <= \ + sizeof(bitbuf_t)); \ + } \ bitsleft += 8; \ } \ } \ } while (0) -/* ENSURE_BITS(n) calls REFILL_BITS() if fewer than 'n' bits are buffered. */ -#define ENSURE_BITS(n) \ +/* + * REFILL_BITS_IN_FASTLOOP() is like REFILL_BITS(), but it doesn't check for the + * end of the input. It can only be used in the fastloop. + */ +#define REFILL_BITS_IN_FASTLOOP() \ do { \ - if (bitsleft < (n)) \ - REFILL_BITS(); \ + STATIC_ASSERT(UNALIGNED_ACCESS_IS_FAST || \ + FASTLOOP_PRELOADABLE_NBITS == CONSUMABLE_NBITS); \ + if (UNALIGNED_ACCESS_IS_FAST) { \ + REFILL_BITS_BRANCHLESS(); \ + } else { \ + while ((u8)bitsleft < CONSUMABLE_NBITS) { \ + bitbuf |= (bitbuf_t)*in_next++ << (u8)bitsleft; \ + bitsleft += 8; \ + } \ + } \ } while (0) -#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1) - -/* BITS(n) returns the next 'n' buffered bits without removing them. */ -#define BITS(n) (bitbuf & BITMASK(n)) - -/* Macros to save the value of the bitbuffer variable and use it later. */ -#define SAVE_BITBUF() (saved_bitbuf = bitbuf) -#define SAVED_BITS(n) (saved_bitbuf & BITMASK(n)) - -/* REMOVE_BITS(n) removes the next 'n' buffered bits. */ -#define REMOVE_BITS(n) (bitbuf >>= (n), bitsleft -= (n)) - -/* POP_BITS(n) removes and returns the next 'n' buffered bits. */ -#define POP_BITS(n) (tmpbits = BITS(n), REMOVE_BITS(n), tmpbits) - -/* - * ALIGN_INPUT() verifies that the input buffer hasn't been overread, then - * aligns the bitstream to the next byte boundary, discarding any unused bits in - * the current byte. - * - * Note that if the bitbuffer variable currently contains more than 7 bits, then - * we must rewind 'in_next', effectively putting those bits back. Only the bits - * in what would be the "current" byte if we were reading one byte at a time can - * be actually discarded. - */ -#define ALIGN_INPUT() \ -do { \ - SAFETY_CHECK(overread_count <= (bitsleft >> 3)); \ - in_next -= (bitsleft >> 3) - overread_count; \ - overread_count = 0; \ - bitbuf = 0; \ - bitsleft = 0; \ -} while(0) - -/* - * Macros used in the "fastloop": the loop that decodes literals and matches - * while there is still plenty of space left in the input and output buffers. - * - * In the fastloop, we improve performance by skipping redundant bounds checks. - * On platforms where it helps, we also use an optimization where we allow bits - * 8 and higher of 'bitsleft' to contain garbage. This is sometimes a useful - * microoptimization because it means the whole 32-bit decode table entry can be - * subtracted from 'bitsleft' without an intermediate step to convert it to 8 - * bits. (It still needs to be converted to 8 bits for the shift of 'bitbuf', - * but most CPUs ignore high bits in shift amounts, so that happens implicitly - * with zero overhead.) REMOVE_ENTRY_BITS_FAST() implements this optimization. - * - * MASK_BITSLEFT() is used to clear the garbage bits when leaving the fastloop. - */ -#if CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST -# define REFILL_BITS_IN_FASTLOOP() REFILL_BITS_WORDWISE() -# define REMOVE_ENTRY_BITS_FAST(entry) (bitbuf >>= (u8)entry, bitsleft -= entry) -# define GET_REAL_BITSLEFT() ((u8)bitsleft) -# define MASK_BITSLEFT() (bitsleft &= 0xFF) -#else -# define REFILL_BITS_IN_FASTLOOP() \ - while (bitsleft < REFILL_GUARANTEED_NBITS) { \ - bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \ - bitsleft += 8; \ - } -# define REMOVE_ENTRY_BITS_FAST(entry) REMOVE_BITS((u8)entry) -# define GET_REAL_BITSLEFT() bitsleft -# define MASK_BITSLEFT() -#endif - /* * This is the worst-case maximum number of output bytes that are written to - * during each iteration of the fastloop. The worst case is 3 literals, then a - * match of length DEFLATE_MAX_MATCH_LEN. The match length must be rounded up - * to a word boundary due to the word-at-a-time match copy implementation. + * during each iteration of the fastloop. The worst case is 2 literals, then a + * match of length DEFLATE_MAX_MATCH_LEN. Additionally, some slack space must + * be included for the intentional overrun in the match copy implementation. */ #define FASTLOOP_MAX_BYTES_WRITTEN \ - (3 + ALIGN(DEFLATE_MAX_MATCH_LEN, WORDBYTES)) + (2 + DEFLATE_MAX_MATCH_LEN + (5 * WORDBYTES) - 1) /* * This is the worst-case maximum number of input bytes that are read during * each iteration of the fastloop. To get this value, we first compute the * greatest number of bits that can be refilled during a loop iteration. The - * refill at the beginning can add at most BITBUF_NBITS, and the amount that can + * refill at the beginning can add at most MAX_BITSLEFT, and the amount that can * be refilled later is no more than the maximum amount that can be consumed by - * 3 literals that don't need a subtable, then a match. We convert this value - * to bytes, rounding up. Finally, we added sizeof(bitbuf_t) to account for - * REFILL_BITS_WORDWISE() reading up to a word past the part really used. + * 2 literals that don't need a subtable, then a match. We convert this value + * to bytes, rounding up; this gives the maximum number of bytes that 'in_next' + * can be advanced. Finally, we add sizeof(bitbuf_t) to account for + * REFILL_BITS_BRANCHLESS() reading a word past 'in_next'. */ #define FASTLOOP_MAX_BYTES_READ \ - (DIV_ROUND_UP(BITBUF_NBITS + \ - ((3 * LITLEN_TABLEBITS) + \ - DEFLATE_MAX_LITLEN_CODEWORD_LEN + \ - DEFLATE_MAX_EXTRA_LENGTH_BITS + \ - DEFLATE_MAX_OFFSET_CODEWORD_LEN + \ - DEFLATE_MAX_EXTRA_OFFSET_BITS), 8) + \ - sizeof(bitbuf_t)) + (DIV_ROUND_UP(MAX_BITSLEFT + (2 * LITLEN_TABLEBITS) + \ + LENGTH_MAXBITS + OFFSET_MAXBITS, 8) + \ + sizeof(bitbuf_t)) /***************************************************************************** * Huffman decoding * @@ -358,24 +346,37 @@ do { \ * take longer, which decreases performance. We choose values that work well in * practice, making subtables rarely needed without making the tables too large. * + * Our choice of OFFSET_TABLEBITS == 8 is a bit low; without any special + * considerations, 9 would fit the trade-off curve better. However, there is a + * performance benefit to using exactly 8 bits when it is a compile-time + * constant, as many CPUs can take the low byte more easily than the low 9 bits. + * + * zlib treats its equivalents of TABLEBITS as maximum values; whenever it + * builds a table, it caps the actual table_bits to the longest codeword. This + * makes sense in theory, as there's no need for the table to be any larger than + * needed to support the longest codeword. However, having the table bits be a + * compile-time constant is beneficial to the performance of the decode loop, so + * there is a trade-off. libdeflate currently uses the dynamic table_bits + * strategy for the litlen table only, due to its larger maximum size. + * PRECODE_TABLEBITS and OFFSET_TABLEBITS are smaller, so going dynamic there + * isn't as useful, and OFFSET_TABLEBITS=8 is useful as mentioned above. + * * Each TABLEBITS value has a corresponding ENOUGH value that gives the * worst-case maximum number of decode table entries, including the main table * and all subtables. The ENOUGH value depends on three parameters: * * (1) the maximum number of symbols in the code (DEFLATE_NUM_*_SYMS) - * (2) the number of main table bits (the corresponding TABLEBITS value) + * (2) the maximum number of main table bits (*_TABLEBITS) * (3) the maximum allowed codeword length (DEFLATE_MAX_*_CODEWORD_LEN) * * The ENOUGH values were computed using the utility program 'enough' from zlib. */ #define PRECODE_TABLEBITS 7 #define PRECODE_ENOUGH 128 /* enough 19 7 7 */ - #define LITLEN_TABLEBITS 11 #define LITLEN_ENOUGH 2342 /* enough 288 11 15 */ - -#define OFFSET_TABLEBITS 9 -#define OFFSET_ENOUGH 594 /* enough 32 9 15 */ +#define OFFSET_TABLEBITS 8 +#define OFFSET_ENOUGH 402 /* enough 32 8 15 */ /* * make_decode_table_entry() creates a decode table entry for the given symbol @@ -387,7 +388,7 @@ do { \ * appropriately-formatted decode table entry. See the definitions of the * *_decode_results[] arrays below, where the entry format is described. */ -static inline u32 +static forceinline u32 make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len) { return decode_results[sym] + (len << 8) + len; @@ -398,8 +399,8 @@ make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len) * described contain zeroes: * * Bit 20-16: presym - * Bit 10-8: codeword_len [not used] - * Bit 2-0: codeword_len + * Bit 10-8: codeword length [not used] + * Bit 2-0: codeword length * * The precode decode table never has subtables, since we use * PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN. @@ -431,6 +432,12 @@ static const u32 precode_decode_results[] = { /* Indicates an end-of-block entry in the litlen decode table */ #define HUFFDEC_END_OF_BLOCK 0x00002000 +/* Maximum number of bits that can be consumed by decoding a match length */ +#define LENGTH_MAXBITS (DEFLATE_MAX_LITLEN_CODEWORD_LEN + \ + DEFLATE_MAX_EXTRA_LENGTH_BITS) +#define LENGTH_MAXFASTBITS (LITLEN_TABLEBITS /* no subtable needed */ + \ + DEFLATE_MAX_EXTRA_LENGTH_BITS) + /* * Here is the format of our litlen decode table entries. Bits not explicitly * described contain zeroes: @@ -464,7 +471,8 @@ static const u32 precode_decode_results[] = { * Bit 15: 1 (HUFFDEC_EXCEPTIONAL) * Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER) * Bit 13: 0 (!HUFFDEC_END_OF_BLOCK) - * Bit 3-0: number of subtable bits + * Bit 11-8: number of subtable bits + * Bit 3-0: number of main table bits * * This format has several desirable properties: * @@ -481,20 +489,26 @@ static const u32 precode_decode_results[] = { * * - The low byte is the number of bits that need to be removed from the * bitstream; this makes this value easily accessible, and it enables the - * optimization used in REMOVE_ENTRY_BITS_FAST(). It also includes the - * number of extra bits, so they don't need to be removed separately. + * micro-optimization of doing 'bitsleft -= entry' instead of + * 'bitsleft -= (u8)entry'. It also includes the number of extra bits, + * so they don't need to be removed separately. * - * - The flags in bits 13-15 are arranged to be 0 when the number of - * non-extra bits (the value in bits 11-8) is needed, making this value + * - The flags in bits 15-13 are arranged to be 0 when the + * "remaining codeword length" in bits 11-8 is needed, making this value * fairly easily accessible as well via a shift and downcast. * + * - Similarly, bits 13-12 are 0 when the "subtable bits" in bits 11-8 are + * needed, making it possible to extract this value with '& 0x3F' rather + * than '& 0xF'. This value is only used as a shift amount, so this can + * save an 'and' instruction as the masking by 0x3F happens implicitly. + * * litlen_decode_results[] contains the static part of the entry for each * symbol. make_decode_table_entry() produces the final entries. */ static const u32 litlen_decode_results[] = { /* Literals */ -#define ENTRY(literal) (((u32)literal << 16) | HUFFDEC_LITERAL) +#define ENTRY(literal) (HUFFDEC_LITERAL | ((u32)literal << 16)) ENTRY(0) , ENTRY(1) , ENTRY(2) , ENTRY(3) , ENTRY(4) , ENTRY(5) , ENTRY(6) , ENTRY(7) , ENTRY(8) , ENTRY(9) , ENTRY(10) , ENTRY(11) , @@ -578,6 +592,12 @@ static const u32 litlen_decode_results[] = { #undef ENTRY }; +/* Maximum number of bits that can be consumed by decoding a match offset */ +#define OFFSET_MAXBITS (DEFLATE_MAX_OFFSET_CODEWORD_LEN + \ + DEFLATE_MAX_EXTRA_OFFSET_BITS) +#define OFFSET_MAXFASTBITS (OFFSET_TABLEBITS /* no subtable needed */ + \ + DEFLATE_MAX_EXTRA_OFFSET_BITS) + /* * Here is the format of our offset decode table entries. Bits not explicitly * described contain zeroes: @@ -592,7 +612,8 @@ static const u32 litlen_decode_results[] = { * Bit 31-16: index of start of subtable * Bit 15: 1 (HUFFDEC_EXCEPTIONAL) * Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER) - * Bit 3-0: number of subtable bits + * Bit 11-8: number of subtable bits + * Bit 3-0: number of main table bits * * These work the same way as the length entries and subtable pointer entries in * the litlen decode table; see litlen_decode_results[] above. @@ -607,15 +628,20 @@ static const u32 offset_decode_results[] = { ENTRY(257 , 7) , ENTRY(385 , 7) , ENTRY(513 , 8) , ENTRY(769 , 8) , ENTRY(1025 , 9) , ENTRY(1537 , 9) , ENTRY(2049 , 10) , ENTRY(3073 , 10) , ENTRY(4097 , 11) , ENTRY(6145 , 11) , ENTRY(8193 , 12) , ENTRY(12289 , 12) , - ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(32769 , 14) , ENTRY(49153 , 14) , + ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) , #undef ENTRY }; /* - * The main DEFLATE decompressor structure. Since this implementation only - * supports full buffer decompression, this structure does not store the entire - * decompression state, but rather only some arrays that are too large to - * comfortably allocate on the stack. + * The main DEFLATE decompressor structure. Since libdeflate only supports + * full-buffer decompression, this structure doesn't store the entire + * decompression state, most of which is in stack variables. Instead, this + * struct just contains the decode tables and some temporary arrays used for + * building them, as these are too large to comfortably allocate on the stack. + * + * Storing the decode tables in the decompressor struct also allows the decode + * tables for the static codes to be reused whenever two static Huffman blocks + * are decoded without an intervening dynamic block, even across streams. */ struct libdeflate_decompressor { @@ -648,6 +674,7 @@ struct libdeflate_decompressor { u16 sorted_syms[DEFLATE_MAX_NUM_SYMS]; bool static_codes_loaded; + unsigned litlen_tablebits; }; /* @@ -678,11 +705,16 @@ struct libdeflate_decompressor { * make the final decode table entries using make_decode_table_entry(). * @table_bits * The log base-2 of the number of main table entries to use. + * If @table_bits_ret != NULL, then @table_bits is treated as a maximum + * value and it will be decreased if a smaller table would be sufficient. * @max_codeword_len * The maximum allowed codeword length for this Huffman code. * Must be <= DEFLATE_MAX_CODEWORD_LEN. * @sorted_syms * A temporary array of length @num_syms. + * @table_bits_ret + * If non-NULL, then the dynamic table_bits is enabled, and the actual + * table_bits value will be returned here. * * Returns %true if successful; %false if the codeword lengths do not form a * valid Huffman code. @@ -692,9 +724,10 @@ build_decode_table(u32 decode_table[], const u8 lens[], const unsigned num_syms, const u32 decode_results[], - const unsigned table_bits, - const unsigned max_codeword_len, - u16 *sorted_syms) + unsigned table_bits, + unsigned max_codeword_len, + u16 *sorted_syms, + unsigned *table_bits_ret) { unsigned len_counts[DEFLATE_MAX_CODEWORD_LEN + 1]; unsigned offsets[DEFLATE_MAX_CODEWORD_LEN + 1]; @@ -714,6 +747,17 @@ build_decode_table(u32 decode_table[], for (sym = 0; sym < num_syms; sym++) len_counts[lens[sym]]++; + /* + * Determine the actual maximum codeword length that was used, and + * decrease table_bits to it if allowed. + */ + while (max_codeword_len > 1 && len_counts[max_codeword_len] == 0) + max_codeword_len--; + if (table_bits_ret != NULL) { + table_bits = MIN(table_bits, max_codeword_len); + *table_bits_ret = table_bits; + } + /* * Sort the symbols primarily by increasing codeword length and * secondarily by increasing symbol value; or equivalently by their @@ -919,16 +963,13 @@ build_decode_table(u32 decode_table[], /* * Create the entry that points from the main table to - * the subtable. This entry contains the index of the - * start of the subtable and the number of bits with - * which the subtable is indexed (the log base 2 of the - * number of entries it contains). + * the subtable. */ decode_table[subtable_prefix] = ((u32)subtable_start << 16) | HUFFDEC_EXCEPTIONAL | HUFFDEC_SUBTABLE_POINTER | - subtable_bits; + (subtable_bits << 8) | table_bits; } /* Fill the subtable entries for the current codeword. */ @@ -969,7 +1010,8 @@ build_precode_decode_table(struct libdeflate_decompressor *d) precode_decode_results, PRECODE_TABLEBITS, DEFLATE_MAX_PRE_CODEWORD_LEN, - d->sorted_syms); + d->sorted_syms, + NULL); } /* Build the decode table for the literal/length code. */ @@ -989,7 +1031,8 @@ build_litlen_decode_table(struct libdeflate_decompressor *d, litlen_decode_results, LITLEN_TABLEBITS, DEFLATE_MAX_LITLEN_CODEWORD_LEN, - d->sorted_syms); + d->sorted_syms, + &d->litlen_tablebits); } /* Build the decode table for the offset code. */ @@ -998,7 +1041,7 @@ build_offset_decode_table(struct libdeflate_decompressor *d, unsigned num_litlen_syms, unsigned num_offset_syms) { /* When you change TABLEBITS, you must change ENOUGH, and vice versa! */ - STATIC_ASSERT(OFFSET_TABLEBITS == 9 && OFFSET_ENOUGH == 594); + STATIC_ASSERT(OFFSET_TABLEBITS == 8 && OFFSET_ENOUGH == 402); STATIC_ASSERT(ARRAY_LEN(offset_decode_results) == DEFLATE_NUM_OFFSET_SYMS); @@ -1009,27 +1052,8 @@ build_offset_decode_table(struct libdeflate_decompressor *d, offset_decode_results, OFFSET_TABLEBITS, DEFLATE_MAX_OFFSET_CODEWORD_LEN, - d->sorted_syms); -} - -static forceinline machine_word_t -repeat_byte(u8 b) -{ - machine_word_t v; - - STATIC_ASSERT(WORDBITS == 32 || WORDBITS == 64); - - v = b; - v |= v << 8; - v |= v << 16; - v |= v << ((WORDBITS == 64) ? 32 : 0); - return v; -} - -static forceinline void -copy_word_unaligned(const void *src, void *dst) -{ - store_word_unaligned(load_word_unaligned(src), dst); + d->sorted_syms, + NULL); } /***************************************************************************** @@ -1037,12 +1061,15 @@ copy_word_unaligned(const void *src, void *dst) *****************************************************************************/ typedef enum libdeflate_result (*decompress_func_t) - (struct libdeflate_decompressor *d, - const void *in, size_t in_nbytes, void *out, size_t out_nbytes_avail, + (struct libdeflate_decompressor * restrict d, + const void * restrict in, size_t in_nbytes, + void * restrict out, size_t out_nbytes_avail, size_t *actual_in_nbytes_ret, size_t *actual_out_nbytes_ret); #define FUNCNAME deflate_decompress_default -#define ATTRIBUTES +#undef ATTRIBUTES +#undef EXTRACT_VARBITS +#undef EXTRACT_VARBITS8 #include "decompress_template.h" /* Include architecture-specific implementation(s) if available. */ diff --git a/src/3rdparty/libdeflate/lib/x86/cpu_features.h b/src/3rdparty/libdeflate/lib/x86/cpu_features.h index fb7127081..796cfa8d9 100644 --- a/src/3rdparty/libdeflate/lib/x86/cpu_features.h +++ b/src/3rdparty/libdeflate/lib/x86/cpu_features.h @@ -156,6 +156,8 @@ typedef char __v64qi __attribute__((__vector_size__(64))); #define HAVE_BMI2_TARGET \ (HAVE_DYNAMIC_X86_CPU_FEATURES && \ (GCC_PREREQ(4, 7) || __has_builtin(__builtin_ia32_pdep_di))) +#define HAVE_BMI2_INTRIN \ + (HAVE_BMI2_NATIVE || (HAVE_BMI2_TARGET && HAVE_TARGET_INTRINSICS)) #endif /* __i386__ || __x86_64__ */ diff --git a/src/3rdparty/libdeflate/lib/x86/decompress_impl.h b/src/3rdparty/libdeflate/lib/x86/decompress_impl.h index 3c621da74..3dc189285 100644 --- a/src/3rdparty/libdeflate/lib/x86/decompress_impl.h +++ b/src/3rdparty/libdeflate/lib/x86/decompress_impl.h @@ -4,18 +4,46 @@ #include "cpu_features.h" /* BMI2 optimized version */ -#if HAVE_BMI2_TARGET && !HAVE_BMI2_NATIVE -# define FUNCNAME deflate_decompress_bmi2 -# define ATTRIBUTES __attribute__((target("bmi2"))) +#if HAVE_BMI2_INTRIN +# define deflate_decompress_bmi2 deflate_decompress_bmi2 +# define FUNCNAME deflate_decompress_bmi2 +# if !HAVE_BMI2_NATIVE +# define ATTRIBUTES __attribute__((target("bmi2"))) +# endif + /* + * Even with __attribute__((target("bmi2"))), gcc doesn't reliably use the + * bzhi instruction for 'word & BITMASK(count)'. So use the bzhi intrinsic + * explicitly. EXTRACT_VARBITS() is equivalent to 'word & BITMASK(count)'; + * EXTRACT_VARBITS8() is equivalent to 'word & BITMASK((u8)count)'. + * Nevertheless, their implementation using the bzhi intrinsic is identical, + * as the bzhi instruction truncates the count to 8 bits implicitly. + */ +# ifndef __clang__ +# include +# ifdef __x86_64__ +# define EXTRACT_VARBITS(word, count) _bzhi_u64((word), (count)) +# define EXTRACT_VARBITS8(word, count) _bzhi_u64((word), (count)) +# else +# define EXTRACT_VARBITS(word, count) _bzhi_u32((word), (count)) +# define EXTRACT_VARBITS8(word, count) _bzhi_u32((word), (count)) +# endif +# endif # include "../decompress_template.h" +#endif /* HAVE_BMI2_INTRIN */ + +#if defined(deflate_decompress_bmi2) && HAVE_BMI2_NATIVE +#define DEFAULT_IMPL deflate_decompress_bmi2 +#else static inline decompress_func_t arch_select_decompress_func(void) { +#ifdef deflate_decompress_bmi2 if (HAVE_BMI2(get_x86_cpu_features())) return deflate_decompress_bmi2; +#endif return NULL; } -# define arch_select_decompress_func arch_select_decompress_func +#define arch_select_decompress_func arch_select_decompress_func #endif #endif /* LIB_X86_DECOMPRESS_IMPL_H */ diff --git a/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h b/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h index 99fbebe8d..8433b9b10 100644 --- a/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h +++ b/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h @@ -28,7 +28,9 @@ #ifndef LIB_X86_MATCHFINDER_IMPL_H #define LIB_X86_MATCHFINDER_IMPL_H -#ifdef __AVX2__ +#include "cpu_features.h" + +#if HAVE_AVX2_NATIVE # include static forceinline void matchfinder_init_avx2(mf_pos_t *data, size_t size) @@ -73,7 +75,7 @@ matchfinder_rebase_avx2(mf_pos_t *data, size_t size) } #define matchfinder_rebase matchfinder_rebase_avx2 -#elif defined(__SSE2__) +#elif HAVE_SSE2_NATIVE # include static forceinline void matchfinder_init_sse2(mf_pos_t *data, size_t size) @@ -117,6 +119,6 @@ matchfinder_rebase_sse2(mf_pos_t *data, size_t size) } while (size != 0); } #define matchfinder_rebase matchfinder_rebase_sse2 -#endif /* __SSE2__ */ +#endif /* HAVE_SSE2_NATIVE */ #endif /* LIB_X86_MATCHFINDER_IMPL_H */ diff --git a/src/3rdparty/libdeflate/libdeflate.h b/src/3rdparty/libdeflate/libdeflate.h index a8b16b92a..ffe402e2a 100644 --- a/src/3rdparty/libdeflate/libdeflate.h +++ b/src/3rdparty/libdeflate/libdeflate.h @@ -10,33 +10,36 @@ extern "C" { #endif #define LIBDEFLATE_VERSION_MAJOR 1 -#define LIBDEFLATE_VERSION_MINOR 13 -#define LIBDEFLATE_VERSION_STRING "1.13" +#define LIBDEFLATE_VERSION_MINOR 14 +#define LIBDEFLATE_VERSION_STRING "1.14" #include #include /* - * On Windows, if you want to link to the DLL version of libdeflate, then - * #define LIBDEFLATE_DLL. Note that the calling convention is "cdecl". + * On Windows, you must define LIBDEFLATE_STATIC if you are linking to the + * static library version of libdeflate instead of the DLL. On other platforms, + * LIBDEFLATE_STATIC has no effect. */ -#ifdef LIBDEFLATE_DLL -# ifdef BUILDING_LIBDEFLATE -# define LIBDEFLATEEXPORT LIBEXPORT -# elif defined(_WIN32) || defined(__CYGWIN__) +#ifdef _WIN32 +# if defined(LIBDEFLATE_STATIC) +# define LIBDEFLATEEXPORT +# elif defined(BUILDING_LIBDEFLATE) +# define LIBDEFLATEEXPORT __declspec(dllexport) +# else # define LIBDEFLATEEXPORT __declspec(dllimport) # endif -#endif -#ifndef LIBDEFLATEEXPORT -# define LIBDEFLATEEXPORT +#else +# define LIBDEFLATEEXPORT __attribute__((visibility("default"))) #endif -#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && \ - defined(_WIN32) && !defined(_WIN64) +#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && defined(__i386__) /* - * On 32-bit Windows, gcc assumes 16-byte stack alignment but MSVC only 4. - * Realign the stack when entering libdeflate to avoid crashing in SSE/AVX - * code when called from an MSVC-compiled application. + * On i386, gcc assumes that the stack is 16-byte aligned at function entry. + * However, some compilers (e.g. MSVC) and programming languages (e.g. + * Delphi) only guarantee 4-byte alignment when calling functions. Work + * around this ABI incompatibility by realigning the stack pointer when + * entering libdeflate. This prevents crashes in SSE/AVX code. */ # define LIBDEFLATEAPI __attribute__((force_align_arg_pointer)) #else @@ -72,10 +75,35 @@ libdeflate_alloc_compressor(int compression_level); /* * libdeflate_deflate_compress() performs raw DEFLATE compression on a buffer of - * data. The function attempts to compress 'in_nbytes' bytes of data located at - * 'in' and write the results to 'out', which has space for 'out_nbytes_avail' - * bytes. The return value is the compressed size in bytes, or 0 if the data - * could not be compressed to 'out_nbytes_avail' bytes or fewer. + * data. It attempts to compress 'in_nbytes' bytes of data located at 'in' and + * write the results to 'out', which has space for 'out_nbytes_avail' bytes. + * The return value is the compressed size in bytes, or 0 if the data could not + * be compressed to 'out_nbytes_avail' bytes or fewer (but see note below). + * + * If compression is successful, then the output data is guaranteed to be a + * valid DEFLATE stream that decompresses to the input data. No other + * guarantees are made about the output data. Notably, different versions of + * libdeflate can produce different compressed data for the same uncompressed + * data, even at the same compression level. Do ***NOT*** do things like + * writing tests that compare compressed data to a golden output, as this can + * break when libdeflate is updated. (This property isn't specific to + * libdeflate; the same is true for zlib and other compression libraries too.) + * + * Note: due to a performance optimization, libdeflate_deflate_compress() + * currently needs a small amount of slack space at the end of the output + * buffer. As a result, it can't actually report compressed sizes very close to + * 'out_nbytes_avail'. This doesn't matter in real-world use cases, and + * libdeflate_deflate_compress_bound() already includes the slack space. + * However, it does mean that testing code that redundantly compresses data + * using an exact-sized output buffer won't work as might be expected: + * + * out_nbytes = libdeflate_deflate_compress(c, in, in_nbytes, out, + * libdeflate_deflate_compress_bound(in_nbytes)); + * // The following assertion will fail. + * assert(libdeflate_deflate_compress(c, in, in_nbytes, out, out_nbytes) != 0); + * + * To avoid this, either don't write tests like the above, or make sure to + * include at least 9 bytes of slack space in 'out_nbytes_avail'. */ LIBDEFLATEEXPORT size_t LIBDEFLATEAPI libdeflate_deflate_compress(struct libdeflate_compressor *compressor,