update bundled libdeflate to v1.14

Signed-off-by: Ivailo Monev <xakepa10@gmail.com>
2025-02-23 18:32:55 +00:00 · 2022-09-23 13:26:45 +03:00 · 2022-09-23 13:26:45 +03:00 · a7a9b5317b
commit a7a9b5317b
parent c50a283a39
12 changed files with 793 additions and 447 deletions
--- a/src/3rdparty/libdeflate/NOTE
+++ b/src/3rdparty/libdeflate/NOTE
@ -1,2 +1,2 @@
-This is Git checkout 72b2ce0d28970b1affc31efeb86daeffee1d7410
+This is Git checkout 18d6cc22b75643ec52111efeb27a22b9d860a982
 from https://github.com/ebiggers/libdeflate that has not been modified.
--- a/src/3rdparty/libdeflate/README.md
+++ b/src/3rdparty/libdeflate/README.md
@ -27,11 +27,13 @@ For the release notes, see the [NEWS file](NEWS.md).
 ## Table of Contents

 - [Building](#building)
-  - [For UNIX](#for-unix)
-  - [For macOS](#for-macos)
-  - [For Windows](#for-windows)
-    - [Using Cygwin](#using-cygwin)
-    - [Using MSYS2](#using-msys2)
+  - [Using the Makefile](#using-the-makefile)
+    - [For UNIX](#for-unix)
+    - [For macOS](#for-macos)
+    - [For Windows](#for-windows)
+      - [Using Cygwin](#using-cygwin)
+      - [Using MSYS2](#using-msys2)
+  - [Using a custom build system](#using-a-custom-build-system)
 - [API](#api)
 - [Bindings for other programming languages](#bindings-for-other-programming-languages)
 - [DEFLATE vs. zlib vs. gzip](#deflate-vs-zlib-vs-gzip)
@ -42,7 +44,14 @@ For the release notes, see the [NEWS file](NEWS.md).

 # Building

-## For UNIX
+libdeflate and the provided programs like `gzip` can be built using the provided
+Makefile.  If only the library is needed, it can alternatively be easily
+integrated into applications and built using any build system; see [Using a
+custom build system](#using-a-custom-build-system).
+
+## Using the Makefile
+
+### For UNIX

 Just run `make`, then (if desired) `make install`.  You need GNU Make and either
 GCC or Clang.  GCC is recommended because it builds slightly faster binaries.
@ -57,7 +66,7 @@ There are also many options which can be set on the `make` command line, e.g. to
 omit library features or to customize the directories into which `make install`
 installs files.  See the Makefile for details.

-## For macOS
+### For macOS

 Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):

@ -65,7 +74,7 @@ Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):

 But if you need to build the binaries yourself, see the section for UNIX above.

-## For Windows
+### For Windows

 Prebuilt Windows binaries can be downloaded from
 https://github.com/ebiggers/libdeflate/releases.  But if you need to build the
@ -84,7 +93,7 @@ binaries built with MinGW will be significantly faster.
 Also note that 64-bit binaries are faster than 32-bit binaries and should be
 preferred whenever possible.

-### Using Cygwin
+#### Using Cygwin

 Run the Cygwin installer, available from https://cygwin.com/setup-x86_64.exe.
 When you get to the package selection screen, choose the following additional
@ -119,7 +128,7 @@ or to build 32-bit binaries:

    make CC=i686-w64-mingw32-gcc

-### Using MSYS2
+#### Using MSYS2

 Run the MSYS2 installer, available from http://www.msys2.org/.  After
 installing, open an MSYS2 shell and run:
@ -161,6 +170,23 @@ and run the following commands:

 Or to build 32-bit binaries, do the same but use "MSYS2 MinGW 32-bit" instead.

+## Using a custom build system
+
+The source files of the library are designed to be compilable directly, without
+any prerequisite step like running a `./configure` script.  Therefore, as an
+alternative to building the library using the provided Makefile, the library
+source files can be easily integrated directly into your application and built
+using any build system.
+
+You should compile both `lib/*.c` and `lib/*/*.c`.  You don't need to worry
+about excluding irrelevant architecture-specific code, as this is already
+handled in the source files themselves using `#ifdef`s.
+
+It is **strongly** recommended to use either gcc or clang, and to use `-O2`.
+
+If you are doing a freestanding build with `-ffreestanding`, you must add
+`-DFREESTANDING` as well, otherwise performance will suffer greatly.
+
 # API

 libdeflate has a simple API that is not zlib-compatible.  You can create
@ -183,10 +209,7 @@ guessing.  However, libdeflate's decompression routines do optionally provide
 the actual number of output bytes in case you need it.

 Windows developers: note that the calling convention of libdeflate.dll is
-"stdcall" -- the same as the Win32 API.  If you call into libdeflate.dll using a
-non-C/C++ language, or dynamically using LoadLibrary(), make sure to use the
-stdcall convention.  Using the wrong convention may crash your application.
-(Note: older versions of libdeflate used the "cdecl" convention instead.)
+"cdecl".  (libdeflate v1.4 through v1.12 used "stdcall" instead.)

 # Bindings for other programming languages

--- a/src/3rdparty/libdeflate/common_defs.h
+++ b/src/3rdparty/libdeflate/common_defs.h
@ -144,8 +144,17 @@ typedef size_t machine_word_t;
 /* restrict - hint that writes only occur through the given pointer */
 #ifdef __GNUC__
 #  define restrict		__restrict__
+#elif defined(_MSC_VER)
+    /*
+     * Don't use MSVC's __restrict; it has nonstandard behavior.
+     * Standard restrict is okay, if it is supported.
+     */
+#  if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L)
+#    define restrict		restrict
+#  else
+#    define restrict
+#  endif
 #else
-/* Don't use MSVC's __restrict; it has nonstandard behavior. */
 #  define restrict
 #endif

@ -200,6 +209,7 @@ typedef size_t machine_word_t;
 #define DIV_ROUND_UP(n, d)	(((n) + (d) - 1) / (d))
 #define STATIC_ASSERT(expr)	((void)sizeof(char[1 - 2 * !(expr)]))
 #define ALIGN(n, a)		(((n) + (a) - 1) & ~((a) - 1))
+#define ROUND_UP(n, d)		((d) * DIV_ROUND_UP((n), (d)))

 /* ========================================================================== */
 /*                           Endianness handling                              */
@ -513,8 +523,10 @@ bsr32(u32 v)
 #ifdef __GNUC__
 	return 31 - __builtin_clz(v);
 #elif defined(_MSC_VER)
-	_BitScanReverse(&v, v);
-	return v;
+	unsigned long i;
+
+	_BitScanReverse(&i, v);
+	return i;
 #else
 	unsigned i = 0;

@ -529,9 +541,11 @@ bsr64(u64 v)
 {
 #ifdef __GNUC__
 	return 63 - __builtin_clzll(v);
-#elif defined(_MSC_VER) && defined(_M_X64)
-	_BitScanReverse64(&v, v);
-	return v;
+#elif defined(_MSC_VER) && defined(_WIN64)
+	unsigned long i;
+
+	_BitScanReverse64(&i, v);
+	return i;
 #else
 	unsigned i = 0;

@ -563,8 +577,10 @@ bsf32(u32 v)
 #ifdef __GNUC__
 	return __builtin_ctz(v);
 #elif defined(_MSC_VER)
-	_BitScanForward(&v, v);
-	return v;
+	unsigned long i;
+
+	_BitScanForward(&i, v);
+	return i;
 #else
 	unsigned i = 0;

@ -579,9 +595,11 @@ bsf64(u64 v)
 {
 #ifdef __GNUC__
 	return __builtin_ctzll(v);
-#elif defined(_MSC_VER) && defined(_M_X64)
-	_BitScanForward64(&v, v);
-	return v;
+#elif defined(_MSC_VER) && defined(_WIN64)
+	unsigned long i;
+
+	_BitScanForward64(&i, v);
+	return i;
 #else
 	unsigned i = 0;

--- a/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h
+++ b/src/3rdparty/libdeflate/lib/arm/matchfinder_impl.h
@ -28,7 +28,9 @@
 #ifndef LIB_ARM_MATCHFINDER_IMPL_H
 #define LIB_ARM_MATCHFINDER_IMPL_H

-#ifdef __ARM_NEON
+#include "cpu_features.h"
+
+#if HAVE_NEON_NATIVE
 #  include <arm_neon.h>
 static forceinline void
 matchfinder_init_neon(mf_pos_t *data, size_t size)
@ -81,6 +83,6 @@ matchfinder_rebase_neon(mf_pos_t *data, size_t size)
 }
 #define matchfinder_rebase matchfinder_rebase_neon

-#endif /* __ARM_NEON */
+#endif /* HAVE_NEON_NATIVE */

 #endif /* LIB_ARM_MATCHFINDER_IMPL_H */
--- a/src/3rdparty/libdeflate/lib/decompress_template.h
+++ b/src/3rdparty/libdeflate/lib/decompress_template.h
@ -31,7 +31,17 @@
 * target instruction sets.
 */

-static enum libdeflate_result ATTRIBUTES
+#ifndef ATTRIBUTES
+#  define ATTRIBUTES
+#endif
+#ifndef EXTRACT_VARBITS
+#  define EXTRACT_VARBITS(word, count)	((word) & BITMASK(count))
+#endif
+#ifndef EXTRACT_VARBITS8
+#  define EXTRACT_VARBITS8(word, count)	((word) & BITMASK((u8)(count)))
+#endif
+
+static enum libdeflate_result ATTRIBUTES MAYBE_UNUSED
 FUNCNAME(struct libdeflate_decompressor * restrict d,
 	 const void * restrict in, size_t in_nbytes,
 	 void * restrict out, size_t out_nbytes_avail,
@ -41,35 +51,36 @@ FUNCNAME(struct libdeflate_decompressor * restrict d,
 	u8 * const out_end = out_next + out_nbytes_avail;
 	u8 * const out_fastloop_end =
 		out_end - MIN(out_nbytes_avail, FASTLOOP_MAX_BYTES_WRITTEN);
+
+	/* Input bitstream state; see deflate_decompress.c for documentation */
 	const u8 *in_next = in;
 	const u8 * const in_end = in_next + in_nbytes;
 	const u8 * const in_fastloop_end =
 		in_end - MIN(in_nbytes, FASTLOOP_MAX_BYTES_READ);
 	bitbuf_t bitbuf = 0;
 	bitbuf_t saved_bitbuf;
-	machine_word_t bitsleft = 0;
+	u32 bitsleft = 0;
 	size_t overread_count = 0;
-	unsigned i;
+
 	bool is_final_block;
 	unsigned block_type;
-	u16 len;
-	u16 nlen;
 	unsigned num_litlen_syms;
 	unsigned num_offset_syms;
-	bitbuf_t tmpbits;
+	bitbuf_t litlen_tablemask;
+	u32 entry;

 next_block:
 	/* Starting to read the next block */
 	;

-	STATIC_ASSERT(CAN_ENSURE(1 + 2 + 5 + 5 + 4 + 3));
+	STATIC_ASSERT(CAN_CONSUME(1 + 2 + 5 + 5 + 4 + 3));
 	REFILL_BITS();

 	/* BFINAL: 1 bit */
-	is_final_block = POP_BITS(1);
+	is_final_block = bitbuf & BITMASK(1);

 	/* BTYPE: 2 bits */
-	block_type = POP_BITS(2);
+	block_type = (bitbuf >> 1) & BITMASK(2);

 	if (block_type == DEFLATE_BLOCKTYPE_DYNAMIC_HUFFMAN) {

@ -81,17 +92,18 @@ next_block:
 		};

 		unsigned num_explicit_precode_lens;
+		unsigned i;

 		/* Read the codeword length counts. */

-		STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == ((1 << 5) - 1) + 257);
-		num_litlen_syms = POP_BITS(5) + 257;
+		STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == 257 + BITMASK(5));
+		num_litlen_syms = 257 + ((bitbuf >> 3) & BITMASK(5));

-		STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == ((1 << 5) - 1) + 1);
-		num_offset_syms = POP_BITS(5) + 1;
+		STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == 1 + BITMASK(5));
+		num_offset_syms = 1 + ((bitbuf >> 8) & BITMASK(5));

-		STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == ((1 << 4) - 1) + 4);
-		num_explicit_precode_lens = POP_BITS(4) + 4;
+		STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == 4 + BITMASK(4));
+		num_explicit_precode_lens = 4 + ((bitbuf >> 13) & BITMASK(4));

 		d->static_codes_loaded = false;

@ -103,16 +115,31 @@ next_block:
 		 * merge one len with the previous fields.
 		 */
 		STATIC_ASSERT(DEFLATE_MAX_PRE_CODEWORD_LEN == (1 << 3) - 1);
-		if (CAN_ENSURE(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
-			d->u.precode_lens[deflate_precode_lens_permutation[0]] = POP_BITS(3);
+		if (CAN_CONSUME(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
+			d->u.precode_lens[deflate_precode_lens_permutation[0]] =
+				(bitbuf >> 17) & BITMASK(3);
+			bitbuf >>= 20;
+			bitsleft -= 20;
 			REFILL_BITS();
-			for (i = 1; i < num_explicit_precode_lens; i++)
-				d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
+			i = 1;
+			do {
+				d->u.precode_lens[deflate_precode_lens_permutation[i]] =
+					bitbuf & BITMASK(3);
+				bitbuf >>= 3;
+				bitsleft -= 3;
+			} while (++i < num_explicit_precode_lens);
 		} else {
-			for (i = 0; i < num_explicit_precode_lens; i++) {
-				ENSURE_BITS(3);
-				d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
-			}
+			bitbuf >>= 17;
+			bitsleft -= 17;
+			i = 0;
+			do {
+				if ((u8)bitsleft < 3)
+					REFILL_BITS();
+				d->u.precode_lens[deflate_precode_lens_permutation[i]] =
+					bitbuf & BITMASK(3);
+				bitbuf >>= 3;
+				bitsleft -= 3;
+			} while (++i < num_explicit_precode_lens);
 		}
 		for (; i < DEFLATE_NUM_PRECODE_SYMS; i++)
 			d->u.precode_lens[deflate_precode_lens_permutation[i]] = 0;
@ -121,13 +148,14 @@ next_block:
 		SAFETY_CHECK(build_precode_decode_table(d));

 		/* Decode the litlen and offset codeword lengths. */
-		for (i = 0; i < num_litlen_syms + num_offset_syms; ) {
-			u32 entry;
+		i = 0;
+		do {
 			unsigned presym;
 			u8 rep_val;
 			unsigned rep_count;

-			ENSURE_BITS(DEFLATE_MAX_PRE_CODEWORD_LEN + 7);
+			if ((u8)bitsleft < DEFLATE_MAX_PRE_CODEWORD_LEN + 7)
+				REFILL_BITS();

 			/*
 			 * The code below assumes that the precode decode table
@ -135,9 +163,11 @@ next_block:
 			 */
 			STATIC_ASSERT(PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN);

-			/* Read the next precode symbol. */
-			entry = d->u.l.precode_decode_table[BITS(DEFLATE_MAX_PRE_CODEWORD_LEN)];
-			REMOVE_BITS((u8)entry);
+			/* Decode the next precode symbol. */
+			entry = d->u.l.precode_decode_table[
+				bitbuf & BITMASK(DEFLATE_MAX_PRE_CODEWORD_LEN)];
+			bitbuf >>= (u8)entry;
+			bitsleft -= entry; /* optimization: subtract full entry */
 			presym = entry >> 16;

 			if (presym < 16) {
@ -171,8 +201,10 @@ next_block:
 				/* Repeat the previous length 3 - 6 times. */
 				SAFETY_CHECK(i != 0);
 				rep_val = d->u.l.lens[i - 1];
-				STATIC_ASSERT(3 + ((1 << 2) - 1) == 6);
-				rep_count = 3 + POP_BITS(2);
+				STATIC_ASSERT(3 + BITMASK(2) == 6);
+				rep_count = 3 + (bitbuf & BITMASK(2));
+				bitbuf >>= 2;
+				bitsleft -= 2;
 				d->u.l.lens[i + 0] = rep_val;
 				d->u.l.lens[i + 1] = rep_val;
 				d->u.l.lens[i + 2] = rep_val;
@ -182,8 +214,10 @@ next_block:
 				i += rep_count;
 			} else if (presym == 17) {
 				/* Repeat zero 3 - 10 times. */
-				STATIC_ASSERT(3 + ((1 << 3) - 1) == 10);
-				rep_count = 3 + POP_BITS(3);
+				STATIC_ASSERT(3 + BITMASK(3) == 10);
+				rep_count = 3 + (bitbuf & BITMASK(3));
+				bitbuf >>= 3;
+				bitsleft -= 3;
 				d->u.l.lens[i + 0] = 0;
 				d->u.l.lens[i + 1] = 0;
 				d->u.l.lens[i + 2] = 0;
@ -197,20 +231,39 @@ next_block:
 				i += rep_count;
 			} else {
 				/* Repeat zero 11 - 138 times. */
-				STATIC_ASSERT(11 + ((1 << 7) - 1) == 138);
-				rep_count = 11 + POP_BITS(7);
+				STATIC_ASSERT(11 + BITMASK(7) == 138);
+				rep_count = 11 + (bitbuf & BITMASK(7));
+				bitbuf >>= 7;
+				bitsleft -= 7;
 				memset(&d->u.l.lens[i], 0,
 				       rep_count * sizeof(d->u.l.lens[i]));
 				i += rep_count;
 			}
-		}
+		} while (i < num_litlen_syms + num_offset_syms);
+
 	} else if (block_type == DEFLATE_BLOCKTYPE_UNCOMPRESSED) {
+		u16 len, nlen;
+
 		/*
 		 * Uncompressed block: copy 'len' bytes literally from the input
 		 * buffer to the output buffer.
 		 */

-		ALIGN_INPUT();
+		bitsleft -= 3; /* for BTYPE and BFINAL */
+
+		/*
+		 * Align the bitstream to the next byte boundary.  This means
+		 * the next byte boundary as if we were reading a byte at a
+		 * time.  Therefore, we have to rewind 'in_next' by any bytes
+		 * that have been refilled but not actually consumed yet (not
+		 * counting overread bytes, which don't increment 'in_next').
+		 */
+		bitsleft = (u8)bitsleft;
+		SAFETY_CHECK(overread_count <= (bitsleft >> 3));
+		in_next -= (bitsleft >> 3) - overread_count;
+		overread_count = 0;
+		bitbuf = 0;
+		bitsleft = 0;

 		SAFETY_CHECK(in_end - in_next >= 4);
 		len = get_unaligned_le16(in_next);
@ -229,6 +282,8 @@ next_block:
 		goto block_done;

 	} else {
+		unsigned i;
+
 		SAFETY_CHECK(block_type == DEFLATE_BLOCKTYPE_STATIC_HUFFMAN);

 		/*
@ -241,6 +296,9 @@ next_block:
 		 * dynamic Huffman block.
 		 */

+		bitbuf >>= 3; /* for BTYPE and BFINAL */
+		bitsleft -= 3;
+
 		if (d->static_codes_loaded)
 			goto have_decode_tables;

@ -270,186 +328,344 @@ next_block:
 	SAFETY_CHECK(build_offset_decode_table(d, num_litlen_syms, num_offset_syms));
 	SAFETY_CHECK(build_litlen_decode_table(d, num_litlen_syms, num_offset_syms));
 have_decode_tables:
+	litlen_tablemask = BITMASK(d->litlen_tablebits);

 	/*
 	 * This is the "fastloop" for decoding literals and matches.  It does
 	 * bounds checks on in_next and out_next in the loop conditions so that
 	 * additional bounds checks aren't needed inside the loop body.
+	 *
+	 * To reduce latency, the bitbuffer is refilled and the next litlen
+	 * decode table entry is preloaded before each loop iteration.
 	 */
-	while (in_next < in_fastloop_end && out_next < out_fastloop_end) {
-		u32 entry, length, offset;
-		u8 lit;
+	if (in_next >= in_fastloop_end || out_next >= out_fastloop_end)
+		goto generic_loop;
+	REFILL_BITS_IN_FASTLOOP();
+	entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+	do {
+		u32 length, offset, lit;
 		const u8 *src;
 		u8 *dst;

-		/* Refill the bitbuffer and decode a litlen symbol. */
-		REFILL_BITS_IN_FASTLOOP();
-		entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
-preloaded:
-		if (CAN_ENSURE(3 * LITLEN_TABLEBITS +
-			       DEFLATE_MAX_LITLEN_CODEWORD_LEN +
-			       DEFLATE_MAX_EXTRA_LENGTH_BITS) &&
-		    (entry & HUFFDEC_LITERAL)) {
+		/*
+		 * Consume the bits for the litlen decode table entry.  Save the
+		 * original bitbuf for later, in case the extra match length
+		 * bits need to be extracted from it.
+		 */
+		saved_bitbuf = bitbuf;
+		bitbuf >>= (u8)entry;
+		bitsleft -= entry; /* optimization: subtract full entry */
+
+		/*
+		 * Begin by checking for a "fast" literal, i.e. a literal that
+		 * doesn't need a subtable.
+		 */
+		if (entry & HUFFDEC_LITERAL) {
 			/*
-			 * 64-bit only: fast path for decoding literals that
-			 * don't need subtables.  We do up to 3 of these before
-			 * proceeding to the general case.  This is the largest
-			 * number of times that LITLEN_TABLEBITS bits can be
-			 * extracted from a refilled 64-bit bitbuffer while
-			 * still leaving enough bits to decode any match length.
+			 * On 64-bit platforms, we decode up to 2 extra fast
+			 * literals in addition to the primary item, as this
+			 * increases performance and still leaves enough bits
+			 * remaining for what follows.  We could actually do 3,
+			 * assuming LITLEN_TABLEBITS=11, but that actually
+			 * decreases performance slightly (perhaps by messing
+			 * with the branch prediction of the conditional refill
+			 * that happens later while decoding the match offset).
 			 *
 			 * Note: the definitions of FASTLOOP_MAX_BYTES_WRITTEN
 			 * and FASTLOOP_MAX_BYTES_READ need to be updated if the
-			 * maximum number of literals decoded here is changed.
+			 * number of extra literals decoded here is changed.
 			 */
-			REMOVE_ENTRY_BITS_FAST(entry);
-			lit = entry >> 16;
-			entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
-			*out_next++ = lit;
-			if (entry & HUFFDEC_LITERAL) {
-				REMOVE_ENTRY_BITS_FAST(entry);
+			if (/* enough bits for 2 fast literals + length + offset preload? */
+			    CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
+							 LENGTH_MAXBITS,
+							 OFFSET_TABLEBITS) &&
+			    /* enough bits for 2 fast literals + slow literal + litlen preload? */
+			    CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
+							 DEFLATE_MAX_LITLEN_CODEWORD_LEN,
+							 LITLEN_TABLEBITS)) {
+				/* 1st extra fast literal */
 				lit = entry >> 16;
-				entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
+				entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+				saved_bitbuf = bitbuf;
+				bitbuf >>= (u8)entry;
+				bitsleft -= entry;
 				*out_next++ = lit;
 				if (entry & HUFFDEC_LITERAL) {
-					REMOVE_ENTRY_BITS_FAST(entry);
+					/* 2nd extra fast literal */
 					lit = entry >> 16;
-					entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
+					entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+					saved_bitbuf = bitbuf;
+					bitbuf >>= (u8)entry;
+					bitsleft -= entry;
 					*out_next++ = lit;
+					if (entry & HUFFDEC_LITERAL) {
+						/*
+						 * Another fast literal, but
+						 * this one is in lieu of the
+						 * primary item, so it doesn't
+						 * count as one of the extras.
+						 */
+						lit = entry >> 16;
+						entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+						REFILL_BITS_IN_FASTLOOP();
+						*out_next++ = lit;
+						continue;
+					}
 				}
+			} else {
+				/*
+				 * Decode a literal.  While doing so, preload
+				 * the next litlen decode table entry and refill
+				 * the bitbuffer.  To reduce latency, we've
+				 * arranged for there to be enough "preloadable"
+				 * bits remaining to do the table preload
+				 * independently of the refill.
+				 */
+				STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(
+						LITLEN_TABLEBITS, LITLEN_TABLEBITS));
+				lit = entry >> 16;
+				entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+				REFILL_BITS_IN_FASTLOOP();
+				*out_next++ = lit;
+				continue;
 			}
 		}
+
+		/*
+		 * It's not a literal entry, so it can be a length entry, a
+		 * subtable pointer entry, or an end-of-block entry.  Detect the
+		 * two unlikely cases by testing the HUFFDEC_EXCEPTIONAL flag.
+		 */
 		if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
 			/* Subtable pointer or end-of-block entry */
-			if (entry & HUFFDEC_SUBTABLE_POINTER) {
-				REMOVE_BITS(LITLEN_TABLEBITS);
-				entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
-			}
-			SAVE_BITBUF();
-			REMOVE_ENTRY_BITS_FAST(entry);
+
 			if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
 				goto block_done;
-			/* Literal or length entry, from a subtable */
-		} else {
-			/* Literal or length entry, from the main table */
-			SAVE_BITBUF();
-			REMOVE_ENTRY_BITS_FAST(entry);
-		}
-		length = entry >> 16;
-		if (entry & HUFFDEC_LITERAL) {
+
 			/*
-			 * Literal that didn't get handled by the literal fast
-			 * path earlier
+			 * A subtable is required.  Load and consume the
+			 * subtable entry.  The subtable entry can be of any
+			 * type: literal, length, or end-of-block.
 			 */
-			*out_next++ = length;
-			continue;
-		}
-		/*
-		 * Match length.  Finish decoding it.  We don't need to check
-		 * for too-long matches here, as this is inside the fastloop
-		 * where it's already been verified that the output buffer has
-		 * enough space remaining to copy a max-length match.
-		 */
-		length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
+			entry = d->u.litlen_decode_table[(entry >> 16) +
+				EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
+			saved_bitbuf = bitbuf;
+			bitbuf >>= (u8)entry;
+			bitsleft -= entry;

-		/* Decode the match offset. */
-
-		/* Refill the bitbuffer if it may be needed for the offset. */
-		if (unlikely(GET_REAL_BITSLEFT() <
-			     DEFLATE_MAX_OFFSET_CODEWORD_LEN +
-			     DEFLATE_MAX_EXTRA_OFFSET_BITS))
-			REFILL_BITS_IN_FASTLOOP();
-
-		STATIC_ASSERT(CAN_ENSURE(OFFSET_TABLEBITS +
-					 DEFLATE_MAX_EXTRA_OFFSET_BITS));
-		STATIC_ASSERT(CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
-					 OFFSET_TABLEBITS +
-					 DEFLATE_MAX_EXTRA_OFFSET_BITS));
-
-		entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
-		if (entry & HUFFDEC_EXCEPTIONAL) {
-			/* Offset codeword requires a subtable */
-			REMOVE_BITS(OFFSET_TABLEBITS);
-			entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
 			/*
-			 * On 32-bit, we might not be able to decode the offset
-			 * symbol and extra offset bits without refilling the
-			 * bitbuffer in between.  However, this is only an issue
-			 * when a subtable is needed, so do the refill here.
+			 * 32-bit platforms that use the byte-at-a-time refill
+			 * method have to do a refill here for there to always
+			 * be enough bits to decode a literal that requires a
+			 * subtable, then preload the next litlen decode table
+			 * entry; or to decode a match length that requires a
+			 * subtable, then preload the offset decode table entry.
 			 */
-			if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
-					DEFLATE_MAX_EXTRA_OFFSET_BITS))
+			if (!CAN_CONSUME_AND_THEN_PRELOAD(DEFLATE_MAX_LITLEN_CODEWORD_LEN,
+							  LITLEN_TABLEBITS) ||
+			    !CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXBITS,
+							  OFFSET_TABLEBITS))
 				REFILL_BITS_IN_FASTLOOP();
+			if (entry & HUFFDEC_LITERAL) {
+				/* Decode a literal that required a subtable. */
+				lit = entry >> 16;
+				entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+				REFILL_BITS_IN_FASTLOOP();
+				*out_next++ = lit;
+				continue;
+			}
+			if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
+				goto block_done;
+			/* Else, it's a length that required a subtable. */
 		}
-		SAVE_BITBUF();
-		REMOVE_ENTRY_BITS_FAST(entry);
-		offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
+
+		/*
+		 * Decode the match length: the length base value associated
+		 * with the litlen symbol (which we extract from the decode
+		 * table entry), plus the extra length bits.  We don't need to
+		 * consume the extra length bits here, as they were included in
+		 * the bits consumed by the entry earlier.  We also don't need
+		 * to check for too-long matches here, as this is inside the
+		 * fastloop where it's already been verified that the output
+		 * buffer has enough space remaining to copy a max-length match.
+		 */
+		length = entry >> 16;
+		length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
+
+		/*
+		 * Decode the match offset.  There are enough "preloadable" bits
+		 * remaining to preload the offset decode table entry, but a
+		 * refill might be needed before consuming it.
+		 */
+		STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXFASTBITS,
+							   OFFSET_TABLEBITS));
+		entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
+		if (CAN_CONSUME_AND_THEN_PRELOAD(OFFSET_MAXBITS,
+						 LITLEN_TABLEBITS)) {
+			/*
+			 * Decoding a match offset on a 64-bit platform.  We may
+			 * need to refill once, but then we can decode the whole
+			 * offset and preload the next litlen table entry.
+			 */
+			if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
+				/* Offset codeword requires a subtable */
+				if (unlikely((u8)bitsleft < OFFSET_MAXBITS +
+					     LITLEN_TABLEBITS - PRELOAD_SLACK))
+					REFILL_BITS_IN_FASTLOOP();
+				bitbuf >>= OFFSET_TABLEBITS;
+				bitsleft -= OFFSET_TABLEBITS;
+				entry = d->offset_decode_table[(entry >> 16) +
+					EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
+			} else if (unlikely((u8)bitsleft < OFFSET_MAXFASTBITS +
+					    LITLEN_TABLEBITS - PRELOAD_SLACK))
+				REFILL_BITS_IN_FASTLOOP();
+		} else {
+			/* Decoding a match offset on a 32-bit platform */
+			REFILL_BITS_IN_FASTLOOP();
+			if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
+				/* Offset codeword requires a subtable */
+				bitbuf >>= OFFSET_TABLEBITS;
+				bitsleft -= OFFSET_TABLEBITS;
+				entry = d->offset_decode_table[(entry >> 16) +
+					EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
+				REFILL_BITS_IN_FASTLOOP();
+				/* No further refill needed before extra bits */
+				STATIC_ASSERT(CAN_CONSUME(
+					OFFSET_MAXBITS - OFFSET_TABLEBITS));
+			} else {
+				/* No refill needed before extra bits */
+				STATIC_ASSERT(CAN_CONSUME(OFFSET_MAXFASTBITS));
+			}
+		}
+		saved_bitbuf = bitbuf;
+		bitbuf >>= (u8)entry;
+		bitsleft -= entry; /* optimization: subtract full entry */
+		offset = entry >> 16;
+		offset += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);

 		/* Validate the match offset; needed even in the fastloop. */
 		SAFETY_CHECK(offset <= out_next - (const u8 *)out);
+		src = out_next - offset;
+		dst = out_next;
+		out_next += length;

 		/*
-		 * Before starting to copy the match, refill the bitbuffer and
-		 * preload the litlen decode table entry for the next loop
-		 * iteration.  This can increase performance by allowing the
-		 * latency of the two operations to overlap.
+		 * Before starting to issue the instructions to copy the match,
+		 * refill the bitbuffer and preload the litlen decode table
+		 * entry for the next loop iteration.  This can increase
+		 * performance by allowing the latency of the match copy to
+		 * overlap with these other operations.  To further reduce
+		 * latency, we've arranged for there to be enough bits remaining
+		 * to do the table preload independently of the refill, except
+		 * on 32-bit platforms using the byte-at-a-time refill method.
 		 */
+		if (!CAN_CONSUME_AND_THEN_PRELOAD(
+			MAX(OFFSET_MAXBITS - OFFSET_TABLEBITS,
+			    OFFSET_MAXFASTBITS),
+			LITLEN_TABLEBITS) &&
+		    unlikely((u8)bitsleft < LITLEN_TABLEBITS - PRELOAD_SLACK))
+			REFILL_BITS_IN_FASTLOOP();
+		entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
 		REFILL_BITS_IN_FASTLOOP();
-		entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];

 		/*
 		 * Copy the match.  On most CPUs the fastest method is a
-		 * word-at-a-time copy, unconditionally copying at least 3 words
+		 * word-at-a-time copy, unconditionally copying about 5 words
 		 * since this is enough for most matches without being too much.
 		 *
 		 * The normal word-at-a-time copy works for offset >= WORDBYTES,
 		 * which is most cases.  The case of offset == 1 is also common
 		 * and is worth optimizing for, since it is just RLE encoding of
 		 * the previous byte, which is the result of compressing long
-		 * runs of the same byte.  We currently don't optimize for the
-		 * less common cases of offset > 1 && offset < WORDBYTES; we
-		 * just fall back to a traditional byte-at-a-time copy for them.
+		 * runs of the same byte.
+		 *
+		 * Writing past the match 'length' is allowed here, since it's
+		 * been ensured there is enough output space left for a slight
+		 * overrun.  FASTLOOP_MAX_BYTES_WRITTEN needs to be updated if
+		 * the maximum possible overrun here is changed.
 		 */
-		src = out_next - offset;
-		dst = out_next;
-		out_next += length;
 		if (UNALIGNED_ACCESS_IS_FAST && offset >= WORDBYTES) {
-			copy_word_unaligned(src, dst);
+			store_word_unaligned(load_word_unaligned(src), dst);
 			src += WORDBYTES;
 			dst += WORDBYTES;
-			copy_word_unaligned(src, dst);
+			store_word_unaligned(load_word_unaligned(src), dst);
 			src += WORDBYTES;
 			dst += WORDBYTES;
-			do {
-				copy_word_unaligned(src, dst);
+			store_word_unaligned(load_word_unaligned(src), dst);
+			src += WORDBYTES;
+			dst += WORDBYTES;
+			store_word_unaligned(load_word_unaligned(src), dst);
+			src += WORDBYTES;
+			dst += WORDBYTES;
+			store_word_unaligned(load_word_unaligned(src), dst);
+			src += WORDBYTES;
+			dst += WORDBYTES;
+			while (dst < out_next) {
+				store_word_unaligned(load_word_unaligned(src), dst);
 				src += WORDBYTES;
 				dst += WORDBYTES;
-			} while (dst < out_next);
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += WORDBYTES;
+				dst += WORDBYTES;
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += WORDBYTES;
+				dst += WORDBYTES;
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += WORDBYTES;
+				dst += WORDBYTES;
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += WORDBYTES;
+				dst += WORDBYTES;
+			}
 		} else if (UNALIGNED_ACCESS_IS_FAST && offset == 1) {
-			machine_word_t v = repeat_byte(*src);
+			machine_word_t v;

+			/*
+			 * This part tends to get auto-vectorized, so keep it
+			 * copying a multiple of 16 bytes at a time.
+			 */
+			v = (machine_word_t)0x0101010101010101 * src[0];
 			store_word_unaligned(v, dst);
 			dst += WORDBYTES;
 			store_word_unaligned(v, dst);
 			dst += WORDBYTES;
-			do {
+			store_word_unaligned(v, dst);
+			dst += WORDBYTES;
+			store_word_unaligned(v, dst);
+			dst += WORDBYTES;
+			while (dst < out_next) {
 				store_word_unaligned(v, dst);
 				dst += WORDBYTES;
+				store_word_unaligned(v, dst);
+				dst += WORDBYTES;
+				store_word_unaligned(v, dst);
+				dst += WORDBYTES;
+				store_word_unaligned(v, dst);
+				dst += WORDBYTES;
+			}
+		} else if (UNALIGNED_ACCESS_IS_FAST) {
+			store_word_unaligned(load_word_unaligned(src), dst);
+			src += offset;
+			dst += offset;
+			store_word_unaligned(load_word_unaligned(src), dst);
+			src += offset;
+			dst += offset;
+			do {
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += offset;
+				dst += offset;
+				store_word_unaligned(load_word_unaligned(src), dst);
+				src += offset;
+				dst += offset;
 			} while (dst < out_next);
 		} else {
-			STATIC_ASSERT(DEFLATE_MIN_MATCH_LEN == 3);
 			*dst++ = *src++;
 			*dst++ = *src++;
 			do {
 				*dst++ = *src++;
 			} while (dst < out_next);
 		}
-		if (in_next < in_fastloop_end && out_next < out_fastloop_end)
-			goto preloaded;
-		break;
-	}
-	/* MASK_BITSLEFT() is needed when leaving the fastloop. */
-	MASK_BITSLEFT();
+	} while (in_next < in_fastloop_end && out_next < out_fastloop_end);

 	/*
 	 * This is the generic loop for decoding literals and matches.  This
@ -458,19 +674,24 @@ preloaded:
 	 * critical, as most time is spent in the fastloop above instead.  We
 	 * therefore omit some optimizations here in favor of smaller code.
 	 */
+generic_loop:
 	for (;;) {
-		u32 entry, length, offset;
+		u32 length, offset;
 		const u8 *src;
 		u8 *dst;

 		REFILL_BITS();
-		entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
+		entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
+		saved_bitbuf = bitbuf;
+		bitbuf >>= (u8)entry;
+		bitsleft -= entry;
 		if (unlikely(entry & HUFFDEC_SUBTABLE_POINTER)) {
-			REMOVE_BITS(LITLEN_TABLEBITS);
-			entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
+			entry = d->u.litlen_decode_table[(entry >> 16) +
+					EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
+			saved_bitbuf = bitbuf;
+			bitbuf >>= (u8)entry;
+			bitsleft -= entry;
 		}
-		SAVE_BITBUF();
-		REMOVE_BITS((u8)entry);
 		length = entry >> 16;
 		if (entry & HUFFDEC_LITERAL) {
 			if (unlikely(out_next == out_end))
@ -480,34 +701,27 @@ preloaded:
 		}
 		if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
 			goto block_done;
-		length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
+		length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
 		if (unlikely(length > out_end - out_next))
 			return LIBDEFLATE_INSUFFICIENT_SPACE;

-		if (CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
-			       DEFLATE_MAX_EXTRA_OFFSET_BITS)) {
-			ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
-				    DEFLATE_MAX_EXTRA_OFFSET_BITS);
-		} else {
-			ENSURE_BITS(OFFSET_TABLEBITS +
-				    DEFLATE_MAX_EXTRA_OFFSET_BITS);
+		if (!CAN_CONSUME(LENGTH_MAXBITS + OFFSET_MAXBITS))
+			REFILL_BITS();
+		entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
+		if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
+			bitbuf >>= OFFSET_TABLEBITS;
+			bitsleft -= OFFSET_TABLEBITS;
+			entry = d->offset_decode_table[(entry >> 16) +
+					EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
+			if (!CAN_CONSUME(OFFSET_MAXBITS))
+				REFILL_BITS();
 		}
-		entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
-		if (entry & HUFFDEC_EXCEPTIONAL) {
-			REMOVE_BITS(OFFSET_TABLEBITS);
-			entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
-			if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
-					DEFLATE_MAX_EXTRA_OFFSET_BITS))
-				ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
-					    OFFSET_TABLEBITS +
-					    DEFLATE_MAX_EXTRA_OFFSET_BITS);
-		}
-		SAVE_BITBUF();
-		REMOVE_BITS((u8)entry);
-		offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
+		offset = entry >> 16;
+		offset += EXTRACT_VARBITS8(bitbuf, entry) >> (u8)(entry >> 8);
+		bitbuf >>= (u8)entry;
+		bitsleft -= entry;

 		SAFETY_CHECK(offset <= out_next - (const u8 *)out);
-
 		src = out_next - offset;
 		dst = out_next;
 		out_next += length;
@ -521,9 +735,6 @@ preloaded:
 	}

 block_done:
-	/* MASK_BITSLEFT() is needed when leaving the fastloop. */
-	MASK_BITSLEFT();
-
 	/* Finished decoding a block */

 	if (!is_final_block)
@ -531,12 +742,21 @@ block_done:

 	/* That was the last block. */

-	/* Discard any readahead bits and check for excessive overread. */
-	ALIGN_INPUT();
+	bitsleft = (u8)bitsleft;
+
+	/*
+	 * If any of the implicit appended zero bytes were consumed (not just
+	 * refilled) before hitting end of stream, then the data is bad.
+	 */
+	SAFETY_CHECK(overread_count <= (bitsleft >> 3));
+
+	/* Optionally return the actual number of bytes consumed. */
+	if (actual_in_nbytes_ret) {
+		/* Don't count bytes that were refilled but not consumed. */
+		in_next -= (bitsleft >> 3) - overread_count;

-	/* Optionally return the actual number of bytes read. */
-	if (actual_in_nbytes_ret)
 		*actual_in_nbytes_ret = in_next - (u8 *)in;
+	}

 	/* Optionally return the actual number of bytes written. */
 	if (actual_out_nbytes_ret) {
@ -550,3 +770,5 @@ block_done:

 #undef FUNCNAME
 #undef ATTRIBUTES
+#undef EXTRACT_VARBITS
+#undef EXTRACT_VARBITS8
--- a/src/3rdparty/libdeflate/lib/deflate_compress.c
+++ b/src/3rdparty/libdeflate/lib/deflate_compress.c
@ -475,7 +475,7 @@ struct deflate_output_bitstream;
 struct libdeflate_compressor {

 	/* Pointer to the compress() implementation chosen at allocation time */
-	void (*impl)(struct libdeflate_compressor *c, const u8 *in,
+	void (*impl)(struct libdeflate_compressor *restrict c, const u8 *in,
 		     size_t in_nbytes, struct deflate_output_bitstream *os);

 	/* The compression level with which this compressor was created */
@ -1041,7 +1041,6 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
 		unsigned parent = A[node] >> NUM_SYMBOL_BITS;
 		unsigned parent_depth = A[parent] >> NUM_SYMBOL_BITS;
 		unsigned depth = parent_depth + 1;
-		unsigned len = depth;

 		/*
 		 * Set the depth of this node so that it is available when its
@ -1054,19 +1053,19 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
 		 * constraint.  This is not the optimal method for generating
 		 * length-limited Huffman codes!  But it should be good enough.
 		 */
-		if (len >= max_codeword_len) {
-			len = max_codeword_len;
+		if (depth >= max_codeword_len) {
+			depth = max_codeword_len;
 			do {
-				len--;
-			} while (len_counts[len] == 0);
+				depth--;
+			} while (len_counts[depth] == 0);
 		}

 		/*
 		 * Account for the fact that we have a non-leaf node at the
 		 * current depth.
 		 */
-		len_counts[len]--;
-		len_counts[len + 1] += 2;
+		len_counts[depth]--;
+		len_counts[depth + 1] += 2;
 	}
 }

@ -1189,11 +1188,9 @@ gen_codewords(u32 A[], u8 lens[], const unsigned len_counts[],
 			(next_codewords[len - 1] + len_counts[len - 1]) << 1;

 	for (sym = 0; sym < num_syms; sym++) {
-		u8 len = lens[sym];
-		u32 codeword = next_codewords[len]++;
-
 		/* DEFLATE requires bit-reversed codewords. */
-		A[sym] = reverse_codeword(codeword, len);
+		A[sym] = reverse_codeword(next_codewords[lens[sym]]++,
+					  lens[sym]);
 	}
 }

--- a/src/3rdparty/libdeflate/lib/deflate_constants.h
+++ b/src/3rdparty/libdeflate/lib/deflate_constants.h
@ -49,11 +49,8 @@
 /*
 * Maximum number of extra bits that may be required to represent a match
 * length or offset.
- *
- * TODO: are we going to have full DEFLATE64 support?  If so, up to 16
- * length bits must be supported.
 */
 #define DEFLATE_MAX_EXTRA_LENGTH_BITS		5
-#define DEFLATE_MAX_EXTRA_OFFSET_BITS		14
+#define DEFLATE_MAX_EXTRA_OFFSET_BITS		13

 #endif /* LIB_DEFLATE_CONSTANTS_H */
--- a/src/3rdparty/libdeflate/lib/deflate_decompress.c
+++ b/src/3rdparty/libdeflate/lib/deflate_decompress.c
@ -27,14 +27,14 @@
 * ---------------------------------------------------------------------------
 *
 * This is a highly optimized DEFLATE decompressor.  It is much faster than
- * zlib, typically more than twice as fast, though results vary by CPU.
+ * vanilla zlib, typically well over twice as fast, though results vary by CPU.
 *
- * Why this is faster than zlib's implementation:
+ * Why this is faster than vanilla zlib:
 *
 * - Word accesses rather than byte accesses when reading input
 * - Word accesses rather than byte accesses when copying matches
 * - Faster Huffman decoding combined with various DEFLATE-specific tricks
- * - Larger bitbuffer variable that doesn't need to be filled as often
+ * - Larger bitbuffer variable that doesn't need to be refilled as often
 * - Other optimizations to remove unnecessary branches
 * - Only full-buffer decompression is supported, so the code doesn't need to
 *   support stopping and resuming decompression.
@ -71,22 +71,32 @@
 /*
 * The state of the "input bitstream" consists of the following variables:
 *
- *	- in_next: pointer to the next unread byte in the input buffer
+ *	- in_next: a pointer to the next unread byte in the input buffer
 *
- *	- in_end: pointer just past the end of the input buffer
+ *	- in_end: a pointer to just past the end of the input buffer
 *
 *	- bitbuf: a word-sized variable containing bits that have been read from
- *		  the input buffer.  The buffered bits are right-aligned
- *		  (they're the low-order bits).
+ *		  the input buffer or from the implicit appended zero bytes
 *
- *	- bitsleft: number of bits in 'bitbuf' that are valid.  NOTE: in the
- *		    fastloop, bits 8 and above of bitsleft can contain garbage.
+ *	- bitsleft: the number of bits in 'bitbuf' available to be consumed.
+ *		    After REFILL_BITS_BRANCHLESS(), 'bitbuf' can actually
+ *		    contain more bits than this.  However, only the bits counted
+ *		    by 'bitsleft' can actually be consumed; the rest can only be
+ *		    used for preloading.
 *
- *	- overread_count: number of implicit 0 bytes past 'in_end' that have
- *			  been loaded into the bitbuffer
+ *		    As a micro-optimization, we allow bits 8 and higher of
+ *		    'bitsleft' to contain garbage.  When consuming the bits
+ *		    associated with a decode table entry, this allows us to do
+ *		    'bitsleft -= entry' instead of 'bitsleft -= (u8)entry'.
+ *		    On some CPUs, this helps reduce instruction dependencies.
+ *		    This does have the disadvantage that 'bitsleft' sometimes
+ *		    needs to be cast to 'u8', such as when it's used as a shift
+ *		    amount in REFILL_BITS_BRANCHLESS().  But that one happens
+ *		    for free since most CPUs ignore high bits in shift amounts.
 *
- * For performance reasons, these variables are declared as standalone variables
- * and are manipulated using macros, rather than being packed into a struct.
+ *	- overread_count: the total number of implicit appended zero bytes that
+ *			  have been loaded into the bitbuffer, including any
+ *			  counted by 'bitsleft' and any already consumed
 */

 /*
@ -97,60 +107,92 @@
 * which they don't have to refill as often.
 */
 typedef machine_word_t bitbuf_t;
+#define BITBUF_NBITS	(8 * (int)sizeof(bitbuf_t))
+
+/* BITMASK(n) returns a bitmask of length 'n'. */
+#define BITMASK(n)	(((bitbuf_t)1 << (n)) - 1)

 /*
- * BITBUF_NBITS is the number of bits the bitbuffer variable can hold.  See
- * REFILL_BITS_WORDWISE() for why this is 1 less than the obvious value.
+ * MAX_BITSLEFT is the maximum number of consumable bits, i.e. the maximum value
+ * of '(u8)bitsleft'.  This is the size of the bitbuffer variable, minus 1 if
+ * the branchless refill method is being used (see REFILL_BITS_BRANCHLESS()).
 */
-#define BITBUF_NBITS	(8 * sizeof(bitbuf_t) - 1)
+#define MAX_BITSLEFT	\
+	(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS - 1 : BITBUF_NBITS)

 /*
- * REFILL_GUARANTEED_NBITS is the number of bits that are guaranteed in the
- * bitbuffer variable after refilling it with ENSURE_BITS(n), REFILL_BITS(), or
- * REFILL_BITS_IN_FASTLOOP().  There might be up to BITBUF_NBITS bits; however,
- * since only whole bytes can be added, only 'BITBUF_NBITS - 7' bits are
- * guaranteed.  That is the smallest amount where another byte doesn't fit.
+ * CONSUMABLE_NBITS is the minimum number of bits that are guaranteed to be
+ * consumable (counted in 'bitsleft') immediately after refilling the bitbuffer.
+ * Since only whole bytes can be added to 'bitsleft', the worst case is
+ * 'MAX_BITSLEFT - 7': the smallest amount where another byte doesn't fit.
 */
-#define REFILL_GUARANTEED_NBITS  (BITBUF_NBITS - 7)
+#define CONSUMABLE_NBITS	(MAX_BITSLEFT - 7)

 /*
- * CAN_ENSURE(n) evaluates to true if the bitbuffer variable is guaranteed to
- * contain at least 'n' bits after a refill.  See REFILL_GUARANTEED_NBITS.
- *
- * This can be used to choose between alternate refill strategies based on the
- * size of the bitbuffer variable.  'n' should be a compile-time constant.
+ * FASTLOOP_PRELOADABLE_NBITS is the minimum number of bits that are guaranteed
+ * to be preloadable immediately after REFILL_BITS_IN_FASTLOOP().  (It is *not*
+ * guaranteed after REFILL_BITS(), since REFILL_BITS() falls back to a
+ * byte-at-a-time refill method near the end of input.)  This may exceed the
+ * number of consumable bits (counted by 'bitsleft').  Any bits not counted in
+ * 'bitsleft' can only be used for precomputation and cannot be consumed.
 */
-#define CAN_ENSURE(n)	((n) <= REFILL_GUARANTEED_NBITS)
+#define FASTLOOP_PRELOADABLE_NBITS	\
+	(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS : CONSUMABLE_NBITS)

 /*
- * REFILL_BITS_WORDWISE() branchlessly refills the bitbuffer variable by reading
- * the next word from the input buffer and updating 'in_next' and 'bitsleft'
- * based on how many bits were refilled -- counting whole bytes only.  This is
- * much faster than reading a byte at a time, at least if the CPU is little
- * endian and supports fast unaligned memory accesses.
+ * PRELOAD_SLACK is the minimum number of bits that are guaranteed to be
+ * preloadable but not consumable, following REFILL_BITS_IN_FASTLOOP() and any
+ * subsequent consumptions.  This is 1 bit if the branchless refill method is
+ * being used, and 0 bits otherwise.
+ */
+#define PRELOAD_SLACK	MAX(0, FASTLOOP_PRELOADABLE_NBITS - MAX_BITSLEFT)
+
+/*
+ * CAN_CONSUME(n) is true if it's guaranteed that if the bitbuffer has just been
+ * refilled, then it's always possible to consume 'n' bits from it.  'n' should
+ * be a compile-time constant, to enable compile-time evaluation.
+ */
+#define CAN_CONSUME(n)	(CONSUMABLE_NBITS >= (n))
+
+/*
+ * CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) is true if it's
+ * guaranteed that after REFILL_BITS_IN_FASTLOOP(), it's always possible to
+ * consume 'consume_nbits' bits, then preload 'preload_nbits' bits.  The
+ * arguments should be compile-time constants to enable compile-time evaluation.
+ */
+#define CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits)	\
+	(CONSUMABLE_NBITS >= (consume_nbits) &&				\
+	 FASTLOOP_PRELOADABLE_NBITS >= (consume_nbits) + (preload_nbits))
+
+/*
+ * REFILL_BITS_BRANCHLESS() branchlessly refills the bitbuffer variable by
+ * reading the next word from the input buffer and updating 'in_next' and
+ * 'bitsleft' based on how many bits were refilled -- counting whole bytes only.
+ * This is much faster than reading a byte at a time, at least if the CPU is
+ * little endian and supports fast unaligned memory accesses.
 *
 * The simplest way of branchlessly updating 'bitsleft' would be:
 *
- *	bitsleft += (BITBUF_NBITS - bitsleft) & ~7;
+ *	bitsleft += (MAX_BITSLEFT - bitsleft) & ~7;
 *
- * To make it faster, we define BITBUF_NBITS to be 'WORDBITS - 1' rather than
+ * To make it faster, we define MAX_BITSLEFT to be 'WORDBITS - 1' rather than
 * WORDBITS, so that in binary it looks like 111111 or 11111.  Then, we update
- * 'bitsleft' just by setting the bits above the low 3 bits:
+ * 'bitsleft' by just setting the bits above the low 3 bits:
 *
- *	bitsleft |= BITBUF_NBITS & ~7;
+ *	bitsleft |= MAX_BITSLEFT & ~7;
 *
 * That compiles down to a single instruction like 'or $0x38, %rbp'.  Using
- * 'BITBUF_NBITS == WORDBITS - 1' also has the advantage that refills can be
- * done when 'bitsleft == BITBUF_NBITS' without invoking undefined behavior.
+ * 'MAX_BITSLEFT == WORDBITS - 1' also has the advantage that refills can be
+ * done when 'bitsleft == MAX_BITSLEFT' without invoking undefined behavior.
 *
 * The simplest way of branchlessly updating 'in_next' would be:
 *
- *	in_next += (BITBUF_NBITS - bitsleft) >> 3;
+ *	in_next += (MAX_BITSLEFT - bitsleft) >> 3;
 *
- * With 'BITBUF_NBITS == WORDBITS - 1' we could use an XOR instead, though this
+ * With 'MAX_BITSLEFT == WORDBITS - 1' we could use an XOR instead, though this
 * isn't really better:
 *
- *	in_next += (BITBUF_NBITS ^ bitsleft) >> 3;
+ *	in_next += (MAX_BITSLEFT ^ bitsleft) >> 3;
 *
 * An alternative which can be marginally better is the following:
 *
@ -162,22 +204,23 @@ typedef machine_word_t bitbuf_t;
 * extraction instruction (e.g. arm's ubfx), it stays at 3, and is potentially
 * more efficient because the length of the longest dependency chain decreases
 * from 3 to 2.  This alternative also has the advantage that it ignores the
- * high bits in 'bitsleft', so it is compatible with the fastloop optimization
- * (described later) where we let the high bits of 'bitsleft' contain garbage.
+ * high bits in 'bitsleft', so it is compatible with the micro-optimization we
+ * use where we let the high bits of 'bitsleft' contain garbage.
 */
-#define REFILL_BITS_WORDWISE()					\
-do {								\
-	bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft;\
-	in_next += sizeof(bitbuf_t) - 1;			\
-	in_next -= (bitsleft >> 3) & 0x7;			\
-	bitsleft |= BITBUF_NBITS & ~7;				\
+#define REFILL_BITS_BRANCHLESS()					\
+do {									\
+	bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft;	\
+	in_next += sizeof(bitbuf_t) - 1;				\
+	in_next -= (bitsleft >> 3) & 0x7;				\
+	bitsleft |= MAX_BITSLEFT & ~7;					\
 } while (0)

 /*
 * REFILL_BITS() loads bits from the input buffer until the bitbuffer variable
- * contains at least REFILL_GUARANTEED_NBITS bits.
+ * contains at least CONSUMABLE_NBITS consumable bits.
 *
- * This checks for the end of input, and it cannot be used in the fastloop.
+ * This checks for the end of input, and it doesn't guarantee
+ * FASTLOOP_PRELOADABLE_NBITS, so it can't be used in the fastloop.
 *
 * If we would overread the input buffer, we just don't read anything, leaving
 * the bits zeroed but marking them filled.  This simplifies the decompressor
@ -196,121 +239,66 @@ do {								\
 */
 #define REFILL_BITS()							\
 do {									\
-	if (CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST &&	\
+	if (UNALIGNED_ACCESS_IS_FAST &&					\
 	    likely(in_end - in_next >= sizeof(bitbuf_t))) {		\
-		REFILL_BITS_WORDWISE();					\
+		REFILL_BITS_BRANCHLESS();				\
 	} else {							\
-		while (bitsleft < REFILL_GUARANTEED_NBITS) {		\
+		while ((u8)bitsleft < CONSUMABLE_NBITS) {		\
 			if (likely(in_next != in_end)) {		\
-				bitbuf |= (bitbuf_t)*in_next++ << bitsleft;	\
+				bitbuf |= (bitbuf_t)*in_next++ <<	\
+					  (u8)bitsleft;			\
 			} else {					\
 				overread_count++;			\
-				SAFETY_CHECK(overread_count <= sizeof(bitbuf));	\
-			}							\
+				SAFETY_CHECK(overread_count <=		\
+					     sizeof(bitbuf_t));		\
+			}						\
 			bitsleft += 8;					\
 		}							\
 	}								\
 } while (0)

-/* ENSURE_BITS(n) calls REFILL_BITS() if fewer than 'n' bits are buffered. */
-#define ENSURE_BITS(n)							\
+/*
+ * REFILL_BITS_IN_FASTLOOP() is like REFILL_BITS(), but it doesn't check for the
+ * end of the input.  It can only be used in the fastloop.
+ */
+#define REFILL_BITS_IN_FASTLOOP()					\
 do {									\
-	if (bitsleft < (n))						\
-		REFILL_BITS();						\
+	STATIC_ASSERT(UNALIGNED_ACCESS_IS_FAST ||			\
+		      FASTLOOP_PRELOADABLE_NBITS == CONSUMABLE_NBITS);	\
+	if (UNALIGNED_ACCESS_IS_FAST) {					\
+		REFILL_BITS_BRANCHLESS();				\
+	} else {							\
+		while ((u8)bitsleft < CONSUMABLE_NBITS) {		\
+			bitbuf |= (bitbuf_t)*in_next++ << (u8)bitsleft;	\
+			bitsleft += 8;					\
+		}							\
+	}								\
 } while (0)

-#define BITMASK(n)	(((bitbuf_t)1 << (n)) - 1)
-
-/* BITS(n) returns the next 'n' buffered bits without removing them. */
-#define BITS(n)		(bitbuf & BITMASK(n))
-
-/* Macros to save the value of the bitbuffer variable and use it later. */
-#define SAVE_BITBUF()	(saved_bitbuf = bitbuf)
-#define SAVED_BITS(n)	(saved_bitbuf & BITMASK(n))
-
-/* REMOVE_BITS(n) removes the next 'n' buffered bits. */
-#define REMOVE_BITS(n)	(bitbuf >>= (n), bitsleft -= (n))
-
-/* POP_BITS(n) removes and returns the next 'n' buffered bits. */
-#define POP_BITS(n)	(tmpbits = BITS(n), REMOVE_BITS(n), tmpbits)
-
-/*
- * ALIGN_INPUT() verifies that the input buffer hasn't been overread, then
- * aligns the bitstream to the next byte boundary, discarding any unused bits in
- * the current byte.
- *
- * Note that if the bitbuffer variable currently contains more than 7 bits, then
- * we must rewind 'in_next', effectively putting those bits back.  Only the bits
- * in what would be the "current" byte if we were reading one byte at a time can
- * be actually discarded.
- */
-#define ALIGN_INPUT()							\
-do {									\
-	SAFETY_CHECK(overread_count <= (bitsleft >> 3));		\
-	in_next -= (bitsleft >> 3) - overread_count;			\
-	overread_count = 0;						\
-	bitbuf = 0;							\
-	bitsleft = 0;							\
-} while(0)
-
-/*
- * Macros used in the "fastloop": the loop that decodes literals and matches
- * while there is still plenty of space left in the input and output buffers.
- *
- * In the fastloop, we improve performance by skipping redundant bounds checks.
- * On platforms where it helps, we also use an optimization where we allow bits
- * 8 and higher of 'bitsleft' to contain garbage.  This is sometimes a useful
- * microoptimization because it means the whole 32-bit decode table entry can be
- * subtracted from 'bitsleft' without an intermediate step to convert it to 8
- * bits.  (It still needs to be converted to 8 bits for the shift of 'bitbuf',
- * but most CPUs ignore high bits in shift amounts, so that happens implicitly
- * with zero overhead.)  REMOVE_ENTRY_BITS_FAST() implements this optimization.
- *
- * MASK_BITSLEFT() is used to clear the garbage bits when leaving the fastloop.
- */
-#if CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST
-#  define REFILL_BITS_IN_FASTLOOP()	REFILL_BITS_WORDWISE()
-#  define REMOVE_ENTRY_BITS_FAST(entry)	(bitbuf >>= (u8)entry, bitsleft -= entry)
-#  define GET_REAL_BITSLEFT()		((u8)bitsleft)
-#  define MASK_BITSLEFT()		(bitsleft &= 0xFF)
-#else
-#  define REFILL_BITS_IN_FASTLOOP()				\
-	while (bitsleft < REFILL_GUARANTEED_NBITS) {		\
-		bitbuf |= (bitbuf_t)*in_next++ << bitsleft;	\
-		bitsleft += 8;					\
-	}
-#  define REMOVE_ENTRY_BITS_FAST(entry)	REMOVE_BITS((u8)entry)
-#  define GET_REAL_BITSLEFT()		bitsleft
-#  define MASK_BITSLEFT()
-#endif
-
 /*
 * This is the worst-case maximum number of output bytes that are written to
- * during each iteration of the fastloop.  The worst case is 3 literals, then a
- * match of length DEFLATE_MAX_MATCH_LEN.  The match length must be rounded up
- * to a word boundary due to the word-at-a-time match copy implementation.
+ * during each iteration of the fastloop.  The worst case is 2 literals, then a
+ * match of length DEFLATE_MAX_MATCH_LEN.  Additionally, some slack space must
+ * be included for the intentional overrun in the match copy implementation.
 */
 #define FASTLOOP_MAX_BYTES_WRITTEN	\
-	(3 + ALIGN(DEFLATE_MAX_MATCH_LEN, WORDBYTES))
+	(2 + DEFLATE_MAX_MATCH_LEN + (5 * WORDBYTES) - 1)

 /*
 * This is the worst-case maximum number of input bytes that are read during
 * each iteration of the fastloop.  To get this value, we first compute the
 * greatest number of bits that can be refilled during a loop iteration.  The
- * refill at the beginning can add at most BITBUF_NBITS, and the amount that can
+ * refill at the beginning can add at most MAX_BITSLEFT, and the amount that can
 * be refilled later is no more than the maximum amount that can be consumed by
- * 3 literals that don't need a subtable, then a match.  We convert this value
- * to bytes, rounding up.  Finally, we added sizeof(bitbuf_t) to account for
- * REFILL_BITS_WORDWISE() reading up to a word past the part really used.
+ * 2 literals that don't need a subtable, then a match.  We convert this value
+ * to bytes, rounding up; this gives the maximum number of bytes that 'in_next'
+ * can be advanced.  Finally, we add sizeof(bitbuf_t) to account for
+ * REFILL_BITS_BRANCHLESS() reading a word past 'in_next'.
 */
 #define FASTLOOP_MAX_BYTES_READ					\
-	(DIV_ROUND_UP(BITBUF_NBITS +				\
-		     ((3 * LITLEN_TABLEBITS) +			\
-		      DEFLATE_MAX_LITLEN_CODEWORD_LEN +		\
-		      DEFLATE_MAX_EXTRA_LENGTH_BITS +		\
-		      DEFLATE_MAX_OFFSET_CODEWORD_LEN +		\
-		      DEFLATE_MAX_EXTRA_OFFSET_BITS), 8) +	\
-	sizeof(bitbuf_t))
+	(DIV_ROUND_UP(MAX_BITSLEFT + (2 * LITLEN_TABLEBITS) +	\
+		      LENGTH_MAXBITS + OFFSET_MAXBITS, 8) +	\
+	 sizeof(bitbuf_t))

 /*****************************************************************************
 *                              Huffman decoding                             *
@ -358,24 +346,37 @@ do {									\
 * take longer, which decreases performance.  We choose values that work well in
 * practice, making subtables rarely needed without making the tables too large.
 *
+ * Our choice of OFFSET_TABLEBITS == 8 is a bit low; without any special
+ * considerations, 9 would fit the trade-off curve better.  However, there is a
+ * performance benefit to using exactly 8 bits when it is a compile-time
+ * constant, as many CPUs can take the low byte more easily than the low 9 bits.
+ *
+ * zlib treats its equivalents of TABLEBITS as maximum values; whenever it
+ * builds a table, it caps the actual table_bits to the longest codeword.  This
+ * makes sense in theory, as there's no need for the table to be any larger than
+ * needed to support the longest codeword.  However, having the table bits be a
+ * compile-time constant is beneficial to the performance of the decode loop, so
+ * there is a trade-off.  libdeflate currently uses the dynamic table_bits
+ * strategy for the litlen table only, due to its larger maximum size.
+ * PRECODE_TABLEBITS and OFFSET_TABLEBITS are smaller, so going dynamic there
+ * isn't as useful, and OFFSET_TABLEBITS=8 is useful as mentioned above.
+ *
 * Each TABLEBITS value has a corresponding ENOUGH value that gives the
 * worst-case maximum number of decode table entries, including the main table
 * and all subtables.  The ENOUGH value depends on three parameters:
 *
 *	(1) the maximum number of symbols in the code (DEFLATE_NUM_*_SYMS)
- *	(2) the number of main table bits (the corresponding TABLEBITS value)
+ *	(2) the maximum number of main table bits (*_TABLEBITS)
 *	(3) the maximum allowed codeword length (DEFLATE_MAX_*_CODEWORD_LEN)
 *
 * The ENOUGH values were computed using the utility program 'enough' from zlib.
 */
 #define PRECODE_TABLEBITS	7
 #define PRECODE_ENOUGH		128	/* enough 19 7 7	*/
-
 #define LITLEN_TABLEBITS	11
 #define LITLEN_ENOUGH		2342	/* enough 288 11 15	*/
-
-#define OFFSET_TABLEBITS	9
-#define OFFSET_ENOUGH		594	/* enough 32 9 15	*/
+#define OFFSET_TABLEBITS	8
+#define OFFSET_ENOUGH		402	/* enough 32 8 15	*/

 /*
 * make_decode_table_entry() creates a decode table entry for the given symbol
@ -387,7 +388,7 @@ do {									\
 * appropriately-formatted decode table entry.  See the definitions of the
 * *_decode_results[] arrays below, where the entry format is described.
 */
-static inline u32
+static forceinline u32
 make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
 {
 	return decode_results[sym] + (len << 8) + len;
@ -398,8 +399,8 @@ make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
 * described contain zeroes:
 *
 *	Bit 20-16:  presym
- *	Bit 10-8:   codeword_len [not used]
- *	Bit 2-0:    codeword_len
+ *	Bit 10-8:   codeword length [not used]
+ *	Bit 2-0:    codeword length
 *
 * The precode decode table never has subtables, since we use
 * PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN.
@ -431,6 +432,12 @@ static const u32 precode_decode_results[] = {
 /* Indicates an end-of-block entry in the litlen decode table */
 #define HUFFDEC_END_OF_BLOCK		0x00002000

+/* Maximum number of bits that can be consumed by decoding a match length */
+#define LENGTH_MAXBITS		(DEFLATE_MAX_LITLEN_CODEWORD_LEN + \
+				 DEFLATE_MAX_EXTRA_LENGTH_BITS)
+#define LENGTH_MAXFASTBITS	(LITLEN_TABLEBITS /* no subtable needed */ + \
+				 DEFLATE_MAX_EXTRA_LENGTH_BITS)
+
 /*
 * Here is the format of our litlen decode table entries.  Bits not explicitly
 * described contain zeroes:
@ -464,7 +471,8 @@ static const u32 precode_decode_results[] = {
 *		Bit 15:     1 (HUFFDEC_EXCEPTIONAL)
 *		Bit 14:     1 (HUFFDEC_SUBTABLE_POINTER)
 *		Bit 13:     0 (!HUFFDEC_END_OF_BLOCK)
- *		Bit 3-0:    number of subtable bits
+ *		Bit 11-8:   number of subtable bits
+ *		Bit 3-0:    number of main table bits
 *
 * This format has several desirable properties:
 *
@ -481,20 +489,26 @@ static const u32 precode_decode_results[] = {
 *
 *	- The low byte is the number of bits that need to be removed from the
 *	  bitstream; this makes this value easily accessible, and it enables the
- *	  optimization used in REMOVE_ENTRY_BITS_FAST().  It also includes the
- *	  number of extra bits, so they don't need to be removed separately.
+ *	  micro-optimization of doing 'bitsleft -= entry' instead of
+ *	  'bitsleft -= (u8)entry'.  It also includes the number of extra bits,
+ *	  so they don't need to be removed separately.
 *
- *	- The flags in bits 13-15 are arranged to be 0 when the number of
- *	  non-extra bits (the value in bits 11-8) is needed, making this value
+ *	- The flags in bits 15-13 are arranged to be 0 when the
+ *	  "remaining codeword length" in bits 11-8 is needed, making this value
 *	  fairly easily accessible as well via a shift and downcast.
 *
+ *	- Similarly, bits 13-12 are 0 when the "subtable bits" in bits 11-8 are
+ *	  needed, making it possible to extract this value with '& 0x3F' rather
+ *	  than '& 0xF'.  This value is only used as a shift amount, so this can
+ *	  save an 'and' instruction as the masking by 0x3F happens implicitly.
+ *
 * litlen_decode_results[] contains the static part of the entry for each
 * symbol.  make_decode_table_entry() produces the final entries.
 */
 static const u32 litlen_decode_results[] = {

 	/* Literals */
-#define ENTRY(literal)	(((u32)literal << 16) | HUFFDEC_LITERAL)
+#define ENTRY(literal)	(HUFFDEC_LITERAL | ((u32)literal << 16))
 	ENTRY(0)   , ENTRY(1)   , ENTRY(2)   , ENTRY(3)   ,
 	ENTRY(4)   , ENTRY(5)   , ENTRY(6)   , ENTRY(7)   ,
 	ENTRY(8)   , ENTRY(9)   , ENTRY(10)  , ENTRY(11)  ,
@ -578,6 +592,12 @@ static const u32 litlen_decode_results[] = {
 #undef ENTRY
 };

+/* Maximum number of bits that can be consumed by decoding a match offset */
+#define OFFSET_MAXBITS		(DEFLATE_MAX_OFFSET_CODEWORD_LEN + \
+				 DEFLATE_MAX_EXTRA_OFFSET_BITS)
+#define OFFSET_MAXFASTBITS	(OFFSET_TABLEBITS /* no subtable needed */ + \
+				 DEFLATE_MAX_EXTRA_OFFSET_BITS)
+
 /*
 * Here is the format of our offset decode table entries.  Bits not explicitly
 * described contain zeroes:
@ -592,7 +612,8 @@ static const u32 litlen_decode_results[] = {
 *		Bit 31-16:  index of start of subtable
 *		Bit 15:     1 (HUFFDEC_EXCEPTIONAL)
 *		Bit 14:     1 (HUFFDEC_SUBTABLE_POINTER)
- *		Bit 3-0:    number of subtable bits
+ *		Bit 11-8:   number of subtable bits
+ *		Bit 3-0:    number of main table bits
 *
 * These work the same way as the length entries and subtable pointer entries in
 * the litlen decode table; see litlen_decode_results[] above.
@ -607,15 +628,20 @@ static const u32 offset_decode_results[] = {
 	ENTRY(257   , 7)  , ENTRY(385   , 7)  , ENTRY(513   , 8)  , ENTRY(769   , 8)  ,
 	ENTRY(1025  , 9)  , ENTRY(1537  , 9)  , ENTRY(2049  , 10) , ENTRY(3073  , 10) ,
 	ENTRY(4097  , 11) , ENTRY(6145  , 11) , ENTRY(8193  , 12) , ENTRY(12289 , 12) ,
-	ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(32769 , 14) , ENTRY(49153 , 14) ,
+	ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) ,
 #undef ENTRY
 };

 /*
- * The main DEFLATE decompressor structure.  Since this implementation only
- * supports full buffer decompression, this structure does not store the entire
- * decompression state, but rather only some arrays that are too large to
- * comfortably allocate on the stack.
+ * The main DEFLATE decompressor structure.  Since libdeflate only supports
+ * full-buffer decompression, this structure doesn't store the entire
+ * decompression state, most of which is in stack variables.  Instead, this
+ * struct just contains the decode tables and some temporary arrays used for
+ * building them, as these are too large to comfortably allocate on the stack.
+ *
+ * Storing the decode tables in the decompressor struct also allows the decode
+ * tables for the static codes to be reused whenever two static Huffman blocks
+ * are decoded without an intervening dynamic block, even across streams.
 */
 struct libdeflate_decompressor {

@ -648,6 +674,7 @@ struct libdeflate_decompressor {
 	u16 sorted_syms[DEFLATE_MAX_NUM_SYMS];

 	bool static_codes_loaded;
+	unsigned litlen_tablebits;
 };

 /*
@ -678,11 +705,16 @@ struct libdeflate_decompressor {
 *	make the final decode table entries using make_decode_table_entry().
 * @table_bits
 *	The log base-2 of the number of main table entries to use.
+ *	If @table_bits_ret != NULL, then @table_bits is treated as a maximum
+ *	value and it will be decreased if a smaller table would be sufficient.
 * @max_codeword_len
 *	The maximum allowed codeword length for this Huffman code.
 *	Must be <= DEFLATE_MAX_CODEWORD_LEN.
 * @sorted_syms
 *	A temporary array of length @num_syms.
+ * @table_bits_ret
+ *	If non-NULL, then the dynamic table_bits is enabled, and the actual
+ *	table_bits value will be returned here.
 *
 * Returns %true if successful; %false if the codeword lengths do not form a
 * valid Huffman code.
@ -692,9 +724,10 @@ build_decode_table(u32 decode_table[],
 		   const u8 lens[],
 		   const unsigned num_syms,
 		   const u32 decode_results[],
-		   const unsigned table_bits,
-		   const unsigned max_codeword_len,
-		   u16 *sorted_syms)
+		   unsigned table_bits,
+		   unsigned max_codeword_len,
+		   u16 *sorted_syms,
+		   unsigned *table_bits_ret)
 {
 	unsigned len_counts[DEFLATE_MAX_CODEWORD_LEN + 1];
 	unsigned offsets[DEFLATE_MAX_CODEWORD_LEN + 1];
@ -714,6 +747,17 @@ build_decode_table(u32 decode_table[],
 	for (sym = 0; sym < num_syms; sym++)
 		len_counts[lens[sym]]++;

+	/*
+	 * Determine the actual maximum codeword length that was used, and
+	 * decrease table_bits to it if allowed.
+	 */
+	while (max_codeword_len > 1 && len_counts[max_codeword_len] == 0)
+		max_codeword_len--;
+	if (table_bits_ret != NULL) {
+		table_bits = MIN(table_bits, max_codeword_len);
+		*table_bits_ret = table_bits;
+	}
+
 	/*
 	 * Sort the symbols primarily by increasing codeword length and
 	 * secondarily by increasing symbol value; or equivalently by their
@ -919,16 +963,13 @@ build_decode_table(u32 decode_table[],

 			/*
 			 * Create the entry that points from the main table to
-			 * the subtable.  This entry contains the index of the
-			 * start of the subtable and the number of bits with
-			 * which the subtable is indexed (the log base 2 of the
-			 * number of entries it contains).
+			 * the subtable.
 			 */
 			decode_table[subtable_prefix] =
 				((u32)subtable_start << 16) |
 				HUFFDEC_EXCEPTIONAL |
 				HUFFDEC_SUBTABLE_POINTER |
-				subtable_bits;
+				(subtable_bits << 8) | table_bits;
 		}

 		/* Fill the subtable entries for the current codeword. */
@ -969,7 +1010,8 @@ build_precode_decode_table(struct libdeflate_decompressor *d)
 				  precode_decode_results,
 				  PRECODE_TABLEBITS,
 				  DEFLATE_MAX_PRE_CODEWORD_LEN,
-				  d->sorted_syms);
+				  d->sorted_syms,
+				  NULL);
 }

 /* Build the decode table for the literal/length code.  */
@ -989,7 +1031,8 @@ build_litlen_decode_table(struct libdeflate_decompressor *d,
 				  litlen_decode_results,
 				  LITLEN_TABLEBITS,
 				  DEFLATE_MAX_LITLEN_CODEWORD_LEN,
-				  d->sorted_syms);
+				  d->sorted_syms,
+				  &d->litlen_tablebits);
 }

 /* Build the decode table for the offset code.  */
@ -998,7 +1041,7 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
 			  unsigned num_litlen_syms, unsigned num_offset_syms)
 {
 	/* When you change TABLEBITS, you must change ENOUGH, and vice versa! */
-	STATIC_ASSERT(OFFSET_TABLEBITS == 9 && OFFSET_ENOUGH == 594);
+	STATIC_ASSERT(OFFSET_TABLEBITS == 8 && OFFSET_ENOUGH == 402);

 	STATIC_ASSERT(ARRAY_LEN(offset_decode_results) ==
 		      DEFLATE_NUM_OFFSET_SYMS);
@ -1009,27 +1052,8 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
 				  offset_decode_results,
 				  OFFSET_TABLEBITS,
 				  DEFLATE_MAX_OFFSET_CODEWORD_LEN,
-				  d->sorted_syms);
-}
-
-static forceinline machine_word_t
-repeat_byte(u8 b)
-{
-	machine_word_t v;
-
-	STATIC_ASSERT(WORDBITS == 32 || WORDBITS == 64);
-
-	v = b;
-	v |= v << 8;
-	v |= v << 16;
-	v |= v << ((WORDBITS == 64) ? 32 : 0);
-	return v;
-}
-
-static forceinline void
-copy_word_unaligned(const void *src, void *dst)
-{
-	store_word_unaligned(load_word_unaligned(src), dst);
+				  d->sorted_syms,
+				  NULL);
 }

 /*****************************************************************************
@ -1037,12 +1061,15 @@ copy_word_unaligned(const void *src, void *dst)
 *****************************************************************************/

 typedef enum libdeflate_result (*decompress_func_t)
-	(struct libdeflate_decompressor *d,
-	 const void *in, size_t in_nbytes, void *out, size_t out_nbytes_avail,
+	(struct libdeflate_decompressor * restrict d,
+	 const void * restrict in, size_t in_nbytes,
+	 void * restrict out, size_t out_nbytes_avail,
 	 size_t *actual_in_nbytes_ret, size_t *actual_out_nbytes_ret);

 #define FUNCNAME deflate_decompress_default
-#define ATTRIBUTES
+#undef ATTRIBUTES
+#undef EXTRACT_VARBITS
+#undef EXTRACT_VARBITS8
 #include "decompress_template.h"

 /* Include architecture-specific implementation(s) if available. */
--- a/src/3rdparty/libdeflate/lib/x86/cpu_features.h
+++ b/src/3rdparty/libdeflate/lib/x86/cpu_features.h
@ -156,6 +156,8 @@ typedef char  __v64qi __attribute__((__vector_size__(64)));
 #define HAVE_BMI2_TARGET \
 	(HAVE_DYNAMIC_X86_CPU_FEATURES && \
 	 (GCC_PREREQ(4, 7) || __has_builtin(__builtin_ia32_pdep_di)))
+#define HAVE_BMI2_INTRIN \
+	(HAVE_BMI2_NATIVE || (HAVE_BMI2_TARGET && HAVE_TARGET_INTRINSICS))

 #endif /* __i386__ || __x86_64__ */

--- a/src/3rdparty/libdeflate/lib/x86/decompress_impl.h
+++ b/src/3rdparty/libdeflate/lib/x86/decompress_impl.h
@ -4,18 +4,46 @@
 #include "cpu_features.h"

 /* BMI2 optimized version */
-#if HAVE_BMI2_TARGET && !HAVE_BMI2_NATIVE
-#  define FUNCNAME		deflate_decompress_bmi2
-#  define ATTRIBUTES		__attribute__((target("bmi2")))
+#if HAVE_BMI2_INTRIN
+#  define deflate_decompress_bmi2	deflate_decompress_bmi2
+#  define FUNCNAME			deflate_decompress_bmi2
+#  if !HAVE_BMI2_NATIVE
+#    define ATTRIBUTES			__attribute__((target("bmi2")))
+#  endif
+   /*
+    * Even with __attribute__((target("bmi2"))), gcc doesn't reliably use the
+    * bzhi instruction for 'word & BITMASK(count)'.  So use the bzhi intrinsic
+    * explicitly.  EXTRACT_VARBITS() is equivalent to 'word & BITMASK(count)';
+    * EXTRACT_VARBITS8() is equivalent to 'word & BITMASK((u8)count)'.
+    * Nevertheless, their implementation using the bzhi intrinsic is identical,
+    * as the bzhi instruction truncates the count to 8 bits implicitly.
+    */
+#  ifndef __clang__
+#    include <immintrin.h>
+#    ifdef __x86_64__
+#      define EXTRACT_VARBITS(word, count)  _bzhi_u64((word), (count))
+#      define EXTRACT_VARBITS8(word, count) _bzhi_u64((word), (count))
+#    else
+#      define EXTRACT_VARBITS(word, count)  _bzhi_u32((word), (count))
+#      define EXTRACT_VARBITS8(word, count) _bzhi_u32((word), (count))
+#    endif
+#  endif
 #  include "../decompress_template.h"
+#endif /* HAVE_BMI2_INTRIN */
+
+#if defined(deflate_decompress_bmi2) && HAVE_BMI2_NATIVE
+#define DEFAULT_IMPL	deflate_decompress_bmi2
+#else
 static inline decompress_func_t
 arch_select_decompress_func(void)
 {
+#ifdef deflate_decompress_bmi2
 	if (HAVE_BMI2(get_x86_cpu_features()))
 		return deflate_decompress_bmi2;
+#endif
 	return NULL;
 }
-#  define arch_select_decompress_func	arch_select_decompress_func
+#define arch_select_decompress_func	arch_select_decompress_func
 #endif

 #endif /* LIB_X86_DECOMPRESS_IMPL_H */
--- a/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h
+++ b/src/3rdparty/libdeflate/lib/x86/matchfinder_impl.h
@ -28,7 +28,9 @@
 #ifndef LIB_X86_MATCHFINDER_IMPL_H
 #define LIB_X86_MATCHFINDER_IMPL_H

-#ifdef __AVX2__
+#include "cpu_features.h"
+
+#if HAVE_AVX2_NATIVE
 #  include <immintrin.h>
 static forceinline void
 matchfinder_init_avx2(mf_pos_t *data, size_t size)
@ -73,7 +75,7 @@ matchfinder_rebase_avx2(mf_pos_t *data, size_t size)
 }
 #define matchfinder_rebase matchfinder_rebase_avx2

-#elif defined(__SSE2__)
+#elif HAVE_SSE2_NATIVE
 #  include <emmintrin.h>
 static forceinline void
 matchfinder_init_sse2(mf_pos_t *data, size_t size)
@ -117,6 +119,6 @@ matchfinder_rebase_sse2(mf_pos_t *data, size_t size)
 	} while (size != 0);
 }
 #define matchfinder_rebase matchfinder_rebase_sse2
-#endif /* __SSE2__ */
+#endif /* HAVE_SSE2_NATIVE */

 #endif /* LIB_X86_MATCHFINDER_IMPL_H */
--- a/src/3rdparty/libdeflate/libdeflate.h
+++ b/src/3rdparty/libdeflate/libdeflate.h
@ -10,33 +10,36 @@ extern "C" {
 #endif

 #define LIBDEFLATE_VERSION_MAJOR	1
-#define LIBDEFLATE_VERSION_MINOR	13
-#define LIBDEFLATE_VERSION_STRING	"1.13"
+#define LIBDEFLATE_VERSION_MINOR	14
+#define LIBDEFLATE_VERSION_STRING	"1.14"

 #include <stddef.h>
 #include <stdint.h>

 /*
- * On Windows, if you want to link to the DLL version of libdeflate, then
- * #define LIBDEFLATE_DLL.  Note that the calling convention is "cdecl".
+ * On Windows, you must define LIBDEFLATE_STATIC if you are linking to the
+ * static library version of libdeflate instead of the DLL.  On other platforms,
+ * LIBDEFLATE_STATIC has no effect.
 */
-#ifdef LIBDEFLATE_DLL
-#  ifdef BUILDING_LIBDEFLATE
-#    define LIBDEFLATEEXPORT	LIBEXPORT
-#  elif defined(_WIN32) || defined(__CYGWIN__)
+#ifdef _WIN32
+#  if defined(LIBDEFLATE_STATIC)
+#    define LIBDEFLATEEXPORT
+#  elif defined(BUILDING_LIBDEFLATE)
+#    define LIBDEFLATEEXPORT	__declspec(dllexport)
+#  else
 #    define LIBDEFLATEEXPORT	__declspec(dllimport)
 #  endif
-#endif
-#ifndef LIBDEFLATEEXPORT
-#  define LIBDEFLATEEXPORT
+#else
+#  define LIBDEFLATEEXPORT	__attribute__((visibility("default")))
 #endif

-#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && \
-	defined(_WIN32) && !defined(_WIN64)
+#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && defined(__i386__)
    /*
-     * On 32-bit Windows, gcc assumes 16-byte stack alignment but MSVC only 4.
-     * Realign the stack when entering libdeflate to avoid crashing in SSE/AVX
-     * code when called from an MSVC-compiled application.
+     * On i386, gcc assumes that the stack is 16-byte aligned at function entry.
+     * However, some compilers (e.g. MSVC) and programming languages (e.g.
+     * Delphi) only guarantee 4-byte alignment when calling functions.  Work
+     * around this ABI incompatibility by realigning the stack pointer when
+     * entering libdeflate.  This prevents crashes in SSE/AVX code.
     */
 #  define LIBDEFLATEAPI	__attribute__((force_align_arg_pointer))
 #else
@ -72,10 +75,35 @@ libdeflate_alloc_compressor(int compression_level);

 /*
 * libdeflate_deflate_compress() performs raw DEFLATE compression on a buffer of
- * data.  The function attempts to compress 'in_nbytes' bytes of data located at
- * 'in' and write the results to 'out', which has space for 'out_nbytes_avail'
- * bytes.  The return value is the compressed size in bytes, or 0 if the data
- * could not be compressed to 'out_nbytes_avail' bytes or fewer.
+ * data.  It attempts to compress 'in_nbytes' bytes of data located at 'in' and
+ * write the results to 'out', which has space for 'out_nbytes_avail' bytes.
+ * The return value is the compressed size in bytes, or 0 if the data could not
+ * be compressed to 'out_nbytes_avail' bytes or fewer (but see note below).
+ *
+ * If compression is successful, then the output data is guaranteed to be a
+ * valid DEFLATE stream that decompresses to the input data.  No other
+ * guarantees are made about the output data.  Notably, different versions of
+ * libdeflate can produce different compressed data for the same uncompressed
+ * data, even at the same compression level.  Do ***NOT*** do things like
+ * writing tests that compare compressed data to a golden output, as this can
+ * break when libdeflate is updated.  (This property isn't specific to
+ * libdeflate; the same is true for zlib and other compression libraries too.)
+ *
+ * Note: due to a performance optimization, libdeflate_deflate_compress()
+ * currently needs a small amount of slack space at the end of the output
+ * buffer.  As a result, it can't actually report compressed sizes very close to
+ * 'out_nbytes_avail'.  This doesn't matter in real-world use cases, and
+ * libdeflate_deflate_compress_bound() already includes the slack space.
+ * However, it does mean that testing code that redundantly compresses data
+ * using an exact-sized output buffer won't work as might be expected:
+ *
+ *	out_nbytes = libdeflate_deflate_compress(c, in, in_nbytes, out,
+ *						 libdeflate_deflate_compress_bound(in_nbytes));
+ *	// The following assertion will fail.
+ *	assert(libdeflate_deflate_compress(c, in, in_nbytes, out, out_nbytes) != 0);
+ *
+ * To avoid this, either don't write tests like the above, or make sure to
+ * include at least 9 bytes of slack space in 'out_nbytes_avail'.
 */
 LIBDEFLATEEXPORT size_t LIBDEFLATEAPI
 libdeflate_deflate_compress(struct libdeflate_compressor *compressor,