mirror of
https://bitbucket.org/smil3y/katie.git
synced 2025-02-23 18:32:55 +00:00
update bundled libdeflate to v1.14
Signed-off-by: Ivailo Monev <xakepa10@gmail.com>
This commit is contained in:
parent
c50a283a39
commit
a7a9b5317b
12 changed files with 793 additions and 447 deletions
2
src/3rdparty/libdeflate/NOTE
vendored
2
src/3rdparty/libdeflate/NOTE
vendored
|
@ -1,2 +1,2 @@
|
|||
This is Git checkout 72b2ce0d28970b1affc31efeb86daeffee1d7410
|
||||
This is Git checkout 18d6cc22b75643ec52111efeb27a22b9d860a982
|
||||
from https://github.com/ebiggers/libdeflate that has not been modified.
|
||||
|
|
51
src/3rdparty/libdeflate/README.md
vendored
51
src/3rdparty/libdeflate/README.md
vendored
|
@ -27,11 +27,13 @@ For the release notes, see the [NEWS file](NEWS.md).
|
|||
## Table of Contents
|
||||
|
||||
- [Building](#building)
|
||||
- [For UNIX](#for-unix)
|
||||
- [For macOS](#for-macos)
|
||||
- [For Windows](#for-windows)
|
||||
- [Using Cygwin](#using-cygwin)
|
||||
- [Using MSYS2](#using-msys2)
|
||||
- [Using the Makefile](#using-the-makefile)
|
||||
- [For UNIX](#for-unix)
|
||||
- [For macOS](#for-macos)
|
||||
- [For Windows](#for-windows)
|
||||
- [Using Cygwin](#using-cygwin)
|
||||
- [Using MSYS2](#using-msys2)
|
||||
- [Using a custom build system](#using-a-custom-build-system)
|
||||
- [API](#api)
|
||||
- [Bindings for other programming languages](#bindings-for-other-programming-languages)
|
||||
- [DEFLATE vs. zlib vs. gzip](#deflate-vs-zlib-vs-gzip)
|
||||
|
@ -42,7 +44,14 @@ For the release notes, see the [NEWS file](NEWS.md).
|
|||
|
||||
# Building
|
||||
|
||||
## For UNIX
|
||||
libdeflate and the provided programs like `gzip` can be built using the provided
|
||||
Makefile. If only the library is needed, it can alternatively be easily
|
||||
integrated into applications and built using any build system; see [Using a
|
||||
custom build system](#using-a-custom-build-system).
|
||||
|
||||
## Using the Makefile
|
||||
|
||||
### For UNIX
|
||||
|
||||
Just run `make`, then (if desired) `make install`. You need GNU Make and either
|
||||
GCC or Clang. GCC is recommended because it builds slightly faster binaries.
|
||||
|
@ -57,7 +66,7 @@ There are also many options which can be set on the `make` command line, e.g. to
|
|||
omit library features or to customize the directories into which `make install`
|
||||
installs files. See the Makefile for details.
|
||||
|
||||
## For macOS
|
||||
### For macOS
|
||||
|
||||
Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):
|
||||
|
||||
|
@ -65,7 +74,7 @@ Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):
|
|||
|
||||
But if you need to build the binaries yourself, see the section for UNIX above.
|
||||
|
||||
## For Windows
|
||||
### For Windows
|
||||
|
||||
Prebuilt Windows binaries can be downloaded from
|
||||
https://github.com/ebiggers/libdeflate/releases. But if you need to build the
|
||||
|
@ -84,7 +93,7 @@ binaries built with MinGW will be significantly faster.
|
|||
Also note that 64-bit binaries are faster than 32-bit binaries and should be
|
||||
preferred whenever possible.
|
||||
|
||||
### Using Cygwin
|
||||
#### Using Cygwin
|
||||
|
||||
Run the Cygwin installer, available from https://cygwin.com/setup-x86_64.exe.
|
||||
When you get to the package selection screen, choose the following additional
|
||||
|
@ -119,7 +128,7 @@ or to build 32-bit binaries:
|
|||
|
||||
make CC=i686-w64-mingw32-gcc
|
||||
|
||||
### Using MSYS2
|
||||
#### Using MSYS2
|
||||
|
||||
Run the MSYS2 installer, available from http://www.msys2.org/. After
|
||||
installing, open an MSYS2 shell and run:
|
||||
|
@ -161,6 +170,23 @@ and run the following commands:
|
|||
|
||||
Or to build 32-bit binaries, do the same but use "MSYS2 MinGW 32-bit" instead.
|
||||
|
||||
## Using a custom build system
|
||||
|
||||
The source files of the library are designed to be compilable directly, without
|
||||
any prerequisite step like running a `./configure` script. Therefore, as an
|
||||
alternative to building the library using the provided Makefile, the library
|
||||
source files can be easily integrated directly into your application and built
|
||||
using any build system.
|
||||
|
||||
You should compile both `lib/*.c` and `lib/*/*.c`. You don't need to worry
|
||||
about excluding irrelevant architecture-specific code, as this is already
|
||||
handled in the source files themselves using `#ifdef`s.
|
||||
|
||||
It is **strongly** recommended to use either gcc or clang, and to use `-O2`.
|
||||
|
||||
If you are doing a freestanding build with `-ffreestanding`, you must add
|
||||
`-DFREESTANDING` as well, otherwise performance will suffer greatly.
|
||||
|
||||
# API
|
||||
|
||||
libdeflate has a simple API that is not zlib-compatible. You can create
|
||||
|
@ -183,10 +209,7 @@ guessing. However, libdeflate's decompression routines do optionally provide
|
|||
the actual number of output bytes in case you need it.
|
||||
|
||||
Windows developers: note that the calling convention of libdeflate.dll is
|
||||
"stdcall" -- the same as the Win32 API. If you call into libdeflate.dll using a
|
||||
non-C/C++ language, or dynamically using LoadLibrary(), make sure to use the
|
||||
stdcall convention. Using the wrong convention may crash your application.
|
||||
(Note: older versions of libdeflate used the "cdecl" convention instead.)
|
||||
"cdecl". (libdeflate v1.4 through v1.12 used "stdcall" instead.)
|
||||
|
||||
# Bindings for other programming languages
|
||||
|
||||
|
|
40
src/3rdparty/libdeflate/common_defs.h
vendored
40
src/3rdparty/libdeflate/common_defs.h
vendored
|
@ -144,8 +144,17 @@ typedef size_t machine_word_t;
|
|||
/* restrict - hint that writes only occur through the given pointer */
|
||||
#ifdef __GNUC__
|
||||
# define restrict __restrict__
|
||||
#elif defined(_MSC_VER)
|
||||
/*
|
||||
* Don't use MSVC's __restrict; it has nonstandard behavior.
|
||||
* Standard restrict is okay, if it is supported.
|
||||
*/
|
||||
# if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L)
|
||||
# define restrict restrict
|
||||
# else
|
||||
# define restrict
|
||||
# endif
|
||||
#else
|
||||
/* Don't use MSVC's __restrict; it has nonstandard behavior. */
|
||||
# define restrict
|
||||
#endif
|
||||
|
||||
|
@ -200,6 +209,7 @@ typedef size_t machine_word_t;
|
|||
#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
|
||||
#define STATIC_ASSERT(expr) ((void)sizeof(char[1 - 2 * !(expr)]))
|
||||
#define ALIGN(n, a) (((n) + (a) - 1) & ~((a) - 1))
|
||||
#define ROUND_UP(n, d) ((d) * DIV_ROUND_UP((n), (d)))
|
||||
|
||||
/* ========================================================================== */
|
||||
/* Endianness handling */
|
||||
|
@ -513,8 +523,10 @@ bsr32(u32 v)
|
|||
#ifdef __GNUC__
|
||||
return 31 - __builtin_clz(v);
|
||||
#elif defined(_MSC_VER)
|
||||
_BitScanReverse(&v, v);
|
||||
return v;
|
||||
unsigned long i;
|
||||
|
||||
_BitScanReverse(&i, v);
|
||||
return i;
|
||||
#else
|
||||
unsigned i = 0;
|
||||
|
||||
|
@ -529,9 +541,11 @@ bsr64(u64 v)
|
|||
{
|
||||
#ifdef __GNUC__
|
||||
return 63 - __builtin_clzll(v);
|
||||
#elif defined(_MSC_VER) && defined(_M_X64)
|
||||
_BitScanReverse64(&v, v);
|
||||
return v;
|
||||
#elif defined(_MSC_VER) && defined(_WIN64)
|
||||
unsigned long i;
|
||||
|
||||
_BitScanReverse64(&i, v);
|
||||
return i;
|
||||
#else
|
||||
unsigned i = 0;
|
||||
|
||||
|
@ -563,8 +577,10 @@ bsf32(u32 v)
|
|||
#ifdef __GNUC__
|
||||
return __builtin_ctz(v);
|
||||
#elif defined(_MSC_VER)
|
||||
_BitScanForward(&v, v);
|
||||
return v;
|
||||
unsigned long i;
|
||||
|
||||
_BitScanForward(&i, v);
|
||||
return i;
|
||||
#else
|
||||
unsigned i = 0;
|
||||
|
||||
|
@ -579,9 +595,11 @@ bsf64(u64 v)
|
|||
{
|
||||
#ifdef __GNUC__
|
||||
return __builtin_ctzll(v);
|
||||
#elif defined(_MSC_VER) && defined(_M_X64)
|
||||
_BitScanForward64(&v, v);
|
||||
return v;
|
||||
#elif defined(_MSC_VER) && defined(_WIN64)
|
||||
unsigned long i;
|
||||
|
||||
_BitScanForward64(&i, v);
|
||||
return i;
|
||||
#else
|
||||
unsigned i = 0;
|
||||
|
||||
|
|
|
@ -28,7 +28,9 @@
|
|||
#ifndef LIB_ARM_MATCHFINDER_IMPL_H
|
||||
#define LIB_ARM_MATCHFINDER_IMPL_H
|
||||
|
||||
#ifdef __ARM_NEON
|
||||
#include "cpu_features.h"
|
||||
|
||||
#if HAVE_NEON_NATIVE
|
||||
# include <arm_neon.h>
|
||||
static forceinline void
|
||||
matchfinder_init_neon(mf_pos_t *data, size_t size)
|
||||
|
@ -81,6 +83,6 @@ matchfinder_rebase_neon(mf_pos_t *data, size_t size)
|
|||
}
|
||||
#define matchfinder_rebase matchfinder_rebase_neon
|
||||
|
||||
#endif /* __ARM_NEON */
|
||||
#endif /* HAVE_NEON_NATIVE */
|
||||
|
||||
#endif /* LIB_ARM_MATCHFINDER_IMPL_H */
|
||||
|
|
580
src/3rdparty/libdeflate/lib/decompress_template.h
vendored
580
src/3rdparty/libdeflate/lib/decompress_template.h
vendored
|
@ -31,7 +31,17 @@
|
|||
* target instruction sets.
|
||||
*/
|
||||
|
||||
static enum libdeflate_result ATTRIBUTES
|
||||
#ifndef ATTRIBUTES
|
||||
# define ATTRIBUTES
|
||||
#endif
|
||||
#ifndef EXTRACT_VARBITS
|
||||
# define EXTRACT_VARBITS(word, count) ((word) & BITMASK(count))
|
||||
#endif
|
||||
#ifndef EXTRACT_VARBITS8
|
||||
# define EXTRACT_VARBITS8(word, count) ((word) & BITMASK((u8)(count)))
|
||||
#endif
|
||||
|
||||
static enum libdeflate_result ATTRIBUTES MAYBE_UNUSED
|
||||
FUNCNAME(struct libdeflate_decompressor * restrict d,
|
||||
const void * restrict in, size_t in_nbytes,
|
||||
void * restrict out, size_t out_nbytes_avail,
|
||||
|
@ -41,35 +51,36 @@ FUNCNAME(struct libdeflate_decompressor * restrict d,
|
|||
u8 * const out_end = out_next + out_nbytes_avail;
|
||||
u8 * const out_fastloop_end =
|
||||
out_end - MIN(out_nbytes_avail, FASTLOOP_MAX_BYTES_WRITTEN);
|
||||
|
||||
/* Input bitstream state; see deflate_decompress.c for documentation */
|
||||
const u8 *in_next = in;
|
||||
const u8 * const in_end = in_next + in_nbytes;
|
||||
const u8 * const in_fastloop_end =
|
||||
in_end - MIN(in_nbytes, FASTLOOP_MAX_BYTES_READ);
|
||||
bitbuf_t bitbuf = 0;
|
||||
bitbuf_t saved_bitbuf;
|
||||
machine_word_t bitsleft = 0;
|
||||
u32 bitsleft = 0;
|
||||
size_t overread_count = 0;
|
||||
unsigned i;
|
||||
|
||||
bool is_final_block;
|
||||
unsigned block_type;
|
||||
u16 len;
|
||||
u16 nlen;
|
||||
unsigned num_litlen_syms;
|
||||
unsigned num_offset_syms;
|
||||
bitbuf_t tmpbits;
|
||||
bitbuf_t litlen_tablemask;
|
||||
u32 entry;
|
||||
|
||||
next_block:
|
||||
/* Starting to read the next block */
|
||||
;
|
||||
|
||||
STATIC_ASSERT(CAN_ENSURE(1 + 2 + 5 + 5 + 4 + 3));
|
||||
STATIC_ASSERT(CAN_CONSUME(1 + 2 + 5 + 5 + 4 + 3));
|
||||
REFILL_BITS();
|
||||
|
||||
/* BFINAL: 1 bit */
|
||||
is_final_block = POP_BITS(1);
|
||||
is_final_block = bitbuf & BITMASK(1);
|
||||
|
||||
/* BTYPE: 2 bits */
|
||||
block_type = POP_BITS(2);
|
||||
block_type = (bitbuf >> 1) & BITMASK(2);
|
||||
|
||||
if (block_type == DEFLATE_BLOCKTYPE_DYNAMIC_HUFFMAN) {
|
||||
|
||||
|
@ -81,17 +92,18 @@ next_block:
|
|||
};
|
||||
|
||||
unsigned num_explicit_precode_lens;
|
||||
unsigned i;
|
||||
|
||||
/* Read the codeword length counts. */
|
||||
|
||||
STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == ((1 << 5) - 1) + 257);
|
||||
num_litlen_syms = POP_BITS(5) + 257;
|
||||
STATIC_ASSERT(DEFLATE_NUM_LITLEN_SYMS == 257 + BITMASK(5));
|
||||
num_litlen_syms = 257 + ((bitbuf >> 3) & BITMASK(5));
|
||||
|
||||
STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == ((1 << 5) - 1) + 1);
|
||||
num_offset_syms = POP_BITS(5) + 1;
|
||||
STATIC_ASSERT(DEFLATE_NUM_OFFSET_SYMS == 1 + BITMASK(5));
|
||||
num_offset_syms = 1 + ((bitbuf >> 8) & BITMASK(5));
|
||||
|
||||
STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == ((1 << 4) - 1) + 4);
|
||||
num_explicit_precode_lens = POP_BITS(4) + 4;
|
||||
STATIC_ASSERT(DEFLATE_NUM_PRECODE_SYMS == 4 + BITMASK(4));
|
||||
num_explicit_precode_lens = 4 + ((bitbuf >> 13) & BITMASK(4));
|
||||
|
||||
d->static_codes_loaded = false;
|
||||
|
||||
|
@ -103,16 +115,31 @@ next_block:
|
|||
* merge one len with the previous fields.
|
||||
*/
|
||||
STATIC_ASSERT(DEFLATE_MAX_PRE_CODEWORD_LEN == (1 << 3) - 1);
|
||||
if (CAN_ENSURE(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[0]] = POP_BITS(3);
|
||||
if (CAN_CONSUME(3 * (DEFLATE_NUM_PRECODE_SYMS - 1))) {
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[0]] =
|
||||
(bitbuf >> 17) & BITMASK(3);
|
||||
bitbuf >>= 20;
|
||||
bitsleft -= 20;
|
||||
REFILL_BITS();
|
||||
for (i = 1; i < num_explicit_precode_lens; i++)
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
|
||||
i = 1;
|
||||
do {
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[i]] =
|
||||
bitbuf & BITMASK(3);
|
||||
bitbuf >>= 3;
|
||||
bitsleft -= 3;
|
||||
} while (++i < num_explicit_precode_lens);
|
||||
} else {
|
||||
for (i = 0; i < num_explicit_precode_lens; i++) {
|
||||
ENSURE_BITS(3);
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[i]] = POP_BITS(3);
|
||||
}
|
||||
bitbuf >>= 17;
|
||||
bitsleft -= 17;
|
||||
i = 0;
|
||||
do {
|
||||
if ((u8)bitsleft < 3)
|
||||
REFILL_BITS();
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[i]] =
|
||||
bitbuf & BITMASK(3);
|
||||
bitbuf >>= 3;
|
||||
bitsleft -= 3;
|
||||
} while (++i < num_explicit_precode_lens);
|
||||
}
|
||||
for (; i < DEFLATE_NUM_PRECODE_SYMS; i++)
|
||||
d->u.precode_lens[deflate_precode_lens_permutation[i]] = 0;
|
||||
|
@ -121,13 +148,14 @@ next_block:
|
|||
SAFETY_CHECK(build_precode_decode_table(d));
|
||||
|
||||
/* Decode the litlen and offset codeword lengths. */
|
||||
for (i = 0; i < num_litlen_syms + num_offset_syms; ) {
|
||||
u32 entry;
|
||||
i = 0;
|
||||
do {
|
||||
unsigned presym;
|
||||
u8 rep_val;
|
||||
unsigned rep_count;
|
||||
|
||||
ENSURE_BITS(DEFLATE_MAX_PRE_CODEWORD_LEN + 7);
|
||||
if ((u8)bitsleft < DEFLATE_MAX_PRE_CODEWORD_LEN + 7)
|
||||
REFILL_BITS();
|
||||
|
||||
/*
|
||||
* The code below assumes that the precode decode table
|
||||
|
@ -135,9 +163,11 @@ next_block:
|
|||
*/
|
||||
STATIC_ASSERT(PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN);
|
||||
|
||||
/* Read the next precode symbol. */
|
||||
entry = d->u.l.precode_decode_table[BITS(DEFLATE_MAX_PRE_CODEWORD_LEN)];
|
||||
REMOVE_BITS((u8)entry);
|
||||
/* Decode the next precode symbol. */
|
||||
entry = d->u.l.precode_decode_table[
|
||||
bitbuf & BITMASK(DEFLATE_MAX_PRE_CODEWORD_LEN)];
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry; /* optimization: subtract full entry */
|
||||
presym = entry >> 16;
|
||||
|
||||
if (presym < 16) {
|
||||
|
@ -171,8 +201,10 @@ next_block:
|
|||
/* Repeat the previous length 3 - 6 times. */
|
||||
SAFETY_CHECK(i != 0);
|
||||
rep_val = d->u.l.lens[i - 1];
|
||||
STATIC_ASSERT(3 + ((1 << 2) - 1) == 6);
|
||||
rep_count = 3 + POP_BITS(2);
|
||||
STATIC_ASSERT(3 + BITMASK(2) == 6);
|
||||
rep_count = 3 + (bitbuf & BITMASK(2));
|
||||
bitbuf >>= 2;
|
||||
bitsleft -= 2;
|
||||
d->u.l.lens[i + 0] = rep_val;
|
||||
d->u.l.lens[i + 1] = rep_val;
|
||||
d->u.l.lens[i + 2] = rep_val;
|
||||
|
@ -182,8 +214,10 @@ next_block:
|
|||
i += rep_count;
|
||||
} else if (presym == 17) {
|
||||
/* Repeat zero 3 - 10 times. */
|
||||
STATIC_ASSERT(3 + ((1 << 3) - 1) == 10);
|
||||
rep_count = 3 + POP_BITS(3);
|
||||
STATIC_ASSERT(3 + BITMASK(3) == 10);
|
||||
rep_count = 3 + (bitbuf & BITMASK(3));
|
||||
bitbuf >>= 3;
|
||||
bitsleft -= 3;
|
||||
d->u.l.lens[i + 0] = 0;
|
||||
d->u.l.lens[i + 1] = 0;
|
||||
d->u.l.lens[i + 2] = 0;
|
||||
|
@ -197,20 +231,39 @@ next_block:
|
|||
i += rep_count;
|
||||
} else {
|
||||
/* Repeat zero 11 - 138 times. */
|
||||
STATIC_ASSERT(11 + ((1 << 7) - 1) == 138);
|
||||
rep_count = 11 + POP_BITS(7);
|
||||
STATIC_ASSERT(11 + BITMASK(7) == 138);
|
||||
rep_count = 11 + (bitbuf & BITMASK(7));
|
||||
bitbuf >>= 7;
|
||||
bitsleft -= 7;
|
||||
memset(&d->u.l.lens[i], 0,
|
||||
rep_count * sizeof(d->u.l.lens[i]));
|
||||
i += rep_count;
|
||||
}
|
||||
}
|
||||
} while (i < num_litlen_syms + num_offset_syms);
|
||||
|
||||
} else if (block_type == DEFLATE_BLOCKTYPE_UNCOMPRESSED) {
|
||||
u16 len, nlen;
|
||||
|
||||
/*
|
||||
* Uncompressed block: copy 'len' bytes literally from the input
|
||||
* buffer to the output buffer.
|
||||
*/
|
||||
|
||||
ALIGN_INPUT();
|
||||
bitsleft -= 3; /* for BTYPE and BFINAL */
|
||||
|
||||
/*
|
||||
* Align the bitstream to the next byte boundary. This means
|
||||
* the next byte boundary as if we were reading a byte at a
|
||||
* time. Therefore, we have to rewind 'in_next' by any bytes
|
||||
* that have been refilled but not actually consumed yet (not
|
||||
* counting overread bytes, which don't increment 'in_next').
|
||||
*/
|
||||
bitsleft = (u8)bitsleft;
|
||||
SAFETY_CHECK(overread_count <= (bitsleft >> 3));
|
||||
in_next -= (bitsleft >> 3) - overread_count;
|
||||
overread_count = 0;
|
||||
bitbuf = 0;
|
||||
bitsleft = 0;
|
||||
|
||||
SAFETY_CHECK(in_end - in_next >= 4);
|
||||
len = get_unaligned_le16(in_next);
|
||||
|
@ -229,6 +282,8 @@ next_block:
|
|||
goto block_done;
|
||||
|
||||
} else {
|
||||
unsigned i;
|
||||
|
||||
SAFETY_CHECK(block_type == DEFLATE_BLOCKTYPE_STATIC_HUFFMAN);
|
||||
|
||||
/*
|
||||
|
@ -241,6 +296,9 @@ next_block:
|
|||
* dynamic Huffman block.
|
||||
*/
|
||||
|
||||
bitbuf >>= 3; /* for BTYPE and BFINAL */
|
||||
bitsleft -= 3;
|
||||
|
||||
if (d->static_codes_loaded)
|
||||
goto have_decode_tables;
|
||||
|
||||
|
@ -270,186 +328,344 @@ next_block:
|
|||
SAFETY_CHECK(build_offset_decode_table(d, num_litlen_syms, num_offset_syms));
|
||||
SAFETY_CHECK(build_litlen_decode_table(d, num_litlen_syms, num_offset_syms));
|
||||
have_decode_tables:
|
||||
litlen_tablemask = BITMASK(d->litlen_tablebits);
|
||||
|
||||
/*
|
||||
* This is the "fastloop" for decoding literals and matches. It does
|
||||
* bounds checks on in_next and out_next in the loop conditions so that
|
||||
* additional bounds checks aren't needed inside the loop body.
|
||||
*
|
||||
* To reduce latency, the bitbuffer is refilled and the next litlen
|
||||
* decode table entry is preloaded before each loop iteration.
|
||||
*/
|
||||
while (in_next < in_fastloop_end && out_next < out_fastloop_end) {
|
||||
u32 entry, length, offset;
|
||||
u8 lit;
|
||||
if (in_next >= in_fastloop_end || out_next >= out_fastloop_end)
|
||||
goto generic_loop;
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
do {
|
||||
u32 length, offset, lit;
|
||||
const u8 *src;
|
||||
u8 *dst;
|
||||
|
||||
/* Refill the bitbuffer and decode a litlen symbol. */
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
preloaded:
|
||||
if (CAN_ENSURE(3 * LITLEN_TABLEBITS +
|
||||
DEFLATE_MAX_LITLEN_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_LENGTH_BITS) &&
|
||||
(entry & HUFFDEC_LITERAL)) {
|
||||
/*
|
||||
* Consume the bits for the litlen decode table entry. Save the
|
||||
* original bitbuf for later, in case the extra match length
|
||||
* bits need to be extracted from it.
|
||||
*/
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry; /* optimization: subtract full entry */
|
||||
|
||||
/*
|
||||
* Begin by checking for a "fast" literal, i.e. a literal that
|
||||
* doesn't need a subtable.
|
||||
*/
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
/*
|
||||
* 64-bit only: fast path for decoding literals that
|
||||
* don't need subtables. We do up to 3 of these before
|
||||
* proceeding to the general case. This is the largest
|
||||
* number of times that LITLEN_TABLEBITS bits can be
|
||||
* extracted from a refilled 64-bit bitbuffer while
|
||||
* still leaving enough bits to decode any match length.
|
||||
* On 64-bit platforms, we decode up to 2 extra fast
|
||||
* literals in addition to the primary item, as this
|
||||
* increases performance and still leaves enough bits
|
||||
* remaining for what follows. We could actually do 3,
|
||||
* assuming LITLEN_TABLEBITS=11, but that actually
|
||||
* decreases performance slightly (perhaps by messing
|
||||
* with the branch prediction of the conditional refill
|
||||
* that happens later while decoding the match offset).
|
||||
*
|
||||
* Note: the definitions of FASTLOOP_MAX_BYTES_WRITTEN
|
||||
* and FASTLOOP_MAX_BYTES_READ need to be updated if the
|
||||
* maximum number of literals decoded here is changed.
|
||||
* number of extra literals decoded here is changed.
|
||||
*/
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
*out_next++ = lit;
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
if (/* enough bits for 2 fast literals + length + offset preload? */
|
||||
CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
|
||||
LENGTH_MAXBITS,
|
||||
OFFSET_TABLEBITS) &&
|
||||
/* enough bits for 2 fast literals + slow literal + litlen preload? */
|
||||
CAN_CONSUME_AND_THEN_PRELOAD(2 * LITLEN_TABLEBITS +
|
||||
DEFLATE_MAX_LITLEN_CODEWORD_LEN,
|
||||
LITLEN_TABLEBITS)) {
|
||||
/* 1st extra fast literal */
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
*out_next++ = lit;
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
/* 2nd extra fast literal */
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
*out_next++ = lit;
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
/*
|
||||
* Another fast literal, but
|
||||
* this one is in lieu of the
|
||||
* primary item, so it doesn't
|
||||
* count as one of the extras.
|
||||
*/
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
*out_next++ = lit;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
/*
|
||||
* Decode a literal. While doing so, preload
|
||||
* the next litlen decode table entry and refill
|
||||
* the bitbuffer. To reduce latency, we've
|
||||
* arranged for there to be enough "preloadable"
|
||||
* bits remaining to do the table preload
|
||||
* independently of the refill.
|
||||
*/
|
||||
STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(
|
||||
LITLEN_TABLEBITS, LITLEN_TABLEBITS));
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
*out_next++ = lit;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* It's not a literal entry, so it can be a length entry, a
|
||||
* subtable pointer entry, or an end-of-block entry. Detect the
|
||||
* two unlikely cases by testing the HUFFDEC_EXCEPTIONAL flag.
|
||||
*/
|
||||
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
|
||||
/* Subtable pointer or end-of-block entry */
|
||||
if (entry & HUFFDEC_SUBTABLE_POINTER) {
|
||||
REMOVE_BITS(LITLEN_TABLEBITS);
|
||||
entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
|
||||
}
|
||||
SAVE_BITBUF();
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
|
||||
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
|
||||
goto block_done;
|
||||
/* Literal or length entry, from a subtable */
|
||||
} else {
|
||||
/* Literal or length entry, from the main table */
|
||||
SAVE_BITBUF();
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
}
|
||||
length = entry >> 16;
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
|
||||
/*
|
||||
* Literal that didn't get handled by the literal fast
|
||||
* path earlier
|
||||
* A subtable is required. Load and consume the
|
||||
* subtable entry. The subtable entry can be of any
|
||||
* type: literal, length, or end-of-block.
|
||||
*/
|
||||
*out_next++ = length;
|
||||
continue;
|
||||
}
|
||||
/*
|
||||
* Match length. Finish decoding it. We don't need to check
|
||||
* for too-long matches here, as this is inside the fastloop
|
||||
* where it's already been verified that the output buffer has
|
||||
* enough space remaining to copy a max-length match.
|
||||
*/
|
||||
length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
|
||||
entry = d->u.litlen_decode_table[(entry >> 16) +
|
||||
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
|
||||
/* Decode the match offset. */
|
||||
|
||||
/* Refill the bitbuffer if it may be needed for the offset. */
|
||||
if (unlikely(GET_REAL_BITSLEFT() <
|
||||
DEFLATE_MAX_OFFSET_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS))
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
|
||||
STATIC_ASSERT(CAN_ENSURE(OFFSET_TABLEBITS +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS));
|
||||
STATIC_ASSERT(CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
|
||||
OFFSET_TABLEBITS +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS));
|
||||
|
||||
entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
|
||||
if (entry & HUFFDEC_EXCEPTIONAL) {
|
||||
/* Offset codeword requires a subtable */
|
||||
REMOVE_BITS(OFFSET_TABLEBITS);
|
||||
entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
|
||||
/*
|
||||
* On 32-bit, we might not be able to decode the offset
|
||||
* symbol and extra offset bits without refilling the
|
||||
* bitbuffer in between. However, this is only an issue
|
||||
* when a subtable is needed, so do the refill here.
|
||||
* 32-bit platforms that use the byte-at-a-time refill
|
||||
* method have to do a refill here for there to always
|
||||
* be enough bits to decode a literal that requires a
|
||||
* subtable, then preload the next litlen decode table
|
||||
* entry; or to decode a match length that requires a
|
||||
* subtable, then preload the offset decode table entry.
|
||||
*/
|
||||
if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS))
|
||||
if (!CAN_CONSUME_AND_THEN_PRELOAD(DEFLATE_MAX_LITLEN_CODEWORD_LEN,
|
||||
LITLEN_TABLEBITS) ||
|
||||
!CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXBITS,
|
||||
OFFSET_TABLEBITS))
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
/* Decode a literal that required a subtable. */
|
||||
lit = entry >> 16;
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
*out_next++ = lit;
|
||||
continue;
|
||||
}
|
||||
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
|
||||
goto block_done;
|
||||
/* Else, it's a length that required a subtable. */
|
||||
}
|
||||
SAVE_BITBUF();
|
||||
REMOVE_ENTRY_BITS_FAST(entry);
|
||||
offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
|
||||
|
||||
/*
|
||||
* Decode the match length: the length base value associated
|
||||
* with the litlen symbol (which we extract from the decode
|
||||
* table entry), plus the extra length bits. We don't need to
|
||||
* consume the extra length bits here, as they were included in
|
||||
* the bits consumed by the entry earlier. We also don't need
|
||||
* to check for too-long matches here, as this is inside the
|
||||
* fastloop where it's already been verified that the output
|
||||
* buffer has enough space remaining to copy a max-length match.
|
||||
*/
|
||||
length = entry >> 16;
|
||||
length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
|
||||
|
||||
/*
|
||||
* Decode the match offset. There are enough "preloadable" bits
|
||||
* remaining to preload the offset decode table entry, but a
|
||||
* refill might be needed before consuming it.
|
||||
*/
|
||||
STATIC_ASSERT(CAN_CONSUME_AND_THEN_PRELOAD(LENGTH_MAXFASTBITS,
|
||||
OFFSET_TABLEBITS));
|
||||
entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
|
||||
if (CAN_CONSUME_AND_THEN_PRELOAD(OFFSET_MAXBITS,
|
||||
LITLEN_TABLEBITS)) {
|
||||
/*
|
||||
* Decoding a match offset on a 64-bit platform. We may
|
||||
* need to refill once, but then we can decode the whole
|
||||
* offset and preload the next litlen table entry.
|
||||
*/
|
||||
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
|
||||
/* Offset codeword requires a subtable */
|
||||
if (unlikely((u8)bitsleft < OFFSET_MAXBITS +
|
||||
LITLEN_TABLEBITS - PRELOAD_SLACK))
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
bitbuf >>= OFFSET_TABLEBITS;
|
||||
bitsleft -= OFFSET_TABLEBITS;
|
||||
entry = d->offset_decode_table[(entry >> 16) +
|
||||
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
|
||||
} else if (unlikely((u8)bitsleft < OFFSET_MAXFASTBITS +
|
||||
LITLEN_TABLEBITS - PRELOAD_SLACK))
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
} else {
|
||||
/* Decoding a match offset on a 32-bit platform */
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
|
||||
/* Offset codeword requires a subtable */
|
||||
bitbuf >>= OFFSET_TABLEBITS;
|
||||
bitsleft -= OFFSET_TABLEBITS;
|
||||
entry = d->offset_decode_table[(entry >> 16) +
|
||||
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
/* No further refill needed before extra bits */
|
||||
STATIC_ASSERT(CAN_CONSUME(
|
||||
OFFSET_MAXBITS - OFFSET_TABLEBITS));
|
||||
} else {
|
||||
/* No refill needed before extra bits */
|
||||
STATIC_ASSERT(CAN_CONSUME(OFFSET_MAXFASTBITS));
|
||||
}
|
||||
}
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry; /* optimization: subtract full entry */
|
||||
offset = entry >> 16;
|
||||
offset += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
|
||||
|
||||
/* Validate the match offset; needed even in the fastloop. */
|
||||
SAFETY_CHECK(offset <= out_next - (const u8 *)out);
|
||||
src = out_next - offset;
|
||||
dst = out_next;
|
||||
out_next += length;
|
||||
|
||||
/*
|
||||
* Before starting to copy the match, refill the bitbuffer and
|
||||
* preload the litlen decode table entry for the next loop
|
||||
* iteration. This can increase performance by allowing the
|
||||
* latency of the two operations to overlap.
|
||||
* Before starting to issue the instructions to copy the match,
|
||||
* refill the bitbuffer and preload the litlen decode table
|
||||
* entry for the next loop iteration. This can increase
|
||||
* performance by allowing the latency of the match copy to
|
||||
* overlap with these other operations. To further reduce
|
||||
* latency, we've arranged for there to be enough bits remaining
|
||||
* to do the table preload independently of the refill, except
|
||||
* on 32-bit platforms using the byte-at-a-time refill method.
|
||||
*/
|
||||
if (!CAN_CONSUME_AND_THEN_PRELOAD(
|
||||
MAX(OFFSET_MAXBITS - OFFSET_TABLEBITS,
|
||||
OFFSET_MAXFASTBITS),
|
||||
LITLEN_TABLEBITS) &&
|
||||
unlikely((u8)bitsleft < LITLEN_TABLEBITS - PRELOAD_SLACK))
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
REFILL_BITS_IN_FASTLOOP();
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
|
||||
/*
|
||||
* Copy the match. On most CPUs the fastest method is a
|
||||
* word-at-a-time copy, unconditionally copying at least 3 words
|
||||
* word-at-a-time copy, unconditionally copying about 5 words
|
||||
* since this is enough for most matches without being too much.
|
||||
*
|
||||
* The normal word-at-a-time copy works for offset >= WORDBYTES,
|
||||
* which is most cases. The case of offset == 1 is also common
|
||||
* and is worth optimizing for, since it is just RLE encoding of
|
||||
* the previous byte, which is the result of compressing long
|
||||
* runs of the same byte. We currently don't optimize for the
|
||||
* less common cases of offset > 1 && offset < WORDBYTES; we
|
||||
* just fall back to a traditional byte-at-a-time copy for them.
|
||||
* runs of the same byte.
|
||||
*
|
||||
* Writing past the match 'length' is allowed here, since it's
|
||||
* been ensured there is enough output space left for a slight
|
||||
* overrun. FASTLOOP_MAX_BYTES_WRITTEN needs to be updated if
|
||||
* the maximum possible overrun here is changed.
|
||||
*/
|
||||
src = out_next - offset;
|
||||
dst = out_next;
|
||||
out_next += length;
|
||||
if (UNALIGNED_ACCESS_IS_FAST && offset >= WORDBYTES) {
|
||||
copy_word_unaligned(src, dst);
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
copy_word_unaligned(src, dst);
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
do {
|
||||
copy_word_unaligned(src, dst);
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
while (dst < out_next) {
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
} while (dst < out_next);
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += WORDBYTES;
|
||||
dst += WORDBYTES;
|
||||
}
|
||||
} else if (UNALIGNED_ACCESS_IS_FAST && offset == 1) {
|
||||
machine_word_t v = repeat_byte(*src);
|
||||
machine_word_t v;
|
||||
|
||||
/*
|
||||
* This part tends to get auto-vectorized, so keep it
|
||||
* copying a multiple of 16 bytes at a time.
|
||||
*/
|
||||
v = (machine_word_t)0x0101010101010101 * src[0];
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
do {
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
while (dst < out_next) {
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
store_word_unaligned(v, dst);
|
||||
dst += WORDBYTES;
|
||||
}
|
||||
} else if (UNALIGNED_ACCESS_IS_FAST) {
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += offset;
|
||||
dst += offset;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += offset;
|
||||
dst += offset;
|
||||
do {
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += offset;
|
||||
dst += offset;
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
src += offset;
|
||||
dst += offset;
|
||||
} while (dst < out_next);
|
||||
} else {
|
||||
STATIC_ASSERT(DEFLATE_MIN_MATCH_LEN == 3);
|
||||
*dst++ = *src++;
|
||||
*dst++ = *src++;
|
||||
do {
|
||||
*dst++ = *src++;
|
||||
} while (dst < out_next);
|
||||
}
|
||||
if (in_next < in_fastloop_end && out_next < out_fastloop_end)
|
||||
goto preloaded;
|
||||
break;
|
||||
}
|
||||
/* MASK_BITSLEFT() is needed when leaving the fastloop. */
|
||||
MASK_BITSLEFT();
|
||||
} while (in_next < in_fastloop_end && out_next < out_fastloop_end);
|
||||
|
||||
/*
|
||||
* This is the generic loop for decoding literals and matches. This
|
||||
|
@ -458,19 +674,24 @@ preloaded:
|
|||
* critical, as most time is spent in the fastloop above instead. We
|
||||
* therefore omit some optimizations here in favor of smaller code.
|
||||
*/
|
||||
generic_loop:
|
||||
for (;;) {
|
||||
u32 entry, length, offset;
|
||||
u32 length, offset;
|
||||
const u8 *src;
|
||||
u8 *dst;
|
||||
|
||||
REFILL_BITS();
|
||||
entry = d->u.litlen_decode_table[BITS(LITLEN_TABLEBITS)];
|
||||
entry = d->u.litlen_decode_table[bitbuf & litlen_tablemask];
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
if (unlikely(entry & HUFFDEC_SUBTABLE_POINTER)) {
|
||||
REMOVE_BITS(LITLEN_TABLEBITS);
|
||||
entry = d->u.litlen_decode_table[(entry >> 16) + BITS((u8)entry)];
|
||||
entry = d->u.litlen_decode_table[(entry >> 16) +
|
||||
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
|
||||
saved_bitbuf = bitbuf;
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
}
|
||||
SAVE_BITBUF();
|
||||
REMOVE_BITS((u8)entry);
|
||||
length = entry >> 16;
|
||||
if (entry & HUFFDEC_LITERAL) {
|
||||
if (unlikely(out_next == out_end))
|
||||
|
@ -480,34 +701,27 @@ preloaded:
|
|||
}
|
||||
if (unlikely(entry & HUFFDEC_END_OF_BLOCK))
|
||||
goto block_done;
|
||||
length += SAVED_BITS((u8)entry) >> (u8)(entry >> 8);
|
||||
length += EXTRACT_VARBITS8(saved_bitbuf, entry) >> (u8)(entry >> 8);
|
||||
if (unlikely(length > out_end - out_next))
|
||||
return LIBDEFLATE_INSUFFICIENT_SPACE;
|
||||
|
||||
if (CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS)) {
|
||||
ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS);
|
||||
} else {
|
||||
ENSURE_BITS(OFFSET_TABLEBITS +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS);
|
||||
if (!CAN_CONSUME(LENGTH_MAXBITS + OFFSET_MAXBITS))
|
||||
REFILL_BITS();
|
||||
entry = d->offset_decode_table[bitbuf & BITMASK(OFFSET_TABLEBITS)];
|
||||
if (unlikely(entry & HUFFDEC_EXCEPTIONAL)) {
|
||||
bitbuf >>= OFFSET_TABLEBITS;
|
||||
bitsleft -= OFFSET_TABLEBITS;
|
||||
entry = d->offset_decode_table[(entry >> 16) +
|
||||
EXTRACT_VARBITS(bitbuf, (entry >> 8) & 0x3F)];
|
||||
if (!CAN_CONSUME(OFFSET_MAXBITS))
|
||||
REFILL_BITS();
|
||||
}
|
||||
entry = d->offset_decode_table[BITS(OFFSET_TABLEBITS)];
|
||||
if (entry & HUFFDEC_EXCEPTIONAL) {
|
||||
REMOVE_BITS(OFFSET_TABLEBITS);
|
||||
entry = d->offset_decode_table[(entry >> 16) + BITS((u8)entry)];
|
||||
if (!CAN_ENSURE(DEFLATE_MAX_OFFSET_CODEWORD_LEN +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS))
|
||||
ENSURE_BITS(DEFLATE_MAX_OFFSET_CODEWORD_LEN -
|
||||
OFFSET_TABLEBITS +
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS);
|
||||
}
|
||||
SAVE_BITBUF();
|
||||
REMOVE_BITS((u8)entry);
|
||||
offset = (entry >> 16) + (SAVED_BITS((u8)entry) >> (u8)(entry >> 8));
|
||||
offset = entry >> 16;
|
||||
offset += EXTRACT_VARBITS8(bitbuf, entry) >> (u8)(entry >> 8);
|
||||
bitbuf >>= (u8)entry;
|
||||
bitsleft -= entry;
|
||||
|
||||
SAFETY_CHECK(offset <= out_next - (const u8 *)out);
|
||||
|
||||
src = out_next - offset;
|
||||
dst = out_next;
|
||||
out_next += length;
|
||||
|
@ -521,9 +735,6 @@ preloaded:
|
|||
}
|
||||
|
||||
block_done:
|
||||
/* MASK_BITSLEFT() is needed when leaving the fastloop. */
|
||||
MASK_BITSLEFT();
|
||||
|
||||
/* Finished decoding a block */
|
||||
|
||||
if (!is_final_block)
|
||||
|
@ -531,12 +742,21 @@ block_done:
|
|||
|
||||
/* That was the last block. */
|
||||
|
||||
/* Discard any readahead bits and check for excessive overread. */
|
||||
ALIGN_INPUT();
|
||||
bitsleft = (u8)bitsleft;
|
||||
|
||||
/*
|
||||
* If any of the implicit appended zero bytes were consumed (not just
|
||||
* refilled) before hitting end of stream, then the data is bad.
|
||||
*/
|
||||
SAFETY_CHECK(overread_count <= (bitsleft >> 3));
|
||||
|
||||
/* Optionally return the actual number of bytes consumed. */
|
||||
if (actual_in_nbytes_ret) {
|
||||
/* Don't count bytes that were refilled but not consumed. */
|
||||
in_next -= (bitsleft >> 3) - overread_count;
|
||||
|
||||
/* Optionally return the actual number of bytes read. */
|
||||
if (actual_in_nbytes_ret)
|
||||
*actual_in_nbytes_ret = in_next - (u8 *)in;
|
||||
}
|
||||
|
||||
/* Optionally return the actual number of bytes written. */
|
||||
if (actual_out_nbytes_ret) {
|
||||
|
@ -550,3 +770,5 @@ block_done:
|
|||
|
||||
#undef FUNCNAME
|
||||
#undef ATTRIBUTES
|
||||
#undef EXTRACT_VARBITS
|
||||
#undef EXTRACT_VARBITS8
|
||||
|
|
21
src/3rdparty/libdeflate/lib/deflate_compress.c
vendored
21
src/3rdparty/libdeflate/lib/deflate_compress.c
vendored
|
@ -475,7 +475,7 @@ struct deflate_output_bitstream;
|
|||
struct libdeflate_compressor {
|
||||
|
||||
/* Pointer to the compress() implementation chosen at allocation time */
|
||||
void (*impl)(struct libdeflate_compressor *c, const u8 *in,
|
||||
void (*impl)(struct libdeflate_compressor *restrict c, const u8 *in,
|
||||
size_t in_nbytes, struct deflate_output_bitstream *os);
|
||||
|
||||
/* The compression level with which this compressor was created */
|
||||
|
@ -1041,7 +1041,6 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
|
|||
unsigned parent = A[node] >> NUM_SYMBOL_BITS;
|
||||
unsigned parent_depth = A[parent] >> NUM_SYMBOL_BITS;
|
||||
unsigned depth = parent_depth + 1;
|
||||
unsigned len = depth;
|
||||
|
||||
/*
|
||||
* Set the depth of this node so that it is available when its
|
||||
|
@ -1054,19 +1053,19 @@ compute_length_counts(u32 A[], unsigned root_idx, unsigned len_counts[],
|
|||
* constraint. This is not the optimal method for generating
|
||||
* length-limited Huffman codes! But it should be good enough.
|
||||
*/
|
||||
if (len >= max_codeword_len) {
|
||||
len = max_codeword_len;
|
||||
if (depth >= max_codeword_len) {
|
||||
depth = max_codeword_len;
|
||||
do {
|
||||
len--;
|
||||
} while (len_counts[len] == 0);
|
||||
depth--;
|
||||
} while (len_counts[depth] == 0);
|
||||
}
|
||||
|
||||
/*
|
||||
* Account for the fact that we have a non-leaf node at the
|
||||
* current depth.
|
||||
*/
|
||||
len_counts[len]--;
|
||||
len_counts[len + 1] += 2;
|
||||
len_counts[depth]--;
|
||||
len_counts[depth + 1] += 2;
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1189,11 +1188,9 @@ gen_codewords(u32 A[], u8 lens[], const unsigned len_counts[],
|
|||
(next_codewords[len - 1] + len_counts[len - 1]) << 1;
|
||||
|
||||
for (sym = 0; sym < num_syms; sym++) {
|
||||
u8 len = lens[sym];
|
||||
u32 codeword = next_codewords[len]++;
|
||||
|
||||
/* DEFLATE requires bit-reversed codewords. */
|
||||
A[sym] = reverse_codeword(codeword, len);
|
||||
A[sym] = reverse_codeword(next_codewords[lens[sym]]++,
|
||||
lens[sym]);
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
@ -49,11 +49,8 @@
|
|||
/*
|
||||
* Maximum number of extra bits that may be required to represent a match
|
||||
* length or offset.
|
||||
*
|
||||
* TODO: are we going to have full DEFLATE64 support? If so, up to 16
|
||||
* length bits must be supported.
|
||||
*/
|
||||
#define DEFLATE_MAX_EXTRA_LENGTH_BITS 5
|
||||
#define DEFLATE_MAX_EXTRA_OFFSET_BITS 14
|
||||
#define DEFLATE_MAX_EXTRA_OFFSET_BITS 13
|
||||
|
||||
#endif /* LIB_DEFLATE_CONSTANTS_H */
|
||||
|
|
421
src/3rdparty/libdeflate/lib/deflate_decompress.c
vendored
421
src/3rdparty/libdeflate/lib/deflate_decompress.c
vendored
|
@ -27,14 +27,14 @@
|
|||
* ---------------------------------------------------------------------------
|
||||
*
|
||||
* This is a highly optimized DEFLATE decompressor. It is much faster than
|
||||
* zlib, typically more than twice as fast, though results vary by CPU.
|
||||
* vanilla zlib, typically well over twice as fast, though results vary by CPU.
|
||||
*
|
||||
* Why this is faster than zlib's implementation:
|
||||
* Why this is faster than vanilla zlib:
|
||||
*
|
||||
* - Word accesses rather than byte accesses when reading input
|
||||
* - Word accesses rather than byte accesses when copying matches
|
||||
* - Faster Huffman decoding combined with various DEFLATE-specific tricks
|
||||
* - Larger bitbuffer variable that doesn't need to be filled as often
|
||||
* - Larger bitbuffer variable that doesn't need to be refilled as often
|
||||
* - Other optimizations to remove unnecessary branches
|
||||
* - Only full-buffer decompression is supported, so the code doesn't need to
|
||||
* support stopping and resuming decompression.
|
||||
|
@ -71,22 +71,32 @@
|
|||
/*
|
||||
* The state of the "input bitstream" consists of the following variables:
|
||||
*
|
||||
* - in_next: pointer to the next unread byte in the input buffer
|
||||
* - in_next: a pointer to the next unread byte in the input buffer
|
||||
*
|
||||
* - in_end: pointer just past the end of the input buffer
|
||||
* - in_end: a pointer to just past the end of the input buffer
|
||||
*
|
||||
* - bitbuf: a word-sized variable containing bits that have been read from
|
||||
* the input buffer. The buffered bits are right-aligned
|
||||
* (they're the low-order bits).
|
||||
* the input buffer or from the implicit appended zero bytes
|
||||
*
|
||||
* - bitsleft: number of bits in 'bitbuf' that are valid. NOTE: in the
|
||||
* fastloop, bits 8 and above of bitsleft can contain garbage.
|
||||
* - bitsleft: the number of bits in 'bitbuf' available to be consumed.
|
||||
* After REFILL_BITS_BRANCHLESS(), 'bitbuf' can actually
|
||||
* contain more bits than this. However, only the bits counted
|
||||
* by 'bitsleft' can actually be consumed; the rest can only be
|
||||
* used for preloading.
|
||||
*
|
||||
* - overread_count: number of implicit 0 bytes past 'in_end' that have
|
||||
* been loaded into the bitbuffer
|
||||
* As a micro-optimization, we allow bits 8 and higher of
|
||||
* 'bitsleft' to contain garbage. When consuming the bits
|
||||
* associated with a decode table entry, this allows us to do
|
||||
* 'bitsleft -= entry' instead of 'bitsleft -= (u8)entry'.
|
||||
* On some CPUs, this helps reduce instruction dependencies.
|
||||
* This does have the disadvantage that 'bitsleft' sometimes
|
||||
* needs to be cast to 'u8', such as when it's used as a shift
|
||||
* amount in REFILL_BITS_BRANCHLESS(). But that one happens
|
||||
* for free since most CPUs ignore high bits in shift amounts.
|
||||
*
|
||||
* For performance reasons, these variables are declared as standalone variables
|
||||
* and are manipulated using macros, rather than being packed into a struct.
|
||||
* - overread_count: the total number of implicit appended zero bytes that
|
||||
* have been loaded into the bitbuffer, including any
|
||||
* counted by 'bitsleft' and any already consumed
|
||||
*/
|
||||
|
||||
/*
|
||||
|
@ -97,60 +107,92 @@
|
|||
* which they don't have to refill as often.
|
||||
*/
|
||||
typedef machine_word_t bitbuf_t;
|
||||
#define BITBUF_NBITS (8 * (int)sizeof(bitbuf_t))
|
||||
|
||||
/* BITMASK(n) returns a bitmask of length 'n'. */
|
||||
#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1)
|
||||
|
||||
/*
|
||||
* BITBUF_NBITS is the number of bits the bitbuffer variable can hold. See
|
||||
* REFILL_BITS_WORDWISE() for why this is 1 less than the obvious value.
|
||||
* MAX_BITSLEFT is the maximum number of consumable bits, i.e. the maximum value
|
||||
* of '(u8)bitsleft'. This is the size of the bitbuffer variable, minus 1 if
|
||||
* the branchless refill method is being used (see REFILL_BITS_BRANCHLESS()).
|
||||
*/
|
||||
#define BITBUF_NBITS (8 * sizeof(bitbuf_t) - 1)
|
||||
#define MAX_BITSLEFT \
|
||||
(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS - 1 : BITBUF_NBITS)
|
||||
|
||||
/*
|
||||
* REFILL_GUARANTEED_NBITS is the number of bits that are guaranteed in the
|
||||
* bitbuffer variable after refilling it with ENSURE_BITS(n), REFILL_BITS(), or
|
||||
* REFILL_BITS_IN_FASTLOOP(). There might be up to BITBUF_NBITS bits; however,
|
||||
* since only whole bytes can be added, only 'BITBUF_NBITS - 7' bits are
|
||||
* guaranteed. That is the smallest amount where another byte doesn't fit.
|
||||
* CONSUMABLE_NBITS is the minimum number of bits that are guaranteed to be
|
||||
* consumable (counted in 'bitsleft') immediately after refilling the bitbuffer.
|
||||
* Since only whole bytes can be added to 'bitsleft', the worst case is
|
||||
* 'MAX_BITSLEFT - 7': the smallest amount where another byte doesn't fit.
|
||||
*/
|
||||
#define REFILL_GUARANTEED_NBITS (BITBUF_NBITS - 7)
|
||||
#define CONSUMABLE_NBITS (MAX_BITSLEFT - 7)
|
||||
|
||||
/*
|
||||
* CAN_ENSURE(n) evaluates to true if the bitbuffer variable is guaranteed to
|
||||
* contain at least 'n' bits after a refill. See REFILL_GUARANTEED_NBITS.
|
||||
*
|
||||
* This can be used to choose between alternate refill strategies based on the
|
||||
* size of the bitbuffer variable. 'n' should be a compile-time constant.
|
||||
* FASTLOOP_PRELOADABLE_NBITS is the minimum number of bits that are guaranteed
|
||||
* to be preloadable immediately after REFILL_BITS_IN_FASTLOOP(). (It is *not*
|
||||
* guaranteed after REFILL_BITS(), since REFILL_BITS() falls back to a
|
||||
* byte-at-a-time refill method near the end of input.) This may exceed the
|
||||
* number of consumable bits (counted by 'bitsleft'). Any bits not counted in
|
||||
* 'bitsleft' can only be used for precomputation and cannot be consumed.
|
||||
*/
|
||||
#define CAN_ENSURE(n) ((n) <= REFILL_GUARANTEED_NBITS)
|
||||
#define FASTLOOP_PRELOADABLE_NBITS \
|
||||
(UNALIGNED_ACCESS_IS_FAST ? BITBUF_NBITS : CONSUMABLE_NBITS)
|
||||
|
||||
/*
|
||||
* REFILL_BITS_WORDWISE() branchlessly refills the bitbuffer variable by reading
|
||||
* the next word from the input buffer and updating 'in_next' and 'bitsleft'
|
||||
* based on how many bits were refilled -- counting whole bytes only. This is
|
||||
* much faster than reading a byte at a time, at least if the CPU is little
|
||||
* endian and supports fast unaligned memory accesses.
|
||||
* PRELOAD_SLACK is the minimum number of bits that are guaranteed to be
|
||||
* preloadable but not consumable, following REFILL_BITS_IN_FASTLOOP() and any
|
||||
* subsequent consumptions. This is 1 bit if the branchless refill method is
|
||||
* being used, and 0 bits otherwise.
|
||||
*/
|
||||
#define PRELOAD_SLACK MAX(0, FASTLOOP_PRELOADABLE_NBITS - MAX_BITSLEFT)
|
||||
|
||||
/*
|
||||
* CAN_CONSUME(n) is true if it's guaranteed that if the bitbuffer has just been
|
||||
* refilled, then it's always possible to consume 'n' bits from it. 'n' should
|
||||
* be a compile-time constant, to enable compile-time evaluation.
|
||||
*/
|
||||
#define CAN_CONSUME(n) (CONSUMABLE_NBITS >= (n))
|
||||
|
||||
/*
|
||||
* CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) is true if it's
|
||||
* guaranteed that after REFILL_BITS_IN_FASTLOOP(), it's always possible to
|
||||
* consume 'consume_nbits' bits, then preload 'preload_nbits' bits. The
|
||||
* arguments should be compile-time constants to enable compile-time evaluation.
|
||||
*/
|
||||
#define CAN_CONSUME_AND_THEN_PRELOAD(consume_nbits, preload_nbits) \
|
||||
(CONSUMABLE_NBITS >= (consume_nbits) && \
|
||||
FASTLOOP_PRELOADABLE_NBITS >= (consume_nbits) + (preload_nbits))
|
||||
|
||||
/*
|
||||
* REFILL_BITS_BRANCHLESS() branchlessly refills the bitbuffer variable by
|
||||
* reading the next word from the input buffer and updating 'in_next' and
|
||||
* 'bitsleft' based on how many bits were refilled -- counting whole bytes only.
|
||||
* This is much faster than reading a byte at a time, at least if the CPU is
|
||||
* little endian and supports fast unaligned memory accesses.
|
||||
*
|
||||
* The simplest way of branchlessly updating 'bitsleft' would be:
|
||||
*
|
||||
* bitsleft += (BITBUF_NBITS - bitsleft) & ~7;
|
||||
* bitsleft += (MAX_BITSLEFT - bitsleft) & ~7;
|
||||
*
|
||||
* To make it faster, we define BITBUF_NBITS to be 'WORDBITS - 1' rather than
|
||||
* To make it faster, we define MAX_BITSLEFT to be 'WORDBITS - 1' rather than
|
||||
* WORDBITS, so that in binary it looks like 111111 or 11111. Then, we update
|
||||
* 'bitsleft' just by setting the bits above the low 3 bits:
|
||||
* 'bitsleft' by just setting the bits above the low 3 bits:
|
||||
*
|
||||
* bitsleft |= BITBUF_NBITS & ~7;
|
||||
* bitsleft |= MAX_BITSLEFT & ~7;
|
||||
*
|
||||
* That compiles down to a single instruction like 'or $0x38, %rbp'. Using
|
||||
* 'BITBUF_NBITS == WORDBITS - 1' also has the advantage that refills can be
|
||||
* done when 'bitsleft == BITBUF_NBITS' without invoking undefined behavior.
|
||||
* 'MAX_BITSLEFT == WORDBITS - 1' also has the advantage that refills can be
|
||||
* done when 'bitsleft == MAX_BITSLEFT' without invoking undefined behavior.
|
||||
*
|
||||
* The simplest way of branchlessly updating 'in_next' would be:
|
||||
*
|
||||
* in_next += (BITBUF_NBITS - bitsleft) >> 3;
|
||||
* in_next += (MAX_BITSLEFT - bitsleft) >> 3;
|
||||
*
|
||||
* With 'BITBUF_NBITS == WORDBITS - 1' we could use an XOR instead, though this
|
||||
* With 'MAX_BITSLEFT == WORDBITS - 1' we could use an XOR instead, though this
|
||||
* isn't really better:
|
||||
*
|
||||
* in_next += (BITBUF_NBITS ^ bitsleft) >> 3;
|
||||
* in_next += (MAX_BITSLEFT ^ bitsleft) >> 3;
|
||||
*
|
||||
* An alternative which can be marginally better is the following:
|
||||
*
|
||||
|
@ -162,22 +204,23 @@ typedef machine_word_t bitbuf_t;
|
|||
* extraction instruction (e.g. arm's ubfx), it stays at 3, and is potentially
|
||||
* more efficient because the length of the longest dependency chain decreases
|
||||
* from 3 to 2. This alternative also has the advantage that it ignores the
|
||||
* high bits in 'bitsleft', so it is compatible with the fastloop optimization
|
||||
* (described later) where we let the high bits of 'bitsleft' contain garbage.
|
||||
* high bits in 'bitsleft', so it is compatible with the micro-optimization we
|
||||
* use where we let the high bits of 'bitsleft' contain garbage.
|
||||
*/
|
||||
#define REFILL_BITS_WORDWISE() \
|
||||
do { \
|
||||
bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft;\
|
||||
in_next += sizeof(bitbuf_t) - 1; \
|
||||
in_next -= (bitsleft >> 3) & 0x7; \
|
||||
bitsleft |= BITBUF_NBITS & ~7; \
|
||||
#define REFILL_BITS_BRANCHLESS() \
|
||||
do { \
|
||||
bitbuf |= get_unaligned_leword(in_next) << (u8)bitsleft; \
|
||||
in_next += sizeof(bitbuf_t) - 1; \
|
||||
in_next -= (bitsleft >> 3) & 0x7; \
|
||||
bitsleft |= MAX_BITSLEFT & ~7; \
|
||||
} while (0)
|
||||
|
||||
/*
|
||||
* REFILL_BITS() loads bits from the input buffer until the bitbuffer variable
|
||||
* contains at least REFILL_GUARANTEED_NBITS bits.
|
||||
* contains at least CONSUMABLE_NBITS consumable bits.
|
||||
*
|
||||
* This checks for the end of input, and it cannot be used in the fastloop.
|
||||
* This checks for the end of input, and it doesn't guarantee
|
||||
* FASTLOOP_PRELOADABLE_NBITS, so it can't be used in the fastloop.
|
||||
*
|
||||
* If we would overread the input buffer, we just don't read anything, leaving
|
||||
* the bits zeroed but marking them filled. This simplifies the decompressor
|
||||
|
@ -196,121 +239,66 @@ do { \
|
|||
*/
|
||||
#define REFILL_BITS() \
|
||||
do { \
|
||||
if (CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST && \
|
||||
if (UNALIGNED_ACCESS_IS_FAST && \
|
||||
likely(in_end - in_next >= sizeof(bitbuf_t))) { \
|
||||
REFILL_BITS_WORDWISE(); \
|
||||
REFILL_BITS_BRANCHLESS(); \
|
||||
} else { \
|
||||
while (bitsleft < REFILL_GUARANTEED_NBITS) { \
|
||||
while ((u8)bitsleft < CONSUMABLE_NBITS) { \
|
||||
if (likely(in_next != in_end)) { \
|
||||
bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \
|
||||
bitbuf |= (bitbuf_t)*in_next++ << \
|
||||
(u8)bitsleft; \
|
||||
} else { \
|
||||
overread_count++; \
|
||||
SAFETY_CHECK(overread_count <= sizeof(bitbuf)); \
|
||||
} \
|
||||
SAFETY_CHECK(overread_count <= \
|
||||
sizeof(bitbuf_t)); \
|
||||
} \
|
||||
bitsleft += 8; \
|
||||
} \
|
||||
} \
|
||||
} while (0)
|
||||
|
||||
/* ENSURE_BITS(n) calls REFILL_BITS() if fewer than 'n' bits are buffered. */
|
||||
#define ENSURE_BITS(n) \
|
||||
/*
|
||||
* REFILL_BITS_IN_FASTLOOP() is like REFILL_BITS(), but it doesn't check for the
|
||||
* end of the input. It can only be used in the fastloop.
|
||||
*/
|
||||
#define REFILL_BITS_IN_FASTLOOP() \
|
||||
do { \
|
||||
if (bitsleft < (n)) \
|
||||
REFILL_BITS(); \
|
||||
STATIC_ASSERT(UNALIGNED_ACCESS_IS_FAST || \
|
||||
FASTLOOP_PRELOADABLE_NBITS == CONSUMABLE_NBITS); \
|
||||
if (UNALIGNED_ACCESS_IS_FAST) { \
|
||||
REFILL_BITS_BRANCHLESS(); \
|
||||
} else { \
|
||||
while ((u8)bitsleft < CONSUMABLE_NBITS) { \
|
||||
bitbuf |= (bitbuf_t)*in_next++ << (u8)bitsleft; \
|
||||
bitsleft += 8; \
|
||||
} \
|
||||
} \
|
||||
} while (0)
|
||||
|
||||
#define BITMASK(n) (((bitbuf_t)1 << (n)) - 1)
|
||||
|
||||
/* BITS(n) returns the next 'n' buffered bits without removing them. */
|
||||
#define BITS(n) (bitbuf & BITMASK(n))
|
||||
|
||||
/* Macros to save the value of the bitbuffer variable and use it later. */
|
||||
#define SAVE_BITBUF() (saved_bitbuf = bitbuf)
|
||||
#define SAVED_BITS(n) (saved_bitbuf & BITMASK(n))
|
||||
|
||||
/* REMOVE_BITS(n) removes the next 'n' buffered bits. */
|
||||
#define REMOVE_BITS(n) (bitbuf >>= (n), bitsleft -= (n))
|
||||
|
||||
/* POP_BITS(n) removes and returns the next 'n' buffered bits. */
|
||||
#define POP_BITS(n) (tmpbits = BITS(n), REMOVE_BITS(n), tmpbits)
|
||||
|
||||
/*
|
||||
* ALIGN_INPUT() verifies that the input buffer hasn't been overread, then
|
||||
* aligns the bitstream to the next byte boundary, discarding any unused bits in
|
||||
* the current byte.
|
||||
*
|
||||
* Note that if the bitbuffer variable currently contains more than 7 bits, then
|
||||
* we must rewind 'in_next', effectively putting those bits back. Only the bits
|
||||
* in what would be the "current" byte if we were reading one byte at a time can
|
||||
* be actually discarded.
|
||||
*/
|
||||
#define ALIGN_INPUT() \
|
||||
do { \
|
||||
SAFETY_CHECK(overread_count <= (bitsleft >> 3)); \
|
||||
in_next -= (bitsleft >> 3) - overread_count; \
|
||||
overread_count = 0; \
|
||||
bitbuf = 0; \
|
||||
bitsleft = 0; \
|
||||
} while(0)
|
||||
|
||||
/*
|
||||
* Macros used in the "fastloop": the loop that decodes literals and matches
|
||||
* while there is still plenty of space left in the input and output buffers.
|
||||
*
|
||||
* In the fastloop, we improve performance by skipping redundant bounds checks.
|
||||
* On platforms where it helps, we also use an optimization where we allow bits
|
||||
* 8 and higher of 'bitsleft' to contain garbage. This is sometimes a useful
|
||||
* microoptimization because it means the whole 32-bit decode table entry can be
|
||||
* subtracted from 'bitsleft' without an intermediate step to convert it to 8
|
||||
* bits. (It still needs to be converted to 8 bits for the shift of 'bitbuf',
|
||||
* but most CPUs ignore high bits in shift amounts, so that happens implicitly
|
||||
* with zero overhead.) REMOVE_ENTRY_BITS_FAST() implements this optimization.
|
||||
*
|
||||
* MASK_BITSLEFT() is used to clear the garbage bits when leaving the fastloop.
|
||||
*/
|
||||
#if CPU_IS_LITTLE_ENDIAN() && UNALIGNED_ACCESS_IS_FAST
|
||||
# define REFILL_BITS_IN_FASTLOOP() REFILL_BITS_WORDWISE()
|
||||
# define REMOVE_ENTRY_BITS_FAST(entry) (bitbuf >>= (u8)entry, bitsleft -= entry)
|
||||
# define GET_REAL_BITSLEFT() ((u8)bitsleft)
|
||||
# define MASK_BITSLEFT() (bitsleft &= 0xFF)
|
||||
#else
|
||||
# define REFILL_BITS_IN_FASTLOOP() \
|
||||
while (bitsleft < REFILL_GUARANTEED_NBITS) { \
|
||||
bitbuf |= (bitbuf_t)*in_next++ << bitsleft; \
|
||||
bitsleft += 8; \
|
||||
}
|
||||
# define REMOVE_ENTRY_BITS_FAST(entry) REMOVE_BITS((u8)entry)
|
||||
# define GET_REAL_BITSLEFT() bitsleft
|
||||
# define MASK_BITSLEFT()
|
||||
#endif
|
||||
|
||||
/*
|
||||
* This is the worst-case maximum number of output bytes that are written to
|
||||
* during each iteration of the fastloop. The worst case is 3 literals, then a
|
||||
* match of length DEFLATE_MAX_MATCH_LEN. The match length must be rounded up
|
||||
* to a word boundary due to the word-at-a-time match copy implementation.
|
||||
* during each iteration of the fastloop. The worst case is 2 literals, then a
|
||||
* match of length DEFLATE_MAX_MATCH_LEN. Additionally, some slack space must
|
||||
* be included for the intentional overrun in the match copy implementation.
|
||||
*/
|
||||
#define FASTLOOP_MAX_BYTES_WRITTEN \
|
||||
(3 + ALIGN(DEFLATE_MAX_MATCH_LEN, WORDBYTES))
|
||||
(2 + DEFLATE_MAX_MATCH_LEN + (5 * WORDBYTES) - 1)
|
||||
|
||||
/*
|
||||
* This is the worst-case maximum number of input bytes that are read during
|
||||
* each iteration of the fastloop. To get this value, we first compute the
|
||||
* greatest number of bits that can be refilled during a loop iteration. The
|
||||
* refill at the beginning can add at most BITBUF_NBITS, and the amount that can
|
||||
* refill at the beginning can add at most MAX_BITSLEFT, and the amount that can
|
||||
* be refilled later is no more than the maximum amount that can be consumed by
|
||||
* 3 literals that don't need a subtable, then a match. We convert this value
|
||||
* to bytes, rounding up. Finally, we added sizeof(bitbuf_t) to account for
|
||||
* REFILL_BITS_WORDWISE() reading up to a word past the part really used.
|
||||
* 2 literals that don't need a subtable, then a match. We convert this value
|
||||
* to bytes, rounding up; this gives the maximum number of bytes that 'in_next'
|
||||
* can be advanced. Finally, we add sizeof(bitbuf_t) to account for
|
||||
* REFILL_BITS_BRANCHLESS() reading a word past 'in_next'.
|
||||
*/
|
||||
#define FASTLOOP_MAX_BYTES_READ \
|
||||
(DIV_ROUND_UP(BITBUF_NBITS + \
|
||||
((3 * LITLEN_TABLEBITS) + \
|
||||
DEFLATE_MAX_LITLEN_CODEWORD_LEN + \
|
||||
DEFLATE_MAX_EXTRA_LENGTH_BITS + \
|
||||
DEFLATE_MAX_OFFSET_CODEWORD_LEN + \
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS), 8) + \
|
||||
sizeof(bitbuf_t))
|
||||
(DIV_ROUND_UP(MAX_BITSLEFT + (2 * LITLEN_TABLEBITS) + \
|
||||
LENGTH_MAXBITS + OFFSET_MAXBITS, 8) + \
|
||||
sizeof(bitbuf_t))
|
||||
|
||||
/*****************************************************************************
|
||||
* Huffman decoding *
|
||||
|
@ -358,24 +346,37 @@ do { \
|
|||
* take longer, which decreases performance. We choose values that work well in
|
||||
* practice, making subtables rarely needed without making the tables too large.
|
||||
*
|
||||
* Our choice of OFFSET_TABLEBITS == 8 is a bit low; without any special
|
||||
* considerations, 9 would fit the trade-off curve better. However, there is a
|
||||
* performance benefit to using exactly 8 bits when it is a compile-time
|
||||
* constant, as many CPUs can take the low byte more easily than the low 9 bits.
|
||||
*
|
||||
* zlib treats its equivalents of TABLEBITS as maximum values; whenever it
|
||||
* builds a table, it caps the actual table_bits to the longest codeword. This
|
||||
* makes sense in theory, as there's no need for the table to be any larger than
|
||||
* needed to support the longest codeword. However, having the table bits be a
|
||||
* compile-time constant is beneficial to the performance of the decode loop, so
|
||||
* there is a trade-off. libdeflate currently uses the dynamic table_bits
|
||||
* strategy for the litlen table only, due to its larger maximum size.
|
||||
* PRECODE_TABLEBITS and OFFSET_TABLEBITS are smaller, so going dynamic there
|
||||
* isn't as useful, and OFFSET_TABLEBITS=8 is useful as mentioned above.
|
||||
*
|
||||
* Each TABLEBITS value has a corresponding ENOUGH value that gives the
|
||||
* worst-case maximum number of decode table entries, including the main table
|
||||
* and all subtables. The ENOUGH value depends on three parameters:
|
||||
*
|
||||
* (1) the maximum number of symbols in the code (DEFLATE_NUM_*_SYMS)
|
||||
* (2) the number of main table bits (the corresponding TABLEBITS value)
|
||||
* (2) the maximum number of main table bits (*_TABLEBITS)
|
||||
* (3) the maximum allowed codeword length (DEFLATE_MAX_*_CODEWORD_LEN)
|
||||
*
|
||||
* The ENOUGH values were computed using the utility program 'enough' from zlib.
|
||||
*/
|
||||
#define PRECODE_TABLEBITS 7
|
||||
#define PRECODE_ENOUGH 128 /* enough 19 7 7 */
|
||||
|
||||
#define LITLEN_TABLEBITS 11
|
||||
#define LITLEN_ENOUGH 2342 /* enough 288 11 15 */
|
||||
|
||||
#define OFFSET_TABLEBITS 9
|
||||
#define OFFSET_ENOUGH 594 /* enough 32 9 15 */
|
||||
#define OFFSET_TABLEBITS 8
|
||||
#define OFFSET_ENOUGH 402 /* enough 32 8 15 */
|
||||
|
||||
/*
|
||||
* make_decode_table_entry() creates a decode table entry for the given symbol
|
||||
|
@ -387,7 +388,7 @@ do { \
|
|||
* appropriately-formatted decode table entry. See the definitions of the
|
||||
* *_decode_results[] arrays below, where the entry format is described.
|
||||
*/
|
||||
static inline u32
|
||||
static forceinline u32
|
||||
make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
|
||||
{
|
||||
return decode_results[sym] + (len << 8) + len;
|
||||
|
@ -398,8 +399,8 @@ make_decode_table_entry(const u32 decode_results[], u32 sym, u32 len)
|
|||
* described contain zeroes:
|
||||
*
|
||||
* Bit 20-16: presym
|
||||
* Bit 10-8: codeword_len [not used]
|
||||
* Bit 2-0: codeword_len
|
||||
* Bit 10-8: codeword length [not used]
|
||||
* Bit 2-0: codeword length
|
||||
*
|
||||
* The precode decode table never has subtables, since we use
|
||||
* PRECODE_TABLEBITS == DEFLATE_MAX_PRE_CODEWORD_LEN.
|
||||
|
@ -431,6 +432,12 @@ static const u32 precode_decode_results[] = {
|
|||
/* Indicates an end-of-block entry in the litlen decode table */
|
||||
#define HUFFDEC_END_OF_BLOCK 0x00002000
|
||||
|
||||
/* Maximum number of bits that can be consumed by decoding a match length */
|
||||
#define LENGTH_MAXBITS (DEFLATE_MAX_LITLEN_CODEWORD_LEN + \
|
||||
DEFLATE_MAX_EXTRA_LENGTH_BITS)
|
||||
#define LENGTH_MAXFASTBITS (LITLEN_TABLEBITS /* no subtable needed */ + \
|
||||
DEFLATE_MAX_EXTRA_LENGTH_BITS)
|
||||
|
||||
/*
|
||||
* Here is the format of our litlen decode table entries. Bits not explicitly
|
||||
* described contain zeroes:
|
||||
|
@ -464,7 +471,8 @@ static const u32 precode_decode_results[] = {
|
|||
* Bit 15: 1 (HUFFDEC_EXCEPTIONAL)
|
||||
* Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER)
|
||||
* Bit 13: 0 (!HUFFDEC_END_OF_BLOCK)
|
||||
* Bit 3-0: number of subtable bits
|
||||
* Bit 11-8: number of subtable bits
|
||||
* Bit 3-0: number of main table bits
|
||||
*
|
||||
* This format has several desirable properties:
|
||||
*
|
||||
|
@ -481,20 +489,26 @@ static const u32 precode_decode_results[] = {
|
|||
*
|
||||
* - The low byte is the number of bits that need to be removed from the
|
||||
* bitstream; this makes this value easily accessible, and it enables the
|
||||
* optimization used in REMOVE_ENTRY_BITS_FAST(). It also includes the
|
||||
* number of extra bits, so they don't need to be removed separately.
|
||||
* micro-optimization of doing 'bitsleft -= entry' instead of
|
||||
* 'bitsleft -= (u8)entry'. It also includes the number of extra bits,
|
||||
* so they don't need to be removed separately.
|
||||
*
|
||||
* - The flags in bits 13-15 are arranged to be 0 when the number of
|
||||
* non-extra bits (the value in bits 11-8) is needed, making this value
|
||||
* - The flags in bits 15-13 are arranged to be 0 when the
|
||||
* "remaining codeword length" in bits 11-8 is needed, making this value
|
||||
* fairly easily accessible as well via a shift and downcast.
|
||||
*
|
||||
* - Similarly, bits 13-12 are 0 when the "subtable bits" in bits 11-8 are
|
||||
* needed, making it possible to extract this value with '& 0x3F' rather
|
||||
* than '& 0xF'. This value is only used as a shift amount, so this can
|
||||
* save an 'and' instruction as the masking by 0x3F happens implicitly.
|
||||
*
|
||||
* litlen_decode_results[] contains the static part of the entry for each
|
||||
* symbol. make_decode_table_entry() produces the final entries.
|
||||
*/
|
||||
static const u32 litlen_decode_results[] = {
|
||||
|
||||
/* Literals */
|
||||
#define ENTRY(literal) (((u32)literal << 16) | HUFFDEC_LITERAL)
|
||||
#define ENTRY(literal) (HUFFDEC_LITERAL | ((u32)literal << 16))
|
||||
ENTRY(0) , ENTRY(1) , ENTRY(2) , ENTRY(3) ,
|
||||
ENTRY(4) , ENTRY(5) , ENTRY(6) , ENTRY(7) ,
|
||||
ENTRY(8) , ENTRY(9) , ENTRY(10) , ENTRY(11) ,
|
||||
|
@ -578,6 +592,12 @@ static const u32 litlen_decode_results[] = {
|
|||
#undef ENTRY
|
||||
};
|
||||
|
||||
/* Maximum number of bits that can be consumed by decoding a match offset */
|
||||
#define OFFSET_MAXBITS (DEFLATE_MAX_OFFSET_CODEWORD_LEN + \
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS)
|
||||
#define OFFSET_MAXFASTBITS (OFFSET_TABLEBITS /* no subtable needed */ + \
|
||||
DEFLATE_MAX_EXTRA_OFFSET_BITS)
|
||||
|
||||
/*
|
||||
* Here is the format of our offset decode table entries. Bits not explicitly
|
||||
* described contain zeroes:
|
||||
|
@ -592,7 +612,8 @@ static const u32 litlen_decode_results[] = {
|
|||
* Bit 31-16: index of start of subtable
|
||||
* Bit 15: 1 (HUFFDEC_EXCEPTIONAL)
|
||||
* Bit 14: 1 (HUFFDEC_SUBTABLE_POINTER)
|
||||
* Bit 3-0: number of subtable bits
|
||||
* Bit 11-8: number of subtable bits
|
||||
* Bit 3-0: number of main table bits
|
||||
*
|
||||
* These work the same way as the length entries and subtable pointer entries in
|
||||
* the litlen decode table; see litlen_decode_results[] above.
|
||||
|
@ -607,15 +628,20 @@ static const u32 offset_decode_results[] = {
|
|||
ENTRY(257 , 7) , ENTRY(385 , 7) , ENTRY(513 , 8) , ENTRY(769 , 8) ,
|
||||
ENTRY(1025 , 9) , ENTRY(1537 , 9) , ENTRY(2049 , 10) , ENTRY(3073 , 10) ,
|
||||
ENTRY(4097 , 11) , ENTRY(6145 , 11) , ENTRY(8193 , 12) , ENTRY(12289 , 12) ,
|
||||
ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(32769 , 14) , ENTRY(49153 , 14) ,
|
||||
ENTRY(16385 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) , ENTRY(24577 , 13) ,
|
||||
#undef ENTRY
|
||||
};
|
||||
|
||||
/*
|
||||
* The main DEFLATE decompressor structure. Since this implementation only
|
||||
* supports full buffer decompression, this structure does not store the entire
|
||||
* decompression state, but rather only some arrays that are too large to
|
||||
* comfortably allocate on the stack.
|
||||
* The main DEFLATE decompressor structure. Since libdeflate only supports
|
||||
* full-buffer decompression, this structure doesn't store the entire
|
||||
* decompression state, most of which is in stack variables. Instead, this
|
||||
* struct just contains the decode tables and some temporary arrays used for
|
||||
* building them, as these are too large to comfortably allocate on the stack.
|
||||
*
|
||||
* Storing the decode tables in the decompressor struct also allows the decode
|
||||
* tables for the static codes to be reused whenever two static Huffman blocks
|
||||
* are decoded without an intervening dynamic block, even across streams.
|
||||
*/
|
||||
struct libdeflate_decompressor {
|
||||
|
||||
|
@ -648,6 +674,7 @@ struct libdeflate_decompressor {
|
|||
u16 sorted_syms[DEFLATE_MAX_NUM_SYMS];
|
||||
|
||||
bool static_codes_loaded;
|
||||
unsigned litlen_tablebits;
|
||||
};
|
||||
|
||||
/*
|
||||
|
@ -678,11 +705,16 @@ struct libdeflate_decompressor {
|
|||
* make the final decode table entries using make_decode_table_entry().
|
||||
* @table_bits
|
||||
* The log base-2 of the number of main table entries to use.
|
||||
* If @table_bits_ret != NULL, then @table_bits is treated as a maximum
|
||||
* value and it will be decreased if a smaller table would be sufficient.
|
||||
* @max_codeword_len
|
||||
* The maximum allowed codeword length for this Huffman code.
|
||||
* Must be <= DEFLATE_MAX_CODEWORD_LEN.
|
||||
* @sorted_syms
|
||||
* A temporary array of length @num_syms.
|
||||
* @table_bits_ret
|
||||
* If non-NULL, then the dynamic table_bits is enabled, and the actual
|
||||
* table_bits value will be returned here.
|
||||
*
|
||||
* Returns %true if successful; %false if the codeword lengths do not form a
|
||||
* valid Huffman code.
|
||||
|
@ -692,9 +724,10 @@ build_decode_table(u32 decode_table[],
|
|||
const u8 lens[],
|
||||
const unsigned num_syms,
|
||||
const u32 decode_results[],
|
||||
const unsigned table_bits,
|
||||
const unsigned max_codeword_len,
|
||||
u16 *sorted_syms)
|
||||
unsigned table_bits,
|
||||
unsigned max_codeword_len,
|
||||
u16 *sorted_syms,
|
||||
unsigned *table_bits_ret)
|
||||
{
|
||||
unsigned len_counts[DEFLATE_MAX_CODEWORD_LEN + 1];
|
||||
unsigned offsets[DEFLATE_MAX_CODEWORD_LEN + 1];
|
||||
|
@ -714,6 +747,17 @@ build_decode_table(u32 decode_table[],
|
|||
for (sym = 0; sym < num_syms; sym++)
|
||||
len_counts[lens[sym]]++;
|
||||
|
||||
/*
|
||||
* Determine the actual maximum codeword length that was used, and
|
||||
* decrease table_bits to it if allowed.
|
||||
*/
|
||||
while (max_codeword_len > 1 && len_counts[max_codeword_len] == 0)
|
||||
max_codeword_len--;
|
||||
if (table_bits_ret != NULL) {
|
||||
table_bits = MIN(table_bits, max_codeword_len);
|
||||
*table_bits_ret = table_bits;
|
||||
}
|
||||
|
||||
/*
|
||||
* Sort the symbols primarily by increasing codeword length and
|
||||
* secondarily by increasing symbol value; or equivalently by their
|
||||
|
@ -919,16 +963,13 @@ build_decode_table(u32 decode_table[],
|
|||
|
||||
/*
|
||||
* Create the entry that points from the main table to
|
||||
* the subtable. This entry contains the index of the
|
||||
* start of the subtable and the number of bits with
|
||||
* which the subtable is indexed (the log base 2 of the
|
||||
* number of entries it contains).
|
||||
* the subtable.
|
||||
*/
|
||||
decode_table[subtable_prefix] =
|
||||
((u32)subtable_start << 16) |
|
||||
HUFFDEC_EXCEPTIONAL |
|
||||
HUFFDEC_SUBTABLE_POINTER |
|
||||
subtable_bits;
|
||||
(subtable_bits << 8) | table_bits;
|
||||
}
|
||||
|
||||
/* Fill the subtable entries for the current codeword. */
|
||||
|
@ -969,7 +1010,8 @@ build_precode_decode_table(struct libdeflate_decompressor *d)
|
|||
precode_decode_results,
|
||||
PRECODE_TABLEBITS,
|
||||
DEFLATE_MAX_PRE_CODEWORD_LEN,
|
||||
d->sorted_syms);
|
||||
d->sorted_syms,
|
||||
NULL);
|
||||
}
|
||||
|
||||
/* Build the decode table for the literal/length code. */
|
||||
|
@ -989,7 +1031,8 @@ build_litlen_decode_table(struct libdeflate_decompressor *d,
|
|||
litlen_decode_results,
|
||||
LITLEN_TABLEBITS,
|
||||
DEFLATE_MAX_LITLEN_CODEWORD_LEN,
|
||||
d->sorted_syms);
|
||||
d->sorted_syms,
|
||||
&d->litlen_tablebits);
|
||||
}
|
||||
|
||||
/* Build the decode table for the offset code. */
|
||||
|
@ -998,7 +1041,7 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
|
|||
unsigned num_litlen_syms, unsigned num_offset_syms)
|
||||
{
|
||||
/* When you change TABLEBITS, you must change ENOUGH, and vice versa! */
|
||||
STATIC_ASSERT(OFFSET_TABLEBITS == 9 && OFFSET_ENOUGH == 594);
|
||||
STATIC_ASSERT(OFFSET_TABLEBITS == 8 && OFFSET_ENOUGH == 402);
|
||||
|
||||
STATIC_ASSERT(ARRAY_LEN(offset_decode_results) ==
|
||||
DEFLATE_NUM_OFFSET_SYMS);
|
||||
|
@ -1009,27 +1052,8 @@ build_offset_decode_table(struct libdeflate_decompressor *d,
|
|||
offset_decode_results,
|
||||
OFFSET_TABLEBITS,
|
||||
DEFLATE_MAX_OFFSET_CODEWORD_LEN,
|
||||
d->sorted_syms);
|
||||
}
|
||||
|
||||
static forceinline machine_word_t
|
||||
repeat_byte(u8 b)
|
||||
{
|
||||
machine_word_t v;
|
||||
|
||||
STATIC_ASSERT(WORDBITS == 32 || WORDBITS == 64);
|
||||
|
||||
v = b;
|
||||
v |= v << 8;
|
||||
v |= v << 16;
|
||||
v |= v << ((WORDBITS == 64) ? 32 : 0);
|
||||
return v;
|
||||
}
|
||||
|
||||
static forceinline void
|
||||
copy_word_unaligned(const void *src, void *dst)
|
||||
{
|
||||
store_word_unaligned(load_word_unaligned(src), dst);
|
||||
d->sorted_syms,
|
||||
NULL);
|
||||
}
|
||||
|
||||
/*****************************************************************************
|
||||
|
@ -1037,12 +1061,15 @@ copy_word_unaligned(const void *src, void *dst)
|
|||
*****************************************************************************/
|
||||
|
||||
typedef enum libdeflate_result (*decompress_func_t)
|
||||
(struct libdeflate_decompressor *d,
|
||||
const void *in, size_t in_nbytes, void *out, size_t out_nbytes_avail,
|
||||
(struct libdeflate_decompressor * restrict d,
|
||||
const void * restrict in, size_t in_nbytes,
|
||||
void * restrict out, size_t out_nbytes_avail,
|
||||
size_t *actual_in_nbytes_ret, size_t *actual_out_nbytes_ret);
|
||||
|
||||
#define FUNCNAME deflate_decompress_default
|
||||
#define ATTRIBUTES
|
||||
#undef ATTRIBUTES
|
||||
#undef EXTRACT_VARBITS
|
||||
#undef EXTRACT_VARBITS8
|
||||
#include "decompress_template.h"
|
||||
|
||||
/* Include architecture-specific implementation(s) if available. */
|
||||
|
|
|
@ -156,6 +156,8 @@ typedef char __v64qi __attribute__((__vector_size__(64)));
|
|||
#define HAVE_BMI2_TARGET \
|
||||
(HAVE_DYNAMIC_X86_CPU_FEATURES && \
|
||||
(GCC_PREREQ(4, 7) || __has_builtin(__builtin_ia32_pdep_di)))
|
||||
#define HAVE_BMI2_INTRIN \
|
||||
(HAVE_BMI2_NATIVE || (HAVE_BMI2_TARGET && HAVE_TARGET_INTRINSICS))
|
||||
|
||||
#endif /* __i386__ || __x86_64__ */
|
||||
|
||||
|
|
|
@ -4,18 +4,46 @@
|
|||
#include "cpu_features.h"
|
||||
|
||||
/* BMI2 optimized version */
|
||||
#if HAVE_BMI2_TARGET && !HAVE_BMI2_NATIVE
|
||||
# define FUNCNAME deflate_decompress_bmi2
|
||||
# define ATTRIBUTES __attribute__((target("bmi2")))
|
||||
#if HAVE_BMI2_INTRIN
|
||||
# define deflate_decompress_bmi2 deflate_decompress_bmi2
|
||||
# define FUNCNAME deflate_decompress_bmi2
|
||||
# if !HAVE_BMI2_NATIVE
|
||||
# define ATTRIBUTES __attribute__((target("bmi2")))
|
||||
# endif
|
||||
/*
|
||||
* Even with __attribute__((target("bmi2"))), gcc doesn't reliably use the
|
||||
* bzhi instruction for 'word & BITMASK(count)'. So use the bzhi intrinsic
|
||||
* explicitly. EXTRACT_VARBITS() is equivalent to 'word & BITMASK(count)';
|
||||
* EXTRACT_VARBITS8() is equivalent to 'word & BITMASK((u8)count)'.
|
||||
* Nevertheless, their implementation using the bzhi intrinsic is identical,
|
||||
* as the bzhi instruction truncates the count to 8 bits implicitly.
|
||||
*/
|
||||
# ifndef __clang__
|
||||
# include <immintrin.h>
|
||||
# ifdef __x86_64__
|
||||
# define EXTRACT_VARBITS(word, count) _bzhi_u64((word), (count))
|
||||
# define EXTRACT_VARBITS8(word, count) _bzhi_u64((word), (count))
|
||||
# else
|
||||
# define EXTRACT_VARBITS(word, count) _bzhi_u32((word), (count))
|
||||
# define EXTRACT_VARBITS8(word, count) _bzhi_u32((word), (count))
|
||||
# endif
|
||||
# endif
|
||||
# include "../decompress_template.h"
|
||||
#endif /* HAVE_BMI2_INTRIN */
|
||||
|
||||
#if defined(deflate_decompress_bmi2) && HAVE_BMI2_NATIVE
|
||||
#define DEFAULT_IMPL deflate_decompress_bmi2
|
||||
#else
|
||||
static inline decompress_func_t
|
||||
arch_select_decompress_func(void)
|
||||
{
|
||||
#ifdef deflate_decompress_bmi2
|
||||
if (HAVE_BMI2(get_x86_cpu_features()))
|
||||
return deflate_decompress_bmi2;
|
||||
#endif
|
||||
return NULL;
|
||||
}
|
||||
# define arch_select_decompress_func arch_select_decompress_func
|
||||
#define arch_select_decompress_func arch_select_decompress_func
|
||||
#endif
|
||||
|
||||
#endif /* LIB_X86_DECOMPRESS_IMPL_H */
|
||||
|
|
|
@ -28,7 +28,9 @@
|
|||
#ifndef LIB_X86_MATCHFINDER_IMPL_H
|
||||
#define LIB_X86_MATCHFINDER_IMPL_H
|
||||
|
||||
#ifdef __AVX2__
|
||||
#include "cpu_features.h"
|
||||
|
||||
#if HAVE_AVX2_NATIVE
|
||||
# include <immintrin.h>
|
||||
static forceinline void
|
||||
matchfinder_init_avx2(mf_pos_t *data, size_t size)
|
||||
|
@ -73,7 +75,7 @@ matchfinder_rebase_avx2(mf_pos_t *data, size_t size)
|
|||
}
|
||||
#define matchfinder_rebase matchfinder_rebase_avx2
|
||||
|
||||
#elif defined(__SSE2__)
|
||||
#elif HAVE_SSE2_NATIVE
|
||||
# include <emmintrin.h>
|
||||
static forceinline void
|
||||
matchfinder_init_sse2(mf_pos_t *data, size_t size)
|
||||
|
@ -117,6 +119,6 @@ matchfinder_rebase_sse2(mf_pos_t *data, size_t size)
|
|||
} while (size != 0);
|
||||
}
|
||||
#define matchfinder_rebase matchfinder_rebase_sse2
|
||||
#endif /* __SSE2__ */
|
||||
#endif /* HAVE_SSE2_NATIVE */
|
||||
|
||||
#endif /* LIB_X86_MATCHFINDER_IMPL_H */
|
||||
|
|
68
src/3rdparty/libdeflate/libdeflate.h
vendored
68
src/3rdparty/libdeflate/libdeflate.h
vendored
|
@ -10,33 +10,36 @@ extern "C" {
|
|||
#endif
|
||||
|
||||
#define LIBDEFLATE_VERSION_MAJOR 1
|
||||
#define LIBDEFLATE_VERSION_MINOR 13
|
||||
#define LIBDEFLATE_VERSION_STRING "1.13"
|
||||
#define LIBDEFLATE_VERSION_MINOR 14
|
||||
#define LIBDEFLATE_VERSION_STRING "1.14"
|
||||
|
||||
#include <stddef.h>
|
||||
#include <stdint.h>
|
||||
|
||||
/*
|
||||
* On Windows, if you want to link to the DLL version of libdeflate, then
|
||||
* #define LIBDEFLATE_DLL. Note that the calling convention is "cdecl".
|
||||
* On Windows, you must define LIBDEFLATE_STATIC if you are linking to the
|
||||
* static library version of libdeflate instead of the DLL. On other platforms,
|
||||
* LIBDEFLATE_STATIC has no effect.
|
||||
*/
|
||||
#ifdef LIBDEFLATE_DLL
|
||||
# ifdef BUILDING_LIBDEFLATE
|
||||
# define LIBDEFLATEEXPORT LIBEXPORT
|
||||
# elif defined(_WIN32) || defined(__CYGWIN__)
|
||||
#ifdef _WIN32
|
||||
# if defined(LIBDEFLATE_STATIC)
|
||||
# define LIBDEFLATEEXPORT
|
||||
# elif defined(BUILDING_LIBDEFLATE)
|
||||
# define LIBDEFLATEEXPORT __declspec(dllexport)
|
||||
# else
|
||||
# define LIBDEFLATEEXPORT __declspec(dllimport)
|
||||
# endif
|
||||
#endif
|
||||
#ifndef LIBDEFLATEEXPORT
|
||||
# define LIBDEFLATEEXPORT
|
||||
#else
|
||||
# define LIBDEFLATEEXPORT __attribute__((visibility("default")))
|
||||
#endif
|
||||
|
||||
#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && \
|
||||
defined(_WIN32) && !defined(_WIN64)
|
||||
#if defined(BUILDING_LIBDEFLATE) && defined(__GNUC__) && defined(__i386__)
|
||||
/*
|
||||
* On 32-bit Windows, gcc assumes 16-byte stack alignment but MSVC only 4.
|
||||
* Realign the stack when entering libdeflate to avoid crashing in SSE/AVX
|
||||
* code when called from an MSVC-compiled application.
|
||||
* On i386, gcc assumes that the stack is 16-byte aligned at function entry.
|
||||
* However, some compilers (e.g. MSVC) and programming languages (e.g.
|
||||
* Delphi) only guarantee 4-byte alignment when calling functions. Work
|
||||
* around this ABI incompatibility by realigning the stack pointer when
|
||||
* entering libdeflate. This prevents crashes in SSE/AVX code.
|
||||
*/
|
||||
# define LIBDEFLATEAPI __attribute__((force_align_arg_pointer))
|
||||
#else
|
||||
|
@ -72,10 +75,35 @@ libdeflate_alloc_compressor(int compression_level);
|
|||
|
||||
/*
|
||||
* libdeflate_deflate_compress() performs raw DEFLATE compression on a buffer of
|
||||
* data. The function attempts to compress 'in_nbytes' bytes of data located at
|
||||
* 'in' and write the results to 'out', which has space for 'out_nbytes_avail'
|
||||
* bytes. The return value is the compressed size in bytes, or 0 if the data
|
||||
* could not be compressed to 'out_nbytes_avail' bytes or fewer.
|
||||
* data. It attempts to compress 'in_nbytes' bytes of data located at 'in' and
|
||||
* write the results to 'out', which has space for 'out_nbytes_avail' bytes.
|
||||
* The return value is the compressed size in bytes, or 0 if the data could not
|
||||
* be compressed to 'out_nbytes_avail' bytes or fewer (but see note below).
|
||||
*
|
||||
* If compression is successful, then the output data is guaranteed to be a
|
||||
* valid DEFLATE stream that decompresses to the input data. No other
|
||||
* guarantees are made about the output data. Notably, different versions of
|
||||
* libdeflate can produce different compressed data for the same uncompressed
|
||||
* data, even at the same compression level. Do ***NOT*** do things like
|
||||
* writing tests that compare compressed data to a golden output, as this can
|
||||
* break when libdeflate is updated. (This property isn't specific to
|
||||
* libdeflate; the same is true for zlib and other compression libraries too.)
|
||||
*
|
||||
* Note: due to a performance optimization, libdeflate_deflate_compress()
|
||||
* currently needs a small amount of slack space at the end of the output
|
||||
* buffer. As a result, it can't actually report compressed sizes very close to
|
||||
* 'out_nbytes_avail'. This doesn't matter in real-world use cases, and
|
||||
* libdeflate_deflate_compress_bound() already includes the slack space.
|
||||
* However, it does mean that testing code that redundantly compresses data
|
||||
* using an exact-sized output buffer won't work as might be expected:
|
||||
*
|
||||
* out_nbytes = libdeflate_deflate_compress(c, in, in_nbytes, out,
|
||||
* libdeflate_deflate_compress_bound(in_nbytes));
|
||||
* // The following assertion will fail.
|
||||
* assert(libdeflate_deflate_compress(c, in, in_nbytes, out, out_nbytes) != 0);
|
||||
*
|
||||
* To avoid this, either don't write tests like the above, or make sure to
|
||||
* include at least 9 bytes of slack space in 'out_nbytes_avail'.
|
||||
*/
|
||||
LIBDEFLATEEXPORT size_t LIBDEFLATEAPI
|
||||
libdeflate_deflate_compress(struct libdeflate_compressor *compressor,
|
||||
|
|
Loading…
Add table
Reference in a new issue