Optimizing Low Hanging Fruit in `miniz

Posted on 2023-01-12

I'm working on a large blog post investigating the performance characteristics of various PNG decoders across several programming languages. A large part of this work is profiling and benchmarking them.

The first decoder I've been looking at as part of this work is image-rs/png, which is the most downloaded PNG decoder in the rust ecosystem.

To start this kind of work, I like getting some cursory profiles and benchmarks just to see what the libraries are doing. The image I like to start with is this PNG of the periodic table of elements from Wikimedia Commons. This file is in the public domain, so we can do whatever we want with it and it's a pretty good size at ~2.2mb.

To benchmark image-rs/png, I have to create a binary that uses it, since by itself it's just a library. I've created a simple rust program that looks like this:

fn main() {
    const BYTES: &[u8] = include_bytes!("Periodic_table_large.png");

    let decoder = png::Decoder::new(BYTES);
    let mut reader = decoder.read_info().unwrap();

    let bpp = reader.info().bytes_per_pixel();
    let width = reader.info().width;
    let length = reader.info().raw_row_length();

    let mut buffer = vec![0; bpp * width as usize * length as usize];

    while let Some(row) = reader.next_row().unwrap() {
        buffer.extend_from_slice(row.data());
    }

    std::hint::black_box(buffer);
}

This isn't perfect, but it works for my purposes right now. A more robust solution would elide the command line altogether and only benchmark the PNG operations themselves, but at the start I just want a rough idea of how the library performs.

I've made a few optimizations to improve the benchmark. Namely, I avoid file IO by using rust's include_bytes! macro, which will load the entire contents of a file into the binary at compile time instead of at runtime, and I pre-allocate the entire output buffer in order to avoid having to resize it during the benchmark.

I'm using std::hint::black_box to make sure the compiler doesn't optimize anything differently just because we aren't actually using the result of the decoding.

To run this benchmark I'm going to use hyperfine, which is a great tool for ad hoc benchmarking of command line utilities.

cargo b --release
hyperfine ./target/release/test-png

gives us

Benchmark 1: ./target/release/test-image-png
  Time (mean ± σ):     253.5 ms ±   3.3 ms    [User: 211.7 ms, System: 39.9 ms]
  Range (min … max):   250.6 ms … 262.4 ms    11 runs

So about 250ms to decode a 2mb image. Is that reasonable? We don't really know -- this is the first decoder we're looking at. Intuitively it feels a bit slow. This is executing on a dedicated Linux server that's not running any other programs.

Out of curiosity, let's profile the binary using perf to see where it's spending its time.

perf record -e cpu-clock ./target/release/test-png

prints

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.047 MB perf.data (1017 samples) ]

and creates a perf.data file. We can read this file using perf report.

perf report

This opens an interactive viewer in our terminal that looks something like this:

perf report output

Reading this output isn't too complex. The first column displays a percentage of execution spent in the function. The last column shows the symbol name, which is either the actual name of a function or a memory address.

Does this profile make sense?

The second line looks like it belongs. We'd expect decoding the next interlaced row to be the bulk of the time in a benchmark where we loop over the rows of a PNG file. image-rs/png uses #![forbid(unsafe)], which means that unless they're using an external crate, they likely don't use handwritten SIMD intrinsics to decode PNG filters. LLVM can do ok autovectorizing sometimes, but in general one would expect the equivalent higher level code to not be as fast. It makes sense that this might be a bit slow.

So we expect that most of the time would be spent in png::decoder::Reader<R>::next_raw_interlaced_row, but what about miniz_oxide::inflate::core::transfer?

From the name it's clear that this is coming from the DEFLATE library that image-rs/png relies on. It sounds like all it's doing is transferring data from one buffer to another. If we look at the implementation in miniz_oxide, we can see that's exactly what it's doing:

#[inline]
fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    for _ in 0..match_len >> 2 {
        out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
        out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
        source_pos += 4;
        out_pos += 4;
    }

    match match_len & 3 {
        0 => (),
        1 => out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask],
        2 => {
            out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
            out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        }
        3 => {
            out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
            out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
            out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        }
        _ => unreachable!(),
    }
}

Given a buffer, out_slice, this function will read bytes from within out_slice starting at source_pos and copying them to out_slice starting at out_pos. The specific way it does the copying is a bit more complex, but that's the gist of it.

Our PNG decoder spends 1/3rd of its time copying bytes. That raises a few red flags. Modern memcpy is pretty fast and shouldn't be a bottleneck for our decoder. This function probably isn't being optimized to a straight memcpy, but that's the upper bound we should be targeting, and I think we should be able to get pretty close. Let's look at the disassembly to see what's going wrong. When we look at the disassembly, let's ignore the bottom match statement. That's only executed once per function call and should have a much smaller impact on perf compared to the loop¹.

miniz_oxide::core::inflate::transfer:
        push    rax
        cmp     r8, 4
        jb      .LBB0_11
        ; match_len >> 2
        shr     r8, 2
.LBB0_2:
        ; out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
        mov     rax, rdx
        and     rax, r9
        cmp     rax, rsi
        jae     .LBB0_12
        cmp     rcx, rsi
        jae     .LBB0_14
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rdi + rcx], al

        ; out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        inc     rdx
        mov     rax, rdx
        and     rax, r9
        cmp     rax, rsi
        jae     .LBB0_15
        lea     r10, [rcx + 1]
        cmp     r10, rsi
        jae     .LBB0_16
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rdi + rcx + 1], al

        ; out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        inc     rdx
        mov     rax, rdx
        and     rax, r9
        cmp     rax, rsi
        jae     .LBB0_18
        lea     r10, [rcx + 2]
        cmp     r10, rsi
        jae     .LBB0_19
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rdi + rcx + 2], al

        ; out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
        inc     rdx
        mov     rax, rdx
        and     rax, r9
        cmp     rax, rsi
        jae     .LBB0_20
        lea     r10, [rcx + 3]
        cmp     r10, rsi
        jae     .LBB0_21
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rdi + rcx + 3], al

        ; source_pos += 4;
        add     rcx, 4
        inc     rdx
        dec     r8
        jne     .LBB0_2
.LBB0_11:
        pop     rax
        ret

; ... several lines of panic handling code are omitted

Let's break the disassembly down in chunks.

We see a block like this repeated 4 times.

mov     rax, rdx
and     rax, r9
cmp     rax, rsi
jae     .LBB0_12
cmp     rcx, rsi
jae     .LBB0_14
movzx   eax, byte ptr [rdi + rax]
mov     byte ptr [rdi + rcx], al

Each instance corresponds to the line out_slice[out_pos] = out_slice[(source_pos + X) & out_buf_size_mask]; in the original source. Since we compiled this on Linux, we're using the System V ABI's calling conventions. We pass in out_slice as the first argument. In rust, slices are represented by a pointer and a length. These values get passed to our function in rdi and rsi respectively. The next parameter, in this case source_pos, is passed to the function in rdx. So we move the value of source_pos from the register rdx to the register rax. Then we bitmask it with r9, which contains the value of out_buf_size_mask.

Then we compare that bitmasked value to the length of out_slice stored in the register rsi. The jae instruction, jump if above or equal, does exactly what it sounds like. If rax is greater than or equal to the length, then we jump to .LBB0_12, which sets us up to panic. This is an array bounds check. We have a similar check in the two instructions below, where we check that out_pos is also in the bounds of the slice.

Following that, we move the value at out_slice[source_pos] into the register eax. Then we move the lower 8 bits of that register to out_slice[out_pos].

We repeat this 4 times until we get to the end of the loop.

add     rcx, 4
inc     rdx
dec     r8
jne     .LBB0_2

Then we add 4 to out_pos. During iteration we've already been adding to source_pos, so here we just have to increment its value by 1. Then we subtract 1 from our loop counter and jump to the start of the loop if it isn't zero.

You might already be able to see a bit of inefficiency, even if you aren't that familiar with assembly. On every iteration we perform 2 bounds checks per line, for a total of 8 bounds checks per iteration. Bounds checks do add overhead, but much like integer overflow checks, they aren't by themselves that slow. In our case, going out of bounds is exceptional and should never happen during the course of regular execution, which means the branch predictor should do a really good job here.

Typically the problem with integer overflow checks and bounds checks is that they prevent other optimizations, like autovectorization. Perhaps that's what's happening here -- we're doing a lot of bounds checks inside a tight loop and so LLVM can't optimize it that well. Let's see what removing the bounds checks does to the codegen.

Let's start by trying to remove them in safe rust. We can try asserting one large length condition at the start of the function, and hope that LLVM is able to elide the later checks.

The new code looks something like this:

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    assert!(out_slice.len() > (match_len >> 2) * 4 + out_pos - 1);
    assert!(out_slice.len() > (match_len >> 2) * 4 + source_pos - 1);
    for _ in 0..match_len >> 2 {
        out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
        out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
        source_pos += 4;
        out_pos += 4;
    }
}

Remember we're omitting the final match statement for simplicity. Let's look at the disassembly for the first line in our loop body now.

; out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
mov     rax, rdx
and     rax, r9
cmp     rax, rsi
jae     .LBB0_16
cmp     rcx, rsi
jae     .LBB0_18
movzx   eax, byte ptr [rdi + rax]
mov     byte ptr [rdi + rcx], al

The bounds checks are still there :/

LLVM wasn't smart enough to elide them based on our assert at the start of the function. Admittedly the check was a bit complex. What if we try asserting at the start of the loop body? That should be a lot simpler for the compiler to reason about.

Our new code looks like this

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    for _ in 0..match_len >> 2 {
        assert!(out_slice.len() > out_pos + 3);
        assert!(out_slice.len() > source_pos + 3);
        out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
        out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
        source_pos += 4;
        out_pos += 4;
    }
}

Looking at the assembly,

; out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
lea     rax, [rdx + rbx]
and     rax, r9
cmp     rax, rsi
jae     .LBB0_12
lea     r11, [rcx + rbx]
cmp     r11, rsi
jae     .LBB0_14
movzx   eax, byte ptr [rdi + rax]
mov     byte ptr [r14 + rbx - 3], al

The cmp and jae are still there twice for every line in the loop. At this point let's just give up and use unsafe to see if what we're trying to do will actually have a meaningful impact on the codegen. We can use slice::get_unchecked and slice::get_unchecked_mut to elide bounds checks that we know are safe.

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    assert!(out_slice.len() > (match_len >> 2) * 4 + out_pos - 1);
    assert!(out_slice.len() > (match_len >> 2) * 4 + source_pos - 1);

    for _ in 0..match_len >> 2 {
        unsafe {
            *out_slice.get_unchecked_mut(out_pos) =
                *out_slice.get_unchecked(source_pos & out_buf_size_mask);
            *out_slice.get_unchecked_mut(out_pos + 1) =
                *out_slice.get_unchecked((source_pos + 1) & out_buf_size_mask);
            *out_slice.get_unchecked_mut(out_pos + 2) =
                *out_slice.get_unchecked((source_pos + 2) & out_buf_size_mask);
            *out_slice.get_unchecked_mut(out_pos + 3) =
                *out_slice.get_unchecked((source_pos + 3) & out_buf_size_mask);
        }
        source_pos += 4;
        out_pos += 4;
    }
}

We'll add back our original asserts to make sure we don't accidentally do any out-of-bounds reads. What does the codegen for this look like?

example::transfer:
        push    rax
        mov     r10, r8
        and     r10, -4
        lea     rax, [r10 + rcx]
        cmp     rax, rsi
        jae     .LBB0_6
        add     r10, rdx
        cmp     r10, rsi
        jae     .LBB0_7
        cmp     r8, 4
        jb      .LBB0_5
        shr     r8, 2
        cmp     r8, 2
        mov     r10d, 1
        cmovae  r10, r8
        add     rcx, rdi
        add     rcx, 3
        xor     esi, esi
.LBB0_4:
        mov     rax, rdx
        and     rax, r9
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rcx + 4*rsi - 3], al
        lea     rax, [rdx + 1]
        and     rax, r9
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rcx + 4*rsi - 2], al
        lea     rax, [rdx + 2]
        and     rax, r9
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rcx + 4*rsi - 1], al
        lea     rax, [rdx + 3]
        and     rax, r9
        movzx   eax, byte ptr [rdi + rax]
        mov     byte ptr [rcx + 4*rsi], al
        lea     rax, [rsi + 1]
        add     rdx, 4
        mov     rsi, rax
        cmp     r10, rax
        jne     .LBB0_4
.LBB0_5:
        pop     rax
        ret
.LBB0_6:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 66
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2
.LBB0_7:
        lea     rdi, [rip + .L__unnamed_3]
        lea     rdx, [rip + .L__unnamed_4]
        mov     esi, 69
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2

No bounds checks! And a lot fewer instructions than what we started with (55 lines vs 82 originally). So how much faster is the code without bounds checks?

Let's start by cloning the miniz_oxide repo and vendoring our dependencies so we can modify them locally.

If our original Cargo.toml looked like this,

[dependencies]
png = "0.17.7"

we just have to add this section

[patch.crates-io]
miniz_oxide = { path = "../miniz_oxide/miniz_oxide" }

Before we start making changes, lets make a copy of the original binary we used to benchmark so we can use it as a point of comparison for our changes.

cp ./target/release/test-png ./original

Next let's replace the old implementation of transfer with our new one. Then we can build and compare the two implementations.

cargo b --release
hyperfine ./original ./target/release/test-png --warmup 5

which prints

Benchmark 1: ./original
  Time (mean ± σ):     255.9 ms ±   4.5 ms    [User: 216.9 ms, System: 35.5 ms]
  Range (min … max):   251.7 ms … 265.2 ms    11 runs
 
Benchmark 2: ./target/release/test-image-png
  Time (mean ± σ):     254.5 ms ±   5.0 ms    [User: 214.7 ms, System: 38.4 ms]
  Range (min … max):   250.2 ms … 267.3 ms    11 runs
 
Summary
  './target/release/test-image-png' ran
    1.01 ± 0.03 times faster than './original'

A 1% improvement. These results are barely above random noise, even though we removed bounds checks. We're doing a bit of new work now for the 2 assertions at the start of the function, but that wouldn't be the cause for such a small speed-up. Fundamentally the performance benefits from removing bounds checks aren't helping us here and the optimizer isn't able to do much better even with them gone. If we want to speed up this function, we have to start thinking about the algorithm.

Looking at the assembly, right now our implementation is entirely scalar. We load one array element at a time, mask it, and then copy it to the out position. It should be trivial for the compiler to vectorize this, so what's preventing it?

There's two problems here. First, we mask the array indices with out_buf_size_mask, which means its possible for us to be forced to copy non-consecutive elements from the array. Non-consecutive reads may prevent us from using SIMD here at all.

The other issue is that our data is highly dependent on previous calculations. This is especially apparent when source_pos and out_pos differ by less than 4. We can see this if we think through an example where our array is [1, 2, 3, 4, 5, 6, 7, 8], source_pos is 0, and out_pos is 2. We can pretend we don't do any masking for now. Here's what the code looks like if we substitute concrete values for source_pos and out_pos:

out_slice = [1, 2, 3, 4, 5, 6, 7, 8]

out_slice[2] = out_slice[0]
out_slice[3] = out_slice[1]

# !!!
out_slice[4] = out_slice[2]
out_slice[5] = out_slice[3]

After the first iteration, the array is now [1, 2, 1, 2, 1, 2, 7, 8].

On the third line of our loop, we depend on the results of the first line of our loop. if we tried to load all the values for this iteration at once, we wouldn't get the correct result. We'll have to be really creative if we end up having to work around this dependency in the general case².

Let's look at the first problem -- masking can cause non-consecutive reads. Is there any way around this? One thing we could look at is the value we're masking by. If it's always the same value, we might be able to make some interesting optimizations. To inspect its value, we can use an ad hoc profiling tool called counts that I really love. To use it, we just need to insert some prints into the code and then pipe the results to counts.

Our new code looks like this:

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    println!("{}", out_buf_size_mask);
    for _ in 0..match_len >> 2 {
        out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
        out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
        out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
        out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
        source_pos += 4;
        out_pos += 4;
    }
}

Pretty simple. Then to use it:

cargo r --release | counts

We run in release mode just because it's prohibitively slow to execute in debug mode. After running this, we get our results:

255047 counts
(  1)   255047 (100.0%,100.0%): 18446744073709551615

Looking at the output, the first number in parentheses is just the line number of the output. The second number is the count for the given value. The last number in the output is the value that we printed.

Based on this, we can see that the mask value is always the same. It looks like some large 64 bit integer. I don't have special 64 bit integers memorized, so lets open up a python repl to see what it looks like in binary.

>>> bin(18446744073709551615)
'0b1111111111111111111111111111111111111111111111111111111111111111'

The number is all ones. Let's check how long it is. The number we're hoping for is 64:

>>> len(_) - 2
64

Ok, so in every case that we call this function for this input, we don't actually do any masking. That's our first problem solved. We can just special case our code when out_buf_size_mask === usize::MAX.

Our next issue is the data dependency when the difference between out_pos and source_pos is small enough that they overlap. Let's see how often this comes up in practice. We can use counts here again to help us:

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    println!("{}", source_pos.abs_diff(out_pos));
    
    // ...
}

cargo r --release | counts

gives us

255047 counts
(  1)   253652 (99.5%, 99.5%): 1
(  2)      516 ( 0.2%, 99.7%): 4
(  3)      150 ( 0.1%, 99.7%): 1200
(  4)       46 ( 0.0%, 99.7%): 24001
(  5)       24 ( 0.0%, 99.7%): 2400
(  6)       16 ( 0.0%, 99.7%): 8
(  7)       12 ( 0.0%, 99.8%): 281
(  8)       12 ( 0.0%, 99.8%): 285
(  9)       11 ( 0.0%, 99.8%): 120
( 10)       10 ( 0.0%, 99.8%): 265
( 11)       10 ( 0.0%, 99.8%): 3600
( 12)       10 ( 0.0%, 99.8%): 4800
( 13)        8 ( 0.0%, 99.8%): 273
( 14)        8 ( 0.0%, 99.8%): 277
( 15)        8 ( 0.0%, 99.8%): 305
... 215 more results omitted

Quite a bit of variance. But if we look at the percentages, 99.5% of cases have a difference of 1. The rest of the cases all have pretty low counts relative to the total. So now we know which case we want to investigate.

Right now we're checking the absolute difference, so we're not actually checking whether out_pos is greater than or less than source_pos. This can affect the changes that we're making, so let's see which one is more common.

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    println!("{:?}", source_pos.cmp(&out_pos));
    
    // ...
}

cargo r --release | counts

255047 counts
(  1)   255047 (100.0%,100.0%): Less

source_pos is always less than out_pos. That means we can just ignore all other cases. That's pretty helpful for our optimization work. Now we can start trying to come up with a fast path:

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    if out_buf_size_mask == usize::MAX && source_pos.abs_diff(out_pos) == 1 {
        // our super-fast specialized code goes here
    } else {
        for _ in 0..match_len >> 2 {
            out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
            out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
            out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
            out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
            source_pos += 4;
            out_pos += 4;
        }
    }
}

Let's revisit our example from before. If we start with an array of [1, 2, 3, 4, 5, 6, 7, 8], let's say source_pos is 0, and out_pos is 1 this time. We don't have to worry about masking because we know the number we're masking by is usize::MAX. Let's think through what the array looks like after 1 iteration.

out_slice = [1, 2, 3, 4, 5, 6, 7, 8]

# [1, 2, 3, 4, 5, 6, 7, 8]
out_slice[1] = out_slice[0]
# [1, 1, 3, 4, 5, 6, 7, 8]
out_slice[2] = out_slice[1]
# [1, 1, 1, 4, 5, 6, 7, 8]
out_slice[2] = out_slice[2]
# [1, 1, 1, 1, 5, 6, 7, 8]
out_slice[4] = out_slice[3]

There's a pretty clear -- and exciting -- pattern here. The range is just converted to be the same value. So if we calculate the end position, we can pretty succinctly do memset(&out_slice[out_pos], out_slice[source_pos], end - out_pos). In rust we can get this codegen pretty easily with out_slice[out_pos..end].fill(out_slice[source_pos]). To calculate the end, we can reuse some of our assertion code from before.

Our new code is

fn transfer(
    out_slice: &mut [u8],
    mut source_pos: usize,
    mut out_pos: usize,
    match_len: usize,
    out_buf_size_mask: usize,
) {
    if out_buf_size_mask == usize::MAX && source_pos.abs_diff(out_pos) == 1 {
        let fill_byte = out_slice[source_pos];
        let end = (match_len >> 2) * 4 + out_pos;
        out_slice[out_pos..end].fill(fill_byte);
        source_pos = end - 1;
        out_pos = end;
    } else {
        for _ in 0..match_len >> 2 {
            out_slice[out_pos] = out_slice[source_pos & out_buf_size_mask];
            out_slice[out_pos + 1] = out_slice[(source_pos + 1) & out_buf_size_mask];
            out_slice[out_pos + 2] = out_slice[(source_pos + 2) & out_buf_size_mask];
            out_slice[out_pos + 3] = out_slice[(source_pos + 3) & out_buf_size_mask];
            source_pos += 4;
            out_pos += 4;
        }
    }
}

We can't compute the fill_byte as an argument to .fill(..) because of rust's borrow checker, so we have to bring out as a separate variable. We have to update the source_pos and out_pos in our fast path because the later match statement depends on their values.

Now let's compile this and compare it to our original binary.

cargo b --release
hyperfine ./original ./target/release/test-png --warmup 5

Benchmark 1: ./original
  Time (mean ± σ):     253.1 ms ±   3.1 ms    [User: 207.9 ms, System: 43.4 ms]
  Range (min … max):   250.5 ms … 260.7 ms    11 runs
 
Benchmark 2: ./target/release/test-image-png
  Time (mean ± σ):     164.8 ms ±   1.3 ms    [User: 129.9 ms, System: 34.0 ms]
  Range (min … max):   163.7 ms … 168.1 ms    17 runs
 
Summary
  './target/release/test-image-png' ran
    1.54 ± 0.02 times faster than './original'

A 50% improvement. This benchmark is testing how long it takes to decode an entire PNG, so this would be a pretty meaningful optimization. Let's double check this is real and that we didn't break anything by running miniz_oxide's test suite.

cd ../miniz_oxide
cargo t --all

test result: ok. 22 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

# ... full test results omitted

All of the tests pass; this is likely a sound optimization. At this point we should be comfortable making a PR with our changes.

This patch was submitted as #131 to the miniz_oxide repo. It contains a bit more than described here. miniz_oxide has its own benchmark suite, and after looking at it we can make a few small changes to improve performance there as well.

The results of our optimization are pretty striking: for some inputs we see a more than 2x improvement. miniz_oxide is ported from a C library that has roughly the same performance characteristics. What this means is, after our changes, miniz_oxide is now also more than 2x faster than the original C implementation for these inputs.

The relevant benchmarks are:

// before
test oxide::decompress_compressed_lvl_1 ... bench:      89,253 ns/iter (+/- 1,412)
test oxide::decompress_compressed_lvl_6 ... bench:     176,299 ns/iter (+/- 6,266)
test oxide::decompress_compressed_lvl_9 ... bench:     175,840 ns/iter (+/- 3,131)

// after
test oxide::decompress_compressed_lvl_1 ... bench:      83,992 ns/iter (+/- 1,347)
test oxide::decompress_compressed_lvl_6 ... bench:      76,640 ns/iter (+/- 2,288)
test oxide::decompress_compressed_lvl_9 ... bench:      76,791 ns/iter (+/- 1,978)

In practice this was verified by looking at the annotated disassembly in perf report

An implementation of this problem in the general case can be found here, though it ends up not being much faster