Replace platform-specific SIMD with portable wide crate by lilith · Pull Request #127 · ImageOptim/libimagequant

lilith · 2026-01-18T10:37:04Z

Summary

Replace platform-specific SSE2/NEON intrinsics in f_pixel::diff() with portable wide::f32x4
Remove HistSortTmp union in favor of plain u32 (union was unnecessary since PalIndex fits in u32)
Remove unnecessary get_unchecked in qsort_pivot (3 bounds checks is negligible overhead)

Details

The diff() function had three separate implementations using unsafe intrinsics:

#[cfg(target_arch = "x86_64")] with SSE2
#[cfg(target_arch = "aarch64")] with NEON
Scalar fallback

Now there's a single portable implementation using wide::f32x4 that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under #[cfg(test)] with a brute-force comparison test that verifies both implementations match across edge cases.

Test plan

All existing tests pass
New diff_simd_matches_scalar test verifies SIMD matches scalar reference
cargo clippy passes

Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4. - Direct bytemuck cast from ARGBF to f32x4 (both are Pod) - Single implementation works on all platforms (x86, ARM, WASM, etc.) - Includes scalar reference and brute-force comparison test - ~70 lines removed, eliminates all unsafe in this function

The union was used to reinterpret the same memory as either mc_sort_value (u32) during median cut sorting, or likely_palette_index (PalIndex) after. Since PalIndex fits in u32, we can simply store u32 and cast on read. This eliminates unsafe union field access with no runtime cost.

This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.

lilith · 2026-01-18T13:25:54Z

Follow-up: MaybeUninit removal feasibility

I've prototyped making MaybeUninit opt-in via an unsafe-uninit feature flag. The default path uses zero-initialized buffers instead, which eliminates most unsafe code for pure Rust users.

Benchmark results (512×512 image, 3 runs each)

Mode	quantize	remap/dither	histogram
Safe (zero-init)	35.9 / 35.7 / 36.9 ms	11.6 / 11.9 / 11.5 ms	5.5 / 6.4 / 5.3 ms
MaybeUninit	35.5 / 35.2 / 35.9 ms	11.9 / 11.2 / 11.5 ms	5.7 / 5.9 / 5.9 ms

The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work.

Changes required

Slot<T> type alias: MaybeUninit<T> or T based on feature
Helper functions that work in both modes
Image::new_fn gated behind unsafe-uninit (callback API inherently requires unsafe)
_internal_c_ffi implies unsafe-uninit (C users keep current behavior)

Would you be interested in this approach for the main repo? It would allow #![deny(unsafe_code)] for pure Rust builds while maintaining backward compatibility for C FFI users.

kornelski · 2026-02-11T00:01:42Z

Thanks for the PR. Sorry, I'm busy and can't review PRs in a timely manner.

It's nice to have portable SIMD, but it's a shame it requires adding another dependency. I'd rather keep the existing implementations for x86 and ARM.

How about making the wide dependency target-specific and used only on non-arm non-x86 platforms?

kornelski · 2026-02-11T01:07:08Z

I'd prefer changes in separate PRs, so I can merge them separately.

The removal of union is fine, but it left the code setting the field a bit cryptic. I've cherry-picked it and added a setter. 21a6396

kornelski · 2026-02-11T01:07:34Z

I really wanted to get rid of panics in quicksort. It used to have only that one bad case, but I checked it now, and there was half a dozen other missed cases ;( LLVM's optimizations there have regressed a lot. I suspect it's all due to if/else making LLVM forget which variables were in bounds.

I fudged it to be panic-free. Hopefully will get all the perf back when LLVM fixes regressions.

lilith · 2026-02-11T01:32:59Z

Sorry for the lousy PR, I actually did abandon this approach and should have commented when I closed it. Specialized intrinsics might be best for this crate.

wide is great for wasm 128 and all compile-time optimization - really a beautiful approach that gets 80% of the perf gains. Unfortunately, it doesn't do runtime 'upgrade' based on CPU features, so you have to combine it with the multiversion/multiversed macros to generate specialized, optimized versions with runtime dispatch. And since wide is compile-time, it produces substandard intrinsic and we have to rely on LVVM to hopefully upgrade them. Which it does absurdly well about 60% of the time.

But wide probably isn't for libimagequant. Nor is archmage, but I wanted to mention it since I've been using it in this type of library a lot recently.

I've moved about a seven projects over to my archmage crate, which allows #!forbid(unsafe_code] + runtime dispatch + safe core::arch + safe_unaligned_simd for SIMD load/store/mem ops. Normally, intrinsics are a pain, but when they can be compile-time-verified-safe, I don't mind having Claude generate and brute-force-test them against the scalar implementations. I made magetypes as a token-proof runtime version of wide, but I'm getting less and less invested in portable simd-style solutions as testing and compile-time safety become enough.

// Your code:
#[arcane]
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    _mm256_add_ps(a, b)
}

// Generated:
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, a: __m256, b: __m256) -> __m256 {
        _mm256_add_ps(a, b)  // Safe inside #[target_feature]!
    }
    // SAFETY: Token proves CPU support was verified
    unsafe { __inner(token, a, b) }
}

use archmage::{Desktop64, Arm64, Simd128Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // Try x86 AVX2
    if let Some(token) = Desktop64::summon() {
        return process_x86(token, data);
    }

    // Try ARM NEON
    if let Some(token) = Arm64::summon() {
        return process_arm(token, data);
    }

    // Try WASM SIMD
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Scalar fallback
    process_scalar(data);
}

#[arcane]
fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ }

fn process_scalar(data: &mut [f32]) { /* ... */ }

lilith added 3 commits January 18, 2026 03:35

Remove unnecessary get_unchecked in qsort_pivot

4bf8a30

This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.

lilith closed this Feb 8, 2026

kornelski reopened this Feb 10, 2026

kornelski closed this Feb 11, 2026

lilith deleted the pr-safe-simd branch March 24, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace platform-specific SIMD with portable wide crate#127

Replace platform-specific SIMD with portable wide crate#127
lilith wants to merge 3 commits into
ImageOptim:mainfrom
lilith:pr-safe-simd

lilith commented Jan 18, 2026

lilith commented Jan 18, 2026

kornelski commented Feb 11, 2026

kornelski commented Feb 11, 2026 •

edited

Loading

kornelski commented Feb 11, 2026 •

edited

Loading

lilith commented Feb 11, 2026

Labels

2 participants

Uh oh!

Conversation

lilith commented Jan 18, 2026

Summary

Details

Test plan

lilith commented Jan 18, 2026

Follow-up: MaybeUninit removal feasibility

Benchmark results (512×512 image, 3 runs each)

Changes required

kornelski commented Feb 11, 2026

kornelski commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kornelski commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

lilith commented Feb 11, 2026

Labels

2 participants

kornelski commented Feb 11, 2026 •

edited

Loading

kornelski commented Feb 11, 2026 •

edited

Loading