Skip to content

Replace platform-specific SIMD with portable wide crate#127

Closed
lilith wants to merge 3 commits into
ImageOptim:mainfrom
lilith:pr-safe-simd
Closed

Replace platform-specific SIMD with portable wide crate#127
lilith wants to merge 3 commits into
ImageOptim:mainfrom
lilith:pr-safe-simd

Conversation

@lilith

@lilith lilith commented Jan 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace platform-specific SSE2/NEON intrinsics in f_pixel::diff() with portable wide::f32x4
  • Remove HistSortTmp union in favor of plain u32 (union was unnecessary since PalIndex fits in u32)
  • Remove unnecessary get_unchecked in qsort_pivot (3 bounds checks is negligible overhead)

Details

The diff() function had three separate implementations using unsafe intrinsics:

  • #[cfg(target_arch = "x86_64")] with SSE2
  • #[cfg(target_arch = "aarch64")] with NEON
  • Scalar fallback

Now there's a single portable implementation using wide::f32x4 that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under #[cfg(test)] with a brute-force comparison test that verifies both implementations match across edge cases.

Test plan

  • All existing tests pass
  • New diff_simd_matches_scalar test verifies SIMD matches scalar reference
  • cargo clippy passes
Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in
f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4.

- Direct bytemuck cast from ARGBF to f32x4 (both are Pod)
- Single implementation works on all platforms (x86, ARM, WASM, etc.)
- Includes scalar reference and brute-force comparison test
- ~70 lines removed, eliminates all unsafe in this function
The union was used to reinterpret the same memory as either mc_sort_value
(u32) during median cut sorting, or likely_palette_index (PalIndex) after.

Since PalIndex fits in u32, we can simply store u32 and cast on read.
This eliminates unsafe union field access with no runtime cost.
This sorts 3 pivot indices and accesses them 3 times total.
The bounds check overhead is negligible compared to the partitioning
work that follows. Safe indexing is simpler and equally fast.
@lilith

lilith commented Jan 18, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: MaybeUninit removal feasibility

I've prototyped making MaybeUninit opt-in via an unsafe-uninit feature flag. The default path uses zero-initialized buffers instead, which eliminates most unsafe code for pure Rust users.

Benchmark results (512×512 image, 3 runs each)

Mode quantize remap/dither histogram
Safe (zero-init) 35.9 / 35.7 / 36.9 ms 11.6 / 11.9 / 11.5 ms 5.5 / 6.4 / 5.3 ms
MaybeUninit 35.5 / 35.2 / 35.9 ms 11.9 / 11.2 / 11.5 ms 5.7 / 5.9 / 5.9 ms

The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work.

Changes required

  • Slot<T> type alias: MaybeUninit<T> or T based on feature
  • Helper functions that work in both modes
  • Image::new_fn gated behind unsafe-uninit (callback API inherently requires unsafe)
  • _internal_c_ffi implies unsafe-uninit (C users keep current behavior)

Would you be interested in this approach for the main repo? It would allow #![deny(unsafe_code)] for pure Rust builds while maintaining backward compatibility for C FFI users.

@lilith lilith closed this Feb 8, 2026
@kornelski kornelski reopened this Feb 10, 2026
@kornelski

Copy link
Copy Markdown
Member

Thanks for the PR. Sorry, I'm busy and can't review PRs in a timely manner.

It's nice to have portable SIMD, but it's a shame it requires adding another dependency. I'd rather keep the existing implementations for x86 and ARM.

How about making the wide dependency target-specific and used only on non-arm non-x86 platforms?

@kornelski

kornelski commented Feb 11, 2026

Copy link
Copy Markdown
Member

I'd prefer changes in separate PRs, so I can merge them separately.

The removal of union is fine, but it left the code setting the field a bit cryptic. I've cherry-picked it and added a setter. 21a6396

@kornelski

kornelski commented Feb 11, 2026

Copy link
Copy Markdown
Member

I really wanted to get rid of panics in quicksort. It used to have only that one bad case, but I checked it now, and there was half a dozen other missed cases ;( LLVM's optimizations there have regressed a lot. I suspect it's all due to if/else making LLVM forget which variables were in bounds.

I fudged it to be panic-free. Hopefully will get all the perf back when LLVM fixes regressions.

@lilith

lilith commented Feb 11, 2026

Copy link
Copy Markdown
Contributor Author

Sorry for the lousy PR, I actually did abandon this approach and should have commented when I closed it. Specialized intrinsics might be best for this crate.

wide is great for wasm 128 and all compile-time optimization - really a beautiful approach that gets 80% of the perf gains. Unfortunately, it doesn't do runtime 'upgrade' based on CPU features, so you have to combine it with the multiversion/multiversed macros to generate specialized, optimized versions with runtime dispatch. And since wide is compile-time, it produces substandard intrinsic and we have to rely on LVVM to hopefully upgrade them. Which it does absurdly well about 60% of the time.

But wide probably isn't for libimagequant. Nor is archmage, but I wanted to mention it since I've been using it in this type of library a lot recently.

I've moved about a seven projects over to my archmage crate, which allows #!forbid(unsafe_code] + runtime dispatch + safe core::arch + safe_unaligned_simd for SIMD load/store/mem ops. Normally, intrinsics are a pain, but when they can be compile-time-verified-safe, I don't mind having Claude generate and brute-force-test them against the scalar implementations. I made magetypes as a token-proof runtime version of wide, but I'm getting less and less invested in portable simd-style solutions as testing and compile-time safety become enough.

// Your code:
#[arcane]
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    _mm256_add_ps(a, b)
}

// Generated:
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, a: __m256, b: __m256) -> __m256 {
        _mm256_add_ps(a, b)  // Safe inside #[target_feature]!
    }
    // SAFETY: Token proves CPU support was verified
    unsafe { __inner(token, a, b) }
}

use archmage::{Desktop64, Arm64, Simd128Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // Try x86 AVX2
    if let Some(token) = Desktop64::summon() {
        return process_x86(token, data);
    }

    // Try ARM NEON
    if let Some(token) = Arm64::summon() {
        return process_arm(token, data);
    }

    // Try WASM SIMD
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Scalar fallback
    process_scalar(data);
}

#[arcane]
fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ }

fn process_scalar(data: &mut [f32]) { /* ... */ }
@kornelski kornelski closed this Feb 11, 2026
@lilith lilith deleted the pr-safe-simd branch March 24, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants