0

In C++, I'm trying to obtain a numpy array from a pdf page using PDFium:

py::array_t<uint8_t> render_page_helper(FPDF_PAGE page, int target_width = 0, int target_height = 0, int dpi = 80) {
    int width, height;

    if (target_width > 0 && target_height > 0) {
        width = target_width;
        height = target_height;
    } else {
        width = static_cast<int>(FPDF_GetPageWidth(page) * dpi / 72.0);
        height = static_cast<int>(FPDF_GetPageHeight(page) * dpi / 72.0);
    }

    FPDF_BITMAP bitmap = FPDFBitmap_Create(width, height, 1);
    if (!bitmap) throw std::runtime_error("Failed to create bitmap");

    FPDFBitmap_FillRect(bitmap, 0, 0, width, height, 0xFFFFFFFF);
    FPDF_RenderPageBitmap(bitmap, page, 0, 0, width, height, 0, FPDF_ANNOT);

    int stride = FPDFBitmap_GetStride(bitmap);
    uint8_t* buffer = static_cast<uint8_t*>(FPDFBitmap_GetBuffer(bitmap));

    // Return numpy array with shape (height, width, 4) = BGRA
    auto result = py::array_t<uint8_t>({height, width, 4}, buffer);
    FPDFBitmap_Destroy(bitmap);
    return result;
}

The result then gets passed back into Python and processed with:

arr = arr_bgra[:, :, [2, 1, 0]]

To chop off the alpha value and rearrange it into rgb format.

And when given an image, I currently handle it using stb_image:

py::array_t<uint8_t> render_image(const std::string& filename, int target_width = 224, int target_height = 224) {
    int width, height, channels;
    unsigned char* rgba = stbi_load(filename.c_str(), &width, &height, &channels, 4); // force RGBA
    if (!rgba) throw std::runtime_error("Failed to load image");

    // Temporary buffer (still RGBA after resize)
    std::vector<uint8_t> resized(target_width * target_height * 4);
    stbir_resize_uint8(rgba, width, height, 0,
                       resized.data(), target_width, target_height, 0, 4);
    stbi_image_free(rgba);

    // Allocate Python-owned buffer for final RGB output
    py::array_t<uint8_t> result({target_height, target_width, 3});
    auto buf = result.mutable_unchecked<3>();

    // Convert RGBA → RGB (drop alpha)
    for (int y = 0; y < target_height; ++y) {
        for (int x = 0; x < target_width; ++x) {
            int idx = (y * target_width + x) * 4;
            buf(y, x, 0) = resized[idx + 0]; // R
            buf(y, x, 1) = resized[idx + 1]; // G
            buf(y, x, 2) = resized[idx + 2]; // B
        }
    }

    return result;
}

To process and return a numpy array directly.

Both works great, however, when presented with a pdf and an image of the same contents and everything, the two pipelines produce very different arrays.

I've tried switching image renderers and have even tried converting both to PIL Image to no avail. And I wonder if it's even possible to produce results that are similar without ditching PDFium as using it is somewhat of a requirement.

Here's the minimal working example of this problem: https://github.com/Maximilus-thethird/pdfium-stb-image-syncing

You can recreate the problem by opening example.py and paste in the paths to the test samples I provided.

6
  • 1
    we can't run this code so we have not idea what means "very different arrays". You would have to show it. Or maybe better create minimal working code and add example file - so we could test it and make changes. Commented Sep 1 at 12:53
  • How do you know if the return value of stbi_load has 4 components? From a brief look at the stb source code, it looks like it takes the number of components from the file. If the image file had 3 components, that could account for them being different. Commented Sep 2 at 0:37
  • @NickODell Thanks for the feedback, you're right that the function takes the number of components from the files, and the one I pass in is jpeg so it only has only 3 components. However, editing the code to convert all channels to just RGB still doesn't return the right result. Commented Sep 2 at 10:23
  • 3
    Can you unpack the .rar and include the contents of the .rar inside the repo? This is useful in case someone wants to preview the repo without downloading it. Commented Sep 2 at 15:13
  • 1
    @NickODell I've unpacked the .rar into component files, the c++ wrapper is inside example_wrapper.cpp where the issue mostly lies, while the main part is inside example.py. However, since pdfium.dll is larger than the allowed 25mb upload limit, I decided to leave it packed, to run the code you need to unpack it, or download the PDFium library yourself and put pdfium.dll inside the main folder. Thank you. Commented Sep 3 at 9:56

1 Answer 1

2

After some fiddling around, I think I've cracked it. The reason the results don't seem to match is because I was trying to render the pdf page at 224x224 directly to maximize performance in which the method to do so is very different from rendering a page at high dpi then downscale, which is what I did with the images.

Delving deeper, for encoding images I had 3 options: using Pillow, using cv2 and using my custom render_image() function, out of those 3, using cv2 gives the best result (based on Euclidean distance, the closer the better) while Pillow and render_image() are tied behind. Furthermore, cv2 also gives the advantage of resizing an array directly rather than having to convert that array to an Image first, so both image and pdf can be put through the same resizing pipeline.

However, this method isn't perfect as the Euclidean distance between an image and a pdf of the same identical content can never be truly 0, just really close, and the higher the dpi you choose when rendering the pdf page, the closer it gets, so you have to find a sweet spot for both performance and accuracy.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.