In C++, I'm trying to obtain a numpy array from a pdf page using PDFium:
py::array_t<uint8_t> render_page_helper(FPDF_PAGE page, int target_width = 0, int target_height = 0, int dpi = 80) {
int width, height;
if (target_width > 0 && target_height > 0) {
width = target_width;
height = target_height;
} else {
width = static_cast<int>(FPDF_GetPageWidth(page) * dpi / 72.0);
height = static_cast<int>(FPDF_GetPageHeight(page) * dpi / 72.0);
}
FPDF_BITMAP bitmap = FPDFBitmap_Create(width, height, 1);
if (!bitmap) throw std::runtime_error("Failed to create bitmap");
FPDFBitmap_FillRect(bitmap, 0, 0, width, height, 0xFFFFFFFF);
FPDF_RenderPageBitmap(bitmap, page, 0, 0, width, height, 0, FPDF_ANNOT);
int stride = FPDFBitmap_GetStride(bitmap);
uint8_t* buffer = static_cast<uint8_t*>(FPDFBitmap_GetBuffer(bitmap));
// Return numpy array with shape (height, width, 4) = BGRA
auto result = py::array_t<uint8_t>({height, width, 4}, buffer);
FPDFBitmap_Destroy(bitmap);
return result;
}
The result then gets passed back into Python and processed with:
arr = arr_bgra[:, :, [2, 1, 0]]
To chop off the alpha value and rearrange it into rgb format.
And when given an image, I currently handle it using stb_image:
py::array_t<uint8_t> render_image(const std::string& filename, int target_width = 224, int target_height = 224) {
int width, height, channels;
unsigned char* rgba = stbi_load(filename.c_str(), &width, &height, &channels, 4); // force RGBA
if (!rgba) throw std::runtime_error("Failed to load image");
// Temporary buffer (still RGBA after resize)
std::vector<uint8_t> resized(target_width * target_height * 4);
stbir_resize_uint8(rgba, width, height, 0,
resized.data(), target_width, target_height, 0, 4);
stbi_image_free(rgba);
// Allocate Python-owned buffer for final RGB output
py::array_t<uint8_t> result({target_height, target_width, 3});
auto buf = result.mutable_unchecked<3>();
// Convert RGBA → RGB (drop alpha)
for (int y = 0; y < target_height; ++y) {
for (int x = 0; x < target_width; ++x) {
int idx = (y * target_width + x) * 4;
buf(y, x, 0) = resized[idx + 0]; // R
buf(y, x, 1) = resized[idx + 1]; // G
buf(y, x, 2) = resized[idx + 2]; // B
}
}
return result;
}
To process and return a numpy array directly.
Both works great, however, when presented with a pdf and an image of the same contents and everything, the two pipelines produce very different arrays.
I've tried switching image renderers and have even tried converting both to PIL Image to no avail. And I wonder if it's even possible to produce results that are similar without ditching PDFium as using it is somewhat of a requirement.
Here's the minimal working example of this problem: https://github.com/Maximilus-thethird/pdfium-stb-image-syncing
You can recreate the problem by opening example.py and paste in the paths to the test samples I provided.
minimal working codeand add example file - so we could test it and make changes.stbi_loadhas 4 components? From a brief look at the stb source code, it looks like it takes the number of components from the file. If the image file had 3 components, that could account for them being different..rarinto component files, thec++wrapper is insideexample_wrapper.cppwhere the issue mostly lies, while the main part is insideexample.py. However, sincepdfium.dllis larger than the allowed 25mb upload limit, I decided to leave it packed, to run the code you need to unpack it, or download the PDFium library yourself and putpdfium.dllinside the main folder. Thank you.