Extracting a Specific Text from PDF using RegEx

Question

Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=123456). Ultimately, what I need is the 123456. This text will only be on the first page of the PDF. It's actually going to be in the color white so it's "invisible".

My initial thought was to grab all the Text from the first page PDF and then use RegEx to parse for "SomeID=[Integer]". I do not care about all the other text in this PDF. I only care about finding "SomeID=" and the integer that follows.

What is a simple way to get all Text from PDF, without using a Nuget Library?

But, I can try to get one approved if there's one that is solid.

If you want to hide meta data, I think you could just append it to the end of the file. Like appending a .zip file would work. Or you could just use your own byte format. — Jeremy Lakeman
– Jeremy Lakeman, Commented Feb 5 at 1:01
"without using a Nuget Library" - is only nuget a problem? Or are third party tools and libs in general? — mkl
– mkl, Commented Feb 5 at 6:11
"without using a Nuget Library?" - basically: Good Luck. I wouldn't touch that with a 10ft pole. — Fildor
– Fildor, Commented Feb 5 at 8:53
@mkl Third Party Tools, in general. Essentially, they would like to limit that as soon much as possible. We have a DevExpress library, though I'm not super familiar with that, but I wonder if there's a way to do that — user3121062
– user3121062, Commented Feb 5 at 14:54
Working with arbitrary PDF files without using existing libraries or tools means a lot of work. If you only need to process very special files, the situation can be easier but still not easy. — mkl
– mkl, Commented Feb 5 at 15:29

K J · Accepted Answer · 2025-02-04 23:27:29Z

Text may be unseen in a PDF rendering so here we can see the "Default" black text on the left, and SomeID has no colour set (for that area on the right). However that area could be printer tracking yellow or white or any other unseen colour.

To see all PDF text in black and white we simply need to read the page as if it were plain text.

If we have a region of interest at a known location we can trim down the area of extraction to a per page value.

Thus any application cross platforms can shell such a program line and using redirection just read that small zone (or a larger one) or even extract the line by find and split the result.

-f 1 -l 1 restricts the search to first page.

>pdftotext -layout -f 1 -l 1 hiddentext.pdf -|find "SomeID"
               SomeID=123456

You can even set an environmental with system related contortions.

>cmd /V:ON /r pdftotext -layout -f 1 -l 1  hiddentext.pdf -|find "SomeID">%temp%\output.txt &set /p input=<%temp%\output.txt&&set output=%input:~-6%&&echo/&&set output

output=123456

Thank you for this! I'll have to ask to see if we can add that utility. But this definitely gives us an idea. Really, the problem is a business process issue, but that's what we do right? Write dumb code to make up for poor business processes.
Before asking you should test the tool with your files - if the text in your file is reasonably encoded, it should work. But text also can be obfuscated to prohibit text extraction.

Collectives™ on Stack Overflow

Extracting a Specific Text from PDF using RegEx

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related