python - How to match and highlight text in a PDF with PyMuPDF when control characters are present between sentences?

I'm using PyMuPDF (fitz) to search for and highlight text in a PDF. However, the PDF text contains various control characters between sentences, which makes it difficult to match multi-sentence strings.

For example, when I extract the text from a page using page.get_text(), I get something like:

\x15\x15\x13
The quick brown fox jumps over the lazy dog.
ETX
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
ETX
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
ETX
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
ETX
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
ETX
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Sincerely,
Mr. John Doe

Suppose I want to search for the string:

The quick brown fox jumps over the lazy dog. Lorem ipsum dolor sit amet. This won't match because there are control characters (like ETX) between the sentences in the actual PDF text.

Here’s the relevant part of my code:

def highlight_and_annotate_pdf(input_pdf, matches_per_page):
    output_buffer = io.BytesIO()
    doc = fitz.open(stream=input_pdf.read(), filetype="pdf")

    for page_num in range(len(doc)):
        page = doc[page_num]
        matches = matches_per_page.get(page_num, {"matches": []})["matches"]

        if not matches:
            continue

        page_width = page.rect.width
        margin_x = page_width - 60

        for match in matches:
            text_instances = page.search_for(match["text"])
            if text_instances:
                # ... highlight and annotate code ...
                pass

    doc.save(output_buffer)
    doc.close()
    return output_buffer

How can I reliably search for and highlight text in a PDF when there are control characters or invisible formatting between sentences?

edited May 21 at 20:20

InSync

12.2k5 gold badges22 silver badges60 bronze badges

asked May 21 at 19:43

Shantanu

11 bronze badge

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to match and highlight text in a PDF with PyMuPDF when control characters are present between sentences?

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.