0

I'm using PyMuPDF (fitz) to search for and highlight text in a PDF. However, the PDF text contains various control characters between sentences, which makes it difficult to match multi-sentence strings.

For example, when I extract the text from a page using page.get_text(), I get something like:

\x15\x15\x13
The quick brown fox jumps over the lazy dog.
ETX
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
ETX
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
ETX
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
ETX
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
ETX
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Sincerely,
Mr. John Doe 

Suppose I want to search for the string:

The quick brown fox jumps over the lazy dog. Lorem ipsum dolor sit amet. This won't match because there are control characters (like ETX) between the sentences in the actual PDF text.

Here’s the relevant part of my code:

def highlight_and_annotate_pdf(input_pdf, matches_per_page):
    output_buffer = io.BytesIO()
    doc = fitz.open(stream=input_pdf.read(), filetype="pdf")

    for page_num in range(len(doc)):
        page = doc[page_num]
        matches = matches_per_page.get(page_num, {"matches": []})["matches"]

        if not matches:
            continue

        page_width = page.rect.width
        margin_x = page_width - 60

        for match in matches:
            text_instances = page.search_for(match["text"])
            if text_instances:
                # ... highlight and annotate code ...
                pass

    doc.save(output_buffer)
    doc.close()
    return output_buffer

How can I reliably search for and highlight text in a PDF when there are control characters or invisible formatting between sentences?

0

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.