I'm using PyMuPDF (fitz) to search for and highlight text in a PDF. However, the PDF text contains various control characters between sentences, which makes it difficult to match multi-sentence strings.
For example, when I extract the text from a page using page.get_text(), I get something like:
\x15\x15\x13
The quick brown fox jumps over the lazy dog.
ETX
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
ETX
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
ETX
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
ETX
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
ETX
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Sincerely,
Mr. John Doe
Suppose I want to search for the string:
The quick brown fox jumps over the lazy dog. Lorem ipsum dolor sit amet. This won't match because there are control characters (like ETX) between the sentences in the actual PDF text.
Here’s the relevant part of my code:
def highlight_and_annotate_pdf(input_pdf, matches_per_page):
output_buffer = io.BytesIO()
doc = fitz.open(stream=input_pdf.read(), filetype="pdf")
for page_num in range(len(doc)):
page = doc[page_num]
matches = matches_per_page.get(page_num, {"matches": []})["matches"]
if not matches:
continue
page_width = page.rect.width
margin_x = page_width - 60
for match in matches:
text_instances = page.search_for(match["text"])
if text_instances:
# ... highlight and annotate code ...
pass
doc.save(output_buffer)
doc.close()
return output_buffer
How can I reliably search for and highlight text in a PDF when there are control characters or invisible formatting between sentences?