Paper 2026/278

Exploiting PDF Obfuscation in LLMs, arXiv, and More

Zhongtang Luo, Purdue University West Lafayette
Jianting Zhang, Purdue University West Lafayette
Zheng Zhong, Purdue University West Lafayette
Abstract

Many modern systems parse PDF files to extract semantic information, including multimodal large language models and academic submission platforms. We show that this practice is extremely vulnerable in real-world use cases. By exploiting standard-compliant features of the PDF page description language, an adversary can craft PDFs whose parsed content differs arbitrarily from what is visually rendered to human readers and whose metadata can be manipulated to mislead automated systems. We demonstrate two concrete vulnerabilities. First, we build adversarial PDFs using font-level glyph remapping that cause several widely deployed multimodal language models to extract incorrect text, while remaining visually indistinguishable from benign documents. Across six platforms, most systems that rely on PDF text extraction are vulnerable, whereas OCR-based pipelines are robust. Second, we analyze arXiv's TeX-detection mechanism and show that it relies on brittle metadata and font heuristics, which can be fully bypassed without changing the visual output. Our findings reveal a potential risk arising from discrepancies between automated PDF parsing and human-visible semantics. We argue that rendering-based interpretation, followed by computer vision, is one better approach for security-sensitive PDF interpretation.

Metadata
Available format(s)
PDF
Category
Applications
Publication info
Preprint.
Keywords
PDFobfuscation
Contact author(s)
luo401 @ purdue edu
zhan4674 @ purdue edu
zhong183 @ purdue edu
History
2026-02-17: approved
2026-02-16: received
See all versions
Short URL
https://ia.cr/2026/278
License
Creative Commons Attribution-NonCommercial
CC BY-NC

BibTeX

@misc{cryptoeprint:2026/278,
      author = {Zhongtang Luo and Jianting Zhang and Zheng Zhong},
      title = {Exploiting {PDF} Obfuscation in {LLMs}, {arXiv}, and More},
      howpublished = {Cryptology {ePrint} Archive, Paper 2026/278},
      year = {2026},
      url = {https://eprint.iacr.org/2026/278}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.