Exploiting PDF Obfuscation in LLMs, arXiv, and More

Paper 2026/278

Exploiting PDF Obfuscation in LLMs, arXiv, and More

Zhongtang Luo

, Purdue University West Lafayette

Jianting Zhang

, Purdue University West Lafayette

Zheng Zhong

, Purdue University West Lafayette

Abstract

Many modern systems parse PDF files to extract semantic information, including multimodal large language models and academic submission platforms. We show that this practice is extremely vulnerable in real-world use cases. By exploiting standard-compliant features of the PDF page description language, an adversary can craft PDFs whose parsed content differs arbitrarily from what is visually rendered to human readers and whose metadata can be manipulated to mislead automated systems. We demonstrate two concrete vulnerabilities. First, we build adversarial PDFs using font-level glyph remapping that cause several widely deployed multimodal language models to extract incorrect text, while remaining visually indistinguishable from benign documents. Across six platforms, most systems that rely on PDF text extraction are vulnerable, whereas OCR-based pipelines are robust. Second, we analyze arXiv's TeX-detection mechanism and show that it relies on brittle metadata and font heuristics, which can be fully bypassed without changing the visual output. Our findings reveal a potential risk arising from discrepancies between automated PDF parsing and human-visible semantics. We argue that rendering-based interpretation, followed by computer vision, is one better approach for security-sensitive PDF interpretation.

Metadata

Available format(s): PDF
Category: Applications
Publication info: Preprint.
Keywords: PDF obfuscation
Contact author(s): luo401 @ purdue edu
zhan4674 @ purdue edu
zhong183 @ purdue edu
History: 2026-02-17: approved; 2026-02-16: received; See all versions
Short URL: https://ia.cr/2026/278
License: CC BY-NC

BibTeX

@misc{cryptoeprint:2026/278,
      author = {Zhongtang Luo and Jianting Zhang and Zheng Zhong},
      title = {Exploiting {PDF} Obfuscation in {LLMs}, {arXiv}, and More},
      howpublished = {Cryptology {ePrint} Archive, Paper 2026/278},
      year = {2026},
      url = {https://eprint.iacr.org/2026/278}
}