0

When I open a PDF file via a normal link (e.g., https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf), Chrome opens the PDF in the same tab with Chrome's integrated PDF viewer component.

In this state, I want to access elements of the document, e.g., the tag in the of the document. However, I cannot read any elements via Selenium:

Calling driver.page_source only returns a minimal HTML framework (without the tag or visible content).

Even an explicit find_element (e.g., on head > title) fails.

Even if I wait until the page is fully loaded, the elements remain untraceable.

In the DevTools Inspector, however, I can see that Chrome internally uses a shadow root structure for the PDF viewer. However, I don't want to access the shadow DOM elements, only what Chrome already displays in the root document. DOM when pdf is opened

Question: How can I get this element and its text? I have quite experience with selenium but I just can't see what I'm doing wrong. Here is also the code, I used for my example

_driver.Navigate().GoToUrl("https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf");
// I got a breakpoint here, to wait until the page is fully loaded, then I move on
var pageSourceUni = _driver.PageSource;
var allElements = _driver.FindElements(By.XPath("//*"));
// allElements contains only 4 WebElements (html, head, body, embed)
var titleElement = _driver.FindElements(By.CssSelector("head > title"));
//Cant find titleElement

Here is also what pageSourceUni contains:

<html>
<head></head>
<body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(40, 40, 40);">
<embed name="788D80E57272EEC95D5C09EEECDBAFF7" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="788D80E57272EEC95D5C09EEECDBAFF7">
</body>
</html>

Honestly, I don't understand why I get this HTML in pageSource. The only explanation I can think of is that this is the basic HTML, and then everything else is reloaded and modified with AJAX, so that in the end I get the HTML that I also see in Chrome Inspector.

3
  • 2
    what you're seeing in "Inspect" is HTML rendered by a built-in Chrome extension. That has it's own sandbox and is probably not accessible by Selenium. Commented Oct 16, 2025 at 16:55
  • Does this count for the whole document, as on my provided screenshot? Because I am only interested in the tab-title in this context, which is a part outside of the sandbox, isn't it? If the whole document is part of the sandbox, how can it "write" to the tab-title then, which should be outside of the sandbox? So shouldn't I have somehow access to the acutal title of the tab, anyway? Commented Oct 16, 2025 at 20:38
  • This makes me think about, if I can disable this security feature, when starting the browser. If a extension can access it, so should be selenium if there a no checks for trusted areas, maybe? Commented Oct 20, 2025 at 13:08

1 Answer 1

0

Selenium can only access a HTML or XML document as a tree structure wherein each node is an object representing a part of the document. PDF does not offer this nor does it work in this way.

The browser has a PDF reader engine that paints the contests to the pdf-viewer element. This renderer draws the PDF onto a canvas (or a shadow DOM internal element) — a painted surface, not HTML. You will need to use a different tool such as a PDF parsing library that is able to piece through this and read its content and unfortunately Selenium was not designed for this.

Sign up to request clarification or add additional context in comments.

5 Comments

But as mentioned, I dont want the pdf-content. I only want the value of the browsers tab title, which works normally, until a pdf is opened in this tab. The painted surface, you talk about, seems to be could accessed via an extension like K J commented.
okay so have you tried switching to that window tab that opens the PDF and then calling "_driver.Title" to retrieve it? - because when a browser displays a PDF, it stops rendering HTML and instead hands control over to its built-in PDF viewer engine. That means the page you’re seeing is not a DOM made of <html>, <head>, <body>, etc. Selenium’s DOM API (document, XPath, CSS selectors) only sees whatever is in the actual HTML DOM.
„_driver.Title“ returns an empty string. Selenium gets some html, but it isnt the one I see with the dev tools. And I was wondering, how the tab title (which shouldnt be part of the restricted area from the pdf renderer) gets its value. I thought like the pdf viewer could just set the tab title, which then can be accesses from outside. But seems like the title itself lies in the restricted area.
if its returning an empty string then unfortunately Selenium cannot help you any further here because what you are seeing in the devtools is a fake wrapper DOM provided by the browser’s internal PDF viewer app, not the actual PDF structure. It’s the UI shell of the viewer itself - you’re not inspecting the PDF file’s DOM (because PDFs don’t have one), you’re inspecting the viewer app’s DOM. Selenium was designed to interactive with a true DOM and not a wrapper DOM that a browser uses to illustrative the contents render by other engines.
Browsers unify their developer tools, so everything they show you — HTML, extensions, the PDF viewer, even DevTools itself — appears as an HTML-like structure. That’s because The viewer’s interface (zoom buttons, scrollbars, etc.) is made using HTML and JavaScript. The actual PDF contents are drawn into a <canvas> or <embed> via low-level rendering code, not represented by DOM elements. So it feels like inspecting a web page, but those “nodes” aren’t your document’s structure — they’re the viewer’s structure.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.