3

In Python, there are two libraries which are often used in tandem, Poppler and Tesseract. They both need external downloads to function: Poppler, Tesseract. The general recommendation for Windows is to download these files separately from the pip install, and then set the path to them. This solution does not work for me, because they take up too much space in my project folder.

Right now, within my project folder, I have two folders, Poppler and Tesseract, which contain all the necessary information. I set them as such:

pytesseract.tesseract_cmd = path_to_tesseract 
#and
convert_from_path(file_path, poppler_path = POPPLER_PATH)

However, this doesn't work in production, because they take up so much space in my folder. What I need, is to somehow download them both somewhere relative to the pip installs, so I don't need to set a path for either.

Right now, I have a PowerShell script which pip installs everything I need. I should be able to download Tesseract and Poppler at the same time as the rest of my pip installs.

$libraries = @(
    "pdf2image", # turnIntoImage()
    "pytesseract"
)

foreach ($lib in $libraries) {
    Write-Host "Installing $lib..."
    pip install $lib
}

# Add code here which downloads Poppler and Tesseract

I've done a lot of research, and this is what I've tried:

  • Downloading the files myself (not programmatic)
  • Downloading the files upon accessing a python file (should be in ps1)
  • Running a tesseract.exe file in ps1. (not good practice, takes forever)
  • Use ps1 to download files straight to C: drive (no access, lots of errors, didn't work)
8
  • What exactly is your question? Commented Nov 18 at 18:18
  • @SantiagoSquarzon I'd like to Download Poppler and Tesseract Programatically with PowerShell. Is that possible / how do I do that? Commented Nov 18 at 18:25
  • It's probably possible. Are you asking others to research how to do this and then write the code for you? If so, that's not the purpose of this site. Commented Nov 18 at 18:41
  • 1
    For those that know PowerShell and aren't well versed in Python you'll need to explain what is Poppler and Tesseract, where do you get these from? Is there a download link? Commented Nov 18 at 18:51
  • 1
    For the Tesseract one it should be pretty easy assuming you want the latest release, you make a web request to https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest then pick either the tarball_url or zipball_url properties from the json response, these contain the download links for the latest release in .tar.gz or .zip. For the other one it doesn't look like they have an API... seems like you'll need to do some web scraping Commented Nov 18 at 19:22

2 Answers 2

2

This might give you a start on how you can approach it programmatically. It isn't as straight forward, as one of the downloads requires web scraping.

First, regarding the location relative to pip, my assumption is that you one to drop these downloads where pip installs all Python Modules, in which case first this looks to work to get that location (not sure if there is a better / easier way):

$piplocation = (pip show pip | Select-String '(?<=^Location: ).+').Matches[0].Value

Then, using that location to drop the downloads; for Tesseract you can use the github API to get the download link for the latest release:

$req = Invoke-RestMethod https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest

# NOTE: Use `$req.tarball_url` if you want the `.tar.gz` instead of the `.zip`
$downloadPath = Join-Path $piplocation "tesseract-ocr.$($req.name).zip"
Invoke-WebRequest $req.zipball_url -OutFile $downloadPath

EDIT: OP has found a much better and more reliable way to obtain the latest Poppler build using the GitHub API, see his answer.

Then for Poppler, looks like you need web scraping to get the link... This might work for now, but as a disclaimer always, be aware web scraping isn't a robust solution to the problem. You should research if they have an API to get the latest download link.

$latest = (Invoke-WebRequest https://poppler.freedesktop.org/).Links |
    Where-Object outerHtml -Match '(?<=a href=")poppler.+?\.tar\.xz(?=")' |
    Select-Object -ExpandProperty href

Invoke-WebRequest https://poppler.freedesktop.org/$latest -OutFile $piplocation

And that's it, now in $piplocation you should be able both ready to extract:

PS ..\pwsh> Get-ChildItem $piplocation -File | Where-Object Name -Match 'tesseract-ocr|poppler'

    Directory: C:\Users\...\AppData\Local\Programs\Python\...\site-packages

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          11/18/2025  5:16 PM        1988596 poppler-25.11.0.tar.xz
-a---          11/18/2025  5:16 PM        2490329 tesseract-ocr.5.5.1.zip
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you Santiago! I'm going ahead and marking this answer as correct since it was so dang helpful. I was able to use this to download both Tesseract and Poppler, and I found a Poppler repository which stays updated. The Tesseract is much simpler / straightforward, so I threw the Poppler logic into a different answer. Feel free to edit + include it here.
No need to edit mine! I'm glad this helped. Btw, you can re-use the same logic for Poppler since you found the GitHub repo for it, so instead of $req.assets | Where-Object { $_.browser_download_url -like "*Release-*.zip" } you can use $downloadPath = Join-Path $piplocation "Poppler-$($req.tag_name).zip" and then Invoke-WebRequest $req.zipball_url -OutFile $downloadPath (very similar as what I used here for Tesseract but using $req.tag_name instead of $req.name to end up with a file named Poppler-v25.11.0-0.zip. Let me know if any doubts!
2

The logic for adding Poppler to a path. I extract the values so I can just get the bin file, and delete everything else.

    $piplocation = (pip show pip | Select-String '(?<=^Location: ).+').Matches[0].Value
    Write-Host "Pip files installed in: $piplocation"
    $targetPath = Join-Path $piplocation "pytesseract_document_reader\poppler"

    Write-Host "Installing Poppler to PATH..."
    $req = Invoke-RestMethod -Uri "https://api.github.com/repos/oschwartz10612/poppler-windows/releases/latest"
    $asset = $req.assets | Where-Object { $_.browser_download_url -like "*Release-*.zip" }
    $downloadPath = Join-Path $piplocation $asset.name
    Invoke-WebRequest -Uri $asset.browser_download_url -OutFile $downloadPath

    Get-ChildItem $piplocation -File | Where-Object Name -Match $asset.name

    Write-Host "Extracting Poppler..."
    Expand-Archive -LiteralPath $downloadPath -DestinationPath $piplocation -Force

    Write-Host "Moving Poppler to pytesseract_document_reader\poppler\bin..."
    $popplerRoot = Get-ChildItem -Path $piplocation -Directory | Where-Object { $_.Name -match '^poppler-\d' } | Select-Object -First 1
    $sourceBin = Join-Path $popplerRoot.FullName  "Library\bin"
    if (-not (Test-Path $targetPath)) {
        New-Item -ItemType Directory -Path $targetPath | Out-Null
    }
    Move-Item -Path $sourceBin -Destination $targetPath -Force

    Write-Host "Cleaning up Poppler ZIP and extracted folder..."
    Remove-Item $downloadPath -Force
    Remove-Item $popplerRoot.FullName -Recurse -Force

In the Python, I get the path using:

import os
import sys
POPPLER_PATH = os.path.join(sys.base_prefix, "Lib", "site-packages", "pytesseract_document_reader", "poppler", "bin")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.