In Python, there are two libraries which are often used in tandem, Poppler and Tesseract. They both need external downloads to function: Poppler, Tesseract. The general recommendation for Windows is to download these files separately from the pip install, and then set the path to them. This solution does not work for me, because they take up too much space in my project folder.
Right now, within my project folder, I have two folders, Poppler and Tesseract, which contain all the necessary information. I set them as such:
pytesseract.tesseract_cmd = path_to_tesseract
#and
convert_from_path(file_path, poppler_path = POPPLER_PATH)
However, this doesn't work in production, because they take up so much space in my folder. What I need, is to somehow download them both somewhere relative to the pip installs, so I don't need to set a path for either.
Right now, I have a PowerShell script which pip installs everything I need. I should be able to download Tesseract and Poppler at the same time as the rest of my pip installs.
$libraries = @(
"pdf2image", # turnIntoImage()
"pytesseract"
)
foreach ($lib in $libraries) {
Write-Host "Installing $lib..."
pip install $lib
}
# Add code here which downloads Poppler and Tesseract
I've done a lot of research, and this is what I've tried:
- Downloading the files myself (not programmatic)
- Downloading the files upon accessing a python file (should be in ps1)
- Running a tesseract.exe file in ps1. (not good practice, takes forever)
- Use ps1 to download files straight to C: drive (no access, lots of errors, didn't work)
https://api.github.com/repos/tesseract-ocr/tesseract/releases/latestthen pick either thetarball_urlorzipball_urlproperties from the json response, these contain the download links for the latest release in.tar.gzor.zip. For the other one it doesn't look like they have an API... seems like you'll need to do some web scraping