feat(files): add Mistral OCR and configurable OCR backends for uploads#423
feat(files): add Mistral OCR and configurable OCR backends for uploads#423adavyas wants to merge 6 commits intoplastic-labs:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds configurable OCR settings and integrates OCR into file processing: processors accept Changes
Sequence DiagramsequenceDiagram
participant Client
participant FileService as FileProcessingService
participant PDFProc as PDFProcessor
participant ImgProc as ImageProcessor
participant OCR as OCRProvider
participant Config as AppConfig
Client->>FileService: upload_file(content, content_type)
FileService->>Config: read OCRSettings
FileService->>FileService: _get_processor(content_type)
alt PDF
FileService->>PDFProc: extract_text(content, content_type)
PDFProc->>PDFProc: _native_pdf_text(content)
alt OCR MODE off
PDFProc-->>FileService: return native_text
else OCR enabled/fallback
alt native_text length >= MIN_EXTRACTED_TEXT_CHARS
PDFProc-->>FileService: return native_text
else
PDFProc->>OCR: POST _ocr_endpoint() with _ocr_payload
alt OCR success
OCR-->>PDFProc: ocr_response
PDFProc->>PDFProc: _coerce_ocr_text(response)
PDFProc-->>FileService: return ocr_text
else OCR failure
PDFProc-->>FileService: return native_text
end
end
end
else Image
FileService->>ImgProc: extract_text(content, content_type)
ImgProc->>OCR: POST _ocr_endpoint() with _ocr_payload
alt OCR success
OCR-->>ImgProc: ocr_response
ImgProc->>ImgProc: _coerce_ocr_text(response)
ImgProc-->>FileService: return ocr_text
else OCR failure
ImgProc-->>FileService: return error/empty
end
else Text/JSON
FileService->>FileService: TextProcessor/JSONProcessor.extract_text(...)
FileService-->>Client: return extracted_text
end
FileService-->>Client: extracted_text
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/utils/files.py (1)
138-142: Consider logging the OCR exception details for debugging.Currently, when OCR fails and falls back to native text, only a warning is logged without the exception details. Consider including exception info for easier debugging.
♻️ Suggested enhancement for exception logging
try: return await _ocr_extract_text(content, content_type) - except Exception: + except Exception as e: if native_text.strip(): - logger.warning("OCR failed for PDF upload, falling back to native text") + logger.warning("OCR failed for PDF upload, falling back to native text: %s", e) return native_text raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/files.py` around lines 138 - 142, In the except Exception block that handles OCR fallback in src/utils/files.py (the block that checks native_text.strip()), capture the exception (e.g., except Exception as e:) and include its details in the log instead of just a plain warning; update the logger call to include the exception (for example use logger.exception(...) or logger.warning(..., exc_info=True)) while preserving the existing behavior of returning native_text when available and re-raising when not.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/utils/files.py`:
- Around line 138-142: In the except Exception block that handles OCR fallback
in src/utils/files.py (the block that checks native_text.strip()), capture the
exception (e.g., except Exception as e:) and include its details in the log
instead of just a plain warning; update the logger call to include the exception
(for example use logger.exception(...) or logger.warning(..., exc_info=True))
while preserving the existing behavior of returning native_text when available
and re-raising when not.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 66000383-22d4-4d89-beb8-b82390c3b841
📒 Files selected for processing (4)
src/config.pysrc/utils/files.pytests/conftest.pytests/routes/test_files.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/utils/files.py (2)
26-35: Consider running blocking I/O in a thread pool.
_native_pdf_textperforms synchronous blocking I/O withpdfplumber, which can block the event loop when called from the asyncPDFProcessor.extract_text. For small files this may be acceptable, but for larger PDFs or concurrent uploads, wrapping inasyncio.to_thread()would prevent blocking.♻️ Optional: Wrap in thread pool executor
+import asyncio + + def _native_pdf_text(content: bytes) -> str: import pdfplumber with pdfplumber.open(BytesIO(content)) as pdf_reader: text_parts: list[str] = [] for page_num, page in enumerate(pdf_reader.pages): text = page.extract_text() if text and text.strip(): text_parts.append(f"[Page {page_num + 1}]\n{text}") return "\n\n".join(text_parts) + + +async def _native_pdf_text_async(content: bytes) -> str: + return await asyncio.to_thread(_native_pdf_text, content)Then call
_native_pdf_text_asyncfromPDFProcessor.extract_text.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/files.py` around lines 26 - 35, The synchronous _native_pdf_text function does blocking I/O and should be run off the event loop; create an async wrapper (e.g., _native_pdf_text_async) that calls the existing _native_pdf_text via asyncio.to_thread(...) (or use asyncio.get_running_loop().run_in_executor) and then update PDFProcessor.extract_text to call _native_pdf_text_async instead of calling _native_pdf_text directly so PDF extraction no longer blocks the event loop during pdfplumber processing.
106-117: Consider wrapping HTTP exceptions for consistent error handling.The
httpxclient can raise various exceptions (ConnectError,TimeoutException,HTTPStatusError) that will propagate directly to callers. ForPDFProcessor, the bareexcept Exceptionhandles this gracefully. However,ImageProcessorhas no fallback, so these low-level exceptions will bubble up to the API layer.Wrapping in a domain-specific exception (e.g.,
FileProcessingError) would provide more consistent error handling and avoid leaking implementation details.♻️ Proposed: Wrap httpx exceptions
async def _ocr_extract_text(content: bytes, content_type: str) -> str: if settings.OCR.MODE == "off": raise UnsupportedFileTypeError("OCR is not enabled") - async with httpx.AsyncClient(timeout=settings.OCR.TIMEOUT_SECONDS) as client: - response = await client.post( - _ocr_endpoint(), - headers=_ocr_headers(), - json=_ocr_payload(content, content_type), - ) - response.raise_for_status() - return _coerce_ocr_text(response.json()) + try: + async with httpx.AsyncClient(timeout=settings.OCR.TIMEOUT_SECONDS) as client: + response = await client.post( + _ocr_endpoint(), + headers=_ocr_headers(), + json=_ocr_payload(content, content_type), + ) + response.raise_for_status() + return _coerce_ocr_text(response.json()) + except httpx.HTTPStatusError as e: + raise FileProcessingError(f"OCR request failed: {e.response.status_code}") from e + except httpx.RequestError as e: + raise FileProcessingError(f"OCR request error: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/files.py` around lines 106 - 117, The _ocr_extract_text function currently lets httpx exceptions leak to callers; catch httpx.RequestError and httpx.HTTPStatusError (or broadly httpx.HTTPError) around the client.post/response.raise_for_status call and re-raise a domain-specific exception (e.g., FileProcessingError) that includes the original exception message/context so callers like ImageProcessor and PDFProcessor get consistent errors; ensure the FileProcessingError retains the original exception as the __cause__ or includes it in its message for logging/diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/utils/files.py`:
- Around line 173-178: The supports_file_type method currently returns True for
any "image/*" even when OCR is disabled; update
ImageProcessor.supports_file_type to first check the OCR mode (e.g., call the
existing OCR flag or helper such as is_ocr_enabled() / config.OCR_ENABLED) and
return False when OCR is not enabled so images are rejected up-front; keep
extract_text as-is (it will still call _ocr_extract_text) and ensure the
rejection uses the same "unsupported file type" flow so callers surface the
standard UnsupportedFileTypeError message rather than the later "OCR is not
enabled" error.
---
Nitpick comments:
In `@src/utils/files.py`:
- Around line 26-35: The synchronous _native_pdf_text function does blocking I/O
and should be run off the event loop; create an async wrapper (e.g.,
_native_pdf_text_async) that calls the existing _native_pdf_text via
asyncio.to_thread(...) (or use asyncio.get_running_loop().run_in_executor) and
then update PDFProcessor.extract_text to call _native_pdf_text_async instead of
calling _native_pdf_text directly so PDF extraction no longer blocks the event
loop during pdfplumber processing.
- Around line 106-117: The _ocr_extract_text function currently lets httpx
exceptions leak to callers; catch httpx.RequestError and httpx.HTTPStatusError
(or broadly httpx.HTTPError) around the client.post/response.raise_for_status
call and re-raise a domain-specific exception (e.g., FileProcessingError) that
includes the original exception message/context so callers like ImageProcessor
and PDFProcessor get consistent errors; ensure the FileProcessingError retains
the original exception as the __cause__ or includes it in its message for
logging/diagnostics.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/utils/files.py`:
- Around line 183-188: The processors list currently always includes
ImageProcessor (self.processors) which causes supported-types to advertise image
support even when OCR or image backends are unavailable; modify the
initialization of self.processors to only append ImageProcessor() when the
runtime/config checks indicate image support (e.g., OCR enabled or required
libraries successfully imported), or instead compute the supported-types list by
iterating active processors (instances in self.processors) and asking each for
supported mime/extensions so disabled processors are never reported; update the
ImageProcessor registration and the code that builds the supported-types list
(references: self.processors, ImageProcessor, FileProcessor and the code that
emits the supported-types/unsupported-type error) to ensure image support is
only advertised when ImageProcessor is actually usable.
- Around line 124-144: The extract_text method currently always calls
_native_pdf_text and returns native_text on any OCR exception, which breaks
OCR.MODE == "force"; change extract_text so that when settings.OCR.MODE ==
"force" it does not call _native_pdf_text before attempting OCR (call await
_ocr_extract_text directly) and only use _native_pdf_text as a fallback inside
the except block when settings.OCR.MODE == "fallback"; ensure that in "force"
mode OCR exceptions are propagated (do not return native_text) and keep the
existing fallback behavior for "off" and "fallback" modes, updating references
to _native_pdf_text, _ocr_extract_text, settings.OCR.MODE, and the extract_text
method accordingly.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/utils/files.py (1)
131-145:⚠️ Potential issue | 🟠 MajorFallback mode never reaches OCR when native PDF extraction throws.
In fallback mode, Line 131 runs
_native_pdf_text()before thetry, so anypdfplumberparse error escapes immediately and OCR is never attempted. That breaks the native-first fallback contract for scanned or parser-hostile PDFs.🛠️ Suggested change
async def extract_text(self, content: bytes, content_type: str) -> str: if settings.OCR.MODE == "off": return _native_pdf_text(content) if settings.OCR.MODE == "force": return await _ocr_extract_text(content, content_type) - native_text = _native_pdf_text(content) + native_text = "" + try: + native_text = _native_pdf_text(content) + except Exception: + logger.warning( + "Native PDF extraction failed; trying OCR", + exc_info=True, + ) if ( settings.OCR.MODE == "fallback" and len(native_text.strip()) >= settings.OCR.MIN_EXTRACTED_TEXT_CHARS ): return native_text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/files.py` around lines 131 - 145, _native_pdf_text is called outside the try/except so its exceptions bypass OCR; wrap the native extraction and the OCR call in a controlled flow: call _native_pdf_text inside a try block (capture exceptions to native_text = "" and logger.warning the parse error), then if settings.OCR.MODE == "fallback" and len(native_text.strip()) >= settings.OCR.MIN_EXTRACTED_TEXT_CHARS return native_text, otherwise attempt await _ocr_extract_text(content, content_type) and, if that raises, if settings.OCR.MODE == "fallback" and native_text.strip() return native_text else re-raise the exception; reference functions/_symbols: _native_pdf_text, _ocr_extract_text, settings.OCR.MODE, settings.OCR.MIN_EXTRACTED_TEXT_CHARS, logger.warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/utils/files.py`:
- Around line 131-145: _native_pdf_text is called outside the try/except so its
exceptions bypass OCR; wrap the native extraction and the OCR call in a
controlled flow: call _native_pdf_text inside a try block (capture exceptions to
native_text = "" and logger.warning the parse error), then if settings.OCR.MODE
== "fallback" and len(native_text.strip()) >=
settings.OCR.MIN_EXTRACTED_TEXT_CHARS return native_text, otherwise attempt
await _ocr_extract_text(content, content_type) and, if that raises, if
settings.OCR.MODE == "fallback" and native_text.strip() return native_text else
re-raise the exception; reference functions/_symbols: _native_pdf_text,
_ocr_extract_text, settings.OCR.MODE, settings.OCR.MIN_EXTRACTED_TEXT_CHARS,
logger.warning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 991ff92c-ac90-46aa-afd2-fdbbc5956b4e
📒 Files selected for processing (2)
src/utils/files.pytests/routes/test_files.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/utils/files.py (1)
84-103: Consider a more specific exception type.The
ValueErroron Line 103 is acceptable but could be more specific for debugging OCR integration issues.♻️ Optional: Use a domain-specific exception
+from src.exceptions import FileProcessingError ... - raise ValueError("OCR response did not contain extracted text") + raise FileProcessingError("OCR response did not contain extracted text")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/files.py` around lines 84 - 103, The function _coerce_ocr_text currently raises a generic ValueError when it can't find extracted text; define and raise a domain-specific exception (e.g., OcrExtractionError or OCRTextNotFoundError) and use that in _coerce_ocr_text instead of ValueError so callers can distinguish OCR integration failures; add the new exception class near other utility exceptions (or top of src/utils/files.py) and update any callers/tests that expect ValueError to handle or catch the new exception name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/utils/files.py`:
- Around line 84-103: The function _coerce_ocr_text currently raises a generic
ValueError when it can't find extracted text; define and raise a domain-specific
exception (e.g., OcrExtractionError or OCRTextNotFoundError) and use that in
_coerce_ocr_text instead of ValueError so callers can distinguish OCR
integration failures; add the new exception class near other utility exceptions
(or top of src/utils/files.py) and update any callers/tests that expect
ValueError to handle or catch the new exception name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 656d6f71-855d-4959-911b-602ef7e6e9a8
📒 Files selected for processing (2)
src/utils/files.pytests/routes/test_files.py
Summary
Adds configurable OCR support to the upload pipeline for PDFs and images.
This introduces Mistral OCR as the primary hosted OCR backend for uploaded files, while also supporting DeepSeek-compatible OCR endpoints. Native PDF extraction via
pdfplumberremains in place, and OCR can be configured to run inoff,fallback, orforcemode.Motivation
Honcho currently relies on native PDF text extraction, which does not work well for scanned PDFs or image-based uploads. This change adds OCR support directly to the file-ingestion path so uploads can still be converted into messages when native extraction is insufficient.
Closes #410
What Changed
src/config.pysrc/utils/files.pyto support:simple_batch_embed()so tests remain hermeticDesign Notes
offTesting
Validated with:
Result:
916 passedAdditional OCR-specific checks:
Result:
19 passed, 1 skippedOpt-in live Mistral OCR smoke test:
Result:
1 passedNotes
-n auto) hit Postgres teardown contention in this environment, so verification was performed with-n 0Summary by CodeRabbit
New Features
Tests