Skip to content

feat(files): add Mistral OCR and configurable OCR backends for uploads#423

Draft
adavyas wants to merge 6 commits intoplastic-labs:mainfrom
adavyas:feature/ocr-upload-backends
Draft

feat(files): add Mistral OCR and configurable OCR backends for uploads#423
adavyas wants to merge 6 commits intoplastic-labs:mainfrom
adavyas:feature/ocr-upload-backends

Conversation

@adavyas
Copy link
Copy Markdown
Contributor

@adavyas adavyas commented Mar 11, 2026

Summary

Adds configurable OCR support to the upload pipeline for PDFs and images.

This introduces Mistral OCR as the primary hosted OCR backend for uploaded files, while also supporting DeepSeek-compatible OCR endpoints. Native PDF extraction via pdfplumber remains in place, and OCR can be configured to run in off, fallback, or force mode.

Motivation

Honcho currently relies on native PDF text extraction, which does not work well for scanned PDFs or image-based uploads. This change adds OCR support directly to the file-ingestion path so uploads can still be converted into messages when native extraction is insufficient.

Closes #410

What Changed

  • Added OCR settings in src/config.py
  • Extended src/utils/files.py to support:
    • native PDF extraction only
    • native-first OCR fallback
    • forced OCR
  • Added Mistral OCR support for uploaded PDFs and images
  • Added support for DeepSeek-compatible OCR endpoints
  • Added image OCR support for uploads
  • Added focused upload tests for:
    • image OCR
    • PDF native extraction in fallback mode
    • PDF OCR fallback when native extraction is insufficient
  • Added an opt-in live Mistral OCR smoke test for the upload route
  • Updated the shared embedding test mock to cover simple_batch_embed() so tests remain hermetic

Design Notes

  • The OCR integration is intentionally limited to the existing upload/text-extraction path
  • Existing behavior remains unchanged when OCR mode is off
  • No changes were made to deriver, dialectic, queue orchestration, or the primary LLM client abstraction

Testing

Validated with:

env DB_CONNECTION_URI=postgresql+psycopg://testuser:testpwd@127.0.0.1:5433/honcho \
LLM_GEMINI_API_KEY=test-gemini-key \
LLM_ANTHROPIC_API_KEY=test-anthropic-key \
LLM_OPENAI_API_KEY=test-openai-key \
LLM_GROQ_API_KEY=test-groq-key \
uv run pytest -n 0 -q

Result:

  • 916 passed

Additional OCR-specific checks:

env DB_CONNECTION_URI=postgresql+psycopg://testuser:testpwd@127.0.0.1:5433/honcho \
LLM_GEMINI_API_KEY=test-gemini-key \
LLM_ANTHROPIC_API_KEY=test-anthropic-key \
LLM_OPENAI_API_KEY=test-openai-key \
LLM_GROQ_API_KEY=test-groq-key \
uv run pytest -n 0 tests/routes/test_files.py -q

Result:

  • 19 passed, 1 skipped

Opt-in live Mistral OCR smoke test:

env DB_CONNECTION_URI=postgresql+psycopg://testuser:testpwd@127.0.0.1:5433/honcho \
LLM_GEMINI_API_KEY=test-gemini-key \
LLM_ANTHROPIC_API_KEY=test-anthropic-key \
LLM_OPENAI_API_KEY=test-openai-key \
LLM_GROQ_API_KEY=test-groq-key \
RUN_LIVE_MISTRAL_OCR_TEST=1 \
MISTRAL_API_KEY=... \
uv run pytest -n 0 tests/routes/test_files.py -q -k live_mistral

Result:

  • 1 passed

Notes

  • The repository's default parallel pytest configuration (-n auto) hit Postgres teardown contention in this environment, so verification was performed with -n 0
  • The live Mistral test is skipped by default and only runs when explicitly opted in with a real API key

Summary by CodeRabbit

  • New Features

    • OCR support for images and PDFs with configurable providers, modes (force/fallback/off), timeouts, and minimum-text thresholds.
    • App config exposes OCR settings and recognizes an "ocr" config section.
    • File processing updated to prefer native extraction but use OCR according to configuration, with image OCR enabled when configured.
  • Tests

    • Expanded coverage for OCR flows, PDF native/OCR fallback and force modes, image OCR, metadata/chunking, embedding mocks, and error cases.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds configurable OCR settings and integrates OCR into file processing: processors accept content_type; PDF native-extraction with OCR force/fallback logic implemented; ImageProcessor and async OCR calls added; tests expanded for OCR flows, PDF behaviors, metadata, chunking, and embeddings mock extended.

Changes

Cohort / File(s) Summary
OCR Configuration
src/config.py
Added OCRSettings class, validator, and added OCR: OCRSettings to AppSettings; mapped TOML section "OCR""ocr".
File Processing & OCR Integration
src/utils/files.py
Updated FileProcessor.extract_text signatures to accept content_type; added native PDF extraction helpers (_native_pdf_text) and OCR helpers (_ocr_endpoint, _ocr_headers, _ocr_payload, _coerce_ocr_text, _ocr_extract_text); introduced async OCR calls, ImageProcessor, and PDF force/fallback logic; updated FileProcessingService wiring and error handling.
Tests: Embeddings Mock
tests/conftest.py
Extended mock_openai_embeddings to patch/expose simple_batch_embed alongside existing embed and batch_embed.
Tests: File Uploads / OCR
tests/routes/test_files.py
Added PDF helper and many tests covering image OCR, PDF native/OCR fallback and force modes, metadata/chunking behavior, unsupported types, created_at handling, and a guarded live-Mistral OCR smoke test.

Sequence Diagram

sequenceDiagram
    participant Client
    participant FileService as FileProcessingService
    participant PDFProc as PDFProcessor
    participant ImgProc as ImageProcessor
    participant OCR as OCRProvider
    participant Config as AppConfig

    Client->>FileService: upload_file(content, content_type)
    FileService->>Config: read OCRSettings
    FileService->>FileService: _get_processor(content_type)

    alt PDF
        FileService->>PDFProc: extract_text(content, content_type)
        PDFProc->>PDFProc: _native_pdf_text(content)
        alt OCR MODE off
            PDFProc-->>FileService: return native_text
        else OCR enabled/fallback
            alt native_text length >= MIN_EXTRACTED_TEXT_CHARS
                PDFProc-->>FileService: return native_text
            else
                PDFProc->>OCR: POST _ocr_endpoint() with _ocr_payload
                alt OCR success
                    OCR-->>PDFProc: ocr_response
                    PDFProc->>PDFProc: _coerce_ocr_text(response)
                    PDFProc-->>FileService: return ocr_text
                else OCR failure
                    PDFProc-->>FileService: return native_text
                end
            end
        end
    else Image
        FileService->>ImgProc: extract_text(content, content_type)
        ImgProc->>OCR: POST _ocr_endpoint() with _ocr_payload
        alt OCR success
            OCR-->>ImgProc: ocr_response
            ImgProc->>ImgProc: _coerce_ocr_text(response)
            ImgProc-->>FileService: return ocr_text
        else OCR failure
            ImgProc-->>FileService: return error/empty
        end
    else Text/JSON
        FileService->>FileService: TextProcessor/JSONProcessor.extract_text(...)
        FileService-->>Client: return extracted_text
    end

    FileService-->>Client: extracted_text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through bytes and lines anew,
PDFs whispered what images knew,
Mistral and Deepseek lent their light,
I chewed the text by moonlit byte,
Hooray — OCR feels right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: adding Mistral OCR and configurable OCR backends for the upload pipeline, which aligns with the primary feature being delivered.
Linked Issues check ✅ Passed All coding objectives from issue #410 are met: OCR configuration in src/config.py, Mistral and DeepSeek backend support, integration into upload/extraction flow, native extraction preservation, and configurable modes (off/fallback/force).
Out of Scope Changes check ✅ Passed All changes are within scope of issue #410. File changes limited to config, upload utilities, and related tests; no modifications to unrelated subsystems like deriver, queue orchestration, or primary LLM client.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/utils/files.py (1)

138-142: Consider logging the OCR exception details for debugging.

Currently, when OCR fails and falls back to native text, only a warning is logged without the exception details. Consider including exception info for easier debugging.

♻️ Suggested enhancement for exception logging
         try:
             return await _ocr_extract_text(content, content_type)
-        except Exception:
+        except Exception as e:
             if native_text.strip():
-                logger.warning("OCR failed for PDF upload, falling back to native text")
+                logger.warning("OCR failed for PDF upload, falling back to native text: %s", e)
                 return native_text
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/files.py` around lines 138 - 142, In the except Exception block
that handles OCR fallback in src/utils/files.py (the block that checks
native_text.strip()), capture the exception (e.g., except Exception as e:) and
include its details in the log instead of just a plain warning; update the
logger call to include the exception (for example use logger.exception(...) or
logger.warning(..., exc_info=True)) while preserving the existing behavior of
returning native_text when available and re-raising when not.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/utils/files.py`:
- Around line 138-142: In the except Exception block that handles OCR fallback
in src/utils/files.py (the block that checks native_text.strip()), capture the
exception (e.g., except Exception as e:) and include its details in the log
instead of just a plain warning; update the logger call to include the exception
(for example use logger.exception(...) or logger.warning(..., exc_info=True))
while preserving the existing behavior of returning native_text when available
and re-raising when not.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 66000383-22d4-4d89-beb8-b82390c3b841

📥 Commits

Reviewing files that changed from the base of the PR and between 24f94f3 and 396e7ed.

📒 Files selected for processing (4)
  • src/config.py
  • src/utils/files.py
  • tests/conftest.py
  • tests/routes/test_files.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/utils/files.py (2)

26-35: Consider running blocking I/O in a thread pool.

_native_pdf_text performs synchronous blocking I/O with pdfplumber, which can block the event loop when called from the async PDFProcessor.extract_text. For small files this may be acceptable, but for larger PDFs or concurrent uploads, wrapping in asyncio.to_thread() would prevent blocking.

♻️ Optional: Wrap in thread pool executor
+import asyncio
+
+
 def _native_pdf_text(content: bytes) -> str:
     import pdfplumber
 
     with pdfplumber.open(BytesIO(content)) as pdf_reader:
         text_parts: list[str] = []
         for page_num, page in enumerate(pdf_reader.pages):
             text = page.extract_text()
             if text and text.strip():
                 text_parts.append(f"[Page {page_num + 1}]\n{text}")
         return "\n\n".join(text_parts)
+
+
+async def _native_pdf_text_async(content: bytes) -> str:
+    return await asyncio.to_thread(_native_pdf_text, content)

Then call _native_pdf_text_async from PDFProcessor.extract_text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/files.py` around lines 26 - 35, The synchronous _native_pdf_text
function does blocking I/O and should be run off the event loop; create an async
wrapper (e.g., _native_pdf_text_async) that calls the existing _native_pdf_text
via asyncio.to_thread(...) (or use asyncio.get_running_loop().run_in_executor)
and then update PDFProcessor.extract_text to call _native_pdf_text_async instead
of calling _native_pdf_text directly so PDF extraction no longer blocks the
event loop during pdfplumber processing.

106-117: Consider wrapping HTTP exceptions for consistent error handling.

The httpx client can raise various exceptions (ConnectError, TimeoutException, HTTPStatusError) that will propagate directly to callers. For PDFProcessor, the bare except Exception handles this gracefully. However, ImageProcessor has no fallback, so these low-level exceptions will bubble up to the API layer.

Wrapping in a domain-specific exception (e.g., FileProcessingError) would provide more consistent error handling and avoid leaking implementation details.

♻️ Proposed: Wrap httpx exceptions
 async def _ocr_extract_text(content: bytes, content_type: str) -> str:
     if settings.OCR.MODE == "off":
         raise UnsupportedFileTypeError("OCR is not enabled")
 
-    async with httpx.AsyncClient(timeout=settings.OCR.TIMEOUT_SECONDS) as client:
-        response = await client.post(
-            _ocr_endpoint(),
-            headers=_ocr_headers(),
-            json=_ocr_payload(content, content_type),
-        )
-        response.raise_for_status()
-        return _coerce_ocr_text(response.json())
+    try:
+        async with httpx.AsyncClient(timeout=settings.OCR.TIMEOUT_SECONDS) as client:
+            response = await client.post(
+                _ocr_endpoint(),
+                headers=_ocr_headers(),
+                json=_ocr_payload(content, content_type),
+            )
+            response.raise_for_status()
+            return _coerce_ocr_text(response.json())
+    except httpx.HTTPStatusError as e:
+        raise FileProcessingError(f"OCR request failed: {e.response.status_code}") from e
+    except httpx.RequestError as e:
+        raise FileProcessingError(f"OCR request error: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/files.py` around lines 106 - 117, The _ocr_extract_text function
currently lets httpx exceptions leak to callers; catch httpx.RequestError and
httpx.HTTPStatusError (or broadly httpx.HTTPError) around the
client.post/response.raise_for_status call and re-raise a domain-specific
exception (e.g., FileProcessingError) that includes the original exception
message/context so callers like ImageProcessor and PDFProcessor get consistent
errors; ensure the FileProcessingError retains the original exception as the
__cause__ or includes it in its message for logging/diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/files.py`:
- Around line 173-178: The supports_file_type method currently returns True for
any "image/*" even when OCR is disabled; update
ImageProcessor.supports_file_type to first check the OCR mode (e.g., call the
existing OCR flag or helper such as is_ocr_enabled() / config.OCR_ENABLED) and
return False when OCR is not enabled so images are rejected up-front; keep
extract_text as-is (it will still call _ocr_extract_text) and ensure the
rejection uses the same "unsupported file type" flow so callers surface the
standard UnsupportedFileTypeError message rather than the later "OCR is not
enabled" error.

---

Nitpick comments:
In `@src/utils/files.py`:
- Around line 26-35: The synchronous _native_pdf_text function does blocking I/O
and should be run off the event loop; create an async wrapper (e.g.,
_native_pdf_text_async) that calls the existing _native_pdf_text via
asyncio.to_thread(...) (or use asyncio.get_running_loop().run_in_executor) and
then update PDFProcessor.extract_text to call _native_pdf_text_async instead of
calling _native_pdf_text directly so PDF extraction no longer blocks the event
loop during pdfplumber processing.
- Around line 106-117: The _ocr_extract_text function currently lets httpx
exceptions leak to callers; catch httpx.RequestError and httpx.HTTPStatusError
(or broadly httpx.HTTPError) around the client.post/response.raise_for_status
call and re-raise a domain-specific exception (e.g., FileProcessingError) that
includes the original exception message/context so callers like ImageProcessor
and PDFProcessor get consistent errors; ensure the FileProcessingError retains
the original exception as the __cause__ or includes it in its message for
logging/diagnostics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e1a4d1cf-8492-4c16-8caf-6c35080ffc8c

📥 Commits

Reviewing files that changed from the base of the PR and between 396e7ed and a570619.

📒 Files selected for processing (1)
  • src/utils/files.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/files.py`:
- Around line 183-188: The processors list currently always includes
ImageProcessor (self.processors) which causes supported-types to advertise image
support even when OCR or image backends are unavailable; modify the
initialization of self.processors to only append ImageProcessor() when the
runtime/config checks indicate image support (e.g., OCR enabled or required
libraries successfully imported), or instead compute the supported-types list by
iterating active processors (instances in self.processors) and asking each for
supported mime/extensions so disabled processors are never reported; update the
ImageProcessor registration and the code that builds the supported-types list
(references: self.processors, ImageProcessor, FileProcessor and the code that
emits the supported-types/unsupported-type error) to ensure image support is
only advertised when ImageProcessor is actually usable.
- Around line 124-144: The extract_text method currently always calls
_native_pdf_text and returns native_text on any OCR exception, which breaks
OCR.MODE == "force"; change extract_text so that when settings.OCR.MODE ==
"force" it does not call _native_pdf_text before attempting OCR (call await
_ocr_extract_text directly) and only use _native_pdf_text as a fallback inside
the except block when settings.OCR.MODE == "fallback"; ensure that in "force"
mode OCR exceptions are propagated (do not return native_text) and keep the
existing fallback behavior for "off" and "fallback" modes, updating references
to _native_pdf_text, _ocr_extract_text, settings.OCR.MODE, and the extract_text
method accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e10efdc-164f-4f7f-a815-c06bcd80c5e4

📥 Commits

Reviewing files that changed from the base of the PR and between a570619 and 2808d1d.

📒 Files selected for processing (1)
  • src/utils/files.py
@adavyas adavyas marked this pull request as draft March 11, 2026 09:22
@adavyas adavyas marked this pull request as ready for review March 11, 2026 09:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/utils/files.py (1)

131-145: ⚠️ Potential issue | 🟠 Major

Fallback mode never reaches OCR when native PDF extraction throws.

In fallback mode, Line 131 runs _native_pdf_text() before the try, so any pdfplumber parse error escapes immediately and OCR is never attempted. That breaks the native-first fallback contract for scanned or parser-hostile PDFs.

🛠️ Suggested change
     async def extract_text(self, content: bytes, content_type: str) -> str:
         if settings.OCR.MODE == "off":
             return _native_pdf_text(content)

         if settings.OCR.MODE == "force":
             return await _ocr_extract_text(content, content_type)

-        native_text = _native_pdf_text(content)
+        native_text = ""
+        try:
+            native_text = _native_pdf_text(content)
+        except Exception:
+            logger.warning(
+                "Native PDF extraction failed; trying OCR",
+                exc_info=True,
+            )

         if (
             settings.OCR.MODE == "fallback"
             and len(native_text.strip()) >= settings.OCR.MIN_EXTRACTED_TEXT_CHARS
         ):
             return native_text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/files.py` around lines 131 - 145, _native_pdf_text is called
outside the try/except so its exceptions bypass OCR; wrap the native extraction
and the OCR call in a controlled flow: call _native_pdf_text inside a try block
(capture exceptions to native_text = "" and logger.warning the parse error),
then if settings.OCR.MODE == "fallback" and len(native_text.strip()) >=
settings.OCR.MIN_EXTRACTED_TEXT_CHARS return native_text, otherwise attempt
await _ocr_extract_text(content, content_type) and, if that raises, if
settings.OCR.MODE == "fallback" and native_text.strip() return native_text else
re-raise the exception; reference functions/_symbols: _native_pdf_text,
_ocr_extract_text, settings.OCR.MODE, settings.OCR.MIN_EXTRACTED_TEXT_CHARS,
logger.warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/utils/files.py`:
- Around line 131-145: _native_pdf_text is called outside the try/except so its
exceptions bypass OCR; wrap the native extraction and the OCR call in a
controlled flow: call _native_pdf_text inside a try block (capture exceptions to
native_text = "" and logger.warning the parse error), then if settings.OCR.MODE
== "fallback" and len(native_text.strip()) >=
settings.OCR.MIN_EXTRACTED_TEXT_CHARS return native_text, otherwise attempt
await _ocr_extract_text(content, content_type) and, if that raises, if
settings.OCR.MODE == "fallback" and native_text.strip() return native_text else
re-raise the exception; reference functions/_symbols: _native_pdf_text,
_ocr_extract_text, settings.OCR.MODE, settings.OCR.MIN_EXTRACTED_TEXT_CHARS,
logger.warning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 991ff92c-ac90-46aa-afd2-fdbbc5956b4e

📥 Commits

Reviewing files that changed from the base of the PR and between 2808d1d and c54db51.

📒 Files selected for processing (2)
  • src/utils/files.py
  • tests/routes/test_files.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/utils/files.py (1)

84-103: Consider a more specific exception type.

The ValueError on Line 103 is acceptable but could be more specific for debugging OCR integration issues.

♻️ Optional: Use a domain-specific exception
+from src.exceptions import FileProcessingError
 ...
-    raise ValueError("OCR response did not contain extracted text")
+    raise FileProcessingError("OCR response did not contain extracted text")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/files.py` around lines 84 - 103, The function _coerce_ocr_text
currently raises a generic ValueError when it can't find extracted text; define
and raise a domain-specific exception (e.g., OcrExtractionError or
OCRTextNotFoundError) and use that in _coerce_ocr_text instead of ValueError so
callers can distinguish OCR integration failures; add the new exception class
near other utility exceptions (or top of src/utils/files.py) and update any
callers/tests that expect ValueError to handle or catch the new exception name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/utils/files.py`:
- Around line 84-103: The function _coerce_ocr_text currently raises a generic
ValueError when it can't find extracted text; define and raise a domain-specific
exception (e.g., OcrExtractionError or OCRTextNotFoundError) and use that in
_coerce_ocr_text instead of ValueError so callers can distinguish OCR
integration failures; add the new exception class near other utility exceptions
(or top of src/utils/files.py) and update any callers/tests that expect
ValueError to handle or catch the new exception name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 656d6f71-855d-4959-911b-602ef7e6e9a8

📥 Commits

Reviewing files that changed from the base of the PR and between c54db51 and f684264.

📒 Files selected for processing (2)
  • src/utils/files.py
  • tests/routes/test_files.py
@adavyas adavyas marked this pull request as draft March 14, 2026 21:00
@adavyas adavyas marked this pull request as ready for review March 15, 2026 21:59
@adavyas adavyas marked this pull request as draft March 19, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant