2

I am trying to integrate an image-to-text model into a React Native mobile app.

My requirements: The model should support image + text input → text output. It should be lightweight enough to run on mobile devices.

What I tried

moondream-0.5b (ONNX conversion)

  • Tried converting it to ONNX.
  • Faced issues with the tokenizer for encoding/decoding.
  • The output tokens were irrelevant.

microsoft/florence-base (fine-tuned, .pt format)

  • Fine-tuned it for my use case.
  • Converted to .pt.
  • Integration failed due to an error: “corrupted PyTorch model”.

1 Answer 1

0

It seems like you're looking for a "one-in-all" answer. Maybe reworking/redoing one of your initial attempts may get you the answer, but I'm a fan of breaking things up. Personally, I have a work requirement related to expense tracking, so I've been researching OCR for mobile and found:

https://github.com/a7medev/react-native-ml-kit or the NPM link

With the extracted text, you could easily run a cheap/free server (AWS free-tier, Google Cloud free-tier, Heroku cheap) with a mini LLM and pass the extracted text and a text prompt to a server to get the heavy load off the user's mobile device.

Consider whether you truly want everything on a mobile device.

Even after quantizing a model, you'll still be looking at about 50-100MB of size alone (just for the model) which is a pretty large app. I believe Android's Google store has a limit of 150MB and then you have to do some funky file splitting (I think).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.