I am trying to integrate an image-to-text model into a React Native mobile app.
My requirements: The model should support image + text input → text output. It should be lightweight enough to run on mobile devices.
What I tried
moondream-0.5b (ONNX conversion)
- Tried converting it to ONNX.
- Faced issues with the tokenizer for encoding/decoding.
- The output tokens were irrelevant.
microsoft/florence-base (fine-tuned, .pt format)
- Fine-tuned it for my use case.
- Converted to .pt.
- Integration failed due to an error: “corrupted PyTorch model”.