Newest 'multimodal' Questions

0 votes

0 answers

178 views

'MistralTokenizer' object has no attribute 'convert_tokens_to_ids'

I'm trying to run Mistral-Small-3.1-24B-Instruct-2503 in multimodal mode (with image_url) using vLLM, but hitting the tokenizer error: AttributeError: 'MistralTokenizer' object has no attribute '...

weiming

39

asked Nov 14, 2025 at 7:01

3 votes

2 answers

214 views

Multimodal embedding requires video first, then image - why?

I am working with OmniEmbed model (https://huggingface.co/Tevatron/OmniEmbed-v0.1), which is built on Qwen2.5 7B. My goal is to get a multimodal embedding for images and videos. I have the following ...

n_arch

76

asked Oct 2, 2025 at 15:07

0 votes

1 answer

2k views

llama.cpp server and curl requests for multimodal models

I have llama-server up and running on a VPS with Ubuntu 24.04. I can send curl requests from an external IP and get answers for text embedding for instance. Now I want to use multimodal models through ...

user3102556

89

asked Jun 10, 2025 at 19:32

4 votes

1 answer

387 views

Cannot interence with images on llama-cpp-python

I am new to this. I have been trying but could not make the the model answer on images. from llama_cpp import Llama import torch from PIL import Image import base64 llm = Llama( model_path='Holo1-...

Abhash Rai

71

asked Jun 7, 2025 at 5:50

0 votes

0 answers

492 views

Struggling in creating a multimodality chatbot using CopilotKit

While trying to build a chatbot leveraging the capabilities of CopilotKit and GPT-4o model. I am also using the frontend (React based UI) which is also supported by CopilotKit. What's happening is ...

Gaurav Singh

1

asked May 14, 2025 at 18:40

0 votes

2 answers

202 views

How to include image as part of user prompt in haystack 2.X?

I have a great pipeline of chatbot using Haystack. I am referring to the haystack docs to create a pipeline, here is the example of the pipeline using prompt builder: from haystack import Pipeline ...

Abstract

47

asked Mar 3, 2025 at 0:29

0 votes

1 answer

1k views

langchain_ollama attach image to prompt

I'm expirementing with llama 3.2 vision 11B and I'm having a bit of a rough time attaching an image, wether it's local or online, to the chat. Here's my Python code: import io import base64 import ...

Za3tour420

1

asked Dec 14, 2024 at 15:00

2 votes

0 answers

193 views

MultiModal Cross attention

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using ...

m sh

21

asked Nov 21, 2024 at 3:16

1 vote

0 answers

120 views

Can't evaluate BLIP2 on a batch of images in parallel

I'm trying to speed up the generation of captions on a large set of images using BLIP-2. The below code for one image works fine: prompt = "this is a picture of" inputs = processor(trainData[...

Paul

1,216

asked Oct 14, 2024 at 1:16

1 vote

3 answers

3k views

How to pass online images to Gemini model?

I try to use Gemini model to generate descriptions for online images, but failed at the converting Pillow iamge format to vertext ai image format. Running below code encounters this error: ...

Koala S

11

asked Jul 15, 2024 at 6:28

2 votes

1 answer

649 views

How to extract image hidden states in LLaVa's transformers (Huggingface) implementation?

I am using the transformers library (Huggingface) to extract all hidden units of LLaVa 1.5. On the huggingface documentation, it shows that it is possible to extract image hidden states from the ...

Mihir Mehta

107

asked Jun 17, 2024 at 7:04

1 vote

1 answer

2k views

GCP Gemini API - Send multimodal prompt requests using local image

On this page Google shows a sample code on how to send multimodal prompt requests (image + text). import vertexai from vertexai.generative_models import GenerativeModel, Part # ...

Matheus Torquato

1,679

asked Jun 3, 2024 at 22:16

0 votes

1 answer

119 views

Transformers code works on its own, but breaks when using gradio (device mismatch

I am attempting to make a gradio demo for nanoLLaVA by @stablequan. I am porting over just the structure of Apache 2.0 licensed code in the Moondream repo. The nanoLLaVA repo has example code in the ...

CoderCowMoo

13

asked Apr 18, 2024 at 10:13

5 votes

0 answers

895 views

How to use LLaVa embedding function? Multi-Modal Rag

I'm currently implementing a multi-modal RAG sys leveraging, LLaVa, Chroma & Langchain. However, I'm having a hard time finding the embeddings function llava uses. Can anybody help me with that? ...

Danilo Dresen

61

asked Apr 16, 2024 at 10:05

1 vote

1 answer

318 views

Loading video-LLaVA with Huggingface transformers

On trying to load video-LLaVA with Huggingface on colab I get this error: --------------------------------------------------------------------------- HTTPError ...

Kamakshi Ramamurthy

11

asked Feb 22, 2024 at 6:19

Collectives™ on Stack Overflow

'MistralTokenizer' object has no attribute 'convert_tokens_to_ids'

Multimodal embedding requires video first, then image - why?

llama.cpp server and curl requests for multimodal models

Cannot interence with images on llama-cpp-python

Struggling in creating a multimodality chatbot using CopilotKit

How to include image as part of user prompt in haystack 2.X?

langchain_ollama attach image to prompt

MultiModal Cross attention

Can't evaluate BLIP2 on a batch of images in parallel

How to pass online images to Gemini model?

How to extract image hidden states in LLaVa's transformers (Huggingface) implementation?

GCP Gemini API - Send multimodal prompt requests using local image

Transformers code works on its own, but breaks when using gradio (device mismatch

How to use LLaVa embedding function? Multi-Modal Rag

Loading video-LLaVA with Huggingface transformers

Hot Network Questions