0

I'm trying to run a RAG system on my mac M3-pro (18gb RAM) using langchain and `Llama-3.2-3B-Instruct` on a jupyter notebook (and the vector storage is Milvus).
When I am invoking RetrievalQA.from_chain_type, the cell is running indefinitely (at least 15 mins, did not let it run longer...).
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,  # (optional)
    chain_type_kwargs={"prompt": prompt}
)
response = qa_chain.invoke({"query": question})

Can you help resolve please?

The llm, retriever and prompt are as below:

from langchain.llms.base import LLM
from typing import List, Dict
from pydantic import PrivateAttr

class HuggingFaceLLM(LLM):
    # Define pipeline as a private attribute
    _pipeline: any = PrivateAttr()

    def __init__(self, pipeline):
        super().__init__()
        self._pipeline = pipeline

    def _call(self, prompt: str, stop: List[str] = None) -> str:
        # Generate text using the Hugging Face pipeline
        # response = self._pipeline(prompt, max_length=512, num_return_sequences=1)
        response = self._pipeline(prompt, num_return_sequences=1)
        return response[0]["generated_text"]

    @property
    def _identifying_params(self):
        return {"name": "HuggingFaceLLM"}

    @property
    def _llm_type(self):
        return "custom"

llm = HuggingFaceLLM(pipeline=llm_pipeline)

llm pipeline:

from langchain.prompts import PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=hf_token)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=hf_token)

llm_pipeline = pipeline( "text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        device=0, 
                        max_new_tokens=256, 
                        temperature=0.7, 
                        top_p=0.9, 
                        truncation=True, 
                        )

propmpt:

prompt_template = """
You are a helpful assistant. Use the following context to answer the question concisely.
If you do not know the answer from the context, please state so and do not search for an answer elsewhere.

Context:
{context}

Question:
{question}

Answer:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

Retriever:

class MilvusRetriever(BaseRetriever, BaseModel):
    collection: any  
    embedding_function: Callable[[str], np.ndarray]  
    text_field: str  
    vector_field: str  
    top_k: int = 5  

    def get_relevant_documents(self, query: str) -> List[Dict]:
        query_embedding = self.embedding_function(query)

        search_params = {"metric_type": "IP", "params": {"nprobe": 10}}
        results = self.collection.search(
            data=[query_embedding],
            anns_field=self.vector_field,
            param=search_params,
            limit=self.top_k,
            output_fields=[self.text_field]
        )

        documents = []
        for hit in results[0]:
            documents.append(
                Document(
                    page_content=hit.entity.get(self.text_field),
                    metadata={"score": hit.distance}
                )
            )
        return documents

    async def aget_relevant_documents(self, query: str) -> List[Dict]:
        """Asynchronous version of get_relevant_documents."""
        return self.get_relevant_documents(query)

retriever = MilvusRetriever(
collection=collection,
embedding_function=embed_model.embed_query,  
text_field="text",
vector_field="embedding", 
top_k=5  
)

I am also checking that the Mac GPUs are on:

import torch  
if torch.backends.mps.is_available():
    print("MPS is available!")

Edit 1: As recommended here, I tried adding verbose:

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,  # (optional)
    # return_source_documents=False,  # (optional)
    verbose=True,
    chain_type_kwargs={
        "verbose": True,
        "prompt": prompt
        }
)

Now the output is:

> Entering new RetrievalQA chain...


> Entering new StuffDocumentsChain chain...


> Entering new LLMChain chain...
Prompt after formatting:
<MY PROMPT>

Context:
<some context from my data, seems like this is done ok.>

Question:
<MY QUESTION>

Answer:

(and still stuck here)

1
  • 1
    Can you try making the chain verbose? That should let you understand why it is hanging Commented Jan 15 at 6:06

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.