Skip to content
Discussion options

You must be logged in to vote
  1. Reducing inference latency

Use smaller or distilled models where possible (e.g. fine-tuned smaller LLMs instead of always defaulting to large ones).

Enable response streaming to improve perceived latency.

Cache frequent prompts and embedding results (Redis works well).

Batch requests when generating embeddings.

For production, consider model quantization or hosted inference with GPU-backed providers.

  1. Efficient API & vector database integration

Generate embeddings once and store them; never recompute unless data changes.

Use hybrid search (vector + keyword filtering) if your vector DB supports it.

Keep chunk sizes consistent (usually 300–800 tokens) and include metadata for filtering.

P…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by logicfortuner264-alt
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API and Webhooks Discussions related to GitHub's APIs or Webhooks Question Ask and answer questions about GitHub features and usage
3 participants