This project aims to create an intelligent system that allows users to search through audio or video content using natural language queries (text or audio). The system identifies speakers, transcribes audio into Hindi, and performs semantic segmentation to enable question-based search.
Key components include:
- Speech transcription using OpenAIβs Whisper
- Speaker diarization using PyAnnote and Resemblyzer
- Semantic search via ChromaDB embeddings
- Question answering with Retrieval-Augmented Generation (RAG)
- Web interface for transcript search and media playback
- ποΈ Extract audio from video
- βοΈ Segment speech with timestamps
- π§βπ« Identify speaker roles (e.g., teacher vs student)
- π£οΈ Transcribe speech to Hindi with speaker & time metadata
- π Break content into semantic units for embedding
- β Support question-based search (Hindi/English)
- ποΈ Web interface for searching & playback with labeled transcript
- π Educational lecture transcription and indexing
- π§Ύ Automatic subtitle generation in Hindi
- π Conversational analytics for meetings/interviews
- π Interactive semantic search over long-form speech
- π§ Real-time Q&A from recorded class videos
- High computation cost for diarization (especially on CPU)
- Accurate speaker alignment with transcribed content
- Ensuring Whisperβs Hindi output retained contextual correctness
- Maintaining uniform output structure across pipeline stages