This project extracts text from PDF textbooks, generates hierarchical embeddings using Sentence-BERT, and stores them in a PostgreSQL database for retrieval and analysis.
Ensure you have the following installed:
- Python 3.8 or later
- PostgreSQL with the
vectorextension - Required Python packages (see below)
git clone https://github.com/zarouz/Hr_interview_preperationAgent/tree/knowledgeBasepip install -r requirements.txtDependencies include:
psycopg2sentence-transformersnltkpdfplumbernumpypython-dotenv
Ensure PostgreSQL is running and create the necessary schema:
psql -U <your-username> -d KnowledgeBase -f schema.sqlAlternatively, run setup_database() in the script to create tables automatically.
To parse a textbook directory and store embeddings, execute:
python main.py /path/to/textbook/folderTo ensure proper processing, store textbooks in the following format:
- Place all textbooks inside the
textbooks/folder. - Each textbook should have its own subfolder named after the book title.
- Chapters should be stored as separate PDF files with a consistent naming convention:
textbooks/ ├── Book_Title/ │ ├── Introduction_1.pdf │ ├── Basics_2.pdf │ ├── Advanced_Topics_3.pdf │ └── ... - Use a consistent naming format (
name_number.pdf). - Avoid spaces in filenames; use underscores (
_) instead.
The bookmaker.py utility helps in structuring textbooks by splitting PDFs into structured chapters based on user input. It uses PyPDF2 to divide PDFs into meaningful sections.
The database containing embeddings of the OS book is available for download at the following Google Drive link: Download Database
To restore the database from the provided SQL dump file, use the following command:
psql -U <your-username> -d <your-database-name> -f knowledgeBase.sqlExample:
psql -U karthikyadav -d KnowledgeBase -f knowledgeBase.sqlIf you wish to upload the database after making changes, export it using:
pg_dump -U <your-username> -d <your-database-name> -f knowledgeBase.sqlExample:
pg_dump -U karthikyadav -d KnowledgeBase -f knowledgeBase.sqlThis will create a backup file that can be shared or uploaded for others to use.
- Parsing PDFs: The script expects textbooks in a structured folder.
- Storing Embeddings: The Sentence-BERT model generates embeddings for hierarchical text units.
- Retrieving Data: Query the database using vector similarity search on
chunks.embedding.
- If
vectorextension errors occur, install it in PostgreSQL:psql -U <your-username> -d KnowledgeBase -c "CREATE EXTENSION IF NOT EXISTS vector;"
- Ensure your
.envfile is correctly set up. - Verify
nltk_datapath matches your system setup.
For issues or improvements, submit a pull request or open an issue in the repository.