Public dataset of all Sefaria texts, hosted on Google Cloud Storage.
This repository is a lightweight index and set of tools for accessing the Sefaria text corpus. The actual text data (~26GB, ~85K files) lives in a public GCS bucket and can be downloaded without authentication.
# List top-level formats and directories
./examples/browse_bucket.sh
# Drill into a specific category
./examples/browse_bucket.sh json/Talmudcurl -O "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json"# Using the helper script
./examples/download_category.sh Talmud # all Talmud in JSON
./examples/download_category.sh Mishnah txt # all Mishnah in TXT
# Or directly with gcloud/gsutil
gcloud storage cp -r "gs://sefaria-export/json/Talmud/" ./talmud/
gsutil -m cp -r "gs://sefaria-export/json/Talmud/" ./talmud/gcloud storage cp -r "gs://sefaria-export/" ./sefaria-data/import requests
books = requests.get(
"https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/books.json"
).json()
# Find all Talmud texts
for book in books["books"]:
if "Talmud" in book["categories"]:
print(book["title"], book.get("json_url"))Or use the ready-made script:
# Download all English Mishnah texts as JSON
python examples/download_from_books_json.py --category Mishnah --language English
# Download a specific title
python examples/download_from_books_json.py --title "Genesis"
# List what's available without downloading
python examples/download_from_books_json.py --category Tanakh --listThe GCS bucket is organized hierarchically by format, category, title, language, and version:
gs://sefaria-export/
json/{categories}/{title}/{language}/{versionTitle}.json
txt/{categories}/{title}/{language}/{versionTitle}.txt
cltk-full/{categories}/{title}/{language}/{versionTitle}.json
cltk-flat/{categories}/{title}/{language}/{versionTitle}.json
schemas/{title}.json
links/links0.csv ... links12.csv
table_of_contents.json
json/Tanakh/Torah/Genesis/English/merged.json
json/Talmud/Bavli/Seder Moed/Shabbat/Hebrew/merged.json
txt/Mishnah/Seder Zeraim/Mishnah Berakhot/English/merged.txt
schemas/Genesis.json
links/links0.csv
| Format | Description |
|---|---|
json/ |
Structured JSON with text content, verse-level arrays |
txt/ |
Plain text, one file per version |
cltk-full/ |
JSON formatted for the Classical Language Toolkit |
cltk-flat/ |
Flattened CLTK format |
schemas/ |
Schema/structure metadata for each text |
links/ |
CSV files of all intertextual connections |
Each text directory includes a merged file (e.g., merged.json, merged.txt). This file combines the maximal content available from all versions, using Sefaria's merging logic. When a single complete version exists, the merged file is a copy of it. Use merged files when you want the most complete text available.
books.json is an index of every text in the bucket. Each entry contains:
{
"title": "Genesis",
"language": "English",
"versionTitle": "merged",
"categories": ["Tanakh", "Torah"],
"json_url": "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json",
"txt_url": "https://storage.googleapis.com/sefaria-export/txt/Tanakh/Torah/Genesis/English/merged.txt",
"cltk_full_url": "...",
"cltk_flat_url": "..."
}This file is regenerated monthly (2nd of each month, day after the GCS export) by a GitHub Action. It can also be triggered manually from the Actions tab.
| Path | Description |
|---|---|
books.json |
Index of all texts with metadata and download URLs |
scripts/generate_books_json.py |
Generates books.json from the GCS bucket listing |
examples/download_from_books_json.py |
Filter and download texts using books.json |
examples/download_category.sh |
Download all texts in a category via gcloud |
examples/browse_bucket.sh |
Browse available categories and texts |
.github/workflows/generate-books-json.yml |
Monthly CI to regenerate books.json (also supports manual trigger) |
- Sefaria-Project - Sefaria's main application source code
- Sefaria API - REST API for accessing Sefaria data programmatically
See LICENSE.md.