Skip to content

Sefaria/Sefaria-Export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

209 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sefaria-Export

Public dataset of all Sefaria texts, hosted on Google Cloud Storage.

This repository is a lightweight index and set of tools for accessing the Sefaria text corpus. The actual text data (~26GB, ~85K files) lives in a public GCS bucket and can be downloaded without authentication.

Quick Start

Browse what's available

# List top-level formats and directories
./examples/browse_bucket.sh

# Drill into a specific category
./examples/browse_bucket.sh json/Talmud

Download a single text

curl -O "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json"

Download an entire category

# Using the helper script
./examples/download_category.sh Talmud          # all Talmud in JSON
./examples/download_category.sh Mishnah txt     # all Mishnah in TXT

# Or directly with gcloud/gsutil
gcloud storage cp -r "gs://sefaria-export/json/Talmud/" ./talmud/
gsutil -m cp -r "gs://sefaria-export/json/Talmud/" ./talmud/

Download everything

gcloud storage cp -r "gs://sefaria-export/" ./sefaria-data/

Use books.json to filter and download programmatically

import requests

books = requests.get(
    "https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/books.json"
).json()

# Find all Talmud texts
for book in books["books"]:
    if "Talmud" in book["categories"]:
        print(book["title"], book.get("json_url"))

Or use the ready-made script:

# Download all English Mishnah texts as JSON
python examples/download_from_books_json.py --category Mishnah --language English

# Download a specific title
python examples/download_from_books_json.py --title "Genesis"

# List what's available without downloading
python examples/download_from_books_json.py --category Tanakh --list

Bucket Structure

The GCS bucket is organized hierarchically by format, category, title, language, and version:

gs://sefaria-export/
  json/{categories}/{title}/{language}/{versionTitle}.json
  txt/{categories}/{title}/{language}/{versionTitle}.txt
  cltk-full/{categories}/{title}/{language}/{versionTitle}.json
  cltk-flat/{categories}/{title}/{language}/{versionTitle}.json
  schemas/{title}.json
  links/links0.csv ... links12.csv
  table_of_contents.json

Example paths

json/Tanakh/Torah/Genesis/English/merged.json
json/Talmud/Bavli/Seder Moed/Shabbat/Hebrew/merged.json
txt/Mishnah/Seder Zeraim/Mishnah Berakhot/English/merged.txt
schemas/Genesis.json
links/links0.csv

Formats

Format Description
json/ Structured JSON with text content, verse-level arrays
txt/ Plain text, one file per version
cltk-full/ JSON formatted for the Classical Language Toolkit
cltk-flat/ Flattened CLTK format
schemas/ Schema/structure metadata for each text
links/ CSV files of all intertextual connections

Merged files

Each text directory includes a merged file (e.g., merged.json, merged.txt). This file combines the maximal content available from all versions, using Sefaria's merging logic. When a single complete version exists, the merged file is a copy of it. Use merged files when you want the most complete text available.

books.json

books.json is an index of every text in the bucket. Each entry contains:

{
  "title": "Genesis",
  "language": "English",
  "versionTitle": "merged",
  "categories": ["Tanakh", "Torah"],
  "json_url": "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json",
  "txt_url": "https://storage.googleapis.com/sefaria-export/txt/Tanakh/Torah/Genesis/English/merged.txt",
  "cltk_full_url": "...",
  "cltk_flat_url": "..."
}

This file is regenerated monthly (2nd of each month, day after the GCS export) by a GitHub Action. It can also be triggered manually from the Actions tab.

Repository Contents

Path Description
books.json Index of all texts with metadata and download URLs
scripts/generate_books_json.py Generates books.json from the GCS bucket listing
examples/download_from_books_json.py Filter and download texts using books.json
examples/download_category.sh Download all texts in a category via gcloud
examples/browse_bucket.sh Browse available categories and texts
.github/workflows/generate-books-json.yml Monthly CI to regenerate books.json (also supports manual trigger)

Related Projects

License

See LICENSE.md.

About

Structured Jewish texts and metadata exported from Sefaria's database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors

Languages