Sefaria-Export

Public dataset of all Sefaria texts, hosted on Google Cloud Storage.

This repository is a lightweight index and set of tools for accessing the Sefaria text corpus. The actual text data (~26GB, ~85K files) lives in a public GCS bucket and can be downloaded without authentication.

Quick Start

Browse what's available

# List top-level formats and directories
./examples/browse_bucket.sh

# Drill into a specific category
./examples/browse_bucket.sh json/Talmud

Download a single text

curl -O "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json"

Download an entire category

# Using the helper script
./examples/download_category.sh Talmud          # all Talmud in JSON
./examples/download_category.sh Mishnah txt     # all Mishnah in TXT

# Or directly with gcloud/gsutil
gcloud storage cp -r "gs://sefaria-export/json/Talmud/" ./talmud/
gsutil -m cp -r "gs://sefaria-export/json/Talmud/" ./talmud/

Download everything

gcloud storage cp -r "gs://sefaria-export/" ./sefaria-data/

Use books.json to filter and download programmatically

import requests

books = requests.get(
    "https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/books.json"
).json()

# Find all Talmud texts
for book in books["books"]:
    if "Talmud" in book["categories"]:
        print(book["title"], book.get("json_url"))

Or use the ready-made script:

# Download all English Mishnah texts as JSON
python examples/download_from_books_json.py --category Mishnah --language English

# Download a specific title
python examples/download_from_books_json.py --title "Genesis"

# List what's available without downloading
python examples/download_from_books_json.py --category Tanakh --list

Bucket Structure

The GCS bucket is organized hierarchically by format, category, title, language, and version:

gs://sefaria-export/
  json/{categories}/{title}/{language}/{versionTitle}.json
  txt/{categories}/{title}/{language}/{versionTitle}.txt
  cltk-full/{categories}/{title}/{language}/{versionTitle}.json
  cltk-flat/{categories}/{title}/{language}/{versionTitle}.json
  schemas/{title}.json
  links/links0.csv ... links12.csv
  table_of_contents.json

Example paths

json/Tanakh/Torah/Genesis/English/merged.json
json/Talmud/Bavli/Seder Moed/Shabbat/Hebrew/merged.json
txt/Mishnah/Seder Zeraim/Mishnah Berakhot/English/merged.txt
schemas/Genesis.json
links/links0.csv

Formats

Format	Description
`json/`	Structured JSON with text content, verse-level arrays
`txt/`	Plain text, one file per version
`cltk-full/`	JSON formatted for the Classical Language Toolkit
`cltk-flat/`	Flattened CLTK format
`schemas/`	Schema/structure metadata for each text
`links/`	CSV files of all intertextual connections

Merged files

Each text directory includes a merged file (e.g., merged.json, merged.txt). This file combines the maximal content available from all versions, using Sefaria's merging logic. When a single complete version exists, the merged file is a copy of it. Use merged files when you want the most complete text available.

books.json

books.json is an index of every text in the bucket. Each entry contains:

{
  "title": "Genesis",
  "language": "English",
  "versionTitle": "merged",
  "categories": ["Tanakh", "Torah"],
  "json_url": "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json",
  "txt_url": "https://storage.googleapis.com/sefaria-export/txt/Tanakh/Torah/Genesis/English/merged.txt",
  "cltk_full_url": "...",
  "cltk_flat_url": "..."
}

This file is regenerated monthly (2nd of each month, day after the GCS export) by a GitHub Action. It can also be triggered manually from the Actions tab.

Repository Contents

Path	Description
`books.json`	Index of all texts with metadata and download URLs
`scripts/generate_books_json.py`	Generates books.json from the GCS bucket listing
`examples/download_from_books_json.py`	Filter and download texts using books.json
`examples/download_category.sh`	Download all texts in a category via gcloud
`examples/browse_bucket.sh`	Browse available categories and texts
`.github/workflows/generate-books-json.yml`	Monthly CI to regenerate books.json (also supports manual trigger)

Related Projects

Sefaria-Project - Sefaria's main application source code
Sefaria API - REST API for accessing Sefaria data programmatically

License

See LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
.github		.github
examples		examples
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
README.md		README.md
books.json		books.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sefaria-Export

Quick Start

Browse what's available

Download a single text

Download an entire category

Download everything

Use books.json to filter and download programmatically

Bucket Structure

Example paths

Formats

Merged files

books.json

Repository Contents

Related Projects

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Sefaria-Export

Quick Start

Browse what's available

Download a single text

Download an entire category

Download everything

Use books.json to filter and download programmatically

Bucket Structure

Example paths

Formats

Merged files

books.json

Repository Contents

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages