Skip to content

sagar0/RDBMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RDBMS

A modular, minimal relational database management system written in modern C++. RDBMS implements a minimal but end-to-end SQL pipeline—from lexical analysis through semantic binding, volcano-style query execution, and pluggable storage—designed as baseline infrastructure for researching and extending database internals state-of-the art in the future.

The storage engine is deliberately decoupled from the SQL front end, enabling future engine swaps (for example, append-only logs, B-Trees, or LSM-Trees) without rewriting the parser or execution layer.

Almost all existing RDBMS out there are too heavy for most of the projects and use-case that independent developers and small teams work on, on a daily basis. They are heavily bloated. So I implemented my own and it wasn't hard as I knew most of the functionality already. It might miss many important things, and get some things wrong. I also woulnd't expect this to be run in production in its current state or ever.


Table of Contents


Design Goals

Goal Implementation
Modularity Strict boundaries between parser, catalog, binder, executor, and storage
Pluggable storage IStorageEngine / IScanner abstract raw byte records; SQL layers never touch disk layout directly
Modern C++ C++20, RAII, smart pointers, std::optional, std::variant, std::string_view
Cross-platform CMake build verified on macOS (Apple Clang) and Linux (GCC 11+ / Clang)
Test-driven Google Test cases covering every major subsystem
Readable codebase Headers separated from implementations; interfaces isolated under include/rdbms/

Architecture

RDBMS follows a classic layered, monolithic architecture. Each layer depends only on the abstractions below it.

flowchart TB
    subgraph client [Client Layer]
        REPL["rdbms_repl"]
    end

    subgraph frontend [SQL Front End]
        Lexer["Lexer"]
        Parser["Recursive Descent Parser"]
        AST["AST"]
        Binder["Semantic Binder"]
    end

    subgraph metadata [Metadata]
        Catalog["CatalogManager"]
    end

    subgraph execution [Execution Engine — Volcano Model]
        Planner["Plan Builder"]
        Scan["SequentialScanNode"]
        Filter["FilterNode"]
    end

    subgraph storage [Storage Layer]
        Codec["Tuple Codec"]
        Engine["InMemoryStorageEngine"]
    end

    REPL --> Lexer
    Lexer --> Parser
    Parser --> AST
    AST --> Binder
    Binder --> Catalog
    Binder --> Planner
    Planner --> Filter
    Filter --> Scan
    Scan --> Engine
    Scan --> Codec
    Engine --> Codec
Loading

Request lifecycle (SELECT)

SQL string
  → Lexer (tokens)
  → Parser (AST)
  → Binder (validates against CatalogManager, produces BoundStatement)
  → Plan tree: FilterNode → SequentialScanNode
  → IStorageEngine::GetScanner (raw bytes)
  → Tuple codec (deserialize)
  → Predicate evaluation (WHERE)
  → ASCII table formatting (stdout)

Request lifecycle (INSERT)

SQL string → Parser → Binder → Tuple codec (serialize) → IStorageEngine::InsertRecord

DDL (CREATE TABLE) updates both the catalog (schema metadata) and the storage engine (physical table slot).


Supported SQL (v1)

v1 intentionally supports a minimal ANSI subset. Keywords are case-insensitive. Trailing semicolons are optional.

CREATE TABLE

CREATE TABLE users (id INT, name VARCHAR);
Rule Detail
Types INT, VARCHAR only
Constraints Table must not already exist; column names must be unique
Output OK

INSERT

INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');
Rule Detail
Literals Integers (42) or single-quoted strings ('Alice')
Arity Value count must match column count
Types Values must match column types (INT ↔ integer, VARCHAR ↔ string)
Output OK

SELECT

SELECT id, name FROM users;
SELECT * FROM users;
SELECT id, name FROM users WHERE id = 1;
SELECT * FROM users WHERE name = 'Alice';
Rule Detail
Projection Explicit column list or *
WHERE Optional; supports column = literal equality only
Output ASCII table + row count, or (0 rows)

REPL controls

Command Action
exit / quit Exit the REPL

Not supported in v 0.1

UPDATE, DELETE, DROP, joins, aggregates, ORDER BY, LIMIT, transactions, indexes, network protocol, and multi-statement scripts.


Quick Start

Prerequisites

Requirement Version
CMake ≥ 3.20
C++ compiler C++20 (Apple Clang, GCC 11+, or Clang)
Git Required for FetchContent (Google Test)

Build

git clone <repository-url>
cd RDBMS

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Run the REPL

./build/rdbms_repl

Example session:

Toy RDBMS v1. Type SQL, or 'exit' to quit.
rdbms> CREATE TABLE users (id INT, name VARCHAR);
OK
rdbms> INSERT INTO users VALUES (1, 'Alice');
OK
rdbms> SELECT * FROM users;
+----+-------+
| id | name  |
+----+-------+
| 1  | Alice |
+----+-------+
(1 row)
rdbms> exit

Run tests

ctest --test-dir build --output-on-failure

All 51 tests should pass.


Project Layout

RDBMS/
├── CMakeLists.txt              # Root build; static lib + rdbms_repl + tests
├── cmake/
│   └── CompilerWarnings.cmake  # -Wall -Wextra -Wpedantic (cross-platform)
├── include/rdbms/              # Public headers
│   ├── ast/                    # Abstract syntax tree + visitor
│   ├── binder/                 # Semantic binder + bound AST
│   ├── catalog/                # Schema metadata
│   ├── common/                 # Shared utilities (Expected<T,E>)
│   ├── execution/              # Volcano plan nodes + predicate eval
│   ├── parser/                 # Lexer + recursive descent parser
│   ├── repl/                   # QueryExecutor + result formatting
│   ├── storage/                # Storage interface + in-memory engine
│   └── types/                  # Tuple, Value, RecordID, DataType
├── src/                        # Implementations (mirrors include/)
├── test/                       # Google Test suites
└── src/main.cpp                # REPL entry point

Module Reference

Module Key Types Responsibility
Types Tuple, Value, RecordID, DataType Runtime value representation
Storage (interface) IStorageEngine, IScanner Physical record I/O boundary (table name + raw bytes only)
Storage (v1 impl) InMemoryStorageEngine In-memory vector<uint8_t> record store
Tuple codec encode_tuple, decode_tuple Schema-aware serialization between Tuple and byte payloads
Catalog CatalogManager, TableSchema In-memory schema registry (tables, columns, byte offsets)
AST Statement, Expr, AstVisitor Parsed SQL structure (std::variant-based)
Binder Binder, BoundStatement Resolves identifiers, validates types, produces bound AST
Execution PlanNode, SequentialScanNode, FilterNode Iterator-model (Volcano) query execution
Parser Lexer, Parser Hand-rolled tokenizer + recursive descent grammar
REPL QueryExecutor, format_table End-to-end orchestration and console output

Storage engine contract

The storage layer is intentionally ignorant of SQL. It operates on:

  • Table names (std::string)
  • Record IDs (RecordID{page_id, slot_id})
  • Opaque byte payloads (std::vector<uint8_t>)
class IStorageEngine {
public:
    virtual void Init() = 0;
    virtual void Shutdown() = 0;
    virtual void CreateTable(const std::string& table_name) = 0;
    virtual RecordID InsertRecord(const std::string& table_name,
                                const std::vector<uint8_t>& payload) = 0;
    virtual bool UpdateRecord(const std::string& table_name, RecordID rid,
                              const std::vector<uint8_t>& payload) = 0;
    virtual std::unique_ptr<IScanner> GetScanner(const std::string& table_name) = 0;
};

InMemoryStorageEngine is the v1 concrete implementation. UpdateRecord is implemented at the storage layer but not yet exposed via SQL.

Execution model

Every plan node implements the Volcano iterator interface:

class PlanNode {
public:
    virtual void Init() = 0;
    virtual std::optional<Tuple> Next() = 0;  // std::nullopt = EOF
};

v1 builds FilterNode → SequentialScanNode trees for SELECT statements with an optional WHERE clause.


Physical Data Model

Column byte offsets are assigned at table creation time and stored in the catalog.

Type On-disk / in-memory layout
INT 8 bytes (int64_t) at column offset
VARCHAR 4-byte length prefix (uint32_t) + variable string data

Example schema users(id INT, name VARCHAR):

Column Offset Size (fixed prefix)
id 0 8 bytes
name 8 4-byte length + N bytes of string data

Build & Test

CMake targets

Target Description
rdbms Static library containing all engine components
rdbms_repl Interactive REPL executable
rdbms_tests Google Test runner (51 tests)

Compiler flags

Non-MSVC builds enable -Wall -Wextra -Wpedantic via cmake/CompilerWarnings.cmake.

Test coverage by area

Test file Area
build_pipeline_test.cpp Build / C++20 smoke
types_test.cpp Core types + storage interface
catalog_manager_test.cpp Schema CRUD + error paths
ast_test.cpp AST construction + visitor
binder_test.cpp Semantic validation
plan_node_test.cpp PlanNode iterator contract
execution_operators_test.cpp Scan + filter pipeline (mock storage)
in_memory_storage_engine_test.cpp Storage engine + codec
parser_test.cpp Lexer + parser (valid + invalid SQL)
repl_test.cpp End-to-end QueryExecutor

Debug build

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

compile_commands.json is generated for clangd / IDE integration.


Roadmap

Planned extensions that align with the existing architecture:

Area Enhancement
Storage Persistent append-only log, B-Tree pages, or LSM-Tree backend behind IStorageEngine
SQL UPDATE / DELETE, richer WHERE (comparisons, AND / OR)
Execution NestedLoopJoinNode, projection node, simple cost-based planner
Catalog JSON / binary persistence across restarts
Client Wire protocol (PostgreSQL-compatible or custom RPC)
Durability Write-ahead log (WAL), crash recovery

Contributions in these areas should preserve the storage boundary and avoid leaking SQL concepts into IStorageEngine.


Contributing

Workflow

  1. Fork the repository and create a feature branch from main.
  2. Implement changes with corresponding Google Tests.
  3. Ensure the build is clean on your platform:
 cmake -B build -DCMAKE_BUILD_TYPE=Debug
 cmake --build build
 ctest --test-dir build --output-on-failure
  1. Open a pull request with a clear description of the architectural impact.

Code standards

  • C++20 with RAII; no owning raw pointers.
  • Prefer std::optional, std::variant, and the project-local Expected<T, E> for error propagation.
  • Keep headers minimal; use forward declarations where possible.
  • Match existing module boundaries—do not couple the parser or executor to a specific storage implementation.
  • One logical concern per commit; write descriptive commit messages (no step-number prefixes required).

Adding a new storage engine

  1. Subclass IStorageEngine and IScanner.
  2. Operate exclusively on std::vector<uint8_t> payloads.
  3. Use tuple_codec for serialization—do not duplicate layout logic.
  4. Add a dedicated test file under test/.

License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors