Database Internals

Why I am reading this book?

To dive deeper into the database systems and have a hands-on in systems programming and write an end to end database system.

Intro

Database systems usually comprise of->

  • Transport Layer
    • Accepting requests from the clients.
  • Query Processor
    • Figures out the most optimal way to perform the asked query.
  • Execution Engine
    • Take plan from the query optimizer and aggregate remote and local execution results.
  • Storage Engine
    • stores, retrieves, and manages data in memory and on disk, designed to capture a persistent, long-term memory of each node

These systems are usually build independent and in isolation with some variants of a DBMS tightly coupling few or all of the modules because of performance reasons (saving allocations/thread creation on hot path etc.). But usually any combination of them can result in a new variant suitable for some specific type of workload/usecase or data storage.

Storage Engines

Storage engines actually store the data on physical memory or disks and provide simple APIs to access and manipulate it. The DBMS supporting complex queries is just a fancy parser applications built on top of storage engines, and can provide more capabilities like Schema, Query Language, Transactions etc.

Databases are an important piece of software for most applications, and hence has potential to have long term consequences. Each database was designed while keeping one thing or one set of things in mind. It's always the beginning when it's best to find if a certain database will not be a good fit because of performance, consistency, etc reasons. Since, not everyone can afford to perform rigorous testing of multiple databases to evaluate the best one for their requirements, there are other platforms which can provide with high level semantics for the usual benchmarks(YCSB) which can be retro-fitted in most cases to the exact requirements.

These should be used with caution since it's easy to draw a premature/invalid conclusion without looking at the data extensively and without listing down the requirements first.

But performance is often not the only criteria to choose a database system for a long run. Since programming is as much a human construct as it is a machine's. Considering software being programming integrated over time, the familiarity, ease of debugging, pain of upgradation, community around it are often the equally important factors if not more.

Past smooth upgrades do not guarantee that future ones will be as smooth, but complicated upgrades in the past might be a sign that future ones won’t be easy, either.
Any usual 3rd party software decision.

Difference from a programming problem

Any sufficiently complex piece of software is very different from a standalone programming problem. Databases differ in ways->

  • Design physical layout.
  • Organize pointers.
  • Decide on serialization format.
  • Dead tuple/data garbage collection schemes.
  • Making it work in concurrent environments.
  • Data Durability guarantees.

-> Chapter 1