Databases

TOPIC AREA

What Are Databases?

Databases are organized collections of structured or semi-structured data managed by software that provides mechanisms for storage, retrieval, update, and administration. A database management system (DBMS) enforces data integrity, controls concurrent access by multiple users, and supports recovery from hardware or software failures. The field draws from formal logic, set theory, file system design, and distributed systems theory. Databases are the foundational infrastructure of information systems across virtually every industry, and the models, languages, and architectures that govern their design have evolved substantially over the five decades since Edgar Codd's 1970 paper introduced the relational model.

Relational Databases and SQL

The relational model organizes data into tables (relations) composed of rows (tuples) and columns (attributes), with relationships between tables expressed through shared key values. Edgar Codd's model was given concrete form in IBM's System R project in the 1970s, which produced the Structured Query Language (SQL) that remains the dominant database query language. SQL is standardized by ISO/IEC 9075, most recently updated in 2023, and its core constructs (SELECT, INSERT, UPDATE, DELETE, JOIN) are implemented by commercial systems including Oracle Database, Microsoft SQL Server, and IBM Db2, as well as open-source systems including PostgreSQL and MySQL. Relational databases enforce constraints such as primary key uniqueness, referential integrity between foreign keys, and data type restrictions, ensuring that the stored data satisfies defined invariants. The ACID properties (Atomicity, Consistency, Isolation, Durability) govern transaction behavior, guaranteeing that concurrent updates and system failures do not leave the database in an inconsistent state.

Distributed Databases

A distributed database stores data across multiple nodes, which may be geographically separated, while presenting a unified interface to applications. Distribution improves availability and throughput but introduces challenges of consistency, partition tolerance, and latency that do not exist in single-node systems. The CAP theorem, proved by Eric Brewer and formalized by Gilbert and Lynch in 2002, establishes that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition tolerance. Systems such as Google Spanner use globally synchronized clocks (TrueTime) and two-phase commit to achieve external consistency across data centers at the cost of higher latency. The ACM SIGMOD community has published foundational distributed database research and annually recognizes influential contributions through the SIGMOD Edgar F. Codd Innovations Award.

NoSQL Databases

NoSQL databases depart from the relational model to optimize for specific access patterns, scalability characteristics, or data structures that tabular schemas represent poorly. The major categories are key-value stores (Redis, Amazon DynamoDB), document stores (MongoDB, Couchbase), column-family stores (Apache Cassandra, HBase), and graph databases (Neo4j, Amazon Neptune). Key-value stores provide O(1) lookup by key and excel at caching and session management. Document stores index JSON or BSON documents and support flexible schemas that can evolve without migration. Column-family stores optimize for write-heavy workloads spread across many nodes by grouping related columns on disk together. Graph databases represent entities as nodes and relationships as edges with properties, enabling efficient traversal of highly connected data such as social networks, supply chains, and knowledge graphs.

Query Optimization

Query optimization is the process by which a DBMS selects an efficient execution plan for a query from the space of logically equivalent plans. The optimizer estimates the cost of each candidate plan using statistics about data distribution (histograms, cardinality estimates) and models of I/O and CPU cost. It then applies algebraic transformations such as predicate pushdown, join reordering, and index selection to minimize total estimated cost. The Selinger optimizer, developed for IBM System R in 1979, introduced cost-based optimization with dynamic programming over join orderings and remains the conceptual basis of query planning in modern systems. The VLDB Endowment, which publishes the Proceedings of the VLDB Endowment (PVLDB) journal, is a primary venue for research on query processing, cardinality estimation, and learned query optimizers that use machine learning to improve plan quality.

Applications

Databases have applications in a wide range of disciplines, including:

E-commerce and retail: product catalogs, order management systems, and customer profile stores handling millions of concurrent transactions
Healthcare: electronic health record systems, clinical trial data repositories, and population health analytics platforms
Financial services: transaction ledgers, risk management systems, and trading infrastructure requiring high-throughput and strict ACID guarantees
Search engines: inverted index structures and distributed key-value stores enabling sub-second full-text search across billions of documents
Telecommunications: subscriber management, call detail records, and network configuration databases supporting carrier-scale operations