Dictionaries

What Are Dictionaries?

Dictionaries are abstract data structures that store collections of key-value pairs, supporting efficient lookup, insertion, and deletion of values by their associated keys. Also called associative arrays or maps in many programming contexts, dictionaries generalize the index-based access of arrays to arbitrary key types: rather than retrieving an element by an integer position, a program retrieves it by a meaningful identifier such as a string, a tuple, or any other hashable object. This abstraction is foundational in software engineering because it models the natural structure of many real-world problems, from symbol tables in compilers to routing tables in network routers to attribute stores in databases.

The interface of a dictionary is simple: insert a key-value pair, retrieve the value associated with a given key, delete a key-value pair, and query whether a key is present. The richness of the data structure lies in how efficiently these operations can be performed on large collections and in how different implementation strategies trade off time, memory, and worst-case behavior.

Hash Table Implementation

The dominant implementation of dictionaries in practice is the hash table. A hash table allocates an array of fixed size and uses a hash function to map each key to an index within that array. The hash function takes an input key of arbitrary size and computes a fixed-size integer, which is then reduced modulo the array size to yield a position. When the hash function distributes keys uniformly across the array, lookup, insertion, and deletion all require constant time on average, which is O(1) in big-O notation. This characteristic distinguishes hash tables from other dictionary implementations such as balanced binary search trees, which guarantee O(log n) worst-case time but require keys to be totally ordered.

The Oregon State University data structures textbook chapter on dictionaries and hash tables provides a detailed treatment of hash function design, covering polynomial rolling hash for strings, multiplication-based hash functions, and the significance of prime table sizes for minimizing collision clustering. The quality of the hash function directly determines the practical performance of the table: a poor hash function that maps many keys to the same index degrades performance toward O(n) for all operations.

Collision Resolution

When two distinct keys hash to the same array index, a collision occurs and must be resolved by one of several strategies. Separate chaining stores a linked list or dynamic array at each array slot, appending new entries to the chain when a collision occurs. Lookup in a chained table traverses the chain at the computed index, comparing keys until the target is found or the chain is exhausted. Under uniform hashing with a load factor below one, the expected chain length is constant.

Open addressing handles collisions by searching for an alternative empty slot within the array itself, following a deterministic probe sequence. Linear probing increments the index by one on each step; quadratic probing uses a quadratic function of the probe count; double hashing uses a second independent hash function to determine the step size. As analyzed in the MPI Informatics treatment of hash tables and associative arrays, open addressing achieves better cache performance than chaining because all data resides in a contiguous array, but it is more sensitive to load factor and suffers from clustering effects that chaining avoids.

Dictionary Operations and Language Support

Modern programming languages provide built-in dictionary types optimized for their respective execution environments. Python's dict uses a compact open-addressing scheme with pseudo-random probing, and since Python 3.7 preserves insertion order. Java's HashMap uses separate chaining with a conversion to balanced binary trees within buckets that exceed a threshold length, providing O(log n) worst-case performance even under adversarial hash flooding. The C++ std::unordered_map follows a chaining approach defined by the C++ standard library specification. According to data structures reference material on hash tables, the load factor threshold at which a hash table is resized is typically between 0.5 and 0.75, triggering a rehash that copies all entries into a larger array.

Applications

Dictionaries have applications across a broad range of computing domains, including:

  • Compiler symbol tables for mapping variable and function names to their types and memory locations
  • Database indexing systems for fast key-based record retrieval
  • Network routing and forwarding tables in routers and switches
  • Cache implementations in web browsers, content delivery networks, and operating systems
  • Configuration and settings storage in applications and operating systems
  • Natural language processing for tokenization, frequency counting, and vocabulary management
Loading…