Overview

  1. CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as <hash,key,value> in each slot(The slot content changed after Python 3.6).
  2. When a new key is added, python use i = hash(key) & mask where mask=table_size-1 to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item.
  3. When 23 of the table is full, the table will be resized.
  4. When getting item from dictionary, both hash and key must be equal.

Resizing

When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2.

So the table size increase like this: 8->8*23*4=24->24*23*4=64->…

Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. To prevent early stopping when looking for other element.

Probing

CPython used a modified random probing algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.

The travel order can be calculated by this formula: j = ((5*j) + 1) mod 2**i, where j is slot index.

For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:

2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2

CPython changed this formula by adding perturb and PERTURB_SHIFT variables, where perturb is hash value and PERTURB_SHIFT is 5. By adding PERTURB_SHIFT, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And perturb will eventually becomes to 0, this ensures that all of the slots will be checked.

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
j = j % 2**i

Dictionary improvement after 3.6

CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.

Compact Hash Table

As mentioned before, entries saved in the form of <hash,key,value>. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).

After 3.6, CPython use two structure to save data. One is index, another is the real data.

For example, if the table size is 8, and there is an item in slot 1, the index looks like this:

[null, 0, null, null, null, null, null, null]

And the real data is:

| hash | key  | value |
| xxx1 | yyy1 | zzz1  |

0 represents the items index on real data. If another item is added in slot 3, the new index become this:

[null, 0, null, 1, null, null, null, null]

The real data become this:

| hash | key  | value |
| xxx1 | yyy1 | zzz1  |
| xxx2 | yyy2 | zzz2  |

This saves memory, especially when table is not full.

Ref:

  1. How are Python’s Built In Dictionaries Implemented
  2. cpython source code
  3. Is it possible to give a python dict an initial capacity (and is it useful)
  4. Python dictionary implementation