Dictionaries¶

Link to incomplete Jupyter Notebook for this section of the notes (for you to fill out while following along with lecture)
Link to completed Jupyter Notebook for this section of the notes

Another useful data type built into Python is the dictionary. A dictionary is like a list, but allows more general indices. In a list, the indices have to be integers; in a dictionary they can be (almost) any type and are now called keys.

Relational storage¶

You can think of a dictionary as a mapping between two things:

keys: a set of indices,
values: a set of values corresponding to each key.

Each key maps to a value. The association of a key to a value is called a key-value pair.

You can define an empty dictionary in two ways. One way is to use a built-in function dict

>>> eng2kor = dict()

or alternatively, use an empty curly brackets, {}

>>> eng2kor = {}

In both cases you would see the following

>>> type(eng2kor)
<type 'dict'>
>>> print(eng2kor)
{}

Let’s add a new pair to the dictionary, eng2kor. To add one, you can use square brackets

>>> eng2kor['one'] = 'hana'

This creates a new key-value pair that maps from the key 'one' to the value 'hana'. If we print the dictionary again

>>> print(eng2kor)
{'one': 'hana'}

You can initialize a dictionary with multiple items like this:

>>> eng2kor={'one':'hana','two':'dool','three':'set','four':'net'}

The append method cannot be invoked on a dictionary directly (i.e., eng2kor.append('five') won’t work). Instead, one can keep adding new pairs using

>>> eng2kor['five'] = 'dasut'

or using update (try help(dict) or dir(dict) to see more options)

>>> eng2kor.update({'five':'dasut'})

Let’s now print to see what we have defined so far

>>> print(eng2kor)
{'four': 'net', 'three': 'set', 'five': 'dasut', 'two': 'dool', 'one': 'hana'}

The order of the key-value pairs may not look like what you probably expected. In fact, they might look different on different computers. The order of pairs in a dictionary is unpredictable. This is because the elements of a dictionary are indexed through a hash function. In spite of this, dictionaries are still iterable.

Traversing through the dictionary will show

>>> for i in eng2kor:
...     print(i)
...
four
three
five
two
one

Here we see that traversing a dictionary runs over the keys, and not the values. Note that keys must be unique, while values may not be (there is a surjection between keys and values):

>>> eng2kor[5] = 'dasut'

>>> eng2kor['five']
'dasut'

>>> eng2kor[5]
'dasut'

The dictionary method items returns something akin to a list of tuples which gives a useful way to iterate over the keys and values together:

>>> for eng, kor in eng2kor.items():
...     print(eng, kor)
...

one hana
two dool
three set
four net
five dasut
5 dasut

or, to just get keys

>>> for eng in eng2kor.keys():
...     print(eng)
...

one
two
three
four
five
5

and finally to get only the values

>>> for kor in eng2kor.values():
...     print(kor)
...
hana
dool
set
net
dasut
dasut

We can apply some of the built-in functions and operators we learned so far to dictionaries as well:

>>> len(eng2kor)
6

>>> 'one' in eng2kor
True

>>> 'net' in eng2kor
False

The second example of the in operator tells us that Python checks if the search word appears as a key, but not as a value in the dictionary.

To see whether something appears as a value instead of a key, is to use the method values which returns the values as a list

>>> print(eng2kor.values())
dict_values(['hana', 'dool', 'set', 'net'])

With this we can now search

>>> 'net' in eng2kor.values()
True

Dictionary as a set of counters¶

A common string processing problem is to examine the frequency of characters or substrings. Let’s see how we could use a dictionary to help keep track of characters (you’ll get to look at the substring version in HW4):

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/histogram.py

"""

def histogram(s):
    # initialize with an empty dictionary
    d = dict()

    for c in s:
        #print(c)
        if c not in d:
            # This is the first instance of c
            # Insert it as a key and set the value to 1
            d[c] = 1
        else:
            # c has already appeared, increment counter
            d[c] += 1

    # return dictionary
    return d

def histogram_ternary(s):

    # This is exactly the same as histogram
    # but using a so-called 'ternary operator':
    # a if test else b
    #
    # Ex: x='apple' if a > 2 else 'orange'
    # Translating this into English gives
    # x is 'apple' if a > 2; otherwise x is 'orange'

    d = dict()
    for c in s:
        # the ternary expression is shorter, though also terse
        d[c] = 1 if c not in d else d[c]+1
    return d

def histogram_ternary_get(s):

    # This is exactly the same as histogram
    # but using 'get' method defined in dictionary:
    # See help(dict) and check out:
    #
    # get(...)
    # D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
    # i.e., if D is a dictionary,
    #                    /D[k] if k in D
    #      D.get(k,d) = |
    #                   \ d if k not in D
    #
    # Ex: x='apple' if a > 2 else 'orange'
    # Translating this into English will be
    # x is 'apple' if a > 2; otherwise x is 'orange'

    d = dict()
    for c in s:
        # Insert c with value 0 if it doesn't exist yet
        # otherwise return current value.
        # Either way, increment before storing
        d[c] = d.get(c,0) + 1
    return d

if __name__ == "__main__":
    # first function
    h1 = histogram('apple')
    print( '(a):', h1 )

    # second function which uses the ternary operator
    h2 = histogram_ternary('apple')
    print( '(b):', h2 )

    # are they the same?
    print( '(c):', h1 is h2 )
    print( '(d):', h1 == h2)

    # print keys
    print( '(e):', h1.keys())

    # does 'a' appear as a key?
    print( '(f):', 'a' in h1 )

    # print values
    print( '(g):', h1.values() )

    # does 2 appear as a value?
    print( '(h):', 2 in h1.values())

    # 'get' takes a key and a default value
    # If the key appears in the dictionary
    # 'get' returns the corresponding value;
    # otherwise it returns the user defined
    # default value, e.g., 159 in the following example:
    print( '(i):', h1.get('a',159) )
    print( '(j):', h1.get('w',159) )

Download this code

Running this in the script mode will give

$ python histogram.py
(a): {'a': 1, 'p': 2, 'l': 1, 'e': 1}
(b): {'a': 1, 'p': 2, 'l': 1, 'e': 1}
(c): False
(d): True
(e): dict_keys(['a', 'p', 'l', 'e'])
(f): True
(g): dict_values([1, 2, 1, 1])
(h): True
(i): 1
(j): 159

Dictionaries and lists¶

Lists can only appear as values in a dictionary, but not keys. For example, if you try

>>> t=['a','e','l']
>>> type(t)
<class 'list'>
>>> d = dict()
>>> d[t]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

>>> d['1'] = t
>>> d
{'1': ['a', 'e', 'l']}

The above example confirms that lists can only be used as values. For the most part, keys need to be immutable objects (more specifically they must have a __hash__() method).

Now, let’s consider an application of using lists as values. Take a look at what we obtained in the last outcome, {'1': ['a','e','l']}. This looks like an inverse map of the output from (a) or (b)! This example tells us that we could try to create an inverse map from values to keys in a dictionary (note this only works if the values are all hashable types). Here is a function that inverts a dictionary:

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/invert_dictionary.py

"""

def invert_dictionary(d):
    # create an empty dictionary
    inverse = dict()

    # traverse through keys in dictionary "d"
    for key,val in d.items():
        if val not in inverse:
            # val hasn't been seen yet
            # insert it into the inverse dictionary
            # note [key] is wrapped as a list
            inverse[val] = [key]
        else:
            # val has been seen before
            # append the key to list stored in inverse[val]
            inverse[val].append(key)
    return inverse


if __name__ == "__main__":

    # import histogram method from histogram.py
    from histogram import histogram

    # compute histogram
    hist = histogram('apple')
    print( hist )

    # compute inverse map of dictionary
    inv = invert_dictionary(hist)
    print( inv )

Download this code

The result look like

$ python3 invert_dictionary.py
{'a': 1, 'p': 2, 'e': 1, 'l': 1}
{1: ['a', 'e', 'l'], 2: ['p']}

Dictionaries and memoization¶

Dictionaries are an excellent way to store the results of expensive calculations using the inputs as keys. In this way, whenever a set of inputs that have already been used are queried one can fetch the result from the dictionary. This avoids repeating potentially expensive calculations, and trades off computational effort for a larger memory footprint.

For example, a naive implementation of the Fibonacci sequence looks like:

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/fibonacci.py

Fibonaci sequence using recursion

"""

def fibonacci(n):
    if n == 0:
        # First base case
        return 0
    elif n == 1:
        # Second base case
        return 1
    else:
        # Otherwise, call backwards in sequence recursively
        res = fibonacci(n-1) + fibonacci(n-2)
        return res
    
if __name__ == "__main__":
    print(fibonacci(12))

Download this code

Notice that early terms in the sequence are re-evaluated a huge number of times through those recursive calls. A dictionary can be used to cache the result for previously encountered values, which can short-circuit those recursive calls quite efficiently.

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/fibonacci_dict.py

Fibonacci sequence using a dictionary "known" which keeps track of values that
have already been computed and stores them for reuse.

"""

from fibonacci import fibonacci

# initialize a dictionary, with the first two Fib. numbers
known = {0:0,1:1}

# Memoized call using full dictionary
def fibonacci_dict(n):
    if n in known:
        # n has been encoutered before, return it
        # base case slides up the series as more terms are queried
        return known[n]
    else:
        # otherwise, call back recursively and cache value
        known[n] = fibonacci_dict(n-1) + fibonacci_dict(n-2)
        return known[n]

# Initialize empty dictionary
sparse = {}
    
def fibonacci_dict_sparse(n):
    if n in sparse:
        # n has been queried before, return value
        return sparse[n]
    else:
        # otherwise, call naive method and cache
        sparse[n] = fibonacci(n)
        return sparse[n]
    
if __name__ == "__main__":
    print(fibonacci_dict(12))
    print(known)
    print(fibonacci_dict_sparse(12))
    print(sparse)

Download this code

In the above example, known is a dictionary that stores the Fibonacci numbers we already know. It starts with the first two terms in the sequence: F0=0 and F1=1, or in other words, 0 maps to 0 and 1 maps to 1. This caches every value seen along way. Note that we could use a list or array for this version.

The sparse version only caches the values we explicitly ask for. This is slower on the initial calls, but has a smaller memory footprint. This is particularly useful if you know that the function will be queried for the same inputs many times.

To compare CPU runtime in seconds, we can do as follows:

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/run_fibonacci.py

Runtime comparison of three fibonacci implementations

"""

import time
from fibonacci import fibonacci
from fibonacci_dict import fibonacci_dict, fibonacci_dict_sparse

id_set = [4,12,15,30,32]

start_time1 = time.time()
for i in id_set:
    fibonacci(i)
elapsed_time1 = time.time() - start_time1

start_time2 = time.time()
for i in id_set:
    fibonacci_dict(i)
elapsed_time2 = time.time() - start_time2

start_time3 = time.time()
for i in id_set:
    fibonacci_dict_sparse(i)
elapsed_time3 = time.time() - start_time3

print('First runs through (sec):')
print( '  fibonacci             = ', elapsed_time1 )
print( '  fibonacci_dict        = ', elapsed_time2 )
print( '  fibonacci_dict_sparse = ', elapsed_time3 )

start_time1 = time.time()
for i in id_set:
    fibonacci(i)
elapsed_time1 = time.time() - start_time1

start_time2 = time.time()
for i in id_set:
    fibonacci_dict(i)
elapsed_time2 = time.time() - start_time2

start_time3 = time.time()
for i in id_set:
    fibonacci_dict_sparse(i)
elapsed_time3 = time.time() - start_time3

print('Second runs through (sec):')
print( '  fibonacci             = ', elapsed_time1 )
print( '  fibonacci_dict        = ', elapsed_time2 )
print( '  fibonacci_dict_sparse = ', elapsed_time3 )

Download this code

Running this will produce something akin to this:

First runs through (sec):
  fibonacci             =  0.7044386863708496
  fibonacci_dict        =  1.4066696166992188e-05
  fibonacci_dict_sparse =  0.7033944129943848
Second runs through (sec):
  fibonacci             =  0.7008662223815918
  fibonacci_dict        =  3.337860107421875e-06
  fibonacci_dict_sparse =  1.430511474609375e-06

Global variables¶

Consider running fibonacci_dict.py directly. known is initialized outside any function, and thus belongs to the module namespace. Variables declared as such are global from the perspective of this file, and can be accessed by any function here (compare this to module variables in Fortran). When we import that file as a module in run_fibonacci.py, the dictionary known will live in the fibonacci_dict namespace.

The following example illustrates how global variables behave and how they could be modified within a local function:

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/global.py

"""

been_called = False

def local_var():
    been_called = True
    print( '(a):', been_called )

local_var()
print( '(b):', been_called )


def global_var():
    global been_called
    been_called = True
    print( '(c):', been_called )


global_var()
print( '(d):', been_called )


been_called = False
def return_var():
    been_called = True
    return been_called

return_var()
print( '(e):', been_called )
print( '(f):', return_var() )

Download this code

The result is:

(a): True
(b): False
(c): True
(d): True
(e): False
(f): True

An example study¶

Consider the following example which has been originally adopted from Dive Into Python 3 and modified for the class.

This routine takes a computer file size in kilobytes as an input and converts it approximately to a human-readable form, e.g., 1TB, or 931 GiB, etc.

"""
/lectureNote/chapters/chapt03/codes/examples/dictionaries/humansize.py

NOTE: This routine has been extracted from

   http://www.diveintopython3.net/your-first-python-program.html

and modified by Prof. Dongwook Lee for AMS 209
and modified by Youngjun Lee for AMS 129

"""

SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
            1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

def approximate_size(size, a_kilobyte_is_1024_bytes=True):
    '''Convert a file size to human-readable form.

    Keyword arguments:
    size -- file size in bytes
    a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
                                if False, use multiples of 1000

    Returns: file size in a string format

    '''
    if size < 0:
        print( 'number must be non-negative' )

    if a_kilobyte_is_1024_bytes:
        multiple = 1024
    else:
        multiple = 1000

    # Initialize an empty size_dict array to keep track of
    # the file sizes and suffixes.
    # The result is going to be the last key:value pair when
    # a computed size becomes smaller than the file size unit (i.e., multiple).
    size_dict=dict()
    for suffix in SUFFIXES[multiple]:
        #print suffix
        size /= multiple  # <==> size = size/multiple
        #print size
        size_dict[size]=suffix

        # Keep dividing until a size is less than the chosen file size unit
        if size < multiple:
            return str(round(size)) + ' ' + size_dict[size]

    print( 'number too large' )

if __name__ == '__main__':
    print( '(a) with the multiple of 1000 bytes: ', approximate_size(1000000000000, False) )
    print( '(b) with the multiple of 1024 bytes: ', approximate_size(1000000000000) )

Download this code

The output from running the routine looks like

$ python3 humansize.py
(a) with the multiple of 1000 bytes:  1 TB
(b) with the multiple of 1024 bytes:  931 GiB

Dictionaries¶

Relational storage¶

Dictionary as a set of counters¶

Dictionaries and lists¶

Dictionaries and memoization¶

Global variables¶

An example study¶

Table of Contents

Previous topic

Next topic

This Page