Dictionaries¶
Another useful data type built into Python is the dictionary. A dictionary
is like a list, but allows more general indices. In a list, the indices have
to be integers; in a dictionary they can be (almost) any type and are now called
keys
.
Relational storage¶
You can think of a dictionary as a mapping between two things:
keys: a set of indices,
values: a set of values corresponding to each key.
Each key maps to a value. The association of a key to a value is called a key-value pair.
You can define an empty dictionary in two ways.
One way is to use a built-in function dict
>>> eng2kor = dict()
or alternatively, use an empty curly brackets, {}
>>> eng2kor = {}
In both cases you would see the following
>>> type(eng2kor)
<type 'dict'>
>>> print(eng2kor)
{}
Let’s add a new pair to the dictionary, eng2kor
. To add one, you can
use square brackets
>>> eng2kor['one'] = 'hana'
This creates a new key-value pair that maps from the key 'one'
to the value
'hana'
. If we print the dictionary again
>>> print(eng2kor)
{'one': 'hana'}
You can initialize a dictionary with multiple items like this:
>>> eng2kor={'one':'hana','two':'dool','three':'set','four':'net'}
The append
method cannot be invoked on a dictionary directly
(i.e., eng2kor.append('five')
won’t work).
Instead, one can keep adding new pairs using
>>> eng2kor['five'] = 'dasut'
or using update
(try help(dict)
or dir(dict)
to see more options)
>>> eng2kor.update({'five':'dasut'})
Let’s now print to see what we have defined so far
>>> print(eng2kor)
{'four': 'net', 'three': 'set', 'five': 'dasut', 'two': 'dool', 'one': 'hana'}
The order of the key-value pairs may not look like what you probably expected.
In fact, they might look different on different computers. The order of pairs in a
dictionary is unpredictable. This is because the elements of a dictionary are indexed
through a hash
function. In spite of this, dictionaries are still iterable.
Traversing through the dictionary will show
>>> for i in eng2kor:
... print(i)
...
four
three
five
two
one
Here we see that traversing a dictionary runs over the keys, and not the values. Note that keys must be unique, while values may not be (there is a surjection between keys and values):
>>> eng2kor[5] = 'dasut'
>>> eng2kor['five']
'dasut'
>>> eng2kor[5]
'dasut'
The dictionary method items
returns something akin to a list of tuples
which gives a useful way to iterate over the keys and values together:
>>> for eng, kor in eng2kor.items():
... print(eng, kor)
...
one hana
two dool
three set
four net
five dasut
5 dasut
or, to just get keys
>>> for eng in eng2kor.keys():
... print(eng)
...
one
two
three
four
five
5
and finally to get only the values
>>> for kor in eng2kor.values():
... print(kor)
...
hana
dool
set
net
dasut
dasut
We can apply some of the built-in functions and operators we learned so far to dictionaries as well:
>>> len(eng2kor)
6
>>> 'one' in eng2kor
True
>>> 'net' in eng2kor
False
The second example of the in
operator tells us that Python checks if
the search word appears as a key
, but not as a value
in the dictionary.
To see whether something appears as a value instead of a key, is to use the method
values
which returns the values as a list
>>> print(eng2kor.values())
dict_values(['hana', 'dool', 'set', 'net'])
With this we can now search
>>> 'net' in eng2kor.values()
True
Dictionary as a set of counters¶
A common string processing problem is to examine the frequency of characters or substrings. Let’s see how we could use a dictionary to help keep track of characters (you’ll get to look at the substring version in HW4):
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/histogram.py
3
4"""
5
6def histogram(s):
7 # initialize with an empty dictionary
8 d = dict()
9
10 for c in s:
11 #print(c)
12 if c not in d:
13 # This is the first instance of c
14 # Insert it as a key and set the value to 1
15 d[c] = 1
16 else:
17 # c has already appeared, increment counter
18 d[c] += 1
19
20 # return dictionary
21 return d
22
23def histogram_ternary(s):
24
25 # This is exactly the same as histogram
26 # but using a so-called 'ternary operator':
27 # a if test else b
28 #
29 # Ex: x='apple' if a > 2 else 'orange'
30 # Translating this into English gives
31 # x is 'apple' if a > 2; otherwise x is 'orange'
32
33 d = dict()
34 for c in s:
35 # the ternary expression is shorter, though also terse
36 d[c] = 1 if c not in d else d[c]+1
37 return d
38
39def histogram_ternary_get(s):
40
41 # This is exactly the same as histogram
42 # but using 'get' method defined in dictionary:
43 # See help(dict) and check out:
44 #
45 # get(...)
46 # D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
47 # i.e., if D is a dictionary,
48 # /D[k] if k in D
49 # D.get(k,d) = |
50 # \ d if k not in D
51 #
52 # Ex: x='apple' if a > 2 else 'orange'
53 # Translating this into English will be
54 # x is 'apple' if a > 2; otherwise x is 'orange'
55
56 d = dict()
57 for c in s:
58 # Insert c with value 0 if it doesn't exist yet
59 # otherwise return current value.
60 # Either way, increment before storing
61 d[c] = d.get(c,0) + 1
62 return d
63
64if __name__ == "__main__":
65 # first function
66 h1 = histogram('apple')
67 print( '(a):', h1 )
68
69 # second function which uses the ternary operator
70 h2 = histogram_ternary('apple')
71 print( '(b):', h2 )
72
73 # are they the same?
74 print( '(c):', h1 is h2 )
75 print( '(d):', h1 == h2)
76
77 # print keys
78 print( '(e):', h1.keys())
79
80 # does 'a' appear as a key?
81 print( '(f):', 'a' in h1 )
82
83 # print values
84 print( '(g):', h1.values() )
85
86 # does 2 appear as a value?
87 print( '(h):', 2 in h1.values())
88
89 # 'get' takes a key and a default value
90 # If the key appears in the dictionary
91 # 'get' returns the corresponding value;
92 # otherwise it returns the user defined
93 # default value, e.g., 159 in the following example:
94 print( '(i):', h1.get('a',159) )
95 print( '(j):', h1.get('w',159) )
Running this in the script mode will give
$ python histogram.py
(a): {'a': 1, 'p': 2, 'l': 1, 'e': 1}
(b): {'a': 1, 'p': 2, 'l': 1, 'e': 1}
(c): False
(d): True
(e): dict_keys(['a', 'p', 'l', 'e'])
(f): True
(g): dict_values([1, 2, 1, 1])
(h): True
(i): 1
(j): 159
Dictionaries and lists¶
Lists can only appear as values in a dictionary, but not keys. For example, if you try
>>> t=['a','e','l']
>>> type(t)
<class 'list'>
>>> d = dict()
>>> d[t]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> d['1'] = t
>>> d
{'1': ['a', 'e', 'l']}
The above example confirms that lists can only be used as values. For the most part,
keys need to be immutable objects (more specifically they must have a __hash__()
method).
Now, let’s consider an application of using lists as values. Take a look at what we
obtained in the last outcome, {'1': ['a','e','l']}
. This looks like an inverse
map of the output from (a)
or (b)
! This example tells us that we could try to
create an inverse map from values to keys in a dictionary (note this only works if the
values are all hashable types). Here is a function that inverts a dictionary:
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/invert_dictionary.py
3
4"""
5
6def invert_dictionary(d):
7 # create an empty dictionary
8 inverse = dict()
9
10 # traverse through keys in dictionary "d"
11 for key,val in d.items():
12 if val not in inverse:
13 # val hasn't been seen yet
14 # insert it into the inverse dictionary
15 # note [key] is wrapped as a list
16 inverse[val] = [key]
17 else:
18 # val has been seen before
19 # append the key to list stored in inverse[val]
20 inverse[val].append(key)
21 return inverse
22
23
24if __name__ == "__main__":
25
26 # import histogram method from histogram.py
27 from histogram import histogram
28
29 # compute histogram
30 hist = histogram('apple')
31 print( hist )
32
33 # compute inverse map of dictionary
34 inv = invert_dictionary(hist)
35 print( inv )
The result look like
$ python3 invert_dictionary.py
{'a': 1, 'p': 2, 'e': 1, 'l': 1}
{1: ['a', 'e', 'l'], 2: ['p']}
Dictionaries and memoization¶
Dictionaries are an excellent way to store the results of expensive calculations using the inputs as keys. In this way, whenever a set of inputs that have already been used are queried one can fetch the result from the dictionary. This avoids repeating potentially expensive calculations, and trades off computational effort for a larger memory footprint.
For example, a naive implementation of the Fibonacci sequence looks like:
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/fibonacci.py
3
4Fibonaci sequence using recursion
5
6"""
7
8def fibonacci(n):
9 if n == 0:
10 # First base case
11 return 0
12 elif n == 1:
13 # Second base case
14 return 1
15 else:
16 # Otherwise, call backwards in sequence recursively
17 res = fibonacci(n-1) + fibonacci(n-2)
18 return res
19
20if __name__ == "__main__":
21 print(fibonacci(12))
Notice that early terms in the sequence are re-evaluated a huge number of times through those recursive calls. A dictionary can be used to cache the result for previously encountered values, which can short-circuit those recursive calls quite efficiently.
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/fibonacci_dict.py
3
4Fibonacci sequence using a dictionary "known" which keeps track of values that
5have already been computed and stores them for reuse.
6
7"""
8
9from fibonacci import fibonacci
10
11# initialize a dictionary, with the first two Fib. numbers
12known = {0:0,1:1}
13
14# Memoized call using full dictionary
15def fibonacci_dict(n):
16 if n in known:
17 # n has been encoutered before, return it
18 # base case slides up the series as more terms are queried
19 return known[n]
20 else:
21 # otherwise, call back recursively and cache value
22 known[n] = fibonacci_dict(n-1) + fibonacci_dict(n-2)
23 return known[n]
24
25# Initialize empty dictionary
26sparse = {}
27
28def fibonacci_dict_sparse(n):
29 if n in sparse:
30 # n has been queried before, return value
31 return sparse[n]
32 else:
33 # otherwise, call naive method and cache
34 sparse[n] = fibonacci(n)
35 return sparse[n]
36
37if __name__ == "__main__":
38 print(fibonacci_dict(12))
39 print(known)
40 print(fibonacci_dict_sparse(12))
41 print(sparse)
In the above example, known
is a dictionary that stores the Fibonacci numbers
we already know. It starts with the first two terms in the sequence: F0=0
and F1=1
,
or in other words, 0 maps to 0 and 1 maps to 1. This caches every value seen along way.
Note that we could use a list or array for this version.
The sparse version only caches the values we explicitly ask for. This is slower on the initial calls, but has a smaller memory footprint. This is particularly useful if you know that the function will be queried for the same inputs many times.
To compare CPU runtime in seconds, we can do as follows:
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/run_fibonacci.py
3
4Runtime comparison of three fibonacci implementations
5
6"""
7
8import time
9from fibonacci import fibonacci
10from fibonacci_dict import fibonacci_dict, fibonacci_dict_sparse
11
12id_set = [4,12,15,30,32]
13
14start_time1 = time.time()
15for i in id_set:
16 fibonacci(i)
17elapsed_time1 = time.time() - start_time1
18
19start_time2 = time.time()
20for i in id_set:
21 fibonacci_dict(i)
22elapsed_time2 = time.time() - start_time2
23
24start_time3 = time.time()
25for i in id_set:
26 fibonacci_dict_sparse(i)
27elapsed_time3 = time.time() - start_time3
28
29print('First runs through (sec):')
30print( ' fibonacci = ', elapsed_time1 )
31print( ' fibonacci_dict = ', elapsed_time2 )
32print( ' fibonacci_dict_sparse = ', elapsed_time3 )
33
34start_time1 = time.time()
35for i in id_set:
36 fibonacci(i)
37elapsed_time1 = time.time() - start_time1
38
39start_time2 = time.time()
40for i in id_set:
41 fibonacci_dict(i)
42elapsed_time2 = time.time() - start_time2
43
44start_time3 = time.time()
45for i in id_set:
46 fibonacci_dict_sparse(i)
47elapsed_time3 = time.time() - start_time3
48
49print('Second runs through (sec):')
50print( ' fibonacci = ', elapsed_time1 )
51print( ' fibonacci_dict = ', elapsed_time2 )
52print( ' fibonacci_dict_sparse = ', elapsed_time3 )
53
Running this will produce something akin to this:
First runs through (sec):
fibonacci = 0.7044386863708496
fibonacci_dict = 1.4066696166992188e-05
fibonacci_dict_sparse = 0.7033944129943848
Second runs through (sec):
fibonacci = 0.7008662223815918
fibonacci_dict = 3.337860107421875e-06
fibonacci_dict_sparse = 1.430511474609375e-06
Global variables¶
Consider running fibonacci_dict.py
directly. known
is initialized outside
any function, and thus belongs to the module namespace. Variables declared as such
are global from the perspective of this file, and can be accessed by any function
here (compare this to module variables in Fortran). When we import that file as a
module in run_fibonacci.py
, the dictionary known will live in the fibonacci_dict
namespace.
The following example illustrates how global variables behave and how they could be modified within a local function:
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/global.py
3
4"""
5
6been_called = False
7
8def local_var():
9 been_called = True
10 print( '(a):', been_called )
11
12local_var()
13print( '(b):', been_called )
14
15
16def global_var():
17 global been_called
18 been_called = True
19 print( '(c):', been_called )
20
21
22global_var()
23print( '(d):', been_called )
24
25
26been_called = False
27def return_var():
28 been_called = True
29 return been_called
30
31return_var()
32print( '(e):', been_called )
33print( '(f):', return_var() )
The result is:
(a): True
(b): False
(c): True
(d): True
(e): False
(f): True
An example study¶
Consider the following example which has been originally adopted from Dive Into Python 3 and modified for the class.
This routine takes a computer file size in kilobytes as an input and converts it approximately to a human-readable form, e.g., 1TB, or 931 GiB, etc.
1"""
2/lectureNote/chapters/chapt03/codes/examples/dictionaries/humansize.py
3
4NOTE: This routine has been extracted from
5
6 http://www.diveintopython3.net/your-first-python-program.html
7
8and modified by Prof. Dongwook Lee for AMS 209
9and modified by Youngjun Lee for AMS 129
10
11"""
12
13SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
14 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
15
16def approximate_size(size, a_kilobyte_is_1024_bytes=True):
17 '''Convert a file size to human-readable form.
18
19 Keyword arguments:
20 size -- file size in bytes
21 a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
22 if False, use multiples of 1000
23
24 Returns: file size in a string format
25
26 '''
27 if size < 0:
28 print( 'number must be non-negative' )
29
30 if a_kilobyte_is_1024_bytes:
31 multiple = 1024
32 else:
33 multiple = 1000
34
35 # Initialize an empty size_dict array to keep track of
36 # the file sizes and suffixes.
37 # The result is going to be the last key:value pair when
38 # a computed size becomes smaller than the file size unit (i.e., multiple).
39 size_dict=dict()
40 for suffix in SUFFIXES[multiple]:
41 #print suffix
42 size /= multiple # <==> size = size/multiple
43 #print size
44 size_dict[size]=suffix
45
46 # Keep dividing until a size is less than the chosen file size unit
47 if size < multiple:
48 return str(round(size)) + ' ' + size_dict[size]
49
50 print( 'number too large' )
51
52if __name__ == '__main__':
53 print( '(a) with the multiple of 1000 bytes: ', approximate_size(1000000000000, False) )
54 print( '(b) with the multiple of 1024 bytes: ', approximate_size(1000000000000) )
The output from running the routine looks like
$ python3 humansize.py
(a) with the multiple of 1000 bytes: 1 TB
(b) with the multiple of 1024 bytes: 931 GiB