.. _ch04-python-numpy: =============================================================== Outlook =============================================================== Google drive folder with Jupyter Notebooks used in class: `Jupyter Notebooks from class `_ In this chapter, we are going to learn how to support our endeavors in scientific computing with `NumPy `_, `matplotlib `_, and a few other tools. As we already have seen, NumPy is one of the core mathematical tools for numerical computing in Python. We'll first cover how to handle mathematical objects using NumPy. Then we'll look at how to produce nice looking plots using Matplotlib -- the core plotting tool for Python. Following these topics we will also look at symbolic calculation using the `SymPy `_ package, more specialized numerical tools in the `SciPy `_ package, and then close out with a glance at some higher level libraries that combine many of these tools together. =============================================================== Reading Materials =============================================================== The materials of this chapter in part have been adopted and modified from: * `SciPy Lecture Notes `_, * `Prof. LeVeque's (Univ. of Washington) online note on Python `_. =============================================================== Array manipulations in NumPy =============================================================== Arrays are an essential construct in scientific computing. For example, an array can contain: * 2D fields or images recorded through experiment or simulation * Solutions of PDE on logically rectangular grids in 2, 3, or more dimensions * Time varying signals recorded by one or more measurement devices * and many, many more Wait a minute... we already have something like this, i.e., Python lists! Unfortunately, Python lists are not suitable for numerical computing. First, let's see why. ------------------------------------------------------------------------------ Python lists vs. NumPy arrays ------------------------------------------------------------------------------ Python has lists as a built-in data type, e.g. .. code-block:: python >>> x = [1., 2., 3.] defines a list that contains 3 real numbers and might be viewed as a vector. However, you cannot easily do arithmetic on such lists the way we could in Fortran, e.g. 2*x does not give [2., 4., 6.] as you might hope. Trying this produces: .. code-block:: python >>> 2*x [1.0, 2.0, 3.0, 1.0, 2.0, 3.0] Instead of doubling the elements of the list this doubles the length of ``x`` by appending another copy (``x + x`` would give the same thing). These are certainly *not* the results of scaling or adding vectors. Two-dimensional arrays are also a bit clumsy when using Python lists, as they have to be specified as a list of lists, e.g., a :math:`3\times 2` array with the elements ``11,12`` in the first row, ``21,22`` in the second row, ``31,32`` in the third row would be specified by .. code-block:: python >>> A = [[11, 12], [21, 22], [31, 32]] Note that indexing always starts with 0 in Python, so we find for example that .. code-block:: python >>> A[0] [11, 12] >>> A[1] [21, 22] >>> A[1][0] 21 Here ``A[0]`` refers to the 0-index element of ``A``, which is itself a list ``[11, 12]``. ``A[1][0]`` can be understood as the 0-index element of ``A[1] = [21, 22]``, hence ``A[1][0]`` is 21. Working with ``A`` as a matrix would be quite clumsy. Any non-trivial operation would require explicit loops over the entries. Most importantly, these loops would have to be done *inside Python*. NumPy was developed to make it easy to do the sorts of things we want to do with matrices and vectors, and more generally n-dimensional arrays. Importantly, most operations on NumPy arrays call out to highly optimized precompiled code. This means that any required looping happens *outside* of Python. For example .. code-block:: python >>> import numpy as np >>> x = np.array([1., 2., 3.]) >>> x array([ 1., 2., 3.]) >>> 2*x array([ 2., 4., 6.]) >>> np.sqrt(x) array([ 1. , 1.41421356, 1.73205081]) We see that we can multiply by a scalar or take component-wise square roots. Indeed, the syntax is very reminiscent of Fortran. We can slice arrays, broadcast products, and do elementwise operations easily. ------------------------------------------------------------------------------ Performance comparison ------------------------------------------------------------------------------ NumPy arrays are more efficient than Python lists, particularly when applying a mathematical operation to the whole array. To see the different behavior, we can use the ``timeit`` module which gives a simple way to time small bits of Python code. For a more full description, see the `documentation `_. For instance, timing an operation on a regular Python's list .. code-block:: python >>> import timeit >>> tList = timeit.Timer('[i**2 for i in L]',setup='L=range(50)') >>> tList.timeit() 7.368645437993109 On the contrary, timing an operation on a NumPy array of the same size gives: .. code-block:: python >>> import timeit # not needed if imported already >>> setupArr = ''' ... import numpy as np ... a = np.arange(50) ... ''' >>> tArr = timeit.Timer('a**2',setup=setupArr) >>> tArr.timeit() 0.6314092200191226 which is substantially faster, and even **includes the time required to import numpy**. ------------------------------------------------------------------------------ Creating NumPy data arrays ------------------------------------------------------------------------------ We looked at a brief example of initializing a NumPy array in the above, where individual items were given explicitly as inputs. In practice, however, we probably want to avoid adding items one by one. Some common examples are as follows: * Evenly spaced values (best for integers) .. code-block:: python >>> import numpy as np >>> a = np.arange(10) # np.arange(n) provides indices from 0 to n-1 >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a.dtype dtype('int64') >>> b = np.arange(1., 9., 2) # syntax : start, end, step >>> b array([ 1., 3., 5., 7.]) >>> b.dtype dtype('float64') or by specifying the number of points (best for floats) .. code-block:: python >>> c = np.linspace(0, 1, 6) >>> c array([ 0. , 0.2, 0.4, 0.6, 0.8, 1. ]) >>> d = np.linspace(0, 1, 5, endpoint=False) >>> d array([ 0. , 0.2, 0.4, 0.6, 0.8]) Alternatively, you can use ``np.arange`` to get floats by specifying the data type: .. code-block:: python >>> a = np.arange(10, dtype=np.float64) * There are built-in constructors for several of the most common arrays. These include arrays of all ones, all zeros, identity arrays, and diagonal arrays (among others). .. code-block:: python >>> a = np.ones((3,3)) >>> a array([[ 1., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]]) >>> a.dtype dtype('float64') >>> b = np.ones(5, dtype=np.int) >>> b array([1, 1, 1, 1, 1]) >>> c = np.zeros((2,3)) >>> c array([[ 0., 0., 0.], [ 0., 0., 0.]]) >>> d = np.eye(3) >>> d array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]]) The first argument in each case is a tuple with the size of each dimension in the desired array. Compare this to the ``dimension(M,N,...)`` keyword in Fortran. .. note:: Just like Fortran arrays have a maximum rank (compiler dependent), NumPy arrays have a maximum rank. Currently this cap is 32. If you're problem requires more than 32 dimensions, you should strongly consider finding an alternative way to pose the problem. For more information, try reading up on `the curse of dimensionality `_. * You can also create arrays using nested python lists, which is good for entering small arrays by hand. The :math:`3\times 2` array from above could be written as .. code-block:: python A = np.array([[11, 12], [21, 22], [31, 32]]) ------------------------------------- Querying properties of NumPy arrays ------------------------------------- In Fortran we used the ``size`` intrinsic function to write functions and subroutines that worked for any size array as long as the rank was known. NumPy arrays support similar queries. A few quite useful ones are: * Get the data type of the elements .. code-block:: python >>> print(A.dtype) int64 Note that this is not necessarily the same as Python's intrinsic ``int`` type. Similarly, floats will default to .. code-block:: python >>> B = np.array([1.0,2.0]) >>> print(B.dtype) float64 Often these are the same as the Python intrinsics, but NumPy is capable of storing many different types. See more `here `_. * Get the number of dimensions (rank) .. code-block:: python >>> print(A.ndim) 2 * Get the shape of the array .. code-block:: python >>> s = A.shape >>> print(s) (3, 2) >>> print(type(s)) * Get the size of the array .. code-block:: python >>> s = A.size >>> print(s) 6 >>> print(type(s)) Note that ``size`` is the total number of elements, while ``shape`` is the number of elements along each rank of the array. ------------------------------------------------------------------------------ NumPy array indexing ------------------------------------------------------------------------------ The items of an array can be accessed in a similar same way to the other Python sequences (list, tuple) .. code-block:: python >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> b = a[0], a[2], a[-1] >>> b (0, 2, 9) >>> type(b) tuple For a multidimensional array ``a``, indexes are tuples of integers, that is, for a 3D array, ``a[(i,j,k)]``, or simply ``a[i,j,k]`` .. code-block:: python >>> a = np.diag(np.arange(5)) >>> a array([[0, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 2, 0, 0], [0, 0, 0, 3, 0], [0, 0, 0, 0, 4]]) >>> a[1,1] 1 >>> a[2,1] = 10 # third line (or row), second column >>> a array([[ 0, 0, 0, 0, 0], [ 0, 1, 0, 0, 0], [ 0, 10, 2, 0, 0], [ 0, 0, 0, 3, 0], [ 0, 0, 0, 0, 4]]) >>> a[2] # this is the same as a[2,:] array([ 0, 10, 2, 0, 0]) Note that: * All indices occur in one pair of square brackets * In 2D, the first dimension corresponds to lines (i.e., rows), the second to columns. * For an array a with more than one dimension, ``a[i]`` is interpreted as ``a[i,:]`` or ``a[i,:,:]`` (or whatever rank is needed). --------------------------------- NumPy array slicing and masking --------------------------------- Slicing extracts multiple elements out of an array simultaneously, and works in a comparable way to list indexing: .. code-block:: python >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a[2:8:2] # [start:end:step] array([2, 4, 6]) Note that the last index is not included! .. code-block:: python >>> a[:4] array([0, 1, 2, 3]) ``start:end:step`` is a slice object which represents the set of indices ``range(start, end, step)``. Slice objects can also be created explicitly: .. code-block:: python >>> sl = slice(1, 9, 2) >>> a = np.arange(10) >>> b = 2*a + 1 >>> a, b (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])) >>> a[sl], b[sl] (array([1, 3, 5, 7]), array([ 3, 7, 11, 15])) All three slice components are not required: by default, start is 0, end is the last index in the associated rank, and step is 1. Using these defaults we can do all of the following: .. code-block:: python >>> a[1:3] array([1, 2]) >>> a[::2] array([0, 2, 4, 6, 8]) >>> a[3:] array([3, 4, 5, 6, 7, 8, 9]) **Question:** What is the difference between ``a[0:2]`` and ``a[0::2]``? Of course, all of this works with multidimensional arrays too: .. code-block:: python >>> a = np.eye(5) >>> a array([[ 1., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 0., 0., 0., 1.]]) >>> a[2:4,:3] #3rd and 4th rows, 3 first columns array([[ 0., 0., 1.], [ 0., 0., 0.]]) All elements specified by a slice can be easily modified .. code-block:: python >>> a[:3,:3] = 4 >>> a array([[ 4., 4., 4., 0., 0.], [ 4., 4., 4., 0., 0.], [ 4., 4., 4., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 0., 0., 0., 1.]]) Here is a small illustrated summary of NumPy indexing and slicing: .. _fig1_ch04: .. figure:: ./_figures/numpy_indexing.png :align: center NumPy array indexing. You can also combine assignment and slicing .. code-block:: python >>> a = np.arange(10) >>> a[5:] = 10 >>> a array([ 0, 1, 2, 3, 4, 10, 10, 10, 10, 10]) >>> b = np.arange(5) >>> a[5:] = b[::-1] >>> a array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0]) We can also mask parts of the array using a boolean array in place of the slice object. For instance: .. code-block:: python >>> B = np.eye(5) >>> B array([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 0., 1., 0., 0.], [0., 0., 0., 1., 0.], [0., 0., 0., 0., 1.]]) >>> mask = B<0.5 >>> mask array([[False, True, True, True, True], [ True, False, True, True, True], [ True, True, False, True, True], [ True, True, True, False, True], [ True, True, True, True, False]]) >>> B[mask] = -2.0 >>> B array([[ 1., -2., -2., -2., -2.], [-2., 1., -2., -2., -2.], [-2., -2., 1., -2., -2.], [-2., -2., -2., 1., -2.], [-2., -2., -2., -2., 1.]]) You can also create the mask in place. The assignment ``B[mask] = -2.0`` could have instead been written as ``B[B<0.5] = -2.0``. ------------------------------------------------------------------------------ Copies and views ------------------------------------------------------------------------------ A slicing operation creates a **view** of the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use `np.may_share_memory() `_ to check if two arrays share the same memory block. Note, however, that this uses heuristics and may give you false positives. It may seem a little odd that this isn't a guaranteed answer. We'll return to why this is when we talk about C and pointer aliasing. When modifying the view, the original array is modified as well .. code-block:: python >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> b = a[::2] >>> b array([0, 2, 4, 6, 8]) >>> np.may_share_memory(a, b) True >>> b[0] = 12 >>> b array([12, 2, 4, 6, 8]) >>> a # (!) array([12, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> c = a[::2].copy() # force a copy >>> c[0] = 47 >>> a array([12, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> c array([47, 2, 4, 6, 8]) >>> np.may_share_memory(a, c) False .. note:: You can **not** create a view using an implicit mask, e.g. ``b = a[a<3]`` will create a distinct copy. Additionally, this will always return a rank 1 array regardless of the shape of the original array. ---------------------------- Linear algebraic operations ---------------------------- NumPy arrays behave a lot like Fortran arrays under arithmetic. Most operations will default to being performed elementwise. If you are accustomed to MatLab, you should be careful. Here ``A*B`` will be an elementwise product, and *not* a matrix-matrix product (the MatLab equivalent would be ``A.*B``). To get a matrix-matrix product you must use ``dot``, ``matmul`` or the ``@`` operator .. code-block:: python >>> A = np.array([[1,2], [3,4]]) >>> B = np.array([[5,0], [0,7]]) >>> A array([[1, 2], [3, 4]]) >>> B array([[5, 0], [0, 7]]) >>> A*B # this will not give a matrix-product, but an elementwise product array([[ 5, 0], [ 0, 28]]) >>> np.dot(A,B) # this is the expected matrix multiplication array([[ 5, 14], [15, 28]]) >>> B.dot(A) # Matrix multiplication is not commutative! array([[ 5, 10], [21, 28]]) >>> np.matmul(A,B) # Note: matmul is not a member function of ndarray ... >>> B @ A ... .. note:: The ``np.dot`` function generalizes to higher rank tensors. However, that generality comes with a performance penalty. You should use ``np.matmul`` or the ``@`` operator whenever possible. It also makes the code a bit easier to read, though that is subjective. Consult also the functions ``np.tensordot`` and ``np.vdot`` for other use cases. Many other linear algebra tools can be found in NumPy. For example, to solve a linear system ``Ax = b`` using Gaussian Elimination, we can do .. code-block:: python >>> A array([[1, 2], [3, 4]]) >>> b = np.array([2,3]) >>> x = np.linalg.solve(A,b) >>> x array([-1. , 1.5]) To find the eigenvalues and eigenvectors of ``A`` we can do: .. code-block:: python >>> evals, evecs = np.linalg.eig(A) >>> evals array([-0.37228132, 5.37228132]) >>> evecs array([[-0.82456484, -0.41597356], [ 0.56576746, -0.90937671]]) .. note:: You may be tempted to use the variable name ``lambda`` for the eigenvalues of a matrix, but this isn't allowed in Python because ``lambda`` is a keyword of the language. ---------------------------------- Reductions of NumPy arrays ---------------------------------- In this context, reductions are operations that map arrays to individual scalar values or arrays of lower rank. These include operations like summation, norms, max/min values, etc. To compute sums .. code-block:: python >>> x = np.array([1, 2, 3, 4]) >>> np.sum(x) 10 >>> x.sum() 10 Sum by rows and by columns .. code-block:: python >>> x = np.array([[1, 2], [3, 4]]) >>> x array([[1, 2], [3, 4]]) >>> x.sum(axis=0) # sum along columns (first dimension) array([4, 6]) >>> x[:, 0].sum(), x[:, 1].sum() (4, 6) >>> x.sum(axis=1) # sum along rows (second dimension) array([3, 7]) >>> x[0, :].sum(), x[1, :].sum() (3, 7) or similarly in even higher dimensions .. code-block:: python >>> x = np.array([ [[1,6],[7,4]], [[5,2],[3,8]] ] ) >>> x array([[[1, 6], [7, 4]], [[5, 2], [3, 8]]]) >>> x.sum(axis = 0) array([[ 6, 8], [10, 12]]) >>> x.sum(axis = 1) array([[8, 10], [8, 10]]) >>> x.sum(axis = 2) array([[7, 11], [7, 11]]) To obtain extrema .. code-block:: python >>> x.min() 1 >>> x.max() 8 >>> x.max(axis=0) array([[5, 6], [7, 8]]) >>> x.argmin() # index of minimum in flattened array 0 >>> x.argmax(axis=0) # index of maxima along 0th dimension array([[1, 0], [0, 1]]) Logical operations .. code-block:: python >>> np.all([True, True, False]) False >>> np.any([True, True, False]) True >>> np.all(x<3) False >>> np.any(x<3) True Finally, we can list a few examples from statistics .. code-block:: python >>> x = np.array([1, 2, 3, 1]) >>> y = np.array([[1, 2, 3], [5, 6, 1]]) >>> x.mean() 1.75 >>> np.median(x) 1.5 >>> np.median(y, axis=-1) # last axis, equivalently, np.median(y, axis=1) array([ 2., 5.]) >>> x.std() # full population standard deviation. 0.82915619758884995 -------------------- Loading data files -------------------- Text files ^^^^^^^^^^^^ Consider an example: ``populations.txt``: .. literalinclude:: ./codes/populations.txt :language: text :linenos: :download:`Download this data <./codes/populations.txt>` To load ``populations.txt`` .. code-block:: python >>> data = np.loadtxt('populations.txt') >>> data array([[ 1900., 30000., 4000., 48300.], [ 1901., 47200., 6100., 48200.], [ 1902., 70200., 9800., 41500.], ... >>> np.mean(data[:,1:],axis=0) array([34080.95238095, 20166.66666667, 42400. ]) >>> np.std(data[:,1:],axis=0) array([20897.90645809, 16254.59153691, 3322.50622558]) To save it to another name, ``pop2.txt`` .. code-block:: python >>> np.savetxt('pop2.txt', data) Other well-known file formats ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * HDF5: h5py, PyTables * We will look at some examples regarding reading in HDF5 files after we cover matplotlib * NetCDF: scipy.io.netcdf_file, netcdf4-python, others * Matlab: scipy.io.loadmat, scipy.io.savemat * IDL: scipy.io.readsav .. _ch04-python-numpy-findiff: --------------------------------------- An example: Numerical differentiation --------------------------------------- Differential equations and dynamical systems are ubiquitous in science, and by extension, in scientific computing. One will often need to approximate the derivative of a function knowing only its values at discrete points. One particularly simple way to approximate derivatives is through *finite differences*. These get their name from revisiting the limit definition of a derivative .. math:: \frac{df}{dx} := \lim\limits_{h\rightarrow 0} \frac{f(x+h) - f(x)}{h} and taking :math:`h` to be a finite instead of infinitesimal value. This same idea can be extended to higher derivatives and more accurate approximations. Some of the final projects will dig into these methods more deeply. In this example we'll use NumPy to approximate the first and second derivative of a function on a uniform grid. .. literalinclude:: ./codes/finiteDiff.py :download:`Download this code <./codes/finiteDiff.py>` **Exercises:** * Modify the code to report the relative error instead of absolute error * Modify the code to report the space-averaged error instead of pointwise error - E.g. report the :math:`L_1` error instead of the :math:`L_\infty` error ---------------------------- An example: Revisiting HW3 ---------------------------- Here we'll look at an example program using NumPy that solves the same boundary value problem as in the second homework. The initial code is: .. literalinclude:: ./codes/dftBvp.py :download:`Download this code <./codes/dftBvp.py>` First, we can run this directly from the command line to get the same answer as what we saw in the homework for :math:`N=21`: .. code-block:: console $ python dftBvp.py $ Solving with 21 grid points $ Max error: 0.009817949242564072 As an exercise, let's now consider the following: #. Add a function to wrap the steps required to go from a forcing function to a solution. #. Loop over a variety of resolutions and record the maximum errors we obtain. #. Segue into the next section by plotting the solution and forcing function. ---------------------------------------- An example: Spectra of random matrices ---------------------------------------- There are some fascinating results regarding the spectra of (large) random symmetric matrices. The study of these spectra first arose in nuclear physics, but have since seen applications in number theory, theoretical neuroscience, and optimal control. The wiki page on `Random Matrices `_ has plenty more information. Let's consider this problem numerically instead of analytically. We are going to generate a large number of random symmetric :math:`N\times N` matrices :math:`\mathbf{A}` such that .. math:: \mathbf{A}_{ij} \sim \mathcal{N}(0,1),\qquad\mathbf{A}_{ij} = \mathbf{A}_{ij}. For each matrix we'll find it's eigenvalues using ``np.linalg.eigvalsh``, which finds the eigenvalues of symmetric (or complex Hermitian) matrices. For consistency, we'll rescale these eigenvalues by :math:`N^{-1/2}` and plot them using a histogram. **Preliminary question:** What distribution do you think the eigenvalues all combined together follow? What if we sort them and consider them one-by-one? The code to accomplish this is: .. literalinclude:: ./codes/randEigH.py :download:`Download this code <./codes/randEigH.py>` Consider a few exercises using this code: #. Run the code as is. Do the produced histograms match your intuition? #. Modify ``N`` and ``numSamples``. How does run time depend on them? How closely can you match the limit distribution using different values? #. Add another set of plots to examine the size of spectral gaps. #. Try changing ``rng.standard_normal`` to a different one using `this list `_ from the NumPy documentation. How strongly does the limit distribution for the eigenvalues depend on the distribution of the entries?