01B. Efficient Programming in Python¶

Mingyang Lu¶

12/16/2023¶

Avoid growing vectors¶

The following code is extremely slow for large n.

In [1]:
# Set the value of n to 10
n = 10

# Create an empty list v
v = []

# Use a for loop to iterate from 1 to n
for i in range(1, n+1):
    # Append the square of i to list v
    v.append(i**2)

# Print the resulting list v
print(v)
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

A better way creates an array of final length first.

In [2]:
# Import the numpy library to create numeric arrays
import numpy as np

# Set the value of n to 10
n = 10

# Create a numeric array v filled with zeros
v = np.zeros(n)

# Use a for loop to iterate from 1 to n
for i in range(1, n+1):
    # Assign the square of i to the corresponding element in the array v
    v[i-1] = i**2

# Print the resulting numeric array v
print(v)
[  1.   4.   9.  16.  25.  36.  49.  64.  81. 100.]

Vectorize codes¶

An even better approach is the following. It uses vector operations instead.

In [3]:
import numpy as np

n = 10

# Generate a sequence from 1 to n using NumPy
v = np.arange(1, n+1)

# Square each element in the array v
v = v ** 2

# Print the resulting array v
print(v)
[  1   4   9  16  25  36  49  64  81 100]

Iteration, e.g., using a For Loop, is very slow in Python. For example, the following code calculates the mean and standard deviation (SD) of a series of numbers.

In [4]:
# Iterations with a for loop
import math

# Initialize variables
my_sum = 0
my_sum2 = 0
num = 100

# Loop through a range from 1 to num
for i in range(1, num + 1):
    my_sum += i
    my_sum2 += i**2

# Calculate mean and standard deviation
my_mean = my_sum / num
my_sd = math.sqrt(my_sum2 / num - my_mean**2)

# Print the mean and standard deviation
print(my_mean, my_sd)
50.5 28.86607004772212

While the above code is typical for C or Fortran, a better approach for Python is to use vector operations.

In [5]:
# Vectorization
import numpy as np

# Define num
num = 100

# Create a sequence from 1 to num
v = np.arange(1, num + 1)

# Calculate mean, mean square, and standard deviation
my_mean = np.mean(v)
my_mean_square = np.mean(v**2)
my_sd = np.sqrt(my_mean_square - my_mean**2)

# Print the mean and standard deviation
print(my_mean, my_sd)
50.5 28.86607004772212

Another way is to use Pandas Series and Apply function to apply a math operation to each element.

In [6]:
# Apply
import pandas as pd
# Define num
num = 100

# Create a sequence from 1 to num
v = np.arange(1, num + 1)

# Convert the NumPy array to a Pandas Series
v_series = pd.Series(v)

# Calculate mean, mean square, and standard deviation using apply
my_mean = v_series.mean()
my_mean_square = v_series.apply(lambda x: x**2).mean()
my_sd = np.sqrt(my_mean_square - my_mean**2)

print(my_mean, my_sd)
50.5 28.86607004772212

Apply can also be used to perform operations for columns (or rows) of a matrix.

In [7]:
import numpy as np
import pandas as pd

# Generate a random matrix of size 4x4
mat = np.random.normal(size=(4, 4))

# Convert the matrix to a DataFrame
df = pd.DataFrame(mat)

# Calculate means, means square, and standard deviations using apply
means = df.apply(np.mean, axis=0)
means2 = df.apply(lambda col: np.mean(col**2), axis=0)
sd = np.sqrt(means2 - means**2)

# Print the standard deviations
print(sd)
0    0.519095
1    0.643977
2    0.662455
3    0.507251
dtype: float64

Performance evaluation¶

In [8]:
import numpy as np
import pandas as pd

# function using for loop
def f1(num):
    my_sum = 0
    my_sum2 = 0

    for i in range(1, num + 1):
        my_sum += i
        my_sum2 += i**2

    my_mean = my_sum / num
    my_sd = math.sqrt(my_sum2 / num - my_mean**2)
    return [my_mean, my_sd]

# function using vectorization
def f2(num):
    v = np.arange(1, num + 1)

    my_mean = np.mean(v)
    my_variance = np.mean(v**2)
    my_sd = np.sqrt(my_variance - my_mean**2)
    return [my_mean, my_sd]

# function using apply
def f3(num):
    v = np.arange(1, num + 1)
    
    v_series = pd.Series(v)
    my_mean = v_series.mean()
    my_variance = v_series.apply(lambda x: x**2).mean()
    my_sd = np.sqrt(my_variance - my_mean**2)
    return [my_mean, my_sd]

%timeit f1(10000)
%timeit f2(10000)
%timeit f3(10000)
1.03 ms ± 27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
50.6 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
2.93 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Typically, a For Loop is easy to unerstand, but slow for large datasets due to the interpreted nature of Python. Vectorization (using Numpy) is efficient for numerical opterations on large datasets, but may not be as intuititve for complex operations. Apply (using Pandas) is convenient for applying a function along row or colun of a DataFrame/Series, but may be slower than vectorized operations.