### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2024<br>Instructor: Ron Mak</center>

# Linear Regression and Weighted Averages (List Version)

#### Suppose you're analyzing the test scores of students who spent different numbers of hours studying for the test. What is the relation between hours of studying and test scores?
#### You obtain data from interviewing the students. Different counts of students claimed the study hours.
## Data as a 2-d list:

In [None]:
scores_list = [
    [ 0,  3,  5],
    [ 1,  4, 11],
    [ 2,  4, 13],
    [ 3,  5, 39],
    [ 4,  5, 44],
    [ 5, 10, 56],
    [ 6,  9, 57],
    [ 7,  3, 80],
    [ 8,  2, 87],
    [ 9,  1, 89],
    [10,  1, 99]
]

In [None]:
print('Hours  Count  Avg. score')

for row in scores_list:
    print(f'{row[0]:3}{row[1]:7}{row[2]:10}')

## `X` is a list of hours, `Y` is a list of scores

In [None]:
X = []
Y = []

for i in range(len(scores_list)):
    row = scores_list[i]

    X.append(row[0])  # element from column 0
    Y.append(row[2])  # element from column 2

In [None]:
print(' hours X:', X)
print('scores Y:', Y)

## Scatter plot of scores vs. hours

In [None]:
import matplotlib
import matplotlib.pyplot as plt

In [None]:
plt.scatter(X, Y)
plt.show()

## The trend line
#### Now let's calculate and plot the trend line, which mathematicians call the **least-squares linear regression line**.

![Screenshot 2023-03-08 at 2.09.29 PM.png](attachment:067bcb00-e205-4bad-831b-8694649971e2.png)
![Screenshot 2023-03-08 at 2.11.13 PM.png](attachment:bf4a83fb-611f-4d13-9873-d369a4ef3f61.png)

## Scope-intercept function (list version)
#### Based on the above formulas, we can define a function that calculates the slope `m` and the y intercept `b` given the x and y values.

In [None]:
def calculate_slope_intercept_list_version(X, Y):
    """
    Calculate and return the slope m and y intercept b
    of the least-squares linear regression line y = mx + b
    given the vectors of x and y values.
    """
    n = len(X)
    
    sum_X  = 0
    sum_Y  = 0
    sum_XX = 0
    sum_XY = 0

    for i in range(n):
        x = X[i]
        y = Y[i]
        
        sum_X  += x
        sum_Y  += y
        sum_XX += x*x
        sum_XY += x*y
        
    mean_X = sum_X/n
    mean_Y = sum_Y/n
    
    numerator   = sum_XY - (sum_X*sum_Y)/n
    denominator = sum_XX - ((sum_X*sum_X)/n)
    
    m = numerator/denominator
    b = mean_Y - m*mean_X
    
    return m, b  # slope and intercept    

## Calculate the slope and y intercept

In [None]:
m, b = calculate_slope_intercept_list_version(X, Y)  # unpack m and b

print(f'      slope m = {m:.2f}')
print(f'y intercept b = {b:.2f}')

## Scatter plot with regression line

#### Since the regression line is a line, we only need its end points to plot it. The function `plt.plot()` can take a **list** of the x values and a list of the y values of the line to plot. Let's make the regression line red.

In [None]:
plt.scatter(X, Y)
    
# The regression line. We only need the end points.
# End point 1: x = 0,     y = b
# End point 2: x = X[-1], y = m*X[-1] + b
plt.plot([0, X[-1]], [b, m*X[-1] + b], color='red')

## The weighted average

#### Let's calculate the weighted average of the scores. We can't simply take the average of the scores because different numbers of students got the scores. Therefore, we can treat the number of students per score as that score's weight.
![Screenshot 2023-03-08 at 2.23.20 PM.png](attachment:95dd8413-59c2-403d-b67c-89f616d01803.png)

## Weighted average function (list version)
#### Based on the above formulas, we can define a function that calculates a weighted average given weights `W` and values `V`.

In [None]:
def calculate_weighted_average_list_version(W, V):
    """
    Calculate and return the weighted average of values
    given the weights W and the values V.
    """
    total = 0
    
    for i in range(len(W)):
        total += W[i]*V[i]
    
    return total/sum(W)

## Weights `W` is a list of counts, values `V` is a list of scores

In [None]:
W = []
V = []

for i in range(len(scores_list)):
    row = scores_list[i]

    W.append(row[1])  # element from column 1
    V.append(row[2])  # element from column 2

In [None]:
print('counts W:', W)
print('scores V:', Y)

## Calculate and print the weighted average score

In [None]:
w_avg_score = calculate_weighted_average_list_version(W, V)

print(f'weighted average score = {w_avg_score:.2f}')

In [None]:
# (C) Copyright 2024 by Ronald Mak