Linear Regression with Numpy & Scipy

y = mx + b, What is r-squared, variance, standard deviation…

3 min readOct 24, 2018

For our example, let’s create the data set where y is mx + b.

x will be a random normal distribution of N = 200 with a standard deviation σ (sigma) of 1 around a mean value μ (mu) of 5.

Standard deviation ‘σ’ is the value expressing by how much the members of a group differ from the mean of the group.

The slope ‘m’ will be 3 and the intercept ‘b’ will be 60.

import numpy as np

x = np.random.normal(5.0,1.0,200) # (mean, std. deviation, N)
m = 3
b = 60
y = m * (x + np.random.normal(0,0.2,200)) + b # add a std. deviation to get a more realistic data

Normal distribution or ‘Gaussian’:

import matplotlib.pyplot as pltplt.hist(x,50)
plt.show()

We can see above how the data is spread around the mean value by our normal distribution.

Let’s visualise our data.

plt.scatter(x,y)
plt.show()

stats.linregress( )

Will give us the value of m and b, the r_value is used to determine how well our line is fitting the data. r-squared will give us a value between 0 and 1, from bad to good fit.

from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

print('Slope: ',slope,'\nIntercept: ',intercept)

Slope: 2.98104902278
Intercept: 60.1146144847

r-squared :

r_value**2

0.96018831950537364

def predict_y_for(x):
    return slope * x + intercept

plt.scatter(x,y)
plt.plot(x, predict_y_for(x), c='r')
plt.show()

Variance and Standard deviation

Get the mean and standard deviation with Numpy

print('Mean: ',np.mean(x),'\nStandard deviation: ',np.std(x))

Mean: 5.04321665207
Standard deviation: 0.972660025762

The variance ‘σ²’is the average of the squared differences from the mean.
We can find the standard deviation ‘σ’ with the square root of our variance.

N = len(x)

mu = sum(n)/N

print('Mean: ',mu)

Mean: 5.04321665207

from math import sqrt

def calc_std_dev(x):
    N = len(x)
    v = 0
    for n in x:
        v += ((n-mu)**2)
    pop_variance = v/N
    sigma = sqrt(pop_variance)

    return sigma

print('Standard deviation: ',calc_std_dev(x))

Standard deviation: 0.9726600257624177

N or N-1, population or sample

The population variance σ², or the average of squared differences is defined by dividing the sum of squared differences by N when N = len(population).

σ² = ∑ (x-μ)² / N

The sample variance S² is defined by dividing the sum of squared differences by N-1, when N = len(sample), for example when working on a train set of data.

S² = ∑ (x-μ)² / N-1