Linear Regression with Numpy & Scipy
y = mx + b, What is r-squared, variance, standard deviation…
For our example, let’s create the data set where y is mx + b.
x will be a random normal distribution of N = 200 with a standard deviation σ (sigma) of 1 around a mean value μ (mu) of 5.
Standard deviation ‘σ’ is the value expressing by how much the members of a group differ from the mean of the group.
The slope ‘m’ will be 3 and the intercept ‘b’ will be 60.
import numpy as np
x = np.random.normal(5.0,1.0,200) # (mean, std. deviation, N)
m = 3
b = 60
y = m * (x + np.random.normal(0,0.2,200)) + b # add a std. deviation to get a more realistic data
Normal distribution or ‘Gaussian’:
import matplotlib.pyplot as pltplt.hist(x,50)
plt.show()
We can see above how the data is spread around the mean value by our normal distribution.
Let’s visualise our data.
plt.scatter(x,y)
plt.show()
stats.linregress( )
Will give us the value of m and b, the r_value is used to determine how well our line is fitting the data. r-squared will give us a value between 0 and 1, from bad to good fit.
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print('Slope: ',slope,'\nIntercept: ',intercept)
Slope: 2.98104902278
Intercept: 60.1146144847
r-squared :
r_value**2
0.96018831950537364
def predict_y_for(x):
return slope * x + intercept
plt.scatter(x,y)
plt.plot(x, predict_y_for(x), c='r')
plt.show()
Variance and Standard deviation
Get the mean and standard deviation with Numpy
print('Mean: ',np.mean(x),'\nStandard deviation: ',np.std(x))
Mean: 5.04321665207
Standard deviation: 0.972660025762
The variance ‘σ²’is the average of the squared differences from the mean.
We can find the standard deviation ‘σ’ with the square root of our variance.
N = len(x)
mu = sum(n)/N
print('Mean: ',mu)
Mean: 5.04321665207
from math import sqrt
def calc_std_dev(x):
N = len(x)
v = 0
for n in x:
v += ((n-mu)**2)
pop_variance = v/N
sigma = sqrt(pop_variance)
return sigma
print('Standard deviation: ',calc_std_dev(x))
Standard deviation: 0.9726600257624177
N or N-1, population or sample
The population variance σ², or the average of squared differences is defined by dividing the sum of squared differences by N when N = len(population).
σ² = ∑ (x-μ)² / N
The sample variance S² is defined by dividing the sum of squared differences by N-1, when N = len(sample), for example when working on a train set of data.
S² = ∑ (x-μ)² / N-1