Linear Regression using statsmodels


Linear regression is a prediction model that assumes the dependent variable and the independent variable have a linear relationship. In this article, I’ll fit a linear regression model on a synthetic dataset using the Python library “statsmodels.”

Linear Regression

A linear regression to modelling the relationship between a scalar response and one or more explanatory variables is known as linear regression in statistics (also known as dependent and independent variables). Simple linear regression is used when there is only one explanatory variable; multiple linear regression is used when there are more than one.

  1. Simple Linear Regression:
  2. Multiple Linear Regression:

Are multiple and multivariate regression really different?
Multiple regression (aka multivariable regression) pertains to one dependent variable and multiple independent variables: . Multivariate regression pertains to multiple dependent variables and multiple independent variables: .

The OLS Assumptions

I divide OLS into five assumptions in this tutorial. Before you undertake regression analysis, you should be aware of all of them and take them into account.

  • Linearity:
  • No endogeneity:
  • Normality and homoscedasticity:
  • No autocorrelation:
  • No multicollinearity:

These are the most important OLS assumptions for regression analysis.


Let’s make a synthetic stock dataset for demonstration purposes.

year month interest unemplyment price
2021 12 2.75 5.3 1464
2021 11 2.50 5.3 1394
2021 10 2.50 5.3 1357
2021 9 2.50 5.3 1293
2021 8 2.50 5.4 1256
2021 7 2.50 5.6 1254
2021 6 2.50 5.5 1234
2021 5 2.25 5.5 1195
2021 4 2.25 5.5 1159
2021 3 2.25 5.6 1167
2021 2 2.00 5.7 1130
2021 1 2.00 5.7 1130
2020 12 2.00 6.0 1047
2020 11 1.75 5.9 965
2020 10 1.75 5.8 943
2020 9 1.75 6.1 958
2020 8 1.75 6.2 971
2020 7 1.75 6.1 949
2020 6 1.75 6.1 884
2020 5 1.75 6.1 866
2020 4 1.75 5.9 876
2020 3 1.75 6.2 822
2020 2 1.75 6.2 704
2020 1 1.75 6.1 719

The objective is to estimate the stock price using two independent variables: the interest rate and the unemployment rate. Multiple Linear Regression is seen in the Python code below.

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler

def get_data():
	df = pd.read_csv("./stock.csv")
	features = df[["interest", "unemployment"]]
	labels = df["price"]
	return features, labels

if __name__ == "__main__":
	features, labels = get_data()
	scaler = MinMaxScaler()
	features = scaler.fit_transform(features)

	features = sm.add_constant(features)
	model = sm.OLS(labels, features).fit()
	predictions = model.predict(features)

When you run the Python code, you’ll get the following result.

                            OLS Regression Results
Dep. Variable:                  price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sat, 28 Aug 2021   Prob (F-statistic):           4.04e-11
Time:                        23:52:11   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2
Covariance Type:            nonrobust
                 coef    std err          t      P>|t|      [0.025      0.975]
const       1077.3223     91.490     11.775      0.000     887.059    1267.586
x1           345.5401    111.367      3.103      0.005     113.940     577.140
x2          -225.1319    106.155     -2.121      0.046    -445.893      -4.371
Omnibus:                        2.691   Durbin-Watson:                   0.530
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                1.551
Skew:                          -0.612   Prob(JB):                        0.461
Kurtosis:                       3.226   Cond. No.                         14.4

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


  • F-statistic in linear regression is comparing your produced linear model for your variables against a model that replaces your variables’ effect to 0, to find out if your group of variables are statistically significant.
  • Prob (F-Statistic) uses this number to tell you the accuracy of the null hypothesis, or whether it is accurate that your variables’ effect is 0.
  • AIC and BIC are both used to compare the efficacy of models in the process of linear regression, using a penalty system for measuring multiple variables. These numbers are used for feature selection of variables.
  • R-squared is the measurement of how much of the independent variable is explained by changes in our dependent variables.
  • Adjusted. R-squared measures how well the model fits the data. R-squared values vary from 0 to 1, with a greater value indicating a better match if specific criteria are met. The adjusted R-squared penalizes the R-squared formula based on the number of variables, therefore a lower adjusted score may be telling you some variables are not contributing to your model’s R-squared properly.
  • interest coefficient represents the change in the output price due to a change of one unit in the interest rate (everything else held constant).
  • unemployment coefficient represents the change in the output price due to a change of one unit in the unemployment rate (everything else held constant).
  • std err reflects the level of accuracy of the coefficients. The lower the number, the higher the level of accuracy.
  • P >|t| is the p-value. Statistical significance is defined as a p-value of less than 0.05.
  • Confidence Interval denotes the range of possibilities for our coefficients (with a likelihood of 95 percent).
  • Omnibus describes the normalcy of the distribution of our residuals using skew and kurtosis as measurements. A 0 would indicate perfect normalcy.
  • Prob(Omnibus) is a statistical test measuring the probability the residuals are normally distributed. A 1 would indicate perfectly normal distribution.
  • Skew is a measurement of symmetry in our data, with 0 being perfect symmetry.
  • Kurtosis measures the peakiness of our data, or its concentration around 0 in a normal curve. Higher kurtosis implies fewer outliers.
  • Durbin-Watson falls between 0 and 4. To be more into detail, 2 denotes no autocorrelation, and if the figure is lower than 1 or higher than 3 cause an alarm.
  • Jarque-Bera (JB) and Prob(JB) are alternate methods of measuring the same value as Omnibus and Prob(Omnibus) using skewness and kurtosis.

Recall that the equation for the MLR is . So for our example, it would look like this:


Most linear and multiple linear regression models are based on OLS. I hope this post clarified some topics for you, and I look forward to hearing from you in the comments section. Happy statistics!



