The Hows and Whys of Regression Analysis

Machine learning experts have borrowed the methods of regression analysis from math because they allow the making of predictions with as little as just one known variable (as well as multiple variables). They’re useful for financial analysis, weather forecasting, medical diagnosis, and many other fields.

What’s Regression in Statistics?

Regression analysis determines the relationship between one dependent variable and a set of independent variables. This sounds a bit complicated, so let’s look at an example.

Imagine you run your own restaurant. You have a waiter who receives tips. The size of those tips usually correlates with the total sum for the meal. The bigger they are, the more expensive the meal was.

You have a list of order numbers and tips received. If you tried to reconstruct how large each meal was with just the tip data (a dependent variable), this would be an example of a simple-linear-regression analysis.

(This example was borrowed from the magnificent video by Brandon Foltz.)

A similar case would be trying to predict how much the apartment will cost based just on its size. While this estimation isn’t perfect, a larger apartment will usually cost more than a smaller one.

To be honest, simple linear regression isn’t the only type of regression in machine learning — and not even the most practical one. However, it’s the easiest to understand.

Representation of the Regression Model

The representation of a linear regression model is a linear equation:

Y = a + bX

In this equation, Y is the value that we’re trying to predict. X is the independent input value. Regarding parameters: b represents the coefficient that every input value is multiplied with, while a (the intercept coefficient) is a coefficient that we add in the end. Changing b impacts the slope of the line, while changing a lets one move the line up and down the y-axis.

Types of Regression for ML

Let’s now look at the common types of linear-regression analysis used in machine learning. There are four basic techniques that we’re going to have a look at here. Other regression models can also be found, but they’re not so commonly used.

Simple linear regression

Simple linear regression uses one independent variable to explain or predict the outcome.

For example, you have a table with the sample data concerning the temperature of cables and their durability. Now, you can do simple linear regression to create a model that can predict the durability of a cable based on its temperature.

The predictions you make with simple regression will usually be rather inaccurate. A cable’s durability depends on many other things than just the temperature: wear, the weight of the carriage, humidity, and other factors. That is why simple linear regression isn’t usually used to solve real-life tasks.

Multiple linear regression for machine learning

Unlike simple linear regression, *multiple linear regression* uses several explanatory variables to predict the dependent outcome of a response variable.

A multiple linear regression model looks like this:

Y = a + B₁X₁ + B₂X₂ + B₃X₃ + … + BₜXₜ

Here, Y is the variable you’re trying to predict, X’s are the variables you’re using to predict Y, a is the intercept, and b’s are the regression coefficients — they show how much a change in a certain X predicts a change in Y, everything else being equal.

In real life, multiple regression can be used by ML-powered algorithms to predict the price of stocks based on fluctuations in similar stocks.

However, it would be erroneous to say that the more variables you have, the more accurate your ML prediction is.

Problems with multiple linear regression

Two possible problems arise with the use of multiple regression: overfitting and multicollinearity.

Overfitting means that the model you build with multiple regression becomes too narrow and doesn’t generalize well. It works OK on the training set of your machine learning model but doesn’t function properly on the items not mentioned before.
Multicollinearity describes the situation when there’s correlation between not only the independent variables and the dependent variable but also between the independent variables themselves. We don’t want this to happen because it leads to misleading results for the model.

To conduct this type of analysis properly, you need to carefully prepare your data. We’re going to talk about that later in this post.

Ordinary least squares

Another method of linear regression is ordinary least squares. This procedure helps you find the optimal line for a set of data points by minimizing the sum of the residuals.

Every data point represents the relationship between an independent variable and a dependent variable (that we’re trying to predict).

To represent the regression visually, you start by plotting the data points and then draw a line that has the smallest sum of squared distances (residuals) between the line and the data points. In ordinary least squares, this is usually done by finding a local minimum through partial derivatives.

Gradient descent

*Gradient descent* is used for the optimization and fine-tuning of models.

Gradient descent is the process of finding something close to the local minimum of a function by repeatedly changing the parameters in the direction where the function gives a smaller result.

In the context of linear regression, we can use it to iteratively find the line with the smallest sum of squared residuals without calculating the optimal values for our coefficients.

We start with random values for each parameter of the model and calculate the sum of squared errors. Then, we iteratively update the parameters so the sum of squared differences is smaller than with the original parameters. We do this until the sum doesn’t decrease anymore. At this moment, the GD has converged, and the parameters we have should provide us with a local minimum.

When you apply this technique, you need to choose a learning rate that determines the size of the improvement step to take on each iteration of the procedure. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

A learning rate is an important concept when we talk about gradient descent. It describes the size of the required step. When the learning rate is high, you can discover more information with each step — but risk impacting the accuracy. In the case of a large enough step, the GD algorithm can’t converge at all. A low learning rate is more accurate, but we’re recalculating the values so frequently that it becomes inefficient. Gradient descent takes a lot of time, so the increased accuracy is usually not worth it.

In practice, gradient descent is useful when you have a lot of variables or data points, since calculating the answer might be expensive. In most situations, it will result in a line comparable to one drawn by OLS.

Regularization

This linear regression technique tries to reduce the complexity of the model by adding restrictions or preassumptions that help to avoid overfitting.

These regularization methods) help when there is multicollinearity between your independent variables and using the ordinary least squares method causes overfitting:

Lasso regression: Ordinary least squares is changed to also minimize the absolute sum of the coefficients (called L1 regularization). Frequently, in the case of overfitting, we get very large coefficients. We could avoid it by minimizing not only the sum of the errors but also some function of the coefficients.
Ridge regression: Ordinary least squares is changed to also minimize the squared absolute sum of the coefficients (called L2 regularization).

Data Preparation and Making Predictions With Regression

Now let’s see, step by step, how you approach a regression problem in ML.

1. Generate a list of potential variables

Analyze your problem, and come up with potential independent variables that will help you to predict the dependent variable. For example, you can use regression to predict the impact of the product price and the marketing budget on sales.

2. Collect data on the variables

Now it’s time to collect historical data samples. Every company keeps track of the sales, marketing budget, and prices of all the products they make. For our regression model, we need a dataset that looks like this:

3. Check the relationship between each independent variable and the dependent variable using scatter plots and correlations

Placing the data points on a scatter plot is an intuitive way to see whether there’s a linear relationship between the variables. I used the linear regression calculator on Alcula.com, but you can use any tool you like.

Let’s start with the relationship between the price and the number of sales.

We can fit a line to the observed data.

Now we need to check the correlation between the variables. For that, I used an online calculator.

The correlation equals -0.441. This is called a negative correlation: one variable increases when the other one decreases. The higher the price, the lower the number of sales.

However, we also want to check the relationship between the money we invested in marketing and the number of sales.

Here’s how our data points look on a scatter plot.

We can see that there’s a clear correlation between the marketing budget and the number of sold items.

Indeed, when we calculate the coefficient (again, using Acula.com), we get 0.967. The closer it is to one, the higher the correlation between the variables is. In this case, we see a strong positive correlation.

4. Check the relationship between the independent variables

An important step to build an accurate model is to check that there’s no correlation between the independent variables. Otherwise, we won’t be able to tell which factor affects the output, and our efforts will be pointless.

Oh no — there is, indeed, a correlation between the two variables. What should we do?

5. Use nonredundant independent variables in analysis to discover the best fitting model

If you find yourself in a situation where there’s a correlation between two independent variables, you’re at risk. In specialized lingo, such variables are called redundant. If the redundancy is moderate, it can just affect the interpretation. However, they often add noise to your model. Some people believe that redundant variables are pure evil, and I can’t blame them.

So in our case, we won’t use both of our variables for our predictions. Using our scatter plots, we can see a strong correlation between the marketing budget and the sales, so we’ll use the marketing budget variable for our model.

6. Use the ML model to make predictions

My example was largely simplified. In real life, you’ll probably have more than two variables to make predictions. You can use this plan to get rid of redundant or useless variables. Conduct these steps as many times as you need.

Now, you’re ready to create a linear regression model using machine learning.

Machine Learning Resources That Mention Regression Models/Algorithms

If you’d like to learn more about regression, check out these valuable resources:

Statistics 101 by Brandon Foltz. This YouTube channel is aimed to help beginners in ML and desperate first-year students understand the most important concepts of statistics. Brandon takes it slow and always provides real-life examples along with his explanations. He has a whole series dedicated to different regression methods and related concepts.
StatQuest with Josh Starmer. The always-funny Josh Starmer will make you fall in love with statistics. You’ll learn how to apply regression and other ML models to real-life situations.
Machine Learning Mastery wouldn’t be a machine learning–mastery blog if it didn’t provide detailed information about different machine learning topics, including regression.
The Serokell blog. We regularly publish new materials about artificial intelligence and machine learning algorithms.