*Linear Regression Intuition:*

Linear regression is widely used throughout Finance in a plethora of applications. In previous tutorials, we calculated a companies’ beta compared to a relative index using the ordinary least squares (OLS) method. Now, we will use linear regression in order to estimate stock prices.

Linear regression is a method used to model a relationship between a dependent variable (y), and an independent variable (x). With simple linear regression, there will only be one independent variable x. There can be many independent variables which would fall under the category of multiple linear regression. In this circumstance, we only have one independent variable which is the date. The date will be represented by an integer starting at 1 for the first date going up to the length of the vector of dates which can vary depending on the time series data. Our dependent variable, of course, will be the price of a stock. In order to understand linear regression, you must understand a fairly elementary equation you probably learned early on in school.

*y = a + bx*

*Where:*

- Y = the predicted value or dependent variable
- b = the slope of the line
- x = the coefficient or independent variable
- a = the y-intercept

Essentially, this will constitute our line of best fit on the data. A multitude of lines are drawn through the dataset in the OLS process. The goal of the process is to find the best-fitting line that minimizes the sum of squared errors (SSE) with the actual value of a stock price (y) and our predicted stock price over all the points in our dataset. This is represented by the figure below. For each line drawn, there is a difference between each point in the dataset with it’s corresponding predicted value outputted by the model. Each one of these differences is added up and squared to produce the sum of squares. From the list, we take the minimum which leads to our line of best fit. Consider the diagram below:

**Part One: Getting the Data:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import numpy as np import matplotlib.pyplot as plt from matplotlib import style from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import quandl import datetime style.use('ggplot') #Dates start_date = datetime.date(2017,1,3) end_date = datetime.date.today() quandl.ApiConfig.api_key = "your key here" #Get Data From Quandl df = quandl.get('WIKI/AAP.4', start_date=start_date, end_date=end_date, collapse="daily") df = df.reset_index() prices = df['Close'].tolist() dates = df.index.tolist() #Convert to 1d Vector dates = np.reshape(dates, (len(dates), 1)) prices = np.reshape(prices, (len(prices), 1)) |

*Part Two: Creating a Regressor Object:*

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#Define Linear Regressor Object regressor = LinearRegression() regressor.fit(dates, prices) #Visualize Results plt.scatter(dates, prices, color='yellow', label= 'Actual Price') #plotting the initial datapoints plt.plot(dates, regressor.predict(dates), color='red', linewidth=3, label = 'Predicted Price') #plotting the line made by linear regression plt.title('Linear Regression | Time vs. Price') plt.legend() plt.xlabel('Date Integer') plt.show() #Predict Price on Given Date date = 10 predicted_price =regressor.predict(date) print(predicted_price[0][0],regressor.coef_[0][0] ,regressor.intercept_[0]) |

*Output:*

*Predicted Price on Date Input:*

**Bonus: Creating train/test sets**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
#Splitting the dataset into the Training set and Test set xtrain, xtest, ytrain, ytest = train_test_split(dates, prices, test_size=0.33, random_state=42) regressor.fit(xtrain, ytrain) #Train Set Graph plt.scatter(xtrain, ytrain, color='yellow', label= 'Actual Price') #plotting the initial datapoints plt.plot(xtrain, regressor.predict(xtrain), color='blue', linewidth=3, label = 'Predicted Price') #plotting the line made by linear regression plt.title('Linear Regression | Time vs. Price') plt.legend() plt.xlabel('Date Integer') plt.show() #Test Set Graph plt.scatter(xtest, ytest, color='yellow', label= 'Actual Price') #plotting the initial datapoints plt.plot(xtest, regressor.predict(xtest), color='blue', linewidth=3, label = 'Predicted Price') #plotting the line made by linear regression plt.title('Linear Regression | Time vs. Price') plt.legend() plt.xlabel('Date Integer') plt.show() |

*Output:*

*Train Set:*

*Test Set:*