Sklearn Linear Regression (Step-By-Step Explanation)

Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression.

What is SKlearn Linear Regression?

Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:

The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.

Sklearn Linear Regression Concepts

When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:

Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable
Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
Residual - the vertical distance between a data point and the regression line
Regression - is an assessment of a variable's predicted change in relation to changes in other variables
Regression Model - The optimum formula for approximating a regression
Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point)
Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
Simple linear regression - A linear regression with a single independent variable

How to Create a Sklearn Linear Regression Model

Step 1: Importing All the Required Libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Step 2: Reading the Dataset

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv('bottle.csv')

df_binary = df[['Salnty', 'T_degC']]

# Taking only the selected two attributes from the dataset

df_binary.columns = ['Sal', 'Temp']

# Renaming the columns for easier writing of the code

df_binary.head()

# Displaying only the 1st rows along with the column names

Step 3: Exploring the Data Scatter

sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None)

# Plotting the data scatter

Step 4: Data Cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method ='ffill', inplace = True)

Step 5: Training Our Model

X = np.array(df_binary['Sal']).reshape(-1, 1)

y = np.array(df_binary['Temp']).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

Step 6: Exploring Our Results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

# Data scatter of predicted values

Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.

Step 7: Working With a Smaller Dataset

df_binary500 = df_binary[:][:500]

# Selecting the 1st 500 rows of the data

sns.lmplot(x ="Sal", y ="Temp", data = df_binary500,

order = 2, ci = None)

We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.

df_binary500.fillna(method ='ffill', inplace = True)

X = np.array(df_binary500['Sal']).reshape(-1, 1)

y = np.array(df_binary500['Temp']).reshape(-1, 1)

df_binary500.dropna(inplace = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

Conclusion

Enroll in Simplilearn’s PG in Data Science to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

Name	Date	Place
Advanced Certificate Program in Data Science	Cohort starts on 27th Apr 2023, Weekend batch	Your City	View Details
Advanced Certificate Program in Data Science	Cohort starts on 11th May 2023, Weekend batch	Your City	View Details
Advanced Certificate Program in Data Science	Cohort starts on 25th May 2023, Weekend batch	Your City	View Details

Sklearn Linear Regression

Table of Contents

Master The Right AI Tools For The Right Job!

What is SKlearn Linear Regression?

Sklearn Linear Regression Concepts

Become an Expert in All Things AI and ML!

How to Create a Sklearn Linear Regression Model