Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression.
What is SKlearn Linear Regression?
Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.
Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:
The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.
Sklearn Linear Regression Concepts
When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:
- Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
- Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable
- Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
- Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
- Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
- Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
- Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
- Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
- Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
- Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
- OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
- Residual - the vertical distance between a data point and the regression line
- Regression - is an assessment of a variable's predicted change in relation to changes in other variables
- Regression Model - The optimum formula for approximating a regression
- Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point)
- Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
- Simple linear regression - A linear regression with a single independent variable
How to Create a Sklearn Linear Regression Model
Step 1: Importing All the Required Libraries
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import preprocessing, svm from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression |
Step 2: Reading the Dataset
cd C:\Users\Dev\Desktop\Kaggle\Salinity # Changing the file read location to the location of the dataset df = pd.read_csv('bottle.csv') df_binary = df[['Salnty', 'T_degC']] # Taking only the selected two attributes from the dataset df_binary.columns = ['Sal', 'Temp'] # Renaming the columns for easier writing of the code df_binary.head() # Displaying only the 1st rows along with the column names |
Step 3: Exploring the Data Scatter
sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None) # Plotting the data scatter |
Step 4: Data Cleaning
# Eliminating NaN or missing input numbers df_binary.fillna(method ='ffill', inplace = True) |
Step 5: Training Our Model
X = np.array(df_binary['Sal']).reshape(-1, 1) y = np.array(df_binary['Temp']).reshape(-1, 1) # Separating the data into independent and dependent variables # Converting each dataframe into a numpy array # since each dataframe contains only one column df_binary.dropna(inplace = True) # Dropping any rows with Nan values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # Splitting the data into training and testing data regr = LinearRegression() regr.fit(X_train, y_train) print(regr.score(X_test, y_test)) |
Step 6: Exploring Our Results
y_pred = regr.predict(X_test) plt.scatter(X_test, y_test, color ='b') plt.plot(X_test, y_pred, color ='k') plt.show() # Data scatter of predicted values |
Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.
Step 7: Working With a Smaller Dataset
df_binary500 = df_binary[:][:500] # Selecting the 1st 500 rows of the data sns.lmplot(x ="Sal", y ="Temp", data = df_binary500, order = 2, ci = None) |
We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.
df_binary500.fillna(method ='ffill', inplace = True)
X = np.array(df_binary500['Sal']).reshape(-1, 1)
y = np.array(df_binary500['Temp']).reshape(-1, 1)
df_binary500.dropna(inplace = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
regr = LinearRegression()
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))
y_pred = regr.predict(X_test)
plt.scatter(X_test, y_test, color ='b')
plt.plot(X_test, y_pred, color ='k')
plt.show()
Related Topics
Conclusion
Enroll in Simplilearn’s PG in Data Science to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.