Machine Learning Algorithms – DatabaseTown https://databasetown.com Data Science for Beginners Fri, 30 Jun 2023 19:14:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 https://databasetown.com/wp-content/uploads/2020/02/dbtown11-150x150.png Machine Learning Algorithms – DatabaseTown https://databasetown.com 32 32 165548442 Common Machine Learning Algorithms for Classification https://databasetown.com/7-commonly-used-machine-learning-algorithms-for-classification/ https://databasetown.com/7-commonly-used-machine-learning-algorithms-for-classification/#respond Fri, 30 Jun 2023 18:31:48 +0000 https://databasetown.com/?p=2836 Machine learning algorithms for classification enable computers to automatically classify and categorize data into predefined classes or categories. These algorithms analyze input data, learn from it, and then make predictions or assign labels to new data based.

Here we’ll cover 7 machine learning algorithms for classification.

What is Classification?

It is a process of forecasting the class of given data points. Classification belongs to a supervised machine learning category where the labeled dataset is used. We must have input variables (X) and output variables (Y) and we applied an appropriate algorithm to find the mapping function (f) from input to output. Y = f(X).

Basic Terminologies

Before discussing the machine learning algorithms used for classification, it is necessary to know some basic terminologies.

  • Classifier: It is an algorithm that maps the information to a particular category or class.
  • Classification model: It attempts to make some determination from the input data given for preparing. It will anticipate the class names/classifications for the new information.
  • Feature: It is an individual quantifiable property of a wonder being watched.
  • Binary Classification: In binary classification, there are two possible results, for example, gender classification into male and female.
  • Multi-class classification: In multi-class classification, there are more than two classes where each sample is assigned to one and only one objective mark. For example, fruit can be mango or apple yet not both simultaneously.
  • Multi-label classification: In multi-label classification, each sample is mapped to a lot of target labels or more than one class. For example, a research article can be about computer science, a computer part, and the computer industry simultaneously.

Examples of Classification Problems

Some common examples of classification problems are given below.

  • Natural Language Processing (NLP), for example, spoken language understanding.
  • Machine vision (for example, face detection)
  • Fraud detection
  • Text Categorization (for example, spam filtering)
  • Bioinformatics (for example, classify the proteins as per their functions)
  • Optical character recognition
  • Market segmentation (for example, forecast if a customer will respond to promotion)

Machine Learning Algorithms for Classification

In supervised machine learning, all the data is labeled and algorithms study to forecast the output from the input data while in unsupervised learning, all data is unlabeled and algorithms study to inherent structure from the input data.

Some popular machine learning algorithms for classification are given briefly discussed here.

  1. Logistic Regression
  2. Naive Bayes
  3. Decision Tree
  4. Support Vector Machine
  5. Random Forests
  6. Stochastic Gradient Descent
  7. K-Nearest Neighbors (KNN)

1. Logistic Regression

Logistic regression is a statistical modeling technique used for binary classification tasks. It is commonly used when the goal is to predict a binary outcome, where the dependent variable can take one of two possible values, such as “yes” or “no,” “true” or “false,” or 0 or 1.

The logistic regression algorithm models the relationship between the independent variables and the probability of the binary outcome. It estimates the probability of the outcome using a logistic function, also known as the sigmoid function. This function maps any real-valued input to a value between 0 and 1 and represents the probability of the positive class.

The algorithm works by fitting a regression line to the training data, using a technique called maximum likelihood estimation. The line separates the feature space into two regions, corresponding to the two possible outcomes. During the prediction phase, the algorithm calculates the probability of the positive class based on the learned regression line and a new set of input features. If the probability exceeds a certain threshold (usually 0.5), the instance is classified as the positive class; otherwise, it is classified as the negative class.

2. Naïve Bayes

Naive Bayes is a classification algorithm that is based on the Bayes’ theorem. It is widely used for text classification tasks, spam filtering, sentiment analysis, and other applications where the input data consists of categorical or discrete features.

The algorithm is termed “naive” because it simplifies the classification problem by assuming that all features are conditionally independent of each other given the class label. Despite this naive assumption, Naive Bayes often performs well in practice and can be very efficient for large datasets.

The Naive Bayes algorithm calculates the probability of each class given a set of input features and then predicts the class with the highest probability. It utilizes Bayes’ theorem, which describes the relationship between the conditional probability of an event and its prior probability. In the context of Naive Bayes, it calculates the posterior probability of each class given the input features.

To build a Naive Bayes model, the algorithm learns the prior probabilities of each class from the training data. It also estimates the conditional probabilities of the features for each class. During the prediction phase, the algorithm applies Bayes’ theorem to calculate the posterior probabilities and assigns the class with the highest probability as the predicted class.

It can handle high-dimensional datasets with many features, and its assumption of feature independence makes it particularly suitable for text classification tasks. However, this assumption can be a limitation if the features are correlated in reality.

3. Decision Tree

A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It creates a flowchart-like structure which resembles a tree, to make decisions based on input features.

The algorithm works by recursively partitioning the feature space into subsets based on the values of different features. It selects the most informative feature at each step to split the data to maximize the separation between different classes or minimize the variability within each subset.

Starting from the root node, the decision tree algorithm evaluates the feature conditions and assigns data points to subsequent nodes based on their feature values. This process continues until a stopping criterion is met, such as reaching a maximum depth or a minimum number of data points in a node.

Each internal node of the tree represents a decision based on a specific feature which lead to different branches. The leaf nodes, also known as terminal nodes, represent the final decision or prediction for a given input.

During the training phase, the decision tree algorithm learns the optimal feature splits by analyzing the training data.

Once the decision tree is built, it can be used to make predictions for new instances by traversing the tree based on the feature values of the input data. The final prediction is determined by the majority class in the leaf node reached by the input instance.

Decision trees have several benefits, including their interpretability, as the flowchart-like structure allows for easy understanding of the decision-making process. They can handle both numerical and categorical features and can capture complex relationships between variables.

4. Support Vector Machine

A Support Vector Machine (SVM) is a powerful machine learning algorithm used for both classification and regression tasks. It is particularly effective in cases where the data has complex relationships and requires a clear separation between classes.

The primary goal of an SVM is to find a hyperplane in a high-dimensional feature space that best separates the data points belonging to different classes. This hyperplane acts as a decision boundary, maximizing the margin, which is the distance between the closest data points of different classes.

The key idea behind SVM is to transform the input data into a higher-dimensional space using a kernel function. In this transformed space, the SVM seeks to find an optimal hyperplane that achieves the best separation between the classes.

During the training phase, the SVM algorithm identifies support vectors, which are the data points closest to the decision boundary. These support vectors play a crucial role in determining the optimal hyperplane. The algorithm adjusts the position and orientation of the hyperplane to maximize the margin and minimize the classification errors.

Once the SVM is trained, it can classify new instances by mapping them into the feature space and determining which side of the decision boundary they fall on. The SVM assigns the class label based on the side of the hyperplane the data point lies.

Applications of SVM are in different fields, including text classification, image recognition, bioinformatics, and finance.

5. Random Forests

Random Forest is ensemble machine learning algorithm that combines multiple decision trees to make predictions. It is known for its robustness in handling both classification and regression tasks.

The algorithm constructs an ensemble, or a collection, of decision trees by training each tree on a different subset of the training data and a random subset of the input features. Each decision tree independently makes predictions, and the final prediction is determined through a voting or averaging mechanism.

Random Forest introduces randomness in two key aspects. First, during the construction of each decision tree, a random subset of the training data, known as bootstrap samples, is selected with replacement. This technique, called bagging, introduces diversity and helps reduce overfitting.

Second, at each node of the decision tree, a random subset of features is considered for splitting, typically referred to as feature subsampling. By randomly selecting a subset of features, Random Forest introduces further variability and prevents certain features from dominating the decision-making process.

Random Forest has many benefits. It can handle high-dimensional data with many features and is resistant to overfitting. It can handle both categorical and numerical features, and it provides an estimate of feature importance.

6. Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used in machine learning for training models, particularly in case of large datasets. It is a variant of the Gradient Descent algorithm that offers computational efficiency by updating model parameters using a random subset of the training data at each iteration.

The basic idea behind SGD is to iteratively adjust the model parameters to minimize a given loss function. Instead of considering the entire training dataset in each iteration, SGD randomly selects a small batch, known as a mini-batch, of training examples. This mini-batch is used to compute the gradient of the loss function with respect to the model parameters.

The gradient represents the direction of steepest ascent in the loss function’s space, indicating how the parameters should be adjusted to reduce the loss. In SGD, the model parameters are updated based on this gradient estimate, using a learning rate that controls the size of the updates.

By repeatedly sampling mini-batches and updating the parameters, SGD gradually converges towards a minimum of the loss function, hopefully reaching a good solution for the learning task.

SGD has many advantages. It is computationally efficient, particularly when dealing with large datasets, as it operates on subsets of the data instead of the entire dataset. It is suitable for online learning scenarios where new data arrives continuously, as it can update the model incrementally.

7. K-Nearest Neighbors (KNN)

The K-Nearest Neighbors (K-NN) algorithm is a non-parametric method that makes predictions based on the similarities between the input data points.

The K-NN algorithm operates on a training dataset with labeled instances. During the training phase, the algorithm simply stores the data points and their corresponding labels.

When a new, unlabeled instance needs to be classified or predicted, the K-NN algorithm compares it to the labeled instances in the training set. It measures the similarity between the new instance and the existing instances using a distance metric, commonly the Euclidean distance.

The “K” in K-NN refers to the number of nearest neighbors to consider for making predictions. K is a hyperparameter that needs to be specified beforehand. The algorithm identifies the K nearest neighbors of the new instance based on the distance metric.

For classification tasks, the K-NN algorithm assigns the class label to the new instance based on the majority vote of its K nearest neighbors. The class that appears most frequently among the neighbors is considered the predicted class for the new instance.

The algorithm’s main drawback is its computational complexity, especially for large datasets, as it requires calculating the distances between the new instance and all training instances.

Common Machine Learning Algorithms for Classification
Common Machine Learning Algorithms for Classification

More to read

]]>
https://databasetown.com/7-commonly-used-machine-learning-algorithms-for-classification/feed/ 0 2836
How to do regression in excel? (Simple Linear Regression) https://databasetown.com/how-to-do-regression-in-excel/ https://databasetown.com/how-to-do-regression-in-excel/#respond Tue, 12 Nov 2019 15:55:14 +0000 https://databasetown.com/?p=2714 Performing regression analysis in excel is a very easy task. Before going towards practical, we recall the concepts of regression. Let’s describe it briefly.

Regression is used for predictive analysis. It is used for finding the strength of predictors and forecasting an effect.

Its most common type is linear regression. The linear regression can be further divided into simple linear regression, multiple linear regression, and logistic linear regression.

You may have performed simple linear and multiple linear regression in programming languages like Python and R. The regression examples in python can be found in these links.

Today we are going to perform the regression analysis in excel.

Let’s start!

If you do not find “Data Analysis tool” in “Data” tab, then go to “File” menu and click on options. Go to “Add-Ins” and select “Analysis ToolPak”

excel-options
excel-options

After pressing OK, you will see this box. Select “Analysis ToolPak” add-in and press OK.

analysis-toolpak-add-ins

Now you will see “Data Analysis” option in your Data tab has appeared as shown in the below figure.

data-analysis-tool

Click on “Data Analysis” tool. A box showing the list of available tools will be opened. Choose “Regression” from the list of analysis tools and press “OK. See this figure.

How to do regression in excel
How to do regression in excel?

Now, starting from figure-1 again. On selecting “Regression” option from figure-1, you will see this.

  1. Set the input value of X and Y by selecting relevant columns of values in the worksheet. Here we have selected A1: A21 and B1: B21.
  2. “Set constant to zero”. This value is optional. The selection of this option will cause the regression line to start from zero.
  3. Check “Labels”, if headings are available in your table.
  4. You can select the output range to put the result on specific cells or select “New Worksheet Ply” to place the result in a new worksheet.
  5. “Residuals” selection of this checkbox will tell you the difference between actual and predicted values. This checkbox is optional.
  6. Press “OK” and see the regression analysis on a new worksheet.
regression-analysis-in-excel

After performing the regression, you will see this output.

egression-analysis-output-summary

See the summary output, we have got some results but what they mean. Let’s briefly explain them.

  • Multiple R: These are the Correlation Coefficient used to measure the relationship strength of variables. Larger value shows a strong relationship while ‘0’ means no relationship.
  • R Square: It shows the points falling on the regression line.
  • Adjusted R Square: It is adjusted for the independent variables.
  • Standard Error: It shows the precision of the regression analysis performed by us.
  • Observation: Shows the number of observations.

Now, we take a look at the second part of the output “ANOVA”. In this part, “Significance F” value shows the accuracy of the model. If the value is less than 0.05, the model is OK. If it is greater than 0.05, you will have to replace the value in order to get the accuracy.

Simple Linear Regression in Excel

In simple linear regression, there is one dependent variable i.e. interval or ratio, and one independent variable i.e. interval or ratio or dichotomous.

We can perform simple linear regression in excel. To do this, select the columns and go to Insert -> Graph -> Scatter

simple linear regression in excel
simple linear regression in excel

Right-click on the graph and add a trending line.

add-trending-line-in-graph

Look at the tool at the right side and change the formatting of line, you can change the color and other options that best fit your model.

linear-regression-graph
]]>
https://databasetown.com/how-to-do-regression-in-excel/feed/ 0 2714
Implementing Support Vector Machine (SVM) in Python https://databasetown.com/implementing-support-vector-machine-svm-in-python/ https://databasetown.com/implementing-support-vector-machine-svm-in-python/#respond Tue, 05 Nov 2019 16:41:33 +0000 https://databasetown.com/?p=2766 Machine Learning is the most famous procedure of foreseeing the future or arranging data to help individuals in settling on essential choices.

The algorithms are trained over models through which they gain information from past encounters so as to make forecasts about what’s to come.

There are three types of Machine learning i.e. supervised learning, unsupervised learning and reinforcement learning.

In this article, I want to acquaint you with a predominant machine learning technique known as Support Vector Machine (SVM).

Before we start it formally, it is essential to know about supervised machine learning: –

Supervised Machine Learning

In supervised machine learning, a labeled dataset is used. You must have input variables (X) and output variables (Y) then you apply an appropriate algorithm to find the mapping function from input to output.

Y = f(X)

Supervised machine learning can be categorized into the following:-

  1. Classification – where the output variable is a category like black or white, plus or minus. Naïve Bayes (NB), Support Vector Machine (SVM) and Decision Tree (DT) are the most trendy supervised machine learning algorithms.
  2. Regression – where the output variable is a real value like weight, dollars, etc. Linear regression is used for regression problems.

Support Vector Machine

Support Vector Machine (SVM) belongs to a supervised machine learning algorithm which is mostly used for data classification and regression analysis.

We can perform linear and non-linear classification with the help of Support Vector Machine.

SVM Classifier splits the data into two classes using a hyperplane which is basically a line that divides a plane into two parts.

svm-implementation

Applications of Support Vector Machine in Real Life

As you already know Support Vector Machine (SVM) based on supervised machine learning algorithms, so, its fundamental aspire to classify the concealed data.

It is most popular due to its memory efficiency, high dimensionality and versatility. There are several applications of SVM in real life some of them are mentioned here.

  • Face detection
  • Image classification
  • Reorganization of Handwriting
  • Geo and environmental sciences
  • Bioinformatics
  • Text categorization
  • Protein fold and remote homology detection
  • Generalized predictive control

Examples of SVM Kernels

  1. Polynomial kernel – it is mostly used in image processing.
  2. Linear Splines kernel in one-dimension – it is used in text categorization and is helpful in dealing with large spare data vectors.
  3. Gaussian Kernel – it is used when there is no preceding information about the data.
  4. Gaussian Radial Basis Function (RBF) – It is commonly used where there is no previous knowledge about the data.
  5. Hyperbolic Tangent Kernel – it is used in neural networks.
  6. Bessel Function of the First kind Kernel – it is used to eliminate the cross term in mathematical functions.
  7. Sigmoid Kernel – it can be utilized as the alternative for neural networks.
  8. ANOVA Radial Basis Kernel – it is mostly used in regression problems.

Support Vector Machine (SVM) implementation in Python:

Now, let’s start coding in python, first, we import the important libraries such as pandas, numpy, mathplotlib, and sklearn.

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

Load the dataset:

Now we load the dataset i.e. apples_and_oranges.csv which is already placed in the same folder where svm.ipynb file saved and also check the dataset what is inside the file. See this figure.

df = pd.read_csv('apples_and_oranges.csv')
df

We can also represent this data frame as a scatter plot.

plt.xlabel('weight')
plt.ylabel('size')
plt.scatter(df['weight'], df['weight'],color="green",marker='+', linewidth='5')
plt.scatter(df['size'], df['size'],color="blue",marker='.' , linewidth='5')

Split the dataset of Apples and Oranges into training and test samples with a ratio of 80% & 20%.

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2)

Now we classify the predictors and target.

x_train = train_set.iloc[:,0:2].values
y_train = train_set.iloc[:,2].values
x_test = test_set.iloc[:,0:2].values
y_test = test_set.iloc[:,2].values

We can also check the length of train_set and test_set by using this code

When we initialize the Support Vector Machine (SVM) and fitting the training data, we obtain.

from sklearn.svm import SVC
model = SVC(kernel='rbf', random_state = 1)
model.fit(x_train, y_train)

Now, we will check the accuracy of our model.

model.score(x_test, y_test)

Wao… our model worked perfectly as it provides 100% accuracy but this may not happen all the time especially in the case where a large number of features are involved.   

Now, we will predict the class of a fruit whose weight is 55 and size is 4.

model.predict([[55,4]])

Another check to predict the class of a fruit whose weight is 60 and size is 5.50.

model.predict([[60,5.50]])

Hence, it is clear from above that the Support Vector Machine (SVM) is an elegant and dominant algorithm.            

We can also use another kernel i.e. linear and check the model score like this.

model_linear_kernal = SVC(kernel='linear')
model_linear_kernal.fit(x_train, y_train)
model_linear_kernal.score(x_test, y_test)

Problem Statement No.1:

Train a Support Vector Machine (SVM) Classifier by using any suitable dataset and then find out the accuracy of your model by utilizing rbf and linear kernels.

You can download dataset here.

]]>
https://databasetown.com/implementing-support-vector-machine-svm-in-python/feed/ 0 2766
What is Clustering & its Types? K-Means Clustering Example (Python) https://databasetown.com/clustering-types-k-means-clustering-example-python/ https://databasetown.com/clustering-types-k-means-clustering-example-python/#respond Mon, 07 Oct 2019 15:35:19 +0000 https://databasetown.com/?p=2673 Cluster Analysis

Cluster is a group of data objects that are similar to one another within the same cluster, whereas, dissimilar to the objects in the other clusters.

Cluster analysis is a technique used to classify the data objects into relative groups called clusters.

Clustering is an unsupervised learning approach in which there are no predefined classes.

The basic aim of clustering is to group the related entities in a way that the entities within a group are alike to each other but the groups are dissimilar from each other.

In K-Means clustering, “K” defines the number of clusters. K-means Clustering, Hierarchical Clustering, and Density Based Spatial Clustering are more popular clustering algorithms.

Examples of Clustering Applications:

  • Cluster analyses are used in marketing for the segmentation of customers based on the benefits obtained from the purchase of the merchandise and find out homogenous groups of the consumers.
  • Cluster analyses are used for earthquake studies.
  • Cluster analyses are used for city planning in order to find out the collection of houses according to their house type, worth and geographical locality.

Major Clustering Approaches:

Major clustering approaches are described as under: –

Partitioning Clustering

In this technique, datasets are subdivided into a set of k-groups (where k is the no. of groups, which is predefined by the analyst).

K-means is the well-known clustering technique in which each cluster is represented by the center of the data points belonging to the cluster.

K-medoids clustering is an alternative technique of K-means, which is less sensitive to outliers as compare to k-means.

K-means clustering method is also known as hard clustering as it produces partitions in which each observation belongs to only one cluster. 

Hierarchy Clustering

Hierarchy Clustering is used to identify the groups in the dataset but the analyst does not require to pre-specify the number of clusters to be generated.

The result obtained from this clustering is tree-based representation of the objects, which is recognized as a dendrogram. Furthermore, observations can also sub-divided into groups by slicing the dendrogram at the desired resemblance level.

Fuzzy Clustering

Fuzzy clustering is also known as soft clustering which permits one piece of data to belong to more than one cluster.

Fuzzy clustering is frequently used in pattern recognition. Fuzzy C-means clustering algorithm is commonly used worldwide.  

Density-based Clustering (DBSCAN)

DBSCAN stands for Density-based spatial clustering of applications with noise. It is a method that has been introduced by Ester et al. in 1996 that can be utilized to find out the clusters of any shape in a dataset having noise and outliers.

The main advantage of DBSCAN is that there is no need to specify the number of clusters to be generated by the user.

Grid-based Clustering

This clustering approach utilizes a multi-resolution grid data structure having high processing speed with a small amount of memory consumption.

Model-based Clustering:

In this clustering approach, it is assumed that the data is coming from a dispersal that is a combination of two or more clusters.

Model based clustering is utilized to resolve the issues that can arise in K-means or Fuzzy K-means algorithms.

Difference between Classification and Clustering

ClassificationClustering
Classification technique is widely utilized in mining for classifying datasets where the output variable is a category like black or white, plus or minus. Cluster is a group of data objects that are similar to one another within the same cluster, whereas, dissimilar to the objects in the other clusters. Cluster analysis is a technique used to classify the data objects into relative groups called clusters.
Naïve Bayes, Support Vector Machine, Decision Tree are the most popular supervised machine learning algorithms. Clustering is unsupervised learning in which there are no predefined classes.

Process of applying K-mean Clustering

  • Choose the number of clusters
  • Specify the cluster seeds
  • Assign each point to a centroid
  • Adjust the centroid

Pros and Cons of Clustering

K-means

  • Pros: It is simple to comprehend, work better on small as well as large datasets. This clustering technique is fast and efficient.
  • Cons: There is a dire need to select the number of clusters

Hierarchical Clustering

  • Pros: The ideal number of clusters can be acquired by the model itself.
  • Cons: Hierarchical clustering is not suitable for large datasets.

K-Means Clustering Example (Python)

These are the steps to perform the example.

Import the relevant libraries.

import libraries

Load the data

Now we load the data in .csv format in the same folder where clustering.ipynb file saved and also check the data what is inside the file. Look at this figure.

load the data

In order to map the data, we will create a new variable data_mapped which is equal to data.copy() and data_mapped[‘continent’] equal to data_mapped[continent].map and also Africa to 0, Asia to 1, Europe to 2, North America to 3 and South America to 4 as shown in this figure.

Further, we will select the features that we intend to utilize for clustering as below

In the above picture, we select three columns and left only one column i.e. country.

Perform K-Mean Clustering

perform k means clustering

In above span, we perform K-mean clustering with 5 clusters and the results shown in below figure.

Now we create a data frame i.e. data_with_clusters which is equal to data. Furthermore, we add an extra column i.e. Cluster which is equal to identified_clusters, as shown in figure

It is clear from the above picture that Angola, Burundi & Benin in cluster 0, Aruba, Anguilla, Antigua & Barb in cluster 1, Albania, Aland, Andorra, Austria & Belgium in cluster 2 and Afghanistan, United Arab Emirates & Azerbaijan in cluster 3.

Finally, we are going to plot a scatter plot in order to obtain a map of the real world. We will take the Longitude along the y-axis and Latitude along the x-axis.

These clusters are based on geographical location, therefore, the result is shown in this figure.

k means clustering with python
]]>
https://databasetown.com/clustering-types-k-means-clustering-example-python/feed/ 0 2673
Logistic Regression (Python) Explained using Practical Example https://databasetown.com/logistic-regression-python-explained-using-practical-example/ https://databasetown.com/logistic-regression-python-explained-using-practical-example/#respond Tue, 01 Oct 2019 14:40:46 +0000 https://databasetown.com/?p=2638 Logistic Regression is a predictive analysis which is used to explain the data and relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. It is mostly used in biological sciences and social science applications. For instance, predict whether received email is spam or not. Similarly, predict whether customer will purchase the product or not.

Statistical gadgets are used to conduct the analysis as logistic regression is bit difficult to interpret as compare to the linear regression.

There are quite a few kinds of logistic regression analysis are:

  1. Binary Logistic Regression – 02 possible outcomes, e.g. email is spam or otherwise.
  2. Multiple Logistic Regression – 03 or more categories with no ordering, e.g. during admission in college, students have various choices among general program, academic program or vocational program.
  3. Ordinal Logistic Regression – 03 or more categories with ordering, e.g. mobile set rating from 1 to 5. 

Logistic Regression Model

Logistic Regression Model

Practical example of Logistic Regression

Import the relevant libraries and load the data.

import relevant libraries

For quantitative analysis, we must convert ‘yes’ and ‘no’ entries into ‘0’ and ‘1’ as shown in figure.

Now we are going to visualize our data, we are predicting job. Therefore, the job is our Y variable and Code (use for education) will be our X variable.

Here, we observed that for all the observations below the outcomes is zero or they are jobless, whereas, for all the persons above the process are successfully got the job. Now we are going to plot a regression line as shown in below figure.

Linear regression is awesome technique but here it is not suitable for this kind of analysis as this regression does not know that our values are bounded between 0  and 1. Our data is non-linear, therefore, we must have to use non-linear approach. Hence, now we are going to plot a logistic regression curve.

This function depicts the probability of getting job, given an educational code. When the education is low, the probability of getting job is 0 or nill, whereas, the education is high, the probability of getting job is 1 or 100%.

It is clear from the above snap that, when the education is ‘BA’ the probability of getting job is about 60%.

Logistic Regression Summary is shown in below figure.

MLE is stands for Maximum likelihood estimation.

Likelihood function

It is a function that guess how likely it is that the model at hand defines the real fundamental relationship of the variables. Larger the likelihood function, larger the probability that our model is precise.

Maximum likelihood function tries to maximize the likelihood function. Computer going through various values till finds an appropriate model for which the likelihood is the optimum. When there is no more improvement is possible, it will just stop the optimization.

Pseudo R-squared (Pseudo R-squ) is mostly useful for comparing variation of the same model. Different models have the different pseudo R-squares. If the value of Pseudo R-square lies between 0.2 and 0.4, it is considered decent.

LL-Null is stands for Log Likelihood-null. The LL (log-likelihood) of a model which has no independent variables.

LLR is stands for Log Likelihood Ratio which measures if our model is statistically different from LL-Null.

Calculating the accuracy of the model

In order to find the accuracy of the model, we use the results_log.predict() command that return the value predicted by our model. Also apply some formatting to see the results more readable by using this command

np.set_printoptions(formatter={‘float’: lambda x: “{0:0.2f}”.format(x)})

Here, value less than 0.5 means chances of getting jobs is below 50% and the value 0.93 means the chances of getting job is 93%.

Now, we compare the actual value of the model with predicted value

If 90% of the predicted values of the model match with the actual values of the model, we say that the model has 90% accuracy.

In order to compare the predicted and actual values in form of table we use the results_log.pred_table() command as shown in figure.

This result is bit difficult to understand, so we take these results in form of confusion matrix, as shown in below figure

Let’s clear this confusion matrix, for 3 observations the model predicted 0 and the actual vale was also 0, similarly, for 9 observations the model predicted 1 and the actual value was also 1, therefore, the model did its good job here.

Furthermore, for 2 observations the model predicted 0 whereas, the actual value was 1, similarly, 1 observation the model predicted 1 and the actual value was 0, therefore, here the model got confused.

Finally, it depicts from these confusion matrix, the model made an accurate estimation in 12 out of 15 cases which means our model works with (12/15)*100 = 80% accuracy.

We can also calculate the accuracy of the model by using this code

cm = np.array(cm_df)
accuracy_model = (cm[0,0]+cm[1,1])/cm.sum()*100
accuracy_model
logistic regression python explained
logistic regression python explained
]]>
https://databasetown.com/logistic-regression-python-explained-using-practical-example/feed/ 0 2638
Simple and Multiple Linear Regression in Python https://databasetown.com/linear-regression-formula-examples/ https://databasetown.com/linear-regression-formula-examples/#respond Wed, 25 Sep 2019 17:30:25 +0000 https://databasetown.com/?p=2597 Generally, Linear Regression is used for predictive analysis. It is a linear approximation of a fundamental relationship between two or more variables.

Main processes of linear regression

  • Get sample data
  • Design a model that works best for that sample
  • Make prediction for the whole population

Main uses of regression analysis

  • Finding the strength of predictors
  • Forecasting an effect
  • Trend forecasting

Some types of linear regression analysis

Simple Linear Regression

One dependent variable i.e. interval or ratio ,and one independent variable i.e. interval or ratio or dichotomous

Multiple Linear Regression

One dependent variable i.e. interval or ratio, and two plus independent variables i.e. interval or ratio or dichotomous

Logistic Linear Regression

One dependent variable i.e. dichotomous, and two plus independent variables i.e. interval or ratio or dichotomous

Ordinal Regression

One dependent variable i.e. ordinal, and one plus independent variables i.e. nominal or dichotomous

Multinomial Regression

One dependent variable i.e. nominal, and one plus independent variables i.e. interval or ratio or dichotomous.

Types of Variables in Linear Regression

In linear regression, there are two types of variables:

  • Dependent Variable
  • Independent Variable

Dependent variables are those which we are going to predict while independent variables are predictors.

Let’s briefly explain them with the help of example.

y = F(x1, x2,x3,…………….. xk)

In above equation, y is dependent variable which is a function of independent variables x1 to xk.

The population formula of simple linear regression model is given below: –

population formula of simple linear regression
population formula of simple linear regression

Look at the above equation, y is dependent variable, β0 is regression constant, β1 is the coefficient that quantifies the effect of independent variable on dependent variable, x1 sample data for independent variable and ε is the error of estimation.

Now we take an example to understand this equation well, for instance, income is dependent variable i.e. y and education is independent variable i.e. x1 then we say that income will definitely depend on education, more education will ensure the higher income.

Therefore, error of estimation is the actual difference between the observed income and the income the regression predicted. However, an average error of estimation is zero.

Simple linear regression equation is given below.

linear regression equation

Difference between Regression and Correlation

Regression Correlation
It is used to measure how one variable effect the other variable It is the relationship between two variables
It is used to fit a best line and estimate one variable on the basis of another variable It is used to show connection between two variables
In regression, both variables are dissimilar There is no difference between dependent and independent variables
One way p(x,y) = p(y,x)
Line Single point
Geometrical representation of Linear Regression Model
Geometrical representation of Linear Regression Model
Simple & Multiple Linear Regression [Formula and Examples]
Simple & Multiple Linear Regression [Formula and Examples]

Python Packages Installation

Python libraries will be used during our practical example of linear regression.

To see the Anaconda installed libraries, we will write the following code in Anaconda Prompt,

C:\Users\Iliya>conda list 

We can also install the more libraries in Anaconda by using this code.

C:\Users\Iliya>conda install numpy

Before we go to start the practical example of linear regression in python, we will discuss its important libraries.

NumPy

It is a library for the python programming which allows us to work with multidimensional arrays and matrices along with a large collection of high level mathematical functions to operate on these arrays.

Pandas

It is a software library for the python programming for data manipulation in a tabular form and analysis.

Matplotlib

It is 2D plotting library for python programming which is specially designed for visualization of NumPy computation.

SciPy

It is open source python library which is used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, image processing, machine learning.

Seaborn

It is a python data visualization library based on matplotlib. Seaborn offers a high level interface for drawing attractive and informative graphics.

Statsmodels

It is a python package which permits users to explore data, estimate statistical models and execute statistical tests.

Scikit-learn

It is free software machine learning library for python programming.

Practical example of Simple Linear Regression

Import the relevant libraries

Load the data

Now we load the data in .csv format in the same folder where regression_example.ipynb file saved and also check the data what is inside the file as shown in figure.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

In order to show the informative statistics, we use the describe() command as shown in figure.

data.describe()

Now we define the dependent and independent variables. In our example, code (allotted to each education) is independent variable whereas salary is dependent variable.

y = data['salary']
x1 = data['code']

In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis, see this figure.

Now we add a constant means we are adding a new column which consists of only 1s.

 x = sm.add_constant(x1) 

Fit the model according to the Ordinary Least Squares (OLS) method with a dependent variable ‘y’ and an independentvariable ‘x’

results = sm.OLS(y,x).fit() 

Finally, we print a summary of the regression.

results.summary()

Now we are going to create a scatter plot

plt.scatter(x1,y)

then, define the regression equation yhat = 5914.2857*x1+6466.6667

and now plot the regression line against the independent variable i.e. code (used for education)

fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')

Now, label the x-axis and y-axis

plt.xlabel('Education', fontsize = 20)
plt.ylabel(Salary, fontsize = 20) 
plt.show() 

Now, look at the output result in below figure . This is the complete code.

plt.scatter(x1,y)
yhat = 5914.2857*x1+6466.6667  
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line') 
plt.xlabel('Education', fontsize = 20)
plt.ylabel(Salary, fontsize = 20) 
plt.show() 

Interpret the Regression Results

Now, put the following lines of code to interpret the regression results.

x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()

Salary is dependent variable

R-squared shows the fit of the model. Its values range from 0 to 1. In our example, R-squared value is 0.911. It is pertinent to mention here that higher value indicate a better fit.

Simple Linear Regression is given by,

simple linear regression
simple linear regression

In our example, const i.e. b0 is 5152.5157

Salary i.e. b1is 6240.5660

Std err shows the level of accuracy of the coefficient. Lower the std error, higher the level of accuracy.

P > | t | is p-value. This value is less than 0.05 is considered to be statistically important.

Therefore,

Salary = 5152.5157 + 6240.5660 × code

If code = 2 then salary will be

17633.6477 = 5152.5157 + 6240.5660 × 2

Hence, according to our model, the expected salary of employee whose education is FA is 17633.65 that is the predictive power of linear regression.

In case of null hypothesis of this test, Beta is equal to zero (H0 : β = 0) which means that coefficient equal to zero. If the coefficient is zero for the intercept be zero that is then the line crosses the y-axis at the origin as shown in figure.

plt.scatter(x1,y)
yhat = 5914.2857*x1+0
fig = plt.plot(x1,yhat, lw=4, c='red', label='regression line')
plt.xlabel('Education', fontsize = 20)
plt.ylabel('Salary', fontsize = 20)
plt.xlim(0)
plt.ylim(0)
plt.show()

If b1= 0 then ŷ = b0 Therefore, graphically, this variable will not be considered for the model.

Therefore, we conclude that the regression line horizontal is always going through the intercept value.

Practical example of Multiple Linear Regression

Import the relevant libraries and load the data

In order to shown the informative statistics, we use the describe() command as shown in figure.

Now we define the dependent and independent variables. In our example, code (allotted to each education) and year are independent variables, whereas, salary is dependent variable.

In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis as shown in figure.

Interpret the Regression Results

Now, we can easily compare the both results of regression model with one or more variables.

]]>
https://databasetown.com/linear-regression-formula-examples/feed/ 0 2597