What is Data Science?
Data Science is the process of collecting large quantities of raw data, which in tutn can be processed and computed in order to extract useful patterns and insights.
Data science project life cycle starts with identifying a problem, data collection, data preparation, data cleaning, data analysis, model development, model evaluation, and finally deployment or data visualization.
With data being the new world currency it is no shock that certain fields such as data science and machine learning are booming. It is estimated that at the end of 2020, there were 64.2 zettabytes of data available, with a projected three times jump of 180 zettabytes in the year 2025!
For a proper analysis of such data, data scientists would fit the given data into dedicated machine learning models/algorithms which allow for optimum data extraction. In this article, we will explain some of the most well-known and commonly used data science and machine learning algorithms out there and their use cases.
Data Science Algorithms
What is Data Science Regression Analysis?
Currently, Regression is the most well-known machine learning algorithm out there and is considered the simplest of the machine learning algorithms out there. Regression is a supervised machine learning algorithm that predicts the relationship between independent (input) variables and the dependent (output) variable. Regression analysis can be performed with several independent variables. The two most popular regression algorithms are Linear and Logistic regression.
1. Linear Regression
A Linear Regression model is used when the target (output) value returns a continuous value. For example, linear regression can be used to predict the selling price of a real estate property depending on known features such as the size, location, and neighborhood of the given property.
Examples of Linear Regression
Some real-life examples of Linear regression include helping medical researchers better understand the relationship between drug dosage and blood pressure. Linear regression is also an easy way to determine the advertisement spending to sales ratio in different businesses. Think of linear regression as correlation. Researchers using linear regression have to be careful comparing 2 variables since correlation does not always equate to causation.
2. Logistic Regression
In the case of logistic regression, the output of the algorithm should be a categorization (discrete) variable. For example, logistic regression can be used to classify different objects, like cats and dogs.
Examples of Logistic Regression
Logistic regression used in real life is credit card fraud. When a credit card transaction happens, the bank makes a note of several factors; the date of the transaction, amount, location, type of purchase, etc. Based on these factors, they develop a Logistic Regression model of whether or not the transaction is a fraud. If the amount purchased is extreme and the bank knows their client never makes purchases that high, they may label it as a fraud. That’s why when individuals make large purchases outside of countries, banks will often call or text to confirm the charge.
3. Clustering Algorithm
Clustering is an unsupervised machine learning algorithm, meaning that the data which is fed into our model is not labeled. The goal of a clustering algorithm is to divide the dataset into their respective groups. As shown in the image below, the algorithm separated the dataset with the points inside the same circle sharing similarities and common features.
Examples of clustering algorithms include K-Means Clustering, Agglomerative/Hierarchical Clustering, and Affinity Propagation.
Examples of Clustering Algorithms
Clustering algorithms are especially useful when you have a set of unlabeled data that you want to make sense of. Some real-life examples of clustering algorithms include spam filtering, marketing and sales targeting, and classifying network traffic. With Clustering, people can be grouped with similar traits and purchase history. Using this classification of data analysis, businesses in marketing or ad generation can target favorable prospects to increase efficiency and reduce cost.
4. Support Vector Machines
Support Vector Machine (short for SVM) is a supervised machine learning algorithm that is capable of splitting the given data points into different groups. Although SVM is more optimized for classification problems it still can be used for regression problems as well. The SVM algorithm plots the given data values and draws the most suitable line to split the data into categories making sure the distance between the two categories is as equidistant from the line as possible.
As shown in the above image the algorithm plotted all our given points and then drew two straight lines to split them into multiple categories. Line A provides a better data splitting line than point B, resulting in a more accurate prediction of our outputs. Line B’s algorithm may confuse the 2 points close to the line resulting in false positives. In case our model runs on only two features the SVM algorithm will split the data by drawing a single straight line, while in cases of more than 2 features a hyperplane is drawn to accommodate for multiple dimensions.
Uses of Support Vector Machines
SVM models are generally used to classify different data into different categories. Some examples of SVM include identifying gene classifications, handwriting recognition, and protein remote homology detection. Many smartphones and scanners now have handwriting or text recognition. This can further be used to efficiently translate from one language to another with the help of SVMs.
5. K-Nearest Neighbor
The K-nearest neighbor algorithm short for KNN is a supervised machine learning algorithm that can group data into different groups. This grouping is based on common features between different data points. KNN can be used for both regression and classification but is more commonly used for classification problems.
The K-nearest neighbor algorithm is a lazy algorithm, as the algorithm is easy to learn unlike some of the other more complicated models. The KNN algorithm focuses mainly on memorizing the data it receives in order to make future predictions.
An example of a KNN model would be one that identifies cats and dogs. When the model does receive a new image of a dog it will classify it depending on the most similar features (category) that it has already stored.
Uses of K-Nearest Neighbor
Some real-life examples of KNN include predicting the chance a patient may have breast cancer based on the patient's previous history or similar patients and their outcomes. KNN can also be used as a recommender system for shopping and suggestions based on current user history or other user extrapolation. An e-commerce site may suggest other products based on what kind of products you are looking to shop for. Different shoppers may obtain different suggestions when viewing a specific product. Suppose the item is a spray paint; an automotive shopper may be suggested items such as car vinyl wrap whereas a home improvement shopper may be suggested items such as wallpaper or wood stains.
We only covered 5 of the 10 most popular data science algorithms this week. Next week we will continue the list with less simple and more advanced data science algorithms that are being used today. If you want to start using data science algorithms for your business, then we would like to help or point you in the right direction. for any questions you might have! Otherwise, browse more articles related to computer vision, machine learning, or a plethora of other topics, over on our blog.