Advertisements

Data Science Fundamentals: Regression

Data Science is very complementary to Business Intelligence, in that they are both used to gain insights from data. While Business Intelligence, generally speaking, is more about answering known questions, Data Science is more about discovery and providing information for previously unknown questions.

This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks.  In this post, I will be covering Regression and will include an example to make it more meaningful.  Previous posts covered Classification and Clustering. Upcoming posts over the next few days will cover Matching, and other data science fundamental concepts.

Regression analysis is a predictive modeling technique which investigates the relationship between a dependent or target variable and one or more independent or predictor variables. regressionIt can be used to predict the value of a variable and the class the variable belongs to and identifies the strength of the relationships and the strength of impact between the variables.  There are many variations of regression with linear and logistic regression being the most commons methods used.  The various regression methods will be explored at a later point in time.

An example of how Regression can be used is, you may identify products similar to a given product, that is, products that are in the same class or category as your subject product. Then review the historical performance of those similar products under certain promotions, and use that to estimate/predict how well the subject product will perform under similar promotions.

Another example is, you may use the classification of a customer or prospect to estimate/predict how much that customer/prospect is likely to spend on your products and services each year.

Classification determines the group/class of an entity, whereas Regression determines where on the spectrum (expressed as a numerical value) of that class the entity falls.  An example using a hotel customer – Classification: Elite Customer; Regression: 200 nights per year (on a scale of 100-366 nights per year)  or  top 10% of customers.

Advertisements

Data Science Fundamentals: Clustering

Like Business Intelligence, the essential purpose of Data Science is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.

This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks.  In this post, I will be covering Clustering and will include an example to make it more meaningful.  A previous post covered Classification. Upcoming posts over the next few days will cover Regression, Matching, and other data science fundamental concepts.

Clustering is similar to Classification, in that, they are both used to categorize and segment data.  But Clustering is different from Classification, in that, clustering segments the data into groups (clusters) not previously defined or even known in some cases.  Clustering explores the data and finds natural groupings/clusters/classes without any targets (previously defined classes).  This is called “unsupervised” segmentation.  It clusters the data entities based on some similarity that makes them more like each other than entities in other clusters.  Therefore, this is a great first step if information about the data set is unknown.

clustering_with_outlier

Clustering: 3 clusters formed (with an outlier)

The Clustering process may yield clusters/groups than can be later used for Classification. Using the defined classes as targets is called “supervised” segmentation.  In the diagram to the right, there are 3 clusters that have been formed (red pluses, blue circles, green diamonds).

 

After a Clustering process is completed, there may be some data entities that are clustered by themselves.  In other words, they do not fall into any of the other clusters containing multiple entities.  These are classified as outliers.  An example of this can be seen in the diagram where there is an outlier in the top-left corner (purple square).  Analysis on these outliers can sometimes yield additional insight.

Software such as R and Python provides functions for performing cluster analysis/segmentation on datasets.  Future posts will cover these topics along with more details on Clustering.

Data Science Fundamentals: Classification and Class Probability Estimation (Scoring)

Over the next 3 months, I will be focusing on Data Science and my next few posts will cover some fundamental topics of Data Science.

The essential purpose of Data Science, like Business Intelligence, is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.

In this post, I will be covering Classification and will include examples to make it more meaningful.  Upcoming posts over the next few days will cover Clustering, Regression, Matching, and other data science fundamental concepts.

Classification is the process of using characteristics, features, and attributes of a data entity (such as a person, company, or thing) to determine what class (group or category) it belongs to and assigning it to that class.  As an example, demographic data is usually a classification – marital status (married, single, divorced),  income bracket (wealthy, middle-class, poor), homeowner status (homeowner or renter), age bracket (old, middle-aged, young), etc.

classification

Shapes are classified by characteristics such as number of sides, length of sides, etc.

When a large amount of data needs to be analyzed, Classification needs to be an automated process.  If the classes are not know ahead of time, a process called Clustering can be used on existing data to discover groups that can in some way be used to form the classes.(Clustering will be covered in an upcoming post)

Class Probability Estimation (Scoring) is the process of producing a score that represents the probability of the data entity being in a particular class.  As an example, Income Bracket – top 5%.

A few Use Cases and examples of Classification and Class Probably Estimation/Scoring are:

(1) Financial: credit risk – High-Risk, Medium-Risk, Low-Risk, Safe.
A person’s past credit history (or lack of one) will determine their credit score. And their credit score will determine what class of credit risk they fall into, and therefore, will determine if they get the loan, and how favorable the terms of the loan would be.

As an example of Class Probability Estimation (Scoring) for this use case, a person may fall in the Low-Risk class, but their credit score (sometime called FICO score) shows that they are in the low-end of the Low-Risk class making them bordering on Medium-Risk.

(2) Marketing: Marketing offer/promotion interest – Highly likely, Likely, Unlikely
Based on past promotions and those who responded to it, classification can be used to determine the likelihood of a person being interested in a specific marketing offer/promotion.  This is known as targeted marketing where specific promotions are sent only to those who will likely be interested, and therefore, different classes/groups may receive different marketing messages from the same company.

As an example of Class Probability Estimation (Scoring) for this use case, a customer or prospect could be scored as 70% Unlikely, or 90% Highly Likely.

(3) Customer Base: Top-customer, Seasonal Customer, Loyal customer, High-Chance of Losing customer, …
A company may use some set of criteria to classify customers into various categories. These categories can be used for various customer-focused efforts, such as marketing, special offers, rewards, and more.

(4) Fraud detection & security:  Transaction or Activity occurrence – Highly Unusual, Unusual, Normal
Based on past activity and all other activities as a whole, a person’s activity/transaction can be classified as unusual or normal, and the appropriate actions taken to protect their accounts.

(5) Healthcare:
Data from past health analysis and treatments can be used to classify the level of a patient’s illness, and classify their treatment class. This will then drive the recommended treatment.

(6) Human behavior/Workforce:
Today’s workforce consists of multiple generations (Baby Boomers, GenX, GenY/Millennials, etc) of workers.  Generational classification of people based on the period in which they were born is used for marketing purposes, but is also used to help educate a diverse workforce on understanding their team members of different generations and how to work with them.

There are of course many more types of classification and use cases. Feel free to share your use cases.