Advertisements

Statistics Basics – Descriptive vs Inferential Statistics

Descriptive Statistics
Statistics that quantitatively describes an observed data set. Analysis for descriptive statistics is performed on and conclusions drawn from the observed data only, and does not take into account any larger population of data.

Inferential Statistics
Statistics that make inferences about a larger population of data based on the observed data set. Analysis for inferential statistics takes into account that the observed data is taken from a larger population of data, and infers or predicts characteristics about the population.

Advertisements

Statistics Basics – Measures of Central Tendency & Measures of Variability

Measures of Central Tendency and Measures of Variability are frequently used in data analysis.  This post provides simple definitions of the common measures.

 

Measures of Central Tendency

Mean / Average – sum of all data points or observations in a dataset divided by the total number of data points or observations in the dataset.

The mean or average of this dataset with 5 numbers {2, 4, 6, 8, 10} is: 6

Sum of all data points:     (2+4+6+8+10)
Divided by:                       ———————–  = 6
Number of data points:              5

Median – with the values (data points) in the dataset listed in increasing (ascending) order, the median is the midpoint of the values, such that there are an equal number of data points above and below the median.  If there are an odd number of data points in the dataset, then the median value will be a single midpoint value. If there an even number of data points in the dataset, then the median value will be the mean/average of the two midpoint values.

The median of the same dataset {2, 4, 6, 8, 10} is:  6
This dataset has an odd number of data points (5), and the middle data point is the value 6, with 2 numbers below (2, 4) and 2 numbers above (8, 10).

Using an example of a dataset with an even number of data points:
The median of this dataset {2, 4, 6, 8, 10, 12} is: (6 + 8) / 2 = 7
Since there are 2 middle data points (6, 8), then we need to calculate the mean of those 2 numbers to determine the median.

Mode – the data point that appears the most times in the dataset.

Using our original dataset {2, 4, 6, 8, 10}, since each of the values only appear once, none appearing more times than the others, this dataset does not have a mode.

Using a new dataset {2, 2, 4, 4, 4, 4, 6, 8, 8, 8, 10}, the Mode in this case is: 4
4 is the value that appears the most times in the dataset.

Measures of Variability

Min – the minimum value of the all values in the dataset.
Min {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 1.

Max – the maximum value of the all values in the dataset.
Max {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 9.

Variance – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from their average/mean value.  It is the average of the squared differences from the mean.

Variance of {2, 3, 4, 5, 6} is calculated as follows …

First find the Mean.  Mean = (2 + 3 + 4 + 5 + 6) / 5 = 4

Then, find the Squared Differences from the Mean … where ^2 means squared …
(2 – 4)^2 = (-2)^2 = 4
(3 – 4)^2 = (-1)^2 = 1
(4 – 4)^2 = (0)^2 = 0
(5 – 4)^2 = (1)^2 = 1
(6 – 4)^2 = (2)^2 = 4
Average of Squared Differences: (4 + 1 + 0 + 1 + 4) / 5 = 2

Standard Deviation – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from each other.  It is the square root of the Variance (defined above).

For the above dataset, Standard Deviation {2, 3, 4, 5, 6} = Square Root (2) =~ 1.414

Kurtosis – a calculated value that represents how close the tail of the distribution of the dataset is to the tail of a normal distribution*.

Skewness – a calculated value that represents how close the symmetry of the distribution of the dataset is to the symmetry of a normal distribution*.

* A normal distribution, also known as the bell curve, is a probability distribution in which most values are toward the center (closer to the average) and less and less observations occur as you go further from the center.

Range – the difference between the largest number in the dataset and the smallest number in the dataset.
Range {2, 4, 6, 8, 10} = 10 – 2 = 8

 

Thanks for reading!

 

Exploring the RStudio Interface

In this post we will explore the RStudio interface.  This is where you will manage your R environment, issue commands for processing and analyzing data, create scripts, view results, and much more.  Below is an image of the default RStudio interface.

RStudio_Environment

On the left:
Console – the window where you enter commands, and where output is displayed.

On the top-right:
Environment tab – shows the variables and values created through the console
History tab – shows the history of past executed commands

On the bottom-right:
Files tab – displays folders and files from the file system, from which you can select files, set working directory, create folders, copy and move folders and files, and more.
Plots tab – displays the plots that have been created and allows for you to export them.
Packages tab – displays all the packages currently installed and available.  Loaded packages will have the checkbox checked and packages must be loaded before they can be used.
Help tab – useful for getting help about R and R packages, and keyword search is available which can be very helpful when you don’t know exactly what you are looking for.
Viewer tab – can be used to view local web content, such as, static HTML files written to the session temporary directory or a locally run web application.

On the top-left (when a script is created or opened):
– Script pane and tabs – When you create or open a R script, it will create a new pane area in the top-left of the application window, and the Console pane will get shifted down to the bottom-left area.

RStudio_Environment_with_Script_pane

A new Script tab will open in this pane for each new script opened or created. From this window, you will be able to run your script line by line or in its entirety, among many other functions.

Thanks for reading!

Installing, Loading, Unloading, and Removing R Packages in RStudio

R has thousands of packages available for statistics and data analytics, but before you can use them, they need to be installed.  In this post I cover installing, loading, unloading, and removing R packages in RStudio. In these examples, I use the ggplot2 package – a popular graphics and visualization package in R.  Wherever you see ggplot2 in the examples below, you can replace it with the package you want to perform these actions on.

To install a package via the User Interface

In RStudio, select Tools -> Install Packages from the main menu, or click Install in the Packages tab on the bottom-right.
R_Installing_Packages_ToolsMenu

The Install Packages dialog appears.
R_Installing_Packages_ToolsMenu_InstallDialog

Start typing the name of the package you want to install, and a list of all packages that start with the letters you have type will show up in the selection list.
R_Installing_Packages_ToolsMenu_InstallDialog_PartialNameFind

Select (or type the full name of) the package you want to install, ensure that “Install dependencies” is checked, and click “Install”.
The statement will be automatically entered and run as shown below.
R_Installing_Packages_ToolsMenu_Output

And the output will show if the package is successfully installed.
R_Installing_Packages_ToolsMenu_Output2

At this point, you will be able to see the package in the list in the Packages tab on the right.
R_Installing_Packages_ToolsMenu_Output3

To install a package via script
Instead of using the user interface (menu), you can also install packages directly via script.

install.packages("ggplot2")

See script statement below.  And as before, the package shows in the Packages tab.
R_Installing_Packages_Script

After a package has been installed, it needs to be loaded before you can use it.

Loading and Unloading a package (via user interface or script):

To load a package you can simply check the checkbox beside the package name in the Packages tab – as shown by the yellow box highlight below.  This will automatically enter and execute the command shown with the yellow arrow.

Or you can enter the script, by entering the command as shown with the green arrow:

library("ggplot2")

As an alternative, the require(ggplot2) command will also load the package.

To unload the package, you can simply uncheck the checkbox beside the package name in the Packages tab, or enter the command shown by the red arrow:

detach("package:ggplot2", unload=TRUE)

R_Loading_Unloading_Packages You will notice that after running the detach command, the Package checkbox will not be checked (will be unchecked).

To remove (uninstall) a package (via user interface or script):

To remove a package, you can simply click the “x” icon shown to the right of the package in the Packages window. See the yellow box highlight beside ggplot2 below.

Or you can run the script command as shown below in the Console window.

remove.packages("ggplot2")

R_Removing_Packages_Script_or_GUI

The below shows the output after removing a package.  You will notice that the package is no longer in the Packages list on the right hand side.  In this example, ggplot2 is no longer in the list of packages.
R_Removing_Packages_Output

An advantage of using the script option instead of the user interface methods to perform the above actions is that you will have a history of what you have done.

Thanks for reading!

Installing RStudio on Windows

RStudio is an open-source integrated development environment (IDE) for R. It also has commercial versions with expansive capabilities available (at a cost).  It runs on the desktop with multiple operating systems, or in a browser connected to a RStudio server.  RStudio includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. This post covers installing RStudio.  Note that R needs to be installed first – see this post for installing R.

To get started, go to http://www.rstudio.com.

Under RStudio, click Download.
RStudio_download

Choose the desired version of RStudio.  You will likely want the “RStudio Desktop (Open Source License)” version. On the same page, you will be able to read about the various options available – free and pay versions.
RStudio_choose_version

This will bring you down in the page to the installers.  Choose the installer that is appropriate for you.  In this example, we are installing on Windows, and so we chose the “RStudio 1.0.153 – Windows Vista/7/8/10” version.  Note: RStudio requires that R is installed.  If you have not already installed R, do so first (see this post for Installing R).
RStudio_choose_installer

After the download is complete, run the exe by double-clicking on it.
RStudio_install_run_exe

Click Next at the Welcome screen.
RStudio_install_welcome_1

Choose the install directory, click Next
RStudio_install_location_2

Chooses a Start Menu Folder, click Install
RStudio_install_start_menu_folder_3

Installing …
RStudio_install_installing_4

Complete the installation.
RStudio_install_complete_5

Run RStudio
RStudio_install_RStudio_icon

RStudio IDE
RStudio_install_run_RStudio

You will notice that the left window “Console” is the same as the “R Console” window in the stand-alone R installation.  This is because RStudio is built on top of R.

Good luck on your R journey!

Installing R on Windows

R is an open source software platform for data manipulation, statistical computing, calculation, analytics, and graphics.  It provides a wide variety of statistical/mathematical and high-quality graphical capabilities.  Some of the statistical capabilities include linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and clustering, and more.
You will find R useful in Analytics and Business Intelligence environments where data needs to be analyzed to uncover patterns or for better understanding and help make predictions and decisions.
In this post, we cover the installation of R.

To get started, go to http://www.r-project.org.

Click the “download R” link (underlined in yellow below).
rprojectorg

Choose the CRAN (Comprehensive R Archive Network) mirror location closest to you.
ChooseBestLocation

Choose the version for your install computer’s operating system (OS). In this example, we are installing on Windows – so we chose “Download R for Windows”.
R_ChooseOS

Assuming this is your first install, click “base” or “Install R for the first time”.
R_Install

Then, click “Download R 3.4.1 for Windows” (or whatever the appropriate version is at the time)
R_download

After the download is complete, go to the download directory, and double-click the R exe to run it.
R_run_exe

Choose your language
R_install_lang

Click Next
R_install_welcome_2

Review the license agreement, click Next
R_install_license_3

Accept the default directory or enter/select a new one.
R_install_dir_4

Select the components you want.  If your PC is 32-bit, then unselect 64-bit if it is shown as an option.  If your PC is 64-bit, you can install both 32-bit and 64-bit (default) or choose one of them.
R_install_components_5

Choose No and click Next (unless you want to customize the startup options for R, but this can be done later)
R_install_startup_6

Click Next
R_install_start_menu_folder_7

Choose Icon and Registry options
R_install_additional_8

Installing
R_install_installing_9

Click Finish to complete the installation
R_install_complete_10

Desktop and Quick Launch icons
R_install_desktop_icons     R_install_quicklaunch_icons

Run R.
R_install_runR

Next, we’ll cover installing RStudio.

Good luck on your R journey.

Data Science Fundamentals: Matching

This is a continuation of a series of Data Science Fundamentals posts.  In this post I will briefly describe Matching.

Matching, also known as Similarity Matching, is a technique of using data about objects to identify “like” objects. For example, Amazon or Walmart may use matching to identify “like” customers based on their browsing, liking, and purchasing history.

This information can then be used to provide product recommendations to these customers.

matching-recommendations

Product recommendations based on browsing and purchase history, and similarity matching

The results of Matching can be used for Classification and Regression; and Matching underlies Clustering.  These techniques were described in previous posts.