Advertisements

Working with built-in datasets in R

This is a quick cheat sheet of commands for working with built-in datasets in R.

R has a number of base datasets that come with the install, and there are many packages that also include additional datasets.  These are very useful and easy to work with.

 

> ?datasets                                  # to get help info on the datasets package

> library(help=”datasets”)     # provides detailed information on the datasets package, including listing and description of the datasets in the package

> data()                                       # lists all the datasets currently available by package

data(package = .packages(all.available = TRUE))  # lists all the available datasets by package even if not installed

> data(EuStockMarkets)            # loads  the dataset EuStockMarkets into the workspace

> summary(AirPassengers)      # provides a summary of the dataset AirPassengers

> ?airquality                                    # outputs in the Help window, information about the dataset “airquality”

> str(airmiles)                                 # shows the structure of the dataset “airmiles”

> View(airmiles)             # shows the data of the dataset “airmiles” in spreadsheet format

 

 

Advertisements

QlikView Sheet Objects Quick Overview

This post provides a quick overview for each of the QlikView Sheet Objects.
Sheet Objects are the various objects that a developer or analyst places on a QlikView document’s sheet to provide the rich functionality of dashboards.

By right-clicking on an open area in a sheet, the below menu will appear.  Select “New Sheet Object” to see the available Sheet Object options – which are – List Box, Statistics Box, Multi Box, Table Box, Chart, Input Box, Current Selections Box, Button, Text Object, Line/Arrow Object, Slider/Calendar Object, Bookmark Object, Search Object, Container, Custom Object, and System Table.
Qlik_SheetObject_ListBox

The sheet objects can also be created from the Design Toolbar selections.
Qlik_SheetObject_DesignMenu

Some of our example images are taken from the Movies Database application that comes with QlikView.

List Box
The “List Box” is one of the most heavily used QlikView sheet objects.  It lists all the distinct values of a particular field. For example, you may have a List Box that displays all the products available within your data set.  The List Box is often used as a menu, where a user selects a value to analyze the data based on that value.
Qlik_SheetObject_ListBox_example2
The above example displays 3 List Boxes – the Director list box, the Actor list box, and the Title list box. When a user selects one or more values from one or more list boxes 9or any other object), the data in all other objects cascade to show only records relevant to that combination of selections.

Statistics Box
The “Statistics Box” is used to show aggregated representations of the data.  There are several statistical aggregation functions that are available, such as Sum, Average, Min, Max, Std Dev, Null Count, and several more.
Qlik_SheetObject_StatisticsBox_example
The above example shows “Statistics” for the Rating field – the number of ratings, the average rating, the min and max ratings, and the average and median ratings.

Multi Box
A Multi Box allows you to display several fields in a single object, with the values for each field available via a drop down. This object can be described as a set of List Boxes compacted together. The example screenshot below shows a Multi Box with the 3 fields, Director/Actor/Title, and by clicking any of the fields, you will get a drop down of all the values for that field – similar to a List Box for that field.
Qlik_SheetObject_MultiBox_example

Clicking on Director will display the list of all Directors, as shown below. Similarly, you can Actor or Title to get those respective drop down lists.
Qlik_SheetObject_MultiBox_dropdown_example

Table Box
The Table Box, as the name implies, displays a table of records for the selected data fields.
Qlik_SheetObject_TableBox_example

Chart
The Chart object is used for creating charts of various types. Some of the chart types available include bar, line, combo, radar, scatter, grid, pie, block, funnel, gauge, and mekko charts, and pivot and straight tables.
Qlik_SheetObject_Chart
The above shows one of the properties dialog tabs for the chart object – which shows some of the various chart types you can create.

Input Box
The Input Box object allows for users to enter a value. (which sets a variable that can be used for further processing)
Qlik_SheetObject_InputBox

Current Selections Box
The Current Selections Box displays all the selections that the user has selected. The Current Selections box is usually placed at the top of a sheet (dashboard page) in a position that allows users to easily see the selections that have been made.
Users are able to clear selections by clicking on the eraser beside the selection in the Current Selections Box, and the data on the sheet will change accordingly.
Qlik_SheetObject_CurrentSelectionsBox
In the example above, the user has selected Rating 5 and Length Range 1 to 1.5 hrs.

Button
The Button object creates a button that can be configured to execute various types of actions. When clicked the configured action will be executed.
Qlik_SheetObject_Button

Text Object
The Text Object is used for adding simple text to the document/sheet. It can be used to display informational messages, directions, descriptions or titles and labels, or any kind of information that needs to be shared with viewers/users of the document.
Qlik_SheetObject_TextObject

Line/Arrow Object
This object is used to create a line with or without arrows, and vertical or horizontal or diagonal. This object can be used for many purposes such showing relationships and flows.
Qlik_SheetObject_LineArrow

Slider/Calendar Object
The Slider/Calendar object is used to create either a slider with multiple values that the user can slide across to select, or a calendar that the user can use to find and select a date.
Qlik_SheetObject_Slider                 Qlik_SheetObject_Calendar
In the above slider, the user has selected Rating 4.  In the calendar object, clicking the grid icon opens a calendar that allows selection.

Bookmark Object
The Bookmark object is used to save the selection state of the current sheet. If a bookmark is created after a specific Actor has been selected, when you select that bookmark, the selection and corresponding data in the sheet will be displayed based on that saved selection state of the bookmark.
Multiple bookmarks can be saved for each sheet. Each having a different set of saved selections.
Qlik_SheetObject_Bookmark

Search Object
The Search Object allows users to search for values in entire data set or within a specific set of fields.
Qlik_SheetObject_Search

Container
The Container is a sheet space saving object that can contain multiple other objects, such as multiple chart objects. This allows you to display multiple views of the data within the same sheet “real estate” and change the view by clicking on the appropriate tab to get to each chart.
Qlik_SheetObject_Container
In the above example, the container contains 3 objects (3 charts), and therefore displays 3 tabs across the top, one for each chart. There are different possible orientations for the tabs – they can be to the left, right, top (as in this example), or bottom.

Custom Object
The custom object is a sheet object specifically intended to carry custom defined OCX replacement controls.
The Custom Object: Object Menu can be accessed from the Object menu, when the custom object is the active object.
Replacement controls are windowless OCX controls programmed by Qlik, yourself or third party vendors, which comply with the specifications for QlikView replacement OCX controls. The OCX control will be drawn in a rectangle on the sheet outlined by the underlying custom object. Communication between the OCX control and the QlikView document is maintained via the internal macro interpreter.

System Table
The System Table object creates a pivot table that shows the data structure of the qlikview document.
Qlik_SheetObject_SystemTable

In future posts, we will be covering each object in depth.

Working with special character letters in Oracle database (grave accent, acute accent, circumflex accent, tilde, and umlaut)

We have some external data feeds that we receive as csv files, and we load them into our data warehouse and process them.    It turns out that data values that include special characters were getting messed up at some point before or during the load to the Oracle database.

In this post, I will go over one way to select, insert or update data values that contain these special characters.  The “special characters” I am referring to are the grave accent, acute accent, circumflex accent, tilde, and umlaut.

The most common one we come across is the “acute accent” which is present in words such as café or entrée (accent over the e in both words).

If you want to insert these words without the accents, into an example table, WORDS, with a single column, SPECIAL_CHAR_WORD, it would simply be:

insert into WORDS (SPECIAL_CHAR_WORD) values ('cafe');
insert into WORDS (SPECIAL_CHAR_WORD) values ('entree');

But if you want to insert these words with the accents, then you would need to do this:

insert into WORDS (SPECIAL_CHAR_WORD) values UNISTR('cafe\0301');
insert into WORDS (SPECIAL_CHAR_WORD) values UNISTR('entre\0301e');

To select the word café with the accent, run this statement:

select UNISTR('cafe\0301') from dual;

Once a column value is already in a table in the proper format, you can simply select the column name using a normal SQL select statement, and the output will show properly (maybe with just the need for some formatting as you will later in the article).

And for update – to update the word entree with the word entrée, run this statement:

update WORD set SPECIAL_CHAR_WORD = UNISTR('entre\0301e') where SPECIAL_CHAR_WORD = 'entree';

To see several special characters (grave accent, acute accent, circumflex accent, tilde, and umlaut), run this statement …

select UNISTR('bare\0300ge') from dual   -- barège    -- grave accent
union
select UNISTR('entre\0301e') from dual   -- entrée    -- acute accent
union
select UNISTR('pa\0302turer') from dual  -- pâturer   -- circumflex accent
union
select UNISTR('jalapen\0303o') from dual -- jalapeño  -- tilde
union 
select UNISTR('fu\0308r') from dual;          -- für    --umlaut

… the output would look like this … which seems not quite right …
Special_Characters_UNISTR

Add the COMPOSE statement (which is just for formatting) as shown below …

select COMPOSE(UNISTR('bare\0300ge')) as SPECIAL_CHAR_WORD from dual   -- barège    -- grave accent
union
select COMPOSE(UNISTR('entre\0301e')) from dual   -- entrée    -- acute accent
union
select COMPOSE(UNISTR('pa\0302turer')) from dual  -- pâturer   -- circumflex accent
union
select COMPOSE(UNISTR('jalapen\0303o')) from dual -- jalapeño  -- tilde
union 
select COMPOSE(UNISTR('fu\0308r')) from dual;     -- führer    --umlaut

and the output will look like this …

Special_Characters_UNISTR_and_COMPOSE2

As  you can see, the key to this is knowing the code for the special character you need, and then using the UNISTR function to add the special character to the rest of the text, and if necessary, use COMPOSE for display purposes.

Thanks for reading. Hope you found this helpful.

Maximum number of characters allowed in the Informatica SQL Qualifier & Lookup transformations’ SQL Query

The Informatica SQL Qualifier and Lookup transformations’ SQL Query parameter has a limitation on the number of characters that your SQL can contain. There is no limitation on the number of lines, the limitation is on the number of characters.

Your SQL is allowed a maximum of of 32,767 characters. If you copied and pasted SQL into the field, after 32767 characters the query will be truncated.  If you are typing into the field, after 32767 characters, no more characters can be entered.

To reduce your SQL character count, removing unnecessary spaces used for formatting will help. You may also examine your aliases to see if they can be shortened.

Statistics Basics – Descriptive vs Inferential Statistics

Descriptive Statistics
Statistics that quantitatively describes an observed data set. Analysis for descriptive statistics is performed on and conclusions drawn from the observed data only, and does not take into account any larger population of data.

Inferential Statistics
Statistics that make inferences about a larger population of data based on the observed data set. Analysis for inferential statistics takes into account that the observed data is taken from a larger population of data, and infers or predicts characteristics about the population.

Statistics Basics – Measures of Central Tendency & Measures of Variability

Measures of Central Tendency and Measures of Variability are frequently used in data analysis.  This post provides simple definitions of the common measures.

 

Measures of Central Tendency

Mean / Average – sum of all data points or observations in a dataset divided by the total number of data points or observations in the dataset.

The mean or average of this dataset with 5 numbers {2, 4, 6, 8, 10} is: 6

Sum of all data points:     (2+4+6+8+10)
Divided by:                       ———————–  = 6
Number of data points:              5

Median – with the values (data points) in the dataset listed in increasing (ascending) order, the median is the midpoint of the values, such that there are an equal number of data points above and below the median.  If there are an odd number of data points in the dataset, then the median value will be a single midpoint value. If there an even number of data points in the dataset, then the median value will be the mean/average of the two midpoint values.

The median of the same dataset {2, 4, 6, 8, 10} is:  6
This dataset has an odd number of data points (5), and the middle data point is the value 6, with 2 numbers below (2, 4) and 2 numbers above (8, 10).

Using an example of a dataset with an even number of data points:
The median of this dataset {2, 4, 6, 8, 10, 12} is: (6 + 8) / 2 = 7
Since there are 2 middle data points (6, 8), then we need to calculate the mean of those 2 numbers to determine the median.

Mode – the data point that appears the most times in the dataset.

Using our original dataset {2, 4, 6, 8, 10}, since each of the values only appear once, none appearing more times than the others, this dataset does not have a mode.

Using a new dataset {2, 2, 4, 4, 4, 4, 6, 8, 8, 8, 10}, the Mode in this case is: 4
4 is the value that appears the most times in the dataset.

Measures of Variability

Min – the minimum value of the all values in the dataset.
Min {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 1.

Max – the maximum value of the all values in the dataset.
Max {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 9.

Variance – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from their average/mean value.  It is the average of the squared differences from the mean.

Variance of {2, 3, 4, 5, 6} is calculated as follows …

First find the Mean.  Mean = (2 + 3 + 4 + 5 + 6) / 5 = 4

Then, find the Squared Differences from the Mean … where ^2 means squared …
(2 – 4)^2 = (-2)^2 = 4
(3 – 4)^2 = (-1)^2 = 1
(4 – 4)^2 = (0)^2 = 0
(5 – 4)^2 = (1)^2 = 1
(6 – 4)^2 = (2)^2 = 4
Average of Squared Differences: (4 + 1 + 0 + 1 + 4) / 5 = 2

Standard Deviation – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from each other.  It is the square root of the Variance (defined above).

For the above dataset, Standard Deviation {2, 3, 4, 5, 6} = Square Root (2) =~ 1.414

Kurtosis – a calculated value that represents how close the tail of the distribution of the dataset is to the tail of a normal distribution*.

Skewness – a calculated value that represents how close the symmetry of the distribution of the dataset is to the symmetry of a normal distribution*.

* A normal distribution, also known as the bell curve, is a probability distribution in which most values are toward the center (closer to the average) and less and less observations occur as you go further from the center.

Range – the difference between the largest number in the dataset and the smallest number in the dataset.
Range {2, 4, 6, 8, 10} = 10 – 2 = 8

 

Thanks for reading!

 

Updating R packages

In this post, I cover updating R packages in RStudio.  You have a few options for updating your R packages in RStudio.

(Option 1) From the menu, select Tools -> “Check for Package Updates”RStudio_Menu_Tools_CheckforPackageUpdates

You will get this message if all packages are up-to-date:

RStudio_AllPackagesUpToDate

(Option 2) From the Packages tab in the bottom-right pane, click on “Update”RStudio_PackagesTab_Update

(Option 3) Execute the command update.packages()

RStudio_Script_UpdatePackages

You may see output like this …

RStudio_Script_UpdatePackages2

Enter “y” to update, “N” to not update, and “c” to cancel.

Thanks for reading!