Working with data

Data is/are? **10 days until https://t.co/n4ju7Of6aM! pic.twitter.com/R0cHOuTpz5
— PHD Comics (@PHDcomics) April 29, 2017

Manipulate datasets

As seen during the previous lesson, there are some datasets that are always available on R, or other that are included in specific packages.

For example, in the previous lesson, we worked with mtcars, beav2 and hflights.

However, in the most common real case scenario, we will need for reading and writing our own data, maybe starting from some sort of tables.

Read and write files

In R, it is possible to acquire dataset in different ways, not only through packages. One of the simplest and most intuitive ways is reading from a file. There are many formats that can contain data. The simplest and most widespread are in the .csv file (comma-separated values) or the .tsv (tab-separated values).

One of the most used command to read files is read.table(), which reads the content of a tabular file and saves it as a data frame. Let’s take a practical example.

Let’s consider the dataset body temperature that contains the data about the human body temperature for some patients. We can download the dataset in the most convenient folder (for example, data) and we can read it using read.table() and specifying the path to the file.

BodyTemperature <- read.table(file = "../data/BodyTemperature.txt", header = TRUE, sep = " ")

The first argument of read.table() is the path to the file. Then, we can indicate if there are headers, i.e., the first row contains the name of the of the variables. Another argument is the separator, which is the symbol or character used in the file to delimit the different elements of a line. The default separator of read.table() is Space but it is possible to indicate different ones. If the file had been a csv (comma separated values), we could have indicated sep =", ". Likewise, we could have used the read.csv() function which considers the comma as the default separator. If the values were separated by semicolons we could have used read.csv2() and read.delim() if separated by tab (\t).

The function read.csv also accepts url as an argument. In this way, R will download the dataset by itself and save it in the indicated variable.

(BodyTemperature <- read.csv(url('http://extras.springer.com/2012/978-1-4614-1301-1/BodyTemperature.txt'), sep=" "))

In a similar way to the read.table() command, we can use write.table() to write a data frame (or a variable) in a file. Also in this case, we can use write.csv for writing comma-separated values file. For example, the following command writes the dataset downloaded from the internet to the data folder as a csv file.

write.csv(BodyTemperature, "./data/BodyTemperature.txt")

Explore the data

Once we read or download our dataset, we can start analyzing the content. Often the datasets are accompanied by descriptions that explain the content and it is a good rule to read them.

We can access head or the tail of a dataset, using the head() and tail() commands. We can even visualize the entire dataset in an R Studio window using the View() command. The dim() returns the dataset dimension and the command names() returns of the column names.

The str() command returns information about the structure of the dataset. In R Studio, the equivalent of this command is accessible in the Environment window. All these functions are always a good starting point for understanding the data you are dealing with.

Another handy function is summary(). This function returns the minimum, maximum, mean, and quantiles of the numerical variables. We specified numeric since a data frame could contain non-numeric variables. For example, in the case of BodyTemperature, the first column is the categorical variable of the patient’s sex.

Other useful functions are mean() and sd() which return us the mean and standard deviation of a vector argument. Analogously, the IQR() and range() functions return the interquartile range and the range of a vector.

A useful command to apply functions designed for scalar or vector to the elements of an array or matrix is apply(). This function is also useful for applying functions along a single dimension. The lapply() and sapply() variants work similarly for lists. Another command for data manipulation is round() which rounds a value or a vector to the nearest integer, or by indicating the number of decimal places, rounds up to the indicated decimal digits.

Other useful features are sort(), which sorts the elements of a vector, unique() which removes repeated entries from an array or data frame. Functions like any() and which() check if a condition is satisfied. The first returns TRUE if at least one element verifies the condition, whereas which() returns the index of the elements that satisfy the condition.

Let’s look at some examples with the BodyTemperature dataset.

Merge datasets

We have widely seen how to work and manipulate data. However, an important function that allows data to be grouped has not been explained in class: it is aggregate(). This command allows you to apply a function to a subset of the data, following a precise pattern. Let’s look at some examples.

Another useful command is merge() which allows the user to merge data frames according to a specific variable. if we are sure that our observations are “aligned”, we can obtain the same result,by using use the commands rbin() and cbind().

Factors

Sometimes some categorical variables are saved as numeric variables. Let’s consider the dataset birthwt (birth weight) in the MASS package. We can note that some variables, although numerical, represent categories: for example, smokers or not, or hypertensive or not. However, if we use the summary() function, we see that they are treated as numerical variables. The as.factor() converts numeric variables to categorical variables.

birthwt1 <- birthwt
birthwt1$race <- as.factor(birthwt1$race)

From the description of the variables in the R Studio Environment window, we can see that after the assignment, race is indicated as a factor, no longer integer. If we call summary(), the results for this variable are different than before.

The table() function builds a contingency table among the combination of factors. levels() returns the level attributes of a variable (use the help to see what prop.table() does). Using levels() it is also possible to assign or vary the level attributes. Let’s use this function to change the levels of race.

Missing data

Often data may contain unreliable values, due to errors in a few lines or missing values. This is a significant problem, both because, if not recognized in advance, these values can influence our results, and because there are no standard protocols to deal with these cases.

Some elements may be missing they are usually reported with NA (not available). R has a dedicated function to find these elements: is.na(). This function returns boolean values to indicate whether the content is NA or not. Similarly, na.omit() returns the argument without the rows containing NA.

In a similar way, the is.nan() command checks if the content is NaN (not a number). There are also the functions is.infinite() and is.finite() that check for Inf/-Inf values.

Probability distributions in R

R allows us to access many probability distributions through specific functions that are contained in the stast package, which must then be loaded.

We can access the distributions through functions that recall their names, for example with norm, binom and pois, just to name a few, we access the normal, binomial and Poisson distributions. To this name is added a prefix, which serves to specify the density function (d), the distribution (p), and the function that returns the quantiles (q) or the one that generates random numbers according to the distribution (r ).

You can access the list of all the distributions of R by searching distribution in the help.

Let’s do some practice

It’s always time for pizza!

Download the dataset at the link, which contains the reviews of pizza restaurants in NY. This exercise is inspired by this video.

Explore and prepare the dataset

How many records are collected in the dataset? What is the information?
How many restaurants and records are in the dataset? How many variables?
What is the maximum number of votes? And the minimum?
According to the last number of your matricola(university registration number) select the data of the following pizzerias:

Last number	Pizzeria1	Pizzeria2
1	Kiss My Slice	Big Slice Pizza
2	Mariella	Spunto
3	Rivoli Pizza	Joe’s Pizza
4	Pizza Italia	New York Pizza Suprema
5	Williamsburg Pizza	John’s of Bleecker
6	Bleecker Street Pizza	Vinny Vincenz
7	Highline Pizza	Famous Original Ray’s
8	Cavallo’s Pizza	Pizza 33
9	Luna Pizza	Roio’s
0	Bella Napoli	Ben’s of SoHo 14th Street

Your new dataset must contain only the data about these 2 restaurants. You should perform the following analyses on this new dataset.

Manipulate the dataset

Add a new column with the number of total votes of each restaurant.
Create a new column with the relative frequency of the answers. Round the value to the third decimal.
Create a new column where you transform the review into numbers, where 0 is Never Again and 4 is Excellent.
Compute the mean review for the two pizzerias in your dataset. Which one is the best?
Plot the number of votes at the varying of the score.
Choose one of the two restaurants. If we approximate its distribution of votes with a Poisson distribution with parameter equal to the mean rate, are we close to the data distribution?

Solution

Titanic

Consider the Titanic dataset (description).

Read the dataset and analyze the variables. Are all the variables type correct?
What is the mean age of the passengers? And the quantiles?
What is the mean price of the ticket? And the standard deviation?
Who is the passenger with more children aboard? How many?
How many passengers survived? What is the survival probability?
What is the survival probability in the various classes?