Lesson 3

Data visualization (and Manipulation)

In this lesson, we will analyze the basic commands for plotting data and retrieving information from these instruments. In addition, we will go tough other manipulation commands and the basic syntax for writing our own R functions.

Data visualization

Data visualization allows at identifying trends, connections and retrieve information from data that are not obvious in a tabular form. All the commands that follow will help in these investigations and all of them can be customized using different colors, labels of the axes or placing side by side images.

Box plot

Using R, it is very easy to produce box plots using the boxplot() commands. We will specify the variable we want to visualize. In case we are interested in exploring the distribution of a variable according to another factor variable, we can use the notation boxplot(ColumnName ~ FactorColumnName, data = dataset ).

Histograms

Using the hist () command it is possible to get histograms for the variables. It i important to notice that we can specify the optional argument freq. If freq is TRUE, the histogram shows the counts of the component; if FALSE, it shows the relative frequencies, i.e., the probability densities. The latter is useful if we are comparing variables with different number of observations. The default value for freq is TRUE. Note that hist() does not support the ~ notation.

Scatter plots

With the commands plot (x, y) or scatterplot (x, y) it is possible to represent the data as points. We have already used plot and we noticed that it can be very useful for exploring the relationships between variables.

There are other commands that can be combined with all the previous, for example abline (), text (), points() and lines () that are very useful and we will use them in the examples.

It should be pointed out, at least for the normal distribution, there are the qqnorm and qqline functions, which graphically compare the quantiles of the data with those of a normal distribution. This is a way to graphical way (and therefore not formally) to verify if your data follows a normal distribution. Together with the requirement that the measurements be independent, these two are hypotheses that we have seen very often, especially in the use of limit theorems.

Programming in R

In R, we can execute commands for the iteration, the evaluation of conditional expressions and we can define our own functions.

Conditional statements (if)

if (condition) command1 else command2
#OR
ifelse(condition,command1,command2)

The first expression verifies a condition only on a single element. In contrast, the ifelse command allows you to vectorize the control. If the condition is to be evaluated on a vector, the ifelse command evaluates it on all entries and applies command1 if satisfied for that entry and command2 otherwise. If a vector were passed to the expression if, this would evaluate the condition only with respect to the first element of the vector and execute the appropriate command.

You can group multiple commands using braces and semicolons. For example, in this way, with the if it is possible to perform more operations for each case.

Iterations (for - while - repeat)

We can iterate commands using the for, whileo repeat.

for (i in sequence) command1

for allows iterating command1 as a variable varies, in this case i. This expression is very useful when visiting vectors and command1 is applied to different entries. Using braces, you can indicate multiple commands.

while (condition) command1

while repeats command1 until the condition is true. Using while can be risky if the condition is always satisfied and gets stuck in an infinite loop.

repeat simply repeats a command. break allows you to interrupt any iteration and is the only way to stop arepeat loop.

Define functions

R allows the user to define functions, through the function command.

FunName <- function( arguments ) command1
return(value)

The previous syntax allows you to define a function called FunName that will evaluate command1 and return value.

The use of user-defined functions allows you to recall the same function more time in different parts of the code and to pass different arguments. Also in this case, using braces, you can specify multiple commands.

my_fun <- function( a , b , c) {
return(a*b + c)    
}
my_fun2 <- function( a , b , c) {
    y <- a**b 
return(y + a*b + c)    
}

You can save functions and scripts (using the .R extension), which can then be executed using source (" function_name.R "). To recall these functions it is essential that they are in the working directory, otherwise the path to reach the file must be indicated.

The tidyverse

We cannot ignore the growing attention towards some packages that are oriented to data manipulation and visualization such as dplyr, ggplot2, tidyr, tibble and others, which are part of the so called tidyverse. These packages allow the user to easily manipulate data and perform non trivial selections using specific commands. For example, we can select columns or rows using commands such as select and filter.

The following lines are an example of the use of dplyr.

library(dplyr)

starwars %>% 
  filter(species == "Droid")

starwars %>% 
  select(name, ends_with("color"))

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

starwars %>% 
  arrange(desc(mass))

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(n > 1,
         mass > 50)

The following lines are an example of the use of ggplot2.

library(ggplot2)

ggplot(data = diamonds, aes(x = depth, fill = cut)) +
  geom_histogram(binwidth = 0.2)

ggplot(data = diamonds, aes(x = depth)) +
    geom_histogram(binwidth = 0.2) +
    facet_wrap(~ cut)

# The combination of the two packages

library(dplyr)

fair_diamonds <- diamonds %>%
  filter(cut == "Fair")
ggplot(data = fair_diamonds, aes(x = price, y = carat)) +
  geom_point()
ggplot(data = fair_diamonds, aes(x = price, y = carat)) +
  geom_point(position = "jitter")

However, in this course, we will not explicitly use these packages, since they require to study their own syntax and grammar. However, you can use them if you like! You can find at the links more information about dplyr and ggplot2.

The Pima.tr2 dataset

Let’s see together an example using the dataset Pima.tr2 in the library MASS.

  1. Save a copy of the dataset and explore its structure.
  2. Identify and count the missing data.
  3. Compute the summary of the dataset.
  4. Replace the missing values with the median of the same variable. Recompute the summary. Are there any changes?
  5. Create two vectors containing the values of the variable glu at the varying of type. Visually compare the distribution of the two vectors and comment the results.
  6. Repeat the previous comparison using a box plot and without the use of the two vectors. Comment the results.
  7. Compare the distribution of the measurements of the variable glu with a Gaussian distribution with mean and standard deviation equal to the experimental ones. Are the two distributions comparable?
  8. Compute mean and standard deviation of the number of pregnancies npreg. If we compare the data distribution with a Poisson with lambda equals to the mean of the data, are the two distributions comparable? Can you think of a better distribution for reproducing the data?

© 2017-2019 Federico Reali