Lesson 1

Originally posted here.

R basic syntax

Folders (where are we now?)

When we work with datasets and we read and write files, it is fundamental to be sure that we are working in the right directory. Thus, we can use the R command to understand in which folder we are working.

getwd()

The previous command (get working directory) returns the path of the current working directory. If we want to move the working directory somewhere else, we can set it by using the command setwd() (set working directory). For example, if our working folder is Statistical Learning (StatLearn), we can move the working directory there using the following code.

setwd('your-path/StatLearn/')
# or
setwd('your-path/StatLearn/R/Lesson1')

From the working directory, we can easily navigate and access files by using the relative path syntax. In this way, without specifying all the path, we just use the dot (./) to refer to the current folder and the double dot (../) to refer to the father folder. We can iterate the double dot until we get to the folder we are interested in.

(Do not try this at home, or anywhere else!)

Once we know where we are, we can use the Rstudio interface to create and run scripts.

Arithmetical Operations

Using the standard notation, we can use R to solve standard arithmetical operations.

For example,

2+2 
2-2
2*2
2/2

We notice that / returns the division, whereas we can get the result of the integer division by using %/% and %% to get the remainder.

With ^, we can compute the powers and using the fractions we can also compute roots. For the square root, there is the command sqrt() that accepts complex numbers as well.

Another useful command is the absolute value, which we can compute using the command abs() (again, we can compute the abs of complex numbers).

Exponential, logarithms, and trigonometric functions

We can compute the value of the exponential function using the commandexp(), whereas we can compute the logarithms according to the base, using log(), log10() and log2(), which compute the logarithm in the natural base, base 10 and 2 respectively. If we use the command log(), we can specify a second argument and select a different base from e. For instance, log( 5 , base = 3) computed the base 3 logarithm of 5.

The main trigonometric functions are implemented in R and can be run using the following commands.

sin() 
cos() 
tan() 

We should notice that the expected input is the angle in radiants, not degrees. We can use the same functions, ending with pi (sinpi(), cospi() e tanpi()), if we express the input a multiples of pi.

The trigonometric inverse functions are implemented and can be run adding the prefix a (acos(), asin(), atan()).

Assignation, memory, and types

Until now, we just computed single values or numbers using R in the way we would use a calculator. However, with R, we can do much more.

As the first step, we can store values to recall them more than once or to modify them in successive steps. You can find more resources about this topic here and here.

We can store a value in a variable assigning a value to a name using the arrows <- or ->. We can also use the = or assign() for the assignations, however, they are less common. At the link, you can find more information on the difference between using arrows or equal.

x  <- 6
# or
6 -> x

x

The previous command stores the value 6 in the variable x. The name variable suggests that we can vary the content of x, and this is a difference with those constants such as pi that are always available and cannot be modified.

The use of variables is particularly useful when we write functions or scripts, to successively modify or update a value.

When we set a variable, R allocates a finite amount of physical memory to store its content and it recognizes the type of a variable a character, logical or numerical. We will see in the next paragraph how to save and manipulates these types.

Logical and relational operators

R allows us to evaluate logical and relational expressions, which will result in a TRUE or FALSE value.

6 > 10
6 <= 10
is_bigger <- 6> 10
is_bigger
as.integer(is_bigger)

In addition to the cases in the previous example, R can check if two or more values are equal == or different !=. These controls are useful for defining and verifying if certain conditions hold.

The logical operators in R are & (and), | (or), xor e ! (not). The following truth table reminds how they work. We denote A and B as variables, whose logical values are either T (true) or F (false).

A B A&B A|B A xor B ! A
T T T T F F
T F F T T F
F T F T T T
F F F F F T

We can evaluate the logical operators either on logical variables, as well as numerical variables. However, in the latter case, all the zeros will be count as FALSE and anything different from 0 will count as TRUE.

Vectors

Until now, we only considered how to assign a single value into a varaible. However, with R, we can assign more than a value by creating vectors.

y <- c(1,2,3,4)
y[1]

In the previous example, we create a vector called y using the command c() (combine). The command combines the values in a column vector (even if it is displayed as a row). We can access the vector entries by using the square brackets []. Indeed, we can perform all the operations seen so far with vectors as well.

By default, R operates with vectors by applying the operation point-wise (apply to any entry). If we try to sum two vectors that have different sizes, R repeats the shorter one until it reaches the dimension of the longest and then applies the sums the two vectors. In the case that this happens, a warning message appears.

a1 <- c(5,6,7)
y+a1

A useful commands when we are dealing with vectors is length() that returns the number of elements.

Another useful command is t() that transposes the vector. Using t() we can create a new vector that contains the same values, but with different dimensions. However, R can perform operations on these vectors without errors or warnings.

z <- t(y)
y
print(z)
z

When we call a variable without any operation, R will display its content. Analogously, if we write a variable, an assignation or an operation between brackets R will display the result.

Using the command print(), we can display the results as well. In contrast with the previous two commands, print() will show the output even if its called inside a function or a script. For this reason, print() is the to go way to debug.

Sequences

If we want to indicate a succession of values that are a regular sequence, it is possible to do so without expressly indicating all of them.

For instance, using the expression a:b, R will provide a sequence of values from a to b with pace 1. To indicate a different pace, we can use the command seq(a,b,pace). If we do not indicated the pace, the default value is 1.

If we want to create a vector of repeated values, we can use the command rep(). For example, rep(1,5) will create a vector of length 5 with all entries equal to 1.

For all functions you can access the help of R by using the help (Function_name) command. Reading the help of seq () we see for example that we can also indicate the parameters in different order, provided that we specify them using the name indicated in the help. For example seq (to = b, by = step, from = a) returns the same output as seq (a, b, step).

Operations on vectors

In addition to the operations we have already seen, there are other R functions specifically designed for vectors. For example, the min andmax functions return the minimum and maximum values ​​contained in a vector, respectively. To access the position of these values ​​combine the previous commands with which. For example

x <- c(3, 5, 7, 1, 3, 3, 9, 8)
min(x)
which.min(x)
max(x)
which.max(x)

Other very useful functions for working with vectors are the sum () and diff (). The former function calculates the sum of all the elements of a vector and the latter calculates the difference of a value from one of the previous ones (it is possible to indicate many previous elements).

sum(x)
diff(x)
diff(x,2)

Matrices

As we have seen, the vectors are generally considered row vectors, and it is not possible to generate a matrix using the c () command.

However, matrices exist and can be defined using the cbind () function that joins the vectors passed as arguments into the columns of a matrix. Analogously, the rbind () function returns a matrix where the vectors will be the rows of the matrix. In both cases, the vectors passed as arguments must have the same dimension (not one row and one column). The dim () function returns the dimensions of the object. Also in this case we can use the length () function, which returns the product of the dimensions (i.e., the total number of elements).

a <- cbind(c(1, 2, 3), c(4, 5, 6))
dim(a)
b <- rbind(c(1, 2, 3), c(4, 5, 6))
dim(b)
a
b

We can vary the dimensions of an object by using length and dim as assignations.

dim(a) <- c(2,3)
a

The previous command changes the shape of the matrix, according to the indicated size. However, these commands must be used carefully, since you have no control over how the content will be redistributed.

Another useful function for creating matrices is the array () function that asks you to specify the elements of the matrix and the dimensions. An analogous result can be obtained with matrix (elements, num lines, num columns). Note that with array () it is possible to define structures with more than two dimensions.

Elements in a matrix are accessed using square brackets, specifying the row and column position of the element.

a[2,3]

If you want to select an entire row or column, just leave this field blank. For example a [1,] returns the first line.

Some functions like contour (), persp (), image () can plot matrices (see the help).

Strings (character vectors)

R also lets you manipulate character vectors, or strings, which can also be saved as vectors. The strings are delimited by double quotes “” “(or even simple quotes” “).

names <- c("Francesco", "Sofia", "Alessandro")
names[1]

names_numbers <- c("Francesco", "Sofia", "Alessandro", 45)

In R, characters and numbers cannot coexist in the same vector/array. If, for example, in the assignment we indicate names and numbers, the numbers will be converted into characters.

A very useful function for handling strings, but also other results, is the paste () function that concatenates vectors after transforming them into strings.

paste(1:12) 

(nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9))))

This function can be very useful, for example, if you need to create vectors of names for variables or to print in a more readable way you results.

To understand the type of data contained in a vector we can use the command typeof () or class ().

Lists and data frames

We can think of a matrix as an effective method for storing numerical information, and that’s it! However, if we have to store mixed type information, for example, numbers and characters, we should prefer other methods. To circumvent this limitation, we will see lists and data frames.

A list is an ordered set of objects. You can define lists using the list () command.

c <- list(destination = c("London", "Madrid"), airline = c("Ryanair", "EasyJet"), cost = c(60, 80), currency = c("£", "€") ).

We can access the contents of a list either by position using the double parentheses c [[2]] or by name c [['airline']] or c$airline. All the previous commands return the contents of the list defined by airline. If we want to access an element directly we can use the commands c [[2]][2], c [[airline"]][2]orc$airline [2] indifferently.

Data frames

Another data structure that allows to store mixed data is the data frame. This structure is by far the most used to read and manipulate data in R.

The data frames are lists of “data.frame” that contain variables with the same number of row, whose identifier is unique.

L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
data.frame(1, 1:10, fac)
## The "same" with automatic column names:
data.frame(rep(1,10), 1:10, fac)
data.frame(1, 1:10, sample(L3, 10, replace = TRUE))

Tidy data, please!

A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.

Tidy data - Hadley Wickham

Given the importance of data in the modern world and the undeniable importance of time in our society, we avoid spending too much time rearranging our data and trying to conform to tidy data when we create a dataset onwards. This will save us a lot of time and improve the repeatability of our analyzes!

Packages

R, like many other programming languages, has a core of predefined functions that always are available to the user, such as those we have seen so far. However, more and more functions have been developed that solve new problems and increase the potential of R, and it is not possible, nor desirable, to add all these functions to the R core. Indeed, this would make it slower, it would occupy more memory and would be full of features that the average user wouldn’t use!

This philosophy is applied to many programming languages ​​that propose a fundamental set of functions accessible to all (base), and the possibility of installing or loading extensions to perform other particular tasks.

In R, these extensions are the packages. The list of installed packages can be accessed using the library () command, without indicating an argument. If the name of a package is specified as an argument, R will load the package and its contents will be available to the user.

For example, we can load the MASS package, which will be useful later, with command library (MASS).

Not all packages are already installed in R, for example the hflights package is not.

library(hflights)
# The command returns error, since the package is not installed
# we can install it using the command
install.packages('hflights')
library(hflights)
# now it is loaded

Some packages may require other packages to be installed. We can specify to R to install these dependencies by using the command dependencies = TRUE), after the package name. The complete statement will be install.packages (package_name, dependencies = TRUE).

The search () command returns the list of the packages loaded in the session.

Any user can prepare and release an R package. These packages are usually found using the command shown above. By default R Studio relies on the site cran to find the packages.

However, if the desired package is not present, it is possible to specify a different server. For example, the following command installs a package on the Bioconductor servers.

source("https://bioconductor.org/biocLite.R")
biocLite("RnaSeqTutorial")

Similarly, R Studio allows you to install packages manually, from the Tools | menu Install Packages … | Install from: | Package Archive File.

Using the package name followed by :: allows access to functions and datasets contained in a package. This is very useful in the case of homonymy between functions. By specifying the package you can be sure to use the correct function.

Let’s play with the data

The base R and the pre-installed packages contains several datasets that are available without the need for reading external files. Now, we will practice with those contained in the MASS and hflights packages.

Exercises

Exercise 1

Let’s consider the dataset mtcars. Use the help to get information on the content of the dataset. Analyze the type of the variables. What is/are the most efficient cars? And the most power? And the lightest? Which is the car with the highest power/weight ratio?

Exercise 2

Let’s consider the dataset mtcars. By using the mathematical operations transform the fuel consumption from miles per gallon to km per liter and store the result in a new vector.

Exercise 3

From the MASS package, select the beav2 dataset. Use the help to get information on the content of the dataset. Without using the mean function (use instead the definition), compute the mean temperature of the measurements at the varying of the variable activ.

Exercise 4

Consider the package hflights and the dataset there included. Use the help to get information on the content of the dataset. After selecting the destination LGA (LaGuardia airport, New York), create a new vector containing the values of the variable ArrDelay. What is the sum of this vector? How can we interpret this result?

© 2017-2019 Federico Reali