Data manipulation exercises
Originally posted here.
Exercise 1
Consider the dataset (description) that contains the information about the hourly traffic volume in the interstate I-94.
- Import the dataset into R and verify the correctness of the content. Replace any errors with the median of the same variable.
- Define a R function that transforms the temperature from Kelvin to Celsius degrees. Using the function defined before, create a new variable
TempC
that contains the temperatures in Celsius. - Create a new variable
rain_YN
that is 0 ifrain_1h
is 0, and 1 otherwise. - At the varying of the variable
rain_YN
, compute the minimum, maximum, average and median for the temperature (in Celsius) and the traffic volume. Using box plots, visualize the distribution of the traffic volume at the varying ofrain_YN
. - Using the previous results, discuss the common belief that there is more traffic on rainy days.
Exercise 2
Import the dataset that contains the data related to the Florida Gators football team representing the University of Florida.
- Import the file into R. Suggestion: convert the file to .csv.
- From the original dataset select the columns:
Year
,YearGame
,Date
,Opponent
,OppWins_1A
andGatorOutcm
. All the following analyses are intended on this reduced dataset. - Analyze the data structure and extract the average, standard deviation, quartiles, minimum and maximum (for all variables).
- Determine the team against which the Florida Gators competed most often. How many times have they won, lost and tied with that team?
- Regarding the games with this team, calculate the mean and standard deviation for the variable
OppWins_1A
. - Visualize, using the box plots, the trend of the variable
OppWins_1A
concerning the result of the game. - From the data, with which team between “Alabama”, “Kentucky” and “Tennessee” is there the highest chance of winning?
Exercise 3
- Import the dataset about the roman emperor and check if the variables are read correctly.
- Convert the columns
birth
,death
,reign_start
andreign_end
to the appropriate type. - Define a new column
reign_length
that contains the length of the reign of the emperor. Note that some dates are B.C. - Define a new column that converts
reign_length
in years and round the result. Which is the emperor with the longest reign? And the shorter? - Visualize and compare the distribution of the variable
reign_length
in years using a plot, a histogram and a box plot. Which one is the most informative? Which is the least? - For each dynasty, compute the mean reign length and the number of emperors of the dynasty.
- For each dynasty, compute the percentage of emperor killed by assassination.
- Suppose that the probability of death by assassination in the Gordian dynasty can be modeled as a binomial distribution of parameter p. Compute and plot the likelihood at the varying of the parameter p. Estimate the p that maximizes the likelihood.
© 2017-2019 Federico Reali