General comments:

  • All the plots should be labelled appropriately (axes, legends, titles). There will be marks allocated for this.
  • Please submit both your .Rmd, and the generated output file .html or .pdf on Canvas before the due date/time.
  • Please make sure that the .Rmd file compiles without any errors. The marker will not spend time fixing the bugs in your code.
  • Please avoid specifying absolute paths.
  • Your submission must be original, and if we recognize that you have copied answers from another student in the course, we will deduct your marks.
  • You must use tidyverse packages to answer the questions in this assignment. Please use dplyr for data wrangling/manipulation, ggplot2 for data visualisation, and lubridate for dates/times. Some parts of Problem 2 will use plots from the fpp3 packages.
  • IMPORTANT NOTE: There are some questions that are for STATS 786 only. Students taking STATS 326, while you are welcome to attempt these questions, please do not submit answers to them.

Due: Friday 15 March 2023 at 16:00 PM (NZ time)

Problem 1: Kobe Bryant

The lakers data set (in the lubridate package) contains play-by-play statistics of each Los Angeles Lakers basketball game in the 2008-2009 regular season. It contains the following variables:

Variable Description
date Date of the game
opponent Name of the opposition team
game_type Home or away game
time Time remaining on the game clock in a given period (counting down from 12 minutes)
period The period of play (most games have four quarters, each 12 minutes in duration, noting that some games go into a 5-minute duration overtime if tied at the end of regular play)
etype The type of play made (e.g., shot, turnover, rebound)
team Name of the NBA team the player who made the play belongs to
player Name of the player that the play was made by
result Whether they won or lost the game
type A more detailed description of the type of play made
x The \(x\)-coordinate on the field of play (in ft)
y The \(y\)-coordinate on the field of play (in ft)
  1. 6 Marks
  • Read in the lakers data set and convert this into a tibble object.
  • Keep only the rows relating to Kobe Bryant. Name this object kobe.
  • Transform the date variable into a lubridate date format (noting that it is currently in integer format).
  • Shot location is given by x and y. The center of the hoop is located at the coordinates \((25, 5.25)\). Center the x and y variables to \((0, 0)\); you will want to overwrite x and y in your kobe data set.
lakers <- as_tibble(lakers)

kobe <- lakers %>%
  filter(player == "Kobe Bryant")

kobe$date <- ymd(kobe$date)

kobe$x <- kobe$x - 25
kobe$y <- kobe$y - 5.25
  1. 6 Marks
  • Subset the kobe data set by only considering plays that are shot attempts (i.e., where etype is equal to shot). Name this new data set kobe.shot.
  • Make a scatter plot of the centered shot location, colouring the points by result. You should use the geom_point layer.
  • Set the transparency of the points to alpha = 0.5.
  • Use the default colour scheme, but reverse the colour order so that shots made is green(ish) and shots missed is red(ish). Hint: You can use scale_colour_discrete with an additional argument.
kobe.shot <- kobe %>%
  filter(etype == "shot")

kobe.shot %>%
  ggplot(mapping = aes(x = x, y = y, colour = result)) +
  geom_point(alpha = 0.5) +
  scale_colour_discrete(direction = -1) + 
  theme_minimal() +
  labs(x = "loc_x(ft)",
       y = "loc_y(ft)",
       colour = "result",
       title = "Scatter Plot of the centered shot location")

  1. 6 Marks
  • Using the kobe.shot data set, produce a 2-dimensional density plot (with contours) of Kobe Bryant’s shot locations. You will want to use both geom_density_2d_filled and geom_density_2d. Do not colour by result.
  • Remove the legend using legend.position argument in the theme layer.
kobe.shot %>%
  ggplot(mapping = aes(x = x, y = y)) +
  geom_density_2d_filled() + 
  geom_density_2d() +
  theme(legend.position = "none") +
  labs(x = "loc_x(ft)",
       y = "loc_y(ft)",
       title = "2-dimensional density plot of the centered shot location")

  1. 9 Marks
  • Within the kobe.shot data set, create a variable called distance that calculates the distance a shot was taken from hoop. You will need to use Pythagoras’ theorem, i.e., \(\text{distance} = \sqrt{x^2 + y^2}\).
  • Then create another variable within your kobe.shot data set called indicator that concatenates the values of result with game_type. Hint: You can use the paste function. You should end up with a variable in your data set that takes on the four values: “made home”, “made away”, “missed home”, “missed away”.
  • Plot histograms showing the distribution of distance using geom_histogram. Use facet_wrap to create seperate panels for all values of indicator. (You should end up with four panels on the same figure).
  • Fill the histograms by indicator such that the interior of the bars are different colours for the four different groups.
  • Remove the legend.
kobe.shot %>%
  mutate(distance = sqrt(x^2 + y^2),
         indicator = paste(result, game_type)) %>%
  ggplot(mapping = aes(x = distance, fill = indicator)) +
  geom_histogram(bins = 30) +
  theme(legend.position = "none") +
  facet_wrap(~indicator) +
  labs(x = "Distance(ft)",
       y = "Frequency",
       fill = "indicator",
       title = "Histograms of distance")

  1. 11 Marks
  • Subset the original kobe data set (not the kobe.shots data set) by considering plays that are only free throws (i.e., where etype is equal to free throw). Call this new data set kobe.free.
  • Within the kobe.free data set, calculate the total number of points from free throws per game as well as the free throw percentage per game. You will want to use the group_by, summarise, sum, and n functions.
  • Plot Kobe Bryant’s free throw percentage per game using geom_segment to create vertical line segments from 0 to the free throw percentage. Your \(x\)-axis should be date and your \(y\)-axis should be free throw percentage.
  • Add transparency proportional to the total number of points per game. (i.e., a larger number of points should have darker line segments).
kobe.free <- kobe %>%
  filter(etype == "free throw")

kobe.free.points <- kobe.free %>%
  group_by(date) %>%
  summarise(total_points = sum(points),
            percentage = total_points / n()) 

kobe.free.points %>%
  ggplot(mapping = aes(x = date, y = percentage)) +
  geom_segment(aes(xend = date,y = 0, yend = percentage, alpha = total_points)) +
  theme_minimal() +
  labs(title = "Histograms of Kobe Bryant's free throw percentage per game")

  1. 7 Marks
  • Using the kobe data set, find the unique dates that Kobe Bryant played in the 2008-2009 regular season. Hint: You will want to use the group_by, summarise, and n_distinct functions. You should end up with a data set with 78 rows. Name this data set kobe.week.
  • Then create a variable that tells you the day of the week the game was played. You will need an appropriate lubridate function.
  • Plot a bar chart that shows the frequency of games played on each of the seven days of the week.
  • Comment on the most common and least common game days.

The most common day is Tuesday, and the least common day is Monday and Saturday.

kobe.week <- kobe %>%
  group_by(date) %>%
  summarise(n_distinct(date))

kobe.week %>%
  mutate(week = wday(date, label = TRUE)) %>%
  ggplot(mapping = aes(x = week)) +
  geom_bar() +
  labs(x = "day",
       y = "frequency",
       title = "barplot")

Total possible marks for Problem 1: 45 Marks

Problem 2: Monthly average temperatures in Auckland

The data set auckland_temps.csv contains the monthly average temperatures in Auckland from July 1994 until January 2024. [Data source: https://cliflo.niwa.co.nz]

  1. 5 Marks
  • Read in the data using read_csv (don’t use read.csv).
  • Convert the Month variable into the correct date format. Hint: You will need to use a function from the tsibble package.
  • Coerce your tibble to a tsibble object with Month as the index.
data_csv <- read_csv("auckland_temps.csv") 
## Rows: 355 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Temperature
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_tsibble <- data_csv %>%
  mutate(Month = yearmonth(Month)) %>%
  as_tsibble(index = Month)
  1. 7 Marks
  • Create a time plot, seasonal plot, and subseries plot of the data.
  • Comment on the seasonality in the plots. Which month has the highest average temperatures, and which month has the lowest?
  • Comment on whether there is a trend in the data and if so, in what direction.

Febrary has the highest average temperatures, July has the lowest. Yes. The temperature undergoes seasonal changes, being low in winter, gradually rising in spring, reaching its peak in summer, and then declining in autumn.

data_tsibble %>%
  autoplot(Temperature) +
  labs(x = "time",
       y = "temperature",
       title = "time plot")

data_tsibble %>%
  gg_season(Temperature, labels = "none") +
  labs(y = "temperature",
       title = "seasonal plot")

data_tsibble %>%
  gg_subseries(Temperature) +
  labs(x = "time",
       y = "temperature",
       title = "subseries plot")

  1. 8 Marks
  • Create a lag plot for the first 12 lags. Use the point geometry and set alpha = 0.5.
  • Write a sentence or two explaining what autocorrelation is.
  • Comment on the patterns you observe in the lag plot, explaining why we see these specific autocorrelation patterns for lags 1, 6, and 12.

Autocorrelation assesses how well the values of a variable correlate with their own previous values

The temperatures in adjacent months are similar, showing a significant positive correlation. In Figure 6, with a time difference of six months, the seasons are opposite; winter and summer temperatures are opposites, situated on the outermost sides, while autumn and spring temperatures are similar and closer. In Figure 12, with a time difference of 12 months, representing a complete cycle of the four seasons, the seasons are the same, and the temperatures are almost identical, displaying a strong positive correlation.

data_tsibble %>%
  gg_lag(Temperature, lags = 1:12, geom = "point", alpha = 0.5) +
  labs(title = "Lag plot")

  1. STATS 786 only 10 Marks
  • In this question, you will recreate the lag plot from (3).
  • Instead of using gg_lag, you will use functions from the dplyr and ggplot2 packages to create your own lag plot. You may also need functions from lubridate, forcats, and tidyr.
  • Try to get your plot as close to the gg_lag plot you produced in (3). You will get full marks if your plot is exactly the same as what you get with gg_lag. Marks will be deducted for inconsistencies.
  • Note: There are many ways to solve this problem, but here are some things you may want to consider: how to create lagged variables in your data set, how to name them, how to convert your data set into a long format, and how to facet your plot.
data_tsibble_lag <- data_tsibble %>%
  mutate(month_name = month(Month,label = TRUE)) %>%
  mutate(Value_lag1 = lag(Temperature, 1)) %>%
  mutate(Value_lag2 = lag(Temperature, 2)) %>%
  mutate(Value_lag3 = lag(Temperature, 3)) %>%
  mutate(Value_lag4 = lag(Temperature, 4)) %>%
  mutate(Value_lag5 = lag(Temperature, 5)) %>%
  mutate(Value_lag6 = lag(Temperature, 6)) %>%
  mutate(Value_lag7 = lag(Temperature, 7)) %>%
  mutate(Value_lag8 = lag(Temperature, 8)) %>%    
  mutate(Value_lag9 = lag(Temperature, 9)) %>%
  mutate(Value_lag10 = lag(Temperature, 10)) %>%
  mutate(Value_lag11 = lag(Temperature, 11)) %>%
  mutate(Value_lag12 = lag(Temperature, 12))

data_tsibble_lag <- pivot_longer(data_tsibble_lag, 
             cols = Value_lag1 :Value_lag12, 
             names_to = "Lag", 
             values_to = "Lag_value",
             values_drop_na = TRUE)

data_tsibble_lag$Lag <- factor(data_tsibble_lag$Lag, 
                               levels = c("Value_lag1","Value_lag2","Value_lag3", "Value_lag4","Value_lag5","Value_lag6","Value_lag7","Value_lag8","Value_lag9","Value_lag10","Value_lag11","Value_lag12"),
                               labels = c("lag 1","lag 2","lag 3", "lag 4","lag 5","lag 6","lag 7","lag 8","lag 9","lag 10","lag 11","lag 12"))


data_tsibble_lag %>%
  ggplot(aes(x = Lag_value, y = Temperature, color = month_name)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~Lag) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
  labs(title = "Lag plot", 
       x = "lag(Temperature, n)", 
       y = "Temperature",
       color = "season") +
  theme(aspect.ratio = 1)

Total possible marks for Problem 2: 20 Marks for 326 30 Marks for 786

Total possible marks for Assignment 1: 65 Marks for 326 75 Marks for 786