General comments:

All the plots should be labelled appropriately (axes, legends, titles).
Please submit both your .Rmd, and the generated output file .html or .pdf on Canvas before the due date/time.
Please make sure that the .Rmd file compiles without any errors. The marker will not spend time fixing the bugs in your code.
Please avoid specifying absolute paths.
Your submission must be original, and if we recognize that you have copied answers from another student in the course, we will deduct your marks.
You will need to use the tidyverse and fpp3 libraries for this assignment.
IMPORTANT NOTE: There are some questions that are for STATS 786 only. Students taking STATS 326, while you are welcome to attempt these questions, please do not submit answers to them.

Due: Friday 3 May 2024 at 16:00 PM (NZ time)

Problem 1: Monthly Average Auckland Temperatures

In Assignments 1 and 2, you investigated monthly average temperatures in Auckland. In this problem, you will do some further analysis. The data set auckland_temps.csv contains the monthly average temperatures in Auckland from July 1994 until January 2024. The time series plot is given below. For this question do not Box-Cox transform the data.

12 Marks

Recall from class that you can forecast with time series decomposition by forecasting the seasonal component and seasonally-adjusted series separately. The seasonal component will automatically be forecasted using the default SNAIVE method, but you will need to specify which benchmark methods you are using on the seasonally-adjusted series.

Create a training set by filtering out the most recent 2 years from the temperature data. The training set should contain dates 1994 Jul until 2022 Jan.

Fit an STL model (with robust = TRUE) to the training set, and using the seasonally-adjusted series from this STL decomposition, fit the naive (NAIVE), the average (MEAN) and the random walk with drift (RW) benchmark forecast methods. The decomposition_model function will be helpful. Also fit a seasonal naive (SNAIVE) to the training set (without doing an STL decomposition).

Forecast 2 years up until 2024 Jan so that you can compare what actually happened in the real data to what you predict would have happened with your forecast models.

Plot the point-forecasts with the original data with autoplot. Do not plot the prediction intervals (this can be achieved with the argument level = NULL).

Evaluate the point-forecasts by computing and comparing the forecast accuracy measures discussed in class.

Discuss which forecast method is better for monthly average Auckland temperatures, and why.

# Create a training set
training_set <- filter(data, data[[1]] < yearmonth("2022 Feb"))
# Fit the seasonally-adjusted series
naive_fit <- training_set %>%
  model(naive = decomposition_model(                    
    STL(Temperature ~ trend(window = 7), robust = TRUE),  
    NAIVE(season_adjust)))
mean_fit <- training_set %>%
  model(mean = decomposition_model(                    
    STL(Temperature ~ trend(window = 7), robust = TRUE),  
    MEAN(season_adjust)))
drift_fit <- training_set %>%
  model(drift = decomposition_model(                    
    STL(Temperature ~ trend(window = 7), robust = TRUE), 
    RW(season_adjust ~ drift())))
# Fit seasonal naive method to the training set
snaive_fit <- training_set %>%
  model(snaive = SNAIVE(Temperature)) 
# Forecast
naive_fc <- naive_fit %>% 
  forecast(h = "2 years") 
mean_fc <- mean_fit %>% 
  forecast(h = "2 years") 
drift_fc <- drift_fit %>% 
  forecast(h = "2 years") 
snaive_fc <- snaive_fit %>% 
  forecast(h = "2 years") 
# Plot the forecast
all_fc <- bind_rows(naive_fc, mean_fc, drift_fc, snaive_fc)
all_fc %>%
  autoplot(data, level = NULL) + 
  ggtitle("Plot of point-forecasts with the original") + 
  theme_minimal()

# Evaluate the point-forecasts by computing
accuracy(all_fc, data)

## # A tibble: 4 × 10
##   .model .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 drift  Test  -1.11   1.37  1.12  -7.26   7.34 1.25  1.18  0.0667
## 2 mean   Test   0.753  1.09  0.988  4.51   6.13 1.10  0.939 0.0343
## 3 naive  Test  -1.02   1.29  1.03  -6.65   6.76 1.15  1.11  0.0343
## 4 snaive Test  -0.0250 0.976 0.85  -0.124  5.32 0.944 0.843 0.194

## As snaive method has the smallest value for MAE, RMSE, MAPE and MASE, snaive method is the best.

Total marks for Problem 1: 12 Marks

Problem 2: Time series regression

The data set tsr.csv is a data set containing quarterly measurements of an anonymised variable from 2000 Q1 until 2023 Q4. The time series is plotted below. You will notice the following features in this time series:

Seasonality.
Linear trend with level shift (steps).
Outliers (spikes).

4 Marks Using any approach you want, find the three quarters that the outliers occur at. Create a dummy variable that is 1 at these three quarters and 0 otherwise. Note: Usually when dealing with outliers, you would create a dummy variable for each individual outlier. However, here you may assume the outliers are due to the same phenomenon, so only create one dummy variable for all three outliers.

data <- data %>%
  mutate(Outlier = ifelse(Measurement > 20, 1, 0))

4 Marks Using any approach you want, find the quarter at which the level-shift occurs (i.e., where the time series level moves up). Create a dummy variable for the level-shift that takes the value 1 from that quarter onwards and 0 otherwise.

indice <- which(diff(data$Measurement)>5 & diff(data$Measurement)<10)[1]
data <- data %>%
  mutate(LevelShift = 0) 
data$LevelShift[(indice+1) : length(data$LevelShift)] = 1

6 Marks Fit the following three different time series regression models.

Model 1: Linear trend with level shift and seasonal factors.

Model 2: Linear trend with level shift, seasonal factors, and outliers.

Model 3: Linear trend with level shift, \(K = 1\) Fourier terms, and outliers.

# Model 1
model_1 <- data %>% 
  model(TSLM(Measurement ~ trend() + LevelShift + season()))
# Model 2
model_2 <- data %>% 
  model(TSLM(Measurement ~ trend() + LevelShift + season() + Outlier))
# Model 3
model_3 <- data %>% 
  model(TSLM(Measurement ~ trend() + LevelShift + fourier(K = 1) + Outlier))

6 Marks Plot the original time series with the three sets of fitted values and comment on how well each model fits the data.

all_fit <- bind_rows(
  augment(model_1) %>% mutate(Model = "Model 1"),
  augment(model_2) %>% mutate(Model = "Model 2"),
  augment(model_3) %>% mutate(Model = "Model 3"))
all_fit %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Measurement)) +
  geom_line(aes(y = .fitted, colour = Model)) + 
  guides(colour = guide_legend(title = NULL)) + 
  labs( title = "Plot of forecasts of three models") +
  theme_minimal()

6 Marks Compare the three models using AICc and the CV-statistic and conclude which model has the best predictive ability. This will be your selected model for the questions to follow. Report the estimates for your selected model.

rbind(glance(model_1) %>% select(.model, AICc, CV),
      glance(model_2) %>% select(.model, AICc, CV),
      glance(model_3) %>% select(.model, AICc, CV))

## # A tibble: 3 × 3
##   .model                                                              AICc    CV
##   <chr>                                                              <dbl> <dbl>
## 1 TSLM(Measurement ~ trend() + LevelShift + season())                102.  2.83 
## 2 TSLM(Measurement ~ trend() + LevelShift + season() + Outlier)     -124.  0.275
## 3 TSLM(Measurement ~ trend() + LevelShift + fourier(K = 1) + Outli…   77.2 2.14

## As model 2 has the smallest AICc and CV values, it has the best predictive ability.
report(model_2)

## Series: Measurement 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36831 -0.33043 -0.02308  0.31677  1.41538 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.877660   0.148145  12.675  < 2e-16 ***
## trend()        0.052125   0.003496  14.911  < 2e-16 ***
## LevelShift     5.280669   0.196289  26.902  < 2e-16 ***
## season()year2  3.047059   0.143211  21.277  < 2e-16 ***
## season()year3 -1.105161   0.143805  -7.685 1.91e-11 ***
## season()year4  1.040672   0.143216   7.266 1.35e-10 ***
## Outlier        8.914841   0.301250  29.593  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4957 on 89 degrees of freedom
## Multiple R-squared: 0.9903,  Adjusted R-squared: 0.9896
## F-statistic:  1507 on 6 and 89 DF, p-value: < 2.22e-16

10 Marks Check the model assumptions for your chosen model. You will need to assess linearity, independence, normality, zero mean, and equality of variance. Note 1: Linearity can be checked by comparing the observed versus fitted values.
Note 2: For the Ljung-Box test, you can find the number of estimated parameters in the report you produced in (5), and because this is a seasonal time series, use \(2m\) lags.

model_2 %>% gg_tsresiduals()

augment(model_2) %>%
  features(.resid, ljung_box, lag = 8, dof = 7)

## # A tibble: 1 × 3
##   .model                                                       lb_stat lb_pvalue
##   <chr>                                                          <dbl>     <dbl>
## 1 TSLM(Measurement ~ trend() + LevelShift + season() + Outlie…    16.4 0.0000501

# The residuals do not exhibit any obvious patterns or trends and show no autocorrelation. Based on the third plot, the residuals appear to be close to a normal distribution. The mean of the residuals is close to zero, consistent with the assumption of zero mean.
# P-value < 0.05, so we accept the hypothesis that innovation residuals from the model resemble a white noise which has equality of variance.

8 Marks Regardless of your assumption check, forecast four years into the future for your chosen model. Plot your forecasts with the 90% and 99% prediction intervals. Hint 1: The new_data function may be helpful. Hint 2: You will need to provide values for the two dummy predictors you created. Assume that no outliers will occur in these four years.

future_scenarios <- scenarios(
  new_data(data, 16) %>%
    mutate(Outlier = 0, LevelShift = 1))
fc <- model_2 %>%
  forecast(new_data = future_scenarios)
data %>%
  autoplot(Measurement) +
  autolayer(fc, level = c(90, 99)) +
  labs(title = "Plot of forecasts",
       y = "Measurement") +
  theme_minimal()

4 Marks Output and interpret the 90% prediction interval for the 1-step forecast of your chosen model (i.e., for 2024 Q1). Hint: The hilo function may be helpful. Note: You may need to knit your document to see the output.

fc[1,] %>%
  hilo(level = 90) %>%
  select('90%')

## # A tsibble: 1 x 2 [1Q]
##                   `90%` Quarter
##                  <hilo>   <qtr>
## 1 [11.3659, 13.06303]90 2024 Q1

# The 90% prediction interval is (11,13), so the estimated value has 90% pobability to be in this interval.

STATS 786 only 10 Marks For this question, you will manually calculate the cross-validation statistic for your selected time series regression model. Recall that there is a shortcut to performing cross-validation that does not require computation over many training/test sets. If \(X\) is the design matrix (matrix of predictor variables), then \(H = X (X^T X)^{-1} X^T\) is called the hat matrix. The cross-validation statistic can be computed as \(\text{CV} = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{1 - h_t}\right)^2\), where \(e_t\) are the residuals from the fitted model and \(h_t\) are the diagonal elements of the hat matrix. You will need to create your own design matrix \(X\) that contains only the predictor variables in your chosen model. You will have to create variables for your trend and seasonal predictors, as well as the intercept. Make sure the matrix object does not contain the dates or measurements. Useful matrix functions include %*%, t, crossprod, solve, diag. Verify your answer by comparing to the CV-statistic you produced in (5).

intercept <- rep(1, length(data$Quarter)) 
trend <- scale(1:nrow(data))
season_year2 = ifelse(quarter(data$Quarter) == 2, 1, 0)
season_year3 = ifelse(quarter(data$Quarter) == 3, 1, 0)
season_year4 = ifelse(quarter(data$Quarter) == 4, 1, 0)
# Create design matrix X
X <- cbind(intercept, trend, data$LevelShift, season_year2, season_year3, season_year4, data$Outlier)
# Calculate the hat matrix H
H <- X %*% solve(crossprod(X)) %*% t(X)
# Extract the diagonal elements of H
h <- diag(H)
# Calculate the cross-validation statistic CV
residuals <- model_2[[1]][[1]]$fit$residuals
CV <- mean((residuals / (1 - h))^2)
print(CV)

## [1] 0.2750324

glance(model_2) %>% select(CV)

## # A tibble: 1 × 1
##      CV
##   <dbl>
## 1 0.275

Total marks for Problem 2: 48 Marks for 326 58 Marks for 786

Total possible marks for Assignment 3: 60 Marks for 326 70 Marks for 786

STATS 326/786

Assignment 03

Problem 1: Monthly Average Auckland Temperatures

Problem 2: Time series regression