Case study / 2026

Equity Forecasting

A reproducible R analysis package for daily equity closing-price forecasting. ARIMA, ETS, and a naive baseline behind a model registry, with formal stationarity tests and residual diagnostics.

3 min read

Equity Forecasting thumbnail with time series line and forecast cone

Problem.

Building a defensible short-horizon point forecast from a noisy, non-stationary financial series is a textbook problem with a lot of folklore around it. Most public examples skip the stationarity tests, skip the residual diagnostics, and report on a single model. This project builds the end-to-end pipeline properly: load, validate, diagnose, transform, fit candidates, evaluate, report.

The committed working example uses 5,124 daily observations of NYSE ticker A from December 1999 to May 2023, with closing prices ranging from $7.76 to $113.70.

My contribution.

Solo, end to end:

  • Schema validation on data load. Expected columns, types, and chronological order enforced at the boundary.
  • EDA: distributional summary, quantiles, full-sample series plot, and STL decomposition strengths.
  • Transformations: optional log, integer-order differencing, all configurable.
  • Stationarity testing. ADF and KPSS combined into a single decision rule rather than running each in isolation and getting contradictory verdicts.
  • Model candidates registered through a MODEL_REGISTRY pattern. ARIMA, seasonal ARIMA via auto.arima, ETS, naive baseline. New candidates can be registered by adding a single function rather than editing the pipeline.
  • Residual diagnostics. Ljung-Box for autocorrelation, Shapiro-Wilk for normality, summary mean and variance checks.
  • Forecast evaluation against the naive baseline as the floor. RMSE, MAE, MAPE.
  • testthat suite with a synthetic series fixture, R-CMD-check CI matrix across multiple R versions, lintr config gating CI.

Why this build matters.

Financial time series fail silently. A model with a good RMSE on training data can be quietly broken at the residual level (autocorrelation in the residuals means the model has not captured the structure) and you will not notice until live data starts behaving badly. The Ljung-Box plus Shapiro-Wilk check at the residual stage is the diagnostic that catches this before you ship the forecast.

The MODEL_REGISTRY pattern is the architectural choice that makes the pipeline genuinely reusable. Adding a new candidate (Prophet, TBATS, neural state-space) is a single registration call. The evaluation, diagnostics, and reporting code do not change.

Architecture.

data-raw/A.csv
   |
   v
load_share_prices -> validate -> to_ts (daily freq, chronological)
                                         |
                                         v
                                   eda + STL decomposition
                                         |
                                         v
                       transform (log, difference) if needed
                                         |
                                         v
                       stationarity (ADF + KPSS combined verdict)
                                         |
                                         v
                       MODEL_REGISTRY = {arima, sarima, ets, naive}
                                         |
                                         v
                          fit each on train, forecast horizon
                                         |
                                         v
                       diagnostics (Ljung-Box, Shapiro-Wilk)
                                         |
                                         v
                       evaluate vs naive baseline (RMSE, MAE, MAPE)

Stack.

R · forecast · tseries · ggplot2 · lubridate · testthat · lintr · GitHub Actions (R-CMD-check CI matrix across multiple R versions)

View on GitHub