Problem.
Predicting heating and cooling loads from building geometry is a classic energy-engineering problem. Architects making early-stage design decisions (wall area, glazing area, roof area, orientation) need to know how those choices ripple into HVAC sizing and annual energy cost before the building is built.
The dataset simulates 768 building configurations across eight design parameters and reports the resulting heating and cooling loads. The question this project asks: which design choices matter most, and can a small linear model defensibly predict load at the design phase?
My contribution.
End-to-end R analysis package, owned solo:
- Data loader with schema validation. Expected columns, expected types, NA handling enforced at the boundary.
- Descriptive statistics across all eight predictors and both response variables (heating load, cooling load).
- Correlation analysis to surface multicollinearity. Relative compactness, surface area, and wall area cluster strongly.
- Stepwise linear regression for variable selection on heating load and cooling load separately. Implemented with base
stats::stepand bidirectional AIC. - Variance Inflation Factor checks via
car::vifon the selected models, plus normality (Shapiro-Wilk) and heteroscedasticity (Breusch-Pagan style) tests on residuals. - Standard diagnostic plots: residuals vs fitted, Q-Q normality, scale-location, leverage.
- testthat suite with a synthetic 200-row fixture, lintr config, R-CMD-check CI matrix across multiple R versions.
What the analysis surfaced.
Glazing area and roof type carry more signal than I expected at the start. Relative compactness drops out under stepwise selection because surface area carries most of the same information, and the model is more interpretable without both.
The cooling load model has noticeably wider residuals than the heating load model. That tells you something about the underlying physics: cooling demand depends on solar gain, which is driven by orientation and glazing in ways that interact non-linearly. A linear model captures the heating side well and gives you a useful first-pass on the cooling side, but cooling deserves a tree-based or interaction-rich model for production use.
Why this project sits alongside the bigger work.
This is a small, focused statistical modelling exercise. It is not a production system. What it demonstrates is the diagnostic discipline: every model gets its residuals checked, every variable gets its correlation cluster examined, every claim about importance is defended against the alternative explanations.
The same discipline shows up in the bigger pipelines. Lineage and tests are not glamorous, but they are the difference between a model that ships and a model that breaks the first time real data hits it.
Stack.
R · stats::step (stepwise selection) · car (VIF and diagnostics) · ggplot2 · readxl · testthat · lintr · GitHub Actions (R-CMD-check)