Case study / 2026

Pharmaceutical Side Effect Classification

A production-grade Python package that classifies free-text adverse-event descriptions into a MedDRA-style taxonomy of ten clinical categories.

3 min read

Pharmaceutical Side Effect Classification thumbnail with stacked classification bars

Problem.

Manual triage of pharmaceutical adverse-event reports is slow, inconsistent, and expensive at scale. The pharmacovigilance teams that classify these reports against standardised taxonomies are bottlenecked by volume, and the classification quality varies analyst by analyst.

The question this project asked: can a multi-class classifier match human-quality triage accurately enough to be worth deploying as a decision-support tool? And, since pharmacovigilance is a regulated domain, can the explainable baseline get close enough to the more complex model that you can ship the explainable one?

My contribution.

Solo, end to end:

  • Built the feature pipeline. TF-IDF vectorisation of medicine name, composition, indications, and side-effect text, plus manufacturer ordinal encoding (with safe handling of unseen manufacturers at inference) and review-percentage features, all behind a single sklearn ColumnTransformer + FeatureUnion.
  • Trained two complementary models on 11,825 marketed medicines mapped to a MedDRA-style taxonomy of ten clinical categories. Random Forest as the primary model, Logistic Regression as the interpretable comparison.
  • Wrapped both in a serializable sklearn Pipeline so joblib.dump produces one end-to-end artifact and inference does not require manual feature prep.
  • Externalised every hyperparameter to a YAML config loaded into frozen dataclasses. No magic numbers in code.
  • Wrote the CLI (train, predict) and the pytest suite (synthetic 60-row fixture so tests do not depend on the 4 MB xls).
  • Set up the GitHub Actions matrix CI on Python 3.10, 3.11, and 3.12 with ruff and black gating.

Architecture.

+------------------+    +-----------------------+    +------------------+
| Medicine_Details | -> | Category mapping      | -> | Stratified split |
| (.xls, 11,825)   |    | (10 MedDRA-aligned    |    | (80/20, seed 42) |
+------------------+    |  categories)          |    +------------------+
                        +-----------------------+              |
                                                               v
                                              +----------------------------+
                                              | sklearn Pipeline           |
                                              |   ColumnTransformer        |
                                              |     TF-IDF (name, comp,    |
                                              |             uses, side fx) |
                                              |     OrdinalEncoder (mfr)   |
                                              |     StandardScaler (rev %) |
                                              |   Classifier               |
                                              |     RF or LR (registry)    |
                                              +----------------------------+
                                                               |
                                                               v
                                              joblib artifact + metrics + plots

Results.

The training pipeline emits accuracy, macro F1, weighted F1, per-class precision and recall, and a confusion matrix figure. Both models perform well on the held-out stratified 20% test set; the interpretable Logistic Regression baseline lands close enough to the Random Forest that the comparison itself becomes the finding.

Specific metric values are not pinned in this write-up because the repo is configured to regenerate them deterministically via make train. Anyone cloning the repo runs one command and gets the current numbers from the current code, not a stale screenshot.

Lessons.

The most useful finding was not the headline accuracy number. It was that the much simpler model came within a couple of percentage points of the more complex one. That tells you something specific about the data: the categorical signal is strong enough that you do not need a deep architecture to extract it, and a shallow interpretable model is genuinely competitive. In a regulated domain like pharmacovigilance where every model decision needs to be explainable, that finding changes the deployment calculus. You ship the explainable model unless you absolutely need the marginal extra points.

The second lesson was about scope discipline. A first version of this had a Streamlit demo, a Hugging Face transformer baseline, and a deployment workflow. None of it was finished. I cut all three and shipped the production-grade core: feature pipeline, two models, evaluation, inference CLI, tests, CI matrix. Less surface area, more depth.

The third lesson was about leaks. The original prototype encoded manufacturer with a one-shot LabelEncoder outside the pipeline, which would crash at inference on any unseen manufacturer. Moving the encoding inside the pipeline as OrdinalEncoder(handle_unknown="use_encoded_value") is the kind of fix that does not show up in the metrics table but is the difference between a model that ships and a model that breaks the first time it sees production data.

Stack.

Python · scikit-learn · pandas · numpy · TF-IDF · Random Forest · Logistic Regression · joblib · pytest · ruff · black · GitHub Actions

View on GitHub