Before we move to the next chapter—describing what the data scientist did after being placed on leave without pay—we need another short excursus.

This one is not about psychology.

It is about data science itself.

Because understanding what happened next requires understanding how difficult it is to extract truth from official reports—even when those reports are technically “public.”

Open Data by Default — But Visible to Whom?

Canada has an Open Government policy.

By principle, data produced by the government should be public by default.

That sounds transparent.

But here is the critical question:

What is the value of data if no one can meaningfully see it?

Publishing data does not equal accessibility.

Accessibility does not equal interpretability.

And interpretability does not equal understanding.

Most datasets are not useful in raw form. Especially large datasets.

CSV vs. Reality

In the best-case scenario, data is published in CSV format (comma-separated values). That means it can be opened in Excel or imported into statistical software easily.

Even then, problems remain:

Large datasets cannot be fully opened in Excel.

Weekly comparisons across multiple years require reshaping.

Grouped classifications must be reconstructed.

Time-series alignment must be coded.

This already requires programming.

When Dr. Albert and his colleagues began building weekly visualizations of vital statistics, they were working with official Statistics Canada data.

They built applications to show:

deaths by cause

deaths by age

week-by-week comparisons

year-over-year alignment

This was serious work. It required scripting, automation, and statistical validation.

And this is exactly why Dr. Albert had been recognized inside the government—he could build these tools.

The R99 Signal

When they visualized causes of death, one category drew attention:

R99 — “Ill-defined and unknown cause of mortality.”

It is not unusual for some deaths to fall temporarily into this category before classification.

But over time, something appeared: the weekly counts in R99 seemed to move in parallel with vaccination rollout phases.

This did not prove causation.

But it raised a legitimate scientific question:

Is this a reporting delay artifact?

A classification shift?

Or something else?

When two time series move together, a data scientist does not conclude.

He investigates.

So the next logical question emerged:

If vaccination is protective, can we observe strong protection in weekly COVID mortality comparisons?

The Public Health Agency Dataset Problem

This is where the technical challenge intensified.

The relevant data was published by the Public Health Agency of Canada (PHAC).

But unlike Statistics Canada’s CSV files, PHAC’s reports were published primarily as:

HTML tables

PDF documents

Not structured CSV.

This dramatically complicates analysis.

PDF is not data.

PDF is presentation.

Extracting structured numbers from PDF requires:

parsing algorithms

text recognition

table reconstruction

classification mapping

It is not trivial. It is not beginner work.

And the reporting format introduced another major complication:

The data was often presented as cumulative totals since vaccination began.

From a statistical standpoint, cumulative totals are almost useless for dynamic evaluation.

What matters is:

Week-by-week rates

Age-adjusted rates

Vaccination status categories

Changes over time

If you only report cumulative totals, you obscure performance variation.

For a system evaluator, that is equivalent to evaluating a vendor’s product using lifetime cumulative output rather than current weekly performance metrics.

It hides degradation.

It hides variation.

It hides change.

Classification Complexity

The PHAC data also included multiple vaccination categories:

Unvaccinated

Partially vaccinated

Fully vaccinated

Fully vaccinated with additional doses

Extracting all these numbers week by week required algorithmic scraping and restructuring.

This was not a political exercise.

From a data science perspective, it was fascinating.

If you can master automated extraction from PDF and HTML, you become a very strong data scientist.

So between October and December 2021, this became a technical challenge inside the community of practice.

How do we automate extraction?

How do we standardize weekly comparison?

How do we build reproducible pipelines?

For almost two months, tools and libraries were built.

Ontario: A Contrast in Design

Then came a contrast.

Ontario’s open data portal was, in many ways, exemplary.

It provided:

structured datasets

machine-readable formats

API access (Application Programming Interface)

direct programmatic retrieval

With API access, large quantities of data can be queried directly from code.

This is how open data should function.

Libraries were built for that too.

Code was uploaded to GitHub.

Documentation began forming.

At the same time, the team began writing a book: R4GC, named after their community of practice within the Government of Canada.

From a purely technical perspective, this was an exciting period.

No one initially expected to discover anything alarming.

It was curiosity-driven.

Skill-building.

Community training.

When Curiosity Meets Correlation

By Christmas 2021, extraction pipelines were functional.

Over the holiday period, Dr. Albert invited academic colleagues from the University of Ottawa to help draft formal peer-reviewed analysis.

The goal was not activism.

The goal was methodological clarity:

Is there correlation between vaccination rollout phases and changes in certain mortality categories?

Is the R99 signal explainable?

Are weekly COVID mortality rates behaving as expected across vaccination classes?

Correlation is not causation.

But correlation is not nothing.

It is a signal that demands investigation.

Why This Chapter Matters

For a non-data scientist, this may sound abstract.

But here is the core point:

Publishing data is not the same as making it analyzable.

And formatting choices matter.

CSV vs PDF

Weekly vs cumulative

Time-aligned vs aggregated

Accessible API vs manual extraction

Each of these choices influences who can realistically verify the narrative.

If meaningful visualization requires advanced programming and months of effort, then only a small technical minority will ever see the underlying structure.

And if interpretation depends on extraction pipelines that few possess, then public discourse rests on trust—not verification.

That is where the tension began.

Not in ideology.

Not in politics.

In formatting.

And that technical tension would soon become existential.

What Comes Next

In the next chapter, we return to the human story:

What happens when a data scientist who has built the tools, seen the signals, and asked the questions is placed on leave without pay?

How does one survive when professional integrity and institutional alignment diverge?

Disclaimer

This article reflects the author’s technical and personal observations. It does not assert misconduct but highlights methodological challenges in data extraction and visualization.

Acknowledgment

This article was written with assistance from ChatGPT using the prompt:

“Write an autobiographical chapter explaining the technical challenges of extracting and visualizing official health data (Statistics Canada, PHAC, Ontario Open Data), including CSV vs PDF, cumulative vs weekly reporting, API access, R99 category correlation, and the building of R4GC tools, in a scientific yet accessible tone.”