Introduction

Getting 3rd party data to fit the requirements of a specific context is no small feat. There is a belief that an untapped and ever-growing pool of data is just waiting for us to use. Leveraging external data requires more than just understanding the options available and selecting which one is relevant.

Data is only as valuable as the problem it helps solve. Therefore, by definition, the transformations and uses of external data will change depending on the specific verticals and issues we want to solve.

We have developed principles and a flexible approach for working with 3rd party data at Anagraph that combines best-of-breed open source technologies. These tools are attractive because of their performance and the disciplined structure they enable. So many variables exist when building data pipelines. Imposing limits on these variables is vital to efficiently building, scaling, and maintaining data pipelines in time.

Statistic Canada Census Data

Statistic Canada census data provides a unique portrait of Canada’s population every five years. The census covers a broad range of topics such as “the population, age, sex at birth and gender, type of dwelling, families, households and marital status, Canadian military experience, income, language, Indigenous peoples, housing, immigration, place of birth and citizenship, ethnocultural and religious diversity, mobility and migration, education, labour, and commuting” (StatCan, 2022). So far, the 2021 census has released the administrative regions and some population attribute data since Wednesday, February 9th. The official schedule shows upcoming releases.

StatCan and other researchers have shared some initial results in the first days since the release. For example, StatCan published an article on how the principal urban areas have continued expanding since 2016.

Although the census data is unique and allows for incredible insights, it is not as easy to use as one would hope. A few of the issues keeping non-technical users to interpret the datasets are the following:

Little non-tabular and easily digestible information is available.
Row identifiers follow a complex structure and type. Some identifiers are numerical others are textual.
The geographic units are multiple by the nature of different government operations.
Every administrative geography is in separate tables.
Regardless of geographic unit, the same table stores all the attribute information.
An individual row stores every attribute instead of a single row per identifier containing all the attributes.

We can overcome any of these issues, but they make the barrier to entry a lot harder for non-technical users. Even technical users would enjoy having the data in a different format.

We want to make unique datasets, like StatCan census data, more accessible and fun to use for everyone. So we created Geometric to share the early results of our experimentations and data transformations with everyone. Geometric Data Viewer is live here.

The following sections overview the principles and techniques we use to overcome issues integrating 3rd party data.

The Data Pipeline Principles

There are even more ways to transform datasets than the sheer number of datasets. We have identified some guiding principles when designing data pipelines. The principles are not hard-coded rules, but they tackle essential dimensions. We divided the different principles into the four categories used to categorize risk by Marty Cagan in Inspired.

Technical Principles

Data pipelines should be reproducible. Every step of the data pipeline should be reproducible, from the very first step of downloading the data to generating specific problem insights. Errors can stem just as much from the environment than from the code executed on the environment. Therefore, the first step in building a data pipeline is ensuring we have a stable and reproducible environment on which it can run. Ideally, our developer environments and production environments should match as closely as possible to avoid mismatches between them.

The data pipeline should have documentation that lives as close as possible to the application code to avoid drift. The best tools provide dynamically generated documentation and minimize the risk of outdated documentation.

Data pipelines should be testable. Every step of the data pipeline should have the possibility of being automatically tested with a minimal amount of overhead. Manual testing can not scale or catch every breaking change.

Data pipelines should be monitored. Any data pipeline, mainly but not limited to pipelines in production, should be monitored to detect any drift or anomaly. If data is such a key ingredient to our information systems, it is in our best interest to ensure the correctness of our data.

Business Principles

Data pipelines should offer rich representations of the process at different levels of detail. In addition, we need our data pipelines to be supported by stakeholders. Abstractions are a necessary tool to share the bigger picture behind the processes.

Data pipelines should design with privacy and user rights from the start. Beyond the damages that can be brought by not respecting our users, we developers should consider the moral responsibility of our micro-decisions.

Usability Principles

Data pipelines should convert datasets into an understandable format. Far too often, external constraints define the characteristics of our generated datasets, and these constraints make the datasets less understandable to our users. We should design our pipelines to alleviate any conditions limiting the comprehension of our datasets.

Data pipelines should be fully integrated. It must be directly integrated within the users’ processes to use the data. Therefore, our pipelines should transform datasets to make them plug-and-play in the user’s ecosystem.

Value Principles

Data pipelines should create actionable datasets. Again, for data to have value, it must be actionable and help solve a problem. Therefore, data pipelines should have a specific objective.

Data pipelines should make the quality and uncertainty of our datasets explicit. Given that the ability to derive value from a dataset is not a sole function of it having the correct information, we should establish the accuracy of our datasets and any other factor that could change a user’s interpretation of the dataset.

Data pipelines should adapt to changing or unknown requirements. Data requirements are not static but current best guesses of the needs for a problem. Our requirements should grow from our experimentation with the data. The coronary is data pipelines should be flexible and easy to change to new conditions.

The (Current) Stack

Our current stack changes continually to fit the principles presented above. However, value principles outweigh most of the others. Meaning our technological choices evolve mainly based on answering specific value hypotheses. Sure, we do market research and keep our eyes open to new trends. But when considering a new tool, we focus on the ability of the tool to contribute to our ability to discover what someone in a specific context finds useful.

An overview of the stack can help transition from the principles to the more technical details of our stack. For example, we have gone from working with raw SQL scripts (still versioned controlled) to alembic migrations, and finally, more recently, to our stack orchestrated with a simple shell and dbt. For every raw data source, we create the following pipeline:

Diagram showing the flow of the raw dataset to the final output dataset of the data pipeline. Data is first downloaded and imported to the database. Afterwards, the data is transformed with dbt models and exported. — Data pipeline structure typically used at Anagraph

The process is relatively simple. We ensure that any developer can reproduce the output dataset from their machine. The download script can vary from simple wget commands to cloud provider-specific copy commands and custom scripting for cases like scrapping.

The import script usually sends the downloaded data into a local Postgres instance. We always have a local docker-compose file that a developer and our CI can use. The compose file’s primary and usually only service is our database service that uses a Postgres image with some of our favorite extensions (PostGIS, pgrouting, postgis_raster, pgh3).

An example compose file could look something like this:

version: "3.7"
services:
  db:
    image: postgis-image
    restart: always
    environment:
      POSTGRES_DB: postgres
      POSTGRES_PASSWORD: <password>
    ports:
      - "host_port:5432"
    volumes:
      - host_path:/var/lib/postgresql/data

Sometimes, we also include a processing service that packages some FOSS4G packages like OGR and GDAL to help developers run the pipeline locally. Most of our import scripts make use of some command-line utilities like ogr2ogr or a handful of the gdal_X equivalents. The two-stage build dockerfile provides an example of how we package all the dependencies that can be hard to replicate locally:

FROM osgeo/gdal:ubuntu-full-3.2.2 AS builder-image

ARG DEBIAN_FRONTEND=noninteractive

RUN apt update && apt-get install --no-install-recommends -y \
    python3.9 python3.9-dev python3.9-venv python3-pip python3-wheel build-essential && \
	apt-get clean && rm -rf /var/lib/apt/lists/*

RUN python3.9 -m venv /home/myuser/venv
ENV PATH="/home/myuser/venv/bin:$PATH"

COPY requirements.txt .
RUN pip3 install --no-cache-dir wheel
RUN pip3 install --no-cache-dir -r requirements.txt

WORKDIR /home/myuser/app
COPY src src
COPY setup.py .
RUN python setup.py develop


FROM osgeo/gdal:ubuntu-full-3.2.2 AS runner-image

RUN apt update && apt-get install --no-install-recommends -y \
    python3.9 python3.9-dev python3.9-venv python3-pip python3-wheel build-essential && \
	apt-get clean && rm -rf /var/lib/apt/lists/*

ENV PYTHONUNBUFFERED=1
ENV VIRTUAL_ENV=/home/myuser/venv
ENV PATH="/home/myuser/venv/bin:$PATH"

RUN useradd --create-home myuser
COPY --from=builder-image /home/myuser/venv /home/myuser/venv
COPY --from=builder-image /home/myuser/app  /home/myuser/app
RUN chmod +x $VIRTUAL_ENV/bin/activate

USER myuser
WORKDIR /home/myuser/app

COPY tests tests/
COPY settings.toml .

CMD $VIRTUAL_ENV/bin/activate && pytest -p no:cacheprovider

We usually keep the imports scripts strictly focused on importing and not processing our data. We do so because dbt enables us to implement pipelines in SQL quickly, test every step of the pipeline, and provide dynamic documentation. Combined with the power of PostGIS and our other extension, dbt helps us develop clear, tested, and documented geospatial data products.

Without fully introducing dbt, we should present some of its key distinctive features. You can learn more by visiting their well-organized official documentation. dbt is built on creating data pipeline steps directly from SQL queries. After configuring your dbt project, creating a file models/my_model.sql will generate a new table called my_model directly in the database after executing the project. The process becomes interesting when we decide to chain different queries together.

For example, with the 2021 census data, we created separate models for the administrative boundaries and the attribute data. We can use dbt to pivot the ill-formatted attribute data, transform the geometries of the administrative boundaries to match our web applications requirements and, finally, join it to our administrative boundary layer.

Here is the model for the dissemination area boundary with some additional dbt configurations directly within the file:

{{ config(
    materialized = 'table',
    indexes=[
      {'columns': ['id', 'adidu']},
      {'columns': ['geom'], 'type': 'gist'},
    ],
    tags="aire_diffusion_2021"
)}}

select ogc_fid id, dauid adidu, dguid, landarea, pruid, st_transform(wkb_geometry, 3857) geom
from aire_diffusion_2021

We used a what seems for us the most efficient way to pivot a table in SQL: jsonb. Here is the model for the external attribute data that we want to join:

{{ config(
    materialized = 'table',
    indexes=[
      {'columns': ['adidu']},
    ]
)}}

with adidu_attr as (
    select "nom_gÉo" adidu, "id_caractÉristique" id_caracteristic, c1_chiffre_total val from population_ad_2021
),
     jsonb_magic as (
         select adidu, jsonb_object_agg(id_caracteristic, val) jsonb_attr
         from adidu_attr
         group by adidu
     )
select adidu,
       jsonb_attr ->> '1' population_2021,
       jsonb_attr ->> '2' population_2016,
       jsonb_attr ->> '3' population_variation_2021_2016,
       jsonb_attr ->> '4' total_logements_prives,
       jsonb_attr ->> '5' logements_prives_occupes,
       jsonb_attr ->> '6' densite_population_2021,
       jsonb_attr ->> '7' superficie_terre
from jsonb_magic

Anf finally, the last model used to join them together:

{{ config(
    materialized = 'table',
    indexes=[
      {'columns': ['id']},
      {'columns': ['geom'], 'type': 'gist'},
    ]
)}}

select id, ad.adidu da_adidu, ad.geom, ad.landarea, pruid, pop_attr.*
from {{ref('aire_diffusion_2021_3857')}} ad
left join {{ref('population_aire_diffusion_raw')}} pop_attr on ad.adidu=pop_attr.adidu

We can generate any and all of our tables with: dbt run. The direct sequence and dependencies will all be taken care of with dbt reference keyword.

Still don’t see the benefits? Other than imagining the chaos of handling hundreds if not thousands of similar tasks, let’s look at the dynamic documentation generated by using the special reference syntax. We can generate the documentation with dbt docs generate and dbt docs serve.

dbt lineage graph model for the population of the StatCan 2021 census data

Moreover, we can populate the table and column descriptions by filling in a simple YAML file to obtain the output below:

version: 2

models:
  - name: aire_diffusion_population_2021_3857
    description: Integrating Census Profile, 2021
    docs:
      show: true
    columns:
      - name: 'id'
        description: Refers to Id of each dissemination block
        tests:
          - not_null
      - name: 'adidu'
        description: Refers to Id of each dissemination block
        tests:
          - not_null
      - name: 'da_adidu'
        description: Refers to Id of each dissemination block
        tests:
          - not_null
      - name: 'population_2016'
        description: Refers to Canadian Demographic Data for 2016
      - name: 'population_2021'
        description: Refers to Canadian Demographic Data for 2016
      - name: 'population_variation_2021_2016'
        description: Refers to The rate of change in population dispersion during the years 2016 to 2021 (percentage)
        tests:
          - not_null
      - name: 'total_logements_prives'
        description: Refers to a separate set of living quarters with a private entrance
      - name: 'logements_prives_occupes'
        description: Refers to a private dwelling in which a person or a group of persons is permanently residing
      - name: 'densite_population_2021'
        description: Refers to density of population living per square kilometer
      - name: 'superficie_terre'
        description: Refers to the number of square kilometres of land in a given geographic area
        tests:
          - not_null
      - name: 'landarea'
        description: Refers to the number of square kilometres of land in a given geographic area
        tests:
          - not_null
      - name: 'geom'
        description: Refers to EPSG:3857 (a Spherical Mercator projection coordinate system)
        tests:
      - name: 'pruid'

GIF showing the dbt documentation generated with the references to construct the graph lineage and the YAML schema file.

Potentially the greatest value dbt has to offer is the simplicity of writing tests. Two types of tests exist (1) schema and (2) data tests. Schema tests verify if constraints are well defined for the table. On the other hand, data tests require running a query against the table. If the query returns zero rows, the test passes, and if not, it fails. It requires a learning curve and redesigning queries around the 1-row vs. 0-row constraint. We usually just need to add a HAVING statement to our already existing debug SQL commands to become a valid dbt model test. The schema tests are already visible directly in the last YAML file.

There are some exciting other features, like macros, that automatically create indexes, custom tests, and any SQL process you can imagine. For example, we have a growing number of macros that test PostGIS geometries, such as valid geometries.


{% test is_valid(model, column_name) %}

SELECT *
FROM {{ model }}
WHERE (NOT st_isvalid( {{ column_name }}))

{% endtest %}

The final step of our pipeline is writing a shell script that exports our final dbt model output to our data warehouse used for our front-end applications. This script usually includes exporting a backup to a cloud provider. Again, we typically use the OGR and GDAL command-line utilities or simple psql commands here.

The Big Data Version

We rarely meet the problem of the previous stack being insufficient to scale to most datasets, but it has happened. The geospatial big data ecosystem is still immature and maybe five years behind non-spatial equivalents. We are still experimenting with technologies that help us implement the same principles presented beforehand.

Some note-worthy combinations seem to be emerging. Combining dbt with Spark and Sedona was tremendously challenging to integrate, but we can now run simple SQL commands directly from any of your database query editors like dbeaver:

We have also used Carto spatial extension on BigQuery without any tremendous success. User-defined functions (UDFs) can only take you so far. Perhaps we will have the chance to dabble in the integration of Trino with dbt. We have not seen anyone test out the spatial capabilities of the last option.

Conclusion

Complexity is hard to manage, and designing simple systems in a third-party data ecosystem is very complex. It is easy to lose sight of the big-picture objectives when working on technical challenges. Instead, we should strive to make our user’s the primary concern of our developments. The data pipeline principles can help us re-center our pipelines on the actual value they provide.

References

dbt: https://docs.getdbt.com/docs/introduction

GDAL/OGR: https://gdal.org/

Geometric Data Viewer: https://geometric-data-viewer.anagraph.io/

Inspired by Marty Cagan: https://svpg.com/inspired-how-to-create-products-customers-love/

StatCan 2021 Census Data Release Schedule: https://www.census.gc.ca/census-recensement/2021/dp-pd/index-eng.cfm.

3rd Party Data Integration at Scale – A Use Case with StatCan 2021 Population Data