Coalesce — A Data Engineering Journey Part 2

Matt Florian
5 min readMar 6, 2022

Staging the raw data vault

Recap

In part 1 of this series, we walked through what it takes to set up Coalesce for a data engineering project. The ease with which we got started spoke volumes about the user experience focus of the Coalesce.io service. However, this is only the start of the journey. Being able to build a sustainable data pipeline is what matters.

This next stop on our journey will build the raw data vault.

Why data vault?

The data vault is a data architecture that stores data in a reliable, auditable form. The structures are “hyper” normalized, so physically, the database stores a record for every change that occurs. Capturing the historical changes to a record starts filling in data gaps, creating a complete historical record.

The data vault model is ideal for an ERP implementation like SAP. The model gives data teams the structure to unify legacy data systems with the new SAP structures. The output of the data vault is then a single pane of glass for historical and SAP data. Systems integrators often overlook this unified data, but the organization needs it. We will cover that in another article.

The Coalesce team made the features of Coalesce.io for the data vault and data warehouse. It is clear from the product features that the team understands the needs of data engineers within this scope. For this reason, we begin the review by building the data vault.

The data setup

For this next part of the evaluation of Coalesce.io, we will make the following assumptions.

  1. Data lands in Snowflake through an ingress process
  2. Data exists as it would in the source system of record
  3. Data is relational, and those relationships are known (because they are not always known)

The existence of the data has come from any number of ingress processes. It may have used the Snowflake Snowpipe capabilities and read it externally, or an external table read as a natural table. Another option came in through a third-party tool, such as Fivetran HVR.

In this scenario, we know that the data we are using came from a shared data set provided by Snowflake. It is a sample of fantasy football stats from the 2020/2021 football season. The Bengals’ season went 4–11–1, and that was a good year.

Add Source Tables

The inclusion of source tables into the collection of nodes is very straightforward. Using the definition of a Source and an Environment from the setup phase, the data engineer begins the import process by navigating to the top-left node panel and clicking on the + symbol next to the node search. A context menu pops up with two options:

Add Sources

Create New Node >

We will start with a bulk import of new source nodes for our purposes. When a data engineer clicks Add Sources, a modal window opens. In the left pane are the available tables. The data engineer can select a single table or bulk-select all within any combination of sources.

A 100-row subset of that data will appear in the data preview pane on the right if a specific table is selected. The ability to preview data before ingesting the metadata model into a tool should be a standard feature of any data transformation tool on the market today.

In this case, choosing to import objects from Snowflake into Coalesce is an import of the metadata. Coalesce stores no data as part of this process. This dataless state for transformation definition is often a critical checkbox for data and cybersecurity professionals within an organization.

Once Coalesce.io imports the metadata, the tables appear in the Nodes window on the left of the screen. The user can add objects to the new or existing subgraph to manage objects, as shown in the image below.

Creating Stage Nodes

In Coalesce, a Stage Node is an intermediate node used to perform transformations and quality checks on the pipeline. When creating a data vault, the stage node enriches the data with additional metadata that downstream nodes will use. Examples of standard metadata are

  • HUB_HASH_KEY
  • LINK_HASH_KEY
  • SOURCE_SYSTEM
  • LOAD_DATE
  • EFFECTIVE_DATE
  • VERSION_NUMBER
  • CHANGE_HASH

Creating these attributes one time in a stage allows use in downstream nodes consistently.

Creating a stage within Coalesce is simple; a data engineer right clicks on the source table to bring up a context menu.

The Stage node automatically maps the source column to the target column. The column definition from the Source node carries forward to the Stage node column. A data engineer can add new columns to include column-level functions from Snowflake. As the Snowflake catalog of procedures continues to increase, so will the capabilities of Coalesce.

Here too, is where data engineers can define hash keys. To create a hash key, the data engineer multi-selects the columns they want as part of the hash, right-click on the selections, and then choose the Generate Hash Column menu option. There are three options for hash type to select from

SHA1

SHA256

MD5

The Stage node is where all data engineering occurs. Data pipelines may be a simple one-to-one node map with essential data enrichment. Alternatively, it may involve two more nodes joining into a single target node. The multi-source node is a typical pattern to build data into a complex target node. For instance, if a source table stores transactional data that will become a Link Satellite in the raw data vault, the hub keys that participate in the transaction can be inferred from the related hub. The data engineer creates the object relationship “joins” in the Join tab of the Node editor.

The join uses a Coalesce.io macro function called ref to create a flexible join statement as a data pipeline. The use of reference objects simplifies code promotion into higher deployment environments. Aliased tables appear just as they are in a SQL join in Snowflake. This same window allows a data engineer to specify any other SQL clause that follows the FROM clause in Snowflake.

Next Up

With all core nodes established, the data engineers are ready to move to the next step in the pipeline: building the raw data vault. In the next installment of this series, we will create the data vault hub, link, and satellite objects consistent with a Data Vault 2.0 model. Finally, it will conclude with a complete build of the data vault and the features that perform data loads.

--

--

Matt Florian

Pragmatic Data Architect + Vendor Curmudgeon + Shiny Object Skeptic + Fishing Guide