Lineage and Data Flows
The lineage section for an asset shows how data is transformed and used across different assets. It can show e.g. how multiple source tables are joined to create a core table that is then used by multiple BI reports.
Each box represents an assets (e.g. a table, view or report) and each arrow represents a data flow (e.g. transformation or consumption). A box with a solid border is a persisted asset (e.g. table), while a dashed border is a virtual asset (e.g. view). Boxes are color coded based on the schema they are a part of. Arrows that are transformations can be clicked to show more information about the SQL statement that defines the data flow. Gray arrows are past transformations, while a black arrow is a data flow that ran in the last 24 hours.
Understanding all downstream dependencies before making a change can help avoid corrupting assets downstream.
A high-level view of data pipelines can help see anti-patterns, e.g.:
- Identify duplicated data assets: often core tables are used to create derived data sets that are redundant. Combining derived assets aligns data quality efforts and boosts productivity
- Spot circular data flows: a circle happens if a data asset is computed from another asset that uses information from the original asset. This setup leads to complicated update logic that easily leads to data inconsistencies.
- Understand if data flows follow your architecture guidelines. E.g.: a good practice is to create marts for data consumers. These marts should not contain heavy data pipelines.
Every interaction with data in Snowflake happens with SQL queries, e.g.:
CREATE TABLE orders AS
SELECT *
FROM raw_orders
WHERE status != 'canceled'
Snowboard parses statements like these to analyze data flows and connects
raw_orders
with orders
in the lineage graph.- Data transformations that happen outside of Snowflake are not captured. E.g. a python script that pulls data from view
customers
, enriches the data and writes it back to another tablecustomers_enriched
. In this case no sql statement is used and no connection will be shown. - Both target and source table need to be indexed in Snowboard. E.g. if the Snowboard user does not have privileges to access the source table
raw_orders
. Snowboard will ignore the connection withorders
until it gets privileges to accessraw_orders
.
Last modified 1yr ago