This past year at Eventbrite we’ve been moving the infrastructure that supports our analytical workloads from a bevy of open-source tools (Presto, Hive, Luigi, Superset) to more managed-services (Snowflake, dbt). Working on the Analytics Engineering team, we’ve been tasked with migrating all AE-owned legacy transformation code written in Spark and scheduled through Luigi to dbt and Airflow.
Because of changes in business logic, old bugs, previous limitations on compute that made it so that changes weren’t applied to transformed tables after a certain time period, and choices we are making about how to run our transformations in our new infrastructure (full overwrites everyday vs. a lookback period) we cannot promise the transformed data tables that currently sit in our old environment will exactly match the updated transformed data tables, but we still want to try to reach parity as best as we possibly can.
If you are going through a migration from a legacy environment to a new one on Snowflake, here’s a handy query you can use that uses the information_schema
table available to you through Snowflake to look at the percent difference in your total table row counts between your old transformed table and your newly transformed table. Replace {OLD_DATABASE_NAME}
with the database where your old transformed table sits, {NEW_DATABASE_NAME}
with the database where your new transformed table sits, and {YOUR_TABLE_SCHEMA}
with the table which you are comparing total row counts. In our case, we have decided that we want to achieve no more than a 1% difference, but the level of acceptable differing total row count will be determined wholly on your own use case and migration completion criteria.
select replication.table_name, replication.row_count as replicated_table_row_count, migration.row_count as migrated_table_row_count, migration.row_count - migration.row_count as difference_in_row_count, abs(((migration.row_count - replication.row_count) / replication.row_count) * 100) as pct_difference_in_row_count, current_date() as date_ran
from {OLD_DATABASE_NAME}.information_schema.tables as replication join {NEW_DATABASE_NAME}.information_schema.tables as migration on replication.table_name = migration.table_name
where lower(migration.table_schema) = {YOUR_TABLE_SCHEMA}
order by pct_difference_in_row_count