3 Common dbt Mistakes And How to Avoid Them.
Simple Guide for Avoiding the Most Common dbt Mistakes in Data Transformation and Modeling
You’ve gathered your data, created your models, and are ready to present those insights you’ve been working hard on. But then, boom. Things start falling apart — data gets corrupted, pipelines break, and the frustration hits you like a tidal wave.
If that sounds familiar, don’t worry — you are not alone.
In this guide, we will break down the three most common mistakes people make when using dbt, and more importantly, how to avoid them. Whether you are just getting started or you’ve been at it for a while, these tips will help you smooth out your data transformation process and make things run like clockwork.
So, let’s jump right in.
1. Unclear Naming Convention:
Imagine stepping into a room filled with random boxes, none of them labeled. You need a tool, but all you find are random bits and pieces. Annoying, right? That’s what it feels like when you are working with unclear names in dbt. It’s a headache you don’t need.
Instead of using simple, clear names for your models, sources, or columns, you might encounter random abbreviations or vague terms. For example, what does emp_type even mean? Is it “employee type”? “Employment type”? Who knows. These unclear names just slow everything down and make life harder than it needs to be.
So what should you do?
- Be specific. Use names that actually describe what something is.
-- Instead of:
SELECT id, name, salary
FROM employee_data
WHERE emp_type = 'ft';
-- Use:
SELECT id, name, salary
FROM employee_data
WHERE employment_type = 'full_time';
- Keep it consistent.
-- Avoid
SELECT name, city, order_type
FROM users AS u
JOIN orders AS o ON u.id = o.user_id;
-- Go with
SELECT name, city, order_type
FROM users AS usr
JOIN orders AS ord ON usr.id = ord.user_id;
- And don’t forget to document your naming conventions so everyone on the team knows what’s up.
## Naming Conventions
- **Sources:** Use snake_case with a consistent prefix (e.g., 'my_') for custom sources.
- **Columns:** Opt for snake_case for consistency and clarity, using descriptive names to convey content.
- **Models:** Choose clear, meaningful names reflecting the transformation's purpose.
By adopting clear, consistent naming conventions and proper documentation, you turn your dbt project into an organized, well-oiled machine where everything is easy to find.
Collaboration becomes smoother, and everyone is more productive.
2. Unoptimized SQL Queries:
If you’ve ever felt like your queries are moving at a snail’s pace, chances are poorly optimized SQL is to blame in dbt.
Think of SQL queries as instructions for your computer to fetch data. If the instructions are vague or inefficient, you’re basically sending it on a wild goose chase.
For example, let’s say you have a table called orders with columns like id, status, and more. To get the id and status, you might be tempted to do this:
-- Selecting all columns when only a few are needed
SELECT *
FROM orders
WHERE status = 'pending';
What’s wrong with that?
Well, it’s slow and inefficient. You are asking for all the columns when you only need two. This can lead to long wait times, overloading your resources and generally poor performance.
Instead, try this:
-- Selecting only the required columns
SELECT id, status
FROM orders
WHERE status = 'pending';
Being specific with your queries means faster results and less strain on your system.
3. Overlooking Testing:
No matter how skilled you are with dbt, mistakes happen. That’s why testing is your secret weapon.
Testing acts as your safety net to catch errors before they mess up your data flow. By testing early, you can spot issues before they become a big deal, keeping everything running smoothly and saving you from future headaches.
How can you test in dbt?
There are two ways to approach testing in dbt:
- Generic Tests: Quick and simple tests that dbt provides right out of the box.
- Custom Tests: Tailored tests that you can write to fit your specific data needs.
Generic Tests in dbt
dbt comes with several handy tests that you can implement without much hassle. These cover common checks like ensuring values are not duplicated or null, or making sure data falls within an expected range.
These are super easy to set up, and here is how you can do it.
Not Null Test: This test makes sure a column doesn’t contain any null values.
Let’s say you want to make sure that the id column in your users table always has a value.
Here is how you’d write it:
# users.yml
version: 2
models:
- name: users
columns:
- name: id
tests:
- not_null
Unique Test: This test ensures no duplicates exist in a column. To ensure every id in the orders table is unique, just add this:
# orders.yml
version: 2
models:
- name: orders
columns:
- name: id
tests:
- unique
Accepted Values Test: This checks that a column’s values match a specific list (e.g., status values like ‘pending’, or ‘completed’).
To check that the status column in the orders table only includes acceptable values like pending, completed, or canceled.
# orders.yml
version: 2
models:
- name: orders
columns:
- name: status
tests:
- accepted_values:
values: ['pending', 'completed', 'canceled']
Relationships Test: This verifies that foreign key relationships between tables are intact.
Let’s say you have two tables: orders and users. The orders table has a user_id column, and you want to ensure that every user_id in the orders table exists as an id in the users table.
version: 2
models:
- name: orders
columns:
- name: user_id
tests:
- relationships:
to: ref('users') # The table you're linking to
field: id # The foreign key column in the 'users' table
Custom Tests in dbt
If you need more than the generic tests, dbt allows you to write custom ones using SQL. This comes in handy for more advanced checks or specific business rules.
Custom Test for Row Count Consistency
Let’s say you want to make sure your users table always has more than 100 rows:
-- tests/custom_test_row_count.sql
WITH validation AS (
SELECT COUNT(*) AS row_count
FROM {{ ref('users') }}
)
SELECT *
FROM validation
WHERE row_count < 100;
To activate this, you should declare the test in your schema.yml:
models:
— name: users
tests:
— custom_test_row_count
Custom Test for Null Values
To make sure the email column in the users table doesn’t have any missing values.
Here is how you should write that test:
— tests/check_email_not_null.sql
SELECT *
FROM {{ ref(‘users’) }}
WHERE email IS NULL;
Then just add this to your schema file:
models:
- name: users
tests:
- check_email_not_null
After you’ve set up your tests, running them is simple. Just run the following command.
dbt test
Why Testing Matters in dbt
- Tests help you catch issues in your data models before they mess up your analyses.
- By validating your data at each step, you can be confident that your results are trustworthy.
- Everyone on your team knows the data is solid, so they can focus on analysis without second-guessing the models.
Conclusion:
Avoiding these common dbt mistakes is easier than you think, and by doing so, you will not only boost the reliability of your workflows but also save yourself a ton of frustration.
Stick to clear naming conventions (trust me, your future self will thank you), write optimized SQL that gets straight to the point, and always test your work to catch problems before they become disasters.
These simple tweaks can go a long way in making your data transformation smoother and more efficient.
Thanks for reading.