Skip to main content

DBT training

One of the tools I am hoping to get to grips with is DBT. It appears to be a very popular tool at the moment. I think with the trend of moving to ELT having a good tool to perform your transformations is important and from what I hear DBT is good. 

I have signed up for the free DBT cloud developer account and connected it to my Snowflake instance but after that I am not quite sure what I am meant to be doing. DBT has its own training so I am starting with the dbt fundamentals course. The training is supposed to take several hours with a few more hours implementing the hands on project and gives up a badge for LinkedIn or something. I am more interested in trying out the tool and seeing what it can do, for free, for this project. I have looked into quite a few training courses over the last few months, looking at all the tools I am using for this and things like AWS and when it comes to actually being useful the dbt training is at the top so far. I skipped some as it was basic for someone with 10 years experience but as someone just starting out this is a really good introduction to not only some of the processes but best practices with testing and version control. 

There has already been an interesting article linked to on the dbt website. 

Working my way through the training it has already been very easy to setup a link to Snowflake and perform a transformation. I am planning on using it to create a couple of mock rollup tables on the exercise data. Given the volumes I am dealing with it is completely pointless but it gives me something to create. Initially I am going to remove the timestamp entry and group by day. 

Going through the training I have been nodding along to a lot of the functionality that the tool provides: 

  • documentation built in 
  • using yml files to be able to config a table so it only has to be changed in a single place 
  • automatic lineage using the ref function
  • which is used to auto order the jobs to be run 
I have used bespoke ETL tools where some of the above is not done or is done so with great effort, so the fact that for my little free play ETL I don't have to worry about this is great. 

By the end of the training I had: 
  • Connected everything to a Git repository.
  • Configured a transformation in my Snowflake DB (the roll up mentioned above) 
  • Configured the data sources so they are referred to using config so changing things could be done in a single place. 
    • This also gave me auto lineage on the transformations, diagram below but will have to try something more complicated. 
  • Configured Source freshness, gives warning if data is stale. 
  • Produced documentation of the sources and processes etc. utilising the yml files and auto documentation processes.
  •  
  • Used their testing as per this video
    • Great that it comes in with some built in tests that run quickly and are easy to configure. 
  • Scheduled a daily job to refresh the roll up tables modelled above. 

I plan on moving onto the more advanced training as I am interested to see what can be done and I really like the tool. 



Comments

Popular posts from this blog

Gen AI News - 12/03/2024

Google’s Beta AI Content Rewriting Tool : Google is testing an AI tool that finds and rewrites quality content. However, some critics argue that it may incentivize the production of AI-generated low-quality content 1 . The New York Times and OpenAI Controversy : A court filing alleges that The New York Times paid someone to hack OpenAI’s products using deceptive prompts. The situation raises questions about the ethical use of AI 1 . Optimizing GPTs for Online Visibility : Learn how to increase online visibility and click-through rates for your GPT models in the GPT Store and Google Search with six practical tips 1 . AI Democratizing SEO or Amplifying Incompetence? : Understand what AI can realistically do for SEO and manage expectations regarding results 1 . Google’s “Help Me Write” AI Assistant : Google has launched an AI writing assistant called “Help Me Write” for the Chrome browser. It suggests text based on website context 1 . Google’s Gemini: Laptop-Friendly Open Language Model :...

My Latest project using Gen AI

So recently parkrun removed all their stats and as a keen running who is trying to work their way up the top 100 of their local parkrun I wanted to get some of these stats back and have a bit of "fun" at the same time. So here is a little "ETL" process that I developed with the help of Gen AI.  The steps of my ETL:  Copy and paste data into Google Sheets template where an AI produced formula extracts URLS from the text and puts them into a new field. This effectively allows me to extract the parkrun athlete id, the primary key, and use it in my analysis. I also have a column to autofill the data I am processing.  Use an Gen AI generated Google Apps script to process it into a processed sheet, this allows me to build up a backlog of events (I had over 500 to process).  This is then queried using a Gen AI Google sheets query to extract key information and columns / format times etc. I then ingest the fully processed sheet into Keboola directly from Google Sheets. ...

Zapier

As much as I have enjoyed using Keboola there are some connections that it doesn't have or that just haven't worked for one reason or another. I actually came across Zapier as a solution for bringing in e-mails from parkrun to load my results every week. Honestly I have not found it to be as robust as Keboola but that might just be me archiving my e-mails before it completes it 15 minute poll.  The second use case I am working on is the pulling in Strava data, for a fitness dashboard the fact it has a built in connector for Strava is great, though I am worried given the activities I do that I might reach the limit.  I won't go into details on how to set things up but you can setup 5 Zaps that can run for a combined 100 runs during a month for free.  In my data platform / solution I am using Zaps to load harder to get / automate data. It doesn't add much from a technical point of view as it is just signing into a few account to get the data into Google Sheets for downstr...