Skip to main content

Posts

Showing posts from May, 2022

streamlit

Over the last few days I have been playing with Streamlit. This is a really useful library that effectively lets you build an interactive dashboard using python. In theory, and somewhat in practice, I can now / have handled extracting data from APIs and DBs, transformed the data using pandas, loaded it into my final database and built a dashboard all using python. From what I have read it can be setup as a form, if I could make it private it might allow me to have direct data entry into my source database as well, else I can still use Google Sheets but might implement it for fun.  The dashboard below is embedded so it should be updated as I progress with my code. The code below however is from a one of gist, you don't seem to be able to embed from the github repo, but you can find the latest version of the code here . Apparently the embedding of the code failed so just check out the dashboard for now and I will try and get that working.

prefect ETL tool in python

Having spent a lot of my time playing with Keboola and dbt to load and transform my data I wanted to have a look at just doing stuff in pure python. I have previously built the fill ETL pipeline for a company in python but haven't really had a need to touch it in over 4 years. Most of the work I did before was just using pandas with a few connectors to various databases and producing reports in Excel using xlwings. It wasn't pretty but it was effective and everyone was happy with the job that it did.  Instead I ended up using the prefect library. Well I built it all and then integrated it into prefect once I found it. I found it ok and it has some useful features bit it is not brilliant but that could be through back of use. It does allow you to produce DAGs and and lots of other useful functionality. Script below. 

Loading my Strava Data using Python

I have wanted to load my strava data into my data platform since I started loading the strength data. I found some really useful instructions that I used as by base here . I basically use the procedures shown to load my last 200 strava activities. I load this into MySQL, find the new entries which then get loaded into the main MySQL table and then a bulk load into Snowflake. My next step will be to process this into a more meaningful table using either dbt or seeing if I can do something smart with python and a view in Snowflake.

Python Pipeline API to MySQL to Snowflake

I decided that I liked doing some of my coding in python, even if I have to manually kick it off at the moment I might go with the safe option of a $5 a month python anywhere package to run on a schedule in the cloud or I could put in as an AWS lambda function or in Azure but I don't want to accidentally rack up a bill so might wait until I am further into training on them.  So in this code I have: Used dotenv to store all parameters and passwords as environment variables so I can post my scrips without modification and store them in git. (with the env file set to gitignore).  Retrieved the values above and called the weather API. Flattened the json to get all the columns. Put the new rows into the table in MySQL.  Retrieved the table from MySQL and done a drop and replace into Snowflake.  My Code:  The table: 

Creating a custom python data pipeline

Having pushed the data in MySQL using a Google App script I wanted to see whether I could then push the data to Snowflake without using one of the automated tools. I decided, initially at least, to use python. Python is a very easy language to use and to achieve the basics, for this sort of project at least, you can plug and play with packages that you need. This is not going to be the most performant and probably won't fly in a proper Enterprise environment though I have previously used complex scripts to generate BI reports for the majority of a companies (start up) reporting.  Below is my script minus the creation of the connection engines, because for the purposes of this I did not want to go to the trouble of masking them and it is very well documented how to create them. Currently I am running this manually but for £5 a month I can get this scheduled on pythoanywhere but hoping I will pluck up the courage to run this on free tier Azure. 

Getting Weather Data

I decided that I wanted to ingest some weather data through a different means, this time combining Google API scripts to retrieve the data from the API and another script to connect to the MySQL Database and deposit the data. Both steps are setup on a schedule.  Setting up the API Call:  I followed the instructions from the following medium post to create a function in Google Apps Scripts that would call an API for me. I use it to call the weather API website and retrieve the weather data for where I live.  These 2 things combined import the data into my Google Sheet as per:  I have then set this up to run on a 4 hour schedule within Google Sheets using one of their triggers.  Getting the data into MySQL : So now that I have the data in Google Sheets I want to regularly import this data into a database as it only stores a single row in google sheet, though I could probably get it to persist here as well. Looking at my options with the JDBC drivers available MySQL could work and Snowfl

Creating a date dimension in dbt and Snowflake

A lot of the data that I am working with doesn't lend itself to creating complex star schemas (more on data modelling in a later post) however I want to try and at least want to go to some effort. The one thing that pretty much all my tables have in common is date, so I want to build a date dimension. My strength training happens on a day, my running and walking, parkrun and hopefully the weather data I intend to integrate will have a date on it. This means if I have a date dimension and I aggregate my various facts at date level I can use them together in a single visualisation.  There are various methods I could have used to create the date dimension and I went with using the dbt spine, this was really simple using the utilities:  Used the date functions in Snowflake to extract relevant fields such as year, month, day in words and year month etc. 

Zapier

As much as I have enjoyed using Keboola there are some connections that it doesn't have or that just haven't worked for one reason or another. I actually came across Zapier as a solution for bringing in e-mails from parkrun to load my results every week. Honestly I have not found it to be as robust as Keboola but that might just be me archiving my e-mails before it completes it 15 minute poll.  The second use case I am working on is the pulling in Strava data, for a fitness dashboard the fact it has a built in connector for Strava is great, though I am worried given the activities I do that I might reach the limit.  I won't go into details on how to set things up but you can setup 5 Zaps that can run for a combined 100 runs during a month for free.  In my data platform / solution I am using Zaps to load harder to get / automate data. It doesn't add much from a technical point of view as it is just signing into a few account to get the data into Google Sheets for downstr

AWS training cloud academy free course

One of the things I like about this course are the instructors are really clear but also that it provides free labs that allow you to actually sign into AWS and perform some actions to actually create and do things without worrying that you are going to incur a cost.  Today I complete one of the hands on labs.  This was to create a lambda function, in this case it was a very basic python script that was searching a website for a keyword. I then placed this into a schedule and used cloudwatch to create a dashboard that monitored the running of this function. Overall it was a very simple use case but it was also a very simple process to setup.  I don't have much to add to this other than it is well worth signing up to cloud academy for the free training if nothing else, I am tempted, once i have done some more training, to give the paid for option a go to get the full sandboxes. 

Pulling Data from Google Fit

So the next project for me will be to integrate step counts from Google Fit. Given my improved knowledge and understanding of the tools and infrastructure I am using can working out how I am going to do this and use this data before I start.  The first step was connecting to the Google Fit API and extraction the relevant data. I will admit that I did the standard developer trick and followed an online guide / stackoverflow to get this done, my main source was the link attached.  My next steps will be:  Use Keboola to connect and load the Google Sheet as in In Job As an Out job deposit the data into Snowflake.  Use dbt to transform and load the data into the final star schema.  At very least number of k of steps / day would be good to have on in my exercise fact tables. If I integrate this early enough and in enough places it will test a lot of by dbt understanding if nothing else.  As part of this I am going to also create a proper date dimension and integrate another new sheet with ca

Creating SCD2 tables in dbt

I don't want this blog to become the dbt blog so I have taken my time to post about this but do fully intend to do some more posts on dbt and the cool built in functionality. Equally I am quite happy with where my model is at at the moment so until I find some new tool to use or a new data source I am going to look to expand the section on the free training available.  dbt has the ability to cater for creating scd(2) style tables called snapshots , details of it are included on the advanced materialization training. I set up my first snapshot model by creating an scd2 table for the activity type dim, so that if I get a new exercise type added it will create a new rows, equally if I delete or modify one of the old columns it will end date the old row and insert the new row. The preference is to do this off a date column for change time however I don't have this so do the merge against all columns.  Snapshots sit in their own folder and have a fairly simple modelling structure s

dbt - pivoting on list of values

Say you have a massive table but for ease of use for reporting you need to split numbers out. For example say you are a service that provides numerous different services, each person gets a single form with a mark for each service provided and the transactional system stores this a a row per form per service. For reporting you may well want to pivot these out and have a separate columns for each service with a 1 / 0 and a count for each.  In traditional databases and tools you have to create all your views in advanced and you cannot pivot on a list of values. The joy I am finding with dbt and jinja coding is that you can create the views dynamically. And whilst that may pose some risk if means that in this scenario a new service code is catered for automatically as you can dynamically generate the column by looping through a list of values.  I have applied this logic my strength exercise data, pivoting it from rows to columns. Below is the jinja code and the resulting SQL,  The other

dbt - more stuff

The more I used dbt the more I like it. I am finding many of its features really useful and I haven't even done the training on macros and packages yet so I feel there is more to come yet. In the meantime I have now start to, just of the fun of it, create some downstream views with dependencies on other steps and a function in SQL. Happy to say it is all working really well and using jinja (and my Snowflake function ) has saved me heap of time coding.  Sources yml:  View using the source function (results in SQL) View that references the output from previous steps, allows them to be linked:  Assuming you create your sources in the yml file and reference previous steps using the reference function rather than calling the resulting table (dbt handles that for you) (as shown above) it will automatically work out the dependencies, run things in the right order and produces a lovely lineage graph like so.  I am hoping to stop playing with what I know of dbt and might make some visuals b

Snowflake Functions

So as part of some stuff I have been doing in my day job I have built some useful and decided I should build a basic function in Snowflake before moving onto something more complicated with table functions and the like. So my basic function is one that accepts parameters and uses this to determine if an activity is +- 1 standard deviation from the average over all time for that activity.  I actually call the code above using dbt to create the views and tables and will show that in my next post. For now here is a screen shot of me calling that function using jinja scripting to create the different function calls. 

dbt training - the next part

 So I enjoyed the first dbt training enough to give the next bunch of training a go. The first part of this training introduces you to jinja and using it to, for example, loop through creating case statements to pivot out rows to columns.  Jinja Code -  I can see there are lots of really useful purposes for this and am hoping that in I will find out later that rather than using a hard coded list you can use a query to generate the list of entries to loop through. This would be a great feature, one of the limitations of a lot of SQL systems is you can't pivot on an unknown list of values. Now onto learning about macros and packages in dbt.  Generates SQL Code -  

DBT training

One of the tools I am hoping to get to grips with is DBT. It appears to be a very popular tool at the moment. I think with the trend of moving to ELT having a good tool to perform your transformations is important and from what I hear DBT is good.  I have signed up for the free DBT cloud developer account and connected it to my Snowflake instance but after that I am not quite sure what I am meant to be doing. DBT has its own training so I am starting with the dbt fundamentals course. The training is supposed to take several hours with a few more hours implementing the hands on project and gives up a badge for LinkedIn or something. I am more interested in trying out the tool and seeing what it can do, for free, for this project. I have looked into quite a few training courses over the last few months, looking at all the tools I am using for this and things like AWS and when it comes to actually being useful the dbt training is at the top so far. I skipped some as it was basic for someo

Zoho Analytics

Have I finally found my BI Tool, one that lets me import data from Snowflake and share it for free? I know, no sooner have I posted about how hard it was to find a tool that could do anything from Snowflake than I come across Zoho. You can check out my dashboard on the following page . Below is a diagram that outlines the processes I have used to obtain this data. In summary my parkrun e-mail is pushed to Google Sheets every week by Zapier and Forms I submit every day are used to track the strength training I do. Keboola is then used to ingest this data into MySQL and or Snowflake where I then use views or the built in transformation processes in Keboola to shift the data into a format for reporting. Google Data Studio then connects to MySQL and Zoho to Snowflake to visualise the data. 

Data Visualisation Tools

In my quest to share the end result of my project I have been looking for a visualisation tool that works with Snowflake, is free and can be shared. I seem to be to able to get 2 of the 3 quite easily but finding one where I can shared my data has so far proven impossible. Here are some of the options I have tried, all are fine for creating visualisations and all have a way of getting it for free. I am using Google Data Studio but I cannot freely connect this to Snowflake only MySQL and the free MySQL DB has very limited connections.  Snowflake's build in dashboard tool - Cannot share publicly. As you can see from that post the link to my dashboard does not work.  Power BI - Can connect to Snowflake but cannot share with the free version.  Kipfolio - Can connect to Snowflake but cannot share with the free version.  Retool - Can connect to Snowflake but cannot share with the free version.  Google Data Studio - As said above it does not connect to Snowflake If anyone knows of any to