Skip to main content

Posts

Showing posts with the label cloud

Creating a custom python data pipeline

Having pushed the data in MySQL using a Google App script I wanted to see whether I could then push the data to Snowflake without using one of the automated tools. I decided, initially at least, to use python. Python is a very easy language to use and to achieve the basics, for this sort of project at least, you can plug and play with packages that you need. This is not going to be the most performant and probably won't fly in a proper Enterprise environment though I have previously used complex scripts to generate BI reports for the majority of a companies (start up) reporting.  Below is my script minus the creation of the connection engines, because for the purposes of this I did not want to go to the trouble of masking them and it is very well documented how to create them. Currently I am running this manually but for £5 a month I can get this scheduled on pythoanywhere but hoping I will pluck up the courage to run this on free tier Azure. 

Creating a date dimension in dbt and Snowflake

A lot of the data that I am working with doesn't lend itself to creating complex star schemas (more on data modelling in a later post) however I want to try and at least want to go to some effort. The one thing that pretty much all my tables have in common is date, so I want to build a date dimension. My strength training happens on a day, my running and walking, parkrun and hopefully the weather data I intend to integrate will have a date on it. This means if I have a date dimension and I aggregate my various facts at date level I can use them together in a single visualisation.  There are various methods I could have used to create the date dimension and I went with using the dbt spine, this was really simple using the utilities:  Used the date functions in Snowflake to extract relevant fields such as year, month, day in words and year month etc. 

AWS training cloud academy free course

One of the things I like about this course are the instructors are really clear but also that it provides free labs that allow you to actually sign into AWS and perform some actions to actually create and do things without worrying that you are going to incur a cost.  Today I complete one of the hands on labs.  This was to create a lambda function, in this case it was a very basic python script that was searching a website for a keyword. I then placed this into a schedule and used cloudwatch to create a dashboard that monitored the running of this function. Overall it was a very simple use case but it was also a very simple process to setup.  I don't have much to add to this other than it is well worth signing up to cloud academy for the free training if nothing else, I am tempted, once i have done some more training, to give the paid for option a go to get the full sandboxes. 

dbt - pivoting on list of values

Say you have a massive table but for ease of use for reporting you need to split numbers out. For example say you are a service that provides numerous different services, each person gets a single form with a mark for each service provided and the transactional system stores this a a row per form per service. For reporting you may well want to pivot these out and have a separate columns for each service with a 1 / 0 and a count for each.  In traditional databases and tools you have to create all your views in advanced and you cannot pivot on a list of values. The joy I am finding with dbt and jinja coding is that you can create the views dynamically. And whilst that may pose some risk if means that in this scenario a new service code is catered for automatically as you can dynamically generate the column by looping through a list of values.  I have applied this logic my strength exercise data, pivoting it from rows to columns. Below is the jinja code and the resulting SQL,...

dbt - more stuff

The more I used dbt the more I like it. I am finding many of its features really useful and I haven't even done the training on macros and packages yet so I feel there is more to come yet. In the meantime I have now start to, just of the fun of it, create some downstream views with dependencies on other steps and a function in SQL. Happy to say it is all working really well and using jinja (and my Snowflake function ) has saved me heap of time coding.  Sources yml:  View using the source function (results in SQL) View that references the output from previous steps, allows them to be linked:  Assuming you create your sources in the yml file and reference previous steps using the reference function rather than calling the resulting table (dbt handles that for you) (as shown above) it will automatically work out the dependencies, run things in the right order and produces a lovely lineage graph like so.  I am hoping to stop playing with what I know of dbt and might make...

Snowflake Functions

So as part of some stuff I have been doing in my day job I have built some useful and decided I should build a basic function in Snowflake before moving onto something more complicated with table functions and the like. So my basic function is one that accepts parameters and uses this to determine if an activity is +- 1 standard deviation from the average over all time for that activity.  I actually call the code above using dbt to create the views and tables and will show that in my next post. For now here is a screen shot of me calling that function using jinja scripting to create the different function calls. 

dbt training - the next part

 So I enjoyed the first dbt training enough to give the next bunch of training a go. The first part of this training introduces you to jinja and using it to, for example, loop through creating case statements to pivot out rows to columns.  Jinja Code -  I can see there are lots of really useful purposes for this and am hoping that in I will find out later that rather than using a hard coded list you can use a query to generate the list of entries to loop through. This would be a great feature, one of the limitations of a lot of SQL systems is you can't pivot on an unknown list of values. Now onto learning about macros and packages in dbt.  Generates SQL Code -  

DBT training

One of the tools I am hoping to get to grips with is DBT. It appears to be a very popular tool at the moment. I think with the trend of moving to ELT having a good tool to perform your transformations is important and from what I hear DBT is good.  I have signed up for the free DBT cloud developer account and connected it to my Snowflake instance but after that I am not quite sure what I am meant to be doing. DBT has its own training so I am starting with the dbt fundamentals course. The training is supposed to take several hours with a few more hours implementing the hands on project and gives up a badge for LinkedIn or something. I am more interested in trying out the tool and seeing what it can do, for free, for this project. I have looked into quite a few training courses over the last few months, looking at all the tools I am using for this and things like AWS and when it comes to actually being useful the dbt training is at the top so far. I skipped some as it was basic for s...

Zoho Analytics

Have I finally found my BI Tool, one that lets me import data from Snowflake and share it for free? I know, no sooner have I posted about how hard it was to find a tool that could do anything from Snowflake than I come across Zoho. You can check out my dashboard on the following page . Below is a diagram that outlines the processes I have used to obtain this data. In summary my parkrun e-mail is pushed to Google Sheets every week by Zapier and Forms I submit every day are used to track the strength training I do. Keboola is then used to ingest this data into MySQL and or Snowflake where I then use views or the built in transformation processes in Keboola to shift the data into a format for reporting. Google Data Studio then connects to MySQL and Zoho to Snowflake to visualise the data. 

Data Cleansing View in MySQL

I discussed before how I picked up parkrun data from my e-mails, they don't have an API as their system was never designed to cope with the millions of people that now take part. I only want my own data so this works just fine for me. I use a Zap to pick up the e-mail and plonk it in a Google Sheet and the Keboola to process the data into MySQL and maybe soon Snowflake. Actually given the setup I have it would only take 5 minutes in Keboola to add a step to the Flow to pass the output from the view below and put it into Snowflake as a table. I am leaning more towards using Snowflake as long as Retool stays free enough for me to use as the free MySQL database has a very limited session pool and therefore limits the visualisations I can do.  Anyway the raw data from the e-mail is useless for visuals so I processed the data in MySQL. There might be more elegant solutions but for me it was some experience in how to code this in MySQL and what functions it has. Being primarily used to ...

Keboola Flows

Really finding Keboola was the thing that kickstarted this project otherwise I would be trying to build custom code on a python cloud server and building everything from scratch.  In Keboola you build you data sources and destinations using connection details which is fairly simple and something I will likely cover in another post, same goes for transformations etc. Here though I am going to discuss Flows, this is where you bring everything together. On my free account there are some limitations.  My easiest flow is very basic:  Pull parkrun results e-mail from Gmail to Google Sheets (actually done by Zap not Keboola).  Keboola will, as often as I like, in this case once a week, pull the data from the sheet into its storage.  It will then transfer this to the target database. Currently I have this setup to be MySQL database but I can and might expand that to the Snowflake instance within Keboola.  I then, outside of Keboola, connect to the MySQL database f...

Free AWS Training

Whilst a lot of the information I am post on here is about how to actually build a free cloud data platform for yourself there is also some good training out there. Whilst you will need to pay to get access a lot of places want to entice you in with certain bits for free. In the data engineering space a popular option for training is places with hands on labs. With the rise of these cloud platforms training providers are able to spin up mini instances with lots of restrictions and allow you to do hands on training without fear of using something outside of the free tier.  As I said most of this training is not free, I did a trial month at Whizlabs and got myself a certification on Snowflake and AWS with hands on experience using their sandbox areas. Honestly despite finishing that course I don't feel like I learnt a huge amount, the labs were too regimented and the trainers were not overly engaging. If money were no object I would give A Cloud Guru a go but I am looking for free ma...