Skip to main content

Data Cleansing View in MySQL

I discussed before how I picked up parkrun data from my e-mails, they don't have an API as their system was never designed to cope with the millions of people that now take part. I only want my own data so this works just fine for me. I use a Zap to pick up the e-mail and plonk it in a Google Sheet and the Keboola to process the data into MySQL and maybe soon Snowflake. Actually given the setup I have it would only take 5 minutes in Keboola to add a step to the Flow to pass the output from the view below and put it into Snowflake as a table. I am leaning more towards using Snowflake as long as Retool stays free enough for me to use as the free MySQL database has a very limited session pool and therefore limits the visualisations I can do. 

Anyway the raw data from the e-mail is useless for visuals so I processed the data in MySQL. There might be more elegant solutions but for me it was some experience in how to code this in MySQL and what functions it has. Being primarily used to Oracle it has been interesting doing this in MySQL and Snowflake. The view uses that fact that the e-mail arrives in a standard format to pull out the details I am interested in, such as location, time, parkrun number etc. and allows me to then report on these. 

The code for my view can be seen here: 

create or replace view v_parkrun_result as
select
/*for Mysql
29-04-2022 - Gary Manley - Initial Version
*/
substr(mailtext,
locate('Your time was',mailtext)+14,
8) event_time,
cast(substr(mailtext,
locate('Congratulations on completing your ',mailtext)+35,
3) as DECIMAL) parkrun_number,
-- get place
trim(substr(mailtext,
locate('Hello Gary',mailtext)+18,
locate('results for event ',mailtext)
- (locate('Hello Gary',mailtext)+18)
)) parkrun_place,
-- get position
cast(substr(mailtext,
locate('today. You finished in',mailtext)+22,
3) as DECIMAL) parkrun_position,
-- get total participants
cast(substr(mailtext,
locate('out of a field of ',mailtext)+18,
4) as DECIMAL) total_field,
-- age category
substr(mailtext,
locate('category VM',mailtext)+9,
7) age_category,
-- age grading (as percentage)
cast(substr(mailtext,
locate('You achieved an age-graded score of ',mailtext)+36,
5) as DECIMAL(4,2)) age_rating,
date(cast(date as datetime)) event_date
from ext_tab_parkrun_email

Comments

Popular posts from this blog

Gen AI news 29-04-2024

Here are some recent updates and insights related to Generative AI (gen AI) : Enterprise Hits and Misses - Robotics and Gen AI Converge : This article discusses the convergence of robotics and generative AI. It explores breakthroughs needed in the field, the FTC’s policy change regarding non-competes, and the impact on AI model sizes for enterprises 1 . Read more All You Need To Know About The Upcoming AI-Powered OLED iPad Pro : This piece provides a summary of rumors surrounding the next-gen AI-fused OLED iPad Pro, powered by the new Apple M4 chip 2 . Read more Delivering on the Promise of Gen AI : New Electronics reflects on NVIDIA GTC and key announcements that contribute to delivering on the promises made for generative AI 3 . Read more The Future of Generative AI - An Early View in 15 Charts (McKinsey): Since the release of ChatGPT in November 2022, generative AI has been making headlines. McKinsey research estimates that gen AI features could add up to $4.4 trillion to the globa...

Keboola Flows

Really finding Keboola was the thing that kickstarted this project otherwise I would be trying to build custom code on a python cloud server and building everything from scratch.  In Keboola you build you data sources and destinations using connection details which is fairly simple and something I will likely cover in another post, same goes for transformations etc. Here though I am going to discuss Flows, this is where you bring everything together. On my free account there are some limitations.  My easiest flow is very basic:  Pull parkrun results e-mail from Gmail to Google Sheets (actually done by Zap not Keboola).  Keboola will, as often as I like, in this case once a week, pull the data from the sheet into its storage.  It will then transfer this to the target database. Currently I have this setup to be MySQL database but I can and might expand that to the Snowflake instance within Keboola.  I then, outside of Keboola, connect to the MySQL database f...

Snowflake Scripting - SQL Cursors

Snowflake scripting in SQL seems to be in preview and I have decided to have a play with it. Given how new it is there is limited documentation so I am using a combination of what I can find on the Snowflake site and the odd blog that has been written about it. There appear to be a few quirks, at least when compared to Oracle PL/SQL (though that has been round for years). How many of these are intentional and how many are things to be ironed out I don't know. You can see the procedure I have created it:  Accepts an id as a parameter  Creates a result set selecting from a table, using the parameter as a filter Loads the results set into a cursor.  Loops through the cursor loading the id in the cursor into variable Calls procedure passing in the variable as the parameter.  Then as a proof of concept I tried the Snowflake feature of allowing declaration of variables within the main start and end block.