Skip to main content

Using Gen AI to write a fairly simple SQL query

So I wanted to see if I could test the different Gen AI models that are out there and get them to write a relatively simple SQL query. Basically select against my table, as detailed in the prompts to Gen AI, and produce a list of the fastest 1000 times at an event (that takes place weekly) and provide the times and names of the athletes that ran said times. Note that although I say view a lot I mean query because what are views if not stored queries anyway and I am using this in my DB as a view. 

Winner: Copilot

The original view can be seen below: 


create or replace view stg_top_100_ag as
with w_fastest as (
select "parkrunID", "Name", "Gender", "AgeBracket", min(TOTALSEC) Fastest_time
from stg_parkrun_all_v
where "AgeBracket" is not null
group by "parkrunID", "Name", "AgeBracket", "Gender"
)
select RANK() OVER (partition by "AgeBracket" ORDER by Fastest_time) "Age Bracket Rank" ,
"Name",
"Gender",
"AgeBracket" as "AgeBracket",
floor(Fastest_time/60) TimeMin,
mod(Fastest_time,60)TimeSec,
floor(Fastest_time/60)||':'||LPAD(mod(Fastest_time,60),2,'0') "Time"
from w_fastest
QUALIFY "Age Bracket Rank" <51
order by "AgeBracket", "Age Bracket Rank"
;
view raw Top100.sql hosted with ❤ by GitHub
So it is a fairly simple view with some logic in it to through some spanners in the works. The question is with the table definition and some explanation can the Gen AI platforms recreate a working version of the above view? 

The initial Prompt: 

I can't find a good way to format and embed my whole chats with the AI tools so I will work with what I have. Here is my original prompt that I used to get a starting point. 


You have a table in a Snowflake database called
stg_parkrun_all_v
The tale has the following columns where event date and position are a unique key:
EVENTDATE - Date the event took place
EVENTNUM - The event number where 1 is the first event and so on
TOTALATHLETES - The total of number of people that ran the event`
Position - The position of the athlete in this row within the event on the date shown
parkrunID - The unique ID of the athlete
Name - The name of the athlete - may not be unique
Gender - The gender of the athlete
AgeBracket - There 5 year age bracket of the athlete
Club - The running club of the athlete
TimeMin - the time they took in the event on the eventdate in minutes
TimeSec - The second portion of their time as per timeMin
TOTALSEC - The total seconds the athlete took in the event
DISPLAYTIME - A string representation of the athletes time on that date
MINS_5_BUCKET - An ID to place each time in a 5 minute bucket
Please write a query to get the best (minimum) time of each athlete over all events and then return just the top 1000 athletes with their ranking.
Please note that parkrunID and Name columns need double quotes round it to work. I would also like the athletes names to be returned in the query.
Please add a comment at the start of the query explaining what the query is doing that the source was a Gen AI model with your name included.
view raw Prompt1.txt hosted with ❤ by GitHub

Gemini: 

Overall this chat isn't very long and the first attempt was pretty close, it just missed the floor function around the minutes column and some quotes so I asked it to add those. 

"Please add a floor function to the fastest_time / 60 AS mins column Also the cte needs the double quotes around theparkrunID and Name columns. "

And here is the Gemini view, which I tested and it produced the right results.

The Gemini View:
-- This query, generated by Bard, a large language model from Google AI,
-- finds the fastest time for each athlete in the 'stg_parkrun_all_v' table
-- and ranks them based on their fastest time. It then returns the top 1000
-- ranked athletes with their ranking, parkrunID, name, fastest time in
-- whole minutes, and fastest time in seconds.
with ranked as (
SELECT
"parkrunID",
"Name", -- replaced "name" with "Name" as requested
MIN(TOTALSEC) as fastest_time,
ROW_NUMBER() OVER (ORDER BY MIN(TOTALSEC) ASC) AS ranking
FROM stg_parkrun_all_v
GROUP BY "parkrunID", "Name"
)
SELECT
ranking,
"parkrunID",
"Name",
FLOOR(fastest_time / 60) AS mins,
fastest_time % 60 AS secs
FROM ranked
WHERE ranking <= 1000
ORDER BY ranking;

Here

Copilot:

Next up I tried CoPilot. Copilot gave me a working piece of SQL out of the date, I did have to ask to expand the comments at the top and to add minute / seconds to the output but I like the way it did it. Overall was very impressed with it. No problems at all.


Copilot View:
-- Query to get the best time for each athlete over all events with ranking
-- Generated by the **Copilot** AI model
WITH RankedTimes AS (
SELECT
"parkrunID" AS "Athlete ID",
"Name" AS "Athlete Name",
MIN("TOTALSEC") AS "Best Time (Seconds)",
RANK() OVER (ORDER BY MIN("TOTALSEC")) AS "Rank"
FROM
stg_parkrun_all_v
GROUP BY
"parkrunID", "Name"
)
SELECT
"Athlete ID",
"Athlete Name",
"Best Time (Seconds)",
CONCAT(
FLOOR("Best Time (Seconds)" / 60), ' min ',
MOD("Best Time (Seconds)", 60), ' sec'
) AS "Best Time (Minutes:Seconds)",
"Rank"
FROM
RankedTimes
WHERE
"Rank" <= 1000
ORDER BY
"Rank";



ChatGPT:

Literally identical story to CoPilot with a slightly different output. I got a working view first time but didn't have the additional columns that were useful and it didn't get the comments right at the top and when I asked it to add them I think it went overboard! Sorry for the formatting it won't let me fix it!

ChatGPT View:

-- Query generated by ChatGPT
-- By OpenAI's ChatGPT
/*
View stg_parkrun_all_v:
- EVENTDATE: Date the event took place
- EVENTNUM: The event number where 1 is the first event and so on
- TOTALATHLETES: The total number of people that ran the event
- Position: The position of the athlete in this row within the event on the date shown
- "parkrunID": The unique ID of the athlete
- "Name": The name of the athlete - may not be unique
- Gender: The gender of the athlete
- AgeBracket: The 5-year age bracket of the athlete
- Club: The running club of the athlete
- TimeMin: the time they took in the event on the eventdate in minutes
- TimeSec: The second portion of their time as per TimeMin
- TOTALSEC: The total seconds the athlete took in the event
- DISPLAYTIME: A string representation of the athlete's time on that date
- MINS_5_BUCKET: An ID to place each time in a 5-minute bucket
*/
SELECT
ROW_NUMBER() OVER (ORDER BY MIN(TOTALSEC)) AS ranking,
"parkrunID",
"Name",
MIN(TOTALSEC) AS best_time,
FLOOR(MIN(TOTALSEC) / 60) AS best_time_minutes,
MOD(MIN(TOTALSEC), 60) AS best_time_seconds
FROM
stg_parkrun_all_v
GROUP BY
"parkrunID",
"Name"
ORDER BY
best_time
LIMIT 1000;
view raw ChatTop1000.sql hosted with ❤ by GitHub


Summary:

Gemini - Original Prompt and 1 correction and gave a working view. Provided me with some nice to haves that the others didn't.
Copilot - Original Prompt worked and minor additions asked for and gave a working view.
ChatGPT - as with CoPilot

Think that overall the winner is Copilot, the final solution it gave me worked and it put in nice comments. I personally would then go with Gemini, even though it require a correction to add the floor that the others didn't it gave the view most closely matching my original and got the comments right. This leaves ChatGPT in last. Honestly however the experience with all 3 was very similar and they all produced more or less what I wanted within a handful of prompts. I wonder how they will cope with more complex data models.



Comments

Popular posts from this blog

AWS training cloud academy free course

One of the things I like about this course are the instructors are really clear but also that it provides free labs that allow you to actually sign into AWS and perform some actions to actually create and do things without worrying that you are going to incur a cost.  Today I complete one of the hands on labs.  This was to create a lambda function, in this case it was a very basic python script that was searching a website for a keyword. I then placed this into a schedule and used cloudwatch to create a dashboard that monitored the running of this function. Overall it was a very simple use case but it was also a very simple process to setup.  I don't have much to add to this other than it is well worth signing up to cloud academy for the free training if nothing else, I am tempted, once i have done some more training, to give the paid for option a go to get the full sandboxes. 

AI News

Here’s a concise roundup of the latest AI news from the past couple of days: AI Technology: Friend or Foe? Researchers and experts continue to debate the impact of artificial intelligence. Is it a boon or a threat? The discussion ranges from AI ethics to its potential in various fields. Read more here . 5 Ways Artificial Intelligence Will Change the World by 2050 Futurists predict that AI will revolutionize our lives in the coming decades. From healthcare to transportation, AI is set to transform industries. Explore the possibilities here . How AI Will Transform Businesses in 2023 Business leaders are embracing AI to enhance efficiency, decision-making, and customer experiences. Stay updated on the latest AI trends in the corporate world here . China’s High-Level AI Modules China is pushing the boundaries of AI with modular next-generation systems. These high-level AI technologies promise breakthroughs in fields like robotics, healthcare, and smart cities. Learn more here . The Future ...

Data Cleansing View in Snowflake

For part of one of my free ETLs I am using Zapps to transfer e-mails from Google Sheets and then Keboola to transfer the sheets into my Snowflake database. I am familiar with string searches and cleansing in Oracle and using python but have not had the chance to do this in Snowflake. I wanted to give it a go as a proof of concept if nothing else. There were some difference in functions between Oracle and Snowflake, no INSTR and using POSITION instead and some difference in working with dates / timestamps but overall it was very similar.  The code below is what I ended up using:  I think want to use this to create some overview graphics to allow me to track the success or failure of my ETLs. Assuming the aspects of Retool remain free you can see how much ETL is going on this link .  In case things aren't working, here is a table of the output I am producing.