Skip to main content

Using Gen AI to write a fairly simple SQL query

So I wanted to see if I could test the different Gen AI models that are out there and get them to write a relatively simple SQL query. Basically select against my table, as detailed in the prompts to Gen AI, and produce a list of the fastest 1000 times at an event (that takes place weekly) and provide the times and names of the athletes that ran said times. Note that although I say view a lot I mean query because what are views if not stored queries anyway and I am using this in my DB as a view. 

Winner: Copilot

The original view can be seen below: 


create or replace view stg_top_100_ag as
with w_fastest as (
select "parkrunID", "Name", "Gender", "AgeBracket", min(TOTALSEC) Fastest_time
from stg_parkrun_all_v
where "AgeBracket" is not null
group by "parkrunID", "Name", "AgeBracket", "Gender"
)
select RANK() OVER (partition by "AgeBracket" ORDER by Fastest_time) "Age Bracket Rank" ,
"Name",
"Gender",
"AgeBracket" as "AgeBracket",
floor(Fastest_time/60) TimeMin,
mod(Fastest_time,60)TimeSec,
floor(Fastest_time/60)||':'||LPAD(mod(Fastest_time,60),2,'0') "Time"
from w_fastest
QUALIFY "Age Bracket Rank" <51
order by "AgeBracket", "Age Bracket Rank"
;
view raw Top100.sql hosted with ❤ by GitHub
So it is a fairly simple view with some logic in it to through some spanners in the works. The question is with the table definition and some explanation can the Gen AI platforms recreate a working version of the above view? 

The initial Prompt: 

I can't find a good way to format and embed my whole chats with the AI tools so I will work with what I have. Here is my original prompt that I used to get a starting point. 


You have a table in a Snowflake database called
stg_parkrun_all_v
The tale has the following columns where event date and position are a unique key:
EVENTDATE - Date the event took place
EVENTNUM - The event number where 1 is the first event and so on
TOTALATHLETES - The total of number of people that ran the event`
Position - The position of the athlete in this row within the event on the date shown
parkrunID - The unique ID of the athlete
Name - The name of the athlete - may not be unique
Gender - The gender of the athlete
AgeBracket - There 5 year age bracket of the athlete
Club - The running club of the athlete
TimeMin - the time they took in the event on the eventdate in minutes
TimeSec - The second portion of their time as per timeMin
TOTALSEC - The total seconds the athlete took in the event
DISPLAYTIME - A string representation of the athletes time on that date
MINS_5_BUCKET - An ID to place each time in a 5 minute bucket
Please write a query to get the best (minimum) time of each athlete over all events and then return just the top 1000 athletes with their ranking.
Please note that parkrunID and Name columns need double quotes round it to work. I would also like the athletes names to be returned in the query.
Please add a comment at the start of the query explaining what the query is doing that the source was a Gen AI model with your name included.
view raw Prompt1.txt hosted with ❤ by GitHub

Gemini: 

Overall this chat isn't very long and the first attempt was pretty close, it just missed the floor function around the minutes column and some quotes so I asked it to add those. 

"Please add a floor function to the fastest_time / 60 AS mins column Also the cte needs the double quotes around theparkrunID and Name columns. "

And here is the Gemini view, which I tested and it produced the right results.

The Gemini View:
-- This query, generated by Bard, a large language model from Google AI,
-- finds the fastest time for each athlete in the 'stg_parkrun_all_v' table
-- and ranks them based on their fastest time. It then returns the top 1000
-- ranked athletes with their ranking, parkrunID, name, fastest time in
-- whole minutes, and fastest time in seconds.
with ranked as (
SELECT
"parkrunID",
"Name", -- replaced "name" with "Name" as requested
MIN(TOTALSEC) as fastest_time,
ROW_NUMBER() OVER (ORDER BY MIN(TOTALSEC) ASC) AS ranking
FROM stg_parkrun_all_v
GROUP BY "parkrunID", "Name"
)
SELECT
ranking,
"parkrunID",
"Name",
FLOOR(fastest_time / 60) AS mins,
fastest_time % 60 AS secs
FROM ranked
WHERE ranking <= 1000
ORDER BY ranking;

Here

Copilot:

Next up I tried CoPilot. Copilot gave me a working piece of SQL out of the date, I did have to ask to expand the comments at the top and to add minute / seconds to the output but I like the way it did it. Overall was very impressed with it. No problems at all.


Copilot View:
-- Query to get the best time for each athlete over all events with ranking
-- Generated by the **Copilot** AI model
WITH RankedTimes AS (
SELECT
"parkrunID" AS "Athlete ID",
"Name" AS "Athlete Name",
MIN("TOTALSEC") AS "Best Time (Seconds)",
RANK() OVER (ORDER BY MIN("TOTALSEC")) AS "Rank"
FROM
stg_parkrun_all_v
GROUP BY
"parkrunID", "Name"
)
SELECT
"Athlete ID",
"Athlete Name",
"Best Time (Seconds)",
CONCAT(
FLOOR("Best Time (Seconds)" / 60), ' min ',
MOD("Best Time (Seconds)", 60), ' sec'
) AS "Best Time (Minutes:Seconds)",
"Rank"
FROM
RankedTimes
WHERE
"Rank" <= 1000
ORDER BY
"Rank";



ChatGPT:

Literally identical story to CoPilot with a slightly different output. I got a working view first time but didn't have the additional columns that were useful and it didn't get the comments right at the top and when I asked it to add them I think it went overboard! Sorry for the formatting it won't let me fix it!

ChatGPT View:

-- Query generated by ChatGPT
-- By OpenAI's ChatGPT
/*
View stg_parkrun_all_v:
- EVENTDATE: Date the event took place
- EVENTNUM: The event number where 1 is the first event and so on
- TOTALATHLETES: The total number of people that ran the event
- Position: The position of the athlete in this row within the event on the date shown
- "parkrunID": The unique ID of the athlete
- "Name": The name of the athlete - may not be unique
- Gender: The gender of the athlete
- AgeBracket: The 5-year age bracket of the athlete
- Club: The running club of the athlete
- TimeMin: the time they took in the event on the eventdate in minutes
- TimeSec: The second portion of their time as per TimeMin
- TOTALSEC: The total seconds the athlete took in the event
- DISPLAYTIME: A string representation of the athlete's time on that date
- MINS_5_BUCKET: An ID to place each time in a 5-minute bucket
*/
SELECT
ROW_NUMBER() OVER (ORDER BY MIN(TOTALSEC)) AS ranking,
"parkrunID",
"Name",
MIN(TOTALSEC) AS best_time,
FLOOR(MIN(TOTALSEC) / 60) AS best_time_minutes,
MOD(MIN(TOTALSEC), 60) AS best_time_seconds
FROM
stg_parkrun_all_v
GROUP BY
"parkrunID",
"Name"
ORDER BY
best_time
LIMIT 1000;
view raw ChatTop1000.sql hosted with ❤ by GitHub


Summary:

Gemini - Original Prompt and 1 correction and gave a working view. Provided me with some nice to haves that the others didn't.
Copilot - Original Prompt worked and minor additions asked for and gave a working view.
ChatGPT - as with CoPilot

Think that overall the winner is Copilot, the final solution it gave me worked and it put in nice comments. I personally would then go with Gemini, even though it require a correction to add the floor that the others didn't it gave the view most closely matching my original and got the comments right. This leaves ChatGPT in last. Honestly however the experience with all 3 was very similar and they all produced more or less what I wanted within a handful of prompts. I wonder how they will cope with more complex data models.



Comments

Popular posts from this blog

AWS training cloud academy free course

One of the things I like about this course are the instructors are really clear but also that it provides free labs that allow you to actually sign into AWS and perform some actions to actually create and do things without worrying that you are going to incur a cost.  Today I complete one of the hands on labs.  This was to create a lambda function, in this case it was a very basic python script that was searching a website for a keyword. I then placed this into a schedule and used cloudwatch to create a dashboard that monitored the running of this function. Overall it was a very simple use case but it was also a very simple process to setup.  I don't have much to add to this other than it is well worth signing up to cloud academy for the free training if nothing else, I am tempted, once i have done some more training, to give the paid for option a go to get the full sandboxes. 

Gen AI news 29-04-2024

Here are some recent updates and insights related to Generative AI (gen AI) : Enterprise Hits and Misses - Robotics and Gen AI Converge : This article discusses the convergence of robotics and generative AI. It explores breakthroughs needed in the field, the FTC’s policy change regarding non-competes, and the impact on AI model sizes for enterprises 1 . Read more All You Need To Know About The Upcoming AI-Powered OLED iPad Pro : This piece provides a summary of rumors surrounding the next-gen AI-fused OLED iPad Pro, powered by the new Apple M4 chip 2 . Read more Delivering on the Promise of Gen AI : New Electronics reflects on NVIDIA GTC and key announcements that contribute to delivering on the promises made for generative AI 3 . Read more The Future of Generative AI - An Early View in 15 Charts (McKinsey): Since the release of ChatGPT in November 2022, generative AI has been making headlines. McKinsey research estimates that gen AI features could add up to $4.4 trillion to the globa...