Blog Barista: Josh Hull | July 18, 2019 | Databases | Brew time: 8 min
If you’ve spent a lot of time working with data, you’ve probably experienced the phenomenon of being stuck with a troublesome query. It’s particularly frustrating when you know you should be able to do something with SQL, but you can’t figure out how. Have you ever been there? Have you had that feeling of being five levels deep into mental math about what a query does and what you need it to do, and you just can’t quite reach the next level?
After struggling through many difficult and time intensive SQL queries, I found a strategy that has proven over and over to effectively bust through mental blocks and convoluted queries. If you want to adopt this strategy it will require some humility, grammar skills, and patience. But, what I’m about to tell you to do, will feel like you’re back in kindergarten.
Here’s the strategy: get really good at writing what you want from your data in very plain english. Try to compose a simple, structured expression of the idea that anyone could understand. Usually when I suggest that to a developer, I get a look that says “I believe you know what you’re talking about, but I don’t think you understand what I’m saying.” In fact, I do.
Here’s why it works: SQL is a declarative language. A declarative statement will say what you want. Often the reason we get stuck writing SQL is because we’re trying write imperative statements rather than declarative. Imperative statements will express how to get it. The difference between the two is telling you that I want a peanut butter and jelly sandwich, versus me walking you through all the steps to make it.
In a Neo4j article, the author walks readers through a helpful example to further explain the difference. Imagine two children, one of whom you must make requests declaratively, and one imperatively.
You will tell the declarative child that you want the bed made with new sheets, and you want the pattern on the pillow showing.
With the imperative child, you will express each step of the process to accomplish this. You will tell the child which sheets, where to get the new sheets, how to carry them, how to spread them on the bed, along with many other instructions. You will explain exactly how to perform each task.
The declarative child may end up doing things in an order you don’t think is right, and might do weird things like carry the old sheets the whole time in a backpack or bring them to the laundry room. But at the end of the process, you’ll have a bed made just like you declared.
The imperative child will do exactly as you said, and you will end up with a bed that is the product of your instructions. It may look exactly like the declarative child’s bed in the end, depending on how complete your long list of instructions are.
The problem with writing an imperative SQL statement is that’s not how SQL works. The database will always do what it wants in order to fulfill your request. It will get very muddy if you start writing like a helicopter parent.
We can easily get into imperative thinking as we get knee-deep into a complicated query. Once we realize there’s a roadblock, us developers go back to what we feel comfortable with, thinking through the steps required to get the information we need.
Writing in plain english is a very effective way of getting back to a declarative mindset, and also identifying the core concepts of a request you’re trying to make. It becomes much clearer that you’re muddying the SQL by trying to tell the database what to do when you’re looking at a sentence that has a list of instructions, instead of description of what you want from the database. You may find that you’ll have to force yourself back out of this way of thinking in order to express your needs effectively in SQL.
Another good reason to write your thoughts in plain english is that it translates very well into SQL. In fact, sometimes you can just start replacing parts of the sentence structure and end up with the SQL that you need.
Let’s go through some examples:
I want the total amount paid to each medical provider for each quarter in 2018.
It’s pretty simple to SQL’ize the first half. I know what I mean by “total” is that I want to add up all the amounts paid. The SQL way to say that is “SUM.” And I know that I need the information from the transactions_table, since that’s where this information exists.
select sum(amount_paid)
from transactions_table
Next: I can recognize from my sentence that I will also need to know the provider and the quarter it belongs to.
select sum(amount_paid), provider, {quarter?}
from transactions_table
Realizing I don’t have quarter on my table, but I can derive that since I have the transaction date. So, I do a little math on the date (for example, with Oracle’s to_char(date, ‘Q’) function).
select sum(amount_paid), provider, to_char(transaction_date,'Q')
from transactions_table
Now, I can add a few more things. In my sentence I said I want 2018. I can clearly see that is a WHERE filter. I also know that since I want the sum by quarter, I need to group it by quarter. Here’s my finished query:
select sum(amount_paid), provider, to_char(transaction_date,'Q')
from transactions_table
where to_char(transaction_date,'YYYY') = '2018'
group by to_char(transaction_date,'Q'), provider
Let’s get a little more complicated. This strategy becomes really fun once you need some joins, correlated sub-queries, and other interesting advanced functions. One very common way to get stuck is to need “the most recent” of something with a qualifier. Here’s an example:
Get me the most recent transaction amount for each provider that is over $500.
Start with the easy stuff:
Select amount_paid, provider, transaction_date
from transaction_table
That will get us all the transactions, so now we need to filter it down to the most recent transaction that’s over $500.
Select amount_paid, provider, transaction_date
from transaction_table
where amount_paid > 500
and transaction_date {is the most recent date
for all the transactions over 500
for this provider}
We’ve left the most complicated part in english, but its much easier to figure out now. “The most recent date” can be translated to MAX(transaction_date). We know we want the date specific to the provider we’re looking at, and one where the transaction amount is over $500. We’re going to use a correlated sub query, which means we need to name the tables so we can reference properly. Let’s write that in:
Select amount_paid, provider, transaction_date
from transaction_table outer
where amount_paid > 500
and transaction_date =(select max(transaction_date)
from transaction_table as inner
where inner.amount_paid > 500
and inner.provider = outer.provider)
So easy, right!?
Here’s a real example I was involved in recently. A co-worker wanted to generate a table with some data aggregated from a single table. I challenged him to come up with a plain english description. Here’s what he came up with:
For all the issues that have a given status, I want to count how many of those issues have each type of priority and then collect them into one line for each status, that shows all the counts of priorities for that status.
Can you spot the imperative language in there? There’s just a little, but it’s muddying things up. The imperative language is “collect them into one line” and “count how many.” We discussed a little further and came up with this:
I want to get all of the statuses in my data and a count of the priorities for each status.
Then, we started interpreting it into SQL:
select status, count(major), count(minor), count(trivial)
from my_data
We don’t actually have three columns for major, minor and trivial, we only have one column called “priority” that can contain one of those three values. We have to work that out a little. We can get the number of “major” if priority is only counted when it is equal to “major.” This can be declared as “sum the number of times priority is major for this status.”
sum(case when priority = 'major' then 1 else 0 end)
We can put this little snippet into the query three different times:
select status
, sum(case when priority = 'major' then 1 else 0) as MAJOR_COUNT
, sum(case when priority = 'minor' then 1 else 0) as MINOR_COUNT
, sum(case when priority = 'trivial' then 1 else 0) as TRIVIAL_COUNT
from my_data
If we run this query, we’ll see a problem because it will issue an error. I don’t actually want that sum for EACH row, I want the sum for ALL the rows for each status. In SQL terms, that means I want to group it by status. Why won’t it automatically group by status? One of SQL’s mysteries. Let’s add it in there.
select status
, sum(case when priority = 'major' then 1 else 0) as MAJOR_COUNT
, sum(case when priority = 'minor' then 1 else 0) as MINOR_COUNT
, sum(case when priority = 'trivial' then 1 else 0) as TRIVIAL_COUNT
from my_data
group by status
And that’s it! Just for fun, let’s re-work that simple sentence again:
I want to see all of the unique statuses represented in my data and for each status a count of each of the three kinds of priority “major, minor, trivial.”
From this example, it’s easy to see that there is a lot of room for variation in expressing what you want in different ways. We can also see that some ways are going to be clearer and easier to translate to SQL. To accomplish this succinct way of expressing an idea, practice makes perfect.
I’d also like to point out that since these are very simple example queries it may not be immediately clear why you’d use this strategy. It becomes much more useful the more complicated your query gets.
My parting challenge: take the time to write clear english descriptions of what you want, and develop the ability to express it in a way that leads to easy SQL translation. If you put some time in and pay attention, you can level up your SQL skills with this simple technique!
1 Comment
Other recent posts:
Kunz, Leigh and Associates (KL&A) Announces the Retirement of KL&A Co-Founder John Leigh
OKEMOS, MI, April 11, 2023 – Kunz, Leigh and Associates (KL&A) announces the recent retirement of KL&A Co-Founder John Leigh as of March 31, 2023. Mr. Leigh began his career as a developer working on large mainframe systems before moving…
Team Building in a Remote Environment
Blog Barista: Dana Graham | June 15th, 2022 | Culture | Brew time: 5 min
Let me start by saying I don’t care for the term “work family.” I have a family I love, and they have absolutely nothing to do with my career. I want my work life to be its own entity. I like boundaries (and the George Costanza Worlds Theory). Certainly, I want to enjoy and trust my coworkers, and I want to feel supported and cared for…
Best advice I ever received in SQL!