This is a compilation of all the questions and answers on Alisdair Owen's PostgreSQL Exercises . Don't
forget that actually solving these problems will make you go further than just skimming through this guide,
so make sure to pay PostgreSQL Exercises a visit.
Table of Contents
Getting Started
Getting Started
It's pretty simple to get going with the exercises: all you have to do is open the exercises , take a look at
the questions, and try to answer them!
The dataset for these exercises is for a newly created country club, with a set of members, facilities such as
tennis courts, and booking history for those facilities. Amongst other things, the club wants to understand
how they can use their information to analyse facility usage/demand. Please note: this dataset is designed
purely for supporting an interesting array of exercises, and the database schema is flawed in several
aspects - please don't take it as an example of good design. We'll start off with a look at the Members
table:
Each member has an ID (not guaranteed to be sequential), basic address information, a reference to the
member that recommended them (if any), and a timestamp for when they joined. The addresses in the
dataset are entirely (and unrealistically) fabricated.
The facilities table lists all the bookable facilities that the country club possesses. The club stores id/name
information, the cost to book both members and guests, the initial cost to build the facility, and estimated
monthly upkeep costs. They hope to use this information to track how financially worthwhile each facility
is.
Okay, that should be all the information you need. You can select a category of query to try from the menu
above, or alternatively start from the beginning .
No problem! Getting up and running isn't too hard. First, you'll need an install of PostgreSQL, which you
can get from here . Once you have it started, download the SQL .
When you're running queries, you may find psql a little clunky. If so, I recommend trying out pgAdmin or
the Eclipse database development tools.
Schema
This category deals with the basics of SQL. It covers select and where clauses, case expressions, unions,
and a few other odds and ends. If you're already educated in SQL you will probably find these exercises
fairly easy. If not, you should find them a good point to start learning for the more difficult categories
ahead!
If you struggle with these questions, I strongly recommend Learning SQL , by Alan Beaulieu, as a concise
and well-written book on the subject. If you're interested in the fundamentals of database systems (as
opposed to just how to use them), you should also investigate An Introduction to Database Systems by C.J.
Date.
Expected results:
facid name membercost guestcost initialoutlay monthlymaintenance
Answer:
The SELECT statement is the basic starting block for queries that read information out of the database. A
minimal select statement is generally comprised of select [some set of columns] from [some
table or group of tables] .
In this case, we want all of the information from the facilities table. The from section is easy - we just need
to specify the cd.facilities table. 'cd' is the table's schema - a term used for a logical grouping of
related information in the database.
Next, we need to specify that we want all the columns. Conveniently, there's a shorthand for 'all columns' -
*. We can use this instead of laboriously specifying all the column names.
Expected results:
name membercost
Tennis Court 1 5
Tennis Court 2 5
Badminton Court 0
Table Tennis 0
Massage Room 1 35
Massage Room 2 35
Snooker Table 0
Pool Table 0
Answer:
For this question, we need to specify the columns that we want. We can do that with a simple comma-
delimited list of column names specified to the select statement. All the database does is look at the
columns available in the FROM clause, and return the ones we asked for, as illustrated below
Generally speaking, for non-throwaway queries it's considered desirable to specify the names of the
columns you want in your queries rather than using *. This is because your application might not be able
to cope if more columns get added into the table.
Expected results:
facid name membercost guestcost initialoutlay monthlymaintenance
Massage Room
4 35 80 4000 3000
1
Massage Room
5 35 80 4000 3000
2
Answer:
The FROM clause is used to build up a set of candidate rows to read results from. In our examples so far,
this set of rows has simply been the contents of a table. In future we will explore joining, which allows us
to create much more interesting candidates.
Once we've built up our set of candidate rows, the WHERE clause allows us to filter for the rows we're
interested in - in this case, those with a membercost of more than zero. As you will see in later exercises,
WHERE clauses can have multiple components combined with boolean logic - it's possible to, for
instance, search for facilities with a cost greater than 0 and less than 10. The filtering action of the WHERE
clause on the facilities table is illustrated below:
Expected results:
Answer:
select facid, name, membercost, monthlymaintenance
from cd.facilities
where
membercost > 0 and
(membercost < monthlymaintenance/50.0);
The WHERE clause allows us to filter for the rows we're interested in - in this case, those with a
membercost of more than zero, and less than 1/50th of the monthly maintenance cost. As you can see, the
massage rooms are very expensive to run thanks to staffing costs!
When we want to test for two or more conditions, we use AND to combine them. We can, as you might
expect, use OR to test whether either of a pair of conditions is true.
You might have noticed that this is our first query that combines a WHERE clause with selecting specific
columns. You can see in the image below the effect of this: the intersection of the selected columns and
the selected rows gives us the data to return. This may not seem too interesting now, but as we add in
more complex operations like joins later, you'll see the simple elegance of this behaviour.
Expected results:
Answer:
select *
from cd.facilities
where
name like '%Tennis%';
SQL's LIKE operator provides simple pattern matching on strings. It's pretty much universally
implemented, and is nice and simple to use - it just takes a string with the % character matching any string,
and _ matching any single character. In this case, we're looking for names containing the word 'Tennis', so
putting a % on either side fits the bill.
There's other ways to accomplish this task: Postgres supports regular expressions with the ~ operator, for
example. Use whatever makes you feel comfortable, but do be aware that the LIKE operator is much
more portable between systems.
Expected results:
Massage Room
5 35 80 4000 3000
2
Answer:
select *
from cd.facilities
where
facid in (1,5);
The obvious answer to this question is to use a WHERE clause that looks like where facid = 1 or
facid = 5 . An alternative that is easier with large numbers of possible matches is the IN operator. The
IN operator takes a list of possible values, and matches them against (in this case) the facid. If one of the
values matches, the where clause is true for that row, and the row is returned.
The IN operator is a good early demonstrator of the elegance of the relational model. The argument it
takes is not just a list of values - it's actually a table with a single column. Since queries also return tables,
if you create a query that returns a single column, you can feed those results into an IN operator. To give
a toy example:
select *
from cd.facilities
where
facid in (
select facid from cd.facilities
);
This example is functionally equivalent to just selecting all the facilities, but shows you how to feed the
results of one query into another. The inner query is called a subquery .
How can you produce a list of facilities, with each labelled as 'cheap' or 'expensive' depending on if their
monthly maintenance cost is more than $100? Return the name and monthly maintenance of the facilities
in question.
Expected results:
name cost
Answer:
select name,
case when (monthlymaintenance > 100) then
'expensive'
else
'cheap'
end as cost
from cd.facilities;
This exercise contains a few new concepts. The first is the fact that we're doing computation in the area of
the query between SELECT and FROM . Previously we've only used this to select columns that we want
to return, but you can put anything in here that will produce a single result per returned row - including
subqueries.
The second new concept is the CASE statement itself. CASE is effectively like if/switch statements in
other languages, with a form as shown in the query. To add a 'middling' option, we would simply insert
another when then section.
Finally, there's the AS operator. This is simply used to label columns or expressions, to make them
display more nicely or to make them easier to reference when used as part of a subquery.
Expected results:
memid surname firstname joindate
Answer:
This is our first look at SQL timestamps. They're formatted in descending order of magnitude: YYYY-MM-DD
HH MM SS.nnnnnn . We can compare them just like we might a unix timestamp, although getting the
differences between dates is a little more involved (and powerful!). In this case, we've just specified the
date portion of the timestamp. This gets automatically cast by postgres into the full timestamp 2012-09-
01 00 00 00 .
Expected results:
surname
Bader
Baker
Boothe
Butters
Coplin
Crumpet
Dare
Farrell
GUEST
Genting
Answer:
There's three new concepts here, but they're all pretty simple.
Specifying DISTINCT after SELECT removes duplicate rows from the result set. Note that this
applies to rows : if row A has multiple columns, row B is only equal to it if the values in all columns are
the same. As a general rule, don't use DISTINCT in a willy-nilly fashion - it's not free to remove
duplicates from large query result sets, so do it as-needed.
Specifying ORDER BY (after the FROM and WHERE clauses, near the end of the query) allows results
to be ordered by a column or set of columns (comma separated).
The LIMIT keyword allows you to limit the number of results retrieved. This is useful for getting
results a page at a time, and can be combined with the OFFSET keyword to get following pages. This
is the same approach used by MySQL and is very convenient - you may, unfortunately, find that this
process is a little more complicated in other DBs.
Expected results:
surname
Tennis Court 2
Worthington-Smyth
Badminton Court
Pinker
Dare
Bader
Mackenzie
Crumpet
Massage Room 1
Squash Court
Answer:
select surname
from cd.members
union
select name
from cd.facilities;
The UNION operator does what you might expect: combines the results of two SQL queries into a single
table. The caveat is that both results from the two queries must have the same number of columns and
compatible data types.
UNION removes duplicate rows, while UNION ALL does not. Use UNION ALL by default, unless you
care about duplicate results.
Simple aggregation
You'd like to get the signup date of your last member. How can you retrieve this information?
Expected results:
latest
2012-09-26 18:08:45
Answer:
This is our first foray into SQL's aggregate functions. They're used to extract information about whole
groups of rows, and allow us to easily ask questions like:
The MAX aggregate function here is very simple: it receives all the possible values for joindate, and outputs
the one that's biggest. There's a lot more power to aggregate functions, which you will come across in
future exercises.
More aggregation
You'd like to get the first and last name of the last member(s) who signed up - not just the date. How can
you do that?
Expected results:
Answer:
In the suggested approach above, you use a subquery to find out what the most recent joindate is. This
subquery returns a scalar table - that is, a table with a single column and a single row. Since we have just
a single value, we can substitute the subquery anywhere we might put a single constant value. In this case,
we use it to complete the WHERE clause of a query to find a given member.
Unfortunately, this doesn't work. The MAX function doesn't restrict rows like the WHERE clause does - it
simply takes in a bunch of values and returns the biggest one. The database is then left wondering how to
pair up a long list of names with the single join date that's come out of the max function, and fails.
Instead, you're left having to say 'find me the row(s) which have a join date that's the same as the
maximum join date'.
As mentioned by the hint, there's other ways to get this job done - one example is below. In this approach,
rather than explicitly finding out what the last joined date is, we simply order our members table in
descending order of join date, and pick off the first one. Note that this approach does not cover the
extremely unlikely eventuality of two people joining at the exact same time :-).
This topic covers inner, outer, and self joins, as well as spending a little time on subqueries (queries within
queries). If you struggle with these questions, I strongly recommend Learning SQL , by Alan Beaulieu, as a
concise and well-written book on the subject.
Expected results:
starttime
2012-09-18 09:00:00
2012-09-18 17:30:00
2012-09-18 13:30:00
2012-09-18 20:00:00
2012-09-19 09:30:00
2012-09-19 15:00:00
2012-09-19 12:00:00
2012-09-20 15:30:00
2012-09-20 11:30:00
2012-09-20 14:00:00
Answer:
select bks.starttime
from
cd.bookings bks
inner join cd.members mems
on mems.memid = bks.memid
where
mems.f rstname='David'
and mems.surname='Farrell';
The most commonly used kind of join is the INNER JOIN . What this does is combine two tables based on
a join expression - in this case, for each member id in the members table, we're looking for matching
values in the bookings table. Where we find a match, a row combining the values for each table is
returned. Note that we've given each table an alias (bks and mems). This is used for two reasons: firstly,
it's convenient, and secondly we might join to the same table several times, requiring us to distinguish
between columns from each different time the table was joined in.
Let's ignore our select and where clauses for now, and focus on what the FROM statement produces. In all
our previous examples, FROM has just been a simple table. What is it now? Another table! This time, it's
produced as a composite of bookings and members. You can see a subset of the output of the join below:
For each member in the members table, the join has found all the matching member ids in the bookings
table. For each match, it's then produced a row combining the row from the members table, and the row
from the bookings table.
Obviously, this is too much information on its own, and any useful question will want to filter it down. In
our query, we use the start of the SELECT clause to pick columns, and the WHERE clause to pick rows,
as illustrated below:
That's all we need to find David's bookings! In general, I encourage you to remember that the output of
the FROM clause is essentially one big table that you then filter information out of. This may sound
inefficient - but don't worry, under the covers the DB will be behaving much more intelligently :-).
One final note: there's two different syntaxes for inner joins. I've shown you the one I prefer, that I find
more consistent with other join types. You'll commonly see a different syntax, shown below:
select bks.starttime
from
cd.bookings bks,
cd.members mems
where
mems.f rstname='David'
and mems.surname='Farrell'
and mems.memid = bks.memid;
This is functionally exactly the same as the approved answer. If you feel more comfortable with this syntax,
feel free to use it!
Expected results:
start name
Answer:
This is another INNER JOIN query, although it has a fair bit more complexity in it! The FROM part of the
query is easy - we're simply joining facilities and bookings tables together on the facid. This produces a
table where, for each row in bookings, we've attached detailed information about the facility being
booked.
On to the WHERE component of the query. The checks on starttime are fairly self explanatory - we're
making sure that all the bookings start between the specified dates. Since we're only interested in tennis
courts, we're also using the IN operator to tell the database system to only give us back facility IDs 0 or 1
- the IDs of the courts. There's other ways to express this: We could have used where facs.facid = 0
or facs.facid = 1 , or even where facs.name like 'Tennis%' .
The rest is pretty simple: we SELECT the columns we're interested in, and ORDER BY the start time.
firstname surname
Florence Bader
Timothy Baker
Gerald Butters
Jemima Farrell
Matthew Genting
David Jones
Janice Joplette
Millicent Purview
Tim Rownam
Darren Smith
Tracy Smith
Ponder Stibbons
Burton Tracy
Answer:
Here's a concept that some people find confusing: you can join a table to itself! This is really useful if you
have columns that reference data in the same table, like we do with recommendedby in cd.members.
If you're having trouble visualising this, remember that this works just the same as any other inner join.
Our join takes each row in members that has a recommendedby value, and looks in members again for the
row which has a matching member id. It then generates an output row combining the two members
entries. This looks like the diagram below:
Note that while we might have two 'surname' columns in the output set, they can be distinguished by their
table aliases. Once we've selected the columns that we want, we simply use DISTINCT to ensure that
there are no duplicates.
Expected results:
memfname memsname recfname recsname
David Farrell
Jemima Farrell
GUEST GUEST
Tim Rownam
Darren Smith
Darren Smith
Tracy Smith
Burton Tracy
Hyacinth Tupperware
memfname memsname recfname recsname
Answer:
Let's introduce another new concept: the LEFT OUTER JOIN . These are best explained by the way in
which they differ from inner joins. Inner joins take a left and a right table, and look for matching rows
based on a join condition ( ON ). When the condition is satisfied, a joined row is produced. A LEFT OUTER
JOIN operates similarly, except that if a given row on the left hand table doesn't match anything, it still
produces an output row. That output row consists of the left hand table row, and a bunch of NULLS in
place of the right hand table row.
This is useful in situations like this question, where we want to produce output with optional data. We
want the names of all members, and the name of their recommender if that person exists . You can't
express that properly with an inner join.
As you may have guessed, there's other outer joins too. The RIGHT OUTER JOIN is much like the LEFT
OUTER JOIN , except that the left hand side of the expression is the one that contains the optional data.
The rarely-used FULL OUTER JOIN treats both sides of the expression as optional.
Expected results:
member facility
Answer:
This exercise is largely a more complex application of what you've learned in prior questions. It's also the
first time we've used more than one join, which may be a little confusing for some. When reading join
expressions, remember that a join is effectively a function that takes two tables, one labelled the left table,
and the other the right. This is easy to visualise with just one join in the query, but a little more confusing
with two.
Our second INNER JOIN in this query has a right hand side of cd.facilities. That's easy enough to grasp.
The left hand side, however, is the table returned by joining cd.members to cd.bookings. It's important to
emphasise this: the relational model is all about tables. The output of any join is another table. The output
of a query is a table. Single columned lists are tables. Once you grasp that, you've grasped the
fundamental beauty of the model.
As a final note, we do introduce one new thing here: the operator is used to concatenate strings.
Expected results:
Answer:
This is a bit of a complicated one! While its more complex logic than we've used previously, there's not an
awful lot to remark upon. The WHERE clause restricts our output to sufficiently costly rows on 2012-09-14,
remembering to distinguish between guests and others. We then use a CASE statement in the column
selections to output the correct cost for the member or guest.
Produce a list of all members, along with their recommender, using no joins
How can you output a list of all members, including the individual who recommended them (if any),
without using any joins? Ensure that there are no duplicates in the list, and that each firstname + surname
pairing is formatted as a column and ordered.
Expected results:
member recommender
Burton Tracy
Darren Smith
David Farrell
GUEST GUEST
Hyacinth Tupperware
Jemima Farrell
Tim Rownam
Tracy Smith
Answer:
This exercise marks the introduction of subqueries. Subqueries are, as the name implies, queries within a
query. They're commonly used with aggregates, to answer questions like 'get me all the details of the
member who has spent the most hours on Tennis Court 1'.
In this case, we're simply using the subquery to emulate an outer join. For every value of member, the
subquery is run once to find the name of the individual who recommended them (if any). A subquery that
uses information from the outer query in this way (and thus has to be run for each row in the result set) is
known as a correlated subquery .
How can you produce a list of bookings on the day of 2012-09-14 which will cost the member (or guest)
more than $30? Remember that guests have different costs to members (the listed costs are per half-hour
'slot'), and the guest user is always ID 0. Include in your output the name of the facility, the name of the
member formatted as a single column, and the cost. Order by descending cost.
Expected results:
member facility cost
Answer:
This answer provides a mild simplification to the previous iteration: in the no-subquery version, we had to
calculate the member or guest's cost in both the WHERE clause and the CASE statement. In our new
version, we produce an inline query that calculates the total booking cost for us, allowing the outer query
to simply select the bookings it's looking for. For reference, you may also see subqueries in the FROM
clause referred to as inline views .
Modifying Data
Querying data is all well and good, but at some point you're probably going to want to put data into your
database! This section deals with inserting, updating, and deleting information. Operations that alter your
data like this are collectively known as Data Manipulation Language, or DML.
In previous sections, we returned to you the results of the query you've performed. Since modifications
like the ones we're making in this section don't return any query results, we instead show you the updated
content of the table you're supposed to be working on. You can compare this with the table shown in
'Expected Results' to see how you've done.
If you struggle with these questions, I strongly recommend Learning SQL , by Alan Beaulieu.
facid: 9, Name: 'Spa', membercost: 20, guestcost: 30, initialoutlay: 100000, monthlymaintenance: 800.
Expected results:
Answer:
insert into cd.facilities
(facid, name, membercost, guestcost, initialoutlay, monthlymaintenance)
values (9, 'Spa', 20, 30, 100000, 800);
INSERT INTO VALUES is the simplest way to insert data into a table. There's not a whole lot to
discuss here: VALUES is used to construct a row of data, which the INSERT statement inserts into the
table. It's a simple as that.
You can see that there's two sections in parentheses. The first is part of the INSERT statement, and
specifies the columns that we're providing data for. The second is part of VALUES , and specifies the
actual data we want to insert into each column.
If we're inserting data into every column of the table, as in this example, explicitly specifying the column
names is optional. As long as you fill in data for all columns of the table, in the order they were defined
when you created the table, you can do something like the following:
insert into cd.facilities values (9, 'Spa', 20, 30, 100000, 800);
Generally speaking, for SQL that's going to be reused I tend to prefer being explicit and specifying the
column names.
facid: 9, Name: 'Spa', membercost: 20, guestcost: 30, initialoutlay: 100000, monthlymaintenance: 800.
facid: 10, Name: 'Squash Court 2', membercost: 3.5, guestcost: 17.5, initialoutlay: 5000,
monthlymaintenance: 80.
Expected results:
Answer:
insert into cd.facilities
(facid, name, membercost, guestcost, initialoutlay, monthlymaintenance)
values
(9, 'Spa', 20, 30, 100000, 800),
(10, 'Squash Court 2', 3.5, 17.5, 5000, 80);
VALUES can be used to generate more than one row to insert into a table, as seen in this example.
Hopefully it's clear what's going on here: the output of VALUES is a table, and that table is copied into
cd.facilities, the table specified in the INSERT command.
While you'll most commonly see VALUES when inserting data, Postgres allows you to use VALUES
wherever you might use a SELECT . This makes sense: the output of both commands is a table, it's just
that VALUES is a bit more ergonomic when working with constant data.
Similarly, it's possible to use SELECT wherever you see a VALUES . This means that you can INSERT
the results of a SELECT . For example:
In later exercises you'll see us using INSERT SELECT to generate data to insert based on the
information already in the database.
Name: 'Spa', membercost: 20, guestcost: 30, initialoutlay: 100000, monthlymaintenance: 800.
Expected results:
In the previous exercises we used VALUES to insert constant data into the facilities table. Here, though,
we have a new requirement: a dynamically generated ID. This gives us a real quality of life improvement,
as we don't have to manually work out what the current largest ID is: the SQL command does it for us.
Since the VALUES clause is only used to supply constant data, we need to replace it with a query instead.
The SELECT statement is fairly simple: there's an inner subquery that works out the next facid based on
the largest current id, and the rest is just constant data. The output of the statement is a row that we insert
into the facilities table.
While this works fine in our simple example, it's not how you would generally implement an incrementing
ID in the real world. Postgres provides SERIAL types that are auto-filled with the next ID when you insert
a row. As well as saving us effort, these types are also safer: unlike the answer given in this exercise,
there's no need to worry about concurrent operations generating the same ID.
Expected results:
Answer:
update cd.facilities
set initialoutlay = 10000
where facid = 1;
The UPDATE statement is used to alter existing data. If you're familiar with SELECT queries, it's pretty
easy to read: the WHERE clause works in exactly the same fashion, allowing us to filter the set of rows we
want to work with. These rows are then modified according to the specifications of the SET clause: in this
case, setting the initial outlay.
The WHERE clause is extremely important. It's easy to get it wrong or even omit it, with disastrous results.
Consider the following command:
update cd.facilities
set initialoutlay = 10000;
There's no WHERE clause to filter for the rows we're interested in. The result of this is that the update
runs on every row in the table! This is rarely what we want to happen.
Answer:
update cd.facilities
set
membercost = 6,
guestcost = 30
where facid in (0,1);
The SET clause accepts a comma separated list of values that you want to update.
Expected results:
Answer:
Updating columns based on calculated data is not too intrinsically difficult: we can do so pretty easily using
subqueries. You can see this approach in our selected answer.
As the number of columns we want to update increases, standard SQL can start to get pretty awkward: you
don't want to be specifying a separate subquery for each of 15 different column updates. Postgres
provides a nonstandard extension to SQL called UPDATE FROM that addresses this: it allows you to
supply a FROM clause to generate values for use in the SET clause. Example below:
Expected results:
bookid facid memid starttime slots
Answer:
The DELETE statement does what it says on the tin: deletes rows from the table. Here, we show the
command in its simplest form, with no qualifiers. In this case, it deletes everything from the table.
Obviously, you should be careful with your deletes and make sure they're always limited - we'll see how to
do that in the next exercise.
truncate cd.bookings;
TRUNCATE also deletes everything in the table, but does so using a quicker underlying mechanism. It's
not perfectly safe in all circumstances , though, so use judiciously. When in doubt, use DELETE .
Expected results:
memid surname firstname address zipcode telephone recommendedby joindate
2012-
(000) 000-
0 GUEST GUEST GUEST 0 07-01
0000
00:00:00
2012-
8 Bloomsbury 555-555-
1 Smith Darren 4321 07-02
Close, Boston 5555
12:02:05
2012-
8 Bloomsbury 555-555-
2 Smith Tracy 4321 07-02
Close, New York 5555
12:08:23
2012-
23 Highway Way, (844) 693-
3 Rownam Tim 23423 07-03
Boston 0723
09:32:15
2012-
20 Crossing Road, (833) 942-
4 Joplette Janice 234 1 07-03
New York 4710
10:25:05
2012-
1065 Huntingdon (844) 078-
5 Butters Gerald 56754 1 07-09
Avenue, Boston 4130
10:44:09
2012-
3 Tunisia Drive, (822) 354-
6 Tracy Burton 45678 07-15
Boston 9973
08:52:55
2012-
6 Hunting Lodge (833) 776-
7 Dare Nancy 10383 4 07-25
Way, Boston 4001
08:59:12
3 Bloomsbury 2012-
(811) 433-
8 Boothe Tim Close, Reading, 234 3 07-25
2547
00234 16:02:35
2012-
5 Dragons Way, (833) 160-
9 Stibbons Ponder 87630 6 07-25
Winchester 3900
17:09:05
2012-
52 Cheshire Grove, (855) 542-
10 Owen Charles 28563 1 08-03
Winchester, 28563 5251
19:42:37
2012-
976 Gnats Close, (844) 536-
11 Jones David 33862 4 08-06
Reading 8036
16:32:55
2012-
55 Powdery Street, 844-076-
12 Baker Anne 80743 9 08-10
Boston 5141
14:23:22
2012-
103 Firth Avenue, (855) 016-
13 Farrell Jemima 57392 08-10
North Reading 0163
14:28:01
memid surname firstname address zipcode telephone recommendedby joindate
2012-
252 Binkington (822) 163-
14 Smith Jack 69302 1 08-10
Way, Boston 3254
16:22:05
2012-
264 Ursula Drive, (833) 499-
15 Bader Florence 84923 9 08-10
Westford 3527
17:52:03
2012-
329 James Street, 833-941-
16 Baker Timothy 58393 13 08-15
Reading 0824
10:34:25
2012-
5 Impreza Road, 811 409-
17 Pinker David 65332 13 08-16
Boston 6734
11:32:47
4 Nunnington 2012-
(811) 972-
20 Genting Matthew Place, Wingfield, 52365 5 08-19
1377
Boston 14:55:55
2012-
64 Perkington Lane, (822) 661-
21 Mackenzie Anna 64577 1 08-26
Reading 2898
09:32:05
2012-
12 Bullington Lane, (822) 413-
24 Sarwin Ramnaresh 65464 15 09-01
Boston 1470
08:44:42
2012-
976 Gnats Close, 844 536-
26 Jones Douglas 11986 11 09-02
Reading 8036
18:43:05
2012-
3 Burkington Plaza, (822) 989-
27 Rumney Henrietta 78533 20 09-05
Boston 8876
08:42:35
2012-
437 Granite Farm (855) 755-
28 Farrell David 43532 09-15
Road, Westford 9876
08:22:05
2012-
Worthington- 55 Jagbi Way, North (855) 894-
29 Henry 97676 2 09-17
Smyth Reading 3758
12:27:15
2012-
641 Drudgery Close, (855) 941-
30 Purview Millicent 34232 2 09-18
Burnington, Boston 9786
19:04:01
2012-
5 Bullington Lane, (899) 720-
35 Hunt John 54333 30 09-19
Boston 6978
11:32:45
2012-
Crimson Road, (811) 732-
36 Crumpet Erica 75655 2 09-22
North Reading 4816
08:36:38
Answer:
This exercise is a small increment on our previous one. Instead of deleting all bookings, this time we want
to be a bit more targeted, and delete a single member that has never made a booking. To do this, we
simply have to add a WHERE clause to our command, specifying the member we want to delete. You can
see the parallels with SELECT and UPDATE statements here.
There's one interesting wrinkle here. Try this command out, but substituting in member id 0 instead. This
member has made many bookings, and you'll find that the delete fails with an error about a foreign key
constraint violation. This is an important concept in relational databases, so let's explore a little further.
Foreign keys are a mechanism for defining relationships between columns of different tables. In our case
we use them to specify that the memid column of the bookings table is related to the memid column of
the members table. The relationship (or 'constraint') specifies that for a given booking, the member
specified in the booking must exist in the members table. It's useful to have this guarantee enforced by
the database: it means that code using the database can rely on the presence of the member. It's hard
(even impossible) to enforce this at higher levels: concurrent operations can interfere and leave your
database in a broken state.
PostgreSQL supports various different kinds of constraints that allow you to enforce structure upon your
data. For more information on constraints, check out the PostgreSQL documentation on foreign keys
Expected results:
memid surname firstname address zipcode telephone recommendedby joindate
2012-
(000) 000-
0 GUEST GUEST GUEST 0 07-01
0000
00:00:00
2012-
8 Bloomsbury 555-555-
1 Smith Darren 4321 07-02
Close, Boston 5555
12:02:05
2012-
8 Bloomsbury 555-555-
2 Smith Tracy 4321 07-02
Close, New York 5555
12:08:23
2012-
23 Highway Way, (844) 693-
3 Rownam Tim 23423 07-03
Boston 0723
09:32:15
2012-
20 Crossing Road, (833) 942-
4 Joplette Janice 234 1 07-03
New York 4710
10:25:05
2012-
1065 Huntingdon (844) 078-
5 Butters Gerald 56754 1 07-09
Avenue, Boston 4130
10:44:09
2012-
3 Tunisia Drive, (822) 354-
6 Tracy Burton 45678 07-15
Boston 9973
08:52:55
2012-
6 Hunting Lodge (833) 776-
7 Dare Nancy 10383 4 07-25
Way, Boston 4001
08:59:12
3 Bloomsbury 2012-
(811) 433-
8 Boothe Tim Close, Reading, 234 3 07-25
2547
00234 16:02:35
2012-
5 Dragons Way, (833) 160-
9 Stibbons Ponder 87630 6 07-25
Winchester 3900
17:09:05
2012-
52 Cheshire Grove, (855) 542-
10 Owen Charles 28563 1 08-03
Winchester, 28563 5251
19:42:37
2012-
976 Gnats Close, (844) 536-
11 Jones David 33862 4 08-06
Reading 8036
16:32:55
2012-
55 Powdery Street, 844-076-
12 Baker Anne 80743 9 08-10
Boston 5141
14:23:22
2012-
103 Firth Avenue, (855) 016-
13 Farrell Jemima 57392 08-10
North Reading 0163
14:28:01
memid surname firstname address zipcode telephone recommendedby joindate
2012-
252 Binkington (822) 163-
14 Smith Jack 69302 1 08-10
Way, Boston 3254
16:22:05
2012-
264 Ursula Drive, (833) 499-
15 Bader Florence 84923 9 08-10
Westford 3527
17:52:03
2012-
329 James Street, 833-941-
16 Baker Timothy 58393 13 08-15
Reading 0824
10:34:25
2012-
5 Impreza Road, 811 409-
17 Pinker David 65332 13 08-16
Boston 6734
11:32:47
4 Nunnington 2012-
(811) 972-
20 Genting Matthew Place, Wingfield, 52365 5 08-19
1377
Boston 14:55:55
2012-
64 Perkington Lane, (822) 661-
21 Mackenzie Anna 64577 1 08-26
Reading 2898
09:32:05
2012-
12 Bullington Lane, (822) 413-
24 Sarwin Ramnaresh 65464 15 09-01
Boston 1470
08:44:42
2012-
976 Gnats Close, 844 536-
26 Jones Douglas 11986 11 09-02
Reading 8036
18:43:05
2012-
3 Burkington Plaza, (822) 989-
27 Rumney Henrietta 78533 20 09-05
Boston 8876
08:42:35
2012-
437 Granite Farm (855) 755-
28 Farrell David 43532 09-15
Road, Westford 9876
08:22:05
2012-
Worthington- 55 Jagbi Way, North (855) 894-
29 Henry 97676 2 09-17
Smyth Reading 3758
12:27:15
2012-
641 Drudgery Close, (855) 941-
30 Purview Millicent 34232 2 09-18
Burnington, Boston 9786
19:04:01
2012-
5 Bullington Lane, (899) 720-
35 Hunt John 54333 30 09-19
Boston 6978
11:32:45
2012-
Crimson Road, (811) 732-
36 Crumpet Erica 75655 2 09-22
North Reading 4816
08:36:38
Answer:
delete from cd.members where memid not in (select memid from cd.bookings);
We can use subqueries to determine whether a row should be deleted or not. There's a couple of standard
ways to do this. In our featured answer, the subquery produces a list of all the different member ids in the
cd.bookings table. If a row in the table isn't in the list generated by the subquery, it gets deleted.
An alternative is to use a correlated subquery . Where our previous example runs a large subquery once,
the correlated approach instead specifies a smaller subqueryto run against every row.
delete from cd.members mems where not exists (select 1 from cd.bookings where memid
= mems.memid);
The two different forms can have different performance characteristics. Under the hood, your database
engine is free to transform your query to execute it in a correlated or uncorrelated fashion, though, so
things can be a little hard to predict.
Aggregation
Aggregation is one of those capabilities that really make you appreciate the power of relational database
systems. It allows you to move beyond merely persisting your data, into the realm of asking truly
interesting questions that can be used to inform decision making. This category covers aggregation at
length, making use of standard grouping as well as more recent window functions.
If you struggle with these questions, I strongly recommend Learning SQL , by Alan Beaulieu and SQL
Cookbook by Anthony Molinaro. In fact, get the latter anyway - it'll take you beyond anything you find on
this site, and on multiple different database systems to boot.
Expected results:
count
9
Answer:
Aggregation starts out pretty simply! The SQL above selects everything from our facilities table, and then
counts the number of rows in the result set. The count function has a variety of uses:
The basic idea of an aggregate function is that it takes in a column of data, performs some function upon it,
and outputs a scalar (single) value. There are a bunch more aggregation functions, including MAX , MIN ,
SUM , and AVG . These all do pretty much what you'd expect from their names :-).
One aspect of aggregate functions that people often find confusing is in queries like the below:
Try it out, and you'll find that it doesn't work. This is because count(*) wants to collapse the facilities table
into a single value - unfortunately, it can't do that, because there's a lot of different facids in cd.facilities -
Postgres doesn't know which facid to pair the count with.
Instead, if you wanted a query that returns all the facids along with a count on each row, you can break
the aggregation out into a subquery as below:
select facid,
(select count(*) from cd.facilities)
from cd.facilities
When we have a subquery that returns a scalar value like this, Postgres knows to simply repeat the value
for every row in cd.facilities.
count
Answer:
This one is only a simple modification to the previous question: we need to weed out the inexpensive
facilities. This is easy to do using a WHERE clause. Our aggregation can now only see the expensive
facilities.
Count the number of recommendations each member makes
Produce a count of the number of recommendations each member has made. Order by member ID.
Expected results:
recommendedby count
1 5
2 3
3 1
4 2
5 1
6 1
9 2
11 1
13 2
15 1
16 1
20 1
30 1
Answer:
Previously, we've seen that aggregation functions are applied to a column of values, and convert them into
an aggregated scalar value. This is useful, but we often find that we don't want just a single aggregated
result: for example, instead of knowing the total amount of money the club has made this month, I might
want to know how much money each different facility has made, or which times of day were most
lucrative.
In order to support this kind of behaviour, SQL has the GROUP BY construct. What this does is batch the
data together into groups, and run the aggregation function separately for each group. When you specify a
GROUP BY , the database produces an aggregated value for each distinct value in the supplied columns.
In this case, we're saying 'for each distinct value of recommendedby, get me the number of times that
value appears'.
Expected results:
0 1320
1 1278
2 1209
3 830
4 1404
5 228
6 1104
7 908
8 911
Answer:
Other than the fact that we've introduced the SUM aggregate function, there's not a great deal to say
about this exercise. For each distinct facility id, the SUM function adds together everything in the slots
column.
Expected results:
facid Total Slots
5 122
3 422
7 426
8 471
6 540
2 570
1 588
0 591
4 648
Answer:
This is only a minor alteration of our previous example. Remember that aggregation happens after the
WHERE clause is evaluated: we thus use the WHERE to restrict the data we aggregate over, and our
aggregation only sees data from a single month.
Expected results:
facid month Total Slots
0 7 270
0 8 459
0 9 591
1 7 207
1 8 483
1 9 588
2 7 180
2 8 459
2 9 570
3 7 104
3 8 304
3 9 422
4 7 264
4 8 492
4 9 648
5 7 24
5 8 82
5 9 122
6 7 164
6 8 400
6 9 540
7 7 156
7 8 326
7 9 426
8 7 117
8 8 322
8 9 471
Answer:
select facid, extract(month from starttime) as month, sum(slots) as "Total Slots"
from cd.bookings
where
starttime '2012-01-01'
and starttime < '2013-01-01'
group by facid, month
order by facid, month;
The main piece of new functionality in this question is the EXTRACT function. EXTRACT allows you to
get individual components of a timestamp, like day, month, year, etc. We group by the output of this
function to provide per-month values. An alternative, if we needed to distinguish between the same
month in different years, is to make use of the DATE_TRUNC function, which truncates a date to a given
granularity.
It's also worth noting that this is the first time we've truly made use of the ability to group by more than
one column.
Find the count of members who have made at least one booking
Find the total number of members who have made at least one booking.
Expected results:
count
30
Answer:
Your first instinct may be to go for a subquery here. Something like the below:
This does work perfectly well, but we can simplify a touch with the help of a little extra knowledge in the
form of COUNT DISTINCT . This does what you might expect, counting the distinct values in the passed
column.
Expected results:
facid Total Slots
0 1320
1 1278
2 1209
4 1404
6 1104
Answer:
It turns out that there's actually an SQL keyword designed to help with the filtering of output from
aggregate functions. This keyword is HAVING .
The behaviour of HAVING is easily confused with that of WHERE . The best way to think about it is that in
the context of a query with an aggregate function, WHERE is used to filter what data gets input into the
aggregate function, while HAVING is used to filter the data once it is output from the function. Try
experimenting to explore this difference!
Expected results:
name revenue
Answer:
select facs.name, sum(slots * case
when memid = 0 then facs.guestcost
else facs.membercost
end) as revenue
from cd.bookings bks
inner join cd.facilities facs
on bks.facid = facs.facid
group by facs.name
order by revenue;
The only real complexity in this query is that guests (member ID 0) have a different cost to everyone else.
We use a case statement to produce the cost for each session, and then sum each of those sessions,
grouped by facility.
Expected results:
name revenue
Answer:
You may well have tried to use the HAVING keyword we introduced in an earlier exercise, producing
something like below:
select facs.name, sum(case
when memid = 0 then slots * facs.guestcost
else slots * membercost
end) as revenue
from cd.bookings bks
inner join cd.facilities facs
on bks.facid = facs.facid
group by facs.name
having revenue < 1000
order by revenue;
Unfortunately, this doesn't work! You'll get an error along the lines of ERROR column "revenue"
does not exist . Postgres, unlike some other RDBMSs like SQL Server and MySQL, doesn't support
putting column names in the HAVING clause. This means that for this query to work, you'd have to
produce something like below:
Having to repeat significant calculation code like this is messy, so our anointed solution instead just wraps
the main query body as a subquery, and selects from it using a WHERE clause. In general, I recommend
using HAVING for simple queries, as it increases clarity. Otherwise, this subquery approach is often easier
to use.
Output the facility id that has the highest number of slots booked
Output the facility id that has the highest number of slots booked. For bonus points, try a version without a
LIMIT clause. This version will probably look messy!
Expected results:
4 1404
Answer:
It's worth realising, though, that this method has a significant weakness. In the event of a tie, we will still
only get one result! To get all the relevant results, we might try using the MAX aggregate function,
something like below:
The intent of this query is to get the highest totalslots value and its associated facid(s). Unfortunately, this
just won't work! In the event of multiple facids having the same number of slots booked, it would be
ambiguous which facid should be paired up with the single (or scalar ) value coming out of the MAX
function. This means that Postgres will tell you that facid ought to be in a GROUP BY section, which won't
produce the results we're looking for.
The query produces a list of facility IDs and number of slots used, and then uses a HAVING clause that
works out the maximum totalslots value. We're essentially saying: 'produce a list of facids and their
number of slots booked, and filter out all the ones that doen't have a number of slots booked equal to the
maximum.'
Useful as HAVING is, however, our query is pretty ugly. To improve on that, let's introduce another new
concept: Common Table Expressions (CTEs). CTEs can be thought of as allowing you to define a database
view inline in your query. It's really helpful in situations like this, where you're having to repeat yourself a
lot.
CTEs are declared in the form WITH CTEName as (SQL-Expression) . You can see our query
redefined to use a CTE below:
You can see that we've factored out our repeated selections from cd.bookings into a single CTE, and made
the query a lot simpler to read in the process!
BUT WAIT. There's more. It's also possible to complete this problem using Window Functions. We'll leave
these until later, but even better solutions to problems like these are available.
That's a lot of information for a single exercise. Don't worry too much if you don't get it all right now - we'll
reuse these concepts in later exercises.
List the total slots booked per facility per month, Part 2
Produce a list of the total number of slots booked per facility per month in the year of 2012. In this version,
include output rows containing totals for all months per facility, and a total for all months for all facilities.
The output table should consist of facility id, month and slots, sorted by the id and month. When
calculating the aggregated values for all months and all facids, return null values in the month and facid
columns.
Expected results:
facid month slots
0 7 270
0 8 459
0 9 591
0 1320
1 7 207
1 8 483
1 9 588
1 1278
2 7 180
2 8 459
2 9 570
2 1209
3 7 104
3 8 304
3 9 422
3 830
4 7 264
4 8 492
4 9 648
4 1404
5 7 24
5 8 82
5 9 122
5 228
6 7 164
6 8 400
6 9 540
6 1104
7 7 156
7 8 326
facid month slots
7 9 426
7 908
8 7 117
8 8 322
8 9 471
8 910
9191
Answer:
When we are doing data analysis, we sometimes want to perform multiple levels of aggregation to allow
ourselves to 'zoom' in and out to different depths. In this case, we might be looking at each facility's
overall usage, but then want to dive in to see how they've performed on a per-month basis. Using the SQL
we know so far, it's quite cumbersome to produce a single query that does what we want - we effectively
have to resort to concatenating multiple queries using UNION ALL :
As you can see, each subquery performs a different level of aggregation, and we just combine the results.
We can clean this up a lot by factoring out commonalities using a CTE:
with bookings as (
select facid, extract(month from starttime) as month, slots
from cd.bookings
where
starttime '2012-01-01'
and starttime < '2013-01-01'
)
select facid, month, sum(slots) from bookings group by facid, month
union all
select facid, null, sum(slots) from bookings group by facid
union all
select null, null, sum(slots) from bookings
order by facid, month;
This version is not excessively hard on the eyes, but it becomes cumbersome as the number of aggregation
columns increases. Fortunately, PostgreSQL 9.5 introduced support for the ROLLUP operator, which we've
used to simplify our accepted answer.
ROLLUP produces a hierarchy of aggregations in the order passed into it: for example, ROLLUP(facid,
month) outputs aggregations on (facid, month), (facid), and (). If we wanted an aggregation of all facilities
for a month (instead of all months for a facility) we'd have to reverse the order, using ROLLUP(month,
facid) . Alternatively, if we instead want all possible permutations of the columns we pass in, we can use
CUBE rather than ROLLUP . This will produce (facid, month), (month), (facid), and ().
ROLLUP and CUBE are special cases of GROUPING SETS . GROUPING SETS allow you to specify the
exact aggregation permutations you want: you could, for example, ask for just (facid, month) and (facid),
skipping the top-level aggregation.
Expected results:
There's a few little pieces of interest in this question. Firstly, you can see that our aggregation works just
fine when we join to another table on a 1:1 basis. Also note that we group by both facs.facid and
facs.name . This is might seem odd: after all, since facid is the primary key of the facilities table, each
facid has exactly one name, and grouping by both fields is the same as grouping by facid alone. In fact,
you'll find that if you remove facs.name from the GROUP BY clause, the query works just fine:
Postgres works out that this 1:1 mapping exists, and doesn't insist that we group by both columns.
Unfortunately, depending on which database system we use, validation might not be so smart, and may
not realise that the mapping is strictly 1:1. That being the case, if there were multiple names for each
facid and we hadn't grouped by name , the DBMS would have to choose between multiple (equally
valid) choices for the name . Since this is invalid, the database system will insist that we group by both
fields. In general, I recommend grouping by all columns you don't have an aggregate function on: this will
ensure better cross-platform compatibility.
Next up is the division. Those of you familiar with MySQL may be aware that integer divisions are
automatically cast to floats. Postgres is a little more traditional in this respect, and expects you to tell it if
you want a floating point division. You can do that easily in this case by dividing by 2.0 rather than 2.
Finally, let's take a look at formatting. The TO_CHAR function converts values to character strings. It takes
a formatting string, which we specify as (up to) lots of numbers before the decimal place, decimal place,
and two numbers after the decimal place. The output of this function can be prepended with a space,
which is why we include the outer TRIM function.
Expected results:
surname firstname memid starttime
This answer demonstrates the use of aggregate functions on dates. MIN works exactly as you'd expect,
pulling out the lowest possible date in the result set. To make this work, we need to ensure that the result
set only contains dates from September onwards. We do this using the WHERE clause.
You might typically use a query like this to find a customer's next booking. You can use this by replacing
the date '2012-09-01' with the function now()
Produce a list of member names, with each row containing the total
member count
Produce a list of member names, with each row containing the total member count. Order by join date.
Expected results:
count firstname surname
31 GUEST GUEST
31 Darren Smith
31 Tracy Smith
31 Tim Rownam
31 Janice Joplette
31 Gerald Butters
31 Burton Tracy
31 Nancy Dare
31 Tim Boothe
31 Ponder Stibbons
31 Charles Owen
31 David Jones
31 Anne Baker
31 Jemima Farrell
31 Jack Smith
31 Florence Bader
31 Timothy Baker
31 David Pinker
31 Matthew Genting
31 Anna Mackenzie
31 Joan Coplin
31 Ramnaresh Sarwin
31 Douglas Jones
31 Henrietta Rumney
31 David Farrell
31 Henry Worthington-Smyth
31 Millicent Purview
31 Hyacinth Tupperware
31 John Hunt
31 Erica Crumpet
count firstname surname
31 Darren Smith
Answer:
Using the knowledge we've built up so far, the most obvious answer to this is below. We use a subquery
because otherwise SQL will require us to group by firstname and surname, producing a different result to
what we're looking for.
There's nothing at all wrong with this answer, but we've chosen a different approach to introduce a new
concept called window functions. Window functions provide enormously powerful capabilities, in a form
often more convenient than the standard aggregation functions. While this exercise is only a toy, we'll be
working on more complicated examples in the near future.
Window functions operate on the result set of your (sub-)query, after the WHERE clause and all standard
aggregation. They operate on a window of data. By default this is unrestricted: the entire result set, but it
can be restricted to provide more useful results. For example, suppose instead of wanting the count of all
members, we want the count of all members who joined in the same month as that member:
In this example, we partition the data by month. For each row the window function operates over, the
window is any rows that have a joindate in the same month. The window function thus produces a count
of the number of members who joined in that month.
You can go further. Imagine if, instead of the total number of members who joined that month, you want
to know what number joinee they were that month. You can do this by adding in an ORDER BY to the
window function:
The ORDER BY changes the window again. Instead of the window for each row being the entire partition,
the window goes from the start of the partition to the current row, and not beyond. Thus, for the first
member who joins in a given month, the count is 1. For the second, the count is 2, and so on.
One final thing that's worth mentioning about window functions: you can have multiple unrelated ones in
the same query. Try out the query below for an example - you'll see the numbers for the members going in
opposite directions! This flexibility can lead to more concise, readable, and maintainable queries.
select count(*) over(partition by date_trunc('month',joindate) order by joindate
asc),
count(*) over(partition by date_trunc('month',joindate) order by joindate
desc),
f rstname, surname
from cd.members
order by joindate
Window functions are extraordinarily powerful, and they will change the way you write and think about
SQL. Make good use of them!
Expected results:
row_number firstname surname
1 GUEST GUEST
2 Darren Smith
3 Tracy Smith
4 Tim Rownam
5 Janice Joplette
6 Gerald Butters
7 Burton Tracy
8 Nancy Dare
9 Tim Boothe
10 Ponder Stibbons
11 Charles Owen
12 David Jones
13 Anne Baker
14 Jemima Farrell
15 Jack Smith
16 Florence Bader
17 Timothy Baker
18 David Pinker
19 Matthew Genting
20 Anna Mackenzie
21 Joan Coplin
22 Ramnaresh Sarwin
23 Douglas Jones
24 Henrietta Rumney
25 David Farrell
26 Henry Worthington-Smyth
27 Millicent Purview
28 Hyacinth Tupperware
29 John Hunt
30 Erica Crumpet
row_number firstname surname
31 Darren Smith
Answer:
This exercise is a simple bit of window function practise! You could just as easily use count(*)
over(order by joindate) here, so don't worry if you used that instead.
In this query, we don't define a partition, meaning that the partition is the entire dataset. Since we define
an order for the window function, for any given row the window is: start of the dataset -> current row.
Output the facility id that has the highest number of slots booked, again
Output the facility id that has the highest number of slots booked. Ensure that in the event of a tie, all
tieing results get output.
Expected results:
facid total
4 1404
Answer:
You may recall that this is a problem we've already solved in an earlier exercise. We came up with an
answer something like below, which we then cut down using CTEs:
Once we've cleaned it up, this solution is perfectly adequate. Explaining how the query works makes it
seem a little odd, though - 'find the number of slots booked by the best facility. Calculate the total slots
booked for each facility, and return only the rows where the slots booked are the same as for the best'.
Wouldn't it be nicer to be able to say 'calculate the number of slots booked for each facility, rank them, and
pick out any at rank 1'?
Fortunately, window functions allow us to do this - although it's fair to say that doing so is not trivial to the
untrained eye. The first key piece of information is the existence of the éfunction. This ranks values based
on the ORDER BY that is passed to it. If there's a tie for (say) second place), the next gets ranked at
position 4. So, what we need to do is get the number of slots for each facility, rank them, and pick off the
ones at the top rank. A first pass at this might look something like the below:
The inner query calculates the total slots booked, the middle one ranks them, and the outer one creams
off the top ranked. We can actually tidy this up a little: recall that window function get applied pretty late
in the select function, after aggregation. That being the case, we can move the aggregation into the
ORDER BY part of the function, as shown in the approved answer.
While the window function approach isn't massively simpler in terms of lines of code, it arguably makes
more semantic sense.
Expected results:
firstname surname hours rank
Jemima Farrell 90 18
David Pinker 80 19
Ramnaresh Sarwin 80 19
Matthew Genting 70 21
Joan Coplin 50 22
David Farrell 30 23
Henry Worthington-Smyth 30 23
John Hunt 20 25
Douglas Jones 20 25
Millicent Purview 20 25
Henrietta Rumney 20 25
Erica Crumpet 10 29
Hyacinth Tupperware 10 29
Answer:
This answer isn't a great stretch over our previous exercise, although it does illustrate the function of
RANK better. You can see that some of the clubgoers have an equal rounded number of hours booked in,
and their rank is the same. If position 2 is shared between two members, the next one along gets position
4. There's a different function, DENSE_RANK , that would assign that member position 3 instead.
It's worth noting the technique we use to do rounding here. Adding 5, dividing by 10, and multiplying by
10 has the effect (thanks to integer arithmetic cutting off fractions) of rounding a number to the nearest
10. In our case, because slots are half an hour, we need to add 10, divide by 20, and multiply by 10. One
could certainly make the argument that we should do the slots -> hours conversion independently of the
rounding, which would increase clarity.
Talking of clarity, this rounding malarky is starting to introduce a noticeable amount of code repetition. At
this point it's a judgement call, but you may wish to factor it out using a subquery as below:
select f rstname, surname, hours, rank() over (order by hours desc) from
(select f rstname, surname,
((sum(bks.slots)+10)/20)*10 as hours
Expected results:
name rank
Massage Room 1 1
Massage Room 2 2
Tennis Court 2 3
Answer:
select name, rank from (
select facs.name as name, rank() over (order by sum(case
when memid = 0 then slots * facs.guestcost
else slots * membercost
end) desc) as rank
from cd.bookings bks
inner join cd.facilities facs
on bks.facid = facs.facid
group by facs.name
) as subq
where rank 3
order by rank;
This question doesn't introduce any new concepts, and is just intended to give you the opportunity to
practise what you already know. We use the CASE statement to calculate the revenue for each slot, and
aggregate that on a per-facility basis using SUM . We then use the RANK window function to produce a
ranking, wrap it all up in a subquery, and extract everything with a rank less than or equal to 3.
Expected results:
name revenue
Answer:
This exercise should mostly use familiar concepts, although we do introduce the NTILE window function.
NTILE groups values into a passed-in number of groups, as evenly as possible. It outputs a number from
1->number of groups. We then use a CASE statement to turn that number into a label!
Expected results:
name months
Answer:
select name,
initialoutlay / (monthlyrevenue - monthlymaintenance) as repaytime
from
(select facs.name as name,
facs.initialoutlay as initialoutlay,
facs.monthlymaintenance as monthlymaintenance,
sum(case
when memid = 0 then slots * facs.guestcost
else slots * membercost
end)/3 as monthlyrevenue
from cd.bookings bks
inner join cd.facilities facs
on bks.facid = facs.facid
group by facs.facid
) as subq
order by name;
But, I hear you ask, what would an automatic version of this look like? One that didn't need to have a
hard-coded number of months in it? That's a little more complicated, and involves some date arithmetic.
I've factored that out into a CTE to make it a little more clear.
with monthdata as (
select mincompletemonth,
maxcompletemonth,
(extract(year from maxcompletemonth)*12) +
extract(month from maxcompletemonth) -
(extract(year from mincompletemonth)*12) -
extract(month from mincompletemonth) as nummonths
from (
select date_trunc('month',
(select max(starttime) from cd.bookings)) as maxcompletemonth,
date_trunc('month',
(select min(starttime) from cd.bookings)) as mincompletemonth
) as subq
)
select name,
initialoutlay / (monthlyrevenue - monthlymaintenance) as repaytime
from
(select facs.name as name,
facs.initialoutlay as initialoutlay,
facs.monthlymaintenance as monthlymaintenance,
sum(case
when memid = 0 then slots * facs.guestcost
else slots * membercost
end)/(select nummonths from monthdata) as monthlyrevenue
This code restricts the data that goes in to complete months. It does this by selecting the maximum date,
rounding down to the month, and stripping out all dates larger than that. Even this code is not completely-
complete. It doesn't handle the case of a facility making a loss. Fixing that is not too hard, and is left as
(another) exercise for the reader!
Expected results:
date revenue
2012-08-01 1126.8333333333333333
2012-08-02 1153.0000000000000000
2012-08-03 1162.9000000000000000
2012-08-04 1177.3666666666666667
2012-08-05 1160.9333333333333333
2012-08-06 1185.4000000000000000
2012-08-07 1182.8666666666666667
2012-08-08 1172.6000000000000000
2012-08-09 1152.4666666666666667
2012-08-10 1175.0333333333333333
2012-08-11 1176.6333333333333333
2012-08-12 1195.6666666666666667
2012-08-13 1218.0000000000000000
2012-08-14 1247.4666666666666667
2012-08-15 1274.1000000000000000
2012-08-16 1281.2333333333333333
2012-08-17 1324.4666666666666667
2012-08-18 1373.7333333333333333
2012-08-19 1406.0666666666666667
2012-08-20 1427.0666666666666667
2012-08-21 1450.3333333333333333
2012-08-22 1539.7000000000000000
2012-08-23 1567.3000000000000000
2012-08-24 1592.3333333333333333
2012-08-25 1615.0333333333333333
2012-08-26 1631.2000000000000000
2012-08-27 1659.4333333333333333
2012-08-28 1687.0000000000000000
2012-08-29 1684.6333333333333333
2012-08-30 1657.9333333333333333
date revenue
2012-08-31 1703.4000000000000000
Answer:
select dategen.date,
(
correlated subquery that, for each day fed into it,
f nds the average revenue for the last 15 days
select sum(case
when memid = 0 then slots * facs.guestcost
else slots * membercost
end) as rev
There's at least two equally good solutions to this question. I've put the simplest to write as the answer,
but there's also a more flexible solution that uses window functions.
Let's look at the selected answer first. When I read SQL queries, I tend to read the SELECT part of the
statement last - the FROM and WHERE parts tend to be more interesting. So, what do we have in our
FROM ? A call to the GENERATE_SERIES function. This does pretty much what it says on the tin -
generates a series of values. You can specify a start value, a stop value, and an increment. It works for
integer types and dates - although, as you can see, we need to be explicit about what types are going into
and out of the function. Try removing the casts, and seeing the result!
So, we've generated a timestamp for each day in August. Now, for each day, we need to generate our
average. We can do this using a correlated subquery . If you remember, a correlated subquery is a
subquery that uses values from the outer query. This means that it gets executed once for each result row
in the outer query. This is in contrast to an uncorrelated subquery, which only has to be executed once.
If we look at our correlated subquery, we can see that it's correlated on the dategen.date field. It produces
a sum of revenue for this day and the 14 days prior to it, and then divides that sum by 15. This produces
the output we're looking for!
I mentioned that there's a window function-based solution for this problem as well - you can see it below.
The approach we use for this is generating a list of revenue for each day, and then using window function
aggregation over that list. The nice thing about this method is that once you have the per-day revenue, you
can produce a wide range of results quite easily - you might, for example, want rolling averages for the
previous month, 15 days, and 5 days. This is easy to do using this method, and rather harder using
conventional aggregation.
You'll note that we've been wanting to work out daily revenue quite frequently. Rather than inserting that
calculation into all our queries, which is rather messy (and will cause us a big headache if we ever change
our schema), we probably want to store that information somewhere. Your first thought might be to
calculate information and store it somewhere for later use. This is a common tactic for large data
warehouses, but it can cause us some problems - if we ever go back and edit our data, we need to
remember to recalculate. For non-enormous-scale data like we're looking at here, we can just create a
view instead. A view is essentially a stored query that looks exactly like a table. Under the covers, the
DBMS just subsititutes in the relevant portion of the view definition when you select data from it. They're
very easy to create, as you can see below:
You can see that this makes our query an awful lot simpler!
select date, avgrev from (
select dategen.date as date,
avg(revdata.rev) over(order by dategen.date rows 14 preceding) as avgrev
from
(select
cast(generate_series(timestamp '2012-07-10', '2012-08-31','1 day') as
date) as date
) as dategen
left outer join
cd.dailyrevenue as revdata on dategen.date = revdata.date
) as subq
where date '2012-08-01'
order by date;
As well as storing frequently-used query fragments, views can be used for a variety of purposes, including
restricting access to certain columns of a table.
Dates/Times in SQL are a complex topic, deserving of a category of their own. They're also fantastically
powerful, making it easier to work with variable-length concepts like 'months' than many programming
languages.
Before getting started on this category, it's probably worth taking a look over the PostgreSQL docs page on
date/time functions. You might also want to complete the aggregate functions category, since we'll use
some of those capabilities in this section.
Expected results:
timestamp
2012-08-31 01:00:00
Answer:
Here's a pretty easy question to start off with! SQL has a bunch of different date and time types, which you
can peruse at your leisure over at the excellent Postgres documentation . These basically allow you to
store dates, times, or timestamps (date+time).
The approved answer is the best way to create a timestamp under normal circumstances. You can also use
casts to change a correctly formatted string into a timestamp, for example:
Timestamps can be stored with or without time zone information. We've chosen not to here, but if you like
you could format the timestamp like "2012-08-31 01:00:00 +00:00", assuming UTC. Note that timestamp
with time zone is a different type to timestamp - when you're declaring it, you should use TIMESTAMP
WITH TIME ZONE 2012-08-31 01 00 00 +00 00.
Finally, have a bit of a play around with some of the different date/time serialisations described in the
Postgres docs. You'll find that Postgres is extremely flexible with the formats it accepts, although my
recommendation to you would be to use the standard serialisation we've used here - you'll find it
unambiguous and easy to port to other DBs.
Expected results:
interval
32 days
Answer:
Subtracting timestamps produces an INTERVAL data type. INTERVAL s are a special data type for
representing the difference between two TIMESTAMP types. When subtracting timestamps, Postgres will
typically give an interval in terms of days, hours, minutes, seconds, without venturing into months. This
generally makes life easier, since months are of variable lengths.
One of the useful things about intervals, though, is the fact that they can encode months. Let's imagine
that I want to schedule something to occur in exactly one month's time, regardless of the length of my
month. To do this, I could use [timestamp] + interval '1 month' .
Intervals stand in contrast to SQL's treatment of DATE types. Dates don't use intervals - instead,
subtracting two dates will return an integer representing the number of days between the two dates. You
can also add integer values to dates. This is sometimes more convenient, depending on how much
intelligence you require in the handling of your dates!
Expected results:
ts
2012-10-01 00:00:00
2012-10-02 00:00:00
2012-10-03 00:00:00
2012-10-04 00:00:00
2012-10-05 00:00:00
2012-10-06 00:00:00
2012-10-07 00:00:00
2012-10-08 00:00:00
2012-10-09 00:00:00
2012-10-10 00:00:00
2012-10-11 00:00:00
2012-10-12 00:00:00
2012-10-13 00:00:00
2012-10-14 00:00:00
2012-10-15 00:00:00
2012-10-16 00:00:00
2012-10-17 00:00:00
2012-10-18 00:00:00
2012-10-19 00:00:00
2012-10-20 00:00:00
2012-10-21 00:00:00
2012-10-22 00:00:00
2012-10-23 00:00:00
2012-10-24 00:00:00
2012-10-25 00:00:00
2012-10-26 00:00:00
2012-10-27 00:00:00
2012-10-28 00:00:00
2012-10-29 00:00:00
2012-10-30 00:00:00
ts
2012-10-31 00:00:00
Answer:
One of the best features of Postgres over other DBs is a simple function called GENERATE_SERIES . This
function allows you to generate a list of dates or numbers, specifying a start, an end, and an increment
value. It's extremely useful for situations where you want to output, say, sales per day over the course of a
month. A typical way to do that on a table containing a list of sales might be to use a SUM aggregation,
grouping by the date and product type. Unfortunately, this approach has a flaw: if there are no sales for a
given day, it won't show up! To make it work properly, you need to left join from a sequential list of
timestamps to the aggregated data to fill in the blank spaces.
On other database systems, it's not uncommon to keep a 'calendar table' full of dates, with which you can
perform these joins. Alternatively, on some systems you can write an analogue to generate_series using
recursive CTEs. Fortunately for us, Postgres makes our lives a lot easier!
Expected results:
date_part
31
Answer:
The EXTRACT function is used for getting sections of a timestamp or interval. You can get the value of
any field in the timestamp as an integer.
Expected results:
date_part
169200
Answer:
If you want to write more portable code, you will unfortunately find that you cannot use extract
epoch . Instead you will need to use something like:
Answer:
This is, as you can observe, rather awful. If you're planning to write cross platform SQL, I would consider
having a library of common user defined functions for each DBMS, allowing you to normalise any common
requirements like this. This keeps your main codebase a lot cleaner.
Expected results:
month length
1 31 days
2 29 days
3 31 days
4 30 days
5 31 days
6 30 days
7 31 days
8 31 days
9 30 days
10 31 days
11 30 days
12 31 days
Answer:
select extract(month from cal.month) as month,
(cal.month + interval '1 month') - cal.month as length
from
(
select generate_series(timestamp '2012-01-01', timestamp '2012-12-01',
interval '1 month') as month
) cal
order by month;
This answer shows several of the concepts we've learned. We use the GENERATE_SERIES function to
produce a year's worth of timestamps, incrementing a month at a time. We then use the EXTRACT
function to get the month number. Finally, we subtract each timestamp + 1 month from itself.
It's worth noting that subtracting two timestamps will always produce an interval in terms of days (or
portions of a day). You won't just get an answer in terms of months or years, because the length of those
time periods is variable.
Expected results:
remaining
19 days
Answer:
The star of this particular show is the DATE_TRUNC function. It does pretty much what you'd expect -
truncates a date to a given minute, hour, day, month, and so on. The way we've solved this problem is to
truncate our timestamp to find the month we're in, add a month to that, and subtract our timestamp. To
ensure partial days get treated as whole days, the timestamp we subtract is truncated to the nearest day.
Note the way we've put the timestamp into a subquery. This isn't required, but it does mean you can give
the timestamp a name, rather than having to list the literal repeatedly.
Expected results:
starttime endtime
Answer:
This question simply returns the start time for a booking, and a calculated end time which is equal to
start time + (30 minutes * slots) . Note that it's perfectly okay to multiply intervals.
The other thing you'll notice is the use of order by and limit to get the last ten bookings. All this does is
order the bookings by the (descending) time at which they end, and pick off the top ten.
Expected results:
month count
2013-01-01 00:00:00 1
Answer:
Expected results:
name month utilisation
Answer:
select name, month,
round((100 slots)/
cast(
25*(cast((month + interval '1 month') as date)
- cast (month as date)) as numeric),1) as utilisation
from (
select facs.name as name, date_trunc('month', starttime) as month,
sum(slots) as slots
from cd.bookings bks
inner join cd.facilities facs
on bks.facid = facs.facid
group by facs.facid, month
) as inn
order by name, month
The meat of this query (the inner subquery) is really quite simple: an aggregation to work out the total
number of slots used per facility per month. If you've covered the rest of this section and the category on
aggregates, you likely didn't find this bit too challenging.
This query does, unfortunately, have some other complexity in it: working out the number of days in each
month. We can calculate the number of days between two months by subtracting two timestamps with a
month between them. This, unfortunately, gives us back on interval datatype, which we can't use to do
mathematics. In this case we've worked around that limitation by converting our timestamps into dates
before subtracting. Subtracting date types gives us an integer number of days.
A alternative to this workaround is to convert the interval into an epoch value: that is, a number of
seconds. To do this use EXTRACT(EPOCH FROM month)/(24*60*60) . This is arguably a much nicer
way to do things, but is much less portable to other database systems.
String Operations
String operations in most RDBMSs are, arguably, needlessly painful. Fortunately, Postgres is better than
most in this regard, providing strong regular expression support. This section covers basic string
manipulation, use of the LIKE operator, and use of regular expressions. I also make an effort to show you
some alternative approaches that work reliably in most RDBMSs. Be sure to check out Postgres' string
function docs page if you're not confident about these exercises.
Anthony Molinaro's SQL Cookbook provides some excellent documentation of (difficult) cross-DBMS
compliant SQL string manipulation. I'd strongly recommend his book, particularly as it's published by
O'Reilly, whose ethical policy of DRM-free ebook distribution deserves rich rewards.
Expected results:
name
GUEST, GUEST
Smith, Darren
Smith, Tracy
Rownam, Tim
Joplette, Janice
Butters, Gerald
Tracy, Burton
Dare, Nancy
Boothe, Tim
Stibbons, Ponder
Owen, Charles
Jones, David
Baker, Anne
Farrell, Jemima
Smith, Jack
Bader, Florence
Baker, Timothy
Pinker, David
Genting, Matthew
Mackenzie, Anna
Coplin, Joan
Sarwin, Ramnaresh
Jones, Douglas
Rumney, Henrietta
Farrell, David
Worthington-Smyth, Henry
Purview, Millicent
Tupperware, Hyacinth
Hunt, John
Crumpet, Erica
name
Smith, Darren
Answer:
Building strings in sql is similar to other languages, with the exception of the concatenation operator: ||.
Some systems (like SQL Server) use +, but || is the SQL standard.
Expected results:
Answer:
The SQL LIKE operator is a highly standard way of searching for a string using basic matching. The %
character matches any string, while _ matches any single character.
One point that's worth considering when you use LIKE is how it uses indexes. If you're using the 'C'
locale , any LIKE string with a fixed beginning (as in our example here) can use an index. If you're using
any other locale, LIKE will not use any index by default. See here for details on how to change that.
Expected results:
Answer:
There's no direct operator for case-insensitive comparison in standard SQL. Fortunately, we can take a
page from many other language's books, and simply force all values into upper case when we do our
comparison. This renders case irrelevant, and gives us our result.
Alternatively, Postgres does provide the ILIKE operator, which performs case insensitive searches. This
isn't standard SQL, but it's arguably more clear.
You should realise that running a function like UPPER over a column value prevents Postgres from making
use of any indexes on the column (the same is true for ILIKE ). Fortunately, Postgres has got your back:
rather than simply creating indexes over columns, you can also create indexes over expressions . If you
created an index over UPPER(name) , this query could use it quite happily.
Expected results:
memid telephone
0 (000) 000-0000
3 (844) 693-0723
4 (833) 942-4710
5 (844) 078-4130
6 (822) 354-9973
7 (833) 776-4001
8 (811) 433-2547
9 (833) 160-3900
10 (855) 542-5251
11 (844) 536-8036
13 (855) 016-0163
14 (822) 163-3254
15 (833) 499-3527
20 (811) 972-1377
21 (822) 661-2898
22 (822) 499-2232
24 (822) 413-1470
27 (822) 989-8876
28 (855) 755-9876
29 (855) 894-3758
30 (855) 941-9786
33 (822) 665-5327
35 (899) 720-6978
36 (811) 732-4816
37 (822) 577-3541
Answer:
We've chosen to answer this using regular expressions, although Postgres does provide other string
functions like POSITION that would do the job at least as well. Postgres implements POSIX regular
expression matching via the ~ operator. If you've used regular expressions before, the functionality of the
operator will be very familiar to you.
As an alternative, you can use the SQL standard SIMILAR TO operator. The regular expressions for this
have similarities to the POSIX standard, but a lot of differences as well. Some of the most notable
differences are:
As in the LIKE operator, SIMILAR TO uses the '_' character to mean 'any character', and the '%'
character to mean 'any string'.
A SIMILAR TO expression must match the whole string, not just a substring as in posix regular
expressions. This means that you'll typically end up bracketing an expression in '%' characters.
The '.' character does not mean 'any character' in SIMILAR TO regexes: it's just a plain character.
Finally, it's worth noting that regular expressions usually don't use indexes. Generally you don't want your
regex to be responsible for doing heavy lifting in your query, because it will be slow. If you need fuzzy
matching that works fast, consider working out if your needs can be met by full text search .
Expected results:
zip
00000
00234
00234
04321
04321
10383
11986
23423
28563
33862
34232
43532
43533
45678
52365
54333
56754
57392
58393
64577
65332
65464
66796
68666
69302
75655
78533
80743
84923
87630
zip
97676
Answer:
Postgres' LPAD function is the star of this particular show. It does basically what you'd expect: allow us to
produce a padded string. We need to remember to cast the zipcode to a string for it to be accepted by the
LPAD function.
When inheriting an old database, It's not that unusual to find wonky decisions having been made over
data types. You may wish to fix mistakes like these, but have a lot of code that would break if you changed
datatypes. In that case, one option (depending on performance requirements) is to create a view over
your table which presents the data in a fixed-up manner, and gradually migrate.
Count the number of members whose surname starts with each letter of the
alphabet
You'd like to produce a count of how many members you have whose surname starts with each letter of the
alphabet. Sort by the letter, and don't worry about printing out a letter if the count is 0.
Expected results:
letter count
B 5
C 2
D 1
F 2
G 2
H 1
J 3
M 1
O 1
P 2
R 2
S 6
T 2
W 1
Answer:
This exercise is fairly straightforward. You simply need to retrieve the first letter of the member's surname,
and do some basic aggregation to achieve a count. We use the SUBSTR function here, but there's a
variety of other ways you can achieve the same thing. The LEFT function, for example, returns you the
first n characters from the left of the string. Alternatively, you could use the SUBSTRING function, which
allows you to use regular expressions to extract a portion of the string.
One point worth noting: as you can see, string functions in SQL are based on 1-indexing, not the 0-indexing
that you're probably used to. This will likely trip you up once or twice before you get used to it :-)
Expected results:
memid telephone
0 0000000000
1 5555555555
2 5555555555
3 8446930723
4 8339424710
5 8440784130
6 8223549973
7 8337764001
8 8114332547
9 8331603900
10 8555425251
11 8445368036
12 8440765141
13 8550160163
14 8221633254
15 8334993527
16 8339410824
17 8114096734
20 8119721377
21 8226612898
22 8224992232
24 8224131470
26 8445368036
27 8229898876
28 8557559876
29 8558943758
30 8559419786
33 8226655327
35 8997206978
36 8117324816
memid telephone
37 8225773541
Answer:
The most direct solution is probably the TRANSLATE function, which can be used to replace characters in
a string. You pass it three strings: the value you want altered, the characters to replace, and the characters
you want them replaced with. In our case, we want all the characters deleted, so our third parameter is an
empty string.
As is often the way with strings, we can also use regular expressions to solve our problem. The
REGEXP_REPLACE function provides what we're looking for: we simply pass a regex that matches all
non-digit characters, and replace them with nothing, as shown below. The 'g' flag tells the function to
replace as many instances of the pattern as it can find. This solution is perhaps more robust, as it cleans
out more bad formatting.
Making automated use of free-formatted text data can be a chore. Ideally you want to avoid having to
constantly write code to clean up the data before using it, so you should consider having your database
enforce correct formatting for you. You can do this using a CHECK constraint on your column, which allow
you to reject any poorly-formatted entry. It's tempting to perform this kind of validation in the application
layer, and this is certainly a valid approach. As a general rule, if your database is getting used by multiple
applications, favour pushing more of your checks down into the database to ensure consistent behaviour
between the apps.
Occasionally, adding a constraint isn't feasible. You may, for example, have two different legacy
applications asserting differently formatted information. If you're unable to alter the applications, you
have a couple of options to consider. Firstly, you can define a trigger on your table. This allows you to
intercept data before (or after) it gets asserted to your table, and normalise it into a single format.
Alternatively, you could build a view over your table that cleans up information on the fly, as it's read out.
Newer applications can read from the view and benefit from more reliably formatted information.
Recursive Queries
Common Table Expressions allow us to, effectively, create our own temporary tables for the duration of a
query - they're largely a convenience to help us make more readable SQL. Using the WITH RECURSIVE
modifier, however, it's possible for us to create recursive queries. This is enormously advantageous for
working with tree and graph-structured data - imagine retrieving all of the relations of a graph node to a
given depth, for example.
This category shows you some basic recursive queries that are possible using our dataset.
Find the upward recommendation chain for member ID 27
Find the upward recommendation chain for member ID 27: that is, the member who recommended them,
and the member who recommended that member, and so on. Return member ID, first name, and surname.
Order by descending member id.
Expected results:
20 Matthew Genting
5 Gerald Butters
1 Darren Smith
Answer:
WITH RECURSIVE is a fantastically useful piece of functionality that many developers are unaware of. It
allows you to perform queries over hierarchies of data, which is very difficult by other means in SQL. Such
scenarios often leave developers resorting to multiple round trips to the database system.
You've seen WITH before. The Common Table Expressions (CTEs) defined by WITH give you the ability to
produce inline views over your data. This is normally just a syntactic convenience, but the RECURSIVE
modifier adds the ability to join against results already produced to produce even more. A recursive WITH
takes the basic form of:
The initial statement populates the initial data, and then the recursive statement runs repeatedly to
produce more. Each step of the recursion can access the CTE, but it sees within it only the data produced
by the previous iteration. It repeats until an iteration produces no additional data.
The most simple example of a recursive WITH might look something like this:
with recursive increment(num) as (
select 1
union all
select increment.num + 1 from increment where increment.num < 5
)
select * from increment;
The initial statement produces '1'. The first iteration of the recursive statement sees this as the content of
increment , and produces '2'. The next iteration sees the content of increment as '2', and so on.
Execution terminates when the recursive statement produces no additional data.
With the basics out of the way, it's fairly easy to explain our answer here. The initial statement gets the ID
of the person who recommended the member we're interested in. The recursive statement takes the
results of the initial statement, and finds the ID of the person who recommended them. This value gets
forwarded on to the next iteration, and so on.
Now that we've constructed the recommenders CTE, all our main SELECT statement has to do is get the
member IDs from recommenders, and join to them members table to find out their names.
Expected results:
4 Janice Joplette
5 Gerald Butters
7 Nancy Dare
10 Charles Owen
11 David Jones
14 Jack Smith
20 Matthew Genting
21 Anna Mackenzie
26 Douglas Jones
27 Henrietta Rumney
Answer:
with recursive recommendeds(memid) as (
select memid from cd.members where recommendedby = 1
union all
select mems.memid
from recommendeds recs
inner join cd.members mems
on mems.recommendedby = recs.memid
)
select recs.memid, mems.f rstname, mems.surname
from recommendeds recs
inner join cd.members mems
on recs.memid = mems.memid
order by memid
This is a pretty minor variation on the previous question. The essential difference is that we're now
heading in the opposite direction. One interesting point to note is that unlike the previous example, this
CTE produces multiple rows per iteration, by virtue of the fact that we're heading down the
recommendation tree (following all branches) rather than up it.
Produce a CTE that can return the upward recommendation chain for any
member
Produce a CTE that can return the upward recommendation chain for any member. You should be able to
select recommender from recommenders where member=x. Demonstrate it by getting the chains for
members 12 and 22. Results table should have member and recommender, ordered by member ascending,
recommender descending.
Expected results:
12 9 Ponder Stibbons
12 6 Burton Tracy
22 16 Timothy Baker
22 13 Jemima Farrell
Answer:
This question requires us to produce a CTE that can calculate the upward recommendation chain for any
user. Most of the complexity of working out the answer is in realising that we now need our CTE to produce
two columns: one to contain the member we're asking about, and another to contain the members in
their recommendation tree. Essentially what we're doing is producing a table that flattens out the
recommendation hierarchy.
Since we're looking to produce the chain for every user, our initial statement needs to select data for each
user: their ID and who recommended them. Subsequently, we want to pass the member field through each
iteration without changing it, while getting the next recommender. You can see that the recursive part of
our statement hasn't really changed, except to pass through the 'member' field.