This post serves as a reference for SQL syntax. We cover the most frequently used commands, including “case when”, “join” and functions.
- Download a database
- Download a client
- Select Data
- Calculating statistics with aggregate functions
- Functions
- Examples
- I/O with Python
A database is a software for storing data. There are many many databases available. They can be broadly classified into two categories: one is relational database, including MySQL and PostgreSQL. They other category is non-relational database, including MongoDB, Cassandra, neo4j, Dgraph and many others. Here we are concerned with SQL, the common syntax for querying relational databases.
Download a database
To practice writing SQL queries, you need to install a database on your computer first. You can download either MySQL or PostgreSQL. Both are open sourced, but MySQL is more widely used than PostgreSQL. Notably, PostgreSQL has more data types and functions, so it is more powerful, but it is also more complex and more difficult to learn. Query syntax for the two databases are largely similar, but not entirely the same, especially for functions.
- To install MySQL, download an installer on the official website. Then follow the installation guide and install all the components step by step.
- To install PostgreSQL, go to the official website and download an installer.
Download a client
You might want a GUI client software to interact with your database. With MySQL installation it comes with a default one called MySQL Workbench. For PostgreSQL the default client that comes with the installation is pgAdmin. I find pgAdmin is ok, but Workbench is hard to use. I personally find
to be the best one.
To play with MySQL on Mac, there is also Sequel Pro available.
Finally, there are also built-in command line clients. To start MySQL client, type
mysql --user=user_name --password db_name
in your shell. To add mysql
to path in windows, add C:\Program Files\MySQL\MySQL Server x.x\bin
to PATH in system environment variable, where x.x.
is the version number. For PostgreSQL, the command line client is psql
. To add it to path in windows, add C:\Program Files\PostgreSQL\xx\bin
to the system PATH variable, where xx
is the version number.
Select Data
Select columns with “select … from …”
1
2
select c1, c2, ...
from tb
Filter rows with “where”
1
2
3
select c1, c2, ...
from tb
where condition1 and/or condition2
We can use operators on columns in where
conditions.
Operator | Description | Example |
---|---|---|
= | equal to | col = 'salary' |
<> / != | not equal to | dept != 'decision' |
> | greater than | date > 2022-02-15 |
< | less than | price < 500.00 |
>= | greater than or equal | number >= 2 |
<= | less than or equal | number <= 2 |
and | and | state = 'GA' and amount > 1000 |
or | or | state = 'GA' or amount > 1000 |
[not] between | between an inclusive range | price between 100.00 and 500.00 |
[not] like | string pattern | name like 'Luca%' |
ilike | string pattern, ignoring capitalization | name ilike '%cafe%' |
[not] in | set membership test | col in ('col1', 'col2', 'col3') |
is [not] null | compare to null | address is not null |
is [not] true | boolean true value test | full_time is true |
is not distinct from | is equal to value or both are nulls | name is not distinct from full_name |
as | Rename table or column name. Can be omitted | select result as r |
Filter rows with “distinct”, “order by” and “limit”
1
2
3
4
5
select distinct c1, c2, ...
from tb
where condition
order by col asc/desc
limit [num] offset [num]
note:
distinct
can have slow performance.
Categorize column values with “case when”
1
2
3
4
5
6
7
8
select
...
case
when condition1 then label1
when condition2 then label2
when condition3 then label3
else label4
end (as col_name)
Select columns from different tables with “join”
1
2
3
4
5
select c1, c2, ...
from tb1 (inner/left/right/full) join tb2
on tb1.id = tb2.id
where condition
...
We illustrate inner, left, right and full join with the following two tables.
This is tb1.
id | c1 |
---|---|
a | value |
b | value |
c | value |
d | value |
e | value |
This is tb2.
id | c2 |
---|---|
a | value |
b | value |
x | value |
y | value |
select common column values with “inner join”
inner join
is tb1 $\cap$ tb2.
id | c1 | c2 |
---|---|---|
a | value | value |
b | value | value |
select rows from one table with “left join” or “right join”
left join
selects all values from tb1’s id column.
id | c1 | c2 |
---|---|---|
a | value | value |
b | value | value |
c | value | null |
d | value | null |
e | value | null |
right join
selects all values from tb2’s id column.
id | c1 | c2 |
---|---|---|
a | value | value |
b | value | value |
x | null | value |
y | null | value |
select row union with full
full
means tb $\cup$ tb2.
id | c1 | c2 |
---|---|---|
a | value | value |
b | value | value |
c | value | null |
d | value | null |
e | value | null |
x | null | value |
y | null | value |
The
outer
keyword is deprecated. It is kept for backward compatibility with SQL-92 only.
left join
may be less efficient than (inner)join
.
Unions
1
2
3
select col from tb1
union all
select col from tb2
Sometimes we need to combine rows from two different tables. We can use union
or union all
for that purpose. union
will only retain unique rows, while union all
will retain all rows.
Union is also different from where ... or ...
clause. This query
1
2
3
select device_id, gender, age, gpa from user_profile where university = "Shandong University"
union all
select device_id, gender, age, gpa from user_profile where gender = 'male'
outputs the set of users that come from Shandong University, then users that are male, in that order. Instead, in the output of this query
1
2
3
select device_id, gender, age, gpa
from user_profile
where university = "Shandong University" or gender = 'male'
users from Shandong University and users that are male are randomly mixed.
Calculating statistics with aggregate functions
1
2
3
select agg(col) as ...
from tb
where condition
function | description |
---|---|
count(col) | counts the number of rows |
min(col) | min for all rows |
max(col) | max for all rows |
avg(col) | average for all rows |
sum(col) | sum over all rows |
Apply aggregate functions to different groups of rows with “group by”
Instead of aggregating across all the rows, you can also apply the aggregate functions to different groups of rows with the group by
keyword.
1
2
3
4
select c, agg(col) as col_name
from tb
where condition
group by c
Rows are classified by values in c
. In other words, rows that correspond to the same value in c
belong to the same group.
Filter calculated columns with “having”
The having
keyword is used to filter rows in calculated columns.
1
2
3
select id, fun(col) as col_name
(group by col)
having col_name...
Functions
Date operators & functions
Below is a list of some date operators in PostgreSQL.
operator | meaning | example | result |
---|---|---|---|
+ int | add a number of days to a date | select date '2021-08-14' + 1 | 202-08-15 |
+ interval | add an interval to a date | select date '2001-09-28' + interval '1 hour' | 2001-09-28 01:00:00 |
+ time | add a time-of-day to a date | ||
- int | subtract a number of days from a date | ||
- date | subtract days, producing the number of days elapsed | select date '2001-10-01' - date '2001-09-28' | 3 |
- time | subtract times |
Some date functions in PostgreSQL and MySQL:
function | PostgreSQL | MySQL | description |
---|---|---|---|
now() | ✅ | ✅ | date and time now |
current_date | ✅ | date now | |
current_time | ✅ | time now | |
curdate() | ✅ | date now | |
curtime() | ✅ | time now | |
date() | ✅ | extract date | |
day() | ✅ | extract day | |
month() | ✅ | extract month | |
year() | ✅ | extract year | |
age(timestamp, timestap) | ✅ | subtract two dates | |
datediff(d1, d2) | ✅ | subtract two dates | |
extract(unit from timestamp) | ✅ | ✅ | extract unit from date |
A useful function might be the extract
function. It can be used to extract year, month or date from datetime record, which is useful for analysis based on different dates.
1
extract(unit from date)
unit can be: microsecond
, second
, minute
, hour
, day
, week
, month
, quarter
, year
.
Different databases may have different date functions. Refer to official documentations for complete list of date functions:
String operators & functions
Here we list some of the string functions in PostgreSQL. Refer to the documentation for a complete list.
function | description | example | result |
---|---|---|---|
s || s | string concatenation | 'Post' || 'greSQL' | PostgreSQL |
length(s) | length of string | length('jose') | 4 |
lower(s) | convert string to lower case | lower('FEMALE') | female |
upper(s) | convert string to upper case | upper('female') | FEMALE |
position(sub in s) | location of specified substring | position('z' in 'wxyz') | 4 |
substring(s from [int] for [int]) | extract substring | substring('abcde' from 2 for 3) | bcd |
substring(s from [re]) | extract substring with regular expression | substring('LxxxR xyz' from 'L.*R') | LxxxR |
left(s, n) | return fist n characters in the string. When $n$ is negative, return all but last $\vert n\vert$ characters. | left('abcde', 2) | ab |
right(s, n) | return last $n$ characters in the string. | right('vwxyz', 2) | yz |
replace(s, sub, repl) | replace all occurences of substring | replace('WakqSnake', 'ak', 'xx') | WxxqSnxxe |
repeat(s, n) | repeat string $n$ times | repeat('Pg', 4) | PgPgPgPg |
string_to_array(s, delimiter) | split string to array | select (string_to_array('165cm,55kg,26,female', ','))[4] | female |
Related: PostgreSQL documentation on array functions
Window functions
Window functions are useful when we need to calculate something within a column. We partition a column into windows, and apply a function to each window. The calculated column can then be selected alongside with other columns.
1
2
select id, fun(col) over (partition by col2 order by col3) as col_name
from tb
for fun
can use
aggregate functions:
max
,min
,sum
, etc.- ranking functions:
row_number
,rank
,dense_rank
,ntile
, etc. They are useful for selecting the top (min or max) N records per category.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
mysql> select val, row_number() over w as 'row_number', rank() over w as 'rank', dense_rank() over w as 'dense_rank' from numbers window w as (order by val); +------+------------+------+------------+ | val | row_number | rank | dense_rank | +------+------------+------+------------+ | 1 | 1 | 1 | 1 | | 1 | 2 | 1 | 1 | | 2 | 3 | 3 | 2 | | 3 | 4 | 4 | 3 | | 3 | 5 | 4 | 3 | | 3 | 6 | 4 | 3 | | 4 | 7 | 7 | 4 | | 4 | 8 | 7 | 4 | | 5 | 9 | 9 | 5 | +------+------------+------+------------+
- analytic functions:
lag
,lead
. They are useful for comparing multiple rows and calculate differences between rows.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
mysql> select t, val, lag(val) over w as 'lag', lead(val) over w as 'lead', val - lag(val) over w as 'lag diff', val - lead(val) over w as 'lead diff' from series window w as (order by t); +----------+------+------+------+----------+-----------+ | t | val | lag | lead | lag diff | lead diff | +----------+------+------+------+----------+-----------+ | 12:00:00 | 100 | NULL | 125 | NULL | -25 | | 13:00:00 | 125 | 100 | 132 | 25 | -7 | | 14:00:00 | 132 | 125 | 145 | 7 | -13 | | 15:00:00 | 145 | 132 | 140 | 13 | 5 | | 16:00:00 | 140 | 145 | 150 | -5 | -10 | | 17:00:00 | 150 | 140 | 200 | 10 | -50 | | 18:00:00 | 200 | 150 | NULL | 50 | NULL | +----------+------+------+------+----------+-----------+
window functions may store information on a temporary database on disk, so their performance can be very slow. In many situations the old fashioned
group by
is way faster than window functions.
Examples
See https://github.com/lifeitech/sql-nowcoder.
I/O with Python
To read from and write data to database with Python is easy. Here is an example showing how to convert SQL tables to pandas DataFrames and vice versa (using PostgreSQL).
First, install the following two packages
1
2
pip install psycopg2-binary # mysql would be: mysqldb or pymysql etc.
pip install sqlalchemy
We can connect to database with the create_engine()
function provided by the sqlalchemy
package, with
1
dialect+driver://username:password@host:port/database
as input url, where host
is the server address, and database
is the name of the database. For more details refer to sqlalchemy docs.
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://username:password@server:port/database')
con = engine.connect()
# read table 'user/info' into pandas dataframe
profile = pd.read_sql('select * from user.info', con=con)
# ....
# write dataframe to a new table in db named 'analysis'
df.to_sql('analysis', con=engine, schema='user', if_exists='replace')
Here, the name “schema” is somewhat confusing, in practice it basically means the “folder” that contains the table.
Cite as:
1
2
3
4
5
6
7
@article{lifei2022sql,
title = "All about SQL Queries",
author = "Li, Fei",
journal = "https://lifeitech.github.io",
year = "2022",
url= "https://lifeitech.github.io/posts/sql/"
}
Comments powered by Disqus.