lifeitech.github.io

This post serves as a reference for SQL syntax. We cover the most frequently used commands, including “case when”, “join” and functions.

Download a database
Download a client
Select Data
Calculating statistics with aggregate functions
- Apply aggregate functions to different groups of rows with “group by”
- Filter calculated columns with “having”
Functions
Examples
I/O with Python

A database is a software for storing data. There are many many databases available. They can be broadly classified into two categories: one is relational database, including MySQL and PostgreSQL. They other category is non-relational database, including MongoDB, Cassandra, neo4j, Dgraph and many others. Here we are concerned with SQL, the common syntax for querying relational databases.

Download a database

PostgreSQL vs MySQL

To practice writing SQL queries, you need to install a database on your computer first. You can download either MySQL or PostgreSQL. Both are open sourced, but MySQL is more widely used than PostgreSQL. Notably, PostgreSQL has more data types and functions, so it is more powerful, but it is also more complex and more difficult to learn. Query syntax for the two databases are largely similar, but not entirely the same, especially for functions.

To install MySQL, download an installer on the official website. Then follow the installation guide and install all the components step by step.
To install PostgreSQL, go to the official website and download an installer.

Download a client

You might want a GUI client software to interact with your database. With MySQL installation it comes with a default one called MySQL Workbench. For PostgreSQL the default client that comes with the installation is pgAdmin. I find pgAdmin is ok, but Workbench is hard to use. I personally find

Beekeeper Studio

to be the best one.

To play with MySQL on Mac, there is also Sequel Pro available.

Finally, there are also built-in command line clients. To start MySQL client, type

mysql --user=user_name --password db_name

in your shell. To add mysql to path in windows, add C:\Program Files\MySQL\MySQL Server x.x\bin to PATH in system environment variable, where x.x. is the version number. For PostgreSQL, the command line client is psql. To add it to path in windows, add C:\Program Files\PostgreSQL\xx\bin to the system PATH variable, where xx is the version number.

Select Data

Select columns with “select … from …”

  
select c1, c2, ...
from tb

Filter rows with “where”

  
select c1, c2, ...
from tb
where condition1 and/or condition2

We can use operators on columns in where conditions.

Operator	Description	Example
`=`	equal to	`col = 'salary'`
`<>` / `!=`	not equal to	`dept != 'decision'`
`>`	greater than	`date > 2022-02-15`
`<`	less than	`price < 500.00`
`>=`	greater than or equal	`number >= 2`
`<=`	less than or equal	`number <= 2`
`and`	and	`state = 'GA' and amount > 1000`
`or`	or	`state = 'GA' or amount > 1000`
`[not] between`	between an inclusive range	`price between 100.00 and 500.00`
`[not] like`	string pattern	`name like 'Luca%'`
`ilike`	string pattern, ignoring capitalization	`name ilike '%cafe%'`
`[not] in`	set membership test	`col in ('col1', 'col2', 'col3')`
`is [not] null`	compare to null	`address is not null`
`is [not] true`	boolean true value test	`full_time is true`
`is not distinct from`	is equal to value or both are nulls	`name is not distinct from full_name`
`as`	Rename table or column name. Can be omitted	`select result as r`

Filter rows with “distinct”, “order by” and “limit”

  
select distinct c1, c2, ... 
from tb 
where condition
order by col asc/desc
limit [num] offset [num]

note: distinct can have slow performance.

Categorize column values with “case when”

  
select 
...
case 
when condition1 then label1
when condition2 then label2
when condition3 then label3
else label4
end (as col_name)

Select columns from different tables with “join”

  
select c1, c2, ...
from tb1 (inner/left/right/full) join tb2 
on tb1.id = tb2.id
where condition
...

We illustrate inner, left, right and full join with the following two tables.

This is tb1.

id	c1
a	value
b	value
c	value
d	value
e	value

This is tb2.

id	c2
a	value
b	value
x	value
y	value

select common column values with “inner join”

inner join is tb1 $\cap$ tb2.

id	c1	c2
a	value	value
b	value	value

select rows from one table with “left join” or “right join”

left join selects all values from tb1’s id column.

id	c1	c2
a	value	value
b	value	value
c	value	`null`
d	value	`null`
e	value	`null`

right join selects all values from tb2’s id column.

id	c1	c2
a	value	value
b	value	value
x	`null`	value
y	`null`	value

select row union with full

full means tb $\cup$ tb2.

id	c1	c2
a	value	value
b	value	value
c	value	`null`
d	value	`null`
e	value	`null`
x	`null`	value
y	`null`	value

The outer keyword is deprecated. It is kept for backward compatibility with SQL-92 only.

left join may be less efficient than (inner) join.

Unions

  
select col from tb1
union all
select col from tb2

Sometimes we need to combine rows from two different tables. We can use union or union all for that purpose. union will only retain unique rows, while union all will retain all rows.

Union is also different from where ... or ... clause. This query

  
select device_id, gender, age, gpa from user_profile where university = "Shandong University" 
union all
select device_id, gender, age, gpa from user_profile where gender = 'male' 

outputs the set of users that come from Shandong University, then users that are male, in that order. Instead, in the output of this query

  
select device_id, gender, age, gpa 
from user_profile 
where university = "Shandong University" or gender = 'male'

users from Shandong University and users that are male are randomly mixed.

Calculating statistics with aggregate functions

  
select agg(col) as ...
from tb
where condition

function	description
`count(col)`	counts the number of rows
`min(col)`	min for all rows
`max(col)`	max for all rows
`avg(col)`	average for all rows
`sum(col)`	sum over all rows

Apply aggregate functions to different groups of rows with “group by”

Instead of aggregating across all the rows, you can also apply the aggregate functions to different groups of rows with the group by keyword.

  
select c, agg(col) as col_name
from tb
where condition
group by c

Rows are classified by values in c. In other words, rows that correspond to the same value in c belong to the same group.

Filter calculated columns with “having”

The having keyword is used to filter rows in calculated columns.

  
select id, fun(col) as col_name
(group by col)
having col_name...

Functions

Date operators & functions

Below is a list of some date operators in PostgreSQL.

operator	meaning	example	result
`+ int`	add a number of days to a date	`select date '2021-08-14' + 1`	202-08-15
`+ interval`	add an interval to a date	`select date '2001-09-28' + interval '1 hour'`	2001-09-28 01:00:00
`+ time`	add a time-of-day to a date
`- int`	subtract a number of days from a date
`- date`	subtract days, producing the number of days elapsed	`select date '2001-10-01' - date '2001-09-28'`	3
`- time`	subtract times

Some date functions in PostgreSQL and MySQL:

function	PostgreSQL	MySQL	description
`now()`	✅	✅	date and time now
`current_date`	✅		date now
`current_time`	✅		time now
`curdate()`		✅	date now
`curtime()`		✅	time now
`date()`		✅	extract date
`day()`		✅	extract day
`month()`		✅	extract month
`year()`		✅	extract year
`age(timestamp, timestap)`	✅		subtract two dates
`datediff(d1, d2)`		✅	subtract two dates
`extract(unit from timestamp)`	✅	✅	extract unit from date

A useful function might be the extract function. It can be used to extract year, month or date from datetime record, which is useful for analysis based on different dates.

  
extract(unit from date)

unit can be: microsecond, second, minute, hour, day, week, month, quarter, year.

Different databases may have different date functions. Refer to official documentations for complete list of date functions:

String operators & functions

Here we list some of the string functions in PostgreSQL. Refer to the documentation for a complete list.

function	description	example	result
`s \|\| s`	string concatenation	`'Post' \|\| 'greSQL'`	PostgreSQL
`length(s)`	length of string	`length('jose')`	4
`lower(s)`	convert string to lower case	`lower('FEMALE')`	female
`upper(s)`	convert string to upper case	`upper('female')`	FEMALE
`position(sub in s)`	location of specified substring	`position('z' in 'wxyz')`	4
`substring(s from [int] for [int])`	extract substring	`substring('abcde' from 2 for 3)`	bcd
`substring(s from [re])`	extract substring with regular expression	`substring('LxxxR xyz' from 'L.*R')`	LxxxR
`left(s, n)`	return fist n characters in the string. When $n$ is negative, return all but last $\vert n\vert$ characters.	`left('abcde', 2)`	ab
`right(s, n)`	return last $n$ characters in the string.	`right('vwxyz', 2)`	yz
`replace(s, sub, repl)`	replace all occurences of substring	`replace('WakqSnake', 'ak', 'xx')`	WxxqSnxxe
`repeat(s, n)`	repeat string $n$ times	`repeat('Pg', 4)`	PgPgPgPg
`string_to_array(s, delimiter)`	split string to array	`select (string_to_array('165cm,55kg,26,female', ','))[4]`	female

Window functions

Window functions are useful when we need to calculate something within a column. We partition a column into windows, and apply a function to each window. The calculated column can then be selected alongside with other columns.

  
select id, fun(col) over (partition by col2 order by col3) as col_name
from tb

for fun can use

aggregate functions: max, min, sum, etc.

ranking functions: row_number, rank, dense_rank, ntile, etc. They are useful for selecting the top (min or max) N records per category.

  
mysql> select
       val,
       row_number() over w as 'row_number',
       rank()       over w as 'rank',
       dense_rank() over w as 'dense_rank'
     from numbers
     window w as (order by val);
+------+------------+------+------------+
| val  | row_number | rank | dense_rank |
+------+------------+------+------------+
|    1 |          1 |    1 |          1 |
|    1 |          2 |    1 |          1 |
|    2 |          3 |    3 |          2 |
|    3 |          4 |    4 |          3 |
|    3 |          5 |    4 |          3 |
|    3 |          6 |    4 |          3 |
|    4 |          7 |    7 |          4 |
|    4 |          8 |    7 |          4 |
|    5 |          9 |    9 |          5 |
+------+------------+------+------------+

analytic functions: lag, lead. They are useful for comparing multiple rows and calculate differences between rows.

  
mysql> select
       t, val,
       lag(val)        over w as 'lag',
       lead(val)       over w as 'lead',
       val - lag(val)  over w as 'lag diff',
       val - lead(val) over w as 'lead diff'
     from series
     window w as (order by t);
+----------+------+------+------+----------+-----------+
| t        | val  | lag  | lead | lag diff | lead diff |
+----------+------+------+------+----------+-----------+
| 12:00:00 |  100 | NULL |  125 |     NULL |       -25 |
| 13:00:00 |  125 |  100 |  132 |       25 |        -7 |
| 14:00:00 |  132 |  125 |  145 |        7 |       -13 |
| 15:00:00 |  145 |  132 |  140 |       13 |         5 |
| 16:00:00 |  140 |  145 |  150 |       -5 |       -10 |
| 17:00:00 |  150 |  140 |  200 |       10 |       -50 |
| 18:00:00 |  200 |  150 | NULL |       50 |      NULL |
+----------+------+------+------+----------+-----------+

window functions may store information on a temporary database on disk, so their performance can be very slow. In many situations the old fashionedgroup by is way faster than window functions.

Examples

See https://github.com/lifeitech/sql-nowcoder.

I/O with Python

To read from and write data to database with Python is easy. Here is an example showing how to convert SQL tables to pandas DataFrames and vice versa (using PostgreSQL).

First, install the following two packages

  
pip install psycopg2-binary  # mysql would be: mysqldb or pymysql etc.
pip install sqlalchemy

We can connect to database with the create_engine() function provided by the sqlalchemy package, with

dialect+driver://username:password@host:port/database

as input url, where host is the server address, and database is the name of the database. For more details refer to sqlalchemy docs.

  
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql+psycopg2://username:password@server:port/database')
con  = engine.connect()

# read table 'user/info' into pandas dataframe
profile = pd.read_sql('select * from user.info', con=con)

# ....

# write dataframe to a new table in db named 'analysis'
df.to_sql('analysis', con=engine, schema='user', if_exists='replace')

Here, the name “schema” is somewhat confusing, in practice it basically means the “folder” that contains the table.

Cite as:

  
@article{lifei2022sql,
  title   = "All about SQL Queries",
  author  = "Li, Fei",
  journal = "https://lifeitech.github.io",
  year    = "2022",
  url= "https://lifeitech.github.io/posts/sql/"
}

All about SQL Queries