Home All about SQL Queries
Post
Cancel
All about SQL Queries

All about SQL Queries

This post serves as a reference for SQL syntax. We cover the most frequently used commands, including “case when”, “join” and functions.

A database is a software for storing data. There are many many databases available. They can be broadly classified into two categories: one is relational database, including MySQL and PostgreSQL. They other category is non-relational database, including MongoDB, Cassandra, neo4j, Dgraph and many others. Here we are concerned with SQL, the common syntax for querying relational databases.

Download a database

PostgreSQL vs MySQL

To practice writing SQL queries, you need to install a database on your computer first. You can download either MySQL or PostgreSQL. Both are open sourced, but MySQL is more widely used than PostgreSQL. Notably, PostgreSQL has more data types and functions, so it is more powerful, but it is also more complex and more difficult to learn. Query syntax for the two databases are largely similar, but not entirely the same, especially for functions.

  • To install MySQL, download an installer on the official website. Then follow the installation guide and install all the components step by step.
  • To install PostgreSQL, go to the official website and download an installer.

Download a client

You might want a GUI client software to interact with your database. With MySQL installation it comes with a default one called MySQL Workbench. For PostgreSQL the default client that comes with the installation is pgAdmin. I find pgAdmin is ok, but Workbench is hard to use. I personally find

beekeeper studio

Beekeeper Studio

to be the best one.

To play with MySQL on Mac, there is also Sequel Pro available.

Finally, there are also built-in command line clients. To start MySQL client, type

mysql --user=user_name --password db_name

in your shell. To add mysql to path in windows, add C:\Program Files\MySQL\MySQL Server x.x\bin to PATH in system environment variable, where x.x. is the version number. For PostgreSQL, the command line client is psql. To add it to path in windows, add C:\Program Files\PostgreSQL\xx\bin to the system PATH variable, where xx is the version number.

Select Data

Select columns with “select … from …”

1
2
select c1, c2, ...
from tb

Filter rows with “where”

1
2
3
select c1, c2, ...
from tb
where condition1 and/or condition2

We can use operators on columns in where conditions.

OperatorDescriptionExample
=equal tocol = 'salary'
<> / !=not equal todept != 'decision'
>greater thandate > 2022-02-15
<less thanprice < 500.00
>=greater than or equalnumber >= 2
<=less than or equalnumber <= 2
andandstate = 'GA' and amount > 1000
ororstate = 'GA' or amount > 1000
[not] betweenbetween an inclusive rangeprice between 100.00 and 500.00
[not] likestring patternname like 'Luca%'
ilikestring pattern, ignoring capitalizationname ilike '%cafe%'
[not] inset membership testcol in ('col1', 'col2', 'col3')
is [not] nullcompare to nulladdress is not null
is [not] trueboolean true value testfull_time is true
is not distinct fromis equal to value or both are nullsname is not distinct from full_name
asRename table or column name. Can be omittedselect result as r

Filter rows with “distinct”, “order by” and “limit”

1
2
3
4
5
select distinct c1, c2, ... 
from tb 
where condition
order by col asc/desc
limit [num] offset [num]

note: distinct can have slow performance.

Categorize column values with “case when”

1
2
3
4
5
6
7
8
select 
...
case 
when condition1 then label1
when condition2 then label2
when condition3 then label3
else label4
end (as col_name)

Select columns from different tables with “join”

1
2
3
4
5
select c1, c2, ...
from tb1 (inner/left/right/full) join tb2 
on tb1.id = tb2.id
where condition
...

We illustrate inner, left, right and full join with the following two tables.

This is tb1.

idc1
avalue
bvalue
cvalue
dvalue
evalue

This is tb2.

idc2
avalue
bvalue
xvalue
yvalue

select common column values with “inner join”

inner join is tb1 $\cap$ tb2.

idc1c2
avaluevalue
bvaluevalue

select rows from one table with “left join” or “right join”

left join selects all values from tb1’s id column.

idc1c2
avaluevalue
bvaluevalue
cvaluenull
dvaluenull
evaluenull

right join selects all values from tb2’s id column.

idc1c2
avaluevalue
bvaluevalue
xnullvalue
ynullvalue

select row union with full

full means tb $\cup$ tb2.

idc1c2
avaluevalue
bvaluevalue
cvaluenull
dvaluenull
evaluenull
xnullvalue
ynullvalue

The outer keyword is deprecated. It is kept for backward compatibility with SQL-92 only.

left join may be less efficient than (inner) join.

Unions

1
2
3
select col from tb1
union all
select col from tb2

Sometimes we need to combine rows from two different tables. We can use union or union all for that purpose. union will only retain unique rows, while union all will retain all rows.

Union is also different from where ... or ... clause. This query

1
2
3
select device_id, gender, age, gpa from user_profile where university = "Shandong University" 
union all
select device_id, gender, age, gpa from user_profile where gender = 'male' 

outputs the set of users that come from Shandong University, then users that are male, in that order. Instead, in the output of this query

1
2
3
select device_id, gender, age, gpa 
from user_profile 
where university = "Shandong University" or gender = 'male'

users from Shandong University and users that are male are randomly mixed.

Calculating statistics with aggregate functions

1
2
3
select agg(col) as ...
from tb
where condition
functiondescription
count(col)counts the number of rows
min(col)min for all rows
max(col)max for all rows
avg(col)average for all rows
sum(col)sum over all rows

Apply aggregate functions to different groups of rows with “group by”

Instead of aggregating across all the rows, you can also apply the aggregate functions to different groups of rows with the group by keyword.

1
2
3
4
select c, agg(col) as col_name
from tb
where condition
group by c

Rows are classified by values in c. In other words, rows that correspond to the same value in c belong to the same group.

Filter calculated columns with “having”

The having keyword is used to filter rows in calculated columns.

1
2
3
select id, fun(col) as col_name
(group by col)
having col_name...

Functions

Date operators & functions

Below is a list of some date operators in PostgreSQL.

operatormeaningexampleresult
+ intadd a number of days to a dateselect date '2021-08-14' + 1202-08-15
+ intervaladd an interval to a dateselect date '2001-09-28' + interval '1 hour'2001-09-28 01:00:00
+ timeadd a time-of-day to a date  
- intsubtract a number of days from a date  
- datesubtract days, producing the number of days elapsedselect date '2001-10-01' - date '2001-09-28'3
- timesubtract times  

Some date functions in PostgreSQL and MySQL:

functionPostgreSQLMySQLdescription
now()date and time now
current_date date now
current_time time now
curdate() date now
curtime() time now
date() extract date
day() extract day
month() extract month
year() extract year
age(timestamp, timestap) subtract two dates
datediff(d1, d2) subtract two dates
extract(unit from timestamp)extract unit from date

A useful function might be the extract function. It can be used to extract year, month or date from datetime record, which is useful for analysis based on different dates.

1
extract(unit from date)

unit can be: microsecond, second, minute, hour, day, week, month, quarter, year.

Different databases may have different date functions. Refer to official documentations for complete list of date functions:

String operators & functions

Here we list some of the string functions in PostgreSQL. Refer to the documentation for a complete list.

functiondescriptionexampleresult
s || sstring concatenation'Post' || 'greSQL'PostgreSQL
length(s)length of stringlength('jose')4
lower(s)convert string to lower caselower('FEMALE')female
upper(s)convert string to upper caseupper('female')FEMALE
position(sub in s)location of specified substringposition('z' in 'wxyz')4
substring(s from [int] for [int])extract substringsubstring('abcde' from 2 for 3)bcd
substring(s from [re])extract substring with regular expressionsubstring('LxxxR xyz' from 'L.*R')LxxxR
left(s, n)return fist n characters in the string. When $n$ is negative, return all but last $\vert n\vert$ characters.left('abcde', 2)ab
right(s, n)return last $n$ characters in the string.right('vwxyz', 2)yz
replace(s, sub, repl)replace all occurences of substringreplace('WakqSnake', 'ak', 'xx')WxxqSnxxe
repeat(s, n)repeat string $n$ timesrepeat('Pg', 4)PgPgPgPg
string_to_array(s, delimiter)split string to arrayselect (string_to_array('165cm,55kg,26,female', ','))[4]female

Related: PostgreSQL documentation on array functions

Window functions

Window functions are useful when we need to calculate something within a column. We partition a column into windows, and apply a function to each window. The calculated column can then be selected alongside with other columns.

1
2
select id, fun(col) over (partition by col2 order by col3) as col_name
from tb

for fun can use

  • aggregate functions: max, min, sum, etc.

  • ranking functions: row_number, rank, dense_rank, ntile, etc. They are useful for selecting the top (min or max) N records per category.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    mysql> select
           val,
           row_number() over w as 'row_number',
           rank()       over w as 'rank',
           dense_rank() over w as 'dense_rank'
         from numbers
         window w as (order by val);
    +------+------------+------+------------+
    | val  | row_number | rank | dense_rank |
    +------+------------+------+------------+
    |    1 |          1 |    1 |          1 |
    |    1 |          2 |    1 |          1 |
    |    2 |          3 |    3 |          2 |
    |    3 |          4 |    4 |          3 |
    |    3 |          5 |    4 |          3 |
    |    3 |          6 |    4 |          3 |
    |    4 |          7 |    7 |          4 |
    |    4 |          8 |    7 |          4 |
    |    5 |          9 |    9 |          5 |
    +------+------------+------+------------+
    
  • analytic functions: lag, lead. They are useful for comparing multiple rows and calculate differences between rows.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    mysql> select
           t, val,
           lag(val)        over w as 'lag',
           lead(val)       over w as 'lead',
           val - lag(val)  over w as 'lag diff',
           val - lead(val) over w as 'lead diff'
         from series
         window w as (order by t);
    +----------+------+------+------+----------+-----------+
    | t        | val  | lag  | lead | lag diff | lead diff |
    +----------+------+------+------+----------+-----------+
    | 12:00:00 |  100 | NULL |  125 |     NULL |       -25 |
    | 13:00:00 |  125 |  100 |  132 |       25 |        -7 |
    | 14:00:00 |  132 |  125 |  145 |        7 |       -13 |
    | 15:00:00 |  145 |  132 |  140 |       13 |         5 |
    | 16:00:00 |  140 |  145 |  150 |       -5 |       -10 |
    | 17:00:00 |  150 |  140 |  200 |       10 |       -50 |
    | 18:00:00 |  200 |  150 | NULL |       50 |      NULL |
    +----------+------+------+------+----------+-----------+
    

window functions may store information on a temporary database on disk, so their performance can be very slow. In many situations the old fashionedgroup by is way faster than window functions.

Examples

See https://github.com/lifeitech/sql-nowcoder.

I/O with Python

To read from and write data to database with Python is easy. Here is an example showing how to convert SQL tables to pandas DataFrames and vice versa (using PostgreSQL).

First, install the following two packages

1
2
pip install psycopg2-binary  # mysql would be: mysqldb or pymysql etc.
pip install sqlalchemy

We can connect to database with the create_engine() function provided by the sqlalchemy package, with

1
dialect+driver://username:password@host:port/database

as input url, where host is the server address, and database is the name of the database. For more details refer to sqlalchemy docs.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql+psycopg2://username:password@server:port/database')
con  = engine.connect()

# read table 'user/info' into pandas dataframe
profile = pd.read_sql('select * from user.info', con=con)

# ....

# write dataframe to a new table in db named 'analysis'
df.to_sql('analysis', con=engine, schema='user', if_exists='replace')

Here, the name “schema” is somewhat confusing, in practice it basically means the “folder” that contains the table.


Cite as:

1
2
3
4
5
6
7
@article{lifei2022sql,
  title   = "All about SQL Queries",
  author  = "Li, Fei",
  journal = "https://lifeitech.github.io",
  year    = "2022",
  url= "https://lifeitech.github.io/posts/sql/"
}
This post is licensed under CC BY 4.0 by the author.
Contents

Google's three big data technologies: GFS, Bigtable and MapReduce

Building Cross-platform Mobile Apps with React Native

Comments powered by Disqus.