In our previous lecture we introduced the basic parts of a query giving filtering only a single paragraph. Filtering is a bigger topic and this lesson will teach you some more ways to slice and dice your tables so you can extract useful knowledge from them. We will learn about conditions, propositional logic and ways to filter our data on columns which contain textual information.
Everything of the form
column_name <operator> <value> or
column_name <operator> other_column_name is called a condition. Here are some conditions:
survived = 1
pclass = 1
sex = 'male'
sex are columns,
= is an operator and
'male' are values.
Any column can be used as a condition, as long as it part of a table of course. The list of operators is fixed while the kind of values you can use depends on the data type. For now we will learn about number and text data types while more advanced cases will be explained in a future lecture.
The most basic operator is equals (
=), which checks if left and right side are equal. This operator can be used for both numbers and text.
> mean strictly less than and strictly more than. The behavior is same as in mathematics. These operators make sense on numbers. They also make sense on text but for now assume
> don’t work for text even if they do. The behavior on textual data is explained in a future lecture.
You can also use
>= which are less or equal and more or equal.
The final operator is
<> which is not equal. It might seem weird and it is, but because keyboards don’t have an easy way to type the ≠ character the people making SQL thought that
<> is the best way to represent the not equals intent.
There are a few rules to remember when dealing with numbers and text:
Here is an example where you see how we can compare values from two column.
SELECT * FROM datasets.titanic WHERE survived = pclass
What do you think is the result of this and why?
We will teach you the 3 keywords which give every
WHERE statement the power it needs to shine bright. We will start with
AND, go over
ON, continue with
NOT, and finalize on the ways to combine them.
AND has the same meaning as in English language. When we say that a passenger survived, had a 1rd class cabin and was a male we have three conditions of whom all 3 must be true for our output table to include a passenger. The following query gives us all entities which satisfy all these 3 conditions.
SELECT * FROM datasets.titanic WHERE survived = 1 AND pclass = 1 AND sex = 'male'
We can have as many conditions and string together as many
AND as we want.
The behavior of
AND is described with the following table:
In essence you will get a row back only if both conditions are true. When you have more than 2 conditions like in our previous example you will get the row back only when all conditions are true, that is passenger survived and passenger was in first class and passenger was a male.
In the world of SQL
OR follows the same meaning as in regular life and mathematics. Some entity is considered fit for being part of our output table if one of the conditions are true. For example if we have the question “We want just 1st class and 3rd class passengers” our conditions would look like:
pclass = 1
pclass = 3
Our query will then look like:
SELECT * FROM datasets.titanic WHERE pclass = 1 OR pclass = 3
What do you think would happen if we were to use
AND instead of
OR in the query?
For reference here is the truth table for
Thus the only way an entity is not considered fit to be part of the output table is if both conditions are not satisfied.
The meaning of
IN is essentially the same as the meaning of a few conditions linked with
OR except that it is easier to write queries with IN.
An example is worthy of a thousand definitions so here is one:
SELECT * FROM datasets.titanic WHERE pclass IN (1, 3)
The query above and the query below are absolutely identical in the results they will produce.
SELECT * FROM datasets.titanic WHERE pclass = 1 OR pclass = 3
Here is another example where using
OR will end in a lot of typing but
IN handles the issue well.
Assume the question: “Find all passengers whose age is exactly a multiple of 10, that is 10, 20, 30, 40, 50, …”
SELECT * FROM datasets.titanic WHERE age IN (10, 20, 30, 40, 50, 60, 70, 80, 90)
The alternative is to do:
age = 10 OR age = 20 OR age = 30 OR age = 40 etc.
This operation is similar to
IN with the main exception that
IN takes a discrete set of values to check against while
BETWEEN takes a continous range. Perhaps an example will clarify this best:
“Who are the passengers who have paid a fare between $10 and $20?”
SELECT * FROM datasets.titanic WHERE fare BETWEEN 10 AND 20
As you can see the translation from English language went really smooth with our only change being that we don’t use the dollar sign in our query. Keep in mind that this
AND here is related to the
AND from before by the relation:
fare BETWEEN 10 AND 20 is same as
fare >= 10 AND fare <= 20
It is prefered to use
BETWEEN because it is easier to read the query that way.
Negation of True is False and negation of False is True. Negating constants is not very useful but negating variables and expressions can be very useful. Take the following example: “Find all passengers who were not of age 24”.
There are two ways to think about this condition:
NOT age = 20(
age <> 20)
age = 1 OR age = 2 OR age = 3 ... OR age = 23 OR age = 25 ... OR age = 100(notice we skiped 24)
It is obvious which one is easier to work with. In this simple example it might not seem that important but as you write more complex queries a clever use of
NOT might make your
WHERE blocks much shorter while still giving correct results.
Logical negation is the way to link
OR with the following formula holding true:
x OR y = ((NOT x) AND (NOT y))
All questions we were able to ask ourselves had to be based on numbers or on simple equality tests. This limitation stops right now when we learn about
LIKE and it’s closely related cousin
The beauty of, or horror, of text is that unlike numbers it comes in rather nonuniform structure. Take for example the number 1. When you write as a number it is always 1 but when written as text it can be ‘ONE’, ‘one’, ‘ace’, ‘AcE’ or many other possibilities. Take also names, is it John or Jon?
To combat these issues
ILIKE were developed.
ILIKE is the case insensitive brother of
LIKE so everything we know about
LIKE applies to
In it’s most basic form
LIKE acts exactly as the equals operator (
=). Here is an example query showing that:
SELECT * FROM datasets.titanic WHERE sex LIKE 'male'
The power of like comes with the characters ‘_’ and ‘%’.
The ‘%’ wild card is used much more so we will focus on it. Here is an illustrative example: “Find all people named John”.
We first need to reformulate this question as: “Find all people whose
name column contains the word ‘John’”. The probably with just writing the query as
name = 'John' is that we might miss results where there is text before or after John. For example, your data might have results like
John Doe or
D. John. So how do we comb through the data and find any instance where a value has the word John?
We can use the wildcards –
'%John%'. This means whatever text comes before John and whatever text comes after John we only care that the word John exists in there somewhere. When combining all of this we get the following query:
SELECT * FROM datasets.titanic WHERE name LIKE '%John%'
Consider another example: “Find all married passengers?”
First we define maried as having their
name column contain either ‘Mr.’ or ‘Mrs.’. Then we define two patterns ‘%Mr.%’ and ‘%Mrs.%’ and we
OR them together.
SELECT * FROM datasets.titanic WHERE name LIKE '%Mr.%' OR name LIKE '%Mrs.%'
Here is an problem for you to solve: “Find all females with two names.” Use only LIKE and a single pattern. Hint: The second name is wrapped in brackets like ‘Nasser, Mrs. Nicholas (Adele Achem)’
Data collection is not a flawless process and very often certain information is missing be that for privacy reasons, laws, faulty procedures, or failures in capturing the full data. Missing information is represented by a special NULL value which applies to both numbers and text.
In our titanic dataset we have missing data in almost all columns, most notably in
We can ignore NULL values in our queries using the
IS NOT NULL operation. Here is an example query where we ignore all passengers whose age is not known.
SELECT * FROM datasets.titanic WHERE age IS NOT NULL
SELECT * FROM datasets.titanic WHERE name LIKE '______,%'
We check for this by using six consecutive ‘_’ symbols.
SELECT name, age FROM datasets.titanic WHERE name ILIKE '%Miss%' AND age IS NOT NULL ORDER BY age DESC LIMIT 1
We ignore missing ages here because when ordering NULL is always both lowest and highest at the same time.
SELECT * FROM datasets.titanic WHERE age BETWEEN 18 AND 30 AND name ILIKE '%Henry%' AND pclass IN (2, 3) AND (survived <> 0 OR fare > 10)
This is a made up query which uses almost all operators in a single query.
The question it answers would go along the lines: “Find all passengers named/surnamed Henry older than 18, younger than 30 years, who were in either second or third class and who either survived or paid a fare greater than 10$”
You can solve this by checking if it is of length 1 and containing no spaces, is of length 2 and containing no spaces or something else.
Find all cabins which are not null and start with a ‘B’. What do you notice of passenger class to cabin number relation?
Find all males whose age is unknown and who don’t have a middle name. (Hint: check that there are not 2 or not 3 spaces in name)