Understanding JOINs in SQL
Understanding JOINs in SQL
by Boxplot Sep 1, 2019
Joins in SQL
Joins are one of the most important (if not THE most important) concepts in SQL. If you take the time to solidly understand how joins work, you’ll be in an excellent place for writing queries. We also have an in-depth article about performing multiple joins. So, let’s dive in!
Join Definitions
Joining tables in SQL is a way of combining them. It is directly comparable to a VLOOKUP in Excel, so if you are familiar wtih VLOOKUPs you already understand how joins work. For example, let’s say you have a table that contains orders placed by your customers (each row in that table represents a unique order). Then you have a second table that is only items that have been returned by your customers (each row in that table represents a unique order too, but not all orders are returned so the table has fewer rows).
Orders Table
order_id | product | cost_per_unit | total_amount | profit | shipping_method |
---|---|---|---|---|---|
1 | A | 5 | 70 | 5 | Air |
2 | B | 10 | 120 | 97 | Air |
3 | C | 15 | 150 | 135 | Air |
4 | A | 5 | 20 | 10 | Air |
5 | B | 10 | 110 | 96 | Air |
6 | C | 15 | 30 | 5 | Air |
7 | A | 5 | 45 | 9 | Air |
8 | B | 10 | 150 | 64 | Air |
9 | C | 15 | 210 | 63 | Air |
10 | A | 5 | 65 | 44 | Air |
11 | B | 10 | 120 | 76 | Air |
12 | C | 15 | 30 | 18 | Air |
13 | A | 5 | 20 | 19 | Ground |
14 | B | 10 | 100 | 59 | Ground |
15 | C | 15 | 30 | 14 | Ground |
16 | A | 5 | 55 | 15 | Ground |
17 | B | 10 | 10 | 1 | Ground |
18 | C | 15 | 105 | 28 | Ground |
19 | A | 5 | 75 | 37 | Ground |
20 | B | 10 | 50 | 41 | Ground |
21 | C | 15 | 15 | 9 | Ground |
22 | A | 5 | 30 | 22 | Ground |
23 | B | 10 | 120 | 117 | Ground |
24 | C | 15 | 195 | 166 | Ground |
25 | A | 5 | 20 | 1 | Ground |
Returns Table
id | amount_returned | Reason |
---|---|---|
15 | 10 | Broken |
2 | 115 | No Reason Given |
24 | 85 | Unsatisfied |
7 | 15 | Unsatisfied |
12 | 20 | Unsatisfied |
9 | 130 | Unsatisfied |
17 | 10 | No Reason Given |
21 | 15 | No Reason Given |
18 | 70 | Broken |
17 | 10 | Broken |
You can’t just copy and paste these tables next to each other to figure out what was returned, because the rows won’t match up. You need a way of comparing each row in the orders table to the returns table to see if a match exists, and then bring back the information from the returns table that you’re interested in. Enter joins!
Let’s take a look at join syntax first:
SELECT *
FROM table_1 INNER JOIN table_2
ON table_1.id = table_2.id
Notice the ON statement, which is required for every join. You need to pick a column from each table that you want SQL to compare in order to join them together, and that’s what the ON statement is specifying. This example says compare the id column in table_1 to the id column in table_2 when trying to match up rows between the two tables during the join.
How do you choose which colunns to use for the ON statement? Generally you want to pick a column that uniquely identifies each row, like an id number. The two columns don’t have to have the exact same name, but they need to be the same type (for example, integer or text) and represent the same thing. The order_id columnn from the orders table above is the same thing as the id column in the returns table – they both represent the id of the order.
There are four types of joins: left, right, inner and outer. The right join functionally does the same thing as the left join, so many SQL programs are choosing not even to support it anymore. For that reason, we won’t be covering it in this article.
As usual, we’ll start with a basic example. Say you have a database with the following two tables in it:
nutrition1
id | name | calcium | serving_size | weight |
---|---|---|---|---|
1 | apples | 10 | 1 ea | 138 |
2 | asparagus | 22 | 1/2 cup | 90 |
3 | avocado | 19 | 1 ea | 173 |
4 | bamboo shoots | 10 | 1 cup | 131 |
5 | banana | 7 | 1 ea | 114 |
6 | beets | 9 | 1/2 cup | 85 |
7 | blackberries | 46 | 1 cup | 144 |
8 | Blueberries raw | 9 | 1 cup | 145 |
9 | broccoli | 205 | 1 spear | 180 |
10 | brussels sprouts | 56 | 1 cup | 156 |
nutrition2
id | percent_water | energy | protein | food_type |
---|---|---|---|---|
1 | 84 | 80 | 0.3 | FRUITS |
2 | 92 | 22 | 2.3 | VEGETABLES AND LEGUMES |
3 | 73 | 305 | 4 | FRUITS |
4 | 94 | 25 | 2.3 | VEGETABLES AND LEGUMES |
5 | 74 | 105 | 1.2 | FRUITS |
11 | 90 | 94 | 2.4 | FRUITS |
12 | 88 | 31 | 0.7 | VEGETABLES AND LEGUMES |
13 | 92 | 15 | 1.2 | VEGETABLES AND LEGUMES |
14 | 95 | 6 | 0.3 | VEGETABLES AND LEGUMES |
15 | 81 | 49 | 0.8 | FRUITS |
The nutrition1
table contains information about foods, and the nutrition2
table contains different information about foods. We’re going to join on the id column (this will be our ON statement) because it’s the unique identifier for each row in both of the tables. So, take note that nutrition1
has id numbers 1-10, but nutrition2
has id numbers 1-5 and then skips to 11-15.
Inner Joins
When you inner join two different tables together, SQL will compare the columns you specify in the ON statement and only keep rows that those columns have in common. Looking at our nutrition tables, the only id numbers that the two have in common are 1 through 5. So when we run this query:
SELECT *
FROM nutrition1
INNER JOIN nutrition2
ON nutrition1
.id = nutrition2
.id
we’ll get this result:
id | name | calcium | serving_size | weight | id | percent_water | energy | protein | food_type |
---|---|---|---|---|---|---|---|---|---|
1 | apples | 10 | 1 ea | 138 | 1 | 84 | 80 | 0.3 | FRUITS |
2 | asparagus | 22 | 1/2 cup | 90 | 2 | 92 | 22 | 2.3 | VEGETABLES AND LEGUMES |
3 | avocado | 19 | 1 ea | 173 | 3 | 73 | 305 | 4 | FRUITS |
4 | bamboo shoots | 10 | 1 cup | 131 | 4 | 94 | 25 | 2.3 | VEGETABLES AND LEGUMES |
5 | banana | 7 | 1 ea | 114 | 5 | 74 | 105 | 1.2 | FRUITS |
Cool! Now we have more information than we previously did about apples, asparagus, etc. Notice that the result displays 1-5 from nutrition1
, lined up next to those columns 1-5 from nutrition2
, and everything else is gone. No other rows appear in the result because inner joins only keep what the two tables have in common, and nothing else.
Left Joins
When you perform a left join, SQL keeps everything from the left table, no matter what, and then only pulls in information from the right table that matches the left based on the columns you specify in the ON statement. So when we run this query:
SELECT *
FROM nutrition1
INNER JOIN nutrition2
ON nutrition1
.id = nutrition2
.id
we’ll get this result:
id | name | calcium | serving_size | weight | id | percent_water | energy | protein | food_type |
---|---|---|---|---|---|---|---|---|---|
1 | apples | 10 | 1 ea | 138 | 1 | 84 | 80 | 0.3 | FRUITS |
2 | asparagus | 22 | 1/2 cup | 90 | 2 | 92 | 22 | 2.3 | VEGETABLES AND LEGUMES |
3 | avocado | 19 | 1 ea | 173 | 3 | 73 | 305 | 4 | FRUITS |
4 | bamboo shoots | 10 | 1 cup | 131 | 4 | 94 | 25 | 2.3 | VEGETABLES AND LEGUMES |
5 | banana | 7 | 1 ea | 114 | 5 | 74 | 105 | 1.2 | FRUITS |
6 | beets | 9 | 1/2 cup | 85 | NULL | NULL | NULL | NULL | NULL |
7 | blackberries | 46 | 1 cup | 144 | NULL | NULL | NULL | NULL | NULL |
8 | Blueberries raw | 9 | 1 cup | 145 | NULL | NULL | NULL | NULL | NULL |
9 | broccoli | 205 | 1 spear | 180 | NULL | NULL | NULL | NULL | NULL |
10 | brussels sprouts | 56 | 1 cup | 156 | NULL | NULL | NULL | NULL | NULL |
For this result, we have the entire nutrition 1 table (id numbers 1-10) and lined up next to it the rows from the nutrition2
table that matched base on the id (just 1-5). Notice the NULL values. The nutrition2
table didn’t have rows for id numnbers 6-10, but we had to keep those rows because the definition of a left join tells us to keep everything from the LEFT table, so the nutrition2
columns for id numbers 6-10 get filled in with NULLs.
Outer Joins
Finally, let’s take a look at what an outer join does. Here’s the query:
SELECT *
FROM nutrition1
INNER JOIN nutrition2
ON nutrition1
.id = nutrition2
.id
and the result:
id | name | calcium | serving_size | weight | id | percent_water | energy | protein | food_type |
---|---|---|---|---|---|---|---|---|---|
1 | apples | 10 | 1 ea | 138 | 1 | 84 | 80 | 0.3 | FRUITS |
2 | asparagus | 22 | 1/2 cup | 90 | 2 | 92 | 22 | 2.3 | VEGETABLES AND LEGUMES |
3 | avocado | 19 | 1 ea | 173 | 3 | 73 | 305 | 4 | FRUITS |
4 | bamboo shoots | 10 | 1 cup | 131 | 4 | 94 | 25 | 2.3 | VEGETABLES AND LEGUMES |
5 | banana | 7 | 1 ea | 114 | 5 | 74 | 105 | 1.2 | FRUITS |
6 | beets | 9 | 1/2 cup | 85 | NULL | NULL | NULL | NULL | NULL |
7 | blackberries | 46 | 1 cup | 144 | NULL | NULL | NULL | NULL | NULL |
8 | Blueberries raw | 9 | 1 cup | 145 | NULL | NULL | NULL | NULL | NULL |
9 | broccoli | 205 | 1 spear | 180 | NULL | NULL | NULL | NULL | NULL |
10 | brussels sprouts | 56 | 1 cup | 156 | NULL | NULL | NULL | NULL | NULL |
NULL | NULL | NULL | NULL | NULL | 11 | 90 | 94 | 2.4 | FRUITS |
NULL | NULL | NULL | NULL | NULL | 12 | 88 | 31 | 0.7 | VEGETABLES AND LEGUMES |
NULL | NULL | NULL | NULL | NULL | 13 | 92 | 15 | 1.2 | VEGETABLES AND LEGUMES |
NULL | NULL | NULL | NULL | NULL | 14 | 95 | 6 | 0.3 | VEGETABLES AND LEGUMES |
NULL | NULL | NULL | NULL | NULL | 15 | 81 | 49 | 0.8 | FRUITS |
Outer joins keep everything, no matter what, and line up rows that match. Here, we have the entire nutrition1
table, and next to it the nutrition2
table but broken up so that it matches the id numbers of nutrition1
properly.
Takeaways
Understanding the definitions of the joins and how each fundamentally works is very important in SQL. Usually, you’ll be working with so much data that you have to join the tables together in SQL and can’t just pull everything into Excel and do a VLOOKUP. Excel can only handle so many rows (a little over a million) and so many columns (a little over 16,000). That sounds like a lot but SQL databases can hold millions and with more and more data being generated and available every day, there’s a good chance that even at a small company you’ll be dealing with more information than can fit into Excel. Sometimes too, your data will fit into Excel but the VLOOKUP will be so computationally taxing that Excel will simply crash or go so slow it isn’t worth it. SQL is built for stuff like this, so it will compute joins much faster. Check out my understanding SQL blog post for a more in-depth discussino of what SQL is and when to use it.