Understanding Joins
Joins
are statements that retrieve data from more than one table. A join is
characterized by multiple tables in the FROM clause, and the relationship
between the tables is defined through the existence of a join condition in the WHERE clause. In a
join, one row set is called inner, and the other is called outer.
This
section discusses:
· Nested
Loop Joins
· Hash
Joins
· Sort
Merge Joins
Nested
Loop Joins
Nested loop joins are useful when small
subsets of data are being joined and if the join condition is an efficient way
of accessing the second table.
It
is very important to ensure that the inner table is driven from (dependent on)
the outer table. If the inner table's access path is independent of the outer
table, then the same rows are retrieved for every iteration of the outer loop,
degrading performance considerably.
Internally
Nested loop works this way
In this algorithm, an
outer loop is formed which consists of few entries and then for each entry, and
inner loop is processed.
Ex:
Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;
It is processed like:
For i in (select *
from tab1) loop
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;
The Steps involved in doing nested loop are
a) Identify outer (driving) table
b) Assign inner (driven) table to outer table.
c) For every row of outer table, access the rows of inner
table.
NESTED
LOOPS
outer_loop
inner_loop
outer_loop
inner_loop
When
optimizer uses nested loops?
Optimizer
uses nested loop when we are joining tables containing small number of rows
with an efficient driving condition. It is important to have an index on column
of inner join table as this table is probed every time for a new value from
outer table.
The optimizer uses nested loop joins when joining small number of rows,
with a good driving condition between the two tables. You drive from the outer
loop to the inner loop, so the order of tables in the execution plan is
important.
The outer loop is the driving row source. It produces a set of rows for
driving the join condition. The row source can be a table accessed using an
index scan or a full table scan. Also, the rows can be produced from any other
operation. For example, the output from a nested loop join can be used as a row
source for another nested loop join.
Optimizer may not use nested loop in case:
- No of rows of both the table is quite high
- Inner query always results in same set of records
- The access path of inner table is independent of data coming from
outer table.
Note: You will see more use of nested loop when using FIRST_ROWS optimizer
mode as it works on model of showing instantaneous results to user as they are
fetched. There is no need for selecting caching any data before it is returned
to user.
To check how optimizer decides the nested loop. Follow test plan. All test plan are tested on Oracle version 10.2.
To check how optimizer decides the nested loop. Follow test plan. All test plan are tested on Oracle version 10.2.
SQL> Create table e as Select * from employees;
Table created.
SQL> Create table d as select * from
departments;
Table created.
SQL> Create index e_deptno on
e(department_id);
Index created
For better execution plan -gather stats
on tables and index.
SQL> exec
dbms_stats.gather_table_stats('HR','D')
PL/SQL procedure successfully completed
One may set it artificially as wellJ
Set artificial stats for E:
SQL> exec
dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows =>
100,
numblks => 100, avgrlen => 124);
PL/SQL procedure successfully completed.
Set artificial stats for E_DEPTNO index
SQL> exec
dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO',
numrows =>100, numlblks => 10);
PL/SQL procedure successfully completed.
With less number of rows we will see
optimizer probes nested loop
SQL>explain plan for select
e.first_name,d.department_name from e, d where e.department_id=d.department_id;
SQL> select * from
table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3204653704
--------------------------------------------------------------------------------------------------------------------------
| Id |
Operation |
Name | Rows | Bytes | Cost (%CPU)|
Time |
---------------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | 100
| 4100 | 6 (0)|
00:00:01 |
| 1 | TABLE ACCESS BY INDEX
ROWID|
E | 4
| 100 | 1 (0)|
00:00:01 |
| 2 | NESTED
LOOPS | | 100
| 4100 | 6 (0)|
00:00:01 |
| 3
| TABLE ACCESS
FULL |
D | 27
| 432 | 3 (0)|
00:00:01 |
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------------
|* 4
| INDEX RANGE
SCAN | E_DEPTNO
| 4
| | 0 (0)|
00:00:01 |
---------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------
4 -
access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
Looking at execution plan its clear the query
is using nested loop.
Hash Joins
Hash joins are used for joining large data
sets. The optimizer uses the smaller of two tables or data sources to build a
hash table on the join key in memory. It then scans the larger table, probing
the hash table to find the joined rows.
This method is best used when the smaller
table fits in available memory. The cost is then limited to a single read pass
over the data for the two tables.
Build phase
For each row RW1 in small (left/build) table
loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;
Probe Phase
For each row RW2 in big (right/probe) table
loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;
When the Optimizer Uses Hash Joins
The optimizer uses a hash join to join two
tables if they are joined using an equijoin and if either of the
following
conditions are true:
· A
large amount of data needs to be joined.
· A
large fraction of a small table needs to be joined.
Note: You may see more hash joins used with ALL_ROWS optimizer mode, because
it works on model of showing results after all the rows of at least one of the
tables are hashed in hash table.
To check how optimizer decides the hash loop. Go through the test plan.
Set
artificial stats for E , D table and index(setting higher number of rows
1000000)
SQL>
exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows
=> 1000000, numblks => 10000, avgrlen => 124);
PL/SQL
procedure successfully completed.
SQL>
exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO',
numrows => 1000000, numlblks => 1000);
PL/SQL
procedure successfully completed.
SQL>
exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'D', numrows
=> 1000000,numblks => 10000 , avgrlen => 124);
PL/SQL
procedure successfully completed.
Now
we have 1000000 number of rows in E and D table both and index on E(DEPTNO) reflects
the same.
SQL>explain
plan for select e.first_name,d.department_name from e, d where
e.department_id=d.department_id;
SQL>
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------
Plan
hash value: 1127375450
--------------------------------------------------------------------------------------------------------------------
|
Id |
Operation | Name | Rows | Bytes |TempSpc|
Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 37G| 1414G| | 663K (99)| 02:12:45 |
|* 1 | HASH JOIN | | 37G| 1414G| 26M| 663K (99)| 02:12:45 |
| 2 | TABLE ACCESS FULL|
D | 1000K| 15M| | 2232 (2)| 00:00:27 |
| 3 | TABLE ACCESS FULL|
E | 1000K| 23M| | 2264 (4)| 00:00:28 |
---------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
-------------------------------------------------------------------------------------------------------------------
1 -
access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
Clearly
we can see it is Using Hash Join.
Sort Merge Joins
Sort merge joins can
be used to join rows from two independent sources. Hash joins generally perform
better than sort merge joins. On the other hand, sort merge joins can perform
better than hash joins if both of the following conditions exist:
· The row sources are sorted already.
· A sort operation does not have to be done.
However, if a sort-merge join involves choosing a slower access method
(an index scan as opposed to a full table scan), then the benefit of using a
sort merge might be lost.
Sort merge joins are useful when the join condition between two tables
is an inequality condition (but not a non-equality) like <, <=, >, or
>=. Sort merge joins perform better than nested loop joins for large data
sets.
You cannot use hash joins unless there is an equality condition.
In a merge join, there is no concept of a driving table. The join
consists of two steps:
1. Sort join operation: Both the inputs are sorted on the
join key.
2. Merge join operation: The sorted lists are merged
together.
The full operation is done in two parts:
get first row RW1 from input1
get first row RW2 from input2.
get first row RW2 from input2.
while not at the end of either input loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop
When
the Optimizer Uses Sort Merge Joins
The optimizer can choose a sort merge join over a hash join for joining
large amounts of data if any of the following conditions are true:
· The join condition between two tables is not an
equi-join.
· Because of sorts already required by other operations,
the optimizer finds it is cheaper to use a sort merge than a hash join.
Note: Sort merge join can be seen with
both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of
first sorting both the data sources and then start returning the results. So if
the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may
prefer sort merge join over nested loop because of large data. And if you have
ALL_ROWS as optimizer goal and if any inequality condition is used the SQL,
optimizer may use sort-merge join over hash join
Now
to test merge join, set subside number of rows and perform some sorting.
Num
of rows say e=10000 and d=1000
SQL>
exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows
=> 10000, numblks => 1000, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO', numrows => 10000, numlblks => 100);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'D', numrows => 1000, numblks => 100, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL>explain plan for select e.first_name,d.department_name from e, d where e.department_id=d.department_id order by e.department_id;
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO', numrows => 10000, numlblks => 100);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'D', numrows => 1000, numblks => 100, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL>explain plan for select e.first_name,d.department_name from e, d where e.department_id=d.department_id order by e.department_id;
SQL>
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------
Plan
hash value: 915894881
---------------------------------------------------------------------------------------------------------------------------
|
Id |
Operation |
Name | Rows | Bytes | Cost
(%CPU)| Time |
----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | 370K| 14M| 136 (7)| 00:00:02 |
| 1 | MERGE
JOIN | | 370K| 14M| 136 (7)| 00:00:02 |
| 2 | TABLE ACCESS BY INDEX
ROWID| E | 10000 | 244K| 104 (1)| 00:00:02 |
| 3 | INDEX FULL
SCAN | E_DEPTNO | 10000
| | 100 (0)| 00:00:02 |
|* 4 | SORT
JOIN | | 1000 | 16000
| 25 (4)| 00:00:01 |
| 5 | TABLE ACCESS
FULL |
D | 1000 | 16000
| 24 (0)| 00:00:01 |
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
--------------------------------------------------------------------------------------------------------------------------------
4 -
access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
filter("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")So
above query uses Merger Join.
When
I discussed nested loop above, I used a term called driving table. So, what is
driving table all about?
The
'driving' table is the table we will join FROM -- that is JOIN TO other
tables. For example, lets say you have the query:
SQL>select
* from emp, dept where emp.deptno = dept.deptno;
In
this case the driving table might be DEPT, we would fetch rows from DEPT and
then find the rows in EMP that match. DEPT is the driving table.
The
choice of a driving table made using many factors:
Suppose
PUBLICATIONS has PUB_ID (the primay key) and PUB_TITLE, and there are 100 publications. Now, suppose DOCUMENTS has DOC_ID (the primary key),DOC_TITLE, and PUB_ID (foreign key to PUBLICATIONS.PUB_ID). Suppose there are 1,000,000 documents.
So, the PUBLICATIONS table is much smaller, it should be the driving table, right?
So, the PUBLICATIONS table is much smaller, it should be the driving table, right?
No,
not necessarily. What if the query is something like:
SQL>select pub.pub_title, doc.doc_id, doc.doc_title from documents doc,
publications pub
where doc.pub_id = pub.pub_id and doc.doc_id = :doc_id;
So, in this case, do you want to drive from the PUBLICATIONS table? No, there are 100 rows that would be returned, based on the predicates in the where clause. If you drive from DOCUMENTS, how many rows will be returned? Either 1 or 0, due to the 'doc.doc_id = :doc_id' predicate. So, in this case, DOCUMENTS is your driving table, even though it has 1,000,000 rows and PUBLICATIONS only has 100. In the context of this query, DOCUMENTS is much more selective!
In general, you want to look at the tables in the join, look at filtering predicates that will go against each table, consider if there is an index driving that filter, and determine which table will produce the smallest row-source with the least amount of work.
where doc.pub_id = pub.pub_id and doc.doc_id = :doc_id;
So, in this case, do you want to drive from the PUBLICATIONS table? No, there are 100 rows that would be returned, based on the predicates in the where clause. If you drive from DOCUMENTS, how many rows will be returned? Either 1 or 0, due to the 'doc.doc_id = :doc_id' predicate. So, in this case, DOCUMENTS is your driving table, even though it has 1,000,000 rows and PUBLICATIONS only has 100. In the context of this query, DOCUMENTS is much more selective!
In general, you want to look at the tables in the join, look at filtering predicates that will go against each table, consider if there is an index driving that filter, and determine which table will produce the smallest row-source with the least amount of work.