Sunday 16 October 2011

Understanding Joins



Understanding Joins


Joins are statements that retrieve data from more than one table. A join is characterized by multiple tables in the FROM clause, and the relationship between the tables is defined through the existence of a join condition in the WHERE clause. In a join, one row set is called inner, and the other is called outer.

This section discusses:
·         Nested Loop Joins
·         Hash Joins
·         Sort Merge Joins

Nested Loop Joins

Nested loop joins are useful when small subsets of data are being joined and if the join condition is an efficient way of accessing the second table.

It is very important to ensure that the inner table is driven from (dependent on) the outer table. If the inner table's access path is independent of the outer table, then the same rows are retrieved for every iteration of the outer loop, degrading performance considerably.

Internally Nested loop works this way
In this algorithm, an outer loop is formed which consists of few entries and then for each entry, and inner loop is processed.
Ex:
Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;
It is processed like:
For i in (select * from tab1) loop
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;


The Steps involved in doing nested loop are


a) Identify outer (driving) table
b) Assign inner (driven) table to outer table.
c) For every row of outer table, access the rows of inner table.

NESTED LOOPS
outer_loop
inner_loop
When optimizer uses nested loops?
Optimizer uses nested loop when we are joining tables containing small number of rows with an efficient driving condition. It is important to have an index on column of inner join table as this table is probed every time for a new value from outer table.
The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. You drive from the outer loop to the inner loop, so the order of tables in the execution plan is important.
The outer loop is the driving row source. It produces a set of rows for driving the join condition. The row source can be a table accessed using an index scan or a full table scan. Also, the rows can be produced from any other operation. For example, the output from a nested loop join can be used as a row source for another nested loop join.


Optimizer may not use nested loop in case:
  1. No of rows of both the table is quite high
  2. Inner query always results in same set of records
  3. The access path of inner table is independent of data coming from outer table.
Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user.


To check how optimizer decides the nested loop. Follow test plan. All test plan are tested on Oracle version 10.2.

SQL> Create table e as Select * from employees;

Table created.

SQL> Create table d as select * from departments;

Table created.

SQL> Create index e_deptno on e(department_id);

Index created

For better execution plan -gather stats on  tables and index.

SQL> exec dbms_stats.gather_table_stats('HR','D') 

PL/SQL procedure successfully completed

One may set it artificially as wellJ

Set artificial stats for E:

SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows => 100,
numblks => 100, avgrlen => 124);

PL/SQL procedure successfully completed.

Set artificial stats for E_DEPTNO index

SQL> exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO', numrows =>100, numlblks => 10);

PL/SQL procedure successfully completed.

With less number of rows we will see optimizer probes nested loop

SQL>explain plan for select e.first_name,d.department_name from e, d where e.department_id=d.department_id;

SQL>   select * from table(dbms_xplan.display);

 PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3204653704
--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |   100 |  4100 |     6   (0)| 00:00:01 |
|   1 | TABLE ACCESS BY INDEX ROWID| E        |     4 |   100 |     1   (0)| 00:00:01 |
|   2 |   NESTED LOOPS              |          |   100 |  4100 |     6   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL        | D        |    27 |   432 |     3   (0)| 00:00:01 |
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------------
|*  4 |    INDEX RANGE SCAN         | E_DEPTNO |     4 |       |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

Looking at execution plan its clear the query is using nested loop.

Hash Joins 

Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.

This method is best used when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.

In simpler terms it works like

Build phase

For each row RW1 in small (left/build) table loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;

Probe Phase

For each row RW2 in big (right/probe) table loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;

When the Optimizer Uses Hash Joins

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the 
following conditions are true:

·         A large amount of data needs to be joined.
·         A large fraction of a small table needs to be joined.

Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.

To check how optimizer decides the hash loop. Go through the test plan.

Set artificial stats for E , D table and index(setting higher number of rows 1000000)

SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows => 1000000, numblks => 10000, avgrlen => 124);

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO', numrows => 1000000, numlblks => 1000);
  
PL/SQL procedure successfully completed.

SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'D', numrows => 1000000,numblks => 10000 , avgrlen => 124);

PL/SQL procedure successfully completed.

Now we have 1000000 number of rows in E and D table both and index on E(DEPTNO)  reflects the same.

SQL>explain plan for select e.first_name,d.department_name from e, d where e.department_id=d.department_id;

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------
Plan hash value: 1127375450
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    37G|  1414G|       |   663K (99)| 02:12:45 |
|*  1 |  HASH JOIN         |      |    37G|  1414G|    26M|   663K (99)| 02:12:45 |
|   2 |   TABLE ACCESS FULL| D     1000K|    15M|        2232   (2)| 00:00:27 |
|   3 |   TABLE ACCESS FULL| E     1000K|    23M|        2264   (4)| 00:00:28 |
---------------------------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
-------------------------------------------------------------------------------------------------------------------
   1 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

Clearly we can see it is Using Hash Join.

Sort Merge Joins

Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform better than sort merge joins. On the other hand, sort merge joins can perform better than hash joins if both of the following conditions exist:
·         The row sources are sorted already.
·         A sort operation does not have to be done.

However, if a sort-merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.

Sort merge joins are useful when the join condition between two tables is an inequality condition (but not a non-equality) like <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. 

You cannot use hash joins unless there is an equality condition.

In a merge join, there is no concept of a driving table. The join consists of two steps:

1.      Sort join operation: Both the inputs are sorted on the join key.
2.      Merge join operation: The sorted lists are merged together.

If the input is already sorted by the join column, then a sort join operation is not performed for that row source.

The full operation is done in two parts:

Sort join operation
get first row RW1 from input1
get first row RW2 from input2.


Merge join operation
while not at the end of either input loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop

 When the Optimizer Uses Sort Merge Joins

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:

·         The join condition between two tables is not an equi-join.
·         Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.

Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join

Now to test merge join, set subside number of rows and perform some sorting.

Num of rows say e=10000 and d=1000

SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'E', numrows => 10000, numblks => 1000, avgrlen => 124);

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.set_index_stats(ownname => 'HR', indname => 'E_DEPTNO', numrows => 10000, numlblks => 100);

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.set_table_stats(ownname => 'HR', tabname => 'D', numrows => 1000, numblks => 100, avgrlen => 124);

PL/SQL procedure successfully completed.

SQL>explain plan for select e.first_name,d.department_name from e, d where e.department_id=d.department_id order by e.department_id;

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------
Plan hash value: 915894881
---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   370K|    14M|   136   (7)| 00:00:02 |
|   1 | MERGE JOIN                  |          |   370K|    14M|   136   (7)| 00:00:02 |
|   2 |   TABLE ACCESS BY INDEX ROWID| E        | 10000 |   244K|   104   (1)| 00:00:02 |
|   3 |    INDEX FULL SCAN           | E_DEPTNO | 10000 |       |   100   (0)| 00:00:02 |
|*  4 |   SORT JOIN                  |           1000 | 16000 |    25   (4)| 00:00:01 |
|   5 |    TABLE ACCESS FULL         | D         1000 | 16000 |    24   (0)| 00:00:01 |
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
--------------------------------------------------------------------------------------------------------------------------------
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
       filter("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")So above query uses Merger Join.

When I discussed nested loop above, I used a term called driving table. So, what is driving table all about?

The 'driving' table is the table we will join FROM -- that is JOIN TO other tables.  For example, lets say you have the query:

SQL>select * from emp, dept where emp.deptno = dept.deptno;

In this case the driving table might be DEPT, we would fetch rows from DEPT and then find the rows in EMP that match.  DEPT is the driving table.

The choice of a driving table made using many factors:

Suppose PUBLICATIONS has PUB_ID (the primay key) and PUB_TITLE, and there are 100 publications. Now, suppose DOCUMENTS has DOC_ID (the primary key),DOC_TITLE, and PUB_ID (foreign key to PUBLICATIONS.PUB_ID). Suppose there are 1,000,000 documents.

So, the PUBLICATIONS table is much smaller, it should be the driving table, right?
No, not necessarily. What if the query is something like:
SQL>select pub.pub_title, doc.doc_id, doc.doc_title  from documents doc, publications pub
where doc.pub_id = pub.pub_id  and doc.doc_id = :doc_id;

So, in this case, do you want to drive from the PUBLICATIONS table? No, there are 100 rows that would be returned, based on the predicates in the where clause. If you drive from DOCUMENTS, how many rows will be returned? Either 1 or 0, due to the 'doc.doc_id = :doc_id' predicate. So, in this case, DOCUMENTS is your driving table, even though it has 1,000,000 rows and PUBLICATIONS only has 100. In the context of this query, DOCUMENTS is much more selective!

In general, you want to look at the tables in the join, look at filtering predicates that will go against each table, consider if there is an index driving that filter, and determine which table will produce the smallest row-source with the least amount of work.