Saturday 17 September 2011

IN V/S EXISTS


IN v/s Exists:  Which one is Appropriate/Faster?


Background:

IN: It’s a multiple-row operator used in a query.


Select * From table_name Where column_name IN( value1,Value2)

EXISTS: Operator is frequently used with correlated sub-queries to test whether a value retrieved by the outer query exists in the results set of the values retrieved by the inner query.


Select column1, column2, FROM   table1 outer
 Where EXISTS (Select ‘X’ FROM    table2 Where   column1=outer. column2);


Note that the inner SELECT query does not need to return a specific value, so a constant can be selected. From a performance standpoint, it is faster to select a constant than a column.

Correlated Sub Queries: Correlated sub-queries are used for row-by-row processing. Each sub-query is executed once for every row of the outer query.





Before proceeding further with the topic I would like to show you “How operators (IN/EXISTS) are processed internally”.

IN Operator

Select * from t1 Where x IN (Select y from t2 )

is typically processed as:

Select * From t1, (Select distinct y From T2 ) t2
 Where t1.x = t2.y;

The sub-query is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table -- typically.


Exists Operator :

Select * from t1 Where exists ( Select null from t2 Where y = x )

That is processed more like:

   for x in ( Select * From t1 )
   loop
      if ( EXISTS ( Select null From t2 Where y = x.x )
      then
      output the records
      end if
   end loop

Note here: Whenever Exists is used, always Full Scan of outer table is performed but the first query can make use of an index.

So which is more appropriate/faster and when?

To deal with this I created one test plan as below( One table is big and other is relatively small)

Create Table test As Select * From all_objects;

Create Index test_idx On test(object_id);

Create Table test1 As Select * From all_objects Where Rownum < 1000;

Create Index test1_idx On test2(object_id);

Analyze Table test Compute Statistics For Table For All Indexes For All Indexed Columns;

Analyze Table test1 Compute Statistics For Table For All Indexes For All indexed Columns;

SQL> Select Count(*) From test;

              COUNT(*)
           ------------------
              1306432


 SQL> Select Count(*) From test1;

                COUNT(*)
               ----------------
                    999

So, I executed my test plan.

SQL> Set Timing On

SQL>Select Count (subobject_name) From test Where EXISTS (Select null From test1 where test1.object_id = test.object_id);

COUNT (SUBOBJECT_NAME)
--------------------------------------------
                    0

Elapsed: 00:00:02.94

SQL>Select Count (subobject_name) From test Where object_id IN (Select object_id From test)


COUNT(SUBOBJECT_NAME)
-----------------------------------------------
                    0

Elapsed: 00:00:02.34

Clearly a 60 micro second’s difference between the queries.

In 1st statement: (Outer query is using test table (which is bigger than test1 ) Here each sub-query is executed once for every row of the outer querySo a full table scan is probing a table scan in the inner query and it clearly shows it is inefficient to use Exist operator here

Whereas,

In 2nd query the inner query is accessing small table test1 and executing it without bothering about the outer query.

 So, above result clearly shows if the outer query is "big" and the inner query is "small, IN operator is generally more efficient then EXISTS.


SQL>Select Count (subobject_name) From test1 Where EXISTS (Select null From test Where test1.object_id = test.object_id)

COUNT(SUBOBJECT_NAME)
--------------------------------------------
                    0

Elapsed: 00:00:02.34


SQL>Select Count (subobject_name) From test1 Where object_id IN (Select object_id From test)

COUNT(SUBOBJECT_NAME)
---------------------------------------------
                    0

Elapsed: 00:00:02.84

Difference of 50 micro seconds.

In 1st statement: (Outer query is using test1 table (which is smaller than test table) Here each sub-query is executed once for every row of the outer query. the table test1 is relatively small and executing (Select null From test Where test.object_id = test1.object_id) is fast (nice index on test(object_id)). Then the EXISTS will be faster as the time to full scan test and do the index probe into test could be less than the time to simply full scan test to build the sub-query we need to distinct on.

Whereas,

In 2nd query the inner query is accessing bigger table test and executing it without bothering about the outer query.Clearly we can see IN is inappropriate in this condition.

So, above result clearly shows if the outer query is "small" and the inner query is "big”, EXISTS operator is generally more efficient then IN.

A very important point to note here:

Where EXISTS will find the first row faster in general then the IN.

The IN will get the last row (all rows) faster than the where exists.  If your optimizer mode is set to FIRST_ROW then EXISTS might totally out cast IN.

If you find this article interesting then please provide feedback through your comments. 


2 comments:

Gokul said...

Great!!!.... thanks a ton!! Now I am more clear about this things... :)

Deviprasad Tummidi said...

Wow.. good info dude.. thank u..

Post a Comment