Sunday 15 January 2017

Oracle Exadata

Oracle Exadata & Offloading 


New Year: - New Chapter, New Aspirations. Wishing you all a very Happy New Year.

Lastly I got a chance to explore Exadata and I went over documentations to understand how this machine actually works and i must say i was amazed by it’s designed. So I thought why not to share with you all.

So what is Exadata all about and what it offers?? Here is a brief about it.

Exadata precisely “The Database Machine” is high end Oracle system built using Industry standard Oracle hardware. It primarily consists 3 grids i.e. DB Server/compute grid, network grid and a storage grid. Basically it’s a great combination of Intelligent Hardware and smart Software.


Exadata Architecture
Exadata Architecture

Let’s take each component one by one in brief:

DB Server/Compute Grid: This grid is actually a Database layer; this layer is basically made up of multiple SUN servers with running Oracle Db software. RAC is supported. The database servers uses ASM to map the storage and ASM is key component of this software stack. Db Server does have one more component LIBCELL and is linked with Oracle kernel. This allows Oracle to talk to storage tier via netwrok based calls instead of reads/writes at OS level. LIBCELL is designed in such a way that its code know how to interact or requests  via iDB.

InfiniBand <Network grid>: It provide a low latency,high throughput communication link. This layer is responsible for the communication between the two tiers<Storage and DB> and here it is done via iDB< Intelligent Database protocol>, which is a network based protocol implemented using InfiniBand. iDB is used to send requests for data along with metadata about the request (including predicates) to cellsrv. iDB is built on Reliable Datagram Sockets (RDS ) protocol and runs over InfiniBand ZDP (Zero-loss Zero-copy Datagram Protocol). The objective of ZDP is to eliminate unnecessary copying of blocks.

Storage Grid: Storage consists of multiple servers and the intelligent part integrated here is the process< Cell Services(Cellsrv)> that runs as a part of storage server. Cellsrv is a multi-threaded program that services I/O requests from a database server. Those requests can be handled by returning processed data or by returning complete blocks depending in the request. Cellsrv is able to use the metadata to process the data prior sending result back to client. Such type of scans are called as Smart Scans.It is not always that Smart scan is used for each and every Db operation; there are certain situations only for which Smart Scans are triggered.

Apart from what I mentioned above there is also a term called Flash Cache; Storage server also equipped with flash based storage. Oracle point this as an Exadata Smart Flash Cache. The whole idea here is to gain on sequential read i.e. single blocks.

There are lot more per component wise; for more I request you to visit Oracle Docs.

Now lets take what Exadata offers at sql level:

One of the Key feature of Exadata is Offloading and this is basically a game changer in my view; I am really impressed in the way Oracle handles the data volume in Exadata . Offloading refers to the key concept of moving the processing from the database servers to the storage layer. By doing this Oracle gains an advantage on reducing the volume of data that must be returned to the DB server. Less the volume at DB layer less will the processing and more we can count on improvement.Smart Scan is another name of offloading and these terms are more or less interchangeable.

But why offloading is important: Large volume tends to be a problem for any system and moving the data from Hardware to DB and processing such large volume is expected to take time. Let say take this via e.g. With a block size of  8K  if you want say one row then storage end sends whole 8K block and from that block Oracle software fetches 1 row. Now 8K block can have N number of Odd rows; so from N number of rows Oracle fetched 1 row.

Now suppose you have millions of rows then you can imagine how many blocks will get transferred to satify large no of rows and this certainly becomes a bottleneck  and this is where Exadata pinch in; Exadata offloading is designed to eliminate unnecessary data between the two layers.

With Exadata Offloading we usually achieve:

(i)                  Less volume of data gets transferred to Db server from Storage End
(ii)                Less physical IO i.e less disk access.
(iii)               CPU utilization is less.

Offloading: Can be achieved by

(i)                  Predicate Filtering
(ii)                Storage Indexes
(iii)               Column Projection

But prior going over each; let see how components interacts when SQL initiates an IO in Exadata machine; here I try to come up with block diagram to show its end to end communication.

EXADATA SQL PROCESSING
Exadata SQL Processing
To understand offloading behaviour lets do a practical to see how this offloading works and whether sql’s are getting benefited or not not.

In Exadata there are few Oracle parameters that deals with offloading; which when tweaked will give you license to play in terms of Offloading. Here I am concentrating on only on cell_offload_processing which is TRUE by default and is session modifiable.

NAME_COL_PLUS_SHOW_PARAM                                                         TYPE        VALUE_COL_PLUS_SHOW_PARAM
-------------------------------------------------------------------------------- ----------- ----------------------------------------
cell_offload_compaction                                                          string      ADAPTIVE
cell_offload_decryption                                                          boolean     TRUE
cell_offload_parameters                                                          string
cell_offload_plan_display                                                        string      AUTO
cell_offload_processing                                                          boolean     TRUE
cell_offloadgroup_name                                                           string

So lets see with < cell_offload_processing =TRUE> how much time below query takes; table is having 485M rows.

/* Formatted on 2017/01/11 18:56 (Formatter Plus v4.8.8) */
SELECT COUNT (*)
  FROM smart_test
WHERE (crnt_f = 'Y');

Elapsed: 00:00:26.33 ----------------à excellent elapsed time; in non exadata machine query is expected to take more.

                                                               SQL
SQL_PLAN_LINE_ID BLOCKING_SESSION BLOCKING_SESSION_SERIAL# BLOCKING_SE Id              SAMPLE_TIME                    EVT                       CURRENT_OBJ#
---------------- ---------------- ------------------------ ----------- --------------- ------------------------------ ------------------------- ------------
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.33.240 AM      cell smart table scan           114988
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.34.250 AM      cell smart table scan           114989
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.35.250 AM      cell smart table scan           114991
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.36.250 AM      cell smart table scan           114992
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.37.250 AM      cell smart table scan           114994
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.38.250 AM      cell smart table scan           114995
               3                                           NOT IN WAIT 47ztspbv68hu0   11-JAN-17 08.23.39.250 AM      ON CPU                          114996
               3                                           NOT IN WAIT 47ztspbv68hu0   11-JAN-17 08.23.40.250 AM      ON CPU                          114998
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.41.250 AM      cell smart table scan           115000
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.42.250 AM      cell smart table scan           115001
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.43.250 AM      cell smart table scan           115003
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.44.250 AM      cell smart table scan           115005
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.45.260 AM      cell smart table scan           115006
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.46.260 AM      cell smart table scan           115008
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.47.260 AM      cell smart table scan           115010
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.48.260 AM      cell smart table scan           115011
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.49.290 AM      cell smart table scan           115013
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.50.290 AM      cell smart table scan           115015
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.51.290 AM      cell smart table scan           115016
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.52.290 AM      cell smart table scan           115018
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.53.290 AM      cell smart table scan           115020
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.54.300 AM      cell smart table scan           115021
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.55.316 AM      cell smart table scan           115023
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.56.316 AM      cell smart table scan           115025
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.57.316 AM      cell smart table scan           115027
               3                                           UNKNOWN     47ztspbv68hu0   11-JAN-17 08.23.58.316 AM      cell smart table scan           115028


You might have noticed by now; query is waited on new event ”cell smart table scan” which shows offloading is being done. Oracle uses this event to account for time spent waiting for Full Table Scans that are Offloaded. Its occurrence can be used to verify whether a statement benefited from Offloading or not. Also to notice here that query visited table in a restricted manner i.e not to much visit.

Lets take a look on execution plan; you can see ID 3 operation you can see new access method i.e. TABLE ACCESS STORAGE FULL’ which is very specific to Exadata only. It is worth to note here that this doesn’t mean smart scan is used; Storage keyword here means that this execution plan is Exadata storage aware.

--------------------------------------------------------------------------------------------------------------------------------------
                      | Id  | Operation                   | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |  OMem |  1Mem | Used-Mem |
                      --------------------------------------------------------------------------------------------------------------------------------------
                      |   0 | SELECT STATEMENT            |            |        |       |  2796K(100)|          |       |       |       |       |          |
                      |   1 |  SORT AGGREGATE             |            |      1 |     2 |            |          |       |       |       |       |          |
                      |   2 |   PARTITION RANGE ALL       |            |     61M|   117M|  2796K  (1)| 00:01:50 |     1 |    67 |       |       |          |
                      |*  3 |    TABLE ACCESS STORAGE FULL| SMART_TEST |     61M|   117M|  2796K  (1)| 00:01:50 |     1 |    67 |  1025K|  1025K|   14M (0)|
                      --------------------------------------------------------------------------------------------------------------------------------------

                      Predicate Information (identified by operation id):
                      ---------------------------------------------------

                         3 - storage("CRNT_F"='Y')
                             filter("CRNT_F"='Y')

But you can see storage clause in predicate information< storage("CRNT_F"='Y')> which shows smart scan is used and Storage grid handled all the pain i.e. major filtering is done at the storage level and DB layer had to do minimal activity<only those blocks which are not handled by cells>. This is basically Predicate Filtering.

Lets see how it works when offloading process is disabled. Exadata will behave like non exadata machine.

SQL>  alter session set cell_offload_processing=false; -------------------à disable offloading process;


/* Formatted on 2017/01/11 18:56 (Formatter Plus v4.8.8) */
SELECT COUNT (*)
  FROM smart_test
WHERE (crnt_f = 'Y');


Elapsed: 00:06:08.59--------àsee after disabling; elapsed time is in minutes from few seconds. 

-----------------------------------------------------------------------------------------------------------
                      | Id  | Operation                   | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |
                      -----------------------------------------------------------------------------------------------------------
                      |   0 | SELECT STATEMENT            |            |        |       |  2796K(100)|          |       |       |
                      |   1 |  SORT AGGREGATE             |            |      1 |     2 |            |          |       |       |
                      |   2 |   PARTITION RANGE ALL       |            |     61M|   117M|  2796K  (1)| 00:01:50 |     1 |    67 |
                      |*  3 |    TABLE ACCESS STORAGE FULL| SMART_TEST |     61M|   117M|  2796K  (1)| 00:01:50 |     1 |    67 |
                      -----------------------------------------------------------------------------------------------------------

                      Predicate Information (identified by operation id):
                      ---------------------------------------------------

                         3 - filter("CRNT_F"='Y')

Look at the predicate storage filter disappeared and that it means no smart scan and no offloading; here all the blocks are shipped to Oracle Db layer from storage layer and all the major filtering is done at Db layer instead of storage layer.

You can also see it from wait events; that query waited on direct path read and visited the table many times as compared to smart scan offloading as I mentioned above.

                                                                SQL
SQL_PLAN_LINE_ID BLOCKING_SESSION BLOCKING_SESSION_SERIAL# BLOCKING_SE  Id              SAMPLE_TIME                    EVT                       CURRENT_OBJ#
---------------- ---------------- ------------------------ ----------- --------------- ------------------------------ ------------------------- ------------
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.29.53.916 AM      direct path read                114992
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.29.54.916 AM      ON CPU                          114992
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.29.55.933 AM      ON CPU                          114993
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.29.56.933 AM      direct path read                114993
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.29.57.933 AM      ON CPU                          114993
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.29.58.933 AM      ON CPU                          114993
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.29.59.933 AM      direct path read                114993
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.00.933 AM      ON CPU                          114994
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.01.933 AM      direct path read                114994
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.02.943 AM      ON CPU                          114994
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.03.943 AM      direct path read                114994
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.04.943 AM      ON CPU                          114995
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.05.943 AM      ON CPU                          114995
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.06.943 AM      direct path read                114995
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.07.943 AM      direct path read                114996
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.08.943 AM      direct path read                114996
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.09.953 AM      direct path read                114996
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.10.953 AM      ON CPU                          114996
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.11.953 AM      ON CPU                          114997
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.12.953 AM      direct path read                114997
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.13.953 AM      direct path read                114997
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.14.953 AM      direct path read                114997
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.15.953 AM      direct path read                114998
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.16.953 AM      direct path read                114998
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.17.953 AM      ON CPU                          114998
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.18.953 AM      direct path read                114999
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.19.963 AM      direct path read                114999
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.20.963 AM      direct path read                114999
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.21.963 AM      ON CPU                          114999
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.22.963 AM      direct path read                115000
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.23.963 AM      direct path read                115000
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.24.963 AM      direct path read                115000
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.25.963 AM      direct path read                115001
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.26.963 AM      direct path read                115001
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.27.963 AM      ON CPU                          115001
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.28.973 AM      ON CPU                          115001
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.29.973 AM      direct path read                115002
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.30.973 AM      direct path read                115002
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.31.973 AM      direct path read                115002
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.32.973 AM      ON CPU                          115002
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.33.973 AM      direct path read                115003
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.34.973 AM      direct path read                115003
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.35.973 AM      direct path read                115003
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.36.973 AM      ON CPU                          115003
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.37.983 AM      ON CPU                          115004
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.38.983 AM      direct path read                115004
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.39.983 AM      direct path read                115004
               3                                           NOT IN WAIT 6fxj1rub84bsq   11-JAN-17 08.30.40.983 AM      ON CPU                          115004
               3                                           UNKNOWN     6fxj1rub84bsq   11-JAN-17 08.30.41.983 AM      direct path read                115005


Storage Indexes: Storage Indexes are not like B Tree indexes; infact there is no relationship between them.Storage Index in my view is like anti index which is designed to eliminate the disk IO as they are here to identify the location where requested data is not present rather than B tree index which are designed to pinpoint the location of requested data. Storage Index structure are of 1MB and are stored in memory and never written to disk. Storage index stores minimum and maximum column values for disk storage units; there is one FLAG sort of column which denotes whether any of the records in a storage region contain nulls.

Since Smart Scans pass the query predicates to the storage servers, and Storage Indexes contain a map of values in each 1MB storage region, any region that can’t possibly contain a matching row can be eliminated without ever being read.

Let me show you via diagram what I am trying to say

Figure 1: Denotes employee table; Figure 2 is Storage Index created on Salary predicate; you can see 3 columns minimum,maximum and FLAG column.

Suppose you executed query { SELECT EMPID from EMPLOYEE WHERE SALARY > 5000 }on exadata machine; smart scan will pass the query predicate to storage server and storage index here will help the storage to skip the first 2 part of region as they do not have maximum values that are high enough to contain any records that will satisfy the query predicate. Therefore, those storage regions will not be read from disk only last row<region> is satisfying the requested data so only that will be read.
EXADATA STORAGE INDEX
Storage index Map

cell physical IO bytes saved by storage index: is the statistics which shows how much bytes were saved. For initial run there wont be any saving as storage index map will be created on that execution; subsequent execution will save and shows how much it saved.

SQL> select avg(id) from smart_test where id2 is null;

NAME                                              VALUE
--------------------------------------------- ---------------
cell physical IO bytes saved by storage index      0

SQL> select avg(id) from smart_test where id2 is null;

NAME                                              VALUE
--------------------------------------------- ---------------
cell physical IO bytes saved by storage index   5984949248


Column Projection:    Column projection also helps in terms of achieving good performance. Columns that are part of select query are not returned only; there are join columns as well which gets returned.But again this is not something which is very unique to exadata. In non exadata machines also column projection is famous.


Ø  select sum(s. ID2),count((s. ID3)) from smart_test s, smart_test s2
where s. ID = s2. ID and s. ID4 > 0 ;


                      Column Projection Information (identified by operation id):
                      -----------------------------------------------------------

                         1 - (#keys=0) COUNT("S"." ID3")[22], SUM("S"." ID2")[22]
                         2 - (#keys=1; rowset=200) "S"." ID2"[NUMBER,22], "S"." ID3"[NUMBER,22]
                         3 - "S2"." ID"[NUMBER,22]
                         4 - "S2"." ID"[NUMBER,22]
                         5 - (rowset=200) "S"." ID"[NUMBER,22], "S"." ID2"[NUMBER,22],
                             "S"." ID3"[NUMBER,22]
                         6 - (rowset=200) "S"." ID"[NUMBER,22], "S"." ID2"[NUMBER,22],
                             "S"." ID3"[NUMBER,22]

You can see join columns and select columns are part of projection but not all columns selected i.e. ID4 column is not part of projection so this is excellent in terms of performance gain.

I hope you like this brief cover on Exadata. For more I request you to go over documentation as this machine has lot to offer.

Enjoy Learning!!!!