Tuning Distributed Queries in Oracle

When it comes to SQL statements and optimizing queries on relational databases, probably the first thing developers (ought to) look at is the execution plan. The execution plan shows you what the database engine thinks is the best way to execute a query and it gives estimates of relevant runtime indicators that influenced the optimizer’s decision.

When a query involves calls to remote databases you may not always get the best execution (plan) available, because Oracle always runs the query on the local database as it has no way of estimating the cost of network traffic and thus no way of weighing the pros and cons of running your query remotely versus locally. Many tips and tricks have been noted by gurus and of course Oracle, but I was recently asked to tune a query than involved more than the textbook cases typically shown online.

Background

On an 11gR2 Oracle database that I managed, a particular query ran in about 120 seconds flat. The statement was supposed to run continuously as it pulled in data for a data warehouse that was supposed to operate in near-realtime; a lag of about five minutes was considered acceptable. Since many other things had to happen in-between runs of that one query, two minutes was not ideal.

Our query pulled in data from three tables (tab1, tab2, and tab3) on a remote database through a database link and combined it with data from the local database (local_vw). On the remote source (remote_db) we had no privileges to create (materialized) views, indexes, functions, or procedures. We were merely allowed read-only access to the data through a database link. In essence, the query was our only chance of getting the data we needed. I have outlined the basic structure of the query below.

Table tab1 had about 150,000 rows, the combination of tab2 and local_vw (view) fewer than 10,000, and tab3 approximately 300,000; the whole query returned roughly 145,000 rows. There were no indexes on tab1 that were relevant to the query, and all the join columns of tab2 and tab3 were in separate indexes; there were no covering indexes that we could use.

SELECT
  ...
, some_plsql_replacement_function(...)
FROM
  (
    SELECT
      ...
    , ROW_NUMBER() OVER ( PARTITION BY ...
ORDER BY ... ) AS rnum
    FROM
      (
        SELECT
          ...
        , some_plsql_extraction_function(...)
        FROM
          (
            SELECT
              ...
            FROM
              tab1@remote_db rt1

            UNION ALL

            SELECT
              ...
            FROM
              tab2@remote_db rt2
            JOIN
              local_vw lv
            USING (...)
          )
      ) tb
    JOIN tab3@remote_db rt3
    USING (...)
  )
WHERE
  rnum = 1;

Observations and Tweaks

As always, let’s take a look at the execution plan. The original execution plan looks schematically as follows:

----------------------------------------------
| Id  | Operation                 | Name     |
----------------------------------------------
|   0 | SELECT STATEMENT          |          |
|   1 |  VIEW                     |          |
|   2 |   WINDOW SORT PUSHED RANK |          |
|   3 |    NESTED LOOPS OUTER     |          |
|   4 |     VIEW                  |          |
|   5 |      UNION-ALL            |          |
|   6 |       REMOTE              | tab1     |
|   7 |       NESTED LOOPS        |          |
|   8 |        VIEW               | local_vw |
| ... |                           |          |
|  26 |        REMOTE             | tab2     |
|  27 |     VIEW                  |          |
|  28 |      REMOTE               | tab3     |
----------------------------------------------

The operations with IDs between 9 and 25 are indicated by an ellipsis as they pertain to the view local_vw and are suppressed since they are not relevant to the situation at hand.

From the plan table I got the distinct impression that the nested loops (operations 3 and 7) would be the bottlenecks. Nested loops can be inefficient but the common ‘wisdom’ that they are always bad is nonsense. I confirmed this by adding the USE_HASH and LEADING hints to no avail, so I was not too fazed about the appearance of nested loops in the execution plan; the non-leading table of the topmost nested loop in the execution plan (i.e. tab3) has all the appropriate indexes to make the nested-loop operation happy. For more details I refer to Tony Hasler’s Expert Oracle SQL, p. 266.

I then commented out the stuff belonging to the local database (i.e. everything below and including the UNION ALL in the innermost subquery) and removed the PL/SQL functions because they were only available on the local database. When I ran the resultant query on the remote database it took slightly more than 10 seconds.

My suspicion was thus that the data transfer was responsible for the problems. I added the driving site hint to the query and, voilà, it took about 95% of the time of the original query’s runtime. Still, not exactly lightning fast…

At that point I remembered something I had done to make it run faster on the remote database: take out the PL/SQL functions. Whenever a SQL statement calls a PL/SQL user-defined function (UDF), a context switch occurs: the database engine passes the baton to the PL/SQL engine who in turn comes up with an answer before surrendering control to the SQL engine. Most of the time you only incur a negligible penalty but perhaps there was more to it than that.

Thankfully, the PL/SQL functions used built-in functions, including multiple levels of nested functions like REPLACE, SUBSTR and INSTR. I could therefore write whatever they did as a part of the SQL statement itself. Sure, it kills modularity but performance rules all, so I tested my suspicions. Indeed, the modified query only needed 90% of the original query’s runtime. Again, about 110 seconds is not exactly stellar but it’s an improvement.

In summary, adding DRIVING_SITE shaves off 5 percentage points and replacing PL/SQL with built-in functions shaves off 10 percentage points. By the way, rewriting the inline views as factored subqueries did nothing for the performance. I looked at the analytic function ROW_NUMBER too but found nothing that indicated a performance problem associated with it.

Actually, we can often tune queries for which we are only interested in one row per group by using MIN/MAX with KEEP ( DENSE_RANK FIRST/LAST ORDER BY ... ) and a GROUP BY clause that lists whatever is in the original ROW_NUMBER’s PARTITION BY clause. When multiple columns are involved many calls to the FIRST/LAST aggregate function may decrease the legibility of your SQL statement, so it’s often best to once get the ROWID of the first or last row that you want from each group and then inner-join your original table with the (temporary) table or factored subquery that contains the first/last ROWIDs. The reason that you can sometimes see a performance gain by rewriting your queries like that is that the analytic ROW_NUMBER function is an insertion-based sort that is quite memory-intensive, whereas the FIRST/LAST aggregate function only retains one row for each of the groupings, hence it consumes only a relatively small amount of memory. Such a query rewrite can in some instances be straightforward, whereas in others it can be a pain in the tush that requires extensive validation after the rewrite; the trick with the ROWIDs decreases the amount of typing in such cases considerably. In my particular case the rewrite did not do much for performance because memory was not an issue with the table sizes in question.

Solution

And now for the killer: when I combined both tweaks the query ran in 10% of the original runtime, in just 12 seconds! The outline of the final query is shown below.

SELECT
  ...
, builtin_replacement_function(...)
FROM
  (
    SELECT
      ...
    , ROW_NUMBER() OVER (PARTITION BY ...
ORDER BY ...) AS rnum
    FROM
      (
        SELECT
          ...
        , builtin_extraction_function(...)
        FROM
          (
            SELECT
              /*+ driving_site(rt1) */
              ...
            FROM
              tab1@remote_db rt1

            UNION ALL

            SELECT
              ...
            FROM
              tab2@remote_db rt2
            JOIN
              local_vw lv
            USING (...)
          )
      ) tb
    JOIN tab3@remote_db rt3
    USING (...)
  )
WHERE
  rnum = 1;

The associated, simplified execution plan is shown below; the VIEW operation (ID: 9) has two REMOTE children that are not shown in the table for reasons of brevity.

------------------------------------------------
| Id  | Operation                       | Name |
------------------------------------------------
|   0 | SELECT STATEMENT REMOTE         |      |
|   1 |  VIEW                           |      |
|   2 |   WINDOW SORT PUSHED RANK       |      |
|   3 |    HASH JOIN RIGHT OUTER        |      |
|   4 |     TABLE ACCESS FULL           | tab3 |
|   5 |     VIEW                        |      |
|   6 |      UNION-ALL                  |      |
|   7 |       TABLE ACCESS FULL         | tab1 |
|   8 |       NESTED LOOPS              |      |
|   9 |        VIEW                     |      |
| ... |                                 |      |
|  21 |        PARTITION RANGE ITERATOR |      |
|  22 |         TABLE ACCESS FULL       | tab2 |
------------------------------------------------

So, why would the combination deliver such a massive performance boost? It’s really simple, once you have discovered the solution and think about it. The driving site hint makes sense: the remote database contains a larger amount of data and therefore forcing the remote database to do the heavy lifting is sensible.

Now imagine what happens when we call a UDF that lives in the local database with the driving site hint. We push our local data to the remote database. It then has to call a PL/SQL function on the local instance, the local database switches contexts, the PL/SQL function is evaluated, the context is switched back, the result is thrown over the fence to the remote database, and once it’s done with everything it faxes the result set to the local database.

Context switches typically cause small, almost negligible, performance hits but when accessing larger amounts of data on a remote database they can be significant because they require Oracle to go back and forth even when you add an appropriate /*+ driving_site(...) */ hint to your queries because the function does not exist on the remote database and needs to be called locally.

The built-in functions are available to all Oracle databases and they do not cause context switches. By combining both tweaks we get the best of both worlds: the remote database bears the brunt and it does not have to ask the local database for function evaluations it can handle itself. By adding /*+ no_merge */ to the subquery with the extraction function I was able to knock off another half a second from the runtime; this particular hint prohibits Oracle from merging inline views and causes the database to execute the inline view before the surrounding subquery. All in all, it is clear that the main benefit is to be found in the combination of the driving site hint and the use of built-in rather than remote user-defined functions.