Friday, August 17, 2012

When good JDE queries go bad

It's 4 a.m. and your cell phone is ringing again.  You groan because you already know why.  Your JDE job queue is backed up again, and batch jobs are running forever.

You just don't understand it.  These little jobs that normally run in under a minute sometimes take as long as an hour to complete.  And worse yet, your users in Europe have started complaining about system performance when running web applications at the same time your job queue gets backed up.  It just doesn't make any sense.  Or does it?

Unfortunately, the above scenario is fairly common when you use an Oracle database with JDE.

Here's an explanation of what can happen to throw off the Oracle optimizer and cause SQL queries to start behaving very badly.  

By default, Oracle statistics are updated in the nightly maintenance window that begins at 10 p.m.  Any tables that have had enough changes since statistics were previously collected will get new stats collected.  Also by default, Oracle will use an internal algorithm to determine when to invalidate cached SQL cursors.  (This setting can be changed to force an immediate invalidation, if you desire -- DBConnect typically does force immediate invalidation by the way.)

Once new statistics are available for a table, and when any cached cursors are invalidated, new SQL plans will be calculated when an SQL statement next references the given table.  This is a key moment in time, because whatever plan the Oracle engine calculates upon the very first execution of the SQL statement, will then be used for all subsequent executions (as long as the SQL statement remains cached).

If, therefore, a BAD execution plan happens to be calculated, the performance will obviously be bad.

But how can this happen?

This can happen because of something called bind peeking.  Bind peeking occurs when Oracle "peeks" into the actual values used within an SQL statement's bind variables, and makes a decision on an execution plan based on these values.  That's usually a very good thing because it means that Oracle is using the ACTUAL values used in the query to provide the best possible result.

But what if the data in the underlying tables is skewed?

If the underlying data is skewed (a very common issue in JDE databases - think of your companies, divisions, doc types, etc.), then Oracle might make a great choice when an SQL statement is first executed, but the path used for the next execution might be a terrible choice.

Let's look at this example on your Sales Order Detail File, F4211:

Let's say you have approximately 1 million records in this table, and the breakdown of your doc types (SDDCTO) is as follows:

S1 - 10 records
S2 - 20,000 records
S3 - 20,000 records
S4 - 10,000 records
S5 - 10,000 records
S6 - 10,000 records
S7 - 20,000 records
S8 - 10,000 records
S9 - 900,000 records

Now, let's say stats are updated on F4211 at 10:00 p.m., and then a UBE runs at say, 11 p.m., which runs an SQL that looks like this:


If the very first execution of the SQL statement specifies a value of S1 for doc type, then it is very likely an index will be used because of a composite index that includes the SDDCTO field.  This is because there are only 10 records with a value of S1, and Oracle knows about this because stats were just collected.

The SQL statement will run very fast.

Now, let's say the UBE continues and now specifies a value of S9 for doc type.  The execution plan is already cached, so it will use the index on SDDCTO.  This means it will need to do an index range scan to locate 900,000 entries, and then take the resulting 900,000 ROWIDs, and go retrieve those records from the table!  This is extremely inefficient, and it would be far more efficient to do a full scan of the F4211 table.

So what can you do at 4 a.m.?

Force dependent SQL statements to be reparsed!

A very easy way to do this is to issue a null comment on the table.  Since this is a DDL statement, it will force all dependent cursors to be invalidated:


The down side is that the currently-running query will have to complete before Oracle will start using a new plan.  Once the statement has completed, though, Oracle will reparse the SQL statement upon its next execution, and you might just find that the batch jobs take right off, and you can go right back to sleep.

Disclaimer:  We don't consider this a FIX to this type of issue, as much as a troubleshooting tactic when you get into a critical situation.  There are strategies for preventing issues like this from ever presenting themselves, and that's what we recommend our customers do when these types of issues occur.

No comments:

Post a Comment