Thinking Out Loud

June 11, 2011

Instrumentation versus Monitoring

Filed under: oracle — mdinh @ 2:47 pm

Please help share advice for my constant battle with instrumentation versus monitoring.

The requirements:

  • The response time analysis and alerting should be automated such that when a process has executed, it should know the timing of the previous execution.
  • If the response time for the current execution exceeds the prior by a defined percentage threshold, then send alert.

Example:

There are many java processes with 10,000+ executions against Oracle database with a normal response of 1 sec.

If the process normally completes in 1 seconds and now takes 5 seconds (exceeding threshold of of 4x), then alert needs to be sent.

Would instrumentation be better?

If there is no available resource to instrument the code, then would monitoring suffice?

My preference would be instrumentation and I would appreciate a different point of view.

Operating System provides method for instrumentation:

> time du -sh /oracle/
 3.6T   /oracle

real    0m0.456s
user    0m0.006s
sys     0m0.065s

Oracle provides method for instrumentation and here is my test case:

SQL> declare
  2   l_sla_warning  constant NUMBER := 10; -- response time in hsecs
  3   l_sla_failed   constant NUMBER := 20; -- response time in hsecs
  4   l_start  NUMBER := dbms_utility.get_time;
  5   l_date   DATE;
  6   l_timing NUMBER;
  7  begin
  8    for x in 1..10000 loop
    select sysdate into l_date from dual;
  9   10    end loop;
 11    l_timing := dbms_utility.get_time - l_start;
 12    case
 13     when l_timing > l_sla_failed then
          dbms_output.put_line('FAILED! Elapsed time in hsecs '||l_timing||' exceeds sla_failed of '||l_sla_failed);
 14   15      when l_timing > l_sla_warning then
 16       dbms_output.put_line('WARNING! Elapsed time in hsecs '||l_timing||' exceeds sla_warning of '||l_sla_warning);
        else null;
 17   18    end case;
end;
 19   20  /
++++++++++ FAILED! Elapsed time in hsecs 34 exceeds sla_failed of 20 ++++++++++ (manually added + for emphasis)

PL/SQL procedure successfully completed.

SQL> declare
  2   l_sla_warning  constant NUMBER := 10; -- response time in hsecs
  3   l_sla_failed   constant NUMBER := 20; -- response time in hsecs
  4   l_start  NUMBER := dbms_utility.get_time;
  5   l_date   DATE;
  6   l_timing NUMBER := 0;
  7  begin
  8    for x in 1..100 loop
  9      select sysdate into l_date from dual;
 10    end loop;
  l_timing := dbms_utility.get_time - l_start;
 11   12    case
 13     when l_timing > l_sla_failed then
 14       dbms_output.put_line('FAILED! Elapsed time in hsecs '||l_timing||' exceeds sla_failed of '||l_sla_failed);
 15      when l_timing > l_sla_warning then
 16       dbms_output.put_line('WARNING! Elapsed time in hsecs '||l_timing||' exceeds sla_warning of '||l_sla_warning);
 17     else null;
 18    end case;
 19  end;
 20  /

PL/SQL procedure successfully completed.

SQL> declare
  2   l_sla_warning  constant NUMBER := 10; -- response time in hsecs
  3   l_sla_failed   constant NUMBER := 20; -- response time in hsecs
  4   l_start  NUMBER := dbms_utility.get_time;
  5   l_date   DATE;
  6   l_timing NUMBER;
  7  begin
  8    for x in 1..5000 loop
    select sysdate into l_date from dual;
  9   10    end loop;
 11    l_timing := dbms_utility.get_time - l_start;
 12    case
        when l_timing > l_sla_failed then
 13   14          dbms_output.put_line('FAILED! Elapsed time in hsecs '||l_timing||' exceeds sla_failed of '||l_sla_failed);
 15      when l_timing > l_sla_warning then
 16       dbms_output.put_line('WARNING! Elapsed time in hsecs '||l_timing||' exceeds sla_warning of '||l_sla_warning);
 17     else null;
 18    end case;
end;
 19  / 20
++++++++++ WARNING! Elapsed time in hsecs 18 exceeds sla_warning of 10  ++++++++++ (manually added + for emphasis)

PL/SQL procedure successfully completed.

SQL>

I have received a recommendation to use ASH (Oracle’s Diagnostic & Tuning Pack feature)

Advertisements

6 Comments »

  1. Unless I totally misunderstand something then instrumentation gives you the means to detect SLA violations but you still need to monitor for the violation. In your testcase, Oracle didn’t do the detection for you, you had to code that in.
    For instrumentation have a look at the ILO package from HOTSOS. AFAIK they have put it on sourceforge ( google for ILO ). I am not sure how applicable it is for Java or how easy, or not, it is to retrofit it to existing code.

    For myself, I have for years run a perl script which monitors active sessions and extracts as much information as it can ( and the current Oracle version provides ) about the sql in each active session ( and some inactive ones ). It has been invaluable to detect, and especially show/prove, performance regression particularly due to an upgrade ( of Oracle or the application ) or even “simply” due to a patch.

    Just over the past week or so I have, at the suggestion of a client, added an alert which mails sql_ids which exceed a certain elapsed time threshold, at the moment every 3 hours and by program type but that could easily be modified to an array of thresholds by sql_id and check and e-mail more frequently.

    Comment by Wolfgang Breitling — June 11, 2011 @ 3:48 pm | Reply

    • @Wolfgang

      I absolutely agree with your statement.
      In the simple test case I created, the SLA requirements were hard coded.
      However, in actual implementation, I would think to keep the information in the database.

      There as to be a baseline and a defined SLA to compare against the baseline.

      Comment by mdinh — June 12, 2011 @ 2:47 pm | Reply

  2. I see instrumentation as the ability to show what is happening and monitoring as the ability to see what is happening. Without application instrumentation, you can fall back to Oracle’s built-in instrumentation. That doesn’t allow you to understand what is happening at the application level.

    One question is what does the ‘5 second rule’ apply to. A session may call a single procedure that consists of 5 SQLs that each run for 2 seconds, or that same session may call the 5 SQLs directly. Either mechanism may mean the operation takes 10 seconds to complete. Alternatively, an SELECT may be returning rows in batches of 50, and might send a total of 500 rows. Does the ‘5 second rule’ apply to returning each batch, or the full data set. What if the process has multiple nested cursors ?

    I’d probably start with monitoring v$transaction based on start_date and maybe last_call_et on v$session ( see http://hemantoracledba.blogspot.com/2010/08/trying-to-understand-lastcallet-1.html )

    Ultimately you may be in the position of stating “This is what I can monitor. If you want any more, then you’ll need to instrument YOUR code.”.

    Comment by Gary — June 12, 2011 @ 7:09 am | Reply

    • @Gary

      I will correct the post to provide more clarity.
      I used 5 seconds as arbitrary maximum threshold. It could be that 2 secs is too slow since 2 secs response time is double the normal 1 sec response time for the same process.
      The analysis and alerting should be automated such that when a process has executed, it should know the timing of the last execution.
      If the response time for the next execution exceeds the prior by a defined percentage threshold, then send alert.

      Comment by mdinh — June 12, 2011 @ 2:46 pm | Reply

  3. I’d say that you instrument code, and you monitor your instrumentation.
    That could be active monitoring (driving alerts) or passive (ad-hoc diagnostics).

    I would be reluctant to put alerting in the application code, because you are only going to get an alert after the event. Maybe in the example of 1 second vs 5 seconds that’s OK, but what about 1 hour vs 5 hours. Decoupling your alerting where possible is going to be better.

    Comment by rnm1978 — June 14, 2011 @ 12:34 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: