Thinking Out Loud

November 15, 2012

SAN Insanity

Filed under: performance,solaris — mdinh @ 5:22 am

There was a migration to a new storage array for the DR site bringing down all the standby databases.

After 3 days of downtime, the DR site was available, media recovery was started and was painfully slow.

At first, I figured it could be slow since so many databases were performing media recovery.

After executing opatch apply for 11.2.3.4 PSU on top of 11.2.0.3, which took 20 minutes, I realized it has to be the I/O subsystem.

I was not able to prove it, but I trusted my gut instinct and nothing from the database has changed.

It seemed pretty obvious.  Of course, the system group denied all this, wanted proof and did not bother to look at the issue.

I googled and found a way to perform simple I/O benchmark.

time sh -c “dd if=/dev/zero of=dd-test-file bs=8k count=1000000 && sync”
1000000+0 records in
1000000+0 records out

real 14m34.625s
user 0m1.555s
sys 0m37.942s

More than 14 minutes to create ~8GB file. The throughput is less that 9MB/s. This time I got the system group’s attention.
Throughout the investigative process, Oracle was getting blamed for the I/O. After several hours, it was determined to be the SAN.

Admittedly, I have become rusty in utilizing system tools and wanted to know how I could have determined this.

I came across the following 2 sites and comment in my search:

http://prefetch.net/blog/index.php/2005/06/26/finding-busy-disks-with-iostat/

http://serverfault.com/questions/26286/what-does-it-mean-if-iostat-shows-100-busy-wait-but-the-cpu-is-completely-idle

IO Stat shouldn’t be 100%. If you have a heavy amount of disk IO, it may be high, above 50%, but exactly 100% usually means something is wrong your IO subsystem. This has happened to me when I’ve had a hard drive in the early stages of failure, when the disk takes longer and longer to respond to requests, but still responds.
Or it could just be a very badly written application. A simple DTrace script should tell you which it is. – Craig Lewis

Here is the output using: iostat -zxnM 5

                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.2    0.0    0.0  0.0  0.0    0.0   15.1   0   0 ssd10
    0.0    0.2    0.0    0.0  0.0  0.0    0.0   10.7   0   0 ssd11
   11.4  145.0    0.3    1.1  0.0  8.8    0.0   56.4   0 100 7/md101
   11.4  145.0    0.3    1.1  0.0  8.8    0.0   56.4   0 100 7/md1
    0.6   75.6    0.0    0.6  0.0  2.3    0.0   29.6   0 100 6/md100
    0.6   75.6    0.0    0.6  0.0  2.3    0.0   29.6   0 100 6/md1
    5.0  143.2    0.0    1.1  0.0  5.8    0.0   38.8   0 100 4/md101
    5.0  143.2    0.0    1.1  0.0  5.8    0.0   38.8   0 100 4/md1
    0.0    0.8    0.0    0.0  0.0  0.0    0.0   53.9   0   2 1/md100
   11.4   75.8    0.1    0.6  0.0  5.2    0.0   60.0   0  99 ssd81
   13.0   69.2    0.1    0.5  0.0  3.8    0.0   46.0   0  92 ssd82
    0.0    0.8    0.0    0.0  0.0  0.0    0.0   53.9   0   2 ssd91
    5.0  143.2    0.0    1.1  0.0  5.8    0.0   38.8   0 100 ssd114
    0.0   67.4    0.0    0.5  0.0  2.0    0.0   29.5   0 100 ssd121
    0.6    8.2    0.0    0.1  0.0  0.3    0.0   29.7   0  19 ssd122
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   63.6   75.0    2.2    0.6  0.0  4.4    0.0   31.9   0 100 7/md101
   63.6   75.0    2.2    0.6  0.0  4.4    0.0   31.9   0 100 7/md1
    1.4   97.6    0.0    0.8  0.0  2.2    0.0   21.9   0 100 6/md100
    1.4   97.6    0.0    0.8  0.0  2.2    0.0   21.9   0 100 6/md1
   12.6  169.0    0.5    1.3  0.0  4.0    0.0   22.2   0 100 4/md101
   12.6  169.0    0.5    1.3  0.0  4.0    0.0   22.3   0 100 4/md1
    0.0    1.6    0.0    0.0  0.0  0.1    0.0   49.0   0   5 1/md100
   85.0   34.6    1.1    0.3  0.0  2.9    0.0   24.1   0  82 ssd81
   89.6   40.4    1.1    0.3  0.0  2.8    0.0   21.6   0  85 ssd82
    0.0    1.6    0.0    0.0  0.0  0.1    0.0   49.0   0   5 ssd91
   12.6  169.0    0.5    1.3  0.0  4.0    0.0   22.2   0 100 ssd114
    0.4   67.2    0.0    0.5  0.0  1.2    0.0   18.0   0  83 ssd121
    1.0   30.4    0.0    0.2  0.0  0.9    0.0   30.1   0  70 ssd122

Noticed all the disk with 100% busy? Are there any storage expert out there who can confirm this?

UPDATE:
As a DBA, I am on a need to know basis; hence, not having all the required information.
From my understanding now, the migration was to a new sever connected to the existing storage.
The system admin believes this could be a cabling issue and the migration to new storage is yet to come.
This is on Solaris 10.

For comparison, this is for a healthy system

                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.2    0.0    0.0  0.0  0.0    0.0   10.1   0   0 md0
    0.0    0.2    0.0    0.0  0.0  0.0    0.0    9.6   0   0 md4
    0.0    0.2    0.0    0.0  0.0  0.0    0.0    0.3   0   0 md5
    0.0    0.2    0.0    0.0  0.0  0.0    0.0    9.5   0   0 md6
    0.0    0.2    0.0    0.0  0.0  0.0    0.0    0.5   0   0 md7
    0.0    0.2    0.0    0.0  0.0  0.0    0.0   10.0   0   0 md8
    0.0    0.8    0.0    0.0  0.0  0.0    0.0    0.3   0   0 sd1
    0.0    0.8    0.0    0.0  0.0  0.0    0.0    8.2   0   1 sd2
    7.7    0.0    5.6    0.0  0.0  0.1    0.0   10.0   0   8 ssd41
    6.5    0.2    1.6    0.0  0.0  0.0    0.0    4.8   0   3 ssd42
    0.0    0.8    0.0    0.0  0.0  0.0    0.0    0.7   0   0 ssd43
    0.2    0.0    0.0    0.0  0.0  0.0    0.0    1.7   0   0 ssd44
   10.7    0.2    6.4    0.0  0.0  0.1    0.0    7.1   0   8 ssd45
   35.1    0.2    0.3    0.0  0.0  0.2    0.0    4.7   0  17 ssd46
   27.2    0.2    0.2    0.0  0.0  0.1    0.0    2.0   0   5 ssd49
    1.4    0.2    0.0    0.0  0.0  0.0    0.0    3.4   0   1 ssd50
    5.7    1.0    0.0    0.0  0.0  0.0    0.0    2.6   0   2 ssd51
    0.0    0.8    0.0    0.0  0.0  0.0    0.0    0.6   0   0 ssd52
    3.0    6.7    0.8    0.2  0.0  0.0    0.0    2.3   0   2 ssd53
    0.6    1.0    0.0    0.0  0.0  0.0    0.0    0.9   0   0 ssd58
    0.8    1.4    0.0    0.0  0.0  0.0    0.0    1.0   0   0 ssd61
    0.0    1.8    0.0    0.0  0.0  0.0    0.0    0.6   0   0 ssd63
    0.0    0.2    0.0    0.0  0.0  0.0    0.0    0.5   0   0 ssd65
    0.4    0.0    0.0    0.0  0.0  0.0    0.0    3.5   0   0 ssd66
    4.7    0.0    4.6    0.0  0.0  0.0    0.0    7.5   0   3 ssd68
    0.6    0.0    0.6    0.0  0.0  0.0    0.0    7.2   0   0 ssd69
    6.1    0.8    4.4    0.0  0.0  0.0    0.0    4.5   0   3 ssd70
    7.5    0.2    6.2    0.0  0.0  0.0    0.0    6.2   0   5 ssd71
    0.2    0.0    0.0    0.0  0.0  0.0    0.0   10.8   0   0 ssd72
    6.3    0.6    6.2    0.0  0.0  0.0    0.0    6.4   0   4 ssd73
    8.9    0.0    5.8    0.0  0.0  0.1    0.0    8.1   0   7 ssd121
    5.9    0.4    1.9    0.0  0.0  0.0    0.0    4.1   0   3 ssd122
    0.0    1.6    0.0    0.0  0.0  0.0    0.0    0.8   0   0 ssd123
   10.1    0.2    6.3    0.0  0.0  0.1    0.0    7.5   0   8 ssd125
   38.5    0.0    0.5    0.0  0.0  0.2    0.0    5.1   0  20 ssd126
About these ads

2 Comments »

  1. There is no activity on your healthy system so for sure we will not have performance issue.
    On the another one the important metrics is not busy but the service time, the iops and and the queuing.
    You system seems to no provide lot of i/o 145 for some device which is the performance of one disk for random i/o.
    Are you using disksuite on this system ? Do you know the underlying configuration?

    Comment by Yc — November 16, 2012 @ 6:52 pm | Reply

    • Thanks for the information. We are not using disksuite and I don’t have the underlying configuration.

      Comment by mdinh — November 29, 2012 @ 2:31 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 314 other followers

%d bloggers like this: