There was a migration to a new storage array for the DR site bringing down all the standby databases.
After 3 days of downtime, the DR site was available, media recovery was started and was painfully slow.
At first, I figured it could be slow since so many databases were performing media recovery.
After executing opatch apply for 11.2.3.4 PSU on top of 11.2.0.3, which took 20 minutes, I realized it has to be the I/O subsystem.
I was not able to prove it, but I trusted my gut instinct and nothing from the database has changed.
It seemed pretty obvious. Of course, the system group denied all this, wanted proof and did not bother to look at the issue.
I googled and found a way to perform simple I/O benchmark.
time sh -c “dd if=/dev/zero of=dd-test-file bs=8k count=1000000 && sync”
1000000+0 records in
1000000+0 records out
real 14m34.625s
user 0m1.555s
sys 0m37.942s
More than 14 minutes to create ~8GB file. The throughput is less that 9MB/s. This time I got the system group’s attention.
Throughout the investigative process, Oracle was getting blamed for the I/O. After several hours, it was determined to be the SAN.
Admittedly, I have become rusty in utilizing system tools and wanted to know how I could have determined this.
I came across the following 2 sites and comment in my search:
http://prefetch.net/blog/index.php/2005/06/26/finding-busy-disks-with-iostat/
http://serverfault.com/questions/26286/what-does-it-mean-if-iostat-shows-100-busy-wait-but-the-cpu-is-completely-idle
IO Stat shouldn’t be 100%. If you have a heavy amount of disk IO, it may be high, above 50%, but exactly 100% usually means something is wrong your IO subsystem. This has happened to me when I’ve had a hard drive in the early stages of failure, when the disk takes longer and longer to respond to requests, but still responds.
Or it could just be a very badly written application. A simple DTrace script should tell you which it is. – Craig Lewis
Here is the output using: iostat -zxnM 5
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 0.0 0.0 0.0 0.0 15.1 0 0 ssd10
0.0 0.2 0.0 0.0 0.0 0.0 0.0 10.7 0 0 ssd11
11.4 145.0 0.3 1.1 0.0 8.8 0.0 56.4 0 100 7/md101
11.4 145.0 0.3 1.1 0.0 8.8 0.0 56.4 0 100 7/md1
0.6 75.6 0.0 0.6 0.0 2.3 0.0 29.6 0 100 6/md100
0.6 75.6 0.0 0.6 0.0 2.3 0.0 29.6 0 100 6/md1
5.0 143.2 0.0 1.1 0.0 5.8 0.0 38.8 0 100 4/md101
5.0 143.2 0.0 1.1 0.0 5.8 0.0 38.8 0 100 4/md1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 53.9 0 2 1/md100
11.4 75.8 0.1 0.6 0.0 5.2 0.0 60.0 0 99 ssd81
13.0 69.2 0.1 0.5 0.0 3.8 0.0 46.0 0 92 ssd82
0.0 0.8 0.0 0.0 0.0 0.0 0.0 53.9 0 2 ssd91
5.0 143.2 0.0 1.1 0.0 5.8 0.0 38.8 0 100 ssd114
0.0 67.4 0.0 0.5 0.0 2.0 0.0 29.5 0 100 ssd121
0.6 8.2 0.0 0.1 0.0 0.3 0.0 29.7 0 19 ssd122
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
63.6 75.0 2.2 0.6 0.0 4.4 0.0 31.9 0 100 7/md101
63.6 75.0 2.2 0.6 0.0 4.4 0.0 31.9 0 100 7/md1
1.4 97.6 0.0 0.8 0.0 2.2 0.0 21.9 0 100 6/md100
1.4 97.6 0.0 0.8 0.0 2.2 0.0 21.9 0 100 6/md1
12.6 169.0 0.5 1.3 0.0 4.0 0.0 22.2 0 100 4/md101
12.6 169.0 0.5 1.3 0.0 4.0 0.0 22.3 0 100 4/md1
0.0 1.6 0.0 0.0 0.0 0.1 0.0 49.0 0 5 1/md100
85.0 34.6 1.1 0.3 0.0 2.9 0.0 24.1 0 82 ssd81
89.6 40.4 1.1 0.3 0.0 2.8 0.0 21.6 0 85 ssd82
0.0 1.6 0.0 0.0 0.0 0.1 0.0 49.0 0 5 ssd91
12.6 169.0 0.5 1.3 0.0 4.0 0.0 22.2 0 100 ssd114
0.4 67.2 0.0 0.5 0.0 1.2 0.0 18.0 0 83 ssd121
1.0 30.4 0.0 0.2 0.0 0.9 0.0 30.1 0 70 ssd122
Noticed all the disk with 100% busy? Are there any storage expert out there who can confirm this?
UPDATE:
As a DBA, I am on a need to know basis; hence, not having all the required information.
From my understanding now, the migration was to a new sever connected to the existing storage.
The system admin believes this could be a cabling issue and the migration to new storage is yet to come.
This is on Solaris 10.
For comparison, this is for a healthy system
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 0.0 0.0 0.0 0.0 10.1 0 0 md0
0.0 0.2 0.0 0.0 0.0 0.0 0.0 9.6 0 0 md4
0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.3 0 0 md5
0.0 0.2 0.0 0.0 0.0 0.0 0.0 9.5 0 0 md6
0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.5 0 0 md7
0.0 0.2 0.0 0.0 0.0 0.0 0.0 10.0 0 0 md8
0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.3 0 0 sd1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 8.2 0 1 sd2
7.7 0.0 5.6 0.0 0.0 0.1 0.0 10.0 0 8 ssd41
6.5 0.2 1.6 0.0 0.0 0.0 0.0 4.8 0 3 ssd42
0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.7 0 0 ssd43
0.2 0.0 0.0 0.0 0.0 0.0 0.0 1.7 0 0 ssd44
10.7 0.2 6.4 0.0 0.0 0.1 0.0 7.1 0 8 ssd45
35.1 0.2 0.3 0.0 0.0 0.2 0.0 4.7 0 17 ssd46
27.2 0.2 0.2 0.0 0.0 0.1 0.0 2.0 0 5 ssd49
1.4 0.2 0.0 0.0 0.0 0.0 0.0 3.4 0 1 ssd50
5.7 1.0 0.0 0.0 0.0 0.0 0.0 2.6 0 2 ssd51
0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.6 0 0 ssd52
3.0 6.7 0.8 0.2 0.0 0.0 0.0 2.3 0 2 ssd53
0.6 1.0 0.0 0.0 0.0 0.0 0.0 0.9 0 0 ssd58
0.8 1.4 0.0 0.0 0.0 0.0 0.0 1.0 0 0 ssd61
0.0 1.8 0.0 0.0 0.0 0.0 0.0 0.6 0 0 ssd63
0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.5 0 0 ssd65
0.4 0.0 0.0 0.0 0.0 0.0 0.0 3.5 0 0 ssd66
4.7 0.0 4.6 0.0 0.0 0.0 0.0 7.5 0 3 ssd68
0.6 0.0 0.6 0.0 0.0 0.0 0.0 7.2 0 0 ssd69
6.1 0.8 4.4 0.0 0.0 0.0 0.0 4.5 0 3 ssd70
7.5 0.2 6.2 0.0 0.0 0.0 0.0 6.2 0 5 ssd71
0.2 0.0 0.0 0.0 0.0 0.0 0.0 10.8 0 0 ssd72
6.3 0.6 6.2 0.0 0.0 0.0 0.0 6.4 0 4 ssd73
8.9 0.0 5.8 0.0 0.0 0.1 0.0 8.1 0 7 ssd121
5.9 0.4 1.9 0.0 0.0 0.0 0.0 4.1 0 3 ssd122
0.0 1.6 0.0 0.0 0.0 0.0 0.0 0.8 0 0 ssd123
10.1 0.2 6.3 0.0 0.0 0.1 0.0 7.5 0 8 ssd125
38.5 0.0 0.5 0.0 0.0 0.2 0.0 5.1 0 20 ssd126