Thinking Out Loud

December 19, 2010

Troubleshooting OEM 11g Grid Agent – root suid enabled

Filed under: 11g,Grid Control — mdinh @ 10:50 pm

OEM 11g Grid Control on Solaris 10.

Not sure what I did wrong in response file, but I never saw the prompt to execute root.sh

Here is the error from the database page that displays IO through put – snmhsutl.c:executable nmhs should have root suid enabled

This is resolved by executing root.sh.

oracle@proddb01:agent11g:/u01/app/oracle/product/agent11g
> ls -l ./bin/nmhs*
-rwxr-xr-x   1 oracle   dba        95912 Dec 16 13:41 ./bin/nmhs
-rwxr-xr-x   1 oracle   dba        95912 Dec 16 13:41 ./bin/nmhs.0
-rwxr-xr-x   1 oracle   dba            0 May  4  2010 ./bin/nmhs0

oracle@proddb01:agent11g:/u01/app/oracle/product/agent11g
> ll root.sh
-rwxr-x---   1 oracle   dba        12237 Dec 16 13:41 root.sh

oracle@proddb01:agent11g:/u01/app/oracle/product/agent11g
> ls -l ./bin/nmhs*
-rws--x---   1 root     dba        95912 Dec 16 13:41 ./bin/nmhs
-rwxr-xr-x   1 oracle   dba        95912 Dec 16 13:41 ./bin/nmhs.0
-rwxr-xr-x   1 oracle   dba            0 May  4  2010 ./bin/nmhs0

oracle@proddb01:agent11g:/u01/app/oracle/product/agent11g

Reference: Grid Agent Configuration: storage_report_metrics.pl Reports Error “snmhsutl.c:executable nmhs should have root suid enabled” [ID 435793.1]

Advertisements

December 9, 2010

Troubleshooting OEM 11g Grid Crontrol – (Too many open files)

Filed under: 11g,Grid Control — mdinh @ 3:58 am

I am running OEM 11g Grid Control on Solaris 10.

Instead of seeing charts on the OEM pages, I was seeing red X.

First, the location of the logs can be found under $EM_INSTANCE_HOME

> echo $EM_INSTANCE_HOME
/u01/app/oracle/gc_inst/em/EMGC_OMS1
/u01/app/oracle/gc_inst/em/EMGC_OMS1/sysman/log
> ll
total 11216
drwxr-----   2 oracle   dba           96 Aug 31 12:57 pafLogs
-rw-r-----   1 oracle   dba            0 Aug 31 13:02 emovm.log.0
drwxr-xr-x   9 oracle   dba         8192 Aug 31 14:39 ..
-rw-r--r--   1 oracle   dba        15264 Sep 30 06:53 secure.log
drwxr-xr-x   3 oracle   dba         8192 Dec  6 11:29 .
-rw-r-----   1 oracle   dba            0 Dec  6 11:29 emovm.log.0.lck
-rw-r-----   1 oracle   dba       387329 Dec  6 11:32 emctl.log
-rw-r-----   1 oracle   dba      2640001 Dec  8 00:00 emoms.log
-rw-r-----   1 oracle   dba      2640001 Dec  8 00:00 emoms.trc

The following errors were in the emoms.log:

2010-12-06 07:55:04,717 [[ACTIVE] ExecuteThread: '5' for queue: 'weblogic.kernel.Default (self-tuning)']
WARN  chart.EmChartBean getChartImage.5105 - EXCEPTION while generating chart image [id]: D3B6716582D334A0D8F9F9D82AAD42C8.gif

java.io.FileNotFoundException:
/u01/app/oracle/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/tmp/_WL_user/emgc/n47f6s/public/images/chartCache/chartD3B6716582D334A0D8F9F9D82AAD42C8.gif.gif
(Too many open files)

This was puzzling since I remembered changing open files from 1024 to 4096 and restarting oms.

> ulimit -an
core file size        (blocks, -c) unlimited
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
open files                    (-n) 4096
pipe size          (512 bytes, -p) 10
stack size            (kbytes, -s) 8192
cpu time             (seconds, -t) unlimited
max user processes            (-u) 29995
virtual memory        (kbytes, -v) unlimited

Let’s find WLS processes:

> ps -afe|grep oracle|grep java|grep -v grep
  oracle  7813  7754   0   Nov 02 ?         389:43 /usr/jdk/instances/jdk1.6.0_20/bin/sparcv9/java -client -Xms256m -Xmx1024m -XX:
  oracle  7191  3335   0   Nov 02 ?         127:43 /usr/jdk/instances/jdk1.6.0_20/jre/bin/sparcv9/java -classpath /usr/jdk/instanc
  oracle 26182 26159   0   Nov 12 ?         698:36 /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1024m -XX:

Let’s look at the open files associated with each process:

> pfiles 7813|head
7813:   /usr/jdk/instances/jdk1.6.0_20/bin/sparcv9/java -client -Xms256m -Xmx1
  Current rlimit: 1024 file descriptors
   0: S_IFCHR mode:0666 dev:340,66 ino:36640 uid:0 gid:3 rdev:13,2
      O_RDONLY|O_LARGEFILE
      /dev/null
   1: S_IFREG mode:0640 dev:286,1000 ino:1599591 uid:15001 gid:15000 size:30922
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   2: S_IFREG mode:0640 dev:286,1000 ino:1599591 uid:15001 gid:15000 size:30922
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12

> pfiles 7191|head
7191:   /usr/jdk/instances/jdk1.6.0_20/jre/bin/sparcv9/java -classpath /usr/jd
  Current rlimit: 1024 file descriptors
   0: S_IFIFO mode:0000 dev:322,0 ino:93191797 uid:15001 gid:15000 size:0
      O_RDWR
   1: S_IFIFO mode:0000 dev:322,0 ino:93191798 uid:15001 gid:15000 size:0
      O_RDWR
   2: S_IFIFO mode:0000 dev:322,0 ino:93191799 uid:15001 gid:15000 size:0
      O_RDWR
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12
      O_RDWR|O_LARGEFILE FD_CLOEXEC

> pfiles 26182|head
26182:  /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1
  Current rlimit: 1024 file descriptors
   0: S_IFCHR mode:0666 dev:340,66 ino:36640 uid:0 gid:3 rdev:13,2
      O_RDONLY|O_LARGEFILE
      /dev/null
   1: S_IFREG mode:0640 dev:286,1000 ino:3232523 uid:15001 gid:15000 size:8965325
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   2: S_IFREG mode:0640 dev:286,1000 ino:3232523 uid:15001 gid:15000 size:8965325
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12

It looks like WLS was started with only 1024 file descriptors.

Restart WLS after confirming open files (-n) 4096 and verify the process:

> ps -afe|grep oracle|grep java|grep -v grep
  oracle 11942 11766   0 11:27:20 ?           6:32 /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1024m -XX:
  oracle 18722  3335   0 11:25:39 pts/4       0:12 /usr/jdk/instances/jdk1.6.0_22/jre/bin/sparcv9/java -classpath /usr/jdk/instanc
  oracle 20361 20266   0 11:25:51 ?           2:45 /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1024m -XX:
 
> pfiles 11942|head
11942:  /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1
  Current rlimit: 4096 file descriptors
   0: S_IFCHR mode:0666 dev:340,66 ino:36640 uid:0 gid:3 rdev:13,2
      O_RDONLY|O_LARGEFILE
      /dev/null
   1: S_IFREG mode:0640 dev:286,1000 ino:1609620 uid:15001 gid:15000 size:62670
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   2: S_IFREG mode:0640 dev:286,1000 ino:1609620 uid:15001 gid:15000 size:62670
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12
 
> pfiles 18722|head
18722:  /usr/jdk/instances/jdk1.6.0_22/jre/bin/sparcv9/java -classpath /usr/jd
  Current rlimit: 4096 file descriptors
   0: S_IFIFO mode:0000 dev:322,0 ino:415173812 uid:15001 gid:15000 size:0
      O_RDWR
   1: S_IFIFO mode:0000 dev:322,0 ino:415173813 uid:15001 gid:15000 size:0
      O_RDWR
   2: S_IFIFO mode:0000 dev:322,0 ino:415173814 uid:15001 gid:15000 size:0
      O_RDWR
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12
      O_RDWR|O_LARGEFILE FD_CLOEXEC
 
> pfiles 20361|head
20361:  /usr/jdk/instances/jdk1.6.0_22/bin/sparcv9/java -client -Xms256m -Xmx1
  Current rlimit: 4096 file descriptors
   0: S_IFCHR mode:0666 dev:340,66 ino:36640 uid:0 gid:3 rdev:13,2
      O_RDONLY|O_LARGEFILE
      /dev/null
   1: S_IFREG mode:0640 dev:286,1000 ino:1610388 uid:15001 gid:15000 size:25426
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   2: S_IFREG mode:0640 dev:286,1000 ino:1610388 uid:15001 gid:15000 size:25426
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   3: S_IFCHR mode:0666 dev:340,66 ino:36699 uid:0 gid:3 rdev:13,12

December 1, 2010

Partial Database Recovery

Filed under: 11g,Recovery — mdinh @ 6:10 am

I had a request to restore a user schema from a ~2TB data warehouse with 1,137 data files.

GIVEN:  RMAN is not being used for backup, backup is being performed using split mirror technology, and the restore server does not have 2TB space available.

The OS was restored with only the required data files for the database.

I started by taking the unnecessary data files offline using the script below:

set serverout on
declare
 l_sql varchar2(4000);
begin
 for x in (
 select name from v$datafile where not regexp_like(name,'sys|undo|user_data','i')
 ) loop 
 l_sql :='alter database datafile '''||x.name||''' offline';
 execute immediate l_sql;
 dbms_output.put_line(l_sql);
 end loop;
exception
when others then
 dbms_output.put_line ('******* error: ' || l_sql);
 raise;
end;
/

After executing:

alter database end backup;

I proceeded to open the database.

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01113: file 1 needs media recovery
ORA-01110: data file 1: '/oracle/oradata/dw01/system01.dbf'

That did not work. (Because I forgot to recover database). After recover database as shown below, database was able to open.

Just to be safe, I moved the online redo logs to a backup directory, hearing never to backup online redo logs.

oracle@dw01:/oracle/oradata/dw01
> mkdir backup
 
oracle@dw01:/oracle/oradata/dw01
> mv redo0* ./backup/

Next attempt was to recover the database:

SQL> recover database;
ORA-00283: recovery session canceled due to errors
ORA-00313: open failed for members of log group 3 of thread 1
ORA-00312: online log 3 thread 1: '/oracle/oradata/dw01/redo03b.dbf'
ORA-17503: ksfdopn:4 Failed to open file /oracle/oradata/dw01/redo03b.dbf
ORA-17500: ODM err:File does not exist
ORA-00312: online log 3 thread 1: '/oracle/oradata/dw01/redo03a.dbf'
ORA-17503: ksfdopn:4 Failed to open file /oracle/oradata/dw01/redo03a.dbf
ORA-17500: ODM err:File does not exist

Looks like the online redo logs were needed after all.

Move the redo back to the original location and attempt to recover the database again:

oracle@dw01:/oracle/oradata/dw01
> mv backup/* .
SQL> recover database;
Media recovery complete.
SQL> alter database open;
Database altered.
SQL>

Success!

Query showing status of data files:

SQL> select status, count(*) from v$datafile_header group by status;
 
STATUS    COUNT(*)
------- ----------
ONLINE          21
OFFLINE       1116
 
SQL> select status, count(*) from v$datafile group by status;
 
STATUS    COUNT(*)
------- ----------
ONLINE          20
RECOVER       1116
SYSTEM           1

Blog at WordPress.com.