HP的查看打开文件数工具:crashinfo
一天,在一个业务高峰期,某省的一个数据库突然宕机了,根据alertlog中发现的报错:
ARC0: Evaluating archive log 5 thread 1 sequence 142032
ARC0: Beginning to archive log 5 thread 1 sequence 142032
Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_142032.arc'
Mon Jan 12 09:49:56 2009
ARC0: Completed archiving log 5 thread 1 sequence 142032
Mon Jan 12 09:57:57 2009
Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora1/rredo_256m_01'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
Mon Jan 12 09:57:57 2009
Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
看来是打开文件数太多,导致了此次数据库宕机。
通过监控打开文件数发现,其实在前一天的时候,打开文件数已经到98%,在上午8点多的时候,打开文件数已经到99%,但是一直都没有引起监控人员的注意,终于导致打开文件数被撑爆。
我们通过sar -v可以发现打开文件数的总数:
01:08:52 text-sz ov proc-sz ov inod-sz ov file-sz ov
01:08:54 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0
01:08:56 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0
01:08:58 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0
01:09:00 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0
01:09:02 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0
其中的file-sz就是指打开文件数。其中20多万的那个值是打开文件数的阈值,前者是当前实际打开的文件数。这阈值是有核心参数nfile决定的。我们在设置nfile的时候,建议是(15 * NPROC + 2048);提醒值是oracle的process数乘以datafile数再加2048;极限值,在存数据库环境应该是os的极限进程数乘以数据文件数,即每个进程都打开所有的数据文件。
Max Number of simultaneously Open
files system-wide at any given time. Total
number of slots it the file descriptor table,
default=16*(nproc+16+maxusers)/10+32+ 2*(npty+nstrpty)
recommended:
to use default.
>= Oracle 9i = (15 * NPROC + 2048)
For Oracle installations with a high number of data files this might be not enough, than use the
>= ((number of oracle processes) * (number of Oracle data files) + 2048)
Nproc
Max Number of Processes that can exist
simultaneously in the system,
default=(20+8*MAXUSERS),
influences ninode, nfile.
recommended:
>= Oracle 9i = 4096
根据计算,如果根据建议值,那么nfile将设置成63488,如果根提醒值,将有141万,如果根据极限值,将有386万:
oracle@my_db01:/oracle > kmtune |grep nproc
nproc 4096 - 4096
oracle@my_db01:/oracle > exit
SQL> select 15*4096+2048 from dual;
15*4096+2048
------------
63488
SQL> show parameter process
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
processes integer 1500
SQL>
SQL>
SQL> select count(*) from v$datafile;
COUNT(*)
----------
944
SQL> select 1500*944+2048 from dual;
1500*944+2048
-------------
1418048
SQL> select 4096*944 from dual;
4096*944
----------
3866624
SQL>
对于设置6万,肯定是完全不适合我们这个系统;如果设置141万或者设置386万,这似乎又太大了。
到底是什么进程需要打开那么多文件数?在sar -v的命令中,我们只看到了一个总数,能具体的去看什么进程打开多少文件数吗?
在这里,HP提供了一个工具:crashinfo。可以查看各个进程的打开文件数:
用法是:
my_db01#[/tmp]chmod +x crashinfo
my_db01#[/tmp]./crashinfo -ofiles
此时能看到各个进程的打开文件数情况:
crashinfo (4.74)
Note: HP CONFIDENTIAL
libp4 (9.305): Opening /stand/vmunix /dev/kmem
Loading symbols from /stand/vmunix
Kernel TEXT pages not requested in crashconf
Will use an artificial mapping from a.out TEXT pages
Command line: ./crashinfo -ofiles
crashinfo (4.74)
pid p_maxof ofiles p_comm
----- ------- ------ ------
1 4096 2 init
2011 4096 4 getty
94 4096 4 evmmirr
62 4096 4 emcpdaemon
63 4096 4 emcpProcd
64 4096 4 emcpd
65 4096 4 emcpdfd
66 4096 4 emcpwdd
67 4096 4 emcpstratd
68 4096 4 MpAsyncIoDaemo
69 4096 4 MpTestDaemon
70 4096 4 MpPirpNotDoneD
71 4096 4 MpPeriodicDaem
72 4096 4 MpDeferredDaem
73 4096 4 MpGeneralReque
74 4096 4 MpcAsyncIoDaem
75 4096 4 MpcResumeIoDae
76 4096 4 MpcPeriodicCal
12986 4096 671 oracle
13005 4096 671 oracle
13291 4096 10 oracle
12991 4096 671 oracle
……
22466 4096 51 oracle
12300 4096 847 oracle
12898 4096 9 oracle
12302 4096 979 oracle
13003 4096 671 oracle
13009 4096 670 oracle
13012 4096 459 oracle
13014 4096 49 oracle
Total = 116775
my_db01#[/tmp]ps -ef |grep 12298
oracle 12298 1 0 Jan 15 ? 9:31 ora_dbw0_orasid
root 12469 11867 1 02:06:39 pts/te 0:00 grep 12298
my_db01#[/tmp]ps -ef |grep 12300
oracle 12300 1 0 Jan 15 ? 9:32 ora_dbw1_orasid
root 13363 11867 0 02:16:09 pts/te 0:00 grep 12300
my_db01#[/tmp]
通过ps -ef |grep 打开文件数最多的几个进程号,发现打开文件数最多的进程是dbwr,smon,lgwr,chkpt。每个进程基本上800~900个打开文件数,且dbwr进程有8个,打开文件数那就更加多了。其他打开文件数较多的进程是LOCAL=NO的进程,通过观察spid关联sql_text,发现是对历史表的操作。历史表是大表,表空间有300多个数据文件,因此,这些进程的打开文件数,也在每个600左右。
此时忽然记起,前段时间该省做优化时,当时DBA将dbwr进程数从2调整到了8,dbwr数增加了4倍,导致打开文件数也相应的增加,这应该就是造成nfile在近期慢慢增长到撑爆为止的原因了吧。
找到了原因,问题就比较好解决了。找了一个停机的时间,把db_writer_processes调整回2,且为了以防万一,再将nfile值改为40万。
附:crashinfo命令的其他用法:
Default: coredir="." if "INDEX" file present else
kernel="/stand/vmunix" core="/dev/kmem"
Options:
-h | -help [flag,flag...]
flags: detail
-u | -update
-v | -verbose
-c | -continue
-H | -Html
-e | -email <mail_addr>[,flag,flag...]
flags: file=<file>
from=<from>
callid=<callid>
-t | -trace [flag,flag...]
flags: args
regs (PA Only)
Rregs (PA Only)
locals (IA64 Only)
frame (IA64 Only)
mems (IA64 Only)
bsp (IA64 Only)
ss (IA64 Only)
-s | -syscall
-f | -full_comm
-S | -Sleep
-l | -listonly [flag,flag...]
flags: pid=<pid>
-top [count]
-p | -proctree [flag,flag...]
flags: tids
-i | -ioscan
-ofiles [pid][,flag,flag...]
flags: tids
-signals [pid]
-seminfo
-unlinked [all]
-m | -msginfo [flag,flag...]
flags: skip
id=<id>
msg=<msg>
type=<type>
-flocks [flag,flag...]
flags: sleep
vnode=<vnode>
inode=<inode>
-sgfr [version]
-fwlog [cpu]
-vmtrace [flag,flag...]
flags: bucket=<bucket>
arena=<arena>
count=<num>
leak
cor
log
parse
如:
my_db01#[/tmp]ps -ef |grep ckpt
oracle 12304 1 0 Jan 15 ? 3:29 ora_ckpt_orasid
root 13509 11867 0 02:18:35 pts/te 0:00 grep ckpt
my_db01#[/tmp]
my_db01#[/tmp]./crashinfo -ofiles 12304
crashinfo (4.74)
Note: HP CONFIDENTIAL
libp4 (9.305): Opening /stand/vmunix /dev/kmem
Loading symbols from /stand/vmunix
Kernel TEXT pages not requested in crashconf
Will use an artificial mapping from a.out TEXT pages
Command line: ./crashinfo -ofiles 12304
crashinfo (4.74)
pid p_maxof ofiles p_comm
----- ------- ------ ------
12304 4096 961 oracle
my_db01#[/tmp]
· 【文章发布信息】发表于: 2009-01-17 @ 02:22:05 · ||分类: ..experience, Working case



CopyRight ©
David.Guo 于 2009-01-17 @ 19:15:27 留言 :
nfile的使用要监控起来,我们这边一般到85就开始要走应急流程处理了。
85,是生命线呀
小荷 于 2009-01-17 @ 23:40:53 留言 :
re David.Guo:我们也是有监控的,不过没引起监控人员的注意。唉,失败呀~~~
不过更失败的是SA的service guard没配好,没monitor到oracle进程,数据库宕机后竟然没自动切换到B机。
dhhb 于 2009-01-19 @ 12:50:56 留言 :
我们这里nfile是100W.呵呵
我记得lsof可以看打开文件数的。
oracle.dba 于 2009-02-25 @ 21:51:17 留言 :
最近也遇到这个问题了,但是没有工具进行分析查看。
能否把这个crashinfo的工具共享下,网站上下载不下来,
发我邮箱也可以
ora315@gmail.com
谢谢!
make 于 2009-05-14 @ 17:06:27 留言 :
我也遇到这个问题,按照metalink的提示增加了
nproc = Max Number of Processes
nfile = Max Number of Open Files
nflocks = Max Number of File Locks
的值,但是file-sz 的值一直在涨(要是股票就好了),最后又down了,再增大,再dwon,我晕了。请何大师指点