修改hosts后,rac通讯失败


今天收到个告警某省的一个数据库的一个节点down了,重启后,只能到started状态,数据库无法open,登录上去后,看到alertlog中:

Mon Jun 15 15:38:28 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_26950.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
 Mon Jun 15 15:41:19 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27062.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
Mon Jun 15 15:42:37 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27147.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
 Mon Jun 15 15:46:19 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27362.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context

当时第一个反映是gethostbyname failed 应该是主机名解析有问题了。但是ping的时候发现还是能ping通:

oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > ping FJ_DB01
PING FJ_DB01: 64 byte packets
64 bytes from 10.203.17.1: icmp_seq=0. time=0. ms
64 bytes from 10.203.17.1: icmp_seq=1. time=0. ms
64 bytes from 10.203.17.1: icmp_seq=2. time=1. ms
64 bytes from 10.203.17.1: icmp_seq=3. time=3. ms
64 bytes from 10.203.17.1: icmp_seq=4. time=0. ms
 
----FJ_DB01 PING Statistics----
5 packets transmitted, 5 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/1/3
oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

奇怪了,但是能ping通为啥报这个错?

询问了当地的同事,说修改了hosts文件,但是现在已经改回去了。进一步检查hosts文件,发现内容虽然已经改回去,但是权限没改回去:

drwxr-xr-x   2 bin        bin             96 Dec 26  2006 gss
-rw-r--r--   1 bin        bin            489 Dec 27  2006 hba.conf
-rw-------   1 bin        bin            847 Jun 15 15:08 hosts
-r--r--r--   1 root       sys            847 Jun 15 14:57 hosts.bak
-r--r--r--   1 root       sys            626 Dec 28  2006 hosts.old

oracle用户cat hosts文件是cat不出来的。
于是改回这个文件的权限为444,重启数据库(注意先用abort shutdown数据库),问题解决:

Mon Jun 15 15:52:09 2009
Shutting down instance (abort)
License high water mark = 2
Instance terminated by USER, pid = 28002
Mon Jun 15 15:52:34 2009
Starting ORACLE instance (normal)
Mon Jun 15 15:52:34 2009
Global Enqueue Service Resources = 36504, pool = 8
Mon Jun 15 15:52:34 2009
Global Enqueue Service Enqueues = 54768
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jun 15 15:52:50 2009
SCN scheme 1
Mon Jun 15 15:53:16 2009
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.6.0.
System parameters with non-default values:
  processes                = 1500
  timed_statistics         = TRUE
  shared_pool_size         = 536870912
  large_pool_size          = 218103808
  java_pool_size           = 218103808
  spfile                   = /dev/vg_rac3/rG3_spfile_128m_01
  control_files            = /dev/vg_rac1/rG1_ctrl_256m_01, /dev/vg_rac2/rG2_ctrl_256m_02, /dev/vg_rac3/rG3_ctrl_256m_03
  db_block_size            = 8192
  db_cache_size            = 8589934592
  compatible               = 9.2.0.0.0
  log_archive_start        = TRUE
  log_archive_dest_1       = location=/archlog2
  log_archive_format       = fjmisc2_%t_%s.arc
  log_buffer               = 4194304
  db_files                 = 2000
  db_file_multiblock_read_count= 16
  cluster_database         = TRUE
  cluster_database_instances= 2
  thread                   = 2
  fast_start_mttr_target   = 800
  instance_number          = 2
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
  undo_suppress_errors     = TRUE
  undo_retention           = 21600
  remote_login_passwordfile= EXCLUSIVE
  db_domain                =
  instance_name            = fjmisc2
  local_listener           = fjmisc2
  remote_listener          = fjmisc
  job_queue_processes      = 10
  hash_join_enabled        = TRUE
  background_dump_dest     = /oracle/app/oracle/admin/fjmisc/bdump
  user_dump_dest           = /oracle/app/oracle/admin/fjmisc/udump
  max_dump_file_size       = 5120
  core_dump_dest           = /oracle/app/oracle/admin/fjmisc/cdump
  sort_area_size           = 524288
  db_name                  = fjmisc
  open_cursors             = 400
  star_transformation_enabled= FALSE
  query_rewrite_enabled    = TRUE
  pga_aggregate_target     = 838860800
Mon Jun 15 15:53:16 2009
cluster interconnect IPC version:Oracle UDP/IP
IPC Vendor 1 proto 2 Version 1.0
PMON started with pid=2
DIAG started with pid=3
LMON started with pid=4
LMD0 started with pid=5
LMS0 started with pid=6
LMS1 started with pid=7
DBW0 started with pid=8
LGWR started with pid=9
CKPT started with pid=10
SMON started with pid=11
RECO started with pid=12
CJQ0 started with pid=13
Mon Jun 15 15:53:20 2009
ARCH: STARTING ARCH PROCESSES
ARC0 started with pid=14
ARC0: Archival started
ARC1 started with pid=15
Mon Jun 15 15:53:20 2009
ARCH: STARTING ARCH PROCESSES COMPLETE
Mon Jun 15 15:53:20 2009
ARC1: Archival started
Mon Jun 15 15:53:20 2009
ARC1: Thread not mounted
Mon Jun 15 15:53:20 2009
ARC0: Thread not mounted
Mon Jun 15 15:53:20 2009
ALTER DATABASE   MOUNT
Mon Jun 15 15:53:21 2009
lmon registered with NM - instance id 2 (internal mem no 1)
Mon Jun 15 15:53:21 2009
Reconfiguration started (old inc 0, new inc 4)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS remote-cache requests
 0 write requests issued in 480118 GCS resources
 116 PIs marked suspect, 0 flush PI msgs
Mon Jun 15 15:53:29 2009
Reconfiguration complete
LCK0 started with pid=17
Mon Jun 15 15:53:35 2009
Successful mount of redo thread 2, with mount id 248173721
Mon Jun 15 15:53:35 2009
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE).
Completed: ALTER DATABASE   MOUNT
Mon Jun 15 15:53:35 2009
ALTER DATABASE OPEN
Picked Lamport scheme to generate SCNs
Mon Jun 15 15:53:39 2009
LGWR: Primary database is in CLUSTER CONSISTENT mode
Thread 2 advanced to log sequence 18067
Thread 2 opened at log sequence 18067
  Current log# 8 seq# 18067 mem# 0: /dev/vg_rac1/rG1_redo_512m_204
  Current log# 8 seq# 18067 mem# 1: /dev/vg_rac2/rG2_redo_512m_214
  Current log# 8 seq# 18067 mem# 2: /dev/vg_rac3/rG3_redo_512m_224
Successful open of redo thread 2
Mon Jun 15 15:53:39 2009
SMON: enabling cache recovery
Mon Jun 15 15:53:39 2009
ARC0: Evaluating archive   log 6 thread 2 sequence 18066
ARC0: Beginning to archive log 6 thread 2 sequence 18066
Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_18066.arc'
Mon Jun 15 15:53:40 2009
Successfully onlined Undo Tablespace 3.
Mon Jun 15 15:53:40 2009
SMON: enabling tx recovery
Mon Jun 15 15:53:40 2009
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Completed: ALTER DATABASE OPEN

在这里,我们注意一下,在hpux的环境下,哪怕某用户没有cat /etc/hosts的权限,他也是能ping通主机名的,这个和linux还是有点区别的。但是rac是需要读取hosts文件的,因此oracle用户必须有读hosts的权限。

· 【文章发布信息】发表于: 2009-06-15 @ 16:35:47 · ||分类: ..experience, Working case

留条评论