Wednesday, March 26, 2008

Disk marked as "stale"

If one removes physical access to a majority of the voting disks  then Oracle corp predicts that

CSS will mark the disk as stale * . Since there is no quorum, the clusterware will stop.

CSS will evict the nodes because there is no quorum. - On reboot, the system should restart and wait without CSS – also because there is no quorum.

After quorum is restored, crsctl start crs should restore the clusterware stacks. Reboot should not be required.

* this will be mentioned in a trace files ..see example of Solaris 8 following a RAC node reboot under a (very) heavy CPU and memory load. There is in the trace file below:

[ CSSD]2008-01-03 18:49:04.418 [11] >WARNING: clssnmDiskPMT: sltscvtimewait timeout (282535)
[ CSSD]2008-01-03 18:49:04.428 [11] >TRACE: clssnmDiskPMT: stale disk (282815 ms) (0//dev/vx/rdsk/racdg/vote_vol)
[ CSSD]2008-01-03 18:49:04.428 [11] >ERROR: clssnmDiskPMT: 1 of 1 voting disk unavailable (0/0/1)

This is a timeout after a 282,535s polling on the voting disk. IO errors were neither reported in /var/adm/messages nor by the storage array.

