bioctl "intermitently" reports RAID 1 array as degraded

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

bioctl "intermitently" reports RAID 1 array as degraded

Theodore Wynnychenko
Hello

I am trying to understand what I may be missing (I have been noticing this issue
for a year or so).

I have a machine running -current that is setup with 2 SSD hard drives.

The SSD's are fdisk'ed with 1 openbsd partition:

# fdisk sd0
Disk: sd0       geometry: 19457/255/63 [312581808 Sectors]
Offset: 0       Signature: 0xAA55
            Starting         Ending         LBA Info:
 #: id      C   H   S -      C   H   S [       start:        size ]
-------------------------------------------------------------------------------
 0: 00      0   0   0 -      0   0   0 [           0:           0 ] unused
 1: 00      0   0   0 -      0   0   0 [           0:           0 ] unused
 2: 00      0   0   0 -      0   0   0 [           0:           0 ] unused
*3: A6      0   1   2 -  19456 254  63 [          64:   312576641 ] OpenBSD

The disklabels on each disk have an "a" 4.2BSD partition, a "b" swap partition,
and then a "m" RAID partition:

# disklabel sd0
# /dev/rsd0c:
type: SCSI
disk: SCSI disk
label: INTEL SSDSA2BW16
duid: 43d094716532e926
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 19457
total sectors: 312581808
boundstart: 64
boundend: 312576705
drivedata: 0

16 partitions:
#                size           offset  fstype [fsize bsize   cpg]
  a:          2104448               64  4.2BSD   2048 16384     1 # /
  b:         18860313          2104512    swap                    # none
  c:        312581808                0  unused
  m:        291611880         20964825    RAID

Most of the time, everything is fine:

# bioctl -i sd2
Volume      Status               Size Device
softraid0 0 Online       149305012224 sd2     RAID1
          0 Online       149305012224 0:0.0   noencl <sd0m>
          1 Online       149305012224 0:1.0   noencl <sd1m>


BUT, every once in a while (let's say, a couple of weeks, then a couple of
months), all of sudden the array will report as being degraded.

However, other than the notice that the array is degraded and that a mirror is
offline, I can find nothing in any log, or any changes in the dmesg to suggest
what may have happened.

I have changed the hard drive cables.  I have changed out the SSD drives.

But, it still happens every so often.

When the array is degraded, I can still fdisk/disklabel the "offline" disk
without a problem.  I can rebuild the degraded array with the "offline" disk (#
bioctl -R /dev/sd1m sd2), and the rebuild completes without a problem, and the
array is stable for weeks/months until, randomly, it happens again.

I am wondering if there is anything I should be looking at/for to help figure
out what the issue is?

As I said, I have already swapped out hardware (at least) once.  If it is a
hardware issue, I can keep swapping out hardware, but (at this point) it seems
that the probability is really low that I would have multiple drives that have
the same intermittent problem (but, obviously, not zero).

I would appreciate any advice on how to track down what the problem may be the
next time it happens.

Thanks
Ted