Ticket #568 (new defect)

Opened 8 months ago

Last modified 8 months ago

Controller failover does not work when EntityLocations is not 1 or 2

Reported by: troyh Owned by:
Priority: blocker Milestone: PL 3.0.1
Component: AvSv Version: 3.0.0-RC1
Keywords: Cc:
patch waiting for maintainer: no

Description

Controller failover does not work when the controllers are not physically in EntityLocations? 1 and 2.

On my c-class cluser. I have 4 physical slots, 9, 10, 11 and 12. If I configure the clsuter using the default slot_id's 1-4 and update the EntityLoactions? 9-12 respectively. The cluster starts and mostly works as expected, except for active controller failover does not work. If I look at the log on the standby controller when the active controller fails I see:

opensaf_immnd: Director Service in NOACTIVE state
kernel: TIPC: Resetting link <1.1.47:eth0-1.1.31:eth0>, peer not responding
kernel: TIPC: Lost link <1.1.47:eth0-1.1.31:eth0> on network plane A
kernel: TIPC: Lost contact with <1.1.31>
opensaf_immd: IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
opensaf_immd: Resend of fevs message 20, will not mbcp to peer IMMD
opensaf_immd: Resend of fevs message 21, will not mbcp to peer IMMD
opensaf_immnd: DISCARD DUPLICATE FEVS message:20
opensaf_immnd: DISCARD DUPLICATE FEVS message:21
opensaf_immnd: Global discard node received for nodeId:2010f pid:22233
ncs_scap: AVD: Heart Beat missed with active director on 2010f
opensaf_fmsd: Role: STANDBY, FM_EVT_HB_LOSS: for slot_id: 9, subslot_id: 15

The STANDBY clearly knows that the ACTIVE went away but yet doesn't seem to think it needs to become ACTIVE.

If I simply move the blades to physical slots 1-4 and updated the EntityLocations? everything seems to work as expected.

There seems to be a disconnect between the slot/subslot ID and the actual EntityLocation?.

Attachments

Change History

Changed 8 months ago by anders

The standby IMMD successfully detects possible failover and acts on that,
within the IMMSv domain.

The IMMD does not actually know that there is a failover yet.
The failover decision is taken by the AMF and is not based on TIPC.

[In my opinion, non HPI enabled systems would probably work better (more reliable and faster)
if the failover decision of the AMF was simply based on loss of TIPC connection with the active.
I assume here that the network is redundant, so that we can "ignore" the network partition case
as a double failure. ]

Changed 8 months ago by anders

  • component changed from IMMSv to AvSv

Changed 8 months ago by scon

  • milestone changed from 3.0.0-GA to PL 3.0.1

Add/Change #568 (Controller failover does not work when EntityLocations is not 1 or 2)

Author



Action
as new
Note: See TracTickets for help on using tickets.