Context Navigation

← Previous Ticket
Next Ticket →

Ticket #601 (new defect)

Last modified 7 months ago

FM/AVSv: SCAP failure can cause duplicate active for 2N redundant model

Reported by:	anders	Owned by:
Priority:	major	Milestone:	PL 3.0.1
Component:	FM	Version:	3.0.0-GA
Keywords:	AVSv	Cc:
patch waiting for maintainer:	no

Description

See discussion thread in:

http://list.opensaf.org/archives/devel/2009-May/004096.html

If the SCAP process crashes then this should lead to IMMEDIATE node
restart (or at least restart of middleware and SAF application at
that node).

The current solution allows applications to continue executing
(for 10 seconds), then standby is promoted to active in parallell
with an order from FM at standby to FM at the "active in demise" to
restart.

This solution is both unreliable (we dont know if the FM at
the old active will comply) dangerous (since we allow a node
with extreemely serious AVSv problems to continue executing) and
defective (since it has a tendency to cause duplicate execution
of 2N redundancy model).

The only reason I dont class the ticket as critical is that the
problem should be rare in a real system. We have only seen the
problem when testing by manually killing SCAP.

I have provided a simple illustative patch that shows approximately
what should be done. In essence, when AVA detects loss of contact
with (the local) AVND, it should termiante its hosting process.

In addition, one of the processes/AVA's should order the node
restart AND send a message to the peer FM that it is going down,
which will cut short the 10 second waiting time for failover.

Attachments

Change History

Add/Change #601 (FM/AVSv: SCAP failure can cause duplicate active for 2N redundant model)

Author

Your email or username:

Comment (you may use WikiFormatting here):

Action

leave as new

Note: See TracTickets for help on using tickets.

Download in other formats: