Ticket #601 (new defect)
FM/AVSv: SCAP failure can cause duplicate active for 2N redundant model
| Reported by: | anders | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | PL 3.0.1 |
| Component: | FM | Version: | 3.0.0-GA |
| Keywords: | AVSv | Cc: | |
| patch waiting for maintainer: | no |
Description
See discussion thread in:
If the SCAP process crashes then this should lead to IMMEDIATE node
restart (or at least restart of middleware and SAF application at
that node).
The current solution allows applications to continue executing
(for 10 seconds), then standby is promoted to active in parallell
with an order from FM at standby to FM at the "active in demise" to
restart.
This solution is both unreliable (we dont know if the FM at
the old active will comply) dangerous (since we allow a node
with extreemely serious AVSv problems to continue executing) and
defective (since it has a tendency to cause duplicate execution
of 2N redundancy model).
The only reason I dont class the ticket as critical is that the
problem should be rare in a real system. We have only seen the
problem when testing by manually killing SCAP.
I have provided a simple illustative patch that shows approximately
what should be done. In essence, when AVA detects loss of contact
with (the local) AVND, it should termiante its hosting process.
In addition, one of the processes/AVA's should order the node
restart AND send a message to the peer FM that it is going down,
which will cut short the 10 second waiting time for failover.
