Handling AROVA Failures
A healthy AROVA status is essential and is completely independent from production VMs.
- If an AROVA fails for any reason (see below), it should be restarted immediately.
- Protection of production VMs is not affected; however, an AROVA stops handling configuration updates which can affect failover if failure occurs while the AROVA is down.
- If an AROVA fails to run in the currently active replica zone (i.e., compute and/or disk replica are down), notification will be sent out.
- The reason for failure should be resolved and AROVA should be redeployed using the replica in the second zone.
- The AROVA can be restarted using arova-cli.py with recovery script. For example:
python3 ./arova-cli.py recovery \
--src-pri-zone us-east1-b \
--aro-disk-name jet-aro-data-us-central1-us-east1 \
--sa [email protected] \
--project arova-project
Important: The above command is an illustrative example only and should not be directly used.
- The recovery script parameters can be generated using the Recover Helper from the Management Site.
- The recovery procedure assumes all previous instances of AROVA for the same region pair are down. If the VM is still available, then specifying the --force command line option will delete all unexpected AROVA instances before proceeding.
- The same option can be used in case the replication state of the ACD is not determinable for some reason.
- If active or passive replicas of ACD are down, AROVA may still be able to run depending on the scope of the incident.
- In such case, notification is sent out but no action is necessary by the user.
- If AROVA loses disks in both zones (R1Z1 and R1Z2), notification is sent out.
- The reason for failure should be resolved and AROVA should be redeployed in the secondary region using arova-cli.py script described above.
- The script requires stale AROVA instances to be deleted by the user, if any are present.
Note: AROVA and protected VMs may run in different regions. AROVA failover does not impact ongoing asynchronous replication.
- If AROVA fails due to software error, notification is sent out.
- The reason for failure should be resolved and then the AROVA VM should be restarted.
- A support bundle that includes the AROVA process events log and core dump should be collected and forwarded to the support team.
- If AROVA is inadvertently deleted (together with the ACD), notification is sent out.
- Once this issue is discovered, AROVA should be redeployed in the secondary region using arova-cli.py script described above.
- In the case of primary region failure, resulting
- In case of primary region failure, manifested issues may appear unrelated to ARO.
- If deemed appropriate, failover should be conducted with AROVA being redeployed in the secondary region using arova-cli.py script described above.
Also see: