While Log Insight strives to make administration easy, issues may be experienced from time-to-time. On rare occasion, I have heard of a Log Insight upgrade failing. The question becomes what should you do if the upgrade fails? Read on to learn more!
Prevention
The best way to ensure a Log Insight upgrade goes smoothly is to prevent any issues in the first place. Before you upgrade your production environment, please validate the following:
- You are running a supported configuration
- You have read and understand the release notes
- You have read and understand the documentation around upgrading
- You have tested the upgrade in a non-production environment
- You have taken a snapshot of your Log Insight instance/cluster before attempting to upgrade
Success!
The default behavior you should experience when performing a Log Insight upgrade is success. This, of course, assumes you took care of all the prevention steps above.
Failed, now what?
If the upgrade fails, Log Insight should give you an error message to point you in the right direction. In addition, if the upgrade had started and failed at some stage, Log Insight should roll-back the upgrade and get the cluster back online in a healthy state. In regards to what to do, I would recommend:
- Ensure Log Insight is online and accepting traffic — assuming your production environment was being upgrade the first goal should be to restore service
- If not, determine if a roll-back is underway by checking the /admin/cluster page — if so wait
- If down and not rolling back or finished rolling back so if anything obvious can be done to get you back online
- If down, not rolling back, and nothing obvious then restore from backup
- Determine what step of the upgrade failed, what the current state of the cluster is, and what the error message indicates
- Prevalidation — if upgrade failed here the cluster will automatically recover, address the reported error and try again
- Failed on a node and rolled back — see the error message to determine what can be done to address the issue reported and try upgrade again
- Failed on a node and did not roll back — see the error message to determine what can be done to address the issue reported and try to upgrade failed node again
- If unable to resolve issue, check for KBs
Failed, what NOT to do…
If the upgrade failed and does not automatically recover (i.e. roll-back) then you have entered an error state. You should attempt to get out of the error state as quickly as possible as the system may not behave as expected. With that said, you do not want to make the situation worse. Below is a list of things I strongly advise you DO NOT DO in an error state.
- Make any configuration changes — this includes any operation in the Administration UI or on /internal/config
- Any change made will likely be lost upon addressing the underlying issue
- Any change may make the situation worse
- Make any changes to content — this includes content packs as well as user content
- Any change made will likely be lost upon addressing the underlying issue
The first recommendation should not be a surprise while the second one may be. To reiterate the first one, DO NOT attempt to remove or replace nodes in the cluster while in this error state. Check the release notes, check the KBs, check the logs, open a support case with GSS, DO NOT under any circumstances attempt to remove or replace nodes.
© 2017, Steve Flanders. All rights reserved.