I sat in on an interesting conversation recently where the question was raised whether upgrades for products should be done in-place versus via a migration. In this post, I would like to talk about the different upgrade types, discuss typical customer expectations and provide some insight into my opinions on the topic.
- In-place: An in-place upgrade is one where all changes required for the upgrade are done in place. For clustered systems this means upgrading of one node does not impact other nodes in the cluster and data — besides the possible upgrade package itself — is not transferred between nodes.
- Migration: A migration upgrade is one where data is moved from one place to another (e.g. node to node). In many cases, migration upgrades mean moving data over the network and between systems. Typically, a migration upgrade means moving (i.e. upgrading) to a brand new node and decommissioning the old node.
- Hybrid: A hybrid upgrade is one that features parts of both in-place and migration upgrades. A common example is an upgrade where a migration happens in-place. For example, moving from one database system to another.
- Customers: In most cases, customers prefer in-place upgrades. The reasons for this are numerous, but typically include things such as ease-of-use, time to upgrade and previous biases.
- Developers: In most cases, developers prefer migration upgrades. The reasons for this are numerous, but typically include things such as starting with a clean slate, stateless coding and less corner case issues.
Given these competing expectations, what is the right solution?
At the end of the day, it does not matter if an upgrade is done in-place or via a migration. In the development world this is known as a design decision. What matters are the (customer) requirements. For upgrades, the requirements in my opinion are:
- Fast: People do not want to wait long periods of time for upgrades to complete. Using Log Insight as an example, let’s assume you have 2TB of data on your LI instance and the upgrade requires migrating 2TB of data from one virtual appliance to another. Such an operation would not be fast and should be avoided at all costs.
- Cheap: Speaking of costs, upgrades should be cheap. What I mean is, you should not need excessive additional hardware in order to perform an upgrade. Going back to the LI example, let’s assume I have a 6-node large cluster (96 vCPU, 192GB memory, 12TB disk) and I need another complete 6-node large cluster to perform an upgrade. Such a requirement would not be cheap. Even requiring only a small number of addition nodes should be avoided as resiliency is built into production solutions and should be leveraged during upgrade operations.
- Transparent: The user should not need to know what is going on behind the scenes. If the user is expected to set up another system manually, enter in credential information, copying certificate information, etc (e.g. VCSA upgrade) then the upgrade procedure is not transparent. If the upgrade behind the scenes does an in-place upgrade or a migration the user should not know and should not need to care.
- Secure: Sometimes it is necessary to copy information during an upgrade. While this operation should be fast, cheap and transparent, it also needs to be secure. Leaking sensitive information during an upgrade is unacceptable.
- Revertable: Inevitably something will go wrong from time-to-time and when it does the system needs to self-heal. This means it is critical that the system be able to rollback to the state previous before the upgrade. This can be challenging as upgrades can fundamental change the structure of data, but the requirement is the ability to rollback and ideally this procedure is done automatically. Note this does not mean that some small amount of data loss is unacceptable, to overcome some of the upgrade requirements a rollback can mean some small amount of data loss, which is transparent to the user.
- Online: An upgrade needs to be done while the system is online and operational. It is ok to bring down parts of the system, but the system needs to be fully operational during the upgrade operation. Note that if a rollback is required the system does not need to be online to complete the rollback operation, but should be fully back online once the rollback completes.
Upgrading, especially of complex system, can be challenging. In this post, I talked about the types of upgrades common, customer vs developer expectation and my opinions on upgrade requirements. I suspect that many of my requirements are missing in many of the products you are using today. Perhaps you have heard the saying, “fast, cheap, good — pick two!” People may argue that many of my requirements can be solved outside of a product. For example, security can be handled by firewalls, revertability can be handled by taking a manual backup before upgrade and why does fast matter if the upgrade happens online? Understanding that workarounds exist, the key upgrade points to take away from this post are:
- Online: Downtime during an upgrade is unacceptable. Having parts of the system offline during the upgrade is OK, but the system as a whole needs to function.
- Transparent: Whether an upgrade is done in-place or via migration is a design decision. The user should not be aware of what is happening behind the scenes.
- Cheap: Software comes with a cost, which includes CapEx — the hardware and software required to run the product as well as support — and OpEx — the people and addition material required to run a product. Upgrades should come with the minimum cost required to use the product and should not require temporary resources to complete an often infrequent operation.
- User Experience: For the best user experience workarounds should be avoided and the product should offload the complexities of an upgrade operation. This means while you could rely on things such as firewalls and manual backups, the product should really handle this for the user.
© 2015, Steve Flanders. All rights reserved.