Migrating to AWS requires careful planning, preparation and design. Migration tools help as well, and there are a variety of migration tools out there. CloudEndure is one of the widely used migration tools for AWS migrations. In this article, I share my experiences of using CloudEndure to perform migrations for many enterprise clients and tens of thousands of servers. These experiences will hopefully enable and accelerate your AWS migrations, and possibly help with common issues you may come across while using CloudEndure.
CloudEndure, at first glance, makes it look very easy to migrate an on-premise server to AWS. In many cases this is true. However, when an issue arises with CloudEndure, it can make or break a successful migration.
- When issues arise, it is important to open a ticket as soon as possible with the CloudEndure team. However, instead of sitting back and waiting on a response from support, here are some workarounds to keep you productive in your migration: Not enough kernel memory – When a server suffers from low kernel memory it can wreak havoc. The CE instance in the dashboard may show syncing and then one day the syncing stops. The server will have to be rebooted to free up the kernel memory and maintain a close eye on this server until migration day, due to the issue reoccurring. In one situation, the low kernel memory issue pushed the on-premise server back a day on the syncing. It was a database server. When we started the migration the next day, I noticed the server was behind by about 24 hours and was not syncing at all. The reboot was completed, and it was estimated by CE support that the sync should catch up in 3 hours. We were able to stall the migration with the client but after several conference calls with the client, and the fact that the sync was still running, the migration was canceled. The sync took 18 hours. This one server was critical to the migration as there were about 5 other servers that worked with this database server. We have since put into practice, and we strongly recommend, you follow the steps below for checking the sync process leading up to migration day.
Check the sync process on the schedule:
48 hours prior to migration day
24 hours prior to migration day
12 hours prior to migration start time
6 hours prior to migration start time
3 hours prior to migration start time
- Not enough space on the drive to install CloudEndure – this one is very straight forward. The first goal in the CE replication process is to get the on-premise server replicated. However, if you notice that you have very low disk space and you are able to get CE installed, take note of the low space and notify the person responsible for the server. Some solutions are to free up space or move files to another volume attached to the server. On the AWS side, offer to upgrade the disk size during the migration.
- Low disk space on any drive – Keep an eye on the disk space leading up to the migration as I have seen some servers run low or out of disk space prior to migration. Sometimes there is a large db dump onto a volume that could eat up most, if not all, the space. This could cause the CE replica to be out of sync. Depending on the size of the file, you will have to wait for the sync to complete which could be time consuming and therefore impact the migration deliverable, completeness, or migration start time. Monitoring disk space on 30-40 servers all the time can be very time consuming, so If you don’t monitor them, you will want to check the disk space after the server is replicated on migration day. So when doing your smoke testing or functionality testing and definitely before handing it over to the client, check the disks for space issues.
- Multiple NICs – Some on premise NIC configurations with unique settings such as nic clustering.
- Reboot of server while it is replicating in CE – Accidently or required reboot of server. I have experienced issues where the server is syncing, and the client restarts the server while CE is replicating. This can cause the CE to check all the disk blocks again. With a server that has more than one or two drives, the sync process can take a long time, 3 to 18 hours or longer.
- Different types of volumes – Check for Spanned volumes, LVM (Logical Volumes) or Multipath EMC power path. Knowing about this upfront before the CE start is helpful but may be time consuming to look for during the discovery. If you run into an issue during the replicating, look at the volumes for issues.
- Windows 2003 servers NIC driver issue – Issue with the drivers being used for the CE replication. The recommendation is that after you create the replica in AWS for the migration, attach a second ENI, login to the IP of that 2nd ENI and then delete ETH0 from device manager. You may get disconnected. AWS will insert a new eth0 device. This will fix an issue when trying to login to the server in AWS on the eth0 device. This is only needed on Windows 2003 servers.
- Firewall/AntiVirus software blocking CE install – There are many products on the market and depending on the configuration, the installation of CE can be blocked. A past project I worked on used a product called Bit9 that blocked installation of CE. Another project used Symantec and did not block the installation. If CloudEndure does not install correctly, check the logs. If the logs do not reveal the blockage, check for Firewall and Virus software. If such software exists, which is most common, reach out to the administrator of that product to ask for their assistance in not only confirming the blockage but also in implementing an exception to allow CE to install.
- For Linux version of Python 2.4 – Make sure you are running at least version 2.4 of Python. The install won’t be successful without it.
- DNS not implemented on server – I have come across several servers where the on premise server does not have DNS configured. This will cause an issue installing CloudEndure as it connects back to its server using DNS. As stated, it is always best to get CloudEndure installed and replicating and get a 100% replication of the server. When DNS is not configured, it may take a day or longer to go through the process of getting a request out and a ticket opened just to get the DNS turned on for this one step of replicating. What I have done previously is hardcode the CloudEndure IP address and CE DNS hostname into the /etc/hosts/ on Linux or on Windows, update the /windows/system32/drivers/etc/hosts file. This step has allowed me to get CE installed and ultimately led to the successful migration of the server. During a smoke test/functionality test wherein we perform certain steps before handing off the server, we check the hosts files normally, so I then remove that manual hard coded host entry. You want to be careful doing the manual hosts entries because during a specific migration, the CE vendor may have changed the IP address for their site.
- .NET 3.5 SP 1 requirement – on Windows servers make sure they are running .NET 3.5 or CE won’t install. We recommend .NET 3.5 SP1 due to another requirement that is not related to CE installation and replicating. The main goal is to get the box replicated successfully so having the .NET 3.5 minimum is needed to accomplish that goal.
- FSTAB entries Linux – There have been many cases where entries in the fstab file have caused the server not to stand up successfully in AWS or fail to replicate. There are more than one fstab type issues, so it is best practice to always open a ticket with CloudEndure Support to confirm changes to a production fstab file. In the case where the syncing is not working due to a fstab issue, work the issue with CloudEndure. Usually the CE support rep will tell you to make the changes to the fstab, wait about 10 minutes for CE to sync and catchup on the syncing then you can create the replica for AWS. Once the sync has caught up and the moment after you kick off the replica, go back to the on premise and reinstate the fstab back to the way it was. I usually make a copy of the fstab before any changes, then I comment out the entries in the fstab instead of deleting them. One indicator that there is an issue with the fstab is when you look at the AWS log for the started instance and you that that status is stuck at disk-check.
- Custom SSH configs – I have had issues with a few servers that were replicated and stood up in AWS and passed both status checks. However, I was not able to login to the instance. It was discovered that the SSH from the on premise had a different port than the standard port configured to listen. In another case the SSH was configured to only allow a handful of IPs to connect. Here is the fix for the port issue – Configure the security group for the AWS to listen on the different port. However, discuss with the client the issue so they are aware and see if they want to use the typical SSH port or stick with the custom port. To fix the locked down IPs, simply add your IP to the config file on the server.
- Many volumes – AWS has a limit of 50 volumes per server that it can support. Installing CE will usually tell you that there are more volumes than AWS supports. There are cases I have come across where CE can help with many volumes, and we have replicated servers with 30 to 40 volumes. The server had EMC powerpath installed. You do have to work directly with CE support to change the devices that have to be included in the replication. It is a command line option that CE Support will give you once they identify all the volumes.
- Windows 2003 servers that do not have SP1 or SP2- AWS will not support Windows 2003 servers that are not running at least SP1. Make sure if you have to install a service pack, make it SP2. If you try to install CloudEndure onto a Windows 2003 without one of those two service packs you may get the error: “This application has failed to start because the application configuration is incorrect.”
When the replication has completed its first replication, meaning all volumes have done a 1st pass of block by block replication, you will want to test the server start up in AWS. You do not want to wait till the day/night/weekend of migration to find out that the server will not startup. We call this “Server not going green in AWS”. There are times that the replica you create to stand up an instance in AWS will not pass the status checks (not go green). Before opening a ticket, I will stop the instance, wait about 5 or 10 minutes then start it again. If the AWS instance still won’t go green, immediately open a ticket with CloudEndure support so they may engage.
As recommended previously, you will want to test startup of the replica in AWS. If that server does startup successfully, you would shut it down in AWS as to not incur costs and let it sit there until the day of migration. However, it has happened that on migration day you recreate the replica and it will not pass the status checks. It was discovered on one such migration that someone added NFS entries in the fstab that was causing the AWS instance not to boot. Fix is discussed in item 12 mentioned previously.
Sometimes the CloudEndure dashboard will show all servers have not synced. Don’t panic yet, immediately open a ticket with CloudEndure support. The dashboard interface gets out of sync and erroneously states invalid sync at times.
Hopefully, these pointers will help you with your AWS migration planning and execution.