I have to say, I am quite shocked that I am on the tail end of waiting 1.5 hours for an ESXi 5.0 upgrade to complete booting. Seriously… 1.5 hours.
I have been waiting for some time to get some ESXi 5.0 awesomeness going on in my environment. vCenter has been sitting on v5 for some time and I have been deploying ESXi 5 in a couple stand-alone situations without any issues. So, now that I have more compute capacity in the data center, it is time to start rolling the remaining hosts to ESXi 5… or so I thought!
I downloaded ESXi 5.0.0 Kernel 469512 a while back and have been using that on my deployments. So far, so good… until today. Update Manager configured with a baseline –> Attach –> Scan –> Remediate –> back to business. Surely, Update Manager processes should take more time than the actual upgrade. About 30 minutes after starting the process, vCenter was showing that the remediation progress was a mere 22% complete and the host was unavailable. I used my RSA (IBM’s version of HP ILO or Dell DRAC) to connect to the console. Sure enough, it was stuck at loading some kernel modules. About 20 minutes later IT WAS STILL THERE!
Restarting the host did not resolve the issue. During the ESXi 5 load screen, pressing Alt + F12 loads the kernel messages. It turns out that iSCSI was having issues loading the datastores in an acceptable amount of time. I was seeing messages similar to:
A little research turned me onto the following knowledgebase article in VMware’s KB: ESXi 5.x boot delays when configured for Software iSCSI (KB2007108)
This issue occurs because ESXi 5.0 attempts to connect to all configured or known targets from all configured software iSCSI portals. If a connection fails, ESXi 5.0 retries the connection 9 times. This can lead to a lengthy iSCSI discovery process, which increases the amount of time it takes to boot an ESXi 5.0 host.
So, I have 13 iSCSI stores on that specific host and multiple iSCSI VMkernel Ports (5). So, calling the iSCSI lengthy is quite the understatement.
The knowledgebase states that the resolution is applying ESXi 5.0 Express Patch 01. Fine. I can do that. And… there is a work around described in the article that states you can reduce the number of targets and network portals. I guess that is a workaround… after you have already dealt with the issue and the ridiculously long boot.
Finally, to help mitigate the issue going forward, VMware has released a new .ISO to download that includes the patch. However, this is currently available in parallel with the buggy .ISO ON THE SAME PAGE! Seriously. Get this… the only way to determine which one to download is:
As a virtualization admin, I know that I am using the Software iSCSI initiator in ESXi. But, why should that even matter at all?! There is a serious flaw in the boot process in version 469512 and that should be taken offline. Just because someone is not using Software iSCSI at the current time does not mean they are not going to in the future. So, if they download the faulty .ISO, they are hosed in the future. Sounds pretty crummy to me!
I am quite shocked that this made it out of the Q/A process at VMware in the first place. My environment is far from complex and I expect that my usage of the ESXi 5.0 hypervisor would be within any standard testing procedure. I try to keep my environment as vanilla as possible and as close to best practices as possible. 1.5 hours for a boot definitely should have been caught before release to the general public.
Additionally, providing the option to download the faulty ISO and the fixed ISO is a complete FAIL! As mentioned on the download page, this is a special circumstance due to the nature of the issue. I would expect that if this issue is as serious as the download page makes it out to be, the faulty ISO should no longer be available. There has to be a better way!
I have since patched the faulty ESXi 5.0 host to the latest/safest version, 504890, and boot times are back to acceptable. I will proceed with the remainder of the upgrades using the new .ISO and have deleted all references to the old version from my environment.
I have never run into an issue like this with a VMware product in my environment and I still have all the confidence in the world that VMware products are excellent. In the scheme of things, this is a bump in the road.