Locked files with VMFS 6

You probably know the error message: operation impossible since file is locked.
If a VMDK is locked you can not start it inside a VM and you can not copy it.
The knowledgebase basically tells you to stop all processes that access the file and reboot.
This is supposed to fix the problem.
This documentation is not good enough.
In VMFS 6 following this advice may not be sufficient to release the lock.
This means that you basically lost the file as you can not read it anymore.
Trying to access it via Linux is also impossible at the moment.
Recent recovery third party tools like UFSexplorer also failed to read the vmdks in such a case.
Here is a recent case for your reference ….
Problem with locked virtual machine after esxi host crash
I have seen a couple of this cases recently so I started to investigate the issue,
Now I have found a way to recover a VMDK that would other wise be a case for Ontrack or VMware Support.
My procedure requires a direct hexedit of the VMFS heartbeat- section so please do not ask for details at the moment.
I will post details once I know this approach is dafe.
Anyway I believe this is a bug in the way ESXi 6.5 handles the heartbeat section of a VMFS-volume.
This issue should never affect single host environements.
Feel free to contact me via skype if you run into this yourself.

Symptoms:

none of the following commands that are available on ESXi will work once a flat.vmdk is locked:
vmkfstools -i  name.vmdk new.vmdk
vmkfstools -p 0 name-flat.vmdk > mapping.txt
hexdump -C name-flat.vmdk | less
dd if=name-flat.vmdk of=new-flat.vmdk bs=1M
starting a VM which uses the locked VMDK will fail

Consequences:

Effectively the VM / VMDK is lost as you can not even read it one more time to copy the data.

Plan A:

Try this first:
If possible isolate the VMFS-volume so that it is exposed to a single ESXi.
Then follow the troubleshooting steps from the VMware Knowledgebase
https://kb.vmware.com/s/article/10051
If you follow the steps you are supposed to get rid of the stale lock.
Apparently this is no longer a 100% reliable procedure with VMFS 6.

Should I try VOMA ?
At the moment my answer to that question is NO.
We need to know the  IP / MAC of the ESXi-host that holds the lock.
To acquire that info good old vmkfstools is enough and that will work without silencing the VMFS-volume first.
In other words: in this case VOMA is most likely a waste of time.

Plan B:

This often used to work with earlier VMFS-versions:
– use  Linux vmfs-tools to read the VMDK and clone it to a new location (unfortunately even the latest builds of vmfs-tools that I am aware of do not support VMFS 6)
– use an ESXi-LiveCD to read the VMFS-volume and clone the VMDK to a new location
– do a fresh install of the same ESXi-build to a temporary USB-flash drive

Plan C:

One essential requirement for a Cluster-filesystem is the need to allow access to a VMDK for a single host inside the Cluster and prevent access for all other hosts.
In order to do this in an efficient and fast way every VMFS-volume uses a small section of the VMFS-metadata for so called heartbeats.
For this “heartbeat section” VMFS 6 uses an area of just one block which equals 1 MB.
This “heartbeat section” can be accessed in 2 ways:
1. dd if=.vh.sf bs=1M count=1 skip=3 of=heartbeat-section.bin
2. dd if=vmfs-partition bs=1M count=1 skip=20  of=heartbeat-section.bin
Option 1 appears to be the more reliable one as this location is independant on the actual location of the .vh.sf file.

Quick-diagnosis:

Use a Linux-system or a Windows that has the commandline tool strings(.exe)
Windows versions is available here: https://docs.microsoft.com/en-us/sysinternals/downloads/strings
Run
strings heartbeat-section.bin > strings.txt
In strings.txt you should see the IP-address or MAC-address that you already know from the error-message of the locked file.
If you find no such reference you can stop reading here – I assume you have a different problem.

Safety first:

As far as I know there does not exist any detailed public  documentation on the exact syntax of the heartbeat-section in a VMFS 6-partition.
That means that as long I do not definetely know all the fine details I have to take care that my instructions are as failsafe as possible.
When we edit such a critical section of the VMFS-metafiles we should avoid to use hexeditors or other tools that are a risk in the hand of an inexperienced user.
Instead I prefer a way that easily allows to create a backup of the relevant 1MB block first.
The command
dd if=.vh.sf bs=1M count=1 skip=3 of=heartbeat-section.bin
does that.
Then I inject a clean-heartbeat-section.bin that I created on a newly created and freshly formatted VMFS 6 -volume (created by the same ESXi-build)
This will completely clean all eventually existing stale locks and appears to have the desired effect.
If still something goes wrong you can easily reinject the original section.

To inject a clean heartbeat-section use
dd of=.vh.sf bs=1M count=1 seek=3 if=clean-heartbeat-section.bin conv=notrunc

According to my current experiences this injection will be effective almost immediatly.
If you see no change in the behaviour try a reboot of the ESXi.

Please help …

I defintely need to see more cases of this defect before I consider offering downloads of premade fixed sections.
So if you run into this problem in the near future please contact me.
skype: sanbarrow
I will then create a fixed section and help you to safely inject it.

Ulli

You can hire me on a “per-incident-level” – my help is most useful with recovery-problems.

Leave a Reply

Your email address will not be published. Required fields are marked *