Slow or Failed Live Migrations are observed for Large Volume Backed Instances with Nova Compute logs indicating memory issues.
INFO nova.compute.manager [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Took 5.12 seconds for pre_live_migration on destination host xyzxyzxy-abd5-48d3-990b-399cae5ae911.
INFO nova.virt.libvirt.migration [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Increasing downtime to 50 ms after 0 sec elapsed time
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Migration running for 60 secs, memory 52% remaining; (bytes processed=7190652899, remaining=9001529344, total=17184923648)
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Migration running for 12180 secs, memory 5% remaining; (bytes processed=1463919506985, remaining=831111168, total=17184923648)
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Data remaining 831111168 bytes, low watermark 2117632 bytes 12622 seconds ago
WARNING nova.virt.libvirt.migration [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Live migration not completed after 12800 sec
- Platform9 Managed OpenStack - All Versions
This scenario where messages within ostackhost logs with regards to the memory being continuously transferred for the instance and the migration never being completed is recognized in this upstream commit.
If the guest is dirtying memory quicker than the network can transfer it, it is entirely possible that a migration will never complete. In such a case it is highly undesirable to leave it running forever since it wastes valuable CPU and network resources.
As part of that commit, the
live_migration_progress_timeoutoption was introduced to abort the migration if this didn't progress, but the issue being that the option was deprecated re: 1644248 as it was found out that the
data_remainingparameter was unreliable. It is explicitly stated here Openstack Commit that this option should not be changed and that it may be re-enabled in future releases.
One of the option to workaround the issue is to cold-migrate the instance in a shutoff state. Another option is to enable auto-convergence for instance in nova configuration files as described here in Nova Documentation. If the migration is unlikely to complete, this option when pre-enabled on hypervisors will slow down the instance's CPU until the memory copy process is faster than the instance's memory writes.
To enable auto-convergence on the hypervisor.
- Set value
live_migration_permit_auto_converge=truein nova_override.conf file.
$ cat /opt/pf9/etc/nova/conf.d/nova_override.conf | grep auto_converge
live_migration_permit_auto_converge = True
- Restart pf9-ostackhost service for the changes to take effect.
$ systemctl restart pf9-ostackhostNote: This change will have to be performed on all the hypervisors. Also, keep in mind that a possible downside of auto-convergence is the slowing down of the instance which will impact the performance of the application running inside the instance at the time of live migration.