Try our new Certificate Revocation List Check Tool
CRLcheck.exe is a tool developed to verify digital signatures of executable files. It collects files from known paths on your client, checks their signature, and checks Certificate Revocation Lists (CRL) and OCSP download. This helps avoid delays in launching files.
Category published:  VMWare   Click on the Category button to get more articles regarding that product.

VMWARE, ESX, Netapp Storage AFF-C190, NFS 3, Server OS freeze/stall during Veeam Backup or high load

Posted by admin on 19.01.2023

 

SOLUTION: performance: NFS3 > NFS4 change of share > performance problem and freeze gone OR change optimize the settings for existing NFS3 (Fine tune)

SOLUTION: NFS 4.1 NETAPP Ontapp Failover fails with NFS 4.1 > This is solved with ESX 7.0 GA and 6.7P04 (Says a Netapp Partner)

SOLUTION: If you have sensitive high end Servers OS and APPS keep those on same ESX machine as the VEEAM Backup Proxy. If you have all sensitive high end servers tune all to perfection regarding NFS or/and install a VEEAM Backup Proxy on EACH ESX Cluster (As we understood)

 

You find in some older NETAPP KB: NFS.nfsvers: This setting controls the version of NFS that the storage system uses. Using NFSv3 is recommended for optimal performance when using VMware vSphere.

 

Why we wrote this collection and braindump of last day:

Although I am not responsible for storage planning or installation, I frequently become involved in resolving performance issues that ultimately from expiernce with other AV-products lead to anti-virus software (One of my Expertise). Recently, we encountered a case where teams from multiple departments spent weeks investigating the cause of the issue. Initially as often, they suspected that McAfee or Trellix ENS Virusprotection was to blame, as they are common targets for blame in these situations. However, in most cases, the underlying cause of performance issues is much more complex and cannot be easily explained. Factors such as certificate revocation and code signing cache, as well as APIs between different systems and processes that interact with one another, can all contribute to performance problems.

Modern anti-virus software, like McAfee ENS, is designed to be comprehensive and can sometimes cause performance issues due to its compexity and also because is half a client firewall these days. This was particularly evident in our case, where the impact was mainly seen on Server 2016 with Citrix Terminal Server and ERP solutions. Despite upgrading the operating system, optimizing and trimming McAfee, and thoroughly investigating every process and signature, the source of the problem remained and tickets went back to dispatch several times.

At the end we found out that during the storage migration someone forget to optimize or enhance the parameters on an NFS V3 share on Netapp. After doing a blank new share with NFS4 all is fine and performance. However there seemed to be discussion in other companys about what to use for NFS 3/4/4.1 related to Netapp.

There seemed to be a bug while Upgrading Ontapp with V4.1 for some time.

 

With NFS 4.1 and the correct ONTAPP Version we see following:

VEEAM backup on Full and incremental raised up in speed and the freeze in OS are gone.

 

Below tips may help some of you. We fully understand that changing those paramter be by risky and a balance between reliability and performance haunting in the past with V4.1 that has led to problems but ONLY if the Storage people upgraded the backend once a year.

But if it’s that slow that a VEEAM Backup can take down such an expensive storage there must be something wrong where things come together.

 

Main argument other people told us:

Veeam Backup for VMware can be used to backup virtual machines running on VMware vSphere using the Network File System (NFS) protocol. However, if the NFS version being used is version 3 and the storage array is a NetApp device, there is a known issue where the hosts running the virtual machines may freeze or become unresponsive during the backup process.

This can occur because NFS v3 has a number of limitations, including poor performance when dealing with large numbers of small files and poor scalability. The solution is to upgrade to NFS v4, which addresses these limitations and improves performance. Additionally, Veeam recommends to check if there is any specific configuration on NetApp that can help to mitigate this issue.

 

If you want to stick with NFS3:

There are several performance settings for NFS version 3 (NFSv3) on a NetApp storage system that can be configured to optimize performance when used with VMware vSphere. Some of these settings include:

 

NFS.tcp.enable: This setting controls whether the storage system uses the TCP protocol for NFS communication. Enabling this setting can improve performance in environments with high network latency.

NFS.udp.enable: This setting controls whether the storage system uses the UDP protocol for NFS communication. Enabling this setting can improve performance in environments with low network latency.

NFS.mount.options: This setting controls the mount options used when mounting the NFS export on the VMware host. Some options that can be used to improve performance include “rsize” and “wsize” which controls the maximum read and write size of the NFS packet.

NFS.nfsvers: This setting controls the version of NFS that the storage system uses. Using NFSv3 is recommended for optimal performance when using VMware vSphere.

NFS.nolargefiles.enable: This setting controls whether or not large files are supported by the storage system. Enabling this setting can improve performance when dealing with large files.

NFS.readahead.enable: This setting controls whether or not the storage system uses readahead when reading data from an NFS export. Enabling this setting can improve performance when dealing with sequential data access.

It’s important to note that these setting and their optimal values may vary depending on the specific use case and the environment in which the storage system is being used. Therefore, it is recommended to check with the NetApp documentation or a NetApp support team for specific guidance on performance tuning for your environment.

 

External Links used

 Lenovo has an extrem good whitepaper:

 

https://download.lenovo.com/storage/n_f_s_in_lenovo_ontap_best_practice_and_implementation_guide.pdf&ved=2ahUKEwitjIKdwOz8AhU_lf0HHRGJCzw4ChAWegQIBxAB&usg=AOvVaw3SaiV2vw0Q6QnpbsVieMjw

Some external Links pro/contra NFS V4.1 on Netapp

https://www.rootusers.com/performance-difference-between-nfs-versions/

https://www.reddit.com/r/vmware/comments/lys8hq/nfsv4_disruption_on_fail_over_events/

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMware_NFSv4.1_datastores_see_disruption_during_failover_events_for_ONTAP_9

Setup NFS 3 with ONTAPP

https://docs.netapp.com/de-de/netapp-solutions/virtualization/vsphere_ontap_auto_file_nfs.html

Setup NFS 4.1 with ONTAPP

https://docs.netapp.com/de-de/netapp-solutions/virtualization/vsphere_ontap_auto_file_nfs41.html#%C3%BCber-diese-aufgabe

 

Vmware

 

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-7297C302-378F-4AF2-9BD6-6EDB1E0A850A.html#:~:text=Procedure%201%20Browse%20to%20the%20cluster%20in%20the,rule%20will%20apply%20and%20click%20OK.%20Weitere%20Elemente

Veeam Proxy

https://helpcenter.veeam.com/docs/backup/vsphere/backup_proxy.html?ver=110

https://helpcenter.veeam.com/docs/backup/vsphere/backup_proxy_requirements.html?ver=110

 

Screeenshot Showing on which ONTAP OS the problem with NFS 4.1 is solved:

 

2019 – warning from ingram europe after some customers lost dats after mixing nfs 3 and 4 for migration. That is the problem which should be solved in 2023.

 

(Old warning so you see whycso man did not chsnge to nfs 4 even they did know v3 had petfornance issue on netapp)

 Aus gegebenem Anlass möchten wir auf eine derzeit bestehende Problematik im Zusammenhang mit VMware (ESX 6.x+) und Datastores der Version NFS v4.1 hinweisen.

Zwar ist NFS nach wie vor die erste Wahl und bietet eine Vielzahl von Vorteilen gegenüber iSCSI oder FC LUNs, jedoch gibt es einen Bug, der einen produktiven Einsatz von NFS v4.1 Datastores verbietet.

Dieser Bug sorgt dafür, dass der Zugriff von ESXen auf NFSv4.1 Datastores bei Aktionen, die unterbrechungsfrei sein sollten, unterbrochen wird. Diese Aktionen beinhalten:

ONTAP Upgrade

LIF Migration

Storage Takeover

Storage Giveback

MetroCluster Switchover

 

Wir empfehlen Kunden, die bereits NFS v4.1 Datastores produktiv einsetzen auf NFS v3 downzugraden. Dabei ist unebdingt zu beachten, dass NFS v3 und NFS v4 grundsätzlich unterschiedliche Locking Mechanismen verwenden.

Daher gilt es zu vermeiden, ein und denselben Datastore auf verschiedenen ESX Hosts gleichzeitig als NFS v3 und NFS v4 Datastore zu mounten, da dies die Datenintegrität gefährdet. Die sichere Methode ist, parallel zum NFS v4 Datastore einen NFS v3 Datastore zu mounten und die VMs per Storage vMotion umzuziehen.

ESXi6.0U3 and ESXi6.5 NFS v4.1 datastores might become inaccessible during failover

VMWare NFSv4.1 Datastores see Disruption During Failover Events Including Upgrade

https://kb.netapp.com/app/answers/answer_view/a_id/109077

 

 

 


 Category published:  VMWare   Click on the Category button to get more articles regarding that product.