Troubleshooting

This page describes common issues and steps to resolve them.

  1. python3 -c 'import socket; print(socket.getfqdn())' does not show the FQDN:

    If the above python command fails to print out the correct FQDN, then your system may be affected by this python bug. The solution is to modify /etc/hosts to hardcode the correct FQDN. To do so, follow the steps in the Debian manual to modify /etc/hosts.

  2. Create VFs for <pcie_addr> task fails with echo 2 > /sys/bus/pci/devices/<pcie_addr>/sriov_numvfs ... echo: I/O error:

    CRA is most easily deployed with exclusive control over a PF. Ensure no other applications like DPDK or QEMU are using the PF. Additionally, ensure the PF is bound to the default kernel driver such as i40e or mlx5_core.

    1. Check if NIC port is used by other applications like QEMU or DPDK, and stop them.

    2. Check if NIC is bound with vfio-pci driver, this can be done with /usr/local/bin/dpdk-devbind.py -s.

    3. Bind with NIC’s default driver. Taking i40e as example, bind NIC port to i40e with /usr/local/bin/dpdk-devbind.py -b i40e <pcie_addr>.

    4. Re-deploy CRA solution.

    CRA can also be deployed onto specific VFs on a machine. This can be used to share underlying PFs with other applications like DPDK or QEMU. To leverage this deployment model, specify the VF PCIe addresses in the pcie_addr for the worker node. See the user guide for more information.

  3. Tasks error out by throwing error Timeout (12s) waiting for privilege escalation prompt:

    This error happens when latency to execute tasks is high. We can add the parameter -T 120 at the end of ansible-playbook to increase the timeout time.

  4. Playbook fails due to PCIe address having active routes:

    TASK [bind_pcie_addrs : Fail if any PCIe address has active routes] ******************************************************
    fatal: [worker-node]: FAILED! => {"changed": false, "msg": "At least one PCIe address has an active route. Binding this to igb_uio may cause a loss in connectivity."}
    

    This error happens when a dataplane interface (PF/VF/ENI) on a worker node has an active route. In this step, the dataplane interface is being rebound to a different driver, which would remove any routes. Since this may result in a loss of connectivity, the solution will not automatically remove the routes. To use that dataplane interface, remove its routes and re-run the playbook.