Solution Design

Introduction

Welcome to the CNF reference architecture solution design documentation. To best understand this document, an understanding of Kubernetes and network virtualization concepts (i.e. CNI, SR-IOV, VF) is recommended.

This document details design aspects that tailor the cloud-native environment for networking workloads. This is mainly two items:

  • Dedicating isolated CPUs to networking workload Pods

  • Enabling the Kubernetes control plane to manage dataplane interfaces, like NIC virtual functions (VFs), NIC physical functions (PFs) or Elastic Network Interfaces (ENIs).

In the cloud-native paradigm, Pods declare the resources they need, and Kubernetes provides them. This means the Kubernetes cluster has to be configured to understand and provision the dataplane interfaces Pods will need. Additionally, Pods need to request the resources they need (i.e., dedicated CPUs and dataplane interfaces).

Additionally, this document discusses how the local docker registry is created and used.

Cluster Configuration

Dedicated Isolated CPUs

Networking workloads expect to run on dedicated CPUs that are isolated from the Linux kernel scheduler. Isolation is provided by the isolcpus kernel command line parameter.

First, Kubernetes has to be configured to provide Pods dedicated CPUs. For this, the kubelet on each worker has to be configured to use the static CPU policy. This policy manages a pool of CPUs that pods can share. When a Pod is granted a dedicated CPU, that CPU is removed from this pool and placed in that Pod’s cpuset. Therefore, there is only one Pod on that CPU.

Once Pods are able to utilize dedicated CPUs, Kubernetes has to be configured to only use isolated CPUs for Pods. This is achieved by setting the reservedSystemCPUs field of the kubelet configuration file. This field defines what CPUs are reserved for host level system threads and Kubernetes related threads. So, if this is set to all non-isolated CPUs, then all host level system threads and Kubernetes related threads run on non-isolated CPUs.

Combining these enables Pods to obtain exclusive, dedicated CPUs isolated from the rest of the system.

Enabling Kubernetes to Manage Dataplane Interfaces

Kubernetes enables management of devices via Device Plugins. This enables the Kubernetes control plane to understand what devices are available on which nodes, and assign them to specific Pods.

This solution utilizes the SR-IOV Network Device Plugin to make PFs/VFs allocatable to Pods. ENIs are treated exactly like PFs. The SR-IOV Network Device Plugin runs on every worker node to get information on all available PF/VFs. Based upon the configured selectors, PF/VFs are added under resource names. This solution groups all PF/VFs intended for its use under the arm.com/dpdk name.

Once the PF/VFs are grouped together on each node, the SR-IOV Network Device Plugin makes this known to the Kubernetes control plane through each node’s kubelet. Once this is done, the Kubernetes control plane understands which PF/VFs are available on which nodes, and can dedicate specific PCIe addresses to specific Pods.

Enabling Dataplane Interfaces to be Added to Pods

The SR-IOV Network Device Plugin by itself is not sufficient to provide Pods access to PF/VF/ENIs. The SR-IOV CNI must be used to add VFs as a separate network interface to Pods. The host-device CNI must be used for PFs and ENIs.

SR-IOV/host-device CNI is unable to provide a default Pod network, and thus relies upon a meta CNI such as Multus CNI. Multus CNI enables creation of “multi-homed” pods, or pods with multiple network interfaces. This enables standard CNIs (e.g. Calico, Cilium, Flannel) to provide the Kubernetes-aware networking (Pod to Pod, Pod to service, etc.) while SR-IOV/host-device CNI focuses on adding the respective dataplane interface as another network interface to the Pod.

To summarize, the SR-IOV Network Device Plugin enables Kubernetes to understand and allocate dataplane interfaces to Pods. The SR-IOV CNI is needed to add the VF as a secondary network interface to a Pod, and the host-device CNI is needed to add a PF or ENI as a secondary network interface. To enable a multi-homed Pod to use these CNIs, a meta CNI like Multus is needed.

Multus CNI is configured using a NetworkAttachmentDefinitions. The name of the NetworkAttachmentDefinition is used by Pods to request additional network interfaces. A single NetworkAttachmentDefinition invokes a single CNI to provide an additional interface. A Pod can ask Multus to invoke any number of CNIs any number of times for any number of interfaces.

This solution needs Multus CNI to add additional interfaces for VFs using SR-IOV CNI and PFs/ENIs using host-device CNI. To accomplish this, the k8s.v1.cni.cncf.io/resourceName is added to the NetworkAttachmentDefinition metadata. This is needed so Multus will provide SR-IOV/host-device CNI with the necessary device information. This solution names the NetworkAttachmentDefinition as sriov-dpdk for VFs and pf-dpdk for PFs/ENIs.

How Pods Can Utilize These Resources

Dedicated and Exclusive CPUs

Pods must have the Guaranteed Quality of Service to be allocated exclusive, dedicated CPUs. This means:

  • Every container in the Pod must have a memory limit and memory request

  • For every container in the pod, the limit must equal the request

  • The same is true for CPU requests

Dataplane Interface Allocation and Use

Pods first have to declare they need a dataplane interface. This is done in the same requests and limits fields as the CPU and memory resources. Since this solution puts every VF/PF/ENI into the arm.com/dpdk resource name, requesting one resource is done by putting arm.com/dpdk: 1 into both limits and requests. For example, the following snippet is used in the DPDK sample application deployment:

resources:
  # Limits and requests must be equal for cpu and memory for the container to be pinned to CPUs.
        limits:
                hugepages-2Mi: 1Gi
                cpu: 2
                memory: 2Gi
                arm.com/dpdk: 1
        requests:
                cpu: 2
                memory: 2Gi
                arm.com/dpdk: 1

Inside the Pod, an environmental variable is set to inform the Pod which resource it has been allocated. In the case of this solution, the arm.com/dpdk resource name means the environmental variable is called PCIDEVICE_ARM_COM_DPDK. See examples/dpdk-testpmd/dpdk-launch.sh for examples using this variable.

In addition to requesting the dataplane interface, the Pod must have Multus add the VF via SR-IOV CNI or PF/ENI via host-device CNI. This solution has configured the SR-IOV CNI to be invoked with the Multus network name sriov-dpdk and the host-device CNI to be invoked with the Multus network name pf-dpdk. So, the Pod must add the k8s.v1.cni.cncf.io/networks: sriov-dpdk or k8s.v1.cni.cncf.io/networks: pf-dpdk annotation to its metadata.

Local Docker Registry

The controller node sets up a Docker registry for this solution to use. This registry holds the container images for the AArch64 build of the SR-IOV CNI and the sample application.

To setup the registry and have it be used by the other nodes in the cluster, the following steps are followed:

  1. Create a self-signed certificate for the FQDN of the controller node. For EC2 deployments, the controller’s private primary IP address is used instead.

  2. Trust the certificate for every node in the cluster

  3. Launch the registry using the self-signed certificate

  4. Nodes interact with the registry using the controller’s FQDN/EC2 private IP.

If additional worker nodes are added to the cluster at a later time, they will need to trust the self-signed certificate. Otherwise, the additional nodes cannot pull the SR-IOV CNI or sample application Docker image to install it.