Intel® Converged Edge Media Platform & Varnish Enterprise

About the Project

This article explores how Intel®'s Converged Edge Media Platform (Reference Architecture), combined with Varnish Enterprise and Kubernetes, leverages Intel® hardware optimizations to maximize efficiency and performance. Specifically, we demonstrate how using SR-IOV can achieve identical high-performance results with reduced CPU usage. Through performance testing, we compare baseline and optimized setups to highlight the tangible benefits of Intel® hardware and Kubernetes optimizations.

Intel®’s Converged Edge Media Platform Architecture is a media-focused implementation of Intel®’s Edge Platform. It provides container-based, cloud-native capabilities that enable Communication Service Providers (CoSPs) to quickly, efficiently, and cost-effectively deploy multiple media services. This architecture capitalizes on the fast-growing edge computing market by offering a platform tailored to media workloads. Built on Kubernetes (K8s) and leveraging Intel® hardware, it is optimized for the network edge and capable of hosting multiple services.

Kubernetes Cluster Configuration

The Kubernetes cluster is enhanced with the following key features:

Device Manager: Enables direct access to Intel® GPUs from Pods.
SR-IOV Plugin: Enhances network performance by enabling virtual NICs on a single physical NIC.

What is SR-IOV?

SR-IOV (Single Root I/O Virtualization) is a technology that enhances network performance in virtualized environments. It allows a single physical NIC to appear as multiple virtual NICs, giving virtual machines or containers direct access to NIC hardware. This reduces latency and increases throughput, making it ideal for high-performance workloads.

Cluster Specifications

Kubernetes Cluster
- cluster1-master
  - CPU: Intel® Xeon® Platinum 8358 (32c/64t)
  - NIC: Intel® Ethernet E810-C QSFP (100Gbps)
  - SSD: Kingston 240GB SSD
- cluster1-node
  - CPU: Intel® Xeon® Platinum 8380 (40c/80t)
  - NIC: Intel® Ethernet E810-C QSFP (100Gbps)
  - SSD: Kingston 240GB SSD
  - SSD: 3x Intel® NVMe 1.8TB
- cluster1-gpu
  - CPU: Intel®Xeon® Gold 6346 (16c/18t)
  - NIC: Intel® Ethernet E810-C QSFP (100Gbps)
  - SSD: 4x Intel® NVMe 1.8TB
  - GPU: 2x Intel®Data Center GPU Flex 170

Kubernetes Clusters Diagram 1

In a Kubernetes (K8s) cluster, master nodes and worker nodes serve distinct roles, each critical to the operation of the cluster.

Master Node: Responsible for managing the Kubernetes cluster and orchestrating all activities across the worker nodes.

Worker Node: Executes workloads (pods) assigned by the master node.

The Tests

Test Setup

The testing process is divided into two phases:

Baseline Phase: Tests without acceleration to establish baseline performance metrics.
Optimized Phase: Tests with SR-IOV and other optimizations to measure performance improvements.

Key elements shared across both phases:

Varnish Configuration: Origin servers in the cluster are configured for performance testing.
Load Testing: Conducted using WRK from a high-performance node provided by Intel®.

Phase 1: Baseline

Varnish Enterprise is deployed on the cluster using a Helm chart without any acceleration.

Phase 2: Optimized

Intel® configures SR-IOV to provide the Varnish Pod with direct NIC access.
NIC affinity is set to align with the NUMA node closest to the NIC for optimal performance via Intel edge media plugins.
Varnish Enterprise is redeployed using a Helm chart, now with SR-IOV annotations.

Origin Setup

For the tests, the prng VMOD is used to generate 1MB synthetic responses. Here is the VCL configuration:

  
      vcl 4.1;

import prng;
import utils;

backend default none;

sub vcl_backend_fetch {
    set bereq.backend = prng.fast_random_backend(1024 * 1024);
}

Test Plan

Load generation is performed using WRK to simulate traffic and measure performance with and without SR-IOV. We limited the number of CPUs in the values.yaml file. We wanted to test how low we could go in terms of CPU usage to saturate the NIC.

The values.yaml file in a Helm chart is a YAML file that contains default configuration values for the chart. It allows you to customize the behavior and resources of a Kubernetes application when deploying it with Helm. These values can be overridden by providing your own values.yaml file or by using --set flags during the helm install or helm upgrade commands.

To learn more about the Varnish Helm chart, you can click here.

In the values.yaml, we added:

  
        resources:
    requests:
      openshift.io/cempsriov: 1
    limits:
      openshift.io/cempsriov: 1
      cpu: "8"

Test Results Summary

Phase 1: Without SR-IOV

Using 8 CPUs, the NIC throughput reached its maximum capacity (100Gbps).

  
    
    
    
    
      wrk -t6 -c65 -d100 -s close_connection.lua http://192.168.100.2:31042
Running 2m test @ http://IP:31042
  6 threads and 65 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.14ms    4.40ms 211.38ms   88.80%
    Req/Sec     1.76k   192.59     2.34k    72.49%
  1049314 requests in 1.67m, 1.00TB read
Requests/sec:  10485.96
Transfer/sec:     10.24GB 
    

To ensure a fair comparison between the two phases, we are now limiting the setup to 4 CPUs.

Threads	Connections	Req/sec	Transfer/sec (GB/s)	90% Latency (ms)	99% Latency (ms)	Hit Rate	Errors (Timeout)
4	20	5849.92	5.71	11.42	23.57	100%	0
8	40	5393.02	5.27	40.76	53.63	100%	0
16	60	4692.99	4.69	53.30	68.73	100%	0
32	120	3416.72	3.34	84.26	107.04	100%	0
64	240	1593.56	1.56	290.28	485.39	100%	0
128	480	1716.83	1.68	514.99	929.11	100%	0

We observe a decline in requests per second and a rise in latency as CPU congestion builds, leading to client performance degradation.

Phase 2: With SR-IOV

With SR-IOV enabled, the same NIC throughput of 10.94GB/s and request rate of 11195.59 req/sec are achieved, comparable to Phase 1 results (10.24GB/s throughput and 10485.96 req/sec) but with significantly fewer CPUs (5 CPUs versus 8 CPUs). This demonstrates that SR-IOV enables us to do the same with less CPU.

  
    
    
    
    
      wrk -t17 -c65 -d100 --latency -s close_connection.lua http://192.168.100.61:6081
Running 2m test @ http://IP:6081
  17 threads and 65 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.69ms    3.91ms 215.24ms   89.11%
    Req/Sec   662.08    155.84     1.44k    69.32%
  1120681 requests in 1.67m, 1.07TB read
Requests/sec:  11195.59
Transfer/sec:     10.94GB
    

Limiting the setup to 4 CPUs:

Threads	Connections	Req/sec	Transfer/sec (GB/s)	90% Latency (ms)	99% Latency (ms)	Hit Rate	Errors (Timeout)
4	20	7566.83	7.39	3.99	8.31	100%	0
8	40	8232.02	8.04	21.25	33.79	100%	0
16	60	8386.16	8.19	22.39	36.26	100%	0
32	120	7700.7	7.52	40.69	62.71	100%	0
64	240	6393.45	6.25	73.59	127.94	100%	0
128	480	5557.95	5.43	124.02	265.08	100%	0

Observations and Comparison

Performance Efficiency: With SR-IOV, all metrics (requests per second, transfer rate, and latency) improve significantly across all configurations.
Latency Reduction: 90% and 99% latency metrics are consistently lower in the SR-IOV setup.
Throughput Improvement: Transfer rates increased by up to ~30% in the SR-IOV setup compared to the baseline.

Conclusion

The results underscore the substantial performance gains achieved by integrating SR-IOV and Intel® hardware optimizations into the Kubernetes infrastructure. This approach not only boosts throughput but also significantly reduces CPU usage, making it an optimal solution for edge computing workloads. It is particularly suited for use cases that demand low latency or involve processing large datasets that are unsuitable for public cloud environments. When tailored for the network edge, these use cases are often video-related. With Varnish, the CPU savings can be repurposed to further enhance edge computing capabilities.

Authors

Gang Shen, Software Architect, Intel
Brandon Gavino, Product Solution Architect, Intel
Arindam Saha, Product Owner, Intel