TLS acceleration with new CPU

2022-01-06 by Tao Xu

All top cloud service providers (CSP) have announced 3rd gen Intel® Xeon® CPU (code name Ice Lake-SP). It came with a built-in feature Crypto-NI to accelerate crypto workloads with new AVX512 instructions. Let’s use bcc/bpftrace to analyze its usage in TLS acceleration and see how it works.

1. Background

Crypto workload is very CPU intensive especially public-key RSA (Elliptic-Curve is better). Hardware vendors developed many solutions to offload or accelerate crypto tasks. For example:

QAT offload device as an add-in PCIe card or integrated in chipset.
ARMv8 Cryptography Extensions.
KAE CPU on-die PCIe accelerator.
FPGA solutions for cloud-based hardware security module (HSM).

A typical use case is SSL/TLS handshake in HTTPS web server, the crypto overhead can be observed by bcc/bpftrace tools, we’ll describe below. Some application may use openssl speed command as a quick method to measure or saturate CPU.

2. CPU hardware

Ice Lake added new AVX512 instructions (plus SHA extension or SHA-NI) for crypto acceleration. Normally application software needs to change to use SIMD intrinsics. Or use the multi-buffer lib which provides batch submission of multiple requests and parallel async processing based on new instruction set.

To my best knowledge, next gen CPU (code name Sapphire Rapids or SPR) is planned with Crypto-NI to succeed Ice Lake, so we could expect this built-in crypto acceleration on future Xeon CPU. Furthermore, SPR would provide several types of on-die accelerators including new QAT.

The funny thing is, I’ve also tried on my laptop (using Ubuntu WSL) with Tiger Lake CPU that succeeded Ice Lake/ICL in consumer market, and it works the same as server market Ice Lake/ICX. I guess it’s because they have same microarchitecture features (VAES,GFNI,IFMA,VPCLMULQDQ) though not highlighted in product spec.

3. Software stacks

Ice Lake crypto acceleration is using same software stack as QAT (qat_hw), and called “multi-buffer” (qat_sw) in QAT_Engine. And this extends the usage from dedicated hardware to general available CPU which is great. They also share same use case as already supported in OpenSSL (BoringSSL or BabaSSL), Nginx (Tengine), DPDK Cryptodev, k8s ingress, Istio Envoy, ZFS (QZFS), CDN, etc.

BTW, similar optimization was added in Linux Kernel crypto as well. The patch is in RFC for some time, while kernel crypto algorithm is already using AVX2. Per official document, the Crypto-NI AVX512 is in light power level that improved the downclocking problem. Anyway, here it’s in userland for CPU intensive crypto workload in OpenSSL and TLS handshake.

Ice Lake CPU is officially supported in CentOS/RHEL 8.2 or some Linux developed by CSP. Thanks to previous offload solutions, system default OpenSSL 1.1.1 already support async SSL and engines without changing OpenSSL binary. This is very helpful as changing system OpenSSL has big compatibility issue.

Above shows 3 approaches for HTTPS TLS handshake which will be analyzed below. General CPU path is most common with no special hardware help, it can still benefit from more powerful CPU though. Ice Lake acceleration path is the approach described above using new CPU instructions. QAT offload path is the original TLS offload approach using dedicated hardware.

4. Scenario setup

Refer to the official Crypto-NI doc or this blog (I posted it) in Chinese. Basically, it’s setting up a HTTPS server with Nginx (or Tengine) using Ice Lake acceleration approach described above for TLS acceleration.

Below is using a bear metal instance from one CSP (VM is the same as it’s just using new CPU instructions) to test localhost TLS with no network overhead. It’s also tried with two machines to simulate client-server in private network, result is same as localhost, only need more client machines when using more CPU for HTTPS server. Let’s use localhost to simplify the scenario as a clean reference.

5. bcc/bpftrace analysis

Use the newly added bcc/bpftrace to compare approaches and verify performance gain. Common step is:

Use openssl s_client to issue a single TLS handshake.
Trace handshake functions with bcc/sslsniff and crypto functions with bpftrace/sslsnoop.
Run wrk stress test referencing nginx doc: open and close for each request with zero bytes to measure SSL/TLS transaction per seconds (TPS).
Use bpftrace/ssllatency to compare the latency distribution for handshake and crypto functions.

# echo | openssl s_client -connect localhost:443
# taskset -c 1 wrk -t 1 -c 1000 -d 10s -H 'Connection: close' https://localhost:443/0kb.bin

a. general CPU approach

This is the default approach and baseline for comparison, TLS performance depends on libcrypto.so from OpenSSL. According to nginx doc: “You can use the openssl speed rsa command to measure how many RSA 2048‑bit signs a single core can sustain per second. As a rough estimate, the number of full SSL handshakes (using ECDHE‑RSA) that a single core can sustain is between 45% and 60% of that value.”

Existing bcc/sslsniff tool can sniff the plaintext and ciphertext in SSL_read/write@libcrypto.so. A verbose option is added to trace additional SSL_do_handshake() before initial SSL_read/write which is the crypto overhead in TLS handshake.

Use openssl s_client to issue a single handshake to nginx server with RSA cipher, SSL_do_handshake() latency is 0.7ms as shown below and it takes 1.3ms before secure connection is ready for SSL_read/write. TLS cipher suite can also change to ECDHE-RSA or ECDHE-ECDSA.

# ./sslsniff.py --hexdump -v
FUNC         TIME(s)            COMM             PID     LEN    LATms
WRITE/SEND   0.000000000        openssl          1265677 0      N/A
HANDSHAKE    0.001296888        nginx            1261684 1      0.687
WRITE/SEND   0.001317746        openssl          1265677 0      1.318
WRITE/SEND   0.001431656        openssl          1265677 1      N/A
----- DATA -----
0a
----- END DATA -----
WRITE/SEND   0.001444130        openssl          1265677 1      0.012
READ/RECV    0.001447699        nginx            1261684 1      0.003
----- DATA -----
0a
----- END DATA -----

A new tool bpftrace/sslsnoop is added for further investigation. Below shows rsa_ossl_private_decrypt()@libcrypto.so takes 0.7ms to complete, and overall handshake takes 1.1ms to get NGX_OK.

# ./sslsnoop.bt
Attaching 60 probes...
Tracing SSL/TLS handshake... Hit Ctrl-C to end.
TIME(us)   TID      COMM     PROBE                          LAT(us)        RET
1921854    1261684  nginx    ngx_ssl_handshake                   84  NGX_AGAIN
1922111    1276282  openssl  rsa_ossl_public_encrypt             39        256
1922801    1261684  nginx    rsa_ossl_private_decrypt           619        256
1922870    1261684  nginx    ngx_ssl_handshake                  700     NGX_OK
^C

Set nginx worker to use one Ice Lake core, and use wrk to saturate the nginx worker, then trace with new tool bpftrace/ssllatency for latency distribution.

Below show two runs with/without the tracing tool. Quote the tools’ man page: “SSL/TLS handshake usually contains network latency and the traced crypto functions are CPU intensive tasks, so call frequency should be low and the overhead of this tool is expected to be negligible.”

# taskset -c 1 wrk -t 1 -c 1000 -d 10s -H 'Connection: close' https://localhost:443/0kb.bin
Running 10s test @ https://localhost:443/0kb.bin
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    36.56ms  105.29ms 599.25ms   93.41%
    Req/Sec     1.44k   779.63     8.62k    96.77%
  13465 requests in 10.02s, 3.89MB read
  Non-2xx or 3xx responses: 13465
Requests/sec:   1343.43
Transfer/sec:    397.52KB

# taskset -c 1 wrk -t 1 -c 1000 -d 10s -H 'Connection: close' https://localhost:443/0kb.bin
Running 10s test @ https://localhost:443/0kb.bin
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    37.29ms  107.51ms 607.23ms   92.89%
    Req/Sec     1.34k    97.36     1.62k    67.03%
  13257 requests in 10.03s, 3.83MB read
  Non-2xx or 3xx responses: 13257
Requests/sec:   1322.02
Transfer/sec:    391.18KB

Below is the tracing output of ssllatency for the 2nd run. wrk run 10s with 13257 requests, ssllatency traced 13289 successful handshake calls returned NGX_OK and 14257 returned NGX_AGAIN. Average latency of rsa_ossl_private_decrypt() and successful ngx_ssl_handshake() are the same 0.7ms as in sslsnoop.

# ./ssllatency.bt
Attaching 59 probes...
Tracing SSL/TLS handshake in us... Hit Ctrl-C to end.
^C
@hist[ngx_ssl_handshake]:
[600, 800)         12870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[800, 1000)          136 |                                                    |
[1000, ...)          283 |@                                                   |

@hist[rsa_ossl_private_decrypt]:
[600, 800)         12870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[800, 1000)          419 |@                                                   |

@histF[ngx_ssl_handshake]:
[0, 200)           14255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)             0 |                                                    |
[400, 600)             0 |                                                    |
[600, 800)             0 |                                                    |
[800, 1000)            0 |                                                    |
[1000, ...)            2 |                                                    |

@stat[rsa_ossl_private_decrypt]: count 13289, average 631, total 8395839
@stat[ngx_ssl_handshake]: count 13289, average 690, total 9180173
@statF[ngx_ssl_handshake]: count 14257, average 27, total 390248

We can now confirm the libcrypto.so bottleneck: one Ice Lake core can perform roughly ~1400 (1s/0.7ms) blocking rsa_ossl_private_decrypt() calls at most per second and we’re getting 1322/s of wrk HTTPS RPS.

Actually use 1 thread with 10 connections in wrk client is enough to saturate one Ice Lake core used by nginx worker. Use 1000 connections here for same workload as in CPU accelerated approach.

b. Ice Lake acceleration approach

This is the CPU acceleration approach by using async SSL and qatengine in OpenSSL, originally designed for QAT offload, to accelerate those CPU intensive crypto tasks in TLS handshake. Use multi-buffer (MB) style functions in libcrypto_mb.so to take advantage of new AVX512 on Ice Lake.

Same as the steps in baseline sync approach, use bcc/sslsniff first to trace single handshake, and then use bpftrace/sslsnoop to break down the latency, and finally use bpftrace/ssllatency to compare the statistical summary in wrk benchmark.

SSL_do_handshake() is now faster to collect NGX_OK state without blocking call to crypto functions. But overall it takes 2.5ms before SSL_read/write which is slower than default 1.3ms baseline. It’s somehow expected as deeper async stack has certain overhead than sync call and performance gain will normally be shown with batch async requests. It’s how async generally works for better throughput but can harm latency.

# ./sslsniff.py --hexdump -v
FUNC         TIME(s)            COMM             PID     LEN    LATms
WRITE/SEND   0.000000000        openssl          1396141 0      N/A
HANDSHAKE    0.002518047        nginx            1366117 1      0.084
WRITE/SEND   0.002537873        openssl          1396141 0      2.538
WRITE/SEND   0.002642700        openssl          1396141 1      N/A
----- DATA -----
0a
----- END DATA -----
WRITE/SEND   0.002653655        openssl          1396141 1      0.011
READ/RECV    0.002657928        nginx            1366117 1      0.004
----- DATA -----
0a
----- END DATA -----

Investigate further below, there’s one more NGX_AGAIN transaction than baseline sync approach, then nginx created another async thread (note the different TID) to process the crypto request using MB style functions in libcrypto_mb.so, and finally collect the result.

# ./sslsnoop.bt
Attaching 60 probes...
Tracing SSL/TLS handshake... Hit Ctrl-C to end.
TIME(us)   TID      COMM     PROBE                          LAT(us)        RET
1522984    1366117  nginx    ngx_ssl_handshake                   80  NGX_AGAIN
1523280    1404482  openssl  rsa_ossl_public_encrypt             46        256
1523363    1366117  nginx    ngx_ssl_handshake                   20  NGX_AGAIN
1534433    1366118  nginx    mbx_rsa_private_crt_ssl_mb8       1004       void
1534486    1366118  nginx    mbx_rsa_public_ssl_mb8              45       void
1534494    1366118  nginx    process_RSA_priv_reqs             1070       void
1534595    1366117  nginx    ngx_ssl_handshake                   97     NGX_OK
^C

Before going to wrk benchmark, let’s try to optimize the single handshake case. Thread creation can be slow and non-deterministic due to Linux sched. Above timestamp shows 11ms between 1366117 and 1366118 including the 1ms MB function latency. There’s a heuristic polling scheme, also described in this QTLS paper, that can avoid the thread creation. Heuristic polling reduced the overall time from 2.5ms to 1.7ms:

# ./sslsniff.py --hexdump -v
FUNC         TIME(s)            COMM             PID     LEN    LATms
WRITE/SEND   0.000000000        openssl          1423517 0      N/A
HANDSHAKE    0.001694604        nginx            1423095 1      0.072
WRITE/SEND   0.001711605        openssl          1423517 0      1.712
WRITE/SEND   0.001814036        openssl          1423517 1      N/A
----- DATA -----
0a
----- END DATA -----
WRITE/SEND   0.001824460        openssl          1423517 1      0.010
READ/RECV    0.001830783        nginx            1423095 1      0.005
----- DATA -----
0a
----- END DATA -----

We can see from below output the libcrypto_mb.so calculation is done in the extra NGX_AGAIN routine without involving new thread. And the overall handshake latency is reduced and more stable.

# ./sslsnoop.bt
Attaching 60 probes...
Tracing SSL/TLS handshake... Hit Ctrl-C to end.
TIME(us)   TID      COMM     PROBE                          LAT(us)        RET
1271355    1423095  nginx    ngx_ssl_handshake                   58  NGX_AGAIN
1271605    1423296  openssl  rsa_ossl_public_encrypt             40        256
1272683    1423095  nginx    mbx_rsa_private_crt_ssl_mb8        992       void
1272733    1423095  nginx    mbx_rsa_public_ssl_mb8              46       void
1272737    1423095  nginx    process_RSA_priv_reqs             1050       void
1272739    1423095  nginx    multibuff_poll                    1059       void
1272741    1423095  nginx    ngx_ssl_handshake                 1072  NGX_AGAIN
1272811    1423095  nginx    ngx_ssl_handshake                   68     NGX_OK
^C

If new thread, it can run on another CPU core without careful affinity setting. So using same number of nginx worker process may result in double CPU consumption. Heuristic polling is preferred for better and stable latency, and strict comparison under same CPU consumption.

Run wrk benchmark twice again to verify tracing tool overhead. The ssllatency overhead increased a bit compared to baseline as performance is better (higher call frequency), and still acceptable (5%). TLS TPS is ~4000/s per core, 3x better than ~1300/s baseline. Latency is also better because the same workload is already overloaded for baseline sync approach.

# taskset -c 1 wrk -t 1 -c 1000 -d 10s -H 'Connection: close' https://localhost:443/0kb.bin
Running 10s test @ https://localhost:443/0kb.bin
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.02ms   12.00ms  87.37ms   77.90%
    Req/Sec     4.22k   848.83     6.12k    67.02%
  41780 requests in 10.09s, 12.07MB read
  Non-2xx or 3xx responses: 41780
Requests/sec:   4140.90
Transfer/sec:      1.20MB

# taskset -c 1 wrk -t 1 -c 1000 -d 10s -H 'Connection: close' https://localhost:443/0kb.bin
Running 10s test @ https://localhost:443/0kb.bin
  1 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.78ms   11.57ms  90.25ms   87.56%
    Req/Sec     4.02k     0.86k    8.87k    80.00%
  39727 requests in 10.08s, 11.48MB read
  Non-2xx or 3xx responses: 39727
Requests/sec:   3941.74
Transfer/sec:      1.14MB

Look at the tracing output below for the 2nd run, and compare with the baseline sync approach.

# ./ssllatency.bt
Attaching 59 probes...
Tracing SSL/TLS handshake in us... Hit Ctrl-C to end.
^C
@hist[process_RSA_priv_reqs]:
[800, 1000)         3587 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1000, ...)         1417 |@@@@@@@@@@@@@@@@@@@@                                |

@hist[mbx_rsa_private_crt_ssl_mb8]:
[800, 1000)         4486 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1000, ...)          518 |@@@@@@                                              |

@hist[mbx_rsa_public_ssl_mb8]:
[0, 200)            5004 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@hist[multibuff_poll]:
[0, 200)            2165 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[200, 400)             0 |                                                    |
[400, 600)             0 |                                                    |
[600, 800)             0 |                                                    |
[800, 1000)         3422 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1000, ...)         1582 |@@@@@@@@@@@@@@@@@@@@@@@@                            |

@hist[ngx_ssl_handshake]:
[0, 200)           39753 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)             9 |                                                    |
[400, 600)             2 |                                                    |

@histF[ngx_ssl_handshake]:
[0, 200)           75508 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)            42 |                                                    |
[400, 600)            32 |                                                    |
[600, 800)             8 |                                                    |
[800, 1000)            3 |                                                    |
[1000, ...)         4932 |@@@                                                 |

@stat[mbx_rsa_public_ssl_mb8]: count 5004, average 40, total 204391
@stat[ngx_ssl_handshake]: count 39764, average 50, total 1996767
@stat[multibuff_poll]: count 7169, average 710, total 5092790
@stat[mbx_rsa_private_crt_ssl_mb8]: count 5004, average 962, total 4817354
@stat[process_RSA_priv_reqs]: count 5004, average 1013, total 5071608
@statF[ngx_ssl_handshake]: count 80525, average 79, total 6436227

wrk run 10s with 39727 requests, ssllatency traced 39764 handshake calls returned NGX_OK and 80525 returned NGX_AGAIN. The NGX_AGAIN calls are doubled as expected from async routines. All the NGX_OK calls returned very quickly averaging 50us’ latency. And among all the NGX_AGAIN calls (shown in histF above), ~5k took >1ms to complete while others returned very quickly. The 5k NGX_AGAIN calls are those triggered by heuristic polling to process multi-buffer requests using AVX512 in one go.

Average latency of process_RSA_priv_reqs() and MB style RSA functions in libcrypto_mb.so are the same 1ms as in sslsnoop. But the number of calls is no longer 1-to-1 equal to TLS handshake requests, it’s just 1/8 (5k/40k) of them. This is the key benefit from async and batch process by AVX512. The buffer queue size is currently hardcoded at 8 and the function name suffix mb8 already tells us.

This is a good example for bcc/bpftrace in-kernel processing which is extremely helpful with histograms and statistical summary. We can also use sslsnoop to trace during the wrk benchmark, and see how NGX_OK are returned in batch. Note the event timestamp may not show in order as thread local data are not synced, post process it if needed. The text output will be huge, so it’s more convenient to just use ssllatency.

c. QAT offload approach

This is the original SSL offload approach from 1st gen QAT on Skylake. I can’t find machine with QAT hardware, so will skip the data analysis here (official perf data reference).

It’s sharing same upper layer with Ice Lake multi-buffer approach, so bcc/bpftrace tools can also be useful. More overhead is expected as it needs in-kernel PCIe driver with extra CPU memory resource for ioctl and polling. As dedicated hardware (yet to be on-die on SPR), there’s PCIe bandwidth limit and it needs maintenance like fallback to CPU approach back and forth. While QAT can co-exist with built-in AVX512 acceleration, it’s maybe more useful on IPU with limited processor or CPU compression offload like in QZFS.

Thanks to previous QAT offload solutions (again), OpenSSL added support in 1.1.0 version. And nginx fork tengine already support async SSL long ago, it’s running in many products for years. Note the nginx we tested above is asynch_mode_nginx developed by intel.

6. Performance results

Use same bear metal instance as in above tracing for multi-core data. Ice Lake instance is 2x Xeon 8369B with 64 cores or 128 threads, and Cascade Lake instance is 2x Xeon 8269CY with 52 cores or 104 threads. As crypto workload is CPU intensive, using HyperThread (HT) or not is the same for raw crypto performance.

a. OpenSSL

Comparing raw crypto algorithm performance by using openssl speed command.

# openssl speed -multi [1-N] -evp [aes-256-gcm]
# openssl speed -multi [1-N] -engine qatengine -evp [aes-256-gcm]
# openssl speed -multi [1-N] [rsa2048|ecdhx25519|ecdsap256]
# openssl speed -multi [1-N] -engine qatengine -async_jobs 8 [rsa2048|ecdhx25519|ecdsap256]

Below is a raw crypto performance comparison. Take AES-256 as an example: no acceleration Ice Lake is 50% better per core than Cascade Lake, and system-wide Ice Lake is 70% better as it has more cores. Accelerated Ice Lake is 3x better (3.4x per core ~ 3.8x system-wide) with software-hardware collaboration.

For public-key RSA and EC, no acceleration Ice Lake is almost same as Cascade Lake. With Ice Lake crypto acceleration, RSA and EC can be 2x~5x better. We can also observe the RSA sign/s bottleneck, just switch to ECDSA can be more than 10x better without hardware change.

b. Nginx/Tengine

Below is the comparison in typical user application TLS handshake: Nginx (asynch_mode_nginx v0.4.6) on the left, Tengine (v2.3.3) on the right, results are close. As mentioned early, simplify the scenario by using localhost, bind nginx/tengine on one CPU socket, and use the other for wrk client. TLS cipher suite is AES256-GCM-SHA384, ECDHE-RSA-AES256-GCM-SHA384, ECDHE-ECDSA-AES256-GCM-SHA384.

For RSA based TLS, no acceleration Ice Lake is almost same as Cascade Lake with ~10% per core improvement. And accelerated Ice Lake gets 3x better. This makes sense as RSA is generally the bottleneck. Nginx doc mentioned 45%~60% correlation ratio between raw openssl RSA speed and TLS handshake performance, it’s based on empirical data on older CPU. bcc/bpftrace tool showed libcrypto.so RSA takes 0.7ms in the total 1.1ms handshake time. By async accelerating RSA, per core TLS performance improved 3x times.

For ECDSA based TLS, ECDSA algorithm along is a huge optimization to replace RSA, more than 10x better as measured above in openssl raw speed test, then crypto task is no longer a bottleneck and rest part in TLS handshake becomes more significant. Ice Lake acceleration is now less obvious (~30%) which is also consistent with official QAT perf data reference. Even wrk client is not enough when using more CPU cores for HTTPS server. The overall result is reasonable.

Summary

Crypto workload is CPU intensive especially RSA and generally a bottleneck in HTTPS server. Ice Lake added new AVX512 instructions for crypto acceleration and shares same software stack (use case) as previous offload solution for dedicated hardware.

By using new bcc/bpftrace tools, async SSL software stack is analyzed with latency breakdown and statistical summary. RSA based TLS is verified to be 3x better per core by using Ice Lake crypto acceleration on HTTPS server. And ECDSA based TLS has less obvious optimization at ~30% as crypto task is no longer the bottleneck.