Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: UDP protocol performance optimization #194

Closed
zonyitoo opened this issue Mar 8, 2022 · 76 comments
Closed

Discussion: UDP protocol performance optimization #194

zonyitoo opened this issue Mar 8, 2022 · 76 comments

Comments

@zonyitoo
Copy link
Contributor

zonyitoo commented Mar 8, 2022

Motivation

QUIC, a UDP based protocol, is now getting famous in the Internet. So the UDP packet relay performance should be taken greater consideration in shadowsocks.

As all we know, the shadowsocks UDP protocol (AEAD) creates a packet in the following steps:

  1. Generate a random IV or salt
  2. Derive a key with HKDF-SHA1
  3. Encrypt the whole data payload with the chosen AEAD cipher

Recently I have done some benchmarks about the cost of each steps. All these tests are written in Rust.

1. Generate a random IV or salt

#[bench]
fn bench_random_32b_iv(b: &mut Bencher) {
    let mut iv = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);
    });
}

#[bench]
fn bench_random_32b_iv_shadowsocks(b: &mut Bencher) {
    let mut iv = [0u8; 32];
    b.iter(|| {
        random_iv_or_salt(&mut iv);
    });
}

The test result is:

test bench_random_32b_iv             ... bench:          17 ns/iter (+/- 0)
test bench_random_32b_iv_shadowsocks ... bench:          21 ns/iter (+/- 1)

The random_iv_or_salt takes 4 ns more because it needs to verify the generated iv.

2. Derive a key with HKDF-SHA1

#[bench]
fn bench_hkdf_32b_key(b: &mut Bencher) {
    const SUBKEY_INFO: &'static [u8] = b"ss-subkey";
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];
    let mut key = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);

        let hk = Hkdf::<Sha1>::new(Some(&iv), KEY);
        hk.expand(SUBKEY_INFO, &mut key).expect("hkdf-sha1");
    });
}

The test result is:

test bench_hkdf_32b_key ... bench:       1,117 ns/iter (+/- 83)

From the result of 1. we can know that the HKDF-SHA1 algorithm takes most of the time.

3. Encrypt the whole data payload with the chosen AEAD cipher

#[bench]
fn bench_udp_packet_1500(b: &mut Bencher) {
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];

    let mut input_packet = [0u8; 1500];
    rand::thread_rng().fill_bytes(&mut input_packet);

    b.iter(|| {
        random_iv_or_salt(&mut iv);
        let mut cipher = Cipher::new(CipherKind::AES_256_GCM, KEY, &iv);
        cipher.encrypt_packet(&mut input_packet);
    });
}

The Cipher that I was using here is from shadowsocks-crypto, which is the actual library that was using in shadowsocks-rust. The test result is:

test bench_udp_packet_1500 ... bench:       2,218 ns/iter (+/- 328)

Analysis

Please forget about the absolute numbers of each tests. When we compare the result in 3. and 2., we can easily make a conclusion: The key derivation process takes roughly 50% of time when creating a UDP packet. So if we can optimize this process, we may get at most 50% of performance improvement!

There should be no way to optimize it without changing the protocol. Here are some possible design:

  1. UDP Session. A client should first handshake with a server to create a session with a new derived key, and then encrypt packets with that derived key.
  2. ... (discussion)
@madeye
Copy link
Contributor

madeye commented Mar 8, 2022

Generate derived keys in background, and save them into a cache for future usage?

@dev4u
Copy link

dev4u commented Mar 9, 2022

我也在寻求方法,实现在客户端与服务端之间PINGPONG,从而能传递临时会话key的功能。

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 9, 2022

Generate derived keys in background, and save them into a cache for future usage?

Well, then we may have to add synchronization between the derive key task and UDP association tasks. The lock may become a new bottleneck.

Another solution may be using a faster KDF, for example, HMAC-Blake3:

#[bench]
fn bench_blake3(b: &mut Bencher) {
    const SUBKEY_INFO: &[u8] = b"ss-subkey";
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];
    let mut key = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);

        let hk = SimpleHkdf::<blake3::Hasher>::new(Some(&iv), KEY);
        hk.expand(SUBKEY_INFO, &mut key).expect("hkdf-sha1");
    });
}

The result shows:

test bench_blake3       ... bench:         915 ns/iter (+/- 79)
test bench_hkdf_32b_key ... bench:       1,115 ns/iter (+/- 56)

The HMAC algorithm should be the key for optimization.

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 10, 2022

Generate derived keys in background, and save them into a cache for future usage?

There are several obvious flaws in this method:

  1. Since the bottleneck is the CPU consumption, pre-generating IV and keys in background won't save any CPU usages. So the overall performance didn't change.
  2. It still have to do it synchronously when receiving packets from remote (both sslocal and ssserver).

So I don't think this will solve this problem effectively.

@dev4u
Copy link

dev4u commented Mar 10, 2022

我的意见不是太专业,请不要介意。
我已经把ss rust结合了google 2fa的算法,实现了一个与时间相关的“随机会话key”。
google 2fa的好处是,相同时间因子、secret产生的随机数是相同的,并且算法简单。所以双边协商好secret后,建立会话时,不用传输key,两边各自通过计算,即可得到相同的临时key……

但是这个方式有个问题--时间因子。因为各种设备,会有各自的时间,虽然我通过ntp方式,让服务器上的时间,尽可能同步(与手机的时间有±2秒内的误差),也会在边界附近,因为时间误差而导致产生的key不相同。

这个我尝试了一些优化,但还是不太理想,主要是人处于高速移动中,因为手机频繁切换基站,已经创建好的连接因为不会关闭,导致通讯仍使用旧的key传输,服务器解密失败而影响使用质量。当然如果是udp,这个影响会小很多。

@dev4u
Copy link

dev4u commented Mar 10, 2022

现在我在努力的事情是,如何安全传输key产生的时间因子。暂时的方案是,直接附在密文的末尾处,但这个因为改变了报文格式,不太倾向这种方式。
还要考虑传输的时机要满足随机性,以免产生特征。

@madeye
Copy link
Contributor

madeye commented Mar 12, 2022

I did a quick calculation based on your benchmark:

10^6us / 2.2us * 1500byte = 681 MB/s bandwidth per CPU core

So, it looks to me that, for most of our users, the bottleneck is their internet speed...

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

Yes, it's true. I made some simple speedtests locally with iperf3:

iperf3 -c => sslocal (tunnel 5202) => ssserver => iperf3 -s

On my laptop (i7-9750H), when testing with iperf3 -c 127.0.0.1 -p 5202 -Rub 5.0G -t 300, result shows:

Connecting to host 127.0.0.1, port 5202
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 56587 connected to 127.0.0.1 port 5202
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   590 MBytes  4.95 Gbits/sec  0.037 ms  357/38225 (0.93%)
[  5]   1.00-2.00   sec   594 MBytes  4.98 Gbits/sec  0.016 ms  141/38285 (0.37%)
[  5]   2.00-3.00   sec   587 MBytes  4.92 Gbits/sec  0.028 ms  565/38259 (1.5%)
[  5]   3.00-4.00   sec   595 MBytes  5.00 Gbits/sec  0.055 ms  46/38279 (0.12%)
[  5]   4.00-5.00   sec   594 MBytes  4.98 Gbits/sec  0.018 ms  137/38259 (0.36%)
[  5]   5.00-6.00   sec   579 MBytes  4.85 Gbits/sec  0.020 ms  1123/38274 (2.9%)
[  5]   6.00-7.00   sec   554 MBytes  4.65 Gbits/sec  0.016 ms  2697/38274 (7%)
[  5]   7.00-8.00   sec   589 MBytes  4.94 Gbits/sec  0.023 ms  454/38262 (1.2%)
[  5]   8.00-9.00   sec   557 MBytes  4.68 Gbits/sec  0.023 ms  2478/38266 (6.5%)

and both sslocal and ssserver CPU usage reached above 87%. But the TCP test (iperf3 -c 127.0.0.1 -p 5202 -R -t 300) results shows that:

Connecting to host 127.0.0.1, port 5202
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 60416 connected to 127.0.0.1 port 5202
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   806 MBytes  6.76 Gbits/sec
[  5]   1.00-2.00   sec   843 MBytes  7.07 Gbits/sec
[  5]   2.00-3.00   sec   839 MBytes  7.04 Gbits/sec
[  5]   3.00-4.00   sec   835 MBytes  7.00 Gbits/sec
[  5]   4.00-5.00   sec   759 MBytes  6.37 Gbits/sec
[  5]   5.00-6.00   sec   806 MBytes  6.76 Gbits/sec
[  5]   6.00-7.00   sec   831 MBytes  6.97 Gbits/sec
[  5]   7.00-8.00   sec   822 MBytes  6.90 Gbits/sec
[  5]   8.00-9.00   sec   817 MBytes  6.85 Gbits/sec
[  5]   9.00-10.00  sec   773 MBytes  6.49 Gbits/sec
[  5]  10.00-11.00  sec   765 MBytes  6.42 Gbits/sec
[  5]  11.00-12.00  sec   732 MBytes  6.14 Gbits/sec
[  5]  12.00-13.00  sec   810 MBytes  6.79 Gbits/sec
[  5]  13.00-14.00  sec   712 MBytes  5.97 Gbits/sec
[  5]  14.00-15.00  sec   798 MBytes  6.69 Gbits/sec

CPU consumption is sslocal 128% and ssserver 82%. TCP channels can have higher brandwidth limit than the UDP associations.

You are correct about average users won't get to this extreme cases, so we can put away these numbers and focus on: how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks, ...

@database64128
Copy link
Contributor

database64128 commented Mar 12, 2022

I have been doing experiments and benchmarks of changing the current Shadowsocks AEAD protocol. If standardized, the spec will be maintained by the Shadowsocks.NET organization.

Shadowsocks 2022 Edition

Goals

Non-Goals

  • Forward secrecy.

PSK

The popular VPN protocol WireGuard provides a simple userspace program wg (usually packaged in wireguard-tools by distributions) to generate cryptographically-secure 32-byte private keys and PSKs. The keys are encoded in base64 for convenience.

Instead of asking the user to provide a password, Shadowsocks 2022 is taking the same approach. A user can use wg genkey or wg genpsk (genkey and genpsk point to the same underlying implementation.) to generate a base64-encoded 32-byte PSK. This change drops dependency on EVP_BytesToKey.

Subkey Derivation

HKDF_SHA1 is replaced by BLAKE3's key derivation mode. A randomly generated 32-byte salt is appended to the PSK to be used as key material.

session_subkey := blake3::derive_key(context: "shadowsocks 2022 session subkey", key_material: key + salt)

I believe BLAKE3's key derivation mode alone without HKDF is secure enough for this purpose.

Required Method

Method 2022-blake3-chacha20-poly1305 MUST be implemented by all implementations. 2022 reflects the fast-changing and flexible nature of the protocol.

TCP

ChaCha20-Poly1305

2022-blake3-chacha20-poly1305 uses ChaCha20-Poly1305 with derived subkeys for TCP sessions. The construction is the same as Shadowsocks AEAD.

Protocol

tcp request header := 1B type + 64-bit unix epoch timestamp [+ atyp + address + port] + 2B padding length [+ padding]
tcp response header := 1B type + 64-bit unix epoch timestamp [+ request salt]
tcp stream := 32B salt [+ AEAD(length) + AEAD(message)] [+...]

HeaderTypeClientStream = 0
HeaderTypeServerStream = 1
MinPaddingLength = 0
MaxPaddingLength = 900

The first message MUST be the header or start with the header.

Replay Protection

Both server and client MUST record all incoming salts during the last 30 seconds. When a new TCP session is started, the first received message is decrypted and its timestamp MUST be check against system time. If the time difference is within 30 seconds, then the salt is checked against all stored salts. If no repeated salt is discovered, then the salt is added to the pool and the session is successfully established.

UDP

XChaCha20-Poly1305

The XChaCha20-Poly1305 construction can safely encrypt a practically unlimited number of messages with the same key, without any practical limit to the size of a message (up to ~ 2^64 bytes).

As an alternative to counters, its large nonce size (192-bit) allows random nonces to be safely used.

The official Go implementation of ChaCha20-Poly1305 provides XChaCha20-poly1305. RustCrypto's chacha20poly1305 crate also provides XChaCha20-Poly1305.

Protocol

udp client header := 8B client session id + 8B packet id + 1B type + 64-bit unix epoch timestamp + 2B padding length [+ padding] [+ atyp + address + port]
udp server header := 8B server session id + 8B packet id + 1B type + 64-bit unix epoch timestamp + 8B client session id + 2B padding length [+ padding] [+ atyp + address + port]
udp packet := 24B nonce + AEAD(message)

HeaderTypeClientPacket = 0
HeaderTypeServerPacket = 1

2022-blake3-chacha20-poly1305 uses XChaCha20-Poly1305 with PSK as the key and a random nonce for each message. The session ID identifies a UDP session. The packet ID is a counter for sliding window replay protection.

Session ID based Routing and Sliding Window Replay Protection

An implementation SHOULD implement a NAT table using session ID as the key. A NAT entry SHOULD at least store the following information:

  • Peer last seen time: Manages the lifetime of the NAT entry.
  • Peer last seen address: Return packets are sent to this address.
  • Outgoing socket/connection: Stores reference to the outgoing socket/connection abstraction where packets to target address are sent from.
  • Sliding window filter: After decryption, the packet ID MUST be passed to this filter for replay protection.

Upon receiving an encrypted packet, the packet is decrypted using the first 24 bytes as nonce. The header is verified by checking the timestamp against system time. Then the session ID in the header is used to look up its NAT entry. If the lookup is successful, pass the packet ID to the sliding window filter to complete verification.

The last seen time and address MUST be updated after packet verification. Updating last seen address ensures that, when the client changes network, the session won't be interrupted.

Optional Methods

Reduced-round variants of ChaCha20-Poly1305

2022-blake3-chacha8-poly1305 and 2022-blake3-chacha12-poly1305 are optional methods that use reduced-round variants of ChaCha20-Poly1305 and XChaCha20-Poly1305.

According to this paper, 8 rounds of ChaCha should be secure, while yielding a 2.5x speedup. I asked @zonyitoo to benchmark these reduced-round variants provided by RustCrypto. Results:

test bench_crypto_chacha12_ietf_poly1305_encrypt  ... bench:       1,064 ns/iter (+/- 151)
test bench_crypto_chacha20_ietf_poly1305_encrypt  ... bench:       1,376 ns/iter (+/- 146)
test bench_crypto_chacha8_ietf_poly1305_encrypt   ... bench:         993 ns/iter (+/- 209)
test bench_crypto_xchacha12_ietf_poly1305_encrypt ... bench:       1,116 ns/iter (+/- 117)
test bench_crypto_xchacha20_ietf_poly1305_encrypt ... bench:       1,436 ns/iter (+/- 181)
test bench_crypto_xchacha8_ietf_poly1305_encrypt  ... bench:         962 ns/iter (+/- 93)
test bench_ring_aes_128_gcm_encrypt               ... bench:         346 ns/iter (+/- 42)
test bench_ring_aes_256_gcm_encrypt               ... bench:         455 ns/iter (+/- 49)
test bench_ring_chacha20_ietf_poly1305_encrypt    ... bench:         801 ns/iter (+/- 92)

Unfortunately, RustCrypto still hasn't quite catch up with ring in terms of performance. And ring does not currently provide an implementation for these reduced-round variants.

I have not been able to find any well-maintained Go implementations of these reduced-round variants.

AES

2022-blake3-aes-256-gcm is an optional method that replaces ChaCha20-Poly1305 with AES-256-GCM for TCP. For UDP, unfortunately, there's no AES counterpart of XChaCha20-Poly1305. We have two options:

  1. Simply use AES-256-GCM in place of XChaCha20-Poly1305 and accept the requirement/risk that the same PSK cannot be used to encrypt more than 2^32 messages.
  2. Use a separate header for session ID and packet ID. These two happen to fit in a single AES block. We can simple AES encrypt the separate header. Since the session ID is unique for each session, and the packet ID is used as counter, this method should be secure.

Separate Header Design

packet header := 8B session id + 8B packet id
message header := 1B type + 64-bit unix epoch timestamp + 2B padding length [+ padding] [+ atyp + address + port]
AEAD key (session subkey) := blake3::derive_key(context: "shadowsocks 2022 session subkey", key_material: key + session_id)
AEAD nonce := packet_header[4:16]
udp_packet := aes-ecb(packet header) + AEAD(message header + payload)

This design gives a 2.1x speedup, as shown in the benchmarks below. Note that the benchmarks were run on a processor without vaes and vpclmulqdq instructions. On newer processors with these instructions, AES-GCM is expected to be twice as fast.

Benchmarks

1. Header benchmarks

https://github.com/database64128/cubic-go-playground/blob/main/shadowsocks/udpheader/udpheader_test.go

BenchmarkGenSaltHkdfSha1-8          	  372712	      3283 ns/op	    1264 B/op	      18 allocs/op
BenchmarkGenSaltBlake3-8            	  770426	      1554 ns/op	       0 B/op	       0 allocs/op
BenchmarkAesEcbHeaderEncryption-8   	99224395	        12.84 ns/op	       0 B/op	       0 allocs/op
BenchmarkAesEcbHeaderDecryption-8   	100000000	        11.79 ns/op	       0 B/op	       0 allocs/op

2. Full UDP packet construction benchmarks

https://github.com/database64128/cubic-go-playground/blob/main/shadowsocks/udp_test.go

BenchmarkShadowsocksAEADAes256GcmEncryption-8             	  252819	      4520 ns/op	    2160 B/op	      23 allocs/op
BenchmarkShadowsocksAEADAes256GcmWithBlake3Encryption-8   	  320718	      3367 ns/op	     896 B/op	       5 allocs/op
BenchmarkDraftSeparateHeaderAes256GcmEncryption-8         	 2059383	       590.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkDraftXChaCha20Poly1305Encryption-8               	  993336	      1266 ns/op	       0 B/op	       0 allocs/op

Acknowledgement

I would like to thank @zonyitoo, @xiaokangwang, and @nekohasekai for their input on the design of the protocol.

@database64128
Copy link
Contributor

database64128 commented Mar 12, 2022

Open Questions

  1. @zonyitoo Are you willing to trade complexity (adding a separate header) for a fairly significant boost (2.1x) of performance?

Reference Implementations

  1. Shadowsocks-NET/outline-ss-server Progress: All done, other than UDP multiplexing.

/cc @madeye @Mygod @riobard from ss org
/cc @fortuna @bemasc from Outline

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@zonyitoo

how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks

I'm sorry but UDP tunnel in userspace won't reach the same performance as TCP because stream-based abstraction enjoys extensive optimization from the kernel while packet-based abstraction doesn't.

@zonyitoo
Copy link
Contributor Author

Ah, thanks for your work.

Are you willing to trade complexity (adding a separate header) for a fairly significant boost (2.1x) of performance?

It doesn't seem to add too much complexity on implementation, so I am Ok with it.

I wouldn't suggest to use xchacha20-ietf-poly1305 separately for the UDP protocol because there is currently no fast implementation in Rust (or Go, or other languages except C).

In terms of UDP session, how to generate a globally unique session ID?

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

@zonyitoo

how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks

I'm sorry but UDP tunnel in userspace won't reach the same performance as TCP because stream-based abstraction enjoys extensive optimization from the kernel while packet-based abstraction doesn't.

Well yes, I did some tests with iperf3 directly on the lo interface:

TCP, iperf3 -c 127.0.0.1 -p 5201 -R

Connecting to host 127.0.0.1, port 5201
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 54944 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  7.47 GBytes  64.1 Gbits/sec
[  5]   1.00-2.00   sec  7.73 GBytes  66.4 Gbits/sec
[  5]   2.00-3.00   sec  7.46 GBytes  64.0 Gbits/sec
[  5]   3.00-4.00   sec  7.84 GBytes  67.3 Gbits/sec
[  5]   4.00-5.00   sec  7.84 GBytes  67.3 Gbits/sec
[  5]   5.00-6.00   sec  7.87 GBytes  67.6 Gbits/sec
[  5]   6.00-7.00   sec  7.77 GBytes  66.8 Gbits/sec
[  5]   7.00-8.00   sec  7.83 GBytes  67.2 Gbits/sec
[  5]   8.00-9.00   sec  7.60 GBytes  65.3 Gbits/sec
[  5]   9.00-10.00  sec  7.87 GBytes  67.6 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  77.3 GBytes  66.4 Gbits/sec                  sender
[  5]   0.00-10.00  sec  77.3 GBytes  66.4 Gbits/sec                  receiver

UDP, iperf3 -c 127.0.0.1 -p 5201 -Rub 0

Connecting to host 127.0.0.1, port 5201
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 62256 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  3.22 GBytes  27.7 Gbits/sec  0.001 ms  0/211709 (0%)
[  5]   1.00-2.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214696 (0%)
[  5]   2.00-3.00   sec  3.28 GBytes  28.2 Gbits/sec  0.001 ms  0/215589 (0%)
[  5]   3.00-4.00   sec  3.21 GBytes  27.6 Gbits/sec  0.001 ms  178/211124 (0.084%)
[  5]   4.00-5.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214772 (0%)
[  5]   5.00-6.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214951 (0%)
[  5]   6.00-7.00   sec  3.25 GBytes  27.9 Gbits/sec  0.001 ms  0/213482 (0%)
[  5]   7.00-8.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214980 (0%)
[  5]   8.00-9.00   sec  3.26 GBytes  28.0 Gbits/sec  0.001 ms  0/214555 (0%)
[  5]   9.00-10.00  sec  3.19 GBytes  27.4 Gbits/sec  0.001 ms  905/210659 (0.43%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  32.5 GBytes  27.9 Gbits/sec  0.000 ms  0/2136530 (0%)  sender
[  5]   0.00-10.00  sec  32.5 GBytes  27.9 Gbits/sec  0.001 ms  1083/2136517 (0.051%)  receiver

Conclusion: UDP can only have 50% bandwidth of TCP. Hmm....

But @riobard , the lo device can actually handle 27.9Gbps UDP bandwidth, so the UDP tunnel (or association) implementation we are doing here can achieve only 3Gbps (without lost) and 5Gbps (5% packet lost) is far from reaching the limit of the internet device.

If we can lower the price of each UDP packets, the brandwidth should at least level with the current TCP implementation (7Gbps).

how to make UDP associations to have the same performance as TCP channels.

The UDP associations could have the same performance (7Gbps on my laptop) as the TCP channels.

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@zonyitoo Don't test lo. Test real ethernet devices, preferably over WiFi, and see what's the real-world impact.

@database64128
Copy link
Contributor

I wouldn't suggest to use xchacha20-ietf-poly1305 separately for the UDP protocol because there is currently no fast implementation in Rust (or Go, or other languages except C).

XChaCha20-Poly1305 is not an IETF standard. And it should be very easy to wrap a fast ChaCha20-Poly1305 implementation into XChaCha20-Poly1305 by adding an HChaCha20 layer.

In terms of UDP session, how to generate a globally unique session ID?

The session ID is 64-bit long. I wouldn't worry about the probability of collision of two randomly generated 64-bit integers. If you don't want random numbers, it's still safe to use a counter as session ID, as long as you stick with 2022-blake3-xchacha20-poly1305. With 2022-blake3-aes-256-gcm, however, we don't ever want two pieces of identical plaintext encrypted by plain non-AEAD AES.

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@database64128 Is IETF C20P1305 slow and/or broken? If not, please refrain from adding even more choices/confusions to the current mess. We're already having a hard time explaining which cipher is the best choice for the average users. The mistakes from the horribly long list of stream ciphers must be avoided.

@database64128
Copy link
Contributor

@riobard I think in the spec we can advise implementations to gate optional ciphers behind optional features. Advanced users and developers can build their own binary by enabling the optional features.

The spec only requires implementing one method 2022-blake3-chacha20-poly1305, which uses IETF ChaCha20-Poly1305 for TCP, and XChaCha20-Poly1305 for UDP. We cannot use ChaCha20-Poly1305 for UDP, because each packet uses a randomly generated nonce.

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

@zonyitoo Don't test lo. Test real ethernet devices, preferably over WiFi, and see what's the real-world impact.

BTW, we should focus more on the protocol itself about how to lower the cost of CPU resource in UDP protocol. On some devices that have low calculation capability, like mobile phones, or routers, they will gain lots of benefit if we can make the protocol faster.

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@database64128 No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice. If you read carefully the current spec, that's exactly what is being written:

Compliant Shadowsocks implementations must support AEAD_CHACHA20_POLY1305. Implementations for devices with hardware AES acceleration should also implement AEAD_AES_128_GCM and AEAD_AES_256_GCM.

The fact that various implementations decide to add other AEAD ciphers is very unfortunate, as it creates more confusion for little benefit. You failed to explain why it is necessary to create another cipher, and I don't see any benefits changing the status quo.

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@zonyitoo I'd like to but there's only so much you could do with UDP in userspace.

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

@database64128 No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice. If you read carefully the current spec, that's exactly what is being written:

Compliant Shadowsocks implementations must support AEAD_CHACHA20_POLY1305. Implementations for devices with hardware AES acceleration should also implement AEAD_AES_128_GCM and AEAD_AES_256_GCM.

The fact that various implementations decide to add other AEAD ciphers is very unfortunate, as it creates more confusion for little benefit. You failed to explain why it is necessary to create another cipher, and I don't see any benefits changing the status quo.

Well, I think what @database64128 's proposal said is TCP protocol uses only chacha20-ietf-poly1305 and UDP protocol uses only xchacha20-ietf-poly1305. So users won't need to choose a method, they are mandatory and fixed in this proposal.

@database64128
Copy link
Contributor

We're already having a hard time explaining which cipher is the best choice for the average users. The mistakes from the horribly long list of stream ciphers must be avoided.

No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice.

@riobard I don't see any problem with letting the user select from existing Shadowsocks AEAD ciphers. The only practical difference between existing Shadowsocks AEAD ciphers is probably performance.

My spec only has ONE mandatory method: 2022-blake3-chacha20-poly1305. The use of ChaCha20-Poly1305 for TCP and XChaCha20-Poly1305 for UDP is an implementation detail, and is transparent to users.

To give users some flexibility, some optional methods are suggested, only because they are just as secure, and can yield significant performance boosts. For example, switching from ChaCha20-Poly1305 to AES-256-GCM increases the maximum TCP throughput by 25%, the separate header proposal for UDP is twice as fast as the mandatory XChaCha20-Poly1305 construction. Please keep in mind that performance boosts translate to less energy consumption on resource-constrained devices.

@riobard
Copy link
Contributor

riobard commented Mar 12, 2022

@zonyitoo No, he's also changing key derivation procedure which breaks backward compatibility for no benefits. To the end users, they'll just see one more entry in the (already long enough) list of available ciphers. The technical details are irrelevant to the discussion of reducing optional choices (and thus complexity).

@database64128 Two points:

I don't see any problem with letting the user select from existing Shadowsocks AEAD ciphers.

That's your opinion and I disagree. Complexity is the root the all evil in software. That's why TLS 1.3 is getting rid of most options and there's not even a choice of cipher in WireGuard.

switching from ChaCha20-Poly1305 to AES-256-GCM increases the maximum TCP throughput by 25%

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions. On iOS (arguably one of the easier platforms to support), it was very late (IIRC in 2021 at earliest) when client apps (e.g. Surge) could figure out how to correctly make use of AES acceleration.

And the performance boost isn't really worth the additional complexity and potential implementation defects (AES/GCM is notoriously difficult to get right). Optimized implementation of C20P1305 is more than enough on the majority of modern devices.

So no, I just don't see the benefits of the proposal.

@database64128
Copy link
Contributor

No, he's also changing key derivation procedure which breaks backward compatibility for no benefits.

What backward compatibility are you talking about? The ability for a user to choose an arbitrary password? I don't see how maintaining such "compatibility" could provide any benefits.

That's your opinion and I disagree. Complexity is the root the all evil in software. That's why TLS 1.3 is getting rid of most options and there's not even a choice of cipher in WireGuard.

And here I am, getting rid of the uncertainty of user-provided password, so we don't have to worry about choosing a secure key derivation algorithm for passwords. HKDF_SHA1 is replaced by BLAKE3's key derivation mode because HKDF_SHA1 is slow, and because I don't want to use anything marked as "obsolete" in the year of 2022.

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions.

This is a very cynical take on the issue. And if you are not confident about using AES, just stick to the default ChaCha20-Poly1305.

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions. On iOS (arguably one of the easier platforms to support), it was very late (IIRC in 2021 at earliest) when client apps (e.g. Surge) could figure out how to correctly make use of AES acceleration.

Err.. Since we are talking about a new protocol for the future, I think we can assume that end users are using devices with hard-accelerated AES instructions (mostly aarch64, x86_64).

test bench_ring_aes_128_gcm_encrypt               ... bench:         346 ns/iter (+/- 42)
test bench_ring_aes_256_gcm_encrypt               ... bench:         455 ns/iter (+/- 49)
test bench_ring_chacha20_ietf_poly1305_encrypt    ... bench:         801 ns/iter (+/- 92)

These tests are done on my laptop (x86_64) and you can see the difference.

Optimized implementation of C20P1305 is more than enough on the majority of modern devices.

Well, from the test shown above, the optimized C20P1305 is still about 50% slower than aes-256-gcm.

No, he's also changing key derivation procedure which breaks backward compatibility for no benefits.

Hmm? I don't think so, the current AEAD protocol should remain unchanged. All the current discussion is only applied to the new protocol. The version 1 (stream protocol), version 3 (AEAD protocol) will not be changed. @riobard

Since SHA1 is marked as cryptographically broken, it is a good chance to replace it with a new modern hash function. I am Ok with this proposal to choose Blake3, because it is actually faster than HKDF-SHA1 in tests.

test bench_blake3            ... bench:         148 ns/iter (+/- 14)
test bench_ring_hkdf_32b_key ... bench:         839 ns/iter (+/- 47)
test bench_hkdf_blake3       ... bench:         857 ns/iter (+/- 91)

The 1st test is blake3::derive_key with [32-bits key] + [32-bits random generated IV]. The 2nd test is the current key derivation method, HKDF-SHA1. The 3rd test is HKDF-BLAKE3.

As for EVP_BytesToKey, maybe we can keep it unchanged, there is no obvious benefit to force users to make a key with exactly length of bytes as the cipher required. If users generate their key with secured tools like genpsk, we can provide a "key" field to read it directly without EVP_BytesToKey. @database64128

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

Maybe we should set aside the topic about using exactly 1 chosen cipher or keep the 3 selected ones in version 3 (AEAD protocol). We should discuss more about the design of the protocol itself.

@database64128
Copy link
Contributor

database64128 commented Mar 12, 2022

As for EVP_BytesToKey, maybe we can keep it unchanged, there is no obvious benefit to force users to make a key with exactly length of bytes as the cipher required. If users generate their key with secured tools like genpsk, we can provide a "key" field to read it directly without EVP_BytesToKey.

I disagree.

  • Directly asking for a base64-encoded 32-byte key keeps it simple and straightforward, just like WireGuard.
  • New implementations won't have to implement EVP_BytesToKey by hand.
  • EVP_BytesToKey is not exactly a good way to derive keys from passwords. And it uses MD5.
  • It also acts as a distinction between new and existing protocols.

@zonyitoo
Copy link
Contributor Author

Ah, it uses MD5, I just remembered.

Alright, how about using another KDF to replace it? Generating one by hand is not user friendly in most cases.

@database64128
Copy link
Contributor

Generating one by hand is not user friendly in most cases.

It's actually much more user-friendly than the current best practice of generating a password in your password manager GUI, then copy-paste it to your config files.

Now all you need is a cryptographically-secure 32-byte key. You can generate one with a one-liner in your favorite shell, which anyone running a Shadowsocks server should be familiar with. Or you can use existing userspace programs like wg, which does nothing more than a simple getrandom(2) system call. Or if you are feeling generous, add a quick command to sslocal and ssserver.

When we ask a user to input a password, they may not bother to actually generate a secure one. But when we ask that they must provide a base64-encoded 32-byte key, it's very unlikely that any weak key gets used, unless the user very much intends to do so.

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Mar 12, 2022

When we ask a user to input a password, they may not bother to actually generate a secure one. But when we ask that they must provide a base64-encoded 32-byte key, it's very unlikely that any weak key gets used, unless the user very much intends to do so.

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, :P

Now all you need is a cryptographically-secure 32-byte key. You can generate one with a one-liner in your favorite shell, which anyone running a Shadowsocks server should be familiar with.

I do agree with you but we just cannot ask users what to do. Users can generate a 32 bytes key with any tools.

For users who know exactly what they are doing, they can generate one in Base64 encoded and pass then to the "key" field, and the other average users can choose to provide a "password" with printable characters.

Providing one won't take any hard efforts and it only run once when the process starts.

How about Argon2, which have been proven to be a cryptographically secured KDF function (password hash).

@zonyitoo
Copy link
Contributor Author

zonyitoo commented Apr 4, 2022

From my point of view, AES-256-GCM on modern devices is fast enough in this age.

@riobard
Copy link
Contributor

riobard commented Apr 5, 2022

Well then C20P1305 is fast enough for the vast majority 😎

@database64128
Copy link
Contributor

Just saw an HN submission about using kTLS in NGINX. Some interesting comments I'd like to quote:

We've been running kTLS + SSL sendfile on FreeBSD at Netflix for the last 6 or 7 years. (We had local patches to nginx, before nginx did them "right", and 2 versions of kTLS before the 2nd version was upstreamed to FreeBSD). The savings in terms of CPU use and memory BW are pretty substantial. Especially when you use a NIC which can do in-line kTLS offload, then things basically go back to pre-TLS costs because the buffers are not touched at all by the CPU.

BTW, FreeBSD 14 supports cha-cha poly. But is far more CPU intensive than GCM, so I'd advise against using it.

Random question: Do you force Netflix clients onto the ciphers which are most efficient for the Netflix servers, or are there cases (I'm thinking mobile devices particularly) where it makes sense to use the ciphers which are most efficient for the clients?

Pretty sure the best choice is basically always AES GCM because most modern chipsets can hardware offload that. Curious to hear the answer, though.

There are cases where it makes sense, but I'm not sure that mobile devices that are likely to be playing video is it. Chances are, they'll be burning more power on the screen backlight than the CPU to do AES vs something else (assuming they're not accelerated for AES). There's two sides to the argument, but easing the burden on servers lets one server serve more clients; it's easier to justify 2x the cost for crypto on the client than on the server, because most clients aren't bottlenecked on crypto and some servers are. (Of course, I'm usually a server engineer, so of course I want my servers to have less work ;)

Choosing ciphers for ease of the client makes more sense, IMHO, when the client is really constrained, like a feature phone or tiny IoT things.

@riobard
Copy link
Contributor

riobard commented Apr 5, 2022

Don't be silly: Netflix edge devices stream at 50~100Gbps. We can talk about CPU usage when 10Gbps fiber is as common as 100Mbps.

@database64128
Copy link
Contributor

Don't be silly: Netflix edge devices stream at 50~100Gbps. We can talk about CPU usage when 10Gbps fiber is as common as 100Mbps.

We should talk about CPU usage, because not everyone can afford beefy machines like Netflix's edge devices. My dual-core AMD EPYC $18/mo Digital Ocean VPS runs at 50% (out of 100%) CPU utilization when I download over WireGuard over Shadowsocks 2022 2022-blake3-aes-256-gcm at 120Mbps. As far as I know, most people use single-core $5/mo VPSes for this purpose.

@riobard
Copy link
Contributor

riobard commented Apr 5, 2022

50% (out of 100%) CPU utilization when I download over WireGuard over Shadowsocks 2022 2022-blake3-aes-256-gcm at 120Mbps.

Yeah… go figure which part eats your CPU budget.

@database64128
Copy link
Contributor

image

As outlined in this blog post by Cloudflare, using sendmmsg(2) + UDP GSO should be an effective way to increase UDP throughput by minimizing syscall overhead. Since shadowsocks-rust already has a send queue for outgoing UDP packets, it's possible to implement this:

  • Determine gso_size for the next message in the upcoming sendmmsg(2) call by selecting up to 63 same-length packets, followed by one same-length or smaller packet.
  • Repeat until all packets in the send queue are processed. Then call sendmmsg(2) to send the processed packets. On return, filter out successfully sent messages to prepare for the next sendmmsg(2) call.

zonyitoo added a commit to shadowsocks/shadowsocks-crypto that referenced this issue Apr 18, 2022
zonyitoo added a commit to shadowsocks/shadowsocks-rust that referenced this issue Apr 18, 2022
@database64128
Copy link
Contributor

database64128 commented Apr 19, 2022

shadowsocks-rust's Shadowsocks 2022 implementation is not 100% complete, but it's ready for benchmarks.

shadowsocks-rust TCP UDP
2022-blake3-aes-128-gcm 12.2Gbps 14.2Gbps
2022-blake3-aes-256-gcm 10.9Gbps 12.5Gbps
2022-blake3-chacha20-poly1305 8.05Gbps 2.35Gbps
2022-blake3-chacha8-poly1305 8.36Gbps 2.60Gbps
aes-128-gcm 8.99Gbps 13.5Gbps
aes-256-gcm 8.21Gbps 11.9Gbps
chacha20-poly1305 6.55Gbps 8.66Gbps
2022-blake3-aes-256-gcm TCP UDP
outline-ss-{server,client} 10.9Gbps 11.5Gbps
shadowsocks-rust 10.9Gbps 12.5Gbps

@database64128
Copy link
Contributor

Many people rely on domestic relays with better international connectivity for their Shadowsocks servers. Most servers serve more than one person, and using more than one port is not always an option. With legacy Shadowsocks, we have Outline that supports multiple passwords on a single port using trial decryption, we have mmp-go that relays from one port to multiple servers based on the password used. These solutions seek to maintain backward compatibility by resorting to brute force, which has performance and security implications. We need a solution that's built into the protocol, fast, and secure by default.

Shadowsocks 2022 Extensible Identity Headers

Identity headers are one or more additional layers of headers, each consisting of the next layer's PSK hash. The next layer of an identity header is the next identity header, or the protocol header if it's the last identity header. Identity headers are encrypted with the current layer's identity PSK using an AES block cipher.

Identity headers are implemented in such a way that's fully backward compatible with current Shadowsocks 2022 implementations. Each identity processor is fully transparent to the next.

  • iPSKn: The nth identity PSK that identifies the current layer.
  • uPSKn: The nth user PSK that identifies a user on the server.

TCP

In TCP requests, identity headers are located between salt and AEAD chunks.

identity_subkey := blake3::derive_key(context: "shadowsocks 2022 identity subkey", key_material: iPSKn + salt)
plaintext := blake3::hash(iPSKn+1)[0..16] // Take the first 16 bytes of the next iPSK's hash.
identity_header := aes_encrypt(key: identity_subkey, plaintext: plaintext)

UDP

In UDP packets, identity headers are located between the separate header (session ID, packet ID) and AEAD ciphertext.

plaintext := blake3::hash(iPSKn+1)[0..16] ^ session_id_packet_id // XOR to make it different for each packet.
identity_header := aes_encrypt(key: iPSKn, plaintext: plaintext)

When iPSKs are used, the separate header MUST be encrypted with the first iPSK. Each identity processor MUST decrypt and re-encrypt the separate header with the next layer's PSK.

Scenarios

      client0       >---+
(iPSK0:iPSK1:uPSK0)      \
                          \
      client1       >------\                        +--->    server0 [iPSK1]
(iPSK0:iPSK1:uPSK1)         \                      /      [uPSK0, uPSK1, uPSK2]
                             >-> relay0 [iPSK0] >-<
      client2               /    [iPSK1, uPSK3]    \
(iPSK0:iPSK1:uPSK2) >------/                        +--->    server1 [uPSK3]
                          /
      client3            /
   (iPSK0:uPSK3)    >---+

A set of PSKs, delimited by :, are assigned to each client. To send a request, a client MUST generate one identity header for each iPSK.

A relay decrypts the first identity header with its identity key, looks up the PSK hash table to find the target server, and relays the remainder of the request.

A single-port-multi-user-capable server decrypts the identity header with its identity key, looks up the user PSK hash table to find the cipher for the user PSK, and processes the remainder of the request.

In the above graph, client0, client1, client2 are users of server0, which is relayed through relay0. server1 is a simple server without identity header support. client3 connects to server1 via relay0.

To start a TCP session, client0 generates a random salt, encrypts iPSK1's hash with iPSK0-derived subkey as the 1st identity header, encrypts uPSK0's hash with iPSK1-derived subkey as the 2nd identity header, and finishes the remainder of the request following the original spec.

To process the TCP request, relay0 decrypts the 1st identity header with iPSK0-derived subkey, looks up the PSK hash table, and writes the salt and remainder of the request (without the processed identity header) to server0.

To send a UDP packet, client0 encrypts the separate header with iPSK0, encrypts (iPSK1's hash XOR session_id_packet_id) with iPSK0 as the 1st identity header, encrypts (uPSK0's hash XOR session_id_packet_id) with iPSK1 as the 2nd identity header, and finishes off following the original spec.

To process the UDP packet, relay0 decrypts the separate header in-place with iPSK0, decrypts the 1st identity header with iPSK0, looks up the PSK hash table, re-encrypt the separate header into the place of the first identity header, and sends the packet (starting at the re-encrypted separate header) to server0.

@madeye
Copy link
Contributor

madeye commented Apr 20, 2022

Please open new issues for any SIP.

@zonyitoo I think you can close this issue and open a new one.

@database64128
Copy link
Contributor

@madeye The spec has not been finalized and is still in design phase. I think we should keep discussions in one place, under this thread. New issues will be opened when the spec is ready.

@zonyitoo
Copy link
Contributor Author

Please open new issues for any SIP.

@zonyitoo I think you can close this issue and open a new one.

Agree. Expecting a formal SIP issue about AEAD-2022.

@database64128
Copy link
Contributor

database64128 commented Apr 23, 2022

I just finished implementing sendmmsg(2) with a send channel in database64128/swgp-go. Upload speed saw an increase of 32%.

Test method: iperf3 -c ::1 -p 30001 -ub 0 -l 1452

swgp-go 1 2 3 4 5 6 avg.
sendmsg 690 714 704 723 702 720 709Mbps
sendmmsg 917 914 945 956 945 951 938Mbps

Note that iperf3 does not currently support sendmmsg(2). There's an open PR though: esnet/iperf#1034.

Next step would be to evaluate whether downlink can benefit from recvmmsg(2) and sendmmsg(2) used together.

@database64128
Copy link
Contributor

Downlink now uses recvmmsg(2) and sendmmsg(2) together. Download speed is now 51% faster.

Test method: iperf3 -c ::1 -p 30001 -Rub 0 -l 1452

swgp-go 1 2 3 4 5 6 avg.
recvmsg + sendmsg 845 829 841 862 844 849 845Mbps
recvmmsg + sendmmsg 1.28 1.26 1.28 1.26 1.27 1.28 1.27Gbps

@wangjian1009
Copy link

shadowsocks-rust's Shadowsocks 2022 implementation is not 100% complete, but it's ready for benchmarks.

shadowsocks-rust TCP UDP
2022-blake3-aes-128-gcm 12.2Gbps 14.2Gbps
2022-blake3-aes-256-gcm 10.9Gbps 12.5Gbps
2022-blake3-chacha20-poly1305 8.05Gbps 2.35Gbps
2022-blake3-chacha8-poly1305 8.36Gbps 2.60Gbps
aes-128-gcm 8.99Gbps 13.5Gbps
aes-256-gcm 8.21Gbps 11.9Gbps
chacha20-poly1305 6.55Gbps 8.66Gbps
2022-blake3-aes-256-gcm TCP UDP
outline-ss-{server,client} 10.9Gbps 11.5Gbps
shadowsocks-rust 10.9Gbps 12.5Gbps

What is the test way? I can`t repeat this result.

@database64128
Copy link
Contributor

database64128 commented Jun 29, 2022

@wangjian1009

  • TCP: iperf3 -c ::1 -p 30001 -R
  • UDP: iperf3 -c ::1 -p 30001 -Rub 0

Port 30001 is forwarded to iperf3 server's port by sslocal and ssserver.

@wangjian1009
Copy link

@wangjian1009

  • TCP: iperf3 -c ::1 -p 30001 -R
  • UDP: iperf3 -c ::1 -p 30001 -Rub 0

Port 30001 is forwarded to iperf3 server's port by sslocal and ssserver.

Why do you use -R to test only server -> client traffic?
I remove -R and found very different result.

@database64128
Copy link
Contributor

Why do you use -R to test only server -> client traffic?

Because that's what matters.

I remove -R and found very different result.

I also did upload tests and the results are similar.

@database64128
Copy link
Contributor

shadowsocks-go v1.0.0 has been released as the reference Go implementation of Shadowsocks 2022.

  • TCP: iperf3 -c ::1 -p 30001 -R
  • UDP 🔽: iperf3 -c ::1 -p 30001 -Rub 0 -l 1382
  • UDP 🔼: iperf3 -c ::1 -p 30001 -ub 0 -l 1390
TCP UDP 🔽 UDP 🔼
shadowsocks-go 8.40Gbps 1.25Gbps 973Mbps
shadowsocks-rust 8.06Gbps 990Mbps 744Mbps

Thanks to recvmmsg(2) and sendmmsg(2), shadowsocks-go's UDP throughput is 26% and 31% higher than shadowsocks-rust.

@CCCAUCHY
Copy link

shadowsocks-go v1.0.0 has been released as the reference Go implementation of Shadowsocks 2022.

  • TCP: iperf3 -c ::1 -p 30001 -R
  • UDP 🔽: iperf3 -c ::1 -p 30001 -Rub 0 -l 1382
  • UDP 🔼: iperf3 -c ::1 -p 30001 -ub 0 -l 1390

TCP UDP 🔽 UDP 🔼
shadowsocks-go 8.40Gbps 1.25Gbps 973Mbps
shadowsocks-rust 8.06Gbps 990Mbps 744Mbps
Thanks to recvmmsg(2) and sendmmsg(2), shadowsocks-go's UDP throughput is 26% and 31% higher than shadowsocks-rust.

why is the udp throughput less than this?

@database64128
Copy link
Contributor

why is the udp throughput less than this?

@CCCAUCHY The old benchmark saturates the loopback interface's 65535-byte MTU. It's much faster because it's much more efficient to send bigger packets. Newer benchmarks use a much smaller packet size that fits in the typical ethernet MTU of 1500 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants