The reason for the unsafe was presumably to avoid copying or
zero-initializing. This achieves the same but using only safe functions.
Note: there is no zero-initializing here because Cursor is "trusted" to
not read from the buffer and so skips the initialization:
https://github.com/rust-lang/rust/blob/master/src/libstd/io/cursor.rs#L241
(the Take wrapper delegates to its inner).
The current implementation uses the `remaining_mut()` function from the
bytes::BufMut implementation for Vec<u8>.
In terms of the BufMut trait, a Vec buffer has infinite capacity - you
can always write more to the buffer, since the Vec grows as needed.
Hence, the `remaining_mut` here actually returns +∞ (actually,
`usize::MAX - len`).
So the first `if` is always true, and the calls to `reserve` never
actually allocate the appropriate space. What happens instead is that
the `bytes_mut()` call in `read_from` picks up the slack, but it merely
grows the buffer a 64 bytes at a time, which slows things down.
This changes the check to use the Vec capacity instead of
`remaining_mut`. In my profile (sending 100,000 10KiB messages back and
forth), this reduces `__memmove_avx_unaligned_erms`'s share of the
total runtime from ~77% to ~53%.
Suggested by clippy:
warning: use of `ok_or` followed by a function call
--> src/handshake/server.rs:20:19
|
20 | let key = self.headers.find_first("Sec-WebSocket-Key")
| ___________________^ starting here...
21 | | .ok_or(Error::Protocol("Missing Sec-WebSocket-Key".into()))?;
| |_______________________________________________________________________^ ...ending here
|
= note: #[warn(or_fun_call)] on by default
help: try this
| let key = self.headers.find_first("Sec-WebSocket-Key").ok_or_else(|| Error::Protocol("Missing Sec-WebSocket-Key".into()))?;
= help: for further information visit https://github.com/Manishearth/rust-clippy/wiki#or_fun_call
This is using some unsafe code and assumptions about alignment to speed
up apply_mask() more.
In profiles, this optimization reduces it from ~55% of the total runtime
to ~7%.
This function is the most speed-critical in the library. In profiles,
this optimization reduces it from ~75% of the profile to ~55%.
I have tried several approaches, but didn't manage to improve on this
one (LLVM already unrolls the loop here). Though I'm sure it is possible.