freebsd-dev

d/freebsd-dev

Fork 0

Commit Graph

Author	SHA1	Message	Date
Conrad Meyer	179f62805c	random(4): Fortuna: allow increased concurrency Add experimental feature to increase concurrency in Fortuna. As this diverges slightly from canonical Fortuna, and due to the security sensitivity of random(4), it is off by default. To enable it, set the tunable kern.random.fortuna.concurrent_read="1". The rest of this commit message describes the behavior when enabled. Readers continue to update shared Fortuna state under global mutex, as they do in the status quo implementation of the algorithm, but shift the actual PRF generation out from under the global lock. This massively reduces the CPU time readers spend holding the global lock, allowing for increased concurrency on SMP systems and less bullying of the harvestq kthread. It is somewhat of a deviation from FS&K. I think the primary difference is that the specific sequence of AES keys will differ if READ_RANDOM_UIO is accessed concurrently (as the 2nd thread to take the mutex will no longer receive a key derived from rekeying the first thread). However, I believe the goals of rekeying AES are maintained: trivially, we continue to rekey every 1MB for the statistical property; and each consumer gets a forward-secret, independent AES key for their PRF. Since Chacha doesn't need to rekey for sequences of any length, this change makes no difference to the sequence of Chacha keys and PRF generated when Chacha is used in place of AES. On a GENERIC 4-thread VM (so, INVARIANTS/WITNESS, numbers not necessarily representative), 3x concurrent AES performance jumped from ~55 MiB/s per thread to ~197 MB/s per thread. Concurrent Chacha20 at 3 threads went from roughly ~113 MB/s per thread to ~430 MB/s per thread. Prior to this change, the system was extremely unresponsive with 3-4 concurrent random readers; each thread had high variance in latency and throughput, depending on who got lucky and won the lock. "rand_harvestq" thread CPU use was high (double digits), seemingly due to spinning on the global lock. After the change, concurrent random readers and the system in general are much more responsive, and rand_harvestq CPU use dropped to basically zero. Tests are added to the devrandom suite to ensure the uint128_add64 primitive utilized by unlocked read functions to specification. Reviewed by: markm Approved by: secteam(delphij) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20313	2019-06-17 20:29:13 +00:00
Conrad Meyer	d0d71d818c	random(4): Generalize algorithm-independent APIs At a basic level, remove assumptions about the underlying algorithm (such as output block size and reseeding requirements) from the algorithm-independent logic in randomdev.c. Chacha20 does not have many of the restrictions that AES-ICM does as a PRF (Pseudo-Random Function), because it has a cipher block size of 512 bits. The motivation is that by generalizing the API, Chacha is not penalized by the limitations of AES. In READ_RANDOM_UIO, first attempt to NOWAIT allocate a large enough buffer for the entire user request, or the maximal input we'll accept between signal checking, whichever is smaller. The idea is that the implementation of any randomdev algorithm is then free to divide up large requests in whatever fashion it sees fit. As part of this, two responsibilities from the "algorithm-generic" randomdev code are pushed down into the Fortuna ra_read implementation (and any other future or out-of-tree ra_read implementations): 1. If an algorithm needs to rekey every N bytes, it is responsible for handling that in ra_read(). (I.e., Fortuna's 1MB rekey interval for AES block generation.) 2. If an algorithm uses a block cipher that doesn't tolerate partial-block requests (again, e.g., AES), it is also responsible for handling that in ra_read(). Several APIs are changed from u_int buffer length to the more canonical size_t. Several APIs are changed from taking a blockcount to a bytecount, to permit PRFs like Chacha20 to directly generate quantities of output that are not multiples of RANDOM_BLOCKSIZE (AES block size). The Fortuna algorithm is changed to NOT rekey every 1MiB when in Chacha20 mode (kern.random.use_chacha20_cipher="1"). This is explicitly supported by the math in FS&K §9.4 (Ferguson, Schneier, and Kohno; "Cryptography Engineering"), as well as by their conclusion: "If we had a block cipher with a 256-bit [or greater] block size, then the collisions would not have been an issue at all." For now, continue to break up reads into PAGE_SIZE chunks, as they were before. So, no functional change, mostly. Reviewed by: markm Approved by: secteam(delphij) Differential Revision: https://reviews.freebsd.org/D20312	2019-06-17 15:09:12 +00:00
Conrad Meyer	403c041316	random(4): Add regression tests for uint128 implementation, Chacha CTR Add some basic regression tests to verify behavior of both uint128 implementations at typical boundary conditions, to run on all architectures. Test uint128 increment behavior of Chacha in keystream mode, as used by 'kern.random.use_chacha20_cipher=1' (r344913) to verify assumptions at edge cases. These assumptions are critical to the safety of using Chacha as a PRF in Fortuna (as implemented). (Chacha's use in arc4random is safe regardless of these tests, as it is limited to far less than 4 billion blocks of output in that API.) Reviewed by: markm Approved by: secteam(gordon) Differential Revision: https://reviews.freebsd.org/D20392	2019-06-17 14:59:45 +00:00

Author

SHA1

Message

Date

Conrad Meyer

179f62805c

random(4): Fortuna: allow increased concurrency

Add experimental feature to increase concurrency in Fortuna.  As this
diverges slightly from canonical Fortuna, and due to the security
sensitivity of random(4), it is off by default.  To enable it, set the
tunable kern.random.fortuna.concurrent_read="1".  The rest of this commit
message describes the behavior when enabled.

Readers continue to update shared Fortuna state under global mutex, as they
do in the status quo implementation of the algorithm, but shift the actual
PRF generation out from under the global lock.  This massively reduces the
CPU time readers spend holding the global lock, allowing for increased
concurrency on SMP systems and less bullying of the harvestq kthread.

It is somewhat of a deviation from FS&K.  I think the primary difference is
that the specific sequence of AES keys will differ if READ_RANDOM_UIO is
accessed concurrently (as the 2nd thread to take the mutex will no longer
receive a key derived from rekeying the first thread).  However, I believe
the goals of rekeying AES are maintained: trivially, we continue to rekey
every 1MB for the statistical property; and each consumer gets a
forward-secret, independent AES key for their PRF.

Since Chacha doesn't need to rekey for sequences of any length, this change
makes no difference to the sequence of Chacha keys and PRF generated when
Chacha is used in place of AES.

On a GENERIC 4-thread VM (so, INVARIANTS/WITNESS, numbers not necessarily
representative), 3x concurrent AES performance jumped from ~55 MiB/s per
thread to ~197 MB/s per thread.  Concurrent Chacha20 at 3 threads went from
roughly ~113 MB/s per thread to ~430 MB/s per thread.

Prior to this change, the system was extremely unresponsive with 3-4
concurrent random readers; each thread had high variance in latency and
throughput, depending on who got lucky and won the lock.  "rand_harvestq"
thread CPU use was high (double digits), seemingly due to spinning on the
global lock.

After the change, concurrent random readers and the system in general are
much more responsive, and rand_harvestq CPU use dropped to basically zero.

Tests are added to the devrandom suite to ensure the uint128_add64 primitive
utilized by unlocked read functions to specification.

Reviewed by:	markm
Approved by:	secteam(delphij)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D20313

2019-06-17 20:29:13 +00:00

Conrad Meyer

d0d71d818c

random(4): Generalize algorithm-independent APIs

At a basic level, remove assumptions about the underlying algorithm (such as
output block size and reseeding requirements) from the algorithm-independent
logic in randomdev.c.  Chacha20 does not have many of the restrictions that
AES-ICM does as a PRF (Pseudo-Random Function), because it has a cipher
block size of 512 bits.  The motivation is that by generalizing the API,
Chacha is not penalized by the limitations of AES.

In READ_RANDOM_UIO, first attempt to NOWAIT allocate a large enough buffer
for the entire user request, or the maximal input we'll accept between
signal checking, whichever is smaller.  The idea is that the implementation
of any randomdev algorithm is then free to divide up large requests in
whatever fashion it sees fit.

As part of this, two responsibilities from the "algorithm-generic" randomdev
code are pushed down into the Fortuna ra_read implementation (and any other
future or out-of-tree ra_read implementations):

  1. If an algorithm needs to rekey every N bytes, it is responsible for
  handling that in ra_read(). (I.e., Fortuna's 1MB rekey interval for AES
  block generation.)

  2. If an algorithm uses a block cipher that doesn't tolerate partial-block
  requests (again, e.g., AES), it is also responsible for handling that in
  ra_read().

Several APIs are changed from u_int buffer length to the more canonical
size_t.  Several APIs are changed from taking a blockcount to a bytecount,
to permit PRFs like Chacha20 to directly generate quantities of output that
are not multiples of RANDOM_BLOCKSIZE (AES block size).

The Fortuna algorithm is changed to NOT rekey every 1MiB when in Chacha20
mode (kern.random.use_chacha20_cipher="1").  This is explicitly supported by
the math in FS&K §9.4 (Ferguson, Schneier, and Kohno; "Cryptography
Engineering"), as well as by their conclusion: "If we had a block cipher
with a 256-bit [or greater] block size, then the collisions would not
have been an issue at all."

For now, continue to break up reads into PAGE_SIZE chunks, as they were
before.  So, no functional change, mostly.

Reviewed by:	markm
Approved by:	secteam(delphij)
Differential Revision:	https://reviews.freebsd.org/D20312

2019-06-17 15:09:12 +00:00

Conrad Meyer

403c041316

random(4): Add regression tests for uint128 implementation, Chacha CTR

Add some basic regression tests to verify behavior of both uint128
implementations at typical boundary conditions, to run on all architectures.

Test uint128 increment behavior of Chacha in keystream mode, as used by
'kern.random.use_chacha20_cipher=1' (r344913) to verify assumptions at edge
cases.  These assumptions are critical to the safety of using Chacha as a
PRF in Fortuna (as implemented).

(Chacha's use in arc4random is safe regardless of these tests, as it is
limited to far less than 4 billion blocks of output in that API.)

Reviewed by:	markm
Approved by:	secteam(gordon)
Differential Revision:	https://reviews.freebsd.org/D20392

2019-06-17 14:59:45 +00:00

3 Commits