old/broken hardware. Unfortunately, it adds cache pressure and possible
mispredicted branches to the fast path of the bus_dmamap_load collection of
functions. Since it's meant for slow path exception processing, de-inline
it and allow its conditions to be pre-computed at tag_create time and thus
short-circuited at runtime.
While here, cut down on the size of _bus_dmamap_load_buffer() by pushing the
bounce page logic into a non-inlined function. Again, this helps with
cache pressure and mispredicted branches.
According to the TSC, this shaves off a few cycles on average. Unfortunately,
the data varies quite a bit due to interrupts and preemption, so it's hard to
get a good measurement. Real world measurements of network PPS are welcomed.
A merge to amd64 and other arches is pending more testing.