freebsd-nq/sys/amd64
Peter Wemm f001eabf3a First pass at (possibly futile) microoptimizing of cpu_switch. Results
are mixed.  Some pure context switch microbenchmarks show up to 29%
improvement.  Pipe based context switch microbenchmarks show up to 7%
improvement.  Real world tests are far less impressive as they are
dominated more by actual work than switch overheads, but depending on
the machine in question, workload, kernel options, phase of moon, etc, a
few percent gain might be seen.

Summary of changes:
- don't reload MSR_[FG]SBASE registers when context switching between
  non-threaded userland apps.  These typically cost 120 clock cycles each
  on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
  faster on this.
- The above change only helps unthreaded userland apps that tend to use
  the same value for gsbase.  Threaded apps will get no benefit from this.
- reorder things like accessing the pcb to be in memory order, to give
  prefetching a better chance of working.  Operations are now in increasing
  memory address order, rather than reverse or random.
- Push some lesser used code out of the main code paths.  Hopefully
  allowing better code density in cache lines.  This is probably futile.
- (part 2 of previous item) Reorder code so that branches have a more
  realistic static branch prediction hint.  Both Intel and AMD cpus
  default to predicting branches to lower memory addresses as being
  taken, and to higher memory addresses as not being taken.  This is
  overridden by the limited dynamic branch prediction subsystem.  A trip
  through userland might overflow this.
- Futule attempt at spreading the use of the results of previous operations
  in new operations.  Hopefully this will allow the cpus to execute in
  parallel better.
- stop wasting 16 bytes at the top of kernel stack, below the PCB.
- Never load the userland fs/gsbase registers for kthreads, but preserve
  curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)

Microbenchmarking this code seems to be really sensitive to things like
scheduling luck, timing, cache behavior, tlb behavior, kernel options,
other random code changes, etc.

While it doesn't help heavy userland workloads much, it does help high
context switch loads a little, and should help those that involve
switching via kthreads a bit more.

A special thanks to Kris for the testing and reality checks, and Jeff for
tormenting me into doing this. :)

This is still work-in-progress.
2008-03-23 23:09:06 +00:00
..
acpica In keeping with style(9)'s recommendations on macros, use a ';' 2008-03-16 10:58:09 +00:00
amd64 First pass at (possibly futile) microoptimizing of cpu_switch. Results 2008-03-23 23:09:06 +00:00
compile
conf Remove kernel support for M:N threading. 2008-03-12 10:12:01 +00:00
ia32 Protect the setting of the fsbase/gsbase MSR registers and the 2008-03-23 22:44:56 +00:00
include Move pcb_flags to make trivially better use of cache lines. 2008-03-23 22:45:51 +00:00
isa Explicitly use spinlock_enter/exit rather than locking the icu_lock spin 2008-03-20 21:53:27 +00:00
linux32 Regen. 2008-03-16 16:29:37 +00:00
pci Adjust the code to probe for the PCI config mechanism to use. 2007-11-28 22:20:08 +00:00
Makefile