API function 'vie_calculate_gla()'.
While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.
Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
taskqueue worker thread(s) to.
For now it isn't a taskqueue/taskthread error to fail to pin
to the given cpuid.
Thanks to rpaulo@, kib@ and jhb@ for feedback.
Tested:
* igb(4), with local RSS patches to pin taskqueues.
TODO:
* ask the doc team for help in documenting the new API call.
* add a taskqueue_start_threads_cpuset() method which takes
a cpuset_t - but this may require a bunch of surgery to
bring cpuset_t into scope.
'struct vm_guest_paging'.
Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.
If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
indicate the faulting linear address.
If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.
Get rid of redundant checks for the PG_RW violations when walking the page
tables.
memory ordering model allows writes to different devices to complete out
of order, leading to a situation where the write that clears an interrupt
source at a device can complete after a write that unmasks and EOIs the
interrupt at the interrupt controller, leading to a spurious re-interrupt.
This adds a generic barrier function specific to the needs of interrupt
controllers, and calls that function from the GIC and TI AINTC controllers.
There may still be other soc-specific controllers that need to make the call.
Reviewed by: cognet, Svatopluk Kraus <onwahe@gmail.com>
MFC after: 3 days
Idle priority is not even time-share, so if system is busy in any way,
those events may never be executed. Since in some cases system waits
for events processed by that thread, that may cause deadlocks.
to be consistent with mutex destruction in ipf_log_soft_destroy(). As a
result mutex destruction in ipf_log_soft_fini() is redundant.
Approved by: glebius (mentor)
Obtained from: darrenr (author)
the kmem object lock is held. Do the pmap_remove() before acquiring the
kmem object lock.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
The CUSE library is a wrapper for the devfs kernel functionality which
is exposed through /dev/cuse . In order to function the CUSE kernel
code must either be enabled in the kernel configuration file or loaded
separately as a module. Currently none of the committed items are
connected to the default builds, except for installing the needed
header files. The CUSE code will be connected to the default world and
kernel builds in a follow-up commit.
The CUSE module was written by Hans Petter Selasky, somewhat inspired
by similar functionality found in FUSE. The CUSE library can be used
for many purposes. Currently CUSE is used when running Linux kernel
drivers in user-space, which need to create a character device node to
communicate with its applications. CUSE has full support for almost
all devfs functionality found in the kernel:
- kevents
- read
- write
- ioctl
- poll
- open
- close
- mmap
- private per file handle data
Requested by several people. Also see "multimedia/cuse4bsd-kmod" in
ports.
the UART FIFO.
The emulation is constrained in a number of ways: 64-bit only, doesn't check
for all exception conditions, limited to i/o ports emulated in userspace.
Some of these constraints will be relaxed in followup commits.
Requested by: grehan
Reviewed by: tychon (partially and a much earlier version)
the proper ICWx initialization sequence. It assumes, probably correctly, that
the boot firmware has done the 8259 initialization.
Since grub-bhyve does not initialize the 8259 this write to the mask register
takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0).
Fix this by initializing 'error' at the start of the function.
shared flag is set on normal-memory mappings made via pmap_kenter() for SMP.
The "shared flag" part of this change isn't obvious from the diff, here's
the deal... by using the array of preformatted page table entry templates
instead of constructing the PTE from scratch, we automatically get the
right attribute bits set for both caching and shared.
MFC after: 1 week
These masks are documented in the Intel Architecture Instruction Set
Extensions Programming Reference (March 2014).
Reviewed by: kib
MFC after: 1 month
"fatal firmware error" happens. Previously it was neccessary to reset
it manually, using "/etc/rc.d/netif restart".
Approved by: adrian@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
"fatal firmware error" happens. Previously it was neccessary to reset
it manually, using "/etc/rc.d/netif restart".
Approved by: adrian@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
direction isochronous transfers.
- Remove setting of fields which does not belong to the respective
TRBs. These fields are currently set as zero and this is more a
cosmetic change.
MFC after: 3 days
Submitted by: Horse Ma <HMa@wyse.com>
Quite often it can be just packet reorder, and killing link in such case
is inconvenient. Add few sysctl's to control that behavior.
PR: kern/182212
Submitted by: Eugene Grosbein <egrosbein@rdtc.ru>
MFC after: 2 weeks
- Make sure TX/RX lists don't leak and are only allocated once.
- Fix off-by one transfer index computation.
- Give firmware loading more time.
MFC after: 3 days
be a race when using a single active queue for all transmit types.
- Last argument of usb_pause_mtx() is ticks and not milliseconds.
- Remove unused watchdog.
- Remove some unused fields from the RSU softc structure.
- Workaround usbd_transfer_start() recursion from inside of completion
callback.
MFC after: 3 days
- Need to set the pre-fetch memory address when reading the host memory.
- We currently assume that no endianness conversion is needed.
Sponsored by: DARPA, AFRL
Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop(). This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.
Reviewed by: kib
Approved by: hrs (mentor)
boundary. This was addressed several years ago by creating a parent
tag hierarchy for the root buses that set the boundary restriction
for appropriate buses and allowed child deviced to inherit it.
Somewhere along the way, this restriction was turned into a case for
marking the tag as a candidate for needing bounce buffers, instead
of just splitting the segment along the boundary line. This flag
also causes all maps associated with this tag to be non-NULL, which
in turn causes bus_dmamap_sync() to take the slow path of function
pointer indirection to discover that there's no bouncing work to
do. The end result is a lot of pages set aside in bounce pools
that will never be used, and a slow path for data buffers in nearly
every DMA-capable PCIe device. For example, our workload at Netflix
was spending nearly 1% of all CPU time going through this slow path.
Fix this problem by being more selective about when to set the
COULD_BOUNCE flag. Only set it when the boundary restriction
exists and the consumer cannot do more than a single DMA segment
at once. This fixes the case of dynamic buffers (mbufs, bio's)
but doesn't address static buffers allocated from bus_dmamem_alloc().
That case will be addressed in the future.
For those interested, this was discovered thanks to Dtrace Flame
Graphs.
Discussed with: jhb, kib
Obtained from: Netflix, Inc.
MFC after: 3 days
Set the accessed and dirty bits in the page table entry. If it fails then
restart the page table walk from the beginning. This might happen if another
vcpu modifies the page tables simultaneously.
Reviewed by: alc, kib
ismt(4) supports the SMBus Message Transport controller found on Intel
C2000 series (Avoton) and S1200 series (Briarwood) Atom SoCs.
Sponsored by: Intel
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.
If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.
To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.
http://bugs.debian.org/743141
Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
birthday-paradox style address collisions when
bhyve VMs are connected to the same broadcoast
domain and are using pseudo-random allocations.
Reviewed by: gnn
MFC after: 1 week
g_orphan_register and g_resize_provider_event. Both are called from the
event queue. Also we have GEOM_DEV class, which does deferred destroy
for its consumers via g_dev_destroy (also called from the event queue).
So it is possible, that for some consumers an orphan method will be
called twice. This triggers panic in g_dev_orphan.
Check that consumer isn't already orphaned before call orphan method.
MFC after: 2 weeks
Replace with the existing loop termination test with a similar
condition from the nested "if" that may terminate the loop a bit
sooner, but still not too early. This condition can then be removed
from the nested "if". Relocate an operator to be style(9) compliant.
MFC after: 3 days
to a guest physical address.
PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now
checked only in non-terminal paging entries.
Ignore the upper 32-bits of the CR3 for PAE paging.
Intel 40G Ethernet Controller XL710 Family. This is
the core driver, a VF driver called i40evf, will be
following soon. Questions or comments to myself or
my co-developer Eric Joyner. Cheers!
lookup for the inp flowid/flowtype to destination CPU.
This only modifies the case where RSS is enabled and the per-cpu tcp
timer option is enabled. Otherwise the behaviour should be the same
as before.
This is intended to be used by various places that wish to hash some
information about a TCP/UDP/IP flow but don't necessarily have a
live mbuf to do it with.
Refactor rss_m2cpuid() to use the refactored function.
- Implement support for interrupt filters in the DWC OTG driver, to
reduce the amount of CPU task switching when only feeding the FIFOs.
- Add common spinlock to the USB bus structure.
MFC after: 2 weeks
- inserting frame enter/leave sequences
- restructuring the vmx_enter_guest routine so that it subsumes
the vm_exit_guest block, which was the #vmexit RIP and not a
callable routine.
Reviewed by: neel
MFC after: 3 weeks
src/sys and the rest of the tree for builds.
o eliminate including bsd.mkopts.mk for the moment in kern.opts.mk
o No need to include src.opts.mk at all anymore. The reasons for it
are now coverted in sys.mk and src.sys.mk.
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().
Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.
PR: bin/189471
Submitted by: Dennis Yusupoff <dyr@smartspb.net>
MFC after: 2 weeks
!(PFRULE_FRAGCROP|PFRULE_FRAGDROP) case.
o In the (PFRULE_FRAGCROP|PFRULE_FRAGDROP) case we should allocate mtag
if we don't find any.
Tested by: Ian FREISLICH <ianf cloudseed.co.za>
platform code, it is expected these will be merged in the future when the
ARM code is more complete.
Until more boards can be tested only use this with the Raspberry Pi and
rrename the functions on the other SoCs.
Reviewed by: ian@
it doesn't leak through when the command structure is reused for a user
command without a data buffer.
PR: amd64/189668
Tested by: Pete Long <pete@nrth.org>
MFC after: 1 week
near-term future use.
These are intended to fetch the current flow id, flow hash type
(M_HASHTYPE_* from the sys/mbuf.h) and if RSS is enabled, the
RSS destined CPU ID for the receive path.
XSAVE Extended Features for AVX512 and MPX (Memory Protection Extensions).
Obtained from: Intel's Instruction Set Extensions Programming Reference
(March 2014)
possible and do not clear IN6_IFF_TENTATIVE. If IFDISABLED was accidentally
set after a DAD started, TENTATIVE could be cleared because no NA was
received due to IFDISABLED, and as a result it could prevent DAD when
manually clearing IFDISABLED after that.
eight years. The original concept was to improve the
corner case where you run out of ephemeral ports, but it
was causing performance problems and the mechanism
of limiting the number of time_wait sockets serves
the same purpose in the end.
Reviewed by: bz
the legacy 8259A PICs.
- Implement an ICH-comptabile PCI interrupt router on the lpc device with
8 steerable pins configured via config space access to byte-wide
registers at 0x60-63 and 0x68-6b.
- For each configured PCI INTx interrupt, route it to both an I/O APIC
pin and a PCI interrupt router pin. When a PCI INTx interrupt is
asserted, ensure that both pins are asserted.
- Provide an initial routing of PCI interrupt router (PIRQ) pins to
8259A pins (ISA IRQs) and initialize the interrupt line config register
for the corresponding PCI function with the ISA IRQ as this matches
existing hardware.
- Add a global _PIC method for OSPM to select the desired interrupt routing
configuration.
- Update the _PRT methods for PCI bridges to provide both APIC and legacy
PRT tables and return the appropriate table based on the configured
routing configuration. Note that if the lpc device is not configured, no
routing information is provided.
- When the lpc device is enabled, provide ACPI PCI link devices corresponding
to each PIRQ pin.
- Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A
pins via the ELCR.
- Mark the power management SCI as level triggered.
- Don't hardcode the number of elements in Packages in the source for
the DSDT. iasl(8) will fill in the actual number of elements, and
this makes it simpler to generate a Package with a variable number of
elements.
Reviewed by: tycho
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.
Illumos issues:
3414 Need a new word of AT_SUN_HWCAP bits
3415 Add isainfo support for f16c and rdrand
3416 Need disassembler support for rdrand and f16c
3413 isainfo -v overflows 80 columns
3417 mdb disassembler confuses rdtscp for invlpg
1518 dis should support AMD SVM/AMD-V/Pacifica instructions
1096 i386 disassembler should understand complex nops
1362 add kvmstat for monitoring of KVM statistics
1363 add vmregs[] variable to DTrace
1364 need disassembler support for VMX instructions
1365 mdb needs 16-bit disassembler support
This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408
Reviewed by: markj
MFC after: 1 week
with all bits set to 1 beyond the I/O permission bitmap.
Prior to this change accessing I/O ports [0xFFF8-0xFFFF] would trigger a
#GP fault even though the I/O bitmap allowed access to those ports.
For more details see section "I/O Permission Bit Map" in the Intel SDM, Vol 1.
Reviewed by: kib
Here, "suitably endowed" means that the System Control Coprocessor
(#15) has Performance Monitoring Registers, including a CCNT (Cycle
Count) register.
The CCNT register is used in a way similar to the TSC register in
x86 processors by the get_cyclecount(9) function. The entropy-harvesting
thread is a heavy user of this function, and will benefit from not
having to call binuptime(9) instead.
One problem with the CCNT register is that it is 32-bit only, so
the upper 32-bits of the returned number are always 0. The entropy
harvester does not care, but in case any one else does, follow-up
work may include an interrup trap to increment an upper-32-bit
counter on CCNT overflow.
Another problem is that the CCNT register is not readable in user-mode
code; in can be made readable by userland, but then it is also
writable, and so is a good chunk of the PMU system. For that reason,
the CCNT is not enabled for user-mode access in this commit.
Like the x86, there is one CCNT per core, so they don't all run in
perfect sync.
Reviewed by: ian@ (an earlier version)
Tested by: ian@ (same earlier version)
Committed from: WANDBOARD-QUAD
audio device driver is detached first and not its children. This fixes
a panic in some cases when unloading "snd_uaudio" while a USB device
is plugged. The linking order affects the order in which the module
dependencies are registered.
MFC after: 1 week
parent and child processes. Previously, we copied these pages even though
they are read only. However, the reason for copying them is historical and
no longer exists. In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages. So, in this case, copy-on-write sharing of wired pages is not to be
feared. It is not going to lead to copy-on-write faults on wired memory.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Previously only TX IP checksum offloading was disabled but it's
reported that TX checksum offloading for UDP datagrams with IP
options also generates corrupted frames. Reporter's controller is
RTL8168CP but I guess RTL8168C also have the same issue since it
shall share the same core.
Reported and tested by: tuexen
that and the need to be in a critical section when switching to idleclock
mode for event timers, use spinlock_enter()/exit() to achieve both needs.
The ARM WFI (wait for interrupt) instruction blocks until an interrupt is
asserted, and it will unblock even if interrupts are masked, and it will
unblock immediately if an interrupt is already pending. It is necessary
to execute it with interrupts disabled, otherwise the interrupt that
should unblock it may occur and be serviced just prior to executing the
instruction. At that point the system is inappropriately asleep until
the next timer tick or some other random interrupt happens.
In general, interrupts need to be disabled continuously from the time the
decision is made that there is no work to be done and sleeping is needed
until actually going to sleep, to avoid a race where handling a new
interrupt changes the basis for deciding there is no work to be done.
Submitted by: hps@ (in slightly different form)
1. Make sure IPI mask is set before sending the IPI
2. Operate atomically on PS3 PIC outstanding interrupt list
3. Make sure IPIs are EOI'ed before, not after, processing. Without this,
a second IPI could be sent partway through processing the first one,
get erroneously acknowledge by the EOI to the first, and be lost. In
particular in the case of smp_rendezvous(), this can be fatal.
In combination, this makes the PS3 boot SMP again. It probably also fixes
some latent bugs elsewhere.
MFC after: 2 weeks
does an out-of-tree build without setting MAKESYSPATH) and recently
added requirements (JIRA's building the modules in a non-standard
layout). So, when MAKESYSPATH is defined, trust that it will do the
right thing (to catch the JIRA use case). When it isn't defined,
assume a standard FreeBSD tree and reach over to grab bsd.mkopt.mk (to
fix the /usr/ports use case). Both camps cannot be appeased otherwise,
so we have this kludge until it can be sorted out.
If the underlying protocol reported an error (e.g. because a connection was
closed while waiting in the queue), this error was also indicated by
returning a zero-length address. For all other kinds of errors (e.g.
[EAGAIN], [ENFILE], [EMFILE]), *addrlen is unmodified and there are
successful cases where a zero-length address is returned (e.g. a connection
from an unbound Unix-domain socket), so this error indication is not
reliable.
As reported in Austin Group bug #836, modifying *addrlen on error may cause
subtle bugs if applications retry the call without resetting *addrlen.
avoid soft page faults when adding write access to user wired entries in
vm_map_protect(). Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write. In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
multiuser again (this commit comes from the PS3 itself). Some problems
still exist with SMP, apparently, as I had to boot a non-SMP kernel to
get here.
or __POSIX_VISIBLE.
Whenever <sys/cdefs.h> sets __BSD_VISIBLE to non-zero, it also sets
__POSIX_VISIBLE and __XSI_VISIBLE to the newest version supported.
No functional change is intended.
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.
Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.
MFC after: 3 days
Relnotes: Bug fix in power management on CPUs with Intel Turbo Boost
the main processing queue, clear the NAK counter for any associated
BULK or CONTROL transfers and poll the endpoint(s) for 1 millisecond
at 125us rate interval, before going into slow, 10ms, NAK polling mode
again. This has the effect that typical ping-ping protocols respond
quicker when initiated from the USB host.
MFC after: 2 weeks
GENERIC64 for PowerPC to use vt with it.
Much to my chagrin, PS3 support seems to have bitrotted somewhat since the
last time I tried it. ehci panics on attach and interrupt handling seems
to be faulty. This should be fixed soon...
On modern ARM SoCs the L2 cache controller sits between the CPU and the
AXI bus, and most on-chip memory-mapped devices are on the AXI bus. We
map the device registers using the 'Device' memory attribute, which means
the memory is not cached, but writes to it are buffered. Ensuring that a
write has made it all the way to a device may require that the L2
controller take some action.
There is currently only one implementation of the new function, for the
PL310 cache controller. It invokes a function that the controller
manual calls "cache sync" but it actually has nothing to do with cache at
all, it triggers a drain of all pending store buffer writes and it blocks
until they complete.
The sheeva and xscale L2 controllers (which predate the concept of Device
memory) don't seem to have a corresponding function. It appears that the
standard armv5 drain_writebuf function includes draining all the way
through the L2 controller.
The last obstacle to switching PowerPC entirely to vt is that the Playstation 3
framebuffer driver needs to be ported over. This only applies for powerpc64,
however.
on my G4 iBook by more than half. Still 10% slower than syscons, but that's
much better than a factor of 2.
The slowness had to do with pathological write performance on 8-bit
framebuffers, which are almost universally used on Open Firmware systems.
Writing 1 byte at a time, potentially nonconsecutively, resulted in many
extra PCI write cycles. This patch, in the common case where it's writing
one or several characters in an 8x8 font, gangs the writes together into
a set of 32-bit writes. This is a port of r143830 to vt(4).
The EFI framebuffer is also extremely slow, probably for the same reason,
and the same patch will likely help there.
On armv4 these are defined as synonyms right now, but it's a bit ambiguous
what NOCACHE means (is buffering/write-combining also enabled or not?); this
is a first step towards replacing PTE_NOCACHE with a less ambiguous name.
need COW and is writeable (i.e. becoming writeable due to the
mprotect(2) operation), do not create a new backing object for the
entry. The caller of the function is vm_map_protect(), the call is
made to ensure that wired entry has all pages resident and wired in
the top level object and to enable the write. We might need to copy
read-only page from some backing objects into the top object or remap
the page with the write allowed.
This fixes the issue with mishandling of the swap accounting when
read-only wired mapping is upgraded to write-enabled after fork. The
previous code path did not accounted the new object, but it creation
is redundand anyway and the change provides an optimization for the
non-common situation.
Reported by: markj
Suggested and reviewed by: alc (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
o KMODDEPS warning is 15 years stale. Remove it.
o MK_CTF will always be defined now, so no need to test to see if it
is defined.
o no need to define MK_FORMAT_EXTENTIONS if undefined anymore.
page queues for the backing objects. The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.
The page queues can be seen with ddb 'show object' command.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the GPIO pin is connected to a push button (or other devices).
Instead keep the boot loader settings.
Calling ar71xx_gpio_pin_configure() with DEFAULT_CAPS was probably a
mistake and was causing all the pins to be set as outputs.
at that time, but AFAIK it is only used on routerboards.
Enabling GPIO_FUNC_SPI_CS[1|2]_EN will claim the use of gpio pins 0 and 1
respectivelly for use as SPI CS pins.
When really needed, this can still be enabled on kernel hints using the
function_set and function_clear knobs.
This driver supports the low and high precision models (9 and 11 bits) and
it will auto-detect the both variants.
The driver expose the temperature registers (actual temperature, shutdown
and hysteresys temperature) and also the configuration register.
It was tested on FDT systems: RPi, BBB and on non-FDT systems: AR71xx, with
both, hardware i2c controllers (when available) and gpioiic(4).
This provides a simple and cheap way for verifying the i2c bus on embedded
systems.
- For non-periodic traffic we only need to wait two SOFs before
disabling the channel.
- Make sure we release the TX FIFO tracking level after the host
channel is disabled.
- Make sure the host channel state gets reset/disabled initially.
- Two minor code style changes.
MFC after: 2 weeks
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores. When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.
Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."
The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0. So if one
pins the swi on CPU #0, there's no default floating swi.
A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.
Tested:
* parallel TCP tests (2 x 1g unfortunately for now);
CPU: Intel(R) Xeon(R) CPU E5-2650
Note:
This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
- Rework how we allocate and free USB host channels, so that we only
allocate a channel if there is a real packet going out on the USB
cable.
- Use BULK type for control data and status, due to instabilities in
the HW it appears.
- Split FIFO TX levels into one for the periodic FIFO and one for the
non-periodic FIFO.
- Use correct HFNUM mask when scheduling host transactions. The HFNUM
register does not count the full 16-bit range.
- Correct START/COMPLETION slot for TT transactions. For INTERRUPT and
ISOCHRONOUS type transactions the hardware always respects the ODDFRM
bit, which means we need to allocate multiple host channels when
processing such endpoints, to not miss any so-called complete split
opportunities.
- When doing ISOCHRONOUS OUT transfers through a TT send all data
payload in a single ALL-burst. This deacreases the likelyhood for
isochronous data underruns.
- Fixed unbalanced unlock in case of "dwc_otg_init_fifo()" failure.
- Increase interrupt priority.
MFC after: 2 weeks