- Return appropriate error code instead of ENOMEM when sosend() fails in
send_mpa_req.
- Fix for problematic race during destroy_qp.
- Abortive close in the failure of send_mpa_reject() instead of normal close.
- Remove the unnecessary doorbell flowcontrol logic.
Sponsored by: Chelsio Communications
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq. This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.
The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread. To avoid complexity while fixing
this race, just drop this optimization. 4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.
Use the same logic to calculate the nominal CPU frequency from the P-state
MSRs on family 0x12, 0x15, and 0x16 CPUs as is used for family 0x10.
Family 0x14 was included in the original patch in the PR but I left that
out as the BIOS writer's guide for family 0x14 CPUs show a different layout
for the relevant MSR and include a different formulate for calculating the
frequency.
While here, simplify a few expressions and print out the family of
unsupported CPUs in hex rather than decimal.
PR: 212020
303753:
Don't permit mappings of invalid physical addresses on amd64 via /dev/mem.
308004:
MFamd64: Add bounds checks on addresses used with /dev/mem.
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.
bhyve: stability and performance improvement for dbgport
The TCP server implementation in dbgport does not track clients, so it
may try to write to a disconected socket resulting in SIGPIPE.
Avoid that by setting SO_NOSIGPIPE socket option.
Because dbgport emulates an I/O port to guest, the communication is done
byte by byte. Reduce latency of the TCP/IP transfers by using
TCP_NODELAY option. In my tests that change improves performance of
kgdb commands with lots of output (e.g. info threads) by two orders of
magnitude.
A general note. Since we have a uart emulation in bhyve, that can be
used for the console and gdb access to guests. So, bvmconsole and bvmdebug
could be de-orbited now. But there are many existing deployments that
still dependend on those.
Discussed with: julian, jhb
Sponsored by: Panzura
slite style changes. There is an incoming patch that rewrites a
lot of this module and I want to get the style and whitespace changes in
a separate commit (or maybe more).
PR: 206185
Submitted by: Dmitry Vagin
r308696:
New driver for Broadcom NetXtreme-C and NetXtreme-E devices.
r308729:
Add bnxt(4) to the hardware notes.
r308787:
Add missing newline in error mesage
r308813:
Check link status after init
r309028:
Add missing break to switch statement
r309073:
Fix version string
r309078:
Add new device IDs
Approved by: sbruno
Relnotes: yes
Sponsored by: Broadcom Limited
r308898:
[bytgpio] Fix USB disconnect event after listsing pins on gpioc2
- Do not set input flag when reading value from GPIO pin, it is not
required and for gpioc2(S5 bank) setting both input and output flags
leads to some kind of electric interference (curren drop?) that
causes USB devices to disconnect
- Check pad configuration when attaching device and provide IN/OUT
capabilities only for pads that are configured as GPIO. Do not let
user code to configure or change value of non-GPIO pads. There is
no information for NC bank in intel's datasheet so for now function
check is ignored for pins in it
Reported by: Frank H.
MFC after: 3 days
r308940:
[bytgpio] prepare bytgpio(4) for modularization
- Add detach method
- module should depend on gpiobus, not gpio
r308942:
[bytgpio] Add module for bytgpio(4)
MFC after: 3 days
r308944 by hiren@:
r308942 broke kernel build.
Add acpi_if.h to module makefile to fix it.
Submitted by: peter
r309112:
[bytgpio] Fix pc98 build by disabling bytgpio module for this platform
Reported by: dim
When symbol versioning was added to rtld, the boolean 'in_plt' argument
to find_symdef() was converted to a bitmask of flags. The first flag
added was 'SYMLOOK_IN_PLT' which replaced the 'in_plt' bool. This
happened to still work by accident as SYMLOOK_IN_PLT had the value of 1
which is the same as 'true', so there should be no functional change.
Add GARP retransmit capability
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost. This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.
To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero. The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).
Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity. This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.
Update arp(4) to document the net.link.ether.inet.garp_rexmit_count sysctl.
Submitted by: dab
Relnotes: yes
Sponsored by: Dell EMC
Fix error reporting from wcstof()
When wcstof() skipped initial space and then parsing failed, it set
endptr to the first non-space character. Fix it to correctly report
failure by setting endptr to the beginning of the input string.
The fix is from theraven@, who fixed this bug in wcstod() and
wcstold() in r227753.
While I'm here:
Move assignments out of declarations in wcstod() and wcstold().
This is against my personal preference, but it is our agreed style(9).
Set endptr correctly on malloc() failure in all three functions.
Remove an incorrect comment: This is pointer arithmetic,
so the code was not actually making that assumption.
wcstold() advanced the wcp pointer beyond leading whitespace
and then reset it back to the beginning of the string.
Do not reset it. This seems to have no functional effect,
since strtold_l() also skips leading whitespace. I'm making
the change to keep this function consistent with wcstof() and
wcstod(), and because the C11 spec prescribes the use of iswspace()
to skip leading space.
Reported by: libc++ unit test for std::stof(std::wstring)
Sponsored by: Dell EMC
locale: fix display of "grouping" and "mon_grouping" values
The "grouping" and "mon_grouping" values are arrays of one-byte
integers, not arrays of ASCII characters. Display them in a format
similar to GNU and MacOS.
Sponsored by: Dell EMC
When issuing a PXE dhcp request, always issue a param request (DHCP option 55)
with all dhcp parameters we might be interested in.
Some DHCP server like the new kea (by ISC) expect it.
This makes pxeboot functional with ISC kea.
Submitted by: Vincent Legout <vincent.legout@gandi.net>
Sponsored by: Gandi.net
r308797
update the hv_vmbus(4) manual by adding a dependency on pci
We enhanced the vmbus driver to support PCIe pass-through recently.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r308798
remove the hv_ata_pci_disengage(4) manual
A few months ago, we removed the driver, which was not necessary any longer.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r308799
fix share/man/man4/Makefile for hv_ata_pci_disengage.4
We need to remove the line since we removed the related manual just now.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r309082
share/man/man4/Makefile: Only install Hyper-V man pages on amd64 and i386
We shouldn't install them on the architectures not supported by Hyper-V.
And, hv_ata_pci_disengage.4.gz should be removed from all architectures:
1) It should have only applied to Hyper-V;
2) For Hyper-V platforms (amd64 and i386), the related driver was removed by
r306426 | sephe | 2016-09-29 09:41:52 +0800 (Thu, 29 Sep 2016),
because now we have a better mechanism to disble the ata driver for hard
disks when the VM runs on Hyper-V.
Reviewed by: sephe, andrew, jhb
Approved by: sephe (mentor)
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8572
Approved by: sephe (mentor)
r308723
hyperv/vmbus: add a new method to get vcpu_id
vcpu_id is host's representation of guest CPU.
We get the mapping between vcpu_id and FreeBSD kernel's cpu id when VMBus
driver is loaded. Later, when a driver, like the coming pcib driver, talks
to the host and needs to refer to a guest CPU, the driver must use the
vcpu_id.
Reviewed by: jhb, sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8410
r308724
hyperv/vmbus: add new vmbus methods to support PCIe pass-through
The new methods will be used by the coming pcib driver.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8409
r308725
hyperv/pcib: enable PCIe pass-through (a.k.a. Discrete Device Assignment)
The feature enables us to pass through physical PCIe devices to FreeBSD VM
running on Hyper-V (Windows Server 2016) to get near-native performance with
low CPU utilization.
The patch implements a PCI bridge driver to support the feature:
1) The pcib driver talks to the host to discover device(s) and presents
the device(s) to FreeBSD's pci driver via PCI configuration space (note:
to access the configuration space, we don't use the standard I/O port
0xCF8/CFC method; instead, we use an MMIO-based method supplied by Hyper-V,
which is very similar to the 0xCF8/CFC method).
2) The pcib driver allocates resources for the device(s) and initialize
the related BARs, when the device driver's attach method is invoked;
3) The pcib driver talks to the host to create MSI/MSI-X interrupt
remapping between the guest and the host;
4) The pcib driver supports device hot add/remove.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8332
r308793
hyperv/pcib: Fix the build for some kernel configs
Add the dependency on pci explicitly for the pcib and vmbus drivers.
The related Makefiles are updated accordingly too.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r308794
hyperv/vmbus,pcib: Add MODULE_DEPEND on pci
We'd better add this dependency explicitly, though usually the pci
driver is built into the kernel by default.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r308795
hyperv/pcib: change the file name: pcib.c -> vmbus_pcib.c
This makes the file name and the variable naming in the file consistent.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
r309127
hyperv/vmbus,pcib: unbreak build in case NEW_PCIB is undefined
vmbus_pcib requires NEW_PCIB, but in case that's not defined, we at
least shouldn't break build.
Reviewed by: sephe
Approved by: sephe (mentor)
Sponsored by: Microsoft
Allocate a struct ifreq rather than using a (wrong) computed size for
the BIOCSETIF ioctl.
The kernel always copies an entire struct ifreq and IPv4 addresses will
always fit in an ifreq.
On systems with pointers larger than 64-bits, the computed size will be
less than the size of struct ifreq, potentially resulting in the kernel
attempting to copyin memory from outside the allocation.
Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D8445
Fix "camcontrol rescan" with SATA drives behind a SAS controller
A bug in CAM's serial number hash logic resulted in SATA drives behind a SAS
controller getting removed and readded anytime the drive was rescanned for
any reason
libc++'s stddef.h includes an existing definition of max_align_t for
C++11, but it is only defined for C++, not for C. In addition, GCC and
clang both define an alternate version of max_align_t that uses a
union of multiple types rather than a plain long double as in libc++.
This adds a __max_align_t to <sys/_types.h> that matches the GCC and
clang definition that is mapped to max_align_t in <stddef.h>.
PR: 210890
Make sure MAC address is reprogrammed when if_init() callback is
invoked. Else promiscious mode must be used to pass traffic. While at
it fix a debug print macro.
r307003: Instead:
1) provide fts_open() with a comparison function to process directories
and files in a deterministic order
2) in addition to the existing hash, insert pages into a linked list
which will be sorted (by virtue of 1)
3) iterate over pages by the list in 2, instead of hash order
Idea from: des
r307564: makewhatis: avoid skipping another page after one with no mlinks
Submitted by: Ingo Schwarze
Sponsored by: The FreeBSD Foundation
Don't read if_counters with if_addr_lock held
Calling into an ifnet implementation with the if_addr_lock already
held can cause a LOR and potentially a deadlock, as ifnet
implementations typically can take the if_addr_lock after their
own locks during configuration. Refactor a sysctl handler that
was violating this to read if_counter data in a temporary buffer
before the if_addr_lock is taken, and then copying the data
in its final location later, when the if_addr_lock is held.
PR: 194109
Reported by: Jean-Sebastien Pedron
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8498
Reviewed by: sbruno
All I/O APIC pins are masked when an I/O APIC is first probed. The
APIC enumerator (MP Table or MADT) then parses its associated tables to
configure individual pins to set custom delivery modes or alternate
routing (e.g. routing IRQ 0 to intpin 2). Pins for regular interrupt
pins are left masked until the first interrupt is assigned. However,
pins with unusual settings (e.g. NMI or SMI) are never assigned an
interrupt and thus never re-programmed. The I/O APIC code used to
reprogram all interrupt pins during registration but this was lost in
r151979.
In theory, this is mostly a no-op as the ACPI APIC table does not
include a way to enumerate NMI or SMI pins for the I/O APIC, so only
systems using an MP Table would be affected.
r308443:
Add -d flag that prints domain only.
PR: 212875
Submitted by: Ben RUBSON <ben.rubson@gmail.com>
Reviewed by: pi
r308459:
Fix missing '-' for the flags -s and -d on both manpage and usage.
Reported by: garga, bde
r308462:
Add flag -B which does the same like batch mode but without exiting after
print. Also add a new flag -s that add blocks size to statistics.
PR: 198347, 212726
Submitted by: Ben RUBSON <ben.rubson@gmail.com>
Tested by: pi
MFC After: 2 weeks.
r308478:
We can't use protect(1) inside a jail(8)!
To avoid have warning for services that are using oomprotect, oomprotect
will only be applied on services that won't run inside jails.
Reported by: allanjude
MFC after: 2 weeks.
r308786:
rc.subr: Swap checks so we only fork sysctl if *_oomprotect is set.
Previously a command like "strings f1 f2 f3" reported the exit status
based only on processing the last file.
As with GNU strings, report an error exit status if an error was
encountered processing any of the files. While here simplify the
exit status handling to just success (0) / failure (1).
Some tools produce objects with a combined strtab and shstrtab.
These objects are not supported by crunchide since it rewrites the
symtab and strtab to "hide" symbols. This invalidates section header
offsets into a combined strtab/shstrtab.
In the future we could support these objects (by ensuring that we retain
unmodified section name strings in the output .strtab, and then rewriting
each section header's sh_name).
Specifically, use .Ta instead of tabs to separate column entries. While
here fix a few other things:
- Use .Sy for all column headers (previously only the first column header
was bold)
- Use .Dv to markup constants used for MIB names.
- Use "1234" and "4321" for the byte order descriptions without
thousands separators.
- Mark up header files in the first table with .In.
EFER_NXE is set in the EFER MSR by initializecpu() and must be set on all
CPUs in the system. When PG_NX support was added to PAE on i386, the
block to enable EFER_NXE was placed in a section of initializecpu() that
only runs if 'cpu == CPU_686'. During early boot, locore does an
initial pass to set cpu that sets it to CPU_686 on all CPUs later than
a Pentium. Later, printcpuinfo() adjusts the 'cpu' variable on
PII and later CPUs to one of CPU_PII, CPU_PIII, or CPU_P4. However,
printcpuinfo() is called after initializecpu() on the BSP, so the BSP
would enable EFER_NXE and pg_nx. The APs execute initializecpu() much
later after printcpuinfo() has run. The end result on a modern CPU was
that cpu was set to CPU_PIII when the APs invoked initializecpu(), so
they did not enable EFER_NXE. As a result, the APs would fault when
trying to access any pages marked with PG_NX set.
When booting a 2 CPU PAE kernel in bhyve this manifested as a hang before
single user mode. The attempt to execute /bin/init tried to copy out
the exec strings (argv, etc.) to a non-executable mapping while running
on the AP. The instruction kept faulting due to invalid bits in the PTE
in an infinite loop.
Fix this by moving the code to enable EFER_NXE out of the switch statement
on 'cpu' and always doing it if 'amd_feature' supports AMDID_NX.