with a reference count of zero can't possibly be mapped, so there is never a
need for vm_page_set_invalid() to call pmap_remove_all() on them.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
hold the kernel, modules, and any other loaded data. This memory block
is relocated to the kernel's expected location during the transfer of
control from the loader to the kernel.
The GENERIC kernel on amd64 has recently grown such that a kernel + zfs.ko
no longer fits in the default staging size. Bump the default size from
32MB to 48MB to provide more breathing room.
PR: 201679
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3666
* the tx descriptor TID is priority, not TID.
* the tx descriptor queue id mapping is separate from the
TID/priority; rather than just "BE".
TODO:
* go and re-re-re-verify the queue mappings; the linux and openbsd
mappings aren't exactly the same. I need to verify all of this
before I try to flip on 11n RX.
* Do 1T1R for now, until we read the config out of ROM and use it.
* Disable turbo mode, I dunno what this is, but the linux drivers
have this disabled.
* Set the firmware endpoints to what we read from USB.
Tested:
* RTL8712 cut 3, STA mode
data queues.
This is similar to the openbsd and rtlwifi/r92su drivers.
Note: this driver still assumes it's a 4-endpoint device; I'll enforce
that in a follow-up commit.
Due to the use of int's for file offsets in the VOP_WRITE_(PRE|POST)
macros, kqueue write events for files greater 2GB where never fired.
This caused tail -f on a file greater 2GB to never see updates.
MFC after: 1 week
Relnotes: YES
Sponsored by: Multiplay
Currently FreeBSD supports only single PIC controller. Some systems
that have more than one (like ThunderX dual-socket) fails to boot.
Disable other PICes until proper handling is implemented in the
generic interrupt code.
Reviewed by: imp
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3682
cpu_init_fdt will now release memory allocated for structures
serving CPUs that have failed to init.
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3297
When the system has more than a single PCI domain, the bus numbers
are not unique, thus they cannot be used for "pci" device numbering.
Change bus numbers to -1 (i.e. to-be-determined automatically)
wherever the code did not care about domains.
Reviewed by: jhb
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3406
TDB_USERWR flag may still be set after a debugger detaches from a
process via PT_DETACH. Previously the flag would never be cleared
forcing a double fetch of the system call arguments for each system
call. Note that the flag cannot be cleared at PT_DETACH time in case
one of the threads in the process is currently stopped in
syscallenter() and the debugger has modified the arguments for that
pending system call before detaching.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3678
sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this
behavior is triggered for any type of process stop including ptrace() stops
and transient stops for single threading during exit() and execve().
Thus, if a debugger is attached to a process in a group when the leader
exits, the entire group can be HUPed. Instead, only send the signals if a
process in the group is stopped due to SIGSTOP.
PR: 201149
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3681
Problem description:
How do we currently perform layer 2 resolution and header imposition:
For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()
Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.
For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()
We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).
After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.
As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).
Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()
The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()
Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.
(And EWOULDBLOCK is actually used by IB/TOE code).
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469
* prepare gateway before insertion
* use RTM_CHANGE instead of explicit find/change route
* Remove fib argument from ifa_switch_loopback_route added in r264887:
if old ifp fib differes from new one, that the caller
is doing something wrong
* Make ifa_*_loopback_route call single ifa_maintain_loopback_route().
that was recently added for Lenovo laptops.
This is a prime candidate for conversion into a table and also
checking other fields like "product".
Tested:
* ASUS UX31E
before proceeding. Otherwise, nothing prevents it from running after the
MAD agent struct has been been freed, and this results in a use-after-free
when the task's ta_pending count is incremented in the callout handler.
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
to sleep. The rmlock implementation enforces this by disabling
sleeping when a read lock is acquired. To simplify the implementation,
sleeping is disabled for most of the duration of rm_rlock. However,
it doesn't need to be disabled until the lock is acquired. If a
sleepable rm lock is contested, then rm_rlock may need to acquire the
backing sx lock. This tripped the overly-broad assertion. Fix by
relaxing the assertion around the call to sx_xlock().
Reported by: mjg
Reviewed by: kib, mjg
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3324
In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)
Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)
Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675
it possible to "simulate" 4K media, to eg test alignment handling.
Reviewed by: mav@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3664
* Move isRouter calculation code to separate nd6_is_router() function.
* Make nd6_cache_lladdr() return void: its return value hasn't been used
since r53541 KAME import in 1999.
Sponsored by: Yandex LLC
There is an issue with interrupts at the moment, but it works with
polling mode set (hw.usb.xhci.use_polling=1).
Reviewed by: hselasky
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3665
chdone(). Previously, the retry could clear the CAM_DEV_QFRZN bit in the
CCB status, leaving the queue frozen.
Submitted by: Jeff Miller <Jeff.Miller@isilon.com>
Reviewed by: ken
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
This allows for arbitrary channel info to be placed in the input call rather
than the totally gross hack of overriding ic_curchan.
Without this I'm sure ic_curchan setting was racing with the scan code
setting the channel itself..
On receipt of a redirect message, install an interface route for the
redirected destination. On removal of the corresponding Neighbor Cache
entry, remove the interface route.
This requires changes in rtredirect_fib() to cope with an AF_LINK
address for the gateway and with the absence of RTF_GATEWAY.
This fixes the "Redirected On-Link" test cases in the Tahi IPv6 Ready Logo
Phase 2 test suite.
Unrelated to the above, fix a recursion on the radix node head lock
triggered by the Tahi Redirected to Alternate Router test cases.
When I first wrote this patch in October 2012, all Section 2
(Neighbor Discovery) test cases passed on 10-CURRENT, 9-STABLE,
and 8-STABLE. cem@ recently rebased the 10.x patch onto head and reported
that it passes Tahi. (Thanks!)
These other test cases also passed in 2012:
* the RTF_MODIFIED case, with IPv4 and IPv6 (using a
RTF_HOST|RTF_GATEWAY route for the destination)
* the redirected-to-self case, with IPv4 and IPv6
* a valid IPv4 redirect
All testing in 2012 was done with WITNESS and INVARIANTS.
Tested by: EMC / Isilon Storage Division via Conrad Meyer (cem) in 2015,
Mark Kelley <mark_kelley@dell.com> in 2012,
TC Telkamp <terence_telkamp@dell.com> in 2012
PR: 152791
Reviewed by: melifaro (current rev), bz (earlier rev)
Approved by: kib (mentor)
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D3602
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3573
branch.
This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.
Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.
Differential Revision: https://reviews.freebsd.org/D3521
Reviewed by: wblock
MFC after: 1 month
To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.
Implementation notes:
If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.
Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.
The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.
Existing network adapter limits will be updated separately.
Differential Revision: https://reviews.freebsd.org/D3458
Reviewed by: rmacklem
MFC after: 2 weeks
of PCI-EBus-bridges actually match the BARs as specified in and
required by [1, p. 113 f.]. Doing so earlier would have simplified
diagnosing a bug in QEMU/OpenBIOS getting the mapping of child
addresses wrong, which still needs to be fixed there.
In theory, we could try to change the BARs accordingly if we hit
this problem. However, at least with real machines changing the
decoding likely won't work, especially if the PCI-EBus-bridge is
beneath an APB one. So implementing such functionality generally
is rather pointless.
- Actually change the allocation type of EBus resources if they
change from SYS_RES_MEMORY to SYS_RES_IOPORT when mapping them
to PCI ranges in ebus_alloc_resource() and passing them up to
bus_activate_resource(9). This may happen with the QEMU/OpenBIOS
PCI-EBus-bridge but not real ones. Still, this is only cleans up
the code and the result of resource allocation and activation is
unchanged.
- Change the remainder of printf(9) to device_printf(9) calls and
canonicalize their wording.
MFC after: 1 week
Peripheral Component Interconnect Input Output Controller,
Part No.: 802-7837-01, Sun Microelectronics, March 1997 [1]
The firmware in this NIC sends management frames. So far I'm not sure which
ones it handles and which ones it doesn't handle - but this is what openbsd
does.
The association messages are handled by the firmware; the key negotiation
for 802.1x and WPA are done as raw frames, not management frames.
This successfully allows it to associate to my home networks whereas it didn't
work beforehand.
Tested:
* RTL8712, cut 3, STA mode
TODO:
* The firmware does send a join response with a status code; that should be
logged in a more obvious way to assist with debugging. Ie, the firmware
is the thing that is saying "couldn't join, sorry!", not net80211.
to attach with the last version of this commit. This commit fixes
attach failures on "ICH8" class devices via modifications to
e1000_init_nvm_params_ich8lan()
- Fix compiler warning in 80003es2lan.c
- Add return value handler for e1000_*_kmrn_reg_80003es2lan
- Fix usage of DEBUGOUT
- Remove unnecessary variable initializations.
- Removed unused variables (complaints from gcc).
- Edit defines in 82571.h.
- Add workaround for igb hw errata.
- Shared code changes for Skylake/I219 support.
- Remove unused OBFF and LTR functions.
Tested by some of the folks that reported breakage in previous incarnation.
Thanks to AllanJude, gjb, gnn, tijl for tempting fate with their machines.
Submitted by: erj@freebsd.org
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3162
to provide the TCPDEBUG functionality with pure DTrace.
Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530
updated
ZFS already supports storing the vdev FRU in a vdev property. There
is code in libzfs to work with this property, and there is code in
the zfs-retire FMA module that looks for that information. But there
is no code actually setting or updating the FRU.
To address this, ZFS is changed to send a handful of new events
whenever a vdev is added, attached, cleared, or onlined, as well
as when a pool is created or imported.
Note that syseventd is not currently available on FreeBSD and thus
some work is needed to actually support the new ZFS events (e.g. in
zfsd) to actually use this capability, this changeset is mostly a
diff reduction from upstream.
illumos/illumos-gate@1437283407
Illumos issues:
5997 FRU field not set during pool creation and never updated
https://www.illumos.org/issues/5997
* yes, when a "sta disconnect" message comes through we should, like,
disconnect things. We're not currently generating beacon miss messages,
and net80211 isn't disconnecting things via software beacon miss receive.
Tested:
* RTL8712, cut 3, STA mode
Formally pair store_rel(&smp_started) with load_acq(&smp_started).
Similarly to x86, this change is mostly a NOP due to the kernel
being run in total store order.
MFC after: 1 week
* use an ath/iwn style debug bitmap - it's still global rather than per-device,
but it's better than debug levels
* disable bgscan - it just makes things unstable/unpredictable for now.
Tested:
* if_rsu - RTL8712 cut 3, STA mode
drivers into the revived sys/sparc64/pci/ofw_pci.c, previously already
serving a similar purpose. This has been done with sun4v in mind, which
explains a) the otherwise not that obvious scheme employed and b) why
reusing sys/powerpc/ofw/ofw_pci.c was even lesser an option.
- Add a workaround for QEMU once again not emulating real machines, in
this case by not providing the OFW_PCI_CS_MEM64 range. [1]
Submitted by: jhb [1]
MFC after: 1 week
In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced. This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.
Prevent this by reintroducing b_compress
illumos/illumos-gate@d4cd038c92
Illumos issues:
6214 zpools going south
https://www.illumos.org/issues/6214
Rewrite the ZFS prefetch code to detect only forward, sequential
streams.
The following kstats have been added:
kstat.zfs.misc.arcstats.sync_wait_for_async
How many sync reads have waited for async read
to complete. (less is better)
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch
How many demand read didn't have to wait for I/O
because of predictive prefetch. (more is better)
zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.
The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.
illumos/illumos-gate@cf6106c8a0
Illumos ZFS issues:
5987 zfs prefetch code needs work
https://www.illumos.org/issues/5987
The pci bus driver handles the power state, it also manages
configuration state saving and restoring for its child devices. Thus a
PCI device driver does not have to worry about those things. In fact, I
observe a hard system hang when trying to suspend a system with active
radeonkms driver where both the bus driver and radeonkms driver try to
do the same thing. I suspect that it could be because of an access to a
PCI configuration register after the device is placed into D3 state.
Reviewed by: dumbbell, jhb
MFC after: 13 days
Differential Revision: https://reviews.freebsd.org/D3561
All requests arriving for processing after OFFLINE flag set are rejected
with BUSY status. Races around OFFLINE flag setting are closed by calling
taskqueue_drain_all().
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.
MFC after: 1 week
running thread.
It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.
This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.
Reviewed by: bdrewery, jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3256
since on amd64 the first argument to a function is generally not on the
stack.
Revert an old DTrace bug fix to some code that assumed that
sizeof(struct amd64_frame) == 16.
Reviewed by: jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3255
5930 fasttrap_pid_enable() panics when prfind() fails in forking process
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>
illumos/illumos-gate@9df7e4e12e
The only operation which is prevented by the hold is the kernel stack
swapout for the faulted thread, which should be fine to allow.
Remove useless checks for NULL curproc or curproc->p_vmspace from the
trap_pfault() wrappers on x86 and powerpc.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
CTL HA functionality was originally implemented by Copan many years ago,
but large part of the sources was never published. This change includes
clean room implementation of the missing code and fixes for many bugs.
This code supports dual-node HA with ALUA in four modes:
- Active/Unavailable without interlink between nodes;
- Active/Standby with second node handling only basic LUN discovery and
reservation, synchronizing with the first node through the interlink;
- Active/Active with both nodes processing commands and accessing the
backing storage, synchronizing with the first node through the interlink;
- Active/Active with second node working as proxy, transfering all
commands to the first node for execution through the interlink.
Unlike original Copan's implementation, depending on specific hardware,
this code uses simple custom TCP-based protocol for interlink. It has
no authentication, so it should never be enabled on public interfaces.
The code may still need some polishing, but generally it is functional.
Relnotes: yes
Sponsored by: iXsystems, Inc.
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl. The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.
This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen. I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit. On a 1TB machine this is like, 26 million
FDs per process. Ugh.
So:
* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
screen, python, etc doing a close() on potentially millions of FDs
even though you only have four open.
Tested:
* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
screen and python forking doesn't result in some pretty hilariously
bad behaviour.
TODO:
* Note that the default login.conf sets openfiles-cur to unlimited,
effectively obeying kern.maxfilesperproc. Perhaps we should fix
this.
* .. and even if we do, we need to also ensure that daemons get
a soft limit of something reasonable and capped - they can request
more FDs themselves.
MFC after: 1 week
Sponsored by: Norse Corp, Inc.
named node, open(2) cannot create directories. But do allow the flag
combination to succeed if the directory already exists.
Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.
Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL. The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.
Reported by: Tom Ridge <freebsd@tom-ridge.com>
Reviewed by: rwatson
PR: 202892
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
* Fail when the length passed in is 0
* Remove an unneeded increment of the count on success
* Return ENAMETOOLONG when the input pointer is too long
Sponsored by: ABT Systems Ltd
mapped address without valid pte installed, when parallel wiring of
the entry happen. The entry must be copy on write. If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear. After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check. Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.
There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it. Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869. The r251901
revision re-introduced the r24666 fix for the current VM.
Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW. In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared. Put the
assert stating the invariant, instead of returning the error.
Reported and debugging help by: peter
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week