is compiled in or not.
This fixes issues with people running -HEAD but who build modules
without doing a "make buildkernel KERNCONF=XXX", thus picking up
opt_*.h. The resulting module wouldn't have 11n enabled and the
chainmask configuration would just be plain wrong.
This allows mapping a tape drive in a changer (as reported by
'chio status') to a sa(4) driver instance by comparing the
serial numbers.
The designators can be ASCII (which is printed out directly), binary
(which is printed in hex format) or UTF-8, which is printed in either
native UTF-8 format if the terminal can support it, or in %XX notation
for non-ASCII characters. Thanks to Hiroki Sato <hrs@> for the
explaining UTF-8 printing and example UTF-8 printing code.
chio.h: Modify the changer_element_status structure to add new
fields and definitions from the SMC3r16 spec.
Rename the original CHIOGSTATUS ioctl to OCHIOGTATUS and
define a new CHIOGSTATUS ioctl.
Clean up some tab/space issues.
chio.c: For the 'status' subcommand, print the designator field
if it is supplied by a device.
scsi_ch.h: Add new flags for DVCID and CURDATA to the READ
ELEMENT STATUS command structure.
Add a read_element_status_device_id structure
for the data fields in the new standard. Add new
unions, dt_or_obsolete and voltage_devid, to hold
and address data from either SCSI-2 or newer devices.
scsi_ch.c: Implement support for fetching device IDs with READ
ELEMENT STATUS data.
Add new arguments to scsi_read_element_status() to
allow the user to request the DVCID and CURDATA bits.
This isn't compiled into libcam (it's only an internal
kernel interface), so we don't need any special
handling for the API change.
If the user issues the new CHIOGSTATUS ioctl, copy all of
the available element status data out. If he issues the
OCHIOGSTATUS ioctl, we don't copy the new fields in the
structure.
Fix a bug in chopen() that would result in the peripheral
never getting unheld if chgetparams() failed.
Sponsored by: Spectra Logic
Submitted by: Po-Li Soong
MFC After: 1 week
This is intended to be used as a stop-gap for switch devices
which expose multiple ethernet PHYs but we don't have a driver
for - here, etherswitchcfg and the general switch configuration
API can be used to interface to said PHYs.
Submitted by: Luiz Otavio O Souza <loos.br@gmail.com>
Programs often do not expect an [EINTR] return from sem_wait() and POSIX
only allows it if the signal was installed without SA_RESTART. The timeout
in sem_timedwait() is absolute so it can be restarted normally.
The umtx call can be invoked with a relative timeout and in that case
[ERESTART] must be changed to [EINTR]. However, libc does not do this.
The old POSIX semaphore implementation did this correctly (before r249566),
unlike the new umtx one.
It may be desirable to avoid [EINTR] completely, which matches the pthread
functions and is explicitly permitted by POSIX. However, the kernel must
return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread
cancellation will not abort a semaphore wait. In this commit, only restore
the 8.x behaviour which is also permitted by POSIX.
Discussed with: jhb
MFC after: 1 week
buffer for the last vnode on the mount back to the server, it
returns. At that point, the code continues with the unmount,
including freeing up the nfs specific part of the mount structure.
It is possible that an nfsiod thread will try to check for an
empty I/O queue in the nfs specific part of the mount structure
after it has been free'd by the unmount. This patch avoids this problem by
setting the iodmount entries for the mount back to NULL while holding the
mutex in the unmount and checking the appropriate entry is non-NULL after
acquiring the mutex in the nfsiod thread.
Reported and tested by: pho
Reviewed by: kib
MFC after: 2 weeks
respective functionality, allowing to synchronize TSC on APs to match BSP's
during boot. It may be unsafe in general case due to theoretical chance of
later drift if CPUs are using different clock rate or source, but it allows
to use TSC in some cases when difference caused by some initialization bug,
while TSCs are known to increment synchronously.
Reviewed by: jimharris, kib
MFC after: 1 month
option. This can occur when an nfsiod thread that already holds
a buffer lock attempts to acquire a vnode lock on an entry in
the directory (a LOR) when another thread holding the vnode lock
is waiting on an nfsiod thread. This patch avoids the deadlock by disabling
readahead for this case, so the nfsiod threads never do readdirplus.
Since readaheads for directories need the directory offset cookie
from the previous read, they cannot normally happen in parallel.
As such, testing by jhb@ and myself didn't find any performance
degredation when this patch is applied. If there is a case where
this results in a significant performance degradation, mounting
without the "rdirplus" option can be done to re-enable readahead
for directories.
Reported and tested by: jhb
Reviewed by: jhb
MFC after: 2 weeks
it will work with either the old or new server.
The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes). It does not currently work for NFSv4 requests. They are
more complex, and will take more work to support.
This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately. Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads. This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.
The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.
This improves sequential write performance in ZFS, because writes
to a file are now more ordered. Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS. Sending them down the same thread increases
the odds of their being in order.
In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once. ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking. This change does not affect the
NFSv4 code.
This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.
I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.
The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables. The read offset window calculation has been slightly
modified as well. Instead of having separate bins, each file
handle has a rolling window of bin_shift size. This minimizes
glitches in throughput when shifting from one bin to another.
sys/conf/files:
Add nfs_fha_new.c and nfs_fha_old.c. Compile nfs_fha.c
when either the old or the new NFS server is built.
sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
Bring in changes from Rick Macklem to newnfs_realign that
allow it to operate in blocking (M_WAITOK) or non-blocking
(M_NOWAIT) mode.
sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
Bring in a change from Rick Macklem to allow telling
nfsm_dissect() whether or not to wait for mallocs.
sys/fs/nfs/nfsm_subs.h:
Bring in changes from Rick Macklem to create a new
nfsm_dissect_nonblock() inline function and
NFSM_DISSECT_NONBLOCK() macro.
sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
Add the malloc wait flag to a newnfs_realign() call.
sys/fs/nfsserver/nfs_nfsdkrpc.c:
Setup the new NFS server's RPC thread pool so that it will
call the FHA code.
Add the malloc flag argument to newnfs_realign().
Unstaticize newnfs_nfsv3_procid[] so that we can use it in
the FHA code.
sys/fs/nfsserver/nfs_nfsdsocket.c:
In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
that use the LK_SHARED lock type.
sys/fs/nfsserver/nfs_nfsdport.c:
In nfsd_fhtovp(), if we're starting a write, check to see
whether the underlying filesystem supports shared writes.
If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.
sys/nfsserver/nfs_fha.c:
Remove all code that is specific to the NFS server
implementation. Anything that is server-specific is now
accessed through a callback supplied by that server's FHA
shim in the new softc.
There are now separate sysctls and tunables for the FHA
implementations for the old and new NFS servers. The new
NFS server has its tunables under vfs.nfsd.fha, the old
NFS server's tunables are under vfs.nfsrv.fha as before.
In fha_extract_info(), use callouts for all server-specific
code. Getting file handles and offsets is now done in the
individual server's shim module.
In fha_hash_entry_choose_thread(), change the way we decide
whether two reads are in proximity to each other.
Previously, the calculation was a simple shift operation to
see whether the offsets were in the same power of 2 bucket.
The issue was that there would be a bucket (and therefore
thread) transition, even if the reads were in close
proximity. When there is a thread transition, reads wind
up going somewhat out of order, and ZFS gets confused.
The new calculation simply tries to see whether the offsets
are within 1 << bin_shift of each other. If they are, the
reads will be sent to the same thread.
The effect of this change is that for sequential reads, if
the client doesn't exceed the max_reqs_per_nfsd parameter
and the bin_shift is set to a reasonable value (22, or
4MB works well in my tests), the reads in any sequential
stream will largely be confined to a single thread.
Change fha_assign() so that it takes a softc argument. It
is now called from the individual server's shim code, which
will pass in the softc.
Change fhe_stats_sysctl() so that it takes a softc
parameter. It is now called from the individual server's
shim code. Add the current offset to the list of things
printed out about each active thread.
Change the num_reads and num_writes counters in the
fha_hash_entry structure to 32-bit values, and rename them
num_rw and num_exclusive, respectively, to reflect their
changed usage.
Add an enable sysctl and tunable that allows the user to
disable the FHA code (when vfs.XXX.fha.enable = 0). This
is useful for before/after performance comparisons.
nfs_fha.h:
Move most structure definitions out of nfs_fha.c and into
the header file, so that the individual server shims can
see them.
Change the default bin_shift to 22 (4MB) instead of 18
(256K). Allow unlimited commands per thread.
sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
Add shims for the old and new NFS servers to interface with
the FHA code, and callbacks for the
The shims contain all of the code and definitions that are
specific to the NFS servers.
They setup the server-specific callbacks and set the server
name for the sysctl and loader tunable variables.
sys/nfsserver/nfs_srvkrpc.c:
Configure the RPC code to call fhaold_assign() instead of
fha_assign().
sys/modules/nfsd/Makefile:
Add nfs_fha.c and nfs_fha_new.c.
sys/modules/nfsserver/Makefile:
Add nfs_fha_old.c.
Reviewed by: rmacklem
Sponsored by: Spectra Logic
MFC after: 2 weeks
prevents us from creating UMA_ZONE_PCPU zones earlier.
As bandaid shift initialization of counter(9) zone later.
Reviewed by: kib
Reported & tested by: Lytochkin Boris <lytboris gmail.com>
(Wasting 4k just as a temporary placeholder for a boot environment seems
a bit ridiculous, but hey.)
Tested: gxemul:
$ gxemul -e malta -d i:/home/adrian/work/freebsd/svn/mfsroot-rspro.img -C 4Kc /tftpboot/kernel.MALTA
* Add ah_ratesArray[] to the ar5416 HAL state - this stores the maximum
values permissable per rate.
* Since different chip EEPROM formats store this value in a different place,
store the HT40 power detector increment value in the ar5416 HAL state.
* Modify the target power setup code to store the maximum values in the
ar5416 HAL state rather than using a local variable.
* Add ar5416RateToRateTable() - to convert a hardware rate code to the
ratesArray enum / index.
* Add ar5416GetTxRatePower() - which goes through the gymnastics required
to correctly calculate the target TX power:
+ Add the power detector increment for ht40;
+ Take the power offset into account for AR9280 and later;
+ Offset the TX power correctly when doing open-loop TX power control;
+ Enforce the per-rate maximum value allowable.
Note - setting a TPC value of 0x0 in the TX descriptor on (at least)
the AR9160 resulted in the TX power being very high indeed. This didn't
happen on the AR9220. I'm guessing it's a chip bug that was fixed at
some point. So for now, just assume the AR5416/AR5418 and AR9130 are
also suspect and clamp the minimum value here at 1.
Tested:
* AR5416, AR9160, AR9220 hostap, verified using (2GHz) spectrum analyser
* Looked at target TX power in TX descriptor (using athalq) as well as TX
power on the spectrum analyser.
TODO:
* The TX descriptor code sets the target TX power to 0 for AR9285 chips.
I'm not yet sure why. Disable this for TPC and ensure that the TPC
TX power is set.
* AR9280, AR9285, AR9227, AR9287 testing!
* 5GHz testing!
Quirks:
* The per-packet TPC code is only exercised when the tpc sysctl is set
to 1. (dev.ath.X.tpc=1.) This needs to be done before you bring the
interface up.
* When TPC is enabled, setting the TX power doesn't end up with a call
through to the HAL to update the maximum TX power. So ensure that
you set the TPC sysctl before you bring the interface up and configure
a lower TX power or the hardware will be clamped by the lower TX
power (at least until the next channel change.)
Thanks to Qualcomm Atheros for all the hardware, and Sam Leffler for use
of his spectrum analyser to verify the TX channel power.
ath_tx_rate_fill_rcflags(). Include setting up the TX power cap in the
rate scenario setup code being passed to the HAL.
Other things:
* add a tx power cap field in ath_rc.
* Add a three-stream flag in ath_rc.
* Delete the LDPC flag from ath_rc - it's not a per-rate flag, it's a
global flag for the transmission.
The following change from illumos brought caused DTrace to
pause in an interactive environment:
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider
This was not detected during testing because it doesn't
affect scripts.
We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.
Reference:
https://www.illumos.org/issues/3026
Reported by: Navdeep Parhar and Mark Johnston
directly referencing ni->ni_txpower.
This provides the hardware with a slightly more accurate idea of
the maximum TX power to be using.
This is part of a series to get per-packet TPC to work (better).
Tested:
* AR5416, hostap mode
the given node.
This takes into account the per-node cap, the ic cap and the
per-channel regulatory caps.
This is designed to replace references to ni_txpower in various net80211
drivers - ni_txpower doesn't necessarily reflect the actual cap for
the given node (eg if the node has the default value of 50dBm (100) and
the administrator has manually configured a lower TX power.)
signal.
- Fix the old ksem implementation for POSIX semaphores to not restart
sem_wait() or sem_timedwait() if interrupted by a signal.
MFC after: 1 week
Pointy-hat to: me, for not realizing snprintf() is available in kernel.
Thanks to: jh, for bringing me the good news of snprintf(), Pawel Worach, for
noting that the panic can be provoked in i386 and not in amd64
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.
The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.
PR: kern/173723
Suggested by: jhb
Discussed with: jhb, kib
Reviewed by: jhb (initial version), kib
MFC after: 1 month
Only look for FDT partitions if our potential parent is a DISK device.
Excluding direct recursion on the flashmap geoms was insufficient
because it did not prevent the underlying device from being retrieved if
flashmap geoms were further partitioned.
Reviewed by: imp
Sponsored by: DARPA, AFRL
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.
Reported and tested by: pho
MFC after: 2 weeks
for the set of IPv6 addresses. Now each attempt goes into IPv6 statistics,
even if given rule did not won. Change this and take into account only
those rules, that won. Also add accounting for cases, when algorithm
fails to select an address.
to unique values.
There's some confusion about what the n32 assembler API really is
(since on page 9 of the spec they say that t0-t3 don't exist, then
turn around on page 22 and say that t4-t7 don't exist), and this
doesn't touch that.
NetBSD's version of this file follows the convention I used here, and
is likely to be correct.
This should fix gdb/ptrace.
is starting. This is in line with practice in OpenSolaris.
Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.
PR: 177706
Submitted by: Tiwei Bie
MFC after: 1 month
well as enabled when necessary. And simplify the checksum routine
itself, adding UDP bit to the test. Thanks to Kevin Lo for pointing
out the problems and code suggestions.
implementation, error on the side of conservatism and only create labels
for GEOMs of classes DISK and MULTIPATH.
Discussed with: trasz
Approved by: silence from freebsd-geom@
The lagg(4) is often used to bond high speed links, so basic per-packet +=
on statistics cause cache misses and statistics loss.
Perfect solution would be to convert ifnet(9) to counters(9), but this
requires much more work, and unfortunately ABI change, so temporarily
patch lagg(4) manually.
We store counters in the softc, and once per second push their values
to legacy ifnet counters.
Sponsored by: Nginx, Inc.
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child. This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes. Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions. For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.
Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.
The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.
Reviewed by: kib
MFC after: 3 weeks
function is provided, which is used either to calculate the note size
or output it to sbuf. On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf. For the sbuf a drain routine is
provided that writes data to a core file.
The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer. Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.
Reviewed by: jhb (initial version), kib
Discussed with: jhb, kib
MFC after: 3 weeks
device which makes the request for dma tag, instead of some descendant
of the PCI device, by creating a pass-through trampoline for vga_pci
and ata_pci buses.
Sponsored by: The FreeBSD Foundation
Suggested by: jhb
Discussed with: jhb, mav
MFC after: 1 week
Stop abusing xpt_periph in random plases that really have no periph related
to CCB, for example, bus scanning. NULL value is fine in such cases and it
is correctly logged in debug messages as "noperiph". If at some point we
need some real XPT periphs (alike to pmpX now), quite likely they will be
per-bus, and not a single global instance as xpt_periph now.
r248917, r248918, r248978, r249001, r249014, r249030:
Remove multilevel freezing mechanism, implemented to handle specifics of
the ATA/SATA error recovery, when post-reset recovery commands should be
allocated when queues are already full of payload requests. Instead of
removing frozen CCBs with specified range of priorities from the queue
to provide free openings, use simple hack, allowing explicit CCBs over-
allocation for requests with priority higher (numerically lower) then
CAM_PRIORITY_OOB threshold.
Simplify CCB allocation logic by removing SIM-level allocation queue.
After that SIM-level queue manages only CCBs execution, while allocation
logic is localized within each single device.
Suggested by: gibbs
sys/arm and sys/mips), squelching the clang 3.3 warnings about this.
Noticed by: tinderbox and many irate spectators
Submitted by: Luiz Otavio O Souza <loos.br@gmail.com>
PR: kern/177759
MFC after: 3 days
In some cases, kern_envp is set by the architecture code and env_pos does
not contain the length of the static kernel environment. In these cases
r249408 causes the kernel to discard the environment.
Fix this by updating the check for empty static env to *kern_envp != '\0'
Reported by: np@
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child. This change is the first step in
eliminating those unnecessary level zero interior nodes. Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.
Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division
This includes a new IOCTL to support a generic method for nvmecontrol(8) to pass
IDENTIFY, GET_LOG_PAGE, GET_FEATURES and other commands to the controller, rather than
separate IOCTLs for each.
Sponsored by: Intel
These were added early on for benchmarking purposes to avoid the mapped I/O
penalties incurred in kern_physio. Now that FreeBSD (including kern_physio)
supports unmapped I/O, the need for these NVMe-specific routines no longer exists.
Sponsored by: Intel