Commit Graph

180401 Commits

Author SHA1 Message Date
Kenneth D. Merry
d96b98a360 Revamp the old NFS server's File Handle Affinity (FHA) code so that
it will work with either the old or new server.

The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes).  It does not currently work for NFSv4 requests.  They are
more complex, and will take more work to support.

This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately.  Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads.  This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.

The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.

This improves sequential write performance in ZFS, because writes
to a file are now more ordered.  Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS.  Sending them down the same thread increases
the odds of their being in order.

In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once.  ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking.  This change does not affect the
NFSv4 code.

This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.

I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.

The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables.  The read offset window calculation has been slightly
modified as well.  Instead of having separate bins, each file
handle has a rolling window of bin_shift size.  This minimizes
glitches in throughput when shifting from one bin to another.

sys/conf/files:
	Add nfs_fha_new.c and nfs_fha_old.c.  Compile nfs_fha.c
	when either the old or the new NFS server is built.

sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
	Bring in changes from Rick Macklem to newnfs_realign that
	allow it to operate in blocking (M_WAITOK) or non-blocking
	(M_NOWAIT) mode.

sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
	Bring in a change from Rick Macklem to allow telling
	nfsm_dissect() whether or not to wait for mallocs.

sys/fs/nfs/nfsm_subs.h:
	Bring in changes from Rick Macklem to create a new
	nfsm_dissect_nonblock() inline function and
	NFSM_DISSECT_NONBLOCK() macro.

sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
	Add the malloc wait flag to a newnfs_realign() call.

sys/fs/nfsserver/nfs_nfsdkrpc.c:
	Setup the new NFS server's RPC thread pool so that it will
	call the FHA code.

	Add the malloc flag argument to newnfs_realign().

	Unstaticize newnfs_nfsv3_procid[] so that we can use it in
	the FHA code.

sys/fs/nfsserver/nfs_nfsdsocket.c:
	In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
	that use the LK_SHARED lock type.

sys/fs/nfsserver/nfs_nfsdport.c:
	In nfsd_fhtovp(), if we're starting a write, check to see
	whether the underlying filesystem supports shared writes.
	If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.

sys/nfsserver/nfs_fha.c:
	Remove all code that is specific to the NFS server
	implementation.  Anything that is server-specific is now
	accessed through a callback supplied by that server's FHA
	shim in the new softc.

	There are now separate sysctls and tunables for the FHA
	implementations for the old and new NFS servers.  The new
	NFS server has its tunables under vfs.nfsd.fha, the old
	NFS server's tunables are under vfs.nfsrv.fha as before.

	In fha_extract_info(), use callouts for all server-specific
	code.  Getting file handles and offsets is now done in the
	individual server's shim module.

	In fha_hash_entry_choose_thread(), change the way we decide
	whether two reads are in proximity to each other.
	Previously, the calculation was a simple shift operation to
	see whether the offsets were in the same power of 2 bucket.
	The issue was that there would be a bucket (and therefore
	thread) transition, even if the reads were in close
	proximity.  When there is a thread transition, reads wind
	up going somewhat out of order, and ZFS gets confused.

	The new calculation simply tries to see whether the offsets
	are within 1 << bin_shift of each other.  If they are, the
	reads will be sent to the same thread.

	The effect of this change is that for sequential reads, if
	the client doesn't exceed the max_reqs_per_nfsd parameter
	and the bin_shift is set to a reasonable value (22, or
	4MB works well in my tests), the reads in any sequential
	stream will largely be confined to a single thread.

	Change fha_assign() so that it takes a softc argument.  It
	is now called from the individual server's shim code, which
	will pass in the softc.

	Change fhe_stats_sysctl() so that it takes a softc
	parameter.  It is now called from the individual server's
	shim code.  Add the current offset to the list of things
	printed out about each active thread.

	Change the num_reads and num_writes counters in the
	fha_hash_entry structure to 32-bit values, and rename them
	num_rw and num_exclusive, respectively, to reflect their
	changed usage.

	Add an enable sysctl and tunable that allows the user to
	disable the FHA code (when vfs.XXX.fha.enable = 0).  This
	is useful for before/after performance comparisons.

nfs_fha.h:
	Move most structure definitions out of nfs_fha.c and into
	the header file, so that the individual server shims can
	see them.

	Change the default bin_shift to 22 (4MB) instead of 18
	(256K).  Allow unlimited commands per thread.

sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
	Add shims for the old and new NFS servers to interface with
	the FHA code, and callbacks for the

	The shims contain all of the code and definitions that are
	specific to the NFS servers.

	They setup the server-specific callbacks and set the server
	name for the sysctl and loader tunable variables.

sys/nfsserver/nfs_srvkrpc.c:
	Configure the RPC code to call fhaold_assign() instead of
	fha_assign().

sys/modules/nfsd/Makefile:
	Add nfs_fha.c and nfs_fha_new.c.

sys/modules/nfsserver/Makefile:
	Add nfs_fha_old.c.

Reviewed by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 21:00:22 +00:00
Jeremie Le Hen
6272779b2f Document jail_<jname>_parameters option.
The description explains why we should not configure "path",
"host.hostname", "command", "ip4.addr" and ip6.addr" parameters with
this, but rather use the historical rc.conf(5) options.

MFC after:	3 days
2013-04-17 20:19:32 +00:00
Gleb Smirnoff
8f779cc541 On non-ACPI i386 mp_ncpus is initialized at SI_SUB_CPU, and this
prevents us from creating UMA_ZONE_PCPU zones earlier.

As bandaid shift initialization of counter(9) zone later.

Reviewed by:		kib
Reported & tested by:	Lytochkin Boris <lytboris gmail.com>
2013-04-17 18:43:33 +00:00
Adrian Chadd
d22e3b024e Add the static kernel boot environment, needed to actually boot this thing.
(Wasting 4k just as a temporary placeholder for a boot environment seems
a bit ridiculous, but hey.)

Tested: gxemul:

$ gxemul -e malta -d i:/home/adrian/work/freebsd/svn/mfsroot-rspro.img -C 4Kc /tftpboot/kernel.MALTA
2013-04-17 18:26:01 +00:00
Gabor Kovesdan
a8b5c2a0aa - Correct spelling in comments
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:56:11 +00:00
Gabor Kovesdan
84a17a97ce - Correct mispellings of word and
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:48:46 +00:00
Gabor Kovesdan
b78540b1c7 - Correct mispellings of word resource
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de>
2013-04-17 11:47:32 +00:00
Gabor Kovesdan
8fb3bbe770 - Corrrect mispellings of word useful
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:45:15 +00:00
Gabor Kovesdan
f0d0985ee9 - Correct mispellings of word miscellaneous
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:43:46 +00:00
Gabor Kovesdan
a2098fea6d - Correct mispellings of the word necessary
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:42:40 +00:00
Gabor Kovesdan
ab3f6b347e - Correct mispellings of the word occurrence
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:40:10 +00:00
Ivan Voras
7ceaf939d6 Link g_label_disk_ident when building geom_label as a module 2013-04-17 09:19:29 +00:00
Adrian Chadd
91046e9c5f Setup needed tables for TPC on AR5416->AR9287 chips.
* Add ah_ratesArray[] to the ar5416 HAL state - this stores the maximum
  values permissable per rate.
* Since different chip EEPROM formats store this value in a different place,
  store the HT40 power detector increment value in the ar5416 HAL state.
* Modify the target power setup code to store the maximum values in the
  ar5416 HAL state rather than using a local variable.
* Add ar5416RateToRateTable() - to convert a hardware rate code to the
  ratesArray enum / index.
* Add ar5416GetTxRatePower() - which goes through the gymnastics required
  to correctly calculate the target TX power:
  + Add the power detector increment for ht40;
  + Take the power offset into account for AR9280 and later;
  + Offset the TX power correctly when doing open-loop TX power control;
  + Enforce the per-rate maximum value allowable.

Note - setting a TPC value of 0x0 in the TX descriptor on (at least)
the AR9160 resulted in the TX power being very high indeed.  This didn't
happen on the AR9220.  I'm guessing it's a chip bug that was fixed at
some point.  So for now, just assume the AR5416/AR5418 and AR9130 are
also suspect and clamp the minimum value here at 1.

Tested:

* AR5416, AR9160, AR9220 hostap, verified using (2GHz) spectrum analyser
* Looked at target TX power in TX descriptor (using athalq) as well as TX
  power on the spectrum analyser.

TODO:

* The TX descriptor code sets the target TX power to 0 for AR9285 chips.
  I'm not yet sure why.  Disable this for TPC and ensure that the TPC
  TX power is set.
* AR9280, AR9285, AR9227, AR9287 testing!
* 5GHz testing!

Quirks:

* The per-packet TPC code is only exercised when the tpc sysctl is set
  to 1. (dev.ath.X.tpc=1.) This needs to be done before you bring the
  interface up.
* When TPC is enabled, setting the TX power doesn't end up with a call
  through to the HAL to update the maximum TX power.  So ensure that
  you set the TPC sysctl before you bring the interface up and configure
  a lower TX power or the hardware will be clamped by the lower TX
  power (at least until the next channel change.)

Thanks to Qualcomm Atheros for all the hardware, and Sam Leffler for use
of his spectrum analyser to verify the TX channel power.
2013-04-17 07:31:53 +00:00
Adrian Chadd
8b470f6f71 Use the TPC bank by default for AR9160.
Tested:

* AR9160, hostap, verified TX power using (2GHz) spectrum analyser

TODO:

* 5GHz verification!
2013-04-17 07:22:23 +00:00
Adrian Chadd
de00e5cb54 Update the rate series setup code to use the decisions already made in
ath_tx_rate_fill_rcflags().  Include setting up the TX power cap in the
rate scenario setup code being passed to the HAL.

Other things:

* add a tx power cap field in ath_rc.
* Add a three-stream flag in ath_rc.
* Delete the LDPC flag from ath_rc - it's not a per-rate flag, it's a
  global flag for the transmission.
2013-04-17 07:21:30 +00:00
Rui Paulo
d1dcd93145 Print more bits from the standard extended features CPUID which will be
available in the Haswell architecture (c.f. Intel Document #319433-012A).
2013-04-17 06:51:17 +00:00
Pedro F. Giffuni
2e654ff9df DTrace: Revert r249426
This change actually depends on r249367 which had to be reverted

Pointy Hat:	pfg
2013-04-17 02:40:07 +00:00
Neel Natu
5685b37212 Correct misleading bootverbose output: ahc_isa_probe -> ahc_isa_identify 2013-04-17 02:33:56 +00:00
Pedro F. Giffuni
03836978be DTrace: Revert r249367
The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.

Reference:
https://www.illumos.org/issues/3026

Reported by:	Navdeep Parhar and Mark Johnston
2013-04-17 02:20:17 +00:00
Neel Natu
9f08548d20 Setup accesses to the memory hole below 4GB to return all 1's on read and
consume all writes without any side effects.

Obtained from:	NetApp
2013-04-17 02:03:12 +00:00
Ivan Voras
8e9405e8a7 Comment typo fix.
Is aware of the importance of comments: dim
2013-04-16 22:42:40 +00:00
Warner Losh
11c601447d r249408 and r249436 cause a NULL pointer dereference on the CUBIEBOARD
since it doesn't set the kernel envrionment at all. Work around this
by making sure kern_envp is non-NULL before dereferencing it.
2013-04-16 22:09:08 +00:00
Adrian Chadd
12087a0769 Use the new net80211 method to fetch the node TX power, rather than
directly referencing ni->ni_txpower.

This provides the hardware with a slightly more accurate idea of
the maximum TX power to be using.

This is part of a series to get per-packet TPC to work (better).

Tested:

* AR5416, hostap mode
2013-04-16 21:26:44 +00:00
Adrian Chadd
ebe15a7b25 Implement a utility function to return the current TX power cap for
the given node.

This takes into account the per-node cap, the ic cap and the
per-channel regulatory caps.

This is designed to replace references to ni_txpower in various net80211
drivers - ni_txpower doesn't necessarily reflect the actual cap for
the given node (eg if the node has the default value of 50dBm (100) and
the administrator has manually configured a lower TX power.)
2013-04-16 20:36:32 +00:00
Joel Dahl
9adbae037d mdoc: remove superfluous paragraph macro. 2013-04-16 20:31:15 +00:00
John Baldwin
8916af883c - Document that sem_wait() can fail with EINTR if it is interrupted by a
signal.
- Fix the old ksem implementation for POSIX semaphores to not restart
  sem_wait() or sem_timedwait() if interrupted by a signal.

MFC after:	1 week
2013-04-16 20:26:31 +00:00
Adrian Chadd
5d4dedadb6 Use a per-RX-queue deferred list, rather than a single deferred list for
both queues.

Since ath_rx_pkt() does multi-mbuf frame recombining based on the RX queue,
this needs to occur.

Tested:

* AR9380 (XB112), hostap mode
2013-04-16 20:21:02 +00:00
Ivan Voras
9a796b22f6 Fix the buffer-overflow-fixing fixes.
Pointy-hat to: me, for not realizing snprintf() is available in kernel.
Thanks to: jh, for bringing me the good news of snprintf(), Pawel Worach, for
           noting that the panic can be provoked in i386 and not in amd64
2013-04-16 19:58:24 +00:00
Pedro F. Giffuni
acc929508b DTrace: print() should try to resolve function pointers
Merge changes from illumos:

3675 DTrace print() should try to resolve function pointers
3676 dt_print_enum hardcodes a value of zero

Illumos Revision:	b1fa6326238973aeaf12c34fcda75985b6c06be1

Reference:
https://www.illumos.org/issues/3675
https://www.illumos.org/issues/3676

Obtained from:	Illumos
MFC after:	1 month
2013-04-16 19:39:27 +00:00
Xin LI
f2297451fe Fix incomplete printf.
PR:		kern/177889
Submitted by:	Sven-Thorsten Dietrich <sven vyatta com>
MFC after:	1 week
2013-04-16 19:32:12 +00:00
Xin LI
c1031303f0 Don't leak lock when returning.
PR:		kern/177888
Submitted by:	Sven-Thorsten Dietrich <sven vyatta com>
MFC after:	1 week
2013-04-16 19:25:41 +00:00
Mikolaj Golub
f1fca82ed5 Add a new set of notes to a process core dump to store procstat data.
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.

The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.

PR:		kern/173723
Suggested by:	jhb
Discussed with:	jhb, kib
Reviewed by:	jhb (initial version), kib
MFC after:	1 month
2013-04-16 19:19:14 +00:00
Brooks Davis
b7b63db789 Partial MFP4 of 222836:
Only look for FDT partitions if our potential parent is a DISK device.

Excluding direct recursion on the flashmap geoms was insufficient
because it did not prevent the underlying device from being retrieved if
flashmap geoms were further partitioned.

Reviewed by:	imp
Sponsored by:	DARPA, AFRL
2013-04-16 17:47:13 +00:00
Bryan Drewery
d0f41f0f27 Also call configtest before reload to ensure working config.
Approved by:	jhb
MFC after:	1 week
X-MFC-With:	r249489
2013-04-16 17:30:13 +00:00
Robert Watson
a49ab93ce2 Use a suitable code generation when building libstand for MIPS.
Reviewed by:	imp
Sponsored by:	DARPA, AFRL
MFC after:	3 days
2013-04-16 17:20:52 +00:00
Robert Watson
909735d975 Adapt libstand's setjmp/longjmp MIPS support to be portable across 32-bit
and 64-bit MIPS.  Don't use the floating-point coprocessor in the libstand
context for MIPS.

Reviewed by:	imp
MFC after:	3 days
Sponsored by:	DARPA, AFRL
2013-04-16 17:03:35 +00:00
Tijl Coosemans
6c81895dab Fix build after r249543. 2013-04-16 16:59:29 +00:00
Warner Losh
dd65664bbc Point to regdef.h. May need to dig up references to the N32 standard
that support this usage (which may be a bit rough, since different
parts of the standard say mutually contradictory things).
2013-04-16 16:54:37 +00:00
Rick Macklem
64fa8df6e0 Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by:	pho
MFC after:	2 weeks
2013-04-16 14:22:16 +00:00
Pawel Jakub Dawidek
fe3fcf7b3a Correct error message.
Reported by:	Dirk Engling <erdgeist@erdgeist.org>
2013-04-16 12:31:16 +00:00
Andrey V. Elsukov
4ff7c740fe Fix accounting after the r249528, also add several another counters to
the statistics.
2013-04-16 11:31:26 +00:00
Andrey V. Elsukov
b543f3b167 Replace hardcoded numbers. Also use interface-local scope name instead
of node-local.
2013-04-16 11:25:45 +00:00
Andrey V. Elsukov
eca4d72003 Use IP6S_M2MMAX macro. 2013-04-16 11:19:13 +00:00
Andrey V. Elsukov
43851aae9a Replace hardcoded numbers. 2013-04-16 11:12:58 +00:00
Konstantin Belousov
44d95698ba Some compilers issue a warning when wider integer is casted to narrow
pointer.  Supposedly shut down the warning by casting through
uintptr_t.

Reported by:	ian
2013-04-16 07:11:52 +00:00
Andrey V. Elsukov
e7a87117d3 The source address selection algorithm tries to apply several rules
for the set of IPv6 addresses. Now each attempt goes into IPv6 statistics,
even if given rule did not won. Change this and take into account only
those rules, that won. Also add accounting for cases, when algorithm
fails to select an address.
2013-04-15 21:02:40 +00:00
Pedro F. Giffuni
98aff789ce DTrace: NFS translators should be split into client/server pieces
Merge change from illumos:

1731 DTrace NFS translators should be split into client/server pieces

Illumos Revision:	13523:6763769941d2

This code seems to be currently unused on FreeBSD.

Reference:
https://www.illumos.org/issues/1731

Obtained from:	Illumos
MFC after:	1 week
2013-04-15 20:16:31 +00:00
Konstantin Belousov
32e1d8010b The origin_subst_one() function limits the length of the string to
PATH_MAX after the token substitution.  This is wrong, because
origin_subst_one() performs the substitution on the whole rpath and
similar strings, which contain several pathes separated by colon.  As
result, long (but correct) rpath consisting of many path elements is
rejected by the function.

Correct the problem by rewriting the origin_subst_one() to perform two
passes, first pass to calculate the number of substitutions to be
performed, and second pass to generate the resulting string.  Second
pass allocates the memory for the result based on the count from the
first pass, without enforcing a limit.

Reported and tested by:	pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-04-15 20:06:56 +00:00
Warner Losh
e4be3d3ddd Fix N32/N64 register saving by ensuring that all registers resolve
to unique values.

There's some confusion about what the n32 assembler API really is
(since on page 9 of the spec they say that t0-t3 don't exist, then
turn around on page 22 and say that t4-t7 don't exist), and this
doesn't touch that.

NetBSD's version of this file follows the convention I used here, and
is likely to be correct.

This should fix gdb/ptrace.
2013-04-15 19:32:14 +00:00
Xin LI
3719c1c85c Reflect version update.
MFC after:	13 days
2013-04-15 18:35:09 +00:00