Commit Graph

15262 Commits

Author SHA1 Message Date
avg
96691e46d1 fix a thread preemption regression in schedulers introduced in r270423
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes.  Unfortunately, at the same time it introduced an
new regression.  The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag.  So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead.  That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by:	kib, mav
MFC after:	4 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9230
2017-01-19 18:46:41 +00:00
mjg
7de7dabf52 sx: reduce lock accesses similarly to r311172
Discussed with:	jhb
Tested by:	pho (previous version)
2017-01-18 17:55:08 +00:00
mjg
fe25936c7a rwlock: reduce lock accesses similarly to r311172
Discussed with:     jhb
Tested by:	pho (previous version)
2017-01-18 17:53:57 +00:00
hselasky
efa6326974 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
emaste
58cfb7eb21 disambiguate msleep KASSERT diagnostics
Previously "panic: msleep" could happen for a few different reasons.
Break the KASSERTs out into individual cases to identify the failing
condition. Found during the investigation that resulted in r308288.

Reviewed by:	kib, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D8604
2017-01-16 20:34:42 +00:00
sbruno
089c279c96 Remove Assert that seems to be hit in various configurations during
normal operations.
2017-01-16 19:01:41 +00:00
sobomax
701697521c Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
sbruno
bfc26a0c94 Change startup order for the no EARLY_AP_STARTUP case to initialize
gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP
which is far too late.

Add an assertion in taskqgroup_attach() to catch startup initialization
failures in the future.

Reported by:	kib bde
2017-01-16 16:58:12 +00:00
hiren
275c6b6b14 Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by:	bz, gnn (previous version)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9150
2017-01-16 08:25:33 +00:00
cem
b2000e56f9 "Buses" is the preferred plural of "bus"
Replace archaic "busses" with modern form "buses."

Intentionally excluded:
* Old/random drivers I didn't recognize
  * Old hardware in general
* Use of "busses" in code as identifiers

No functional change.

http://grammarist.com/spelling/buses-busses/

PR:		216099
Reported by:	bltsrc at mail.ru
Sponsored by:	Dell EMC Isilon
2017-01-15 17:54:01 +00:00
ngie
aa30e382a0 Revert r312119 and reword the intent to fix -Wshadow issues
between exp(3) and `exp` var.

The approach taken previously was not ideal for multiple
functional and stylistic reasons.

Add to existing sed call in Makefile to replace `exp` with
`exponent` instead.

MFC after:	13 days
Requested by:	bde
2017-01-15 09:25:33 +00:00
markj
a2e0d3fa71 Suppress a warning about m_assertbuf being unused.
MFC after:	1 week
2017-01-15 03:53:20 +00:00
sbruno
c23696ed2d Fix hangs in a uniprocessor configuration (qemu, virtualbox, real hw).
sys/net/iflib.c:
  Add ctx to filter_info and don't skpi interrupt early on unless we're on an
  SMP system

sys/kern/subr_gtaskqueue.c:
  Skip smp check if we're running UP

Submitted by:	Matt Macy <mmacy@nextbsd.org>
Reported by:	emaste bde
2017-01-15 00:50:10 +00:00
markj
4028dfd4f0 Stop the scheduler upon panic even in non-SMP kernels.
This is needed for kernel dumps to work, as the panicking thread will call
into code that makes use of kernel locks.

Reported and tested by:	Eugene Grosbein
MFC after:	1 week
2017-01-14 22:16:03 +00:00
ngie
b63e35e8c6 encode_long, encode_timeval: mechanically replace exp with exponent
This helps fix a -Wshadow issue with exp(3) with tests/sys/acct/acct_test,
which include math.h, which in turn defines exp(3)

MFC after:	2 weeks
Tested with:	clang, gcc 4.2.1, gcc 4.9
Sponsored by:	Dell EMC Isilon
2017-01-14 05:06:14 +00:00
ngie
30749c4615 Clean up trailing whitespace
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-01-14 04:16:13 +00:00
ngie
bce93161ba Fix -Wunused on gcc 4.9 (x was set but not used)
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-01-14 04:13:28 +00:00
glebius
eaea1f53fc Remove deprecated fgetsock() and fputsock(). 2017-01-13 22:16:41 +00:00
ian
9cc22de69f Correct the comments about how much buffer is allocated. 2017-01-13 17:03:23 +00:00
ian
b5d7eeda53 Check tty_gone() after allocating IO buffers. The tty lock has to be
dropped then reacquired due to using M_WAITOK, which opens a window in
which the tty device can disappear.  Check for this and return ENXIO
back up the call chain so that callers can cope.

This closes a race where TF_GONE would get set while buffers were being
allocated as part of ttydev_open(), causing a subsequent call to
ttydevsw_modem() later in ttydev_open() to assert.

Reported by:	pho
Reviewed by:	kib
2017-01-13 16:37:38 +00:00
ian
bbdc02abf8 Restructure the tty_drain loop so that device-busy is checked one more time
after tty_timedwait() returns an error only if the error is EWOULDBLOCK;
other errors cause an immediate return.  This fixes the case of the tty
disappearing while in tty_drain().

Reported by:	pho
2017-01-12 21:18:43 +00:00
rpokala
07d67037b6 Remove writability requirement for single-mbuf, contiguous-range
m_pulldown()

m_pulldown() only needs to determine if a mbuf is writable if it is going to
copy data into the data region of an existing mbuf. It does this to create a
contiguous data region in a single mbuf from multiple mbufs in the chain. If
the requested memory region is already contiguous and nothing needs to
change, the mbuf does not need to be writeable.

Submitted by:	Brian Mueller <bmueller@panasas.com>
Reviewed by:	bz
MFC after:	1 week
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D9053
2017-01-12 06:38:03 +00:00
ian
cd73df196a Rework tty_drain() to poll the hardware for completion, and restore
drain timeout handling to historical freebsd behavior.

The primary reason for these changes is the need to have tty_drain() call
ttydevsw_busy() at some reasonable sub-second rate, to poll hardware that
doesn't signal an interrupt when the transmit shift register becomes empty
(which includes virtually all USB serial hardware).  Such hardware hangs
in a ttyout wait, because it never gets an opportunity to trigger a wakeup
from the sleep in tty_drain() by calling ttydisc_getc() again, after
handing the last of the buffered data to the hardware.

While researching the history of changes to tty_drain() I stumbled across
some email describing the historical BSD behavior of tcdrain() and close()
on serial ports, and the ability of comcontrol(1) to control timeout
behavior.  Using that and some advice from Bruce Evans as a guide, I've
put together these changes to implement the hardware polling and restore
the historical timeout behaviors...

 - tty_drain() now calls ttydevsw_busy() in a loop at 10 Hz to accomodate
   hardware that requires polling for busy state.

 - The "new historical" behavior for draining during close(2) is retained:
   the drain timeout is "1 second without making any progress".  When the
   1-second timeout expires, if the count of bytes remaining in the tty
   layer buffer is smaller than last time, the timeout is extended for
   another second.  Unfortunately, the same logic cannot be extended all
   the way down to the hardware, because the interface to that layer is a
   simple busy/not-busy indication.

 - Due to the previous point, an application that needs a guarantee that
   all data has been transmitted must use TIOCDRAIN/tcdrain(3) before
   calling close(2).

 - The historical behavior of honoring the drainwait setting for TIOCDRAIN
   (used by tcdrain(3)) is restored.

 - The historical kern.drainwait sysctl to control the global default
   drainwait time is restored, but is now named kern.tty_drainwait.

 - The historical default drainwait timeout of 300 seconds is restored.

 - Handling of TIOCGDRAINWAIT and TIOCSDRAINWAIT ioctls is restored
   (this also makes the comcontrol(1) drainwait verb work again).

 - Manpages are updated to document these behaviors.

Reviewed by:	bde (prior version)
2017-01-12 00:48:06 +00:00
markj
83d67f5f72 Do not set BIO_DONE if the BIO specifies a completion handler.
biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.

Reviewed by:	mav
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9070
2017-01-10 21:41:28 +00:00
jhb
cb5debd34f Set MORETOCOME for AIO write requests on a socket.
Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend*
set PRUS_MOREOTOCOME when invoking the protocol send method. The aio
worker tasks for sending on a socket set this flag when there are
additional write jobs waiting on the socket buffer.

Reviewed by:	adrian
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D8955
2017-01-06 23:41:45 +00:00
kib
d7b44ae42b Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds.
sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.

Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.

After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.

If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.

Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.

Sponsored by:	The FreeBSD Foundation
2017-01-06 16:56:24 +00:00
kib
5def9fa2c2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
kib
6d69bbcc31 Some style fixes for getfstat(2)-related code.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:03:35 +00:00
markj
d0b3c7cb46 Add a small allocator for exec_map entries.
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8921
2017-01-05 01:44:12 +00:00
markj
1c83f7347b Sort includes in kern_exec.c.
MFC after:	1 week
2017-01-05 01:28:08 +00:00
glebius
01e1e94c27 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
kib
e1aff3b457 The callers of kern_getfsstat(UIO_SYSSPACE) expect that *buf always
returns memory which must be freed, regardless of the error.  Assign
NULL to *buf in case we are not going to allocate any memory due to
invalid mode.

Reported and tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks (together with r310638)
Differential revision:	https://reviews.freebsd.org/D9042
2017-01-04 16:09:45 +00:00
trasz
5166e57c9a Fix bug that would result in a kernel crash in some cases involving
a symlink and an autofs mount request.  The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times.  The bug was introduced in r296715.

Big thanks to Alex Deiter for his help with debugging this.

Reviewed by:	kib@
Tested by:	Alex Deiter <alex.deiter at gmail.com>
MFC after:	1 month
2017-01-04 14:43:57 +00:00
mjg
f62b14bceb mtx: plug open-coded mtx_lock access missed in r311172 2017-01-04 02:25:31 +00:00
mjg
15754c8600 Reduce lock accesses in thread lock similarly to r311172. 2017-01-03 23:08:11 +00:00
mjg
232bc14718 mtx: reduce lock accesses
Instead of spuriously re-reading the lock value, read it once.

This change also has a side effect of fixing a performance bug:
on failed _mtx_obtain_lock, it was possible that re-read would find
the lock is unowned, but in this case the primitive would make a trip
through turnstile code.

This is diff reduction to a variant which uses atomic_fcmpset.

Discussed with:	jhb (previous version)
Tested by:	pho (previous version)
2017-01-03 21:36:15 +00:00
kib
778442cef7 There is no need to use temporary statfs buffer for fsid obliteration
and prison enforcement.  Do it on the caller buffer directly.

Besides eliminating memory copies, this change also removes large
structure from the kernel stack.

Extracted from:	ino64 work by gleb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:59:23 +00:00
kib
9a7682de00 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:49:48 +00:00
kib
577892f66a Move common code from kern_statfs() and kern_fstatfs() into a new helper.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:20:22 +00:00
markj
21bee8f0b6 Factor out instances of a knote detach followed by a knote_drop() call.
Reviewed by:	kib (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9015
2017-01-02 01:23:21 +00:00
sbruno
8746ee3dab 2017 IFLIB updates in preparation for commits to e1000 and ixgbe.
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
          tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)

Some other various, sundry updates.  Slightly more verbose changelog:

Submitted by:	mmacy@nextbsd.org
Reviewed by:	shurd
mFC after:
Sponsored by:	LimeLight Networks and Dell EMC Isilon
2017-01-02 00:56:33 +00:00
mjg
290ab10d4d fd: access openfiles once in falloc_noinstall
This is similar to what's done with nprocs.

Note this is only a band aid.
2017-01-01 08:55:28 +00:00
mjg
9412199098 vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)
Reviewed by:	kib
2016-12-31 19:59:31 +00:00
mjg
f4dcd1882e Remove cpu_spinwait after seq_consistent.
It does not add any benefit as the read routine will do it as necessary.
2016-12-30 06:26:17 +00:00
mjg
4e793cf052 cache: sprinkle __predict_false 2016-12-29 16:35:49 +00:00
mjg
be2842cb28 cache: move shrink lock init to nchinit
This gets rid of unnecesary sysinit usage.

While here also rename the lock to be consistent with the rest.
2016-12-29 12:01:54 +00:00
mjg
20712de956 cache: depessimize hashing macros/inlines
All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.

Store the size - 1 and use 'foo & hash' instead which allows mere shift.
2016-12-29 08:41:25 +00:00
mjg
fd6a71fa0c cache: drop the NULL check from VP2VNODELOCK
Now that negative entries are annotated with a dedicated flag, NULL vnodes
are no longer passed.
2016-12-29 08:34:50 +00:00
jhb
a8a4d60efe Regen after r310638.
Differential Revision:	https://reviews.freebsd.org/D8854
2016-12-27 20:22:17 +00:00
jhb
9105298a66 Rename the 'flags' argument to getfsstat() to 'mode' and validate it.
This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'.  Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by:	kib
MFC after:	1 month
2016-12-27 20:21:11 +00:00