Commit Graph

15275 Commits

Author SHA1 Message Date
Hartmut Brandt
4b481ba0ed Merge filt_soread and filt_solisten and decide what to do when checking
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.

Reviewed by:	kib@, Scott Mitchell
2017-02-01 13:12:07 +00:00
Jason A. Harmening
65ed483615 Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.
2017-02-01 03:32:49 +00:00
Edward Tomasz Napierala
b38b22b0b2 Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9379
2017-01-31 15:35:18 +00:00
Edward Tomasz Napierala
fc8bde8ffe Replace calls to sys_truncate() with kern_truncate().
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9371
2017-01-31 15:19:44 +00:00
Edward Tomasz Napierala
ea2ebdc19e Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Reviewed by:	jhb@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9382
2017-01-31 15:11:23 +00:00
Andriy Gapon
826b3d3187 put very expensive sanity checks of advisory locks under DIAGNOSTIC
The checks have quadratic complexity over a number of advisory locks
active for a file and that could be a lot.  What's the worse is that the
checks are done while holding ls_lock.  That could lead to a long a very
long backlog and performance degradation even if all requested locks are
compatible (e.g. all shared locks).

The checks used to be under INVARIANTS.

Discussed with:	kib
MFC after:	2 weeks
Sponsored by:	Panzura
2017-01-30 15:20:13 +00:00
Edward Tomasz Napierala
d293f35c09 Add kern_listen(), kern_shutdown(), and kern_socket(), and use them
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9367
2017-01-30 12:57:22 +00:00
Edward Tomasz Napierala
f67d6b5f12 Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9366
2017-01-30 12:24:47 +00:00
Edward Tomasz Napierala
ae6b6ef6cb Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
Mateusz Guzik
dfecf51dd0 cache: use vrefact for '.' lookups and refing the rdir in fullpath 2017-01-30 03:20:05 +00:00
Mateusz Guzik
3071469d57 fd: sprinkle __read_mostly and __exclusive_cache_line 2017-01-30 03:07:32 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Mateusz Guzik
95839d3d25 hwpmc: annotate pmc_hook and pmc_intr as __read_mostly
MFC after:	1 month
2017-01-27 22:14:42 +00:00
Mateusz Guzik
f1f7f1cb29 hwpmc: partially depessimize mmap handling if the module is not loaded
In particular this means the pmc sx lock is no longer taken when an
executable mapping succeeds.

MFC after:	1 week
2017-01-27 22:13:15 +00:00
Mateusz Guzik
290511163d Sprinkle __read_mostly on backoff and lock profiling code.
MFC after:	1 month
2017-01-27 15:03:51 +00:00
Mateusz Guzik
17071ff298 cache: annotate with __read_mostly and __exclusive_cache_line
MFC after:	1 month
2017-01-27 14:56:36 +00:00
Sean Bruno
de414cfe14 A few more style bugs lying around in here.
Submitted by:	bde
2017-01-26 13:48:45 +00:00
Gleb Smirnoff
beb4b31200 For non-listening AF_UNIX sockets return error code EOPNOTSUPP to match
documentation and SUS.
2017-01-25 22:26:45 +00:00
Ed Maste
f27ac8e297 ANSIfy kern_ntptime.c 2017-01-25 20:22:32 +00:00
Sean Bruno
06bb7c507a Replace overlooked smp_started checks and variable use in a print
with the now used tqg_smp_started.

Submitted by:	bde
2017-01-25 15:54:44 +00:00
Ed Maste
77ebe276ba imgact_elf: refactor et_dyn_addr calculation
This simplifies the logic somewhat. It is extracted from the change in
review in D5603.

Differential Revision:	https://reviews.freebsd.org/D9321
2017-01-24 22:46:43 +00:00
Mateusz Guzik
543b2f425d proc: perform a lockless check in sys_issetugid
Discussed with:	kib
MFC after:	1 week
2017-01-24 21:48:57 +00:00
Conrad Meyer
90a79ac576 Use time_t for intermediate values to avoid overflow in clock_ts_to_ct
Add additionally safety and overflow checks to clock_ts_to_ct and the
BCD routines while we're here.

Perform a safety check in sys_clock_settime() first to avoid easy local
root panic, without having to propagate an error value back through
dozens of APIs currently lacking error returns.

PR:		211960, 214300
Submitted by:	Justin McOmie <justin.mcomie at gmail.com>, kib@
Reported by:	Tim Newsham <tim.newsham at nccgroup.trust>
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon, FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9279
2017-01-24 18:05:29 +00:00
Sean Bruno
bd84f70044 iflib:
Add internal tracking of smp startup status to reliably figure out
     what methods are to be used to get gtaskqueue up and running.

e1000:
     Calculating this pointer gives undefined behaviour when (last == -1)
     (it is before the buffer).  The pointer is always followed.  Panics
     occurred when it points to an unmapped page.  Otherwise, the pointed-to
     garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
     broken case the loop was usually null and the function just returned, and
     this was acidentally correct.

Submitted by:	bde
Reported by:	Matt Macy <mmacy@nextbsd.org>
2017-01-24 16:05:42 +00:00
Sean Bruno
36fa5d5b64 Revert 312696 due to build tests. 2017-01-24 15:55:52 +00:00
Sean Bruno
562a3182f6 iflib:
Add internal tracking of smp startup status to reliably figure out
   what methods are to be used to get gtaskqueue up and running.

e1000:
   Calculating this pointer gives undefined behaviour when (last == -1)
   (it is before the buffer).  The pointer is always followed.  Panics
   occurred when it points to an unmapped page.  Otherwise, the pointed-to
   garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
   broken case the loop was usually null and the function just returned, and
   this was acidentally correct.

Submitted by:	bde
Reviewed by:	Matt Macy <mmacy@nextbsd.org>
2017-01-24 14:48:32 +00:00
Konstantin Belousov
3467f88cd6 Add comments explaining unobvious td_critnest adjustments in
critical_exit().

Based on the discussion with:	jhb
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	D9276
MFC after:	1 week
2017-01-22 19:41:42 +00:00
Konstantin Belousov
25c6816845 More style cleanup. Use ANSI C definition for vn_closefile(). Switch
to VNASSERT in _vn_lock(), simplify messages.

Sponsored by:	The FreeBSD Foundation
X-MFC with:	r312600, r312601, r312602, r312606
2017-01-22 19:38:45 +00:00
Konstantin Belousov
aec8391d46 Provide fallback VOP methods for crossmp vnode.
In particular, crossmp vnode might leak into rename code.

PR:	216380
Reported by:	fnacl@protonmail.com
Sponsored by:	The FreeBSD Foundation
X-MFC with:	r309425
2017-01-22 19:36:02 +00:00
Edward Tomasz Napierala
5c93966020 Remove redundant KASSERT. 2017-01-22 15:35:51 +00:00
Edward Tomasz Napierala
8acac5a9f5 Improve debugging printf. 2017-01-22 15:27:14 +00:00
Mateusz Guzik
eaf0969bda vfs: fix LK_RETRY logic braino in r312600 2017-01-21 20:34:20 +00:00
Mateusz Guzik
829857c893 vfs: __predict_false the need to handle F_HASLOCK
Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast
majority of the time, so it is typically true.
2017-01-21 19:01:42 +00:00
Mateusz Guzik
abbc538d9a vfs: fix whitespace damage in r312600
While here wrap the previously overly long line so that it fits 80 chars.
2017-01-21 18:56:58 +00:00
Mateusz Guzik
1091fb52c1 vfs: refactor _vn_lock
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.

No functional changes.
2017-01-21 18:38:16 +00:00
Mateusz Guzik
067115e050 vfs: hide the getvnode NULL mp message behind DIAGNOSTIC
Since crossmp vnode changes the message was being printed on each boot.

Reported by:	trasz
Discussed with:	kib
2017-01-21 16:59:50 +00:00
Hans Petter Selasky
10c8755706 Fix for race leading to endless timer interrupts related to
configtimer().

During normal operation "state->nextcallopt" will always be less than
or equal to "state->nextcall" and checking only "state->nextcallopt"
before calling "callout_process()" is sufficient. However when
"configtimer()" is called a race might happen requiring both of these
binary times to be checked.

Short description of race:

1) A configtimer() call will reset both "state->nextcall" and
"state->nextcallopt" to the same binary time.

2) If a "callout_reset()" call happens between "configtimer()" and the
next "callout_process()" call, "state->nextcallopt" will get updated
and "state->nextcall" will remain at the current time. Refer to logic
inside cpu_new_callout().

3) getnextcpuevent() only respects "state->nextcall" and returns this
value over and over again, even if it is in the past, until "now >=
state->nextcallopt" becomes true. Then these two time variables are
corrected by a "callout_process()" call and the situation goes back to
normal.

The problem manifests itself in different ways. The common factor is
the timer process(es) consume all CPU on one or more CPU cores for a
long time, blocking other kernel processes from getting execution
time. This can be seen by very high interrupt counts as displayed by
"vmstat -i | grep timer" right after boot.

When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting
this bug apparently increased.

Example output from "vmstat -i" before patch:
cpu0:timer                          7591         69
cpu9:timer                      39031773     358089
cpu4:timer                          9359         85
cpu3:timer                          9100         83
cpu2:timer                          9620         88

Example output from "vmstat -i" after patch:
cpu0:timer                          4242         34
cpu6:timer                          5531         44
cpu3:timer                          6450         52
cpu1:timer                          4545         36
cpu9:timer                          7153         58

Before the patch cpu9 in the example above, was spinning in a loop in
order to reach 39 million interrupts just a few seconds after
bootup. After the patch the timer interrupt counts are more or less
consistent.

Discussed with:		mav @
Reported by:		several people
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-01-20 17:40:31 +00:00
Ed Maste
039644eca9 ANSYfy kern_ktrace.c and remove archaic register keyword
Sponsored by:	The FreeBSD Foundation
2017-01-20 14:59:56 +00:00
Andriy Gapon
c468ff880a don't abort writing of a core dump after EFAULT
It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.

While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.

Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6

After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)

Reviewed by:	julian, kib
Obtained from:	Panzura (older version of the change)
MFC after:	5 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9233
2017-01-20 13:39:07 +00:00
Andriy Gapon
ad9dadc437 fix a thread preemption regression in schedulers introduced in r270423
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes.  Unfortunately, at the same time it introduced an
new regression.  The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag.  So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead.  That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by:	kib, mav
MFC after:	4 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9230
2017-01-19 18:46:41 +00:00
Mateusz Guzik
c5f61e6f96 sx: reduce lock accesses similarly to r311172
Discussed with:	jhb
Tested by:	pho (previous version)
2017-01-18 17:55:08 +00:00
Mateusz Guzik
3f0a0612e8 rwlock: reduce lock accesses similarly to r311172
Discussed with:     jhb
Tested by:	pho (previous version)
2017-01-18 17:53:57 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Ed Maste
bf9ebe74e2 disambiguate msleep KASSERT diagnostics
Previously "panic: msleep" could happen for a few different reasons.
Break the KASSERTs out into individual cases to identify the failing
condition. Found during the investigation that resulted in r308288.

Reviewed by:	kib, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D8604
2017-01-16 20:34:42 +00:00
Sean Bruno
374f3e042c Remove Assert that seems to be hit in various configurations during
normal operations.
2017-01-16 19:01:41 +00:00
Maxim Sobolev
339efd75a4 Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
Sean Bruno
227743cad4 Change startup order for the no EARLY_AP_STARTUP case to initialize
gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP
which is far too late.

Add an assertion in taskqgroup_attach() to catch startup initialization
failures in the future.

Reported by:	kib bde
2017-01-16 16:58:12 +00:00
Hiren Panchasara
7d03ff1fe9 Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by:	bz, gnn (previous version)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9150
2017-01-16 08:25:33 +00:00
Conrad Meyer
db4fcadf52 "Buses" is the preferred plural of "bus"
Replace archaic "busses" with modern form "buses."

Intentionally excluded:
* Old/random drivers I didn't recognize
  * Old hardware in general
* Use of "busses" in code as identifiers

No functional change.

http://grammarist.com/spelling/buses-busses/

PR:		216099
Reported by:	bltsrc at mail.ru
Sponsored by:	Dell EMC Isilon
2017-01-15 17:54:01 +00:00