freebsd-dev

Author	SHA1	Message	Date
Sean Bruno	bd84f70044	iflib: Add internal tracking of smp startup status to reliably figure out what methods are to be used to get gtaskqueue up and running. e1000: Calculating this pointer gives undefined behaviour when (last == -1) (it is before the buffer). The pointer is always followed. Panics occurred when it points to an unmapped page. Otherwise, the pointed-to garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the broken case the loop was usually null and the function just returned, and this was acidentally correct. Submitted by: bde Reported by: Matt Macy <mmacy@nextbsd.org>	2017-01-24 16:05:42 +00:00
Sean Bruno	36fa5d5b64	Revert 312696 due to build tests.	2017-01-24 15:55:52 +00:00
Sean Bruno	562a3182f6	iflib: Add internal tracking of smp startup status to reliably figure out what methods are to be used to get gtaskqueue up and running. e1000: Calculating this pointer gives undefined behaviour when (last == -1) (it is before the buffer). The pointer is always followed. Panics occurred when it points to an unmapped page. Otherwise, the pointed-to garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the broken case the loop was usually null and the function just returned, and this was acidentally correct. Submitted by: bde Reviewed by: Matt Macy <mmacy@nextbsd.org>	2017-01-24 14:48:32 +00:00
Konstantin Belousov	3467f88cd6	Add comments explaining unobvious td_critnest adjustments in critical_exit(). Based on the discussion with: jhb Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: D9276 MFC after: 1 week	2017-01-22 19:41:42 +00:00
Konstantin Belousov	25c6816845	More style cleanup. Use ANSI C definition for vn_closefile(). Switch to VNASSERT in _vn_lock(), simplify messages. Sponsored by: The FreeBSD Foundation X-MFC with: r312600, r312601, r312602, r312606	2017-01-22 19:38:45 +00:00
Konstantin Belousov	aec8391d46	Provide fallback VOP methods for crossmp vnode. In particular, crossmp vnode might leak into rename code. PR: 216380 Reported by: fnacl@protonmail.com Sponsored by: The FreeBSD Foundation X-MFC with: r309425	2017-01-22 19:36:02 +00:00
Edward Tomasz Napierala	5c93966020	Remove redundant KASSERT.	2017-01-22 15:35:51 +00:00
Edward Tomasz Napierala	8acac5a9f5	Improve debugging printf.	2017-01-22 15:27:14 +00:00
Mateusz Guzik	eaf0969bda	vfs: fix LK_RETRY logic braino in r312600	2017-01-21 20:34:20 +00:00
Mateusz Guzik	829857c893	vfs: __predict_false the need to handle F_HASLOCK Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast majority of the time, so it is typically true.	2017-01-21 19:01:42 +00:00
Mateusz Guzik	abbc538d9a	vfs: fix whitespace damage in r312600 While here wrap the previously overly long line so that it fits 80 chars.	2017-01-21 18:56:58 +00:00
Mateusz Guzik	1091fb52c1	vfs: refactor _vn_lock Stop testing for LK_RETRY and error multiple times. Also postpone the VI_DOOMED until after LK_RETRY was seen as it reads from the vnode. No functional changes.	2017-01-21 18:38:16 +00:00
Mateusz Guzik	067115e050	vfs: hide the getvnode NULL mp message behind DIAGNOSTIC Since crossmp vnode changes the message was being printed on each boot. Reported by: trasz Discussed with: kib	2017-01-21 16:59:50 +00:00
Hans Petter Selasky	10c8755706	Fix for race leading to endless timer interrupts related to configtimer(). During normal operation "state->nextcallopt" will always be less than or equal to "state->nextcall" and checking only "state->nextcallopt" before calling "callout_process()" is sufficient. However when "configtimer()" is called a race might happen requiring both of these binary times to be checked. Short description of race: 1) A configtimer() call will reset both "state->nextcall" and "state->nextcallopt" to the same binary time. 2) If a "callout_reset()" call happens between "configtimer()" and the next "callout_process()" call, "state->nextcallopt" will get updated and "state->nextcall" will remain at the current time. Refer to logic inside cpu_new_callout(). 3) getnextcpuevent() only respects "state->nextcall" and returns this value over and over again, even if it is in the past, until "now >= state->nextcallopt" becomes true. Then these two time variables are corrected by a "callout_process()" call and the situation goes back to normal. The problem manifests itself in different ways. The common factor is the timer process(es) consume all CPU on one or more CPU cores for a long time, blocking other kernel processes from getting execution time. This can be seen by very high interrupt counts as displayed by "vmstat -i \| grep timer" right after boot. When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting this bug apparently increased. Example output from "vmstat -i" before patch: cpu0:timer 7591 69 cpu9:timer 39031773 358089 cpu4:timer 9359 85 cpu3:timer 9100 83 cpu2:timer 9620 88 Example output from "vmstat -i" after patch: cpu0:timer 4242 34 cpu6:timer 5531 44 cpu3:timer 6450 52 cpu1:timer 4545 36 cpu9:timer 7153 58 Before the patch cpu9 in the example above, was spinning in a loop in order to reach 39 million interrupts just a few seconds after bootup. After the patch the timer interrupt counts are more or less consistent. Discussed with: mav @ Reported by: several people MFC after: 1 week Sponsored by: Mellanox Technologies	2017-01-20 17:40:31 +00:00
Ed Maste	039644eca9	ANSYfy kern_ktrace.c and remove archaic register keyword Sponsored by: The FreeBSD Foundation	2017-01-20 14:59:56 +00:00
Andriy Gapon	c468ff880a	don't abort writing of a core dump after EFAULT It's possible to get EFAULT when writing a segment backed by a file if the segment extends beyond the file. The core dump could still be useful if we skip the rest of the segment and proceed to other segements. The skipped segment (or a portion of it) will be zero-filled. While there, use 'const' to signify that core_write() only reads the buffer and use __DECONST before calling vn_rdwr_inchunks() because it can be used for both reading and writing. Before the change: kernel: Failed to write core file for process mmap_trunc_core (error 14) kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6 After the change: kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped) Reviewed by: julian, kib Obtained from: Panzura (older version of the change) MFC after: 5 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9233	2017-01-20 13:39:07 +00:00
Andriy Gapon	ad9dadc437	fix a thread preemption regression in schedulers introduced in r270423 Commit r270423 fixed a regression in sched_yield() that was introduced in earlier changes. Unfortunately, at the same time it introduced an new regression. The problem is that SWT_RELINQUISH (6), like all other SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags & SWT_RELINQUISH) is true in cases where that was not really indended, for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11). A straight forward fix would be to use (flags & SW_TYPE_MASK) == SWT_RELINQUISH, but my impression is that the switch types are designed mostly for gathering statistics, not for influencing scheduling decisions. So, I decided that it would be better to check for SW_PREEMPT flag instead. That's also the same flag that was checked before r239157. I double-checked how that flag is used and I am confident that the flag is set only in the places where we really have the preemption: - critical_exit + td_owepreempt - sched_preempt in the ULE scheduler - sched_preempt in the 4BSD scheduler Reviewed by: kib, mav MFC after: 4 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9230	2017-01-19 18:46:41 +00:00
Mateusz Guzik	c5f61e6f96	sx: reduce lock accesses similarly to r311172 Discussed with: jhb Tested by: pho (previous version)	2017-01-18 17:55:08 +00:00
Mateusz Guzik	3f0a0612e8	rwlock: reduce lock accesses similarly to r311172 Discussed with: jhb Tested by: pho (previous version)	2017-01-18 17:53:57 +00:00
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
Ed Maste	bf9ebe74e2	disambiguate msleep KASSERT diagnostics Previously "panic: msleep" could happen for a few different reasons. Break the KASSERTs out into individual cases to identify the failing condition. Found during the investigation that resulted in r308288. Reviewed by: kib, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D8604	2017-01-16 20:34:42 +00:00
Sean Bruno	374f3e042c	Remove Assert that seems to be hit in various configurations during normal operations.	2017-01-16 19:01:41 +00:00
Maxim Sobolev	339efd75a4	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
Sean Bruno	227743cad4	Change startup order for the no EARLY_AP_STARTUP case to initialize gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP which is far too late. Add an assertion in taskqgroup_attach() to catch startup initialization failures in the future. Reported by: kib bde	2017-01-16 16:58:12 +00:00
Hiren Panchasara	7d03ff1fe9	Add kevent EVFILT_EMPTY for notification when a client has received all data i.e. everything outstanding has been acked. Reviewed by: bz, gnn (previous version) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9150	2017-01-16 08:25:33 +00:00
Conrad Meyer	db4fcadf52	"Buses" is the preferred plural of "bus" Replace archaic "busses" with modern form "buses." Intentionally excluded: * Old/random drivers I didn't recognize * Old hardware in general * Use of "busses" in code as identifiers No functional change. http://grammarist.com/spelling/buses-busses/ PR: 216099 Reported by: bltsrc at mail.ru Sponsored by: Dell EMC Isilon	2017-01-15 17:54:01 +00:00
Enji Cooper	d75a788085	Revert r312119 and reword the intent to fix -Wshadow issues between exp(3) and `exp` var. The approach taken previously was not ideal for multiple functional and stylistic reasons. Add to existing sed call in Makefile to replace `exp` with `exponent` instead. MFC after: 13 days Requested by: bde	2017-01-15 09:25:33 +00:00
Mark Johnston	d53d6fa9a8	Suppress a warning about m_assertbuf being unused. MFC after: 1 week	2017-01-15 03:53:20 +00:00
Sean Bruno	4ecb427a49	Fix hangs in a uniprocessor configuration (qemu, virtualbox, real hw). sys/net/iflib.c: Add ctx to filter_info and don't skpi interrupt early on unless we're on an SMP system sys/kern/subr_gtaskqueue.c: Skip smp check if we're running UP Submitted by: Matt Macy <mmacy@nextbsd.org> Reported by: emaste bde	2017-01-15 00:50:10 +00:00
Mark Johnston	42d33c1f4d	Stop the scheduler upon panic even in non-SMP kernels. This is needed for kernel dumps to work, as the panicking thread will call into code that makes use of kernel locks. Reported and tested by: Eugene Grosbein MFC after: 1 week	2017-01-14 22:16:03 +00:00
Enji Cooper	d467b2ee0c	encode_long, encode_timeval: mechanically replace `exp` with `exponent` This helps fix a -Wshadow issue with exp(3) with tests/sys/acct/acct_test, which include math.h, which in turn defines exp(3) MFC after: 2 weeks Tested with: clang, gcc 4.2.1, gcc 4.9 Sponsored by: Dell EMC Isilon	2017-01-14 05:06:14 +00:00
Enji Cooper	66db8cca1a	Clean up trailing whitespace MFC after: 3 days Sponsored by: Dell EMC Isilon	2017-01-14 04:16:13 +00:00
Enji Cooper	5e8fcdfe1b	Fix -Wunused on gcc 4.9 (x was set but not used) MFC after: 3 days Sponsored by: Dell EMC Isilon	2017-01-14 04:13:28 +00:00
Gleb Smirnoff	4fce19da8d	Remove deprecated fgetsock() and fputsock().	2017-01-13 22:16:41 +00:00
Ian Lepore	d5b937680c	Correct the comments about how much buffer is allocated.	2017-01-13 17:03:23 +00:00
Ian Lepore	a6f63533a7	Check tty_gone() after allocating IO buffers. The tty lock has to be dropped then reacquired due to using M_WAITOK, which opens a window in which the tty device can disappear. Check for this and return ENXIO back up the call chain so that callers can cope. This closes a race where TF_GONE would get set while buffers were being allocated as part of ttydev_open(), causing a subsequent call to ttydevsw_modem() later in ttydev_open() to assert. Reported by: pho Reviewed by: kib	2017-01-13 16:37:38 +00:00
Ian Lepore	e046e8e680	Restructure the tty_drain loop so that device-busy is checked one more time after tty_timedwait() returns an error only if the error is EWOULDBLOCK; other errors cause an immediate return. This fixes the case of the tty disappearing while in tty_drain(). Reported by: pho	2017-01-12 21:18:43 +00:00
Ravi Pokala	8e712af70b	Remove writability requirement for single-mbuf, contiguous-range m_pulldown() m_pulldown() only needs to determine if a mbuf is writable if it is going to copy data into the data region of an existing mbuf. It does this to create a contiguous data region in a single mbuf from multiple mbufs in the chain. If the requested memory region is already contiguous and nothing needs to change, the mbuf does not need to be writeable. Submitted by: Brian Mueller <bmueller@panasas.com> Reviewed by: bz MFC after: 1 week Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D9053	2017-01-12 06:38:03 +00:00
Ian Lepore	f64342e354	Rework tty_drain() to poll the hardware for completion, and restore drain timeout handling to historical freebsd behavior. The primary reason for these changes is the need to have tty_drain() call ttydevsw_busy() at some reasonable sub-second rate, to poll hardware that doesn't signal an interrupt when the transmit shift register becomes empty (which includes virtually all USB serial hardware). Such hardware hangs in a ttyout wait, because it never gets an opportunity to trigger a wakeup from the sleep in tty_drain() by calling ttydisc_getc() again, after handing the last of the buffered data to the hardware. While researching the history of changes to tty_drain() I stumbled across some email describing the historical BSD behavior of tcdrain() and close() on serial ports, and the ability of comcontrol(1) to control timeout behavior. Using that and some advice from Bruce Evans as a guide, I've put together these changes to implement the hardware polling and restore the historical timeout behaviors... - tty_drain() now calls ttydevsw_busy() in a loop at 10 Hz to accomodate hardware that requires polling for busy state. - The "new historical" behavior for draining during close(2) is retained: the drain timeout is "1 second without making any progress". When the 1-second timeout expires, if the count of bytes remaining in the tty layer buffer is smaller than last time, the timeout is extended for another second. Unfortunately, the same logic cannot be extended all the way down to the hardware, because the interface to that layer is a simple busy/not-busy indication. - Due to the previous point, an application that needs a guarantee that all data has been transmitted must use TIOCDRAIN/tcdrain(3) before calling close(2). - The historical behavior of honoring the drainwait setting for TIOCDRAIN (used by tcdrain(3)) is restored. - The historical kern.drainwait sysctl to control the global default drainwait time is restored, but is now named kern.tty_drainwait. - The historical default drainwait timeout of 300 seconds is restored. - Handling of TIOCGDRAINWAIT and TIOCSDRAINWAIT ioctls is restored (this also makes the comcontrol(1) drainwait verb work again). - Manpages are updated to document these behaviors. Reviewed by: bde (prior version)	2017-01-12 00:48:06 +00:00
Mark Johnston	90e17792c8	Do not set BIO_DONE if the BIO specifies a completion handler. biowait() will otherwise race with completions of such BIOs. In-tree code only calls biowait() on BIOs that do not specify a handler, so this change should not have any functional impact. Reviewed by: mav MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9070	2017-01-10 21:41:28 +00:00
John Baldwin	14da48cbe4	Set MORETOCOME for AIO write requests on a socket. Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend* set PRUS_MOREOTOCOME when invoking the protocol send method. The aio worker tasks for sending on a socket set this flag when there are additional write jobs waiting on the socket buffer. Reviewed by: adrian MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D8955	2017-01-06 23:41:45 +00:00
Konstantin Belousov	6e89d383c7	Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds. sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h. Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined. Two lines later, sys/ptrace.h includes machine/reg.h, which in case of powerpc, includes opt_compat.h. After the include headers reordering in r311345, we have sys/ptrace.h included before sys/sysproto.h. If COMPAT_43 was requested in the kernel config, the result is that sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees COMPAT_43 and uses osigset_t. Fix this by explicitely including opt_compat.h to cover the whole kern/kern_exec.c scope. Sponsored by: The FreeBSD Foundation	2017-01-06 16:56:24 +00:00
Konstantin Belousov	2f304845e2	Do not allocate struct statfs on kernel stack. Right now size of the structure is 472 bytes on amd64, which is already large and stack allocations are indesirable. With the ino64 work, MNAMELEN is increased to 1024, which will make it impossible to have struct statfs on the stack. Extracted from: ino64 work by gleb Discussed with: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-05 17:19:26 +00:00
Konstantin Belousov	607fa849d2	Some style fixes for getfstat(2)-related code. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-05 17:03:35 +00:00
Mark Johnston	ec492b13f1	Add a small allocator for exec_map entries. Upon each execve, we allocate a KVA range for use in copying data to the new image. Pages must be faulted into the range, and when the range is freed, the backing pages are freed and their mappings are destroyed. This is a lot of needless overhead, and the exec_map management becomes a bottleneck when many CPUs are executing execve concurrently. Moreover, the number of available ranges is fixed at 16, which is insufficient on large systems and potentially excessive on 32-bit systems. The new allocator reduces overhead by making exec_map allocations persistent. When a range is freed, pages backing the range are marked clean and made easy to reclaim. With this change, the exec_map is sized based on the number of CPUs. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8921	2017-01-05 01:44:12 +00:00
Mark Johnston	eeeaa7ba22	Sort includes in kern_exec.c. MFC after: 1 week	2017-01-05 01:28:08 +00:00
Gleb Smirnoff	bfc8c24c73	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib	2017-01-04 22:27:19 +00:00
Konstantin Belousov	6c4338f2ef	The callers of kern_getfsstat(UIO_SYSSPACE) expect that buf always returns memory which must be freed, regardless of the error. Assign NULL to buf in case we are not going to allocate any memory due to invalid mode. Reported and tested by: pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 weeks (together with r310638) Differential revision: https://reviews.freebsd.org/D9042	2017-01-04 16:09:45 +00:00
Edward Tomasz Napierala	5ec7cde488	Fix bug that would result in a kernel crash in some cases involving a symlink and an autofs mount request. The crash was caused by namei() calling bcopy() with a negative length, caused by numeric underflow: in lookup(), in the relookup path, the ni_pathlen was decremented too many times. The bug was introduced in r296715. Big thanks to Alex Deiter for his help with debugging this. Reviewed by: kib@ Tested by: Alex Deiter <alex.deiter at gmail.com> MFC after: 1 month	2017-01-04 14:43:57 +00:00
Mateusz Guzik	391df78ad4	mtx: plug open-coded mtx_lock access missed in r311172	2017-01-04 02:25:31 +00:00
Mateusz Guzik	5e5ad162ad	Reduce lock accesses in thread lock similarly to r311172.	2017-01-03 23:08:11 +00:00
Mateusz Guzik	2604eb9e17	mtx: reduce lock accesses Instead of spuriously re-reading the lock value, read it once. This change also has a side effect of fixing a performance bug: on failed _mtx_obtain_lock, it was possible that re-read would find the lock is unowned, but in this case the primitive would make a trip through turnstile code. This is diff reduction to a variant which uses atomic_fcmpset. Discussed with: jhb (previous version) Tested by: pho (previous version)	2017-01-03 21:36:15 +00:00
Konstantin Belousov	7ee34a31fd	There is no need to use temporary statfs buffer for fsid obliteration and prison enforcement. Do it on the caller buffer directly. Besides eliminating memory copies, this change also removes large structure from the kernel stack. Extracted from: ino64 work by gleb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-02 18:59:23 +00:00
Konstantin Belousov	b961dc3193	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-02 18:49:48 +00:00
Konstantin Belousov	f2af4041fa	Move common code from kern_statfs() and kern_fstatfs() into a new helper. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-02 18:20:22 +00:00
Mark Johnston	b5442eba5c	Factor out instances of a knote detach followed by a knote_drop() call. Reviewed by: kib (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9015	2017-01-02 01:23:21 +00:00
Sean Bruno	1248952a50	2017 IFLIB updates in preparation for commits to e1000 and ixgbe. - iflib - add checksum in place support (mmacy) - iflib - initialize IP for TSO (going to be needed for e1000) (mmacy) - iflib - move isc_txrx from shared context to softc context (mmacy) - iflib - Normalize checks in TXQ drainage. (shurd) - iflib - Fix queue capping checks (mmacy) - iflib - Fix invalid assert, em can need 2 sentinels (mmacy) - iflib - let the driver determine what capabilities are set and what tx csum flags are used (mmacy) - add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy) - update bnxt(4) to support the changes to iflib (shurd) Some other various, sundry updates. Slightly more verbose changelog: Submitted by: mmacy@nextbsd.org Reviewed by: shurd mFC after: Sponsored by: LimeLight Networks and Dell EMC Isilon	2017-01-02 00:56:33 +00:00
Mateusz Guzik	d4db49c4c7	fd: access openfiles once in falloc_noinstall This is similar to what's done with nprocs. Note this is only a band aid.	2017-01-01 08:55:28 +00:00
Mateusz Guzik	41b0046a4d	vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9) Reviewed by: kib	2016-12-31 19:59:31 +00:00
Mateusz Guzik	0b3b55a0f2	Remove cpu_spinwait after seq_consistent. It does not add any benefit as the read routine will do it as necessary.	2016-12-30 06:26:17 +00:00
Mateusz Guzik	4938d86764	cache: sprinkle __predict_false	2016-12-29 16:35:49 +00:00
Mateusz Guzik	b37707533e	cache: move shrink lock init to nchinit This gets rid of unnecesary sysinit usage. While here also rename the lock to be consistent with the rest.	2016-12-29 12:01:54 +00:00
Mateusz Guzik	0569bc9ca9	cache: depessimize hashing macros/inlines All hash sizes are power-of-2, but the compiler does not know that for sure and 'foo % size' forces doing a division. Store the size - 1 and use 'foo & hash' instead which allows mere shift.	2016-12-29 08:41:25 +00:00
Mateusz Guzik	6dd9661b77	cache: drop the NULL check from VP2VNODELOCK Now that negative entries are annotated with a dedicated flag, NULL vnodes are no longer passed.	2016-12-29 08:34:50 +00:00
John Baldwin	1fabda45c3	Regen after r310638. Differential Revision: https://reviews.freebsd.org/D8854	2016-12-27 20:22:17 +00:00
John Baldwin	34ed0c63c8	Rename the 'flags' argument to getfsstat() to 'mode' and validate it. This argument is not a bitmask of flags, but only accepts a single value. Fail with EINVAL if an invalid value is passed to 'flag'. Rename the 'flags' argument to getmntinfo(3) to 'mode' as well to match. This is a followup to r308088. Reviewed by: kib MFC after: 1 month	2016-12-27 20:21:11 +00:00
Konstantin Belousov	fd30dd7c26	Make knote KN_INFLUX state counted. This is final fix for the issue closed by r310302 for knote(). If KN_INFLUX \| KN_SCAN flags are set for the note passed to knote() or knote_fork(), i.e. the knote is scanned, we might erronously clear INFLUX when finishing notification. For normal knote() it was fixed in r310302 simply by remembering the fact that we do not own KN_INFLUX, since there we own knlist lock and scan thread cannot clear KN_INFLUX until we drop the lock. For knote_fork(), the situation is more complicated, e must drop knlist lock AKA the process lock, since we need to register new knotes. Change KN_INFLUX into counter and allow shared ownership of the in-flux state between scan and knote_fork() or knote(). Both in-flux setters need to ensure that knote is not dropped in parallel. Added assert about kn_influx == 1 in knote_drop() verifies that in-flux state is not shared when knote is destroyed. Since KBI of the struct knote is changed by addition of the int kn_influx field, reorder kn_hook and kn_hookid to fill pad on LP64 arches [1]. This keeps sizeof(struct knote) to same 128 bytes as it was before addition of kn_influx, on amd64. Reviewed by: markj Suggested by: markj [1] Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:33:40 +00:00
Konstantin Belousov	5c36b2e8cb	Change knlist_destroy() to assert that knlist is empty instead of accepting the wrong state and printing warning. Do not obliterate kl_lock and kl_unlock pointers, they are often useful for post-mortem analysis. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:28:10 +00:00
Konstantin Belousov	34311568dc	Style. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:26:40 +00:00
Konstantin Belousov	fc05543fa7	Some optimizations for kqueue timers. There is no need to do two allocations per kqueue timer. Gather all data needed by the timer callout into the structure and allocate it at once. Use the structure to preserve the result of timer2sbintime(), to not perform repeated 64bit calculations in callout. Remove tautological casts. Remove now unused p_nexttime [1]. Noted by: markj [1] Reviewed by: markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC note: do not remove p_nexttime Differential revision: https://reviews.freebsd.org/D8901	2016-12-25 19:49:35 +00:00
Konstantin Belousov	7611b72816	Some style. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8901	2016-12-25 19:38:07 +00:00
Mark Johnston	eab80d9276	Add a comment explaining the race fixed by r310423. Suggested and reviewed by: jhb X-MFC With: r310423	2016-12-23 05:02:17 +00:00
Mark Johnston	aa3c544349	Revert part of r300109. The removal of TAILQ_FOREACH_SAFE introduced a small race: when the last thread on a sleepqueue is awoken, it reclaims the sleepqueue and may begin executing on a different CPU before sleepq_resume_thread() returns. This leaves a window during which it may go back to sleep and incorrectly be awoken again by the caller of sleepq_broadcast(). Reported and tested by: pho MFC after: 3 days Sponsored by: Dell EMC Isilon	2016-12-22 17:51:44 +00:00
John Baldwin	99bc7e4123	Don't spin in pause() during early boot for kthreads other than thread0. pause() uses a spin loop to simulate a sleep during early boot. However, we only need this for thread0 to get far enough in the boot process to enable timers (at which point pause() can sleep). For other kthreads, sleeping in pause() is ok as the callout will be scheduled and will eventually fire once thread0 initializes timers. Tested by: Steven Kargl Sleuthing by: markj MFC after: 1 week Sponsored by: Netflix	2016-12-20 19:44:44 +00:00
Konstantin Belousov	4afd808be7	Do not clear KN_INFLUX when not owning influx state. For notes in KN_INFLUX\|KN_SCAN state, the influx bit is set by a parallel scan. When knote() reports event for the vnode filters, which require kqueue unlocked, it unconditionally sets and then clears influx to keep note around kqueue unlock. There, do not clear influx flag if a scan set it, since we do not own it, instead we prevent scan from executing by holding knlist lock. The knote_fork() function has somewhat similar problem, it might set KN_INFLUX for scanned note, drop kqueue and list locks, and then clear the flag after relock. A solution there would be different enough, as well as the test program, so close the reported issue first. Reported and test case provided by: yjh0502@gmail.com PR: 214923 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-19 22:18:36 +00:00
Konstantin Belousov	69baec3619	Switch from stdatomic.h to atomic.h for kernel. Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not work properly. This effectively reverts r251803. Reported and tested by: lidl Discussed with: ed Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-16 17:41:20 +00:00
Ed Schouten	669a25b50d	Document the existence of the {0, 6, ...} sysctl.	2016-12-15 15:45:11 +00:00
Jilles Tjoelker	b9a6fb9343	reaper: Make REAPER_KILL_SUBTREE actually work. MFC after: 2 weeks	2016-12-14 22:49:20 +00:00
Ed Schouten	ae15715360	Add a "device_index" label to all sysctls under dev.$driver.$index. This way it becomes possible to graph a property for all instances of a single driver. For example, graphing the number of packets across all USB controllers, the amount of dropped packets on all NICs, etc. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 13:03:01 +00:00
Ed Schouten	fd0f59709d	Add labels to sysctls related to clocks. Sysctls like kern.eventtimer.et.*.quality currently embed the name of the clock device. This is problematic for the Prometheus metrics exporter for two reasons: - Some of those clocks have dashes in their names, which Prometheus doesn't allow to be used in metric names. - It doesn't allow for extracting the same property of all clocks on the system from within a single query. Attach these nodes to have a label, so that the Prometheus metrics exporter gives these metric a uniform name with the name of the clock attached as a label. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 12:56:58 +00:00
Ed Schouten	1e1f3941e4	Add support for attaching aggregation labels to sysctl objects. I'm currently working on writing a metrics exporter for the Prometheus monitoring system to provide access to sysctl metrics. Prometheus and sysctl have some structural differences: - sysctl is a tree of string component names. - Prometheus uses a flat namespace for its metrics, but allows you to attach labels with values to them, so that you can do aggregation. An initial version of my exporter simply translated hw.acpi.thermal.tz1.temperature to sysctl_hw_acpi_thermal_tz1_temperature_celcius while we should ideally have sysctl_hw_acpi_thermal_temperature_celcius{thermal_zone="tz1"} allowing you to graph all thermal zones on a system in one go. The change presented in this commit adds support for accomplishing this, by providing the ability to attach labels to nodes. In the example I gave above, the label "thermal_zone" would be attached to "tz1". As this is a feature that will only be used very rarely, I decided to not change the KPI too aggressively. Discussed on: hackers@ Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 12:47:34 +00:00
Gleb Smirnoff	1276a8363c	Zero return value when counter_rate() switches over to next second and value is positive, but below the limit.	2016-12-13 20:11:45 +00:00
Mateusz Guzik	25e578de55	vfs: use vrefact in getcwd and fchdir	2016-12-12 19:16:35 +00:00
Edward Tomasz Napierala	e3d4c4dcde	Undo r309891. Konstantin is right in that this condition normally cannot happen - the um_dev field is assigned at mount and never written to afterwards.	2016-12-12 19:11:04 +00:00
Mateusz Guzik	5afb134c32	vfs: add vrefact, to be used when the vnode has to be already active This allows blind increment of relevant counters which under contention is cheaper than inc-not-zero loops at least on amd64. Use it in some of the places which are guaranteed to see already active vnodes. Reviewed by: kib (previous version)	2016-12-12 15:37:11 +00:00
Edward Tomasz Napierala	223cb0e434	Avoid dereferencing NULL pointers in devtoname(). I've seen it panic, called from ufs_print() in DDB. MFC after: 1 month	2016-12-12 15:22:21 +00:00
Konstantin Belousov	778aa66a68	Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal. Requested and reviewed by: cem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8746	2016-12-12 11:12:04 +00:00
Konstantin Belousov	545d312293	When a zombie gets reparented due to the parent exit, send SIGCHLD to the reaper. The traditional reaper init(8) is aware of zombies silently reparented to it after the parents exit, it loops around waitpid(2) to collect them. For other reapers, the silent reparenting is surprising and collecting zombies requires a thread blocking in waitpid(2) just for that purpose. It seems that sending second SIGCHLD is a better workaround than forcing all reapers to obey the setup. Reported by: Michael Zuo <muh.muhten@gmail.com>, jilles PR: 213928 Reviewed by: jilles (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-12 11:11:50 +00:00
Alan Cox	2d612d2dd2	When tmpfs and POSIX shm pagein a page for the sole purpose of performing truncation, immediately queue the page for asynchronous laundering rather than making the page pass through inactive queue first. Reviewed by: kib, markj	2016-12-11 19:24:41 +00:00
Konrad Witaszczyk	480f31c214	Add support for encrypted kernel crash dumps. Changes include modifications in kernel crash dump routines, dumpon(8) and savecore(8). A new tool called decryptcore(8) was added. A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump configuration in the diocskerneldump_arg structure to the kernel. The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for backward ABI compatibility. dumpon(8) generates an one-time random symmetric key and encrypts it using an RSA public key in capability mode. Currently only AES-256-CBC is supported but EKCD was designed to implement support for other algorithms in the future. The public key is chosen using the -k flag. The dumpon rc(8) script can do this automatically during startup using the dumppubkey rc.conf(5) variable. Once the keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O control. When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random IV and sets up the key schedule for the specified algorithm. Each time the kernel tries to write a crash dump to the dump device, the IV is replaced by a SHA-256 hash of the previous value. This is intended to make a possible differential cryptanalysis harder since it is possible to write multiple crash dumps without reboot by repeating the following commands: # sysctl debug.kdb.enter=1 db> call doadump(0) db> continue # savecore A kernel dump key consists of an algorithm identifier, an IV and an encrypted symmetric key. The kernel dump key size is included in a kernel dump header. The size is an unsigned 32-bit integer and it is aligned to a block size. The header structure has 512 bytes to match the block size so it was required to make a panic string 4 bytes shorter to add a new field to the header structure. If the kernel dump key size in the header is nonzero it is assumed that the kernel dump key is placed after the first header on the dump device and the core dump is encrypted. Separate functions were implemented to write the kernel dump header and the kernel dump key as they need to be unencrypted. The dump_write function encrypts data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps are not supported due to the way they are constructed which makes it impossible to use the CBC mode for encryption. It should be also noted that textdumps don't contain sensitive data by design as a user decides what information should be dumped. savecore(8) writes the kernel dump key to a key.# file if its size in the header is nonzero. # is the number of the current core dump. decryptcore(8) decrypts the core dump using a private RSA key and the kernel dump key. This is performed by a child process in capability mode. If the decryption was not successful the parent process removes a partially decrypted core dump. Description on how to encrypt crash dumps was added to the decryptcore(8), dumpon(8), rc.conf(5) and savecore(8) manual pages. EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU. The feature still has to be tested on arm and arm64 as it wasn't possible to run FreeBSD due to the problems with QEMU emulation and lack of hardware. Designed by: def, pjd Reviewed by: cem, oshogbo, pjd Partial review: delphij, emaste, jhb, kib Approved by: pjd (mentor) Differential Revision: https://reviews.freebsd.org/D4712	2016-12-10 16:20:39 +00:00
Mark Johnston	02315a6759	Use a consistent snapshot of the lock state in owner_mtx(). MFC after: 2 weeks	2016-12-10 02:59:34 +00:00
Mark Johnston	c365a2934e	Return a non-NULL owner only if the lock is exclusively held in owner_sx(). Fix some whitespace bugs while here. MFC after: 2 weeks	2016-12-10 02:56:44 +00:00
Gleb Smirnoff	5040da77c1	Use acquire write to cr_lock to complement with release write at end of locked region. Submitted by: kib	2016-12-09 19:07:31 +00:00
Gleb Smirnoff	169170209c	Provide counter_ratecheck(), a MP-friendly substitution to ppsratecheck(). When rated event happens at a very quick rate, the ppsratecheck() is not only racy, but also becomes a performance bottleneck. Together with: rrs, jtl	2016-12-09 17:58:34 +00:00
Robert Watson	52b42f6287	Regnerate system-call definitions following r309677 correcting a whitespace glitch in syscalls.master.	2016-12-07 16:12:27 +00:00
Robert Watson	82d8d2b8bc	Replace spaces with tabs in definition of SCTP system calls, for consistency with the remainder of the syscalls.master file. This problem does not occur in the freebsd32 version of the same system calls.	2016-12-07 16:11:55 +00:00
Eric van Gyzen	3d32d4a7c9	Export the whole thread name in kinfo_proc kinfo_proc::ki_tdname is three characters shorter than thread::td_name. Add a ki_moretdname field for these three extra characters. Add the new field to kinfo_proc32, as well. Update all in-tree consumers to read the new field and assemble the full name, except for lldb's HostThreadFreeBSD.cpp, which I will handle separately. Bump __FreeBSD_version. Reviewed by: kib MFC after: 1 week Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D8722	2016-12-07 15:04:22 +00:00
Konstantin Belousov	435da98564	Restructure the code to handle reporting of non-exited processes from wait(2). - Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED options were passed [1]. - Extract the code to report alive process into a new helper report_alive_proc() and use it for trapped, stopped and continued childrens. Note that the process spinlock is required around the WTRAPPED and WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are set before other threads are stopped at the suspension point, and that threads increment p_suspcount while owning only the process spinlock, the process lock is dropped by them. If the spinlock is not taken for tests, the syscall thread might miss both p_suspcount increment and wakeup in wakeup in thread_suspend_switch(). Based on the submission by: mjg [1] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-04 20:44:58 +00:00
Eric van Gyzen	ff07dd913e	thr_set_name(): silently truncate the given name as needed Instead of failing with ENAMETOOLONG, which is swallowed by pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1 bytes. This is more likely what the user wants, and saves the caller from truncating it before the call (which was the only recourse). Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2) so the user might find the documentation for this behavior. Reviewed by: jilles MFC after: 3 days Sponsored by: Dell EMC	2016-12-03 01:14:21 +00:00
Mateusz Guzik	a2d3554542	vfs: provide fake locking primitives for the crossmp vnode Since the vnode is only expected to be shared locked, we can save a little overhead by only pretending we are locking in the first place. Reviewed by: kib Tested by: pho	2016-12-02 18:03:15 +00:00

1 2 3 4 5 ...

15301 Commits