Commit Graph

16513 Commits

Author SHA1 Message Date
brooks
4e4796faba Regen after r342190.
Differential Revision:	https://reviews.freebsd.org/D18444
2019-01-31 22:58:17 +00:00
glebius
c9bbcb39b2 Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually do
anything except several assertions.  This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed.  Such mbuf can not be stored or reallocated, its
life time is current context.
2019-01-31 22:37:28 +00:00
markj
4a4880bc37 Prevent some kobj memory allocation failures from panicking the system.
Parts of the kobj(9) KPI assume a non-sleepable context for the purpose
of internal memory allocations, but currently have no way to signal an
allocation failure to the caller, so they just panic in this case.  This
can occur even when kobj_create() is called with M_WAITOK.  Fix some
instances of the problem by plumbing wait flags from kobj_create() through
internal subroutines.  Change kobj_class_compile() to assume a sleepable
context when called externally, since all existing callers use it in a
sleepable context.

To fix the problem fully the kobj_init() KPI must be changed.

Reported and tested by:	pho
Reviewed by:	kib (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19023
2019-01-31 22:27:39 +00:00
mav
87871ab1ce Only sort requests of types that have concept of offset.
Other types, such as BIO_FLUSH or BIO_ZONE, or especially new/unknown ones,
may imply some degree of ordering even if strict ordering is not requested
explicitly.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-01-30 17:24:50 +00:00
andrew
be20ad6345 Extract the coverage sanitizer KPI to a new file.
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.

A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.

While here clean up the #include style a little.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18955
2019-01-29 11:04:17 +00:00
avos
1cdb04b397 m_getm2: correct a comment.
The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.

Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.

PR:		134335
MFC after:	12 days
2019-01-27 16:44:27 +00:00
kib
19ca89f12f Bump SPECNAMELEN to MAXNAMLEN.
This includes the bump for cdevsw d_version.  Otherwise, the impact on
the ABI (not KBI) is surprisingly low.  The most important affected
interface is devname(3) and ttyname(3) which already correctly handle
long names (and ttyname(3) should not be affected at all).

Still, due to the d_version bump, I argue that the change is not MFC-able.

Requested by:	mmacy
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18932
2019-01-27 00:46:06 +00:00
mckusick
5f49ddd0a0 Add printing of b_ioflags to DDB `show buffer' command.
Sponsored by: Netflix
2019-01-25 21:24:09 +00:00
kib
23ffc0a0c1 Re-wrap long line after r341827.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-01-17 04:51:05 +00:00
glebius
35140f62be Fix mistake in r343030: move nswbuf calculation back to
kern_vfs_bio_buffer_alloc(), because in init_param2() nbuf
isn't really initialized yet.

Pointed out by:	bde
2019-01-16 20:20:38 +00:00
kib
f34dbcf7d0 Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
delphij
74be0aa2e9 Use TD_IS_IDLETHREAD instead of unrolled version.
MFC after:	2 weeks
2019-01-15 06:44:37 +00:00
glebius
7ee1aa34d4 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
glebius
55f760dc93 Add flag LK_NEW for lockinit() that is converted to LO_NEW and passed
down to lock_init().  This allows for lockinit() on a not prezeroed
memory.
2019-01-15 00:35:19 +00:00
kib
db6bffc60a Handle overflow in calculating max kmem size.
vm_kmem_size is u_long, and it might be not capable of holding page
count times PAGE_SIZE, even when scaled down by VM_KMEM_SIZE_SCALE.  As
bde reported, 12G PAE config ends up with zero for kmem size.

Explicitly check for overflow and clamp kmem size at vm_kmem_size_max.
If we end up at zero size because VM_KMEM_SIZE_MAX is not defined,
panic with clear explanation rather then failing in a way which is
hard to relate.

Reported by:	bde, pho
Tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18767
2019-01-14 07:31:19 +00:00
jah
23cce14a7d Handle SIGIO for listening sockets
r319722 separated struct socket and parts of the socket I/O path into
listening-socket-specific and dataflow-socket-specific pieces.  Listening
socket connection notifications are now handled by solisten_wakeup() instead
of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the
owning process.

PR:	234258
Reported by:	Kenneth Adelman
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18664
2019-01-13 20:33:54 +00:00
cognet
a7c2a2d8c6 Instead of using an incomplete list of platforms that uses 64bits time_t
in 32bits mode, special case amd64, as i386 is the only arch that still
uses 32bits time_t.
2019-01-13 00:19:15 +00:00
andrew
32f8cf00d6 Fix the check for the offset of td_frame and td_emuldata in struct thread.
Pointy hat:	andrew
Sponsored by:	DARPA, AFRL
2019-01-12 20:41:57 +00:00
andrew
5e0e456d9f Add support for the Clang Coverage Sanitizer in the kernel (KCOV).
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).

Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.

The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.

The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.

KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.

A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.

[1] https://github.com/google/syzkaller

Submitted by:	Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by:	kib
Testing by:	tuexen
Sponsored by:	DARPA, AFRL
Sponsored by:	The FreeBSD Foundation (Mitchell Horne)
Differential Revision:	https://reviews.freebsd.org/D14599
2019-01-12 11:21:28 +00:00
glebius
22a41cecb2 Simplify sosetopt() so that function has single return point. No
functional change.
2019-01-10 00:25:12 +00:00
brooks
48665600ce style(9): fix the indent of a return. 2019-01-09 17:23:59 +00:00
tuexen
e8a9c3693f Avoid overfow in vtruncbuf()
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.

Reviewed by:		mckusick, kib, markj
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18763
2019-01-08 09:04:27 +00:00
kp
39b0b1bf8e Remove unneeded NULL check for td_ucred
td_ucred is always set, so we don't need the ternary expression to check for
it.
2019-01-04 21:12:17 +00:00
cem
d6edca0f6c Expose threads-per-core and physical core count information
With new sysctls (to the best of our ability do detect them).  Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.

Reviewed by:	markj, jhibbits (ppc parts)
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D18322
2019-01-04 18:31:17 +00:00
markj
85037a09f3 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
kp
59252de0d7 Simplify jail ID printing on process exit
As suggested by kib@, we don't need to check p_ucred, because that's only NULL
during process creation, and cr_prison is never NULL.
2018-12-29 21:36:02 +00:00
cem
432d4753f2 Update to Zstandard 1.3.8
This merge brings in a couple new files, which needed to be attached to the
build; a new dependency on <limits.h>, which must be stubbed; and a name
change in the Context parameter constants, from ZSTD_p_foo to ZSTD_c_foo.

Significantly, it fixes a kernel build error with GCC where floating-point
functions were included in the kernel build, by hiding them under the same
compile-time #ifdef that already covered their invocation.  That issue was
introduced to FreeBSD in the 1.3.7 update and tracked upstream here:

  https://github.com/facebook/zstd/issues/1386

The full 1.3.8 release notes can be found on Github:

  https://github.com/facebook/zstd/releases/tag/v1.3.8

Relnotes:	yes
2018-12-29 21:18:01 +00:00
kib
0bf5594c41 For hw.{physmem,realmem,usermem} MIBs, clamp instead truncating.
If the memory size does not fit into u_long, current code truncates
the returned value and returns complete nonsense.  Make the result
slightly more useful by clamping it at ULONG_MAX.

Reported and tested :	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-29 15:55:44 +00:00
kp
7f6b1eda35 Make kernel print jail ID when logging a process exit
Kernel now includes jail ID when logging a process exit. jid is 0 for unjailed
processes.

Submitted by:	Marie Helene Kvello-Aune <freebsd@mhka.no>
Relnotes:	yes
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D18618
2018-12-29 14:48:51 +00:00
jilles
9cebe88602 pfind, pfind_any: Correct zombie logic
SVN r340744 erroneously changed pfind() to return any process including
zombies and pfind_any() to return only non-zombie processes.

In particular, this caused kill() on a zombie process to fail with [ESRCH].
There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0
occasionally triggers it (as reported by lwhsu).

Conversely, returning zombies from pfind() seems likely to violate
invariants and cause panics, but I have not looked at this.

PR:		233646
Reviewed by:	mjg, kib, ngie
Differential Revision:	https://reviews.freebsd.org/D18665
2018-12-28 13:32:14 +00:00
mckusick
51739d277b When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
mav
3f27e240ec Increase MTX_POOL_SLEEP_SIZE from 128 to 1024.
This value remained unchanged for 15 years, and now this bump reduces
lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe
on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM.

While there, align struct mtx_pool fields to cache lines.

MFC after:	1 month
2018-12-24 23:52:35 +00:00
kib
f31f9d5413 Properly test for vmio buffer in bnoreuselist().
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind.  Buffer might has been allocated before the
object created, then the buffer is malloced.  Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:52:02 +00:00
bde
6702ac5ced Oops, rounddown() for the start was misspelled roundup() in r342295,
so only aligned starts worked.  This broke releasing caches in most
cases where the i/o size is smaller than the fs block size.
2018-12-22 09:31:55 +00:00
bde
97d496ecdd Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really
POSIX_FADV_DONTNEED).  The most broken case was for applications that
advise for the whole file and then do block-aligned i/o's 1 block at
a time.  Then advice is sent to VOP_ADVISE() 1 block at a time, but
in vop_stdadvise() the 1-block advice was turned into 0-block advice
for the buffer cache part.

The bugs were caused partly by callers representing the region as
(a_start, a_end), where a_end is actually the maximum, and everything
else representing the region as (start, end) where 'end' is actually
the end (1 after the maximum).  The maximum a_end must be rounded up,
but was rounded down.  Also, rounding to page boundaries was inconsistent.

The bugs and fixes have no effect for zfs and other file systems that
don't use the buffer cache or the page cache.  Most or all file systems
currently use the default VOP_FADVISE(), but it finds a null buffer cache
and a null page cache for file systems that don't use normal methods.

Reviewed by:	kib
2018-12-21 04:57:59 +00:00
mckusick
b9ea8013d1 Some filesystems (like cd9660 and ext3) require that VFS_STATFS()
be called before VFS_ROOT() is called. Move the call for VFS_STATFS()
so that it is done after VFS_MOUNT(), but before VFS_ROOT().
This change actually improves the robustness of the mount system
call because it returns an error rather than failing silently
when VFS_STATFS() returns failure.

Reported by:  Rebecca Cran <rebecca@bluestop.org>
Sponsored by: Netflix
2018-12-21 01:09:25 +00:00
mjg
286549b24a Check for probes enabled in priv_check_cred before evaluting the error.
Sponsored by:	The FreeBSD Foundation
2018-12-19 23:28:29 +00:00
mjg
c39e5a0486 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
mjg
655bf8e58a Microoptimize corner case of ID bitmap handling.
Prior to the change we would avoidably test more possibly used IDs.

While here update the comment: there is no pidchecked variable anymore.
2018-12-19 20:29:52 +00:00
mjg
94437ac557 Deinline vfork handling out of the syscall return path.
vfork is rarely called (comparatively to other syscalls) and it avoidably
pollutes the fast path.

Sponsored by:	The FreeBSD Foundation
2018-12-19 20:27:26 +00:00
markj
b0758adada Fix DDB's "show malloc" after r338899.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-12-19 00:17:22 +00:00
brooks
a3c153e5af const poison the new pointer of __sysctl.
Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18444
2018-12-18 12:44:38 +00:00
avg
260cc02954 add support for marking interrupt handlers as suspended
The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.

I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.

This change affects only PCI devices.  The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down.  This is done only for
traditional PCI interrupts, because only they can be shared.

At the moment the change is only for x86.

Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
  interface along with convenience functions bus_suspend_intr and
  bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
  provide a way to associate an interrupt resource with an interrupt
  cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
  are added to the MI interrupt handler interface.

I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions.  IH_SUSP marks a suspended
interrupt handler.  IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.

MFC after:	1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D15755
2018-12-17 17:11:00 +00:00
mckusick
fae3bd79a5 Clarify panic in set_rootvnode().
Check for panic in vfs_mountroot_shuffle().

Sponsored by: Netflix
2018-12-15 19:18:58 +00:00
mckusick
afd5ac8b62 Under UFS/FFS the VFS_ROOT() function will return an error if the inode
check-hash fails. Panic'ing is not an appropriate response. So, check
for an error return from VFS_ROOT() and when an error is reported,
unwind and return the error.

Reported by:  Gary Jennejohn (gj)
Sponsored by: Netflix
2018-12-15 19:04:50 +00:00
mjg
8598ea893e vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00
mjg
7e31d1de7e Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
mjg
ba8523cc7c fd: dedup code in sys_getdtablesize
Sponsored by:	The FreeBSD Foundation
2018-12-11 12:08:18 +00:00
mjg
59363837d9 Make lim_cur inline if possible.
It is a function call only to accomodate *some* ABIs which install a hook.
They only care for 3 types of limits: DATA, STACK, VMEM

Instead of always calling the func, see at compilation time if the requested
limit is something else and just do the read if so.

Sponsored by:	The FreeBSD Foundation
2018-12-11 12:01:46 +00:00
mjg
45f96abb72 fd: tidy up closing a fd
- avoid a call to knote_close in the common case
- annotate mqueue as unlikely

Sponsored by:	The FreeBSD Foundation
2018-12-11 11:58:44 +00:00