Commit Graph

16480 Commits

Author SHA1 Message Date
Konstantin Belousov
e5ac304989 Bump SPECNAMELEN to MAXNAMLEN.
This includes the bump for cdevsw d_version.  Otherwise, the impact on
the ABI (not KBI) is surprisingly low.  The most important affected
interface is devname(3) and ttyname(3) which already correctly handle
long names (and ttyname(3) should not be affected at all).

Still, due to the d_version bump, I argue that the change is not MFC-able.

Requested by:	mmacy
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18932
2019-01-27 00:46:06 +00:00
Kirk McKusick
dab83bd1e8 Add printing of b_ioflags to DDB `show buffer' command.
Sponsored by: Netflix
2019-01-25 21:24:09 +00:00
Konstantin Belousov
be8dd1428e Re-wrap long line after r341827.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-01-17 04:51:05 +00:00
Gleb Smirnoff
d1bb5d7d50 Fix mistake in r343030: move nswbuf calculation back to
kern_vfs_bio_buffer_alloc(), because in init_param2() nbuf
isn't really initialized yet.

Pointed out by:	bde
2019-01-16 20:20:38 +00:00
Konstantin Belousov
ea7e7006db Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
Xin LI
305bb04ee4 Use TD_IS_IDLETHREAD instead of unrolled version.
MFC after:	2 weeks
2019-01-15 06:44:37 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Gleb Smirnoff
46713135ae Add flag LK_NEW for lockinit() that is converted to LO_NEW and passed
down to lock_init().  This allows for lockinit() on a not prezeroed
memory.
2019-01-15 00:35:19 +00:00
Konstantin Belousov
28b740da38 Handle overflow in calculating max kmem size.
vm_kmem_size is u_long, and it might be not capable of holding page
count times PAGE_SIZE, even when scaled down by VM_KMEM_SIZE_SCALE.  As
bde reported, 12G PAE config ends up with zero for kmem size.

Explicitly check for overflow and clamp kmem size at vm_kmem_size_max.
If we end up at zero size because VM_KMEM_SIZE_MAX is not defined,
panic with clear explanation rather then failing in a way which is
hard to relate.

Reported by:	bde, pho
Tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18767
2019-01-14 07:31:19 +00:00
Jason A. Harmening
7dff7eda1a Handle SIGIO for listening sockets
r319722 separated struct socket and parts of the socket I/O path into
listening-socket-specific and dataflow-socket-specific pieces.  Listening
socket connection notifications are now handled by solisten_wakeup() instead
of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the
owning process.

PR:	234258
Reported by:	Kenneth Adelman
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18664
2019-01-13 20:33:54 +00:00
Olivier Houchard
7045ac437b Instead of using an incomplete list of platforms that uses 64bits time_t
in 32bits mode, special case amd64, as i386 is the only arch that still
uses 32bits time_t.
2019-01-13 00:19:15 +00:00
Andrew Turner
be860eae0f Fix the check for the offset of td_frame and td_emuldata in struct thread.
Pointy hat:	andrew
Sponsored by:	DARPA, AFRL
2019-01-12 20:41:57 +00:00
Andrew Turner
b3c0d957a2 Add support for the Clang Coverage Sanitizer in the kernel (KCOV).
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).

Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.

The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.

The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.

KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.

A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.

[1] https://github.com/google/syzkaller

Submitted by:	Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by:	kib
Testing by:	tuexen
Sponsored by:	DARPA, AFRL
Sponsored by:	The FreeBSD Foundation (Mitchell Horne)
Differential Revision:	https://reviews.freebsd.org/D14599
2019-01-12 11:21:28 +00:00
Gleb Smirnoff
bcc3cec43c Simplify sosetopt() so that function has single return point. No
functional change.
2019-01-10 00:25:12 +00:00
Brooks Davis
4f4ef03f5f style(9): fix the indent of a return. 2019-01-09 17:23:59 +00:00
Michael Tuexen
735835ed5c Avoid overfow in vtruncbuf()
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.

Reviewed by:		mckusick, kib, markj
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18763
2019-01-08 09:04:27 +00:00
Kristof Provost
de2d0d297a Remove unneeded NULL check for td_ucred
td_ucred is always set, so we don't need the ternary expression to check for
it.
2019-01-04 21:12:17 +00:00
Conrad Meyer
6b83069e05 Expose threads-per-core and physical core count information
With new sysctls (to the best of our ability do detect them).  Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.

Reviewed by:	markj, jhibbits (ppc parts)
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D18322
2019-01-04 18:31:17 +00:00
Mark Johnston
2f2ddd68a5 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
Kristof Provost
4ae4822d6a Simplify jail ID printing on process exit
As suggested by kib@, we don't need to check p_ucred, because that's only NULL
during process creation, and cr_prison is never NULL.
2018-12-29 21:36:02 +00:00
Conrad Meyer
a0483764f3 Update to Zstandard 1.3.8
This merge brings in a couple new files, which needed to be attached to the
build; a new dependency on <limits.h>, which must be stubbed; and a name
change in the Context parameter constants, from ZSTD_p_foo to ZSTD_c_foo.

Significantly, it fixes a kernel build error with GCC where floating-point
functions were included in the kernel build, by hiding them under the same
compile-time #ifdef that already covered their invocation.  That issue was
introduced to FreeBSD in the 1.3.7 update and tracked upstream here:

  https://github.com/facebook/zstd/issues/1386

The full 1.3.8 release notes can be found on Github:

  https://github.com/facebook/zstd/releases/tag/v1.3.8

Relnotes:	yes
2018-12-29 21:18:01 +00:00
Konstantin Belousov
7a6322e10d For hw.{physmem,realmem,usermem} MIBs, clamp instead truncating.
If the memory size does not fit into u_long, current code truncates
the returned value and returns complete nonsense.  Make the result
slightly more useful by clamping it at ULONG_MAX.

Reported and tested :	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-29 15:55:44 +00:00
Kristof Provost
af8becca15 Make kernel print jail ID when logging a process exit
Kernel now includes jail ID when logging a process exit. jid is 0 for unjailed
processes.

Submitted by:	Marie Helene Kvello-Aune <freebsd@mhka.no>
Relnotes:	yes
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D18618
2018-12-29 14:48:51 +00:00
Jilles Tjoelker
0abc7e41ba pfind, pfind_any: Correct zombie logic
SVN r340744 erroneously changed pfind() to return any process including
zombies and pfind_any() to return only non-zombie processes.

In particular, this caused kill() on a zombie process to fail with [ESRCH].
There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0
occasionally triggers it (as reported by lwhsu).

Conversely, returning zombies from pfind() seems likely to violate
invariants and cause panics, but I have not looked at this.

PR:		233646
Reviewed by:	mjg, kib, ngie
Differential Revision:	https://reviews.freebsd.org/D18665
2018-12-28 13:32:14 +00:00
Kirk McKusick
c0029546f8 When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
Alexander Motin
abeb9f61f9 Increase MTX_POOL_SLEEP_SIZE from 128 to 1024.
This value remained unchanged for 15 years, and now this bump reduces
lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe
on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM.

While there, align struct mtx_pool fields to cache lines.

MFC after:	1 month
2018-12-24 23:52:35 +00:00
Konstantin Belousov
6c59824b31 Properly test for vmio buffer in bnoreuselist().
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind.  Buffer might has been allocated before the
object created, then the buffer is malloced.  Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:52:02 +00:00
Bruce Evans
5ef4f86d7a Oops, rounddown() for the start was misspelled roundup() in r342295,
so only aligned starts worked.  This broke releasing caches in most
cases where the i/o size is smaller than the fs block size.
2018-12-22 09:31:55 +00:00
Bruce Evans
2c0434acb0 Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really
POSIX_FADV_DONTNEED).  The most broken case was for applications that
advise for the whole file and then do block-aligned i/o's 1 block at
a time.  Then advice is sent to VOP_ADVISE() 1 block at a time, but
in vop_stdadvise() the 1-block advice was turned into 0-block advice
for the buffer cache part.

The bugs were caused partly by callers representing the region as
(a_start, a_end), where a_end is actually the maximum, and everything
else representing the region as (start, end) where 'end' is actually
the end (1 after the maximum).  The maximum a_end must be rounded up,
but was rounded down.  Also, rounding to page boundaries was inconsistent.

The bugs and fixes have no effect for zfs and other file systems that
don't use the buffer cache or the page cache.  Most or all file systems
currently use the default VOP_FADVISE(), but it finds a null buffer cache
and a null page cache for file systems that don't use normal methods.

Reviewed by:	kib
2018-12-21 04:57:59 +00:00
Kirk McKusick
13c31c29ca Some filesystems (like cd9660 and ext3) require that VFS_STATFS()
be called before VFS_ROOT() is called. Move the call for VFS_STATFS()
so that it is done after VFS_MOUNT(), but before VFS_ROOT().
This change actually improves the robustness of the mount system
call because it returns an error rather than failing silently
when VFS_STATFS() returns failure.

Reported by:  Rebecca Cran <rebecca@bluestop.org>
Sponsored by: Netflix
2018-12-21 01:09:25 +00:00
Mateusz Guzik
3e0178fb94 Check for probes enabled in priv_check_cred before evaluting the error.
Sponsored by:	The FreeBSD Foundation
2018-12-19 23:28:29 +00:00
Mateusz Guzik
628888f0e0 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
Mateusz Guzik
19b75ef59a Microoptimize corner case of ID bitmap handling.
Prior to the change we would avoidably test more possibly used IDs.

While here update the comment: there is no pidchecked variable anymore.
2018-12-19 20:29:52 +00:00
Mateusz Guzik
7d065d876e Deinline vfork handling out of the syscall return path.
vfork is rarely called (comparatively to other syscalls) and it avoidably
pollutes the fast path.

Sponsored by:	The FreeBSD Foundation
2018-12-19 20:27:26 +00:00
Mark Johnston
26e9d9b01f Fix DDB's "show malloc" after r338899.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-12-19 00:17:22 +00:00
Brooks Davis
10f7b12c13 const poison the new pointer of __sysctl.
Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18444
2018-12-18 12:44:38 +00:00
Andriy Gapon
82a5a27527 add support for marking interrupt handlers as suspended
The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.

I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.

This change affects only PCI devices.  The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down.  This is done only for
traditional PCI interrupts, because only they can be shared.

At the moment the change is only for x86.

Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
  interface along with convenience functions bus_suspend_intr and
  bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
  provide a way to associate an interrupt resource with an interrupt
  cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
  are added to the MI interrupt handler interface.

I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions.  IH_SUSP marks a suspended
interrupt handler.  IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.

MFC after:	1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D15755
2018-12-17 17:11:00 +00:00
Kirk McKusick
17ca94cfc0 Clarify panic in set_rootvnode().
Check for panic in vfs_mountroot_shuffle().

Sponsored by: Netflix
2018-12-15 19:18:58 +00:00
Kirk McKusick
e04d2a3c5a Under UFS/FFS the VFS_ROOT() function will return an error if the inode
check-hash fails. Panic'ing is not an appropriate response. So, check
for an error return from VFS_ROOT() and when an error is reported,
unwind and return the error.

Reported by:  Gary Jennejohn (gj)
Sponsored by: Netflix
2018-12-15 19:04:50 +00:00
Mateusz Guzik
24d64be4c5 vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mateusz Guzik
6b2d61136f fd: dedup code in sys_getdtablesize
Sponsored by:	The FreeBSD Foundation
2018-12-11 12:08:18 +00:00
Mateusz Guzik
73e62bc9bb Make lim_cur inline if possible.
It is a function call only to accomodate *some* ABIs which install a hook.
They only care for 3 types of limits: DATA, STACK, VMEM

Instead of always calling the func, see at compilation time if the requested
limit is something else and just do the read if so.

Sponsored by:	The FreeBSD Foundation
2018-12-11 12:01:46 +00:00
Mateusz Guzik
86db4d40ac fd: tidy up closing a fd
- avoid a call to knote_close in the common case
- annotate mqueue as unlikely

Sponsored by:	The FreeBSD Foundation
2018-12-11 11:58:44 +00:00
Mateusz Guzik
663de8167e fd: stop looking for exact freefile after allocation
If a lower fd is closed later, the lookup goes to waste. Allocation
always performs the lookup anyway.

Sponsored by:	The FreeBSD Foundation
2018-12-11 11:57:12 +00:00
Konstantin Belousov
94dd54b9a2 Free bootstacks after AP startup.
Bootstacks are unused after APs executed sched_throw() in
init_secondary_tail() and started executing on proper idle thread
stack.  Add sysinit that detects that the idle thread for each CPU was
scheduled at least once, and free corresponding bootstack.

Slight addition of the code (~200 bytes) is compensated by the saving,
because even on typical small modern desktop CPU we leak 128K of
memory otherwise (4 pages x 8 threads).

Reviewed by:	jhb
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18486
2018-12-11 02:54:36 +00:00
Konstantin Belousov
eba8ab0e3e Remove special case handling for getfhat(fd, NULL, handle).
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.

Submitted by:	Jack Halford <jack@gandi.net>
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18501
2018-12-11 02:48:49 +00:00
John Baldwin
c5786670ac Don't report stale signal information for non-signal events in ptrace_lwpinfo.
Once a signal's siginfo was copied to 'td_si' as part of the signal
exchange in issignal(), it was never cleared.  This caused future
thread events that are reported as SIGTRAP events without signal
information to report the stale siginfo in 'td_si'.  For example, if a
debugger created a new process and used SIGSTOP to stop it after
PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI
with the SIGSTOP siginfo in pl_siginfo.  This broke 'catch syscall' in
current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP
indicates a breakpoint or single step trap.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18487
2018-12-10 19:39:24 +00:00
Alan Cox
2905d1ceaf blst_leaf_alloc updates bighint for a leaf when an allocation is successful
and includes the last block represented by the leaf.  The reasoning is that,
if the last block is included, then there must be no solution before that
one in the leaf, so the leaf cannot provide an allocation that big again;
indeed, the leaf cannot provide a solution bigger than range1.

Which is all correct, except that if the value of blk passed in did not
represent the first block of the leaf, because the cursor was pointing to
the middle of the leaf, then a possible solution before the cursor may have
been ignored, and bighint cannot be updated.

Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing
that same block, and allocate 1 (returning 63).  The result is that one
block is allocated from the first leaf, and the value of bighint is 0, so
that nothing can be allocated from that leaf until the only block allocated
from that leaf is freed.  This change detects that skipped-over solution,
and when there is one it makes sure that the value of bighint is not changed
when the last block is allocated.

Submitted by:	Doug Moore <dougm@rice.edu>
Tested by:	pho
X-MFC with:	r340402
Differential Revision:	https://reviews.freebsd.org/D18474
2018-12-09 17:55:10 +00:00
Mateusz Guzik
6017827676 umtx: avoid umtxshm locking on object termination if possible
Sample build world result on tmpfs:
kern.ipc.umtx_terminate_notempty: 0
kern.ipc.umtx_terminate_empty: 2891815

Sponsored by:	The FreeBSD Foundation
2018-12-08 14:04:57 +00:00