Commit Graph

16605 Commits

Author SHA1 Message Date
Alexander Motin
abeb9f61f9 Increase MTX_POOL_SLEEP_SIZE from 128 to 1024.
This value remained unchanged for 15 years, and now this bump reduces
lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe
on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM.

While there, align struct mtx_pool fields to cache lines.

MFC after:	1 month
2018-12-24 23:52:35 +00:00
Konstantin Belousov
6c59824b31 Properly test for vmio buffer in bnoreuselist().
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind.  Buffer might has been allocated before the
object created, then the buffer is malloced.  Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:52:02 +00:00
Bruce Evans
5ef4f86d7a Oops, rounddown() for the start was misspelled roundup() in r342295,
so only aligned starts worked.  This broke releasing caches in most
cases where the i/o size is smaller than the fs block size.
2018-12-22 09:31:55 +00:00
Bruce Evans
2c0434acb0 Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really
POSIX_FADV_DONTNEED).  The most broken case was for applications that
advise for the whole file and then do block-aligned i/o's 1 block at
a time.  Then advice is sent to VOP_ADVISE() 1 block at a time, but
in vop_stdadvise() the 1-block advice was turned into 0-block advice
for the buffer cache part.

The bugs were caused partly by callers representing the region as
(a_start, a_end), where a_end is actually the maximum, and everything
else representing the region as (start, end) where 'end' is actually
the end (1 after the maximum).  The maximum a_end must be rounded up,
but was rounded down.  Also, rounding to page boundaries was inconsistent.

The bugs and fixes have no effect for zfs and other file systems that
don't use the buffer cache or the page cache.  Most or all file systems
currently use the default VOP_FADVISE(), but it finds a null buffer cache
and a null page cache for file systems that don't use normal methods.

Reviewed by:	kib
2018-12-21 04:57:59 +00:00
Kirk McKusick
13c31c29ca Some filesystems (like cd9660 and ext3) require that VFS_STATFS()
be called before VFS_ROOT() is called. Move the call for VFS_STATFS()
so that it is done after VFS_MOUNT(), but before VFS_ROOT().
This change actually improves the robustness of the mount system
call because it returns an error rather than failing silently
when VFS_STATFS() returns failure.

Reported by:  Rebecca Cran <rebecca@bluestop.org>
Sponsored by: Netflix
2018-12-21 01:09:25 +00:00
Mateusz Guzik
3e0178fb94 Check for probes enabled in priv_check_cred before evaluting the error.
Sponsored by:	The FreeBSD Foundation
2018-12-19 23:28:29 +00:00
Mateusz Guzik
628888f0e0 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
Mateusz Guzik
19b75ef59a Microoptimize corner case of ID bitmap handling.
Prior to the change we would avoidably test more possibly used IDs.

While here update the comment: there is no pidchecked variable anymore.
2018-12-19 20:29:52 +00:00
Mateusz Guzik
7d065d876e Deinline vfork handling out of the syscall return path.
vfork is rarely called (comparatively to other syscalls) and it avoidably
pollutes the fast path.

Sponsored by:	The FreeBSD Foundation
2018-12-19 20:27:26 +00:00
Mark Johnston
26e9d9b01f Fix DDB's "show malloc" after r338899.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-12-19 00:17:22 +00:00
Brooks Davis
10f7b12c13 const poison the new pointer of __sysctl.
Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18444
2018-12-18 12:44:38 +00:00
Andriy Gapon
82a5a27527 add support for marking interrupt handlers as suspended
The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.

I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.

This change affects only PCI devices.  The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down.  This is done only for
traditional PCI interrupts, because only they can be shared.

At the moment the change is only for x86.

Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
  interface along with convenience functions bus_suspend_intr and
  bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
  provide a way to associate an interrupt resource with an interrupt
  cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
  are added to the MI interrupt handler interface.

I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions.  IH_SUSP marks a suspended
interrupt handler.  IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.

MFC after:	1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D15755
2018-12-17 17:11:00 +00:00
Kirk McKusick
17ca94cfc0 Clarify panic in set_rootvnode().
Check for panic in vfs_mountroot_shuffle().

Sponsored by: Netflix
2018-12-15 19:18:58 +00:00
Kirk McKusick
e04d2a3c5a Under UFS/FFS the VFS_ROOT() function will return an error if the inode
check-hash fails. Panic'ing is not an appropriate response. So, check
for an error return from VFS_ROOT() and when an error is reported,
unwind and return the error.

Reported by:  Gary Jennejohn (gj)
Sponsored by: Netflix
2018-12-15 19:04:50 +00:00
Mateusz Guzik
24d64be4c5 vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mateusz Guzik
6b2d61136f fd: dedup code in sys_getdtablesize
Sponsored by:	The FreeBSD Foundation
2018-12-11 12:08:18 +00:00
Mateusz Guzik
73e62bc9bb Make lim_cur inline if possible.
It is a function call only to accomodate *some* ABIs which install a hook.
They only care for 3 types of limits: DATA, STACK, VMEM

Instead of always calling the func, see at compilation time if the requested
limit is something else and just do the read if so.

Sponsored by:	The FreeBSD Foundation
2018-12-11 12:01:46 +00:00
Mateusz Guzik
86db4d40ac fd: tidy up closing a fd
- avoid a call to knote_close in the common case
- annotate mqueue as unlikely

Sponsored by:	The FreeBSD Foundation
2018-12-11 11:58:44 +00:00
Mateusz Guzik
663de8167e fd: stop looking for exact freefile after allocation
If a lower fd is closed later, the lookup goes to waste. Allocation
always performs the lookup anyway.

Sponsored by:	The FreeBSD Foundation
2018-12-11 11:57:12 +00:00
Konstantin Belousov
94dd54b9a2 Free bootstacks after AP startup.
Bootstacks are unused after APs executed sched_throw() in
init_secondary_tail() and started executing on proper idle thread
stack.  Add sysinit that detects that the idle thread for each CPU was
scheduled at least once, and free corresponding bootstack.

Slight addition of the code (~200 bytes) is compensated by the saving,
because even on typical small modern desktop CPU we leak 128K of
memory otherwise (4 pages x 8 threads).

Reviewed by:	jhb
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18486
2018-12-11 02:54:36 +00:00
Konstantin Belousov
eba8ab0e3e Remove special case handling for getfhat(fd, NULL, handle).
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.

Submitted by:	Jack Halford <jack@gandi.net>
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18501
2018-12-11 02:48:49 +00:00
John Baldwin
c5786670ac Don't report stale signal information for non-signal events in ptrace_lwpinfo.
Once a signal's siginfo was copied to 'td_si' as part of the signal
exchange in issignal(), it was never cleared.  This caused future
thread events that are reported as SIGTRAP events without signal
information to report the stale siginfo in 'td_si'.  For example, if a
debugger created a new process and used SIGSTOP to stop it after
PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI
with the SIGSTOP siginfo in pl_siginfo.  This broke 'catch syscall' in
current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP
indicates a breakpoint or single step trap.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18487
2018-12-10 19:39:24 +00:00
Alan Cox
2905d1ceaf blst_leaf_alloc updates bighint for a leaf when an allocation is successful
and includes the last block represented by the leaf.  The reasoning is that,
if the last block is included, then there must be no solution before that
one in the leaf, so the leaf cannot provide an allocation that big again;
indeed, the leaf cannot provide a solution bigger than range1.

Which is all correct, except that if the value of blk passed in did not
represent the first block of the leaf, because the cursor was pointing to
the middle of the leaf, then a possible solution before the cursor may have
been ignored, and bighint cannot be updated.

Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing
that same block, and allocate 1 (returning 63).  The result is that one
block is allocated from the first leaf, and the value of bighint is 0, so
that nothing can be allocated from that leaf until the only block allocated
from that leaf is freed.  This change detects that skipped-over solution,
and when there is one it makes sure that the value of bighint is not changed
when the last block is allocated.

Submitted by:	Doug Moore <dougm@rice.edu>
Tested by:	pho
X-MFC with:	r340402
Differential Revision:	https://reviews.freebsd.org/D18474
2018-12-09 17:55:10 +00:00
Mateusz Guzik
6017827676 umtx: avoid umtxshm locking on object termination if possible
Sample build world result on tmpfs:
kern.ipc.umtx_terminate_notempty: 0
kern.ipc.umtx_terminate_empty: 2891815

Sponsored by:	The FreeBSD Foundation
2018-12-08 14:04:57 +00:00
Mateusz Guzik
b0b246b0ba Remove proctree acquire from note_procstat_proc
It is not needed since r340482 ("proc: always store parent pid in p_oppid")

Sponsored by:	The FreeBSD Foundation
2018-12-08 11:38:39 +00:00
Mateusz Guzik
eab2132ad9 Fix a corner case in ID bitmap management.
If all IDs from trypid to pid_max were used as pids, the code would enter
a loop which would be infinite if none of the IDs could become free (e.g.
they all belong to processes which did not transitioned to zombie).

Fixes:	r341684 ("Manage process-related IDs with bitmaps")

Sponsored by:	The FreeBSD Foundation
2018-12-08 10:22:12 +00:00
Mateusz Guzik
e52327e3c5 proc: postpone proc unlock until after reporting with kqueue
kqueue would always relock immediately afterwards.

While here drop the NULL check for list itself. The list is
always allocated.

Sponsored by:	The FreeBSD Foundation
2018-12-08 06:34:12 +00:00
Mateusz Guzik
eadb1dcb71 proc: handle sdt exit probe before taking the proc lock
Sponsored by:	The FreeBSD Foundation
2018-12-08 06:31:43 +00:00
Mateusz Guzik
13a45e4b14 Provide SDT_PROBES_ENABLED macro.
Sponsored by:	The FreeBSD Foundation
2018-12-08 06:30:41 +00:00
Konstantin Belousov
18519f1583 Simplify kern_readlink_vp().
When we detected that the vnode is not symlink, return immediately.
This moves the readlink code out of else branch and unindents it.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-12-07 23:07:51 +00:00
Konstantin Belousov
978f879483 Fix expression evaluation.
Braces were put in the wrong place, causing failing EAGAIN check to
return zero result.  Remove the problematic assignment from the
conditional expression at all.

While there, remove used once variable vp, and wrap too long line.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-12-07 23:05:12 +00:00
Mateusz Guzik
08d005e6a3 fd: use racct_set_unlocked
Sponsored by:	The FreeBSD Foundation
2018-12-07 16:51:38 +00:00
Mateusz Guzik
448db4f761 racct: add RACCT_ENABLED macro and racct_set_unlocked
This allows to remove PROC_LOCK/UNLOCK pairs spread thorought the kernel
only used to appease racct_set.

Sponsored by:	The FreeBSD Foundation
2018-12-07 16:47:34 +00:00
Mateusz Guzik
82f4b82634 fd: try do less work with the lock in dup
Sponsored by:	The FreeBSD Foundation
2018-12-07 16:44:52 +00:00
Mateusz Guzik
6ff4688b09 Replace hand-rolled unrefs if > 1 with refcount_release_if_not_last
Sponsored by:	The FreeBSD Foundation
2018-12-07 16:11:45 +00:00
Konstantin Belousov
fd52edaf70 Regen. 2018-12-07 15:19:00 +00:00
Konstantin Belousov
d1fd400a80 Add new file handle system calls.
Namely, getfhat(2), fhlink(2), fhlinkat(2), fhreadlink(2).  The
syscalls are provided for a NFS userspace server (nfs-ganesha).

Submitted by:	Jack Halford <jack@gandi.net>
Sponsored by:	Gandi.net
Tested by:	pho
Feedback from:	brooks, markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18359
2018-12-07 15:17:29 +00:00
Mateusz Guzik
b1fbffe73c proc: when exiting move to zombproc before taking proctree
The kernel was already doing this prior to r329615. It was changed
to reduce contention on allproc. However, introduction of pidhash
locks and removal of proctree -> allproc ordering from fork thanks
to bitmaps fixed things enough to make this change pessimal.

waitpid takes proctree on each call and this change (now) causes
avoidable stalls if allproc is held.

Sponsored by:	The FreeBSD Foundation
2018-12-07 12:32:25 +00:00
Mateusz Guzik
34ebdceac0 Manage process-related IDs with bitmaps
Currently unique pid allocation on fork often requires a full walk of
process, group, session lists to make sure it is not used by anything.
This has a side effect of requiring proctree to be held along with allproc,
which adds more contention in poudriere -j 128.

The patch below implements trivial bitmaps which gets rid of the problem.
Dedicated lock is introduced to manage IDs.

While here a bug was discovered: all processes would inherit reap id from
the first process spawned by init. This had a side effect of keeping the
ID used and when allocation rolls over to the beginning it keeps being
skipped.

The patch is loosely based on initial work by mjoras@.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2018-12-07 12:22:32 +00:00
Mateusz Guzik
6e8c1ccbe2 Annotate Giant drop/pickup macros with __predict_false
They are used in important places of the kernel with the lock not being held
majority of the time.

Sponsored by:	The FreeBSD Foundation
2018-12-07 12:06:03 +00:00
Mark Johnston
afde86eba3 Let kern.trap_enotcap be set as a tunable.
This is handy for testing programs that are run by rc.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-06 17:29:37 +00:00
Brooks Davis
827c3852fe Further simplify arguments to init.
With the removal of BOOTCDROM and fastboot support, this code always
passed "-s" or "--". The latter simply terminates getopt(3) processing
in init so we only need to pass "-s" in the single user case, or nothing
in other cases.

The passing of "--" seems to have been done to ensure that the number of
arguments passed to init was always the same and thus that argc was the
same.

Also GC the write-only variable pathlen (not in reviewed version).

Reviewed by:	kib, jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18441
2018-12-05 19:18:16 +00:00
Alan Cox
749cdf6f3b Terminate a blist_alloc search when a blst_meta_alloc call fails with
cursor == 0.

Every call to blst_meta_alloc but the one at the root is made only when the
meta-node is known to include a free block, so that either the allocation
will succeed, the node hint will be updated, or the last block of the meta-
node range is, and remains, free.  But the call at the root is made without
checking that there is a free block, so in the case that every block is
allocated, there is no hint update to prevent the current code from looping
forever.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	pho
Reviewed by:	pho
Tested by:	pho
X-MFC with:	r340402
Differential Revision:	https://reviews.freebsd.org/D17999
2018-12-05 18:26:40 +00:00
Brooks Davis
68ea829fe7 Remove never enabled support for "fastboot".
This has been ifdef notyet since the import of BSD 4.4 Lite Kernel
Sources in r1541.

Sponsored by:	DARPA, AFRL
2018-12-05 17:35:15 +00:00
Brooks Davis
7a5db3a770 Remove ifdef BOOTCDROM option to start init.
When BOOTCDROM is defined (via CFLAGS as there is no config option)
it causes -C to be passed to init, but our init and the version of
sysinstall I glanced at in 6.x don't support -C. The last plausibly
related support was removed from the tree in 1995.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18431
2018-12-05 17:29:14 +00:00
Mateusz Guzik
f26db6948d sx: retire SX_NOADAPTIVE
The flag is not used by anything for years and supporting it requires an
explicit read from the lock when entering slow path.

Flag value is left unused on purpose.

Sponsored by:	The FreeBSD Foundation
2018-12-05 16:43:03 +00:00
Brooks Davis
41f7b25317 Remove NOARGS from oaccept.
This was in the orignal patch, but lost in a rebase.

Reported by:	andrew
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15816
2018-12-04 21:56:45 +00:00
Brooks Davis
63de13cfee Regen after r341474: Normalize COMPAT_43 syscall declarations. 2018-12-04 16:49:14 +00:00
Brooks Davis
d48719bd96 Normalize COMPAT_43 syscall declarations.
Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15816
2018-12-04 16:48:47 +00:00
Brooks Davis
3a325dec32 Remove a needlessly clever hack to start init with sys_exec().
Construct a struct image_args with the help of new exec_args_*() helper
functions and call kern_execve().

The previous code mapped a page in userspace, copied arguments out
to it one at a time, and then constructed a struct execve_args all so
that sys_execve() can call exec_copyin_args() to copy the data back in
to a struct image_args.

Opencode the part of pre_execve()/post_execve() that releases a
reference to the initial vmspace. We don't need to stop threads like
they do.

Reviewed by:	kib, jhb (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15469
2018-12-04 00:15:47 +00:00
Mark Johnston
02164d3603 Add a missing definition for the !COMPAT_FREEBSD32 case.
Reported by:	jenkins
MFC with:	r341442
Sponsored by:	The FreeBSD Foundation
2018-12-03 21:07:10 +00:00
Mark Johnston
352aaa5122 Plug memory disclosures via ptrace(2).
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory.  The same
issue existed with the register files in procfs.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18421
2018-12-03 20:54:17 +00:00
Konstantin Belousov
200bf72793 Correct accuracy of the barrier writes accounting.
Discussed with:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-02 12:53:39 +00:00
Eric van Gyzen
5e38e3f5eb Include path for tmpfs objects in vm.objects sysctl
This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by:	kib, dab
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:59:43 +00:00
Brooks Davis
f373437a01 Add helper functions to copy strings into struct image_args.
Given a zeroed struct image_args with an allocated buf member,
exec_args_add_fname() must be called to install a file name (or NULL).
Then zero or more calls to exec_args_add_env() followed by zero or
more calls to exec_args_add_env(). exec_args_adjust_args() may be
called after args and/or env to allow an interpreter to be prepended to
the argument list.

To allow code reuse when adding arg and env variables, begin_envv
should be accessed with the accessor exec_args_get_begin_envv()
which handles the case when no environment entries have been added.

Use these functions to simplify exec_copyin_args() and
freebsd32_exec_copyin_args().

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15468
2018-11-29 21:00:56 +00:00
Konstantin Belousov
7d2b0bd7d7 If BENEATH is specified, always latch the topping directory vnode.
It is possible that we started with a relative path but during the
lookup, found an absolute symlink.  In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.

While there, somewhat improve the assertions.  Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.

Reported by:	wulf
With more input from:	pho
Sponsored by:	The FreeBSD Foundation
2018-11-29 19:13:10 +00:00
Mateusz Guzik
1f6ad48c76 vfs: fix i386 build after r341220 2018-11-29 09:54:27 +00:00
Mateusz Guzik
22443809ff cache: retire cache_enter compat schim
It was added over 6 years ago for binary compat. cache_enter macro remains
as it expands to cache_enter_time.

Sponsored by:	The FreeBSD Foundation
2018-11-29 09:32:59 +00:00
Mateusz Guzik
712775843f vfs: drop spurious memcpy in stat
Sponsored by:	The FreeBSD Foundation
2018-11-29 09:04:10 +00:00
Mateusz Guzik
d47f3fdb0a fd: unify fd range check across the routines
While here annotate out of range as unlikely.

Sponsored by:	The FreeBSD Foundation
2018-11-29 08:53:39 +00:00
Mateusz Guzik
eec8d0a378 Convert racct_enable to bool and annotate as __read_frequently
Sponsored by:	The FreeBSD Foundation
2018-11-29 05:17:16 +00:00
Mateusz Guzik
64cf6a62d4 Deinline racct throttling out of syscall exit path.
racct is not enabled by default and even when it is enabled processes are
typically not throttled. The order of checks is left unchanged since
racct_enable will be annotated as __read_frequently, while checking for the
flag in the processes would probably require an extra fetch.

Sponsored by:	The FreeBSD Foundation
2018-11-29 05:08:46 +00:00
Mateusz Guzik
e272bf479b Annotate td_cowgen check as unlikely.
Sponsored by:	The FreeBSD Foundation
2018-11-29 04:48:22 +00:00
Mateusz Guzik
3277792bde Tidy up hardclock.
- use fcmpset for updating ticks
- move (rarely used) itimer handling to a dedicated function

Sponsored by:	The FreeBSD Foundation
2018-11-29 03:44:02 +00:00
Mateusz Guzik
1e9a1bf589 proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock
waitpid always takes proctree to evaluate the list, but only takes allproc
if it can reap. With this patch allproc is no longer taken, which helps during
poudriere -j 128.

Discussed with: kib
Sponsored by:	The FreeBSD Foundation
2018-11-29 02:52:08 +00:00
Konstantin Belousov
affd918514 Improve sigonstack().
Avoid relying on unsigned overflow for the test.
Simplify expressions to avoid duplicate check for the range.
Style.
Add herald comment.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18361
2018-11-27 19:50:58 +00:00
Jamie Gritton
b307954481 In hardened systems, where the security.bsd.unprivileged_proc_debug sysctl
node is set, allow setting security.bsd.unprivileged_proc_debug per-jail.
In part, this is needed to create jails in which the Address Sanitizer
(ASAN) fully works as ASAN utilizes libkvm to inspect the virtual address
space. Instead of having to allow unprivileged process debugging for the
entire system, allow setting it on a per-jail basis.

The sysctl node is still security.bsd.unprivileged_proc_debug and the
jail(8) param is allow.unprivileged_proc_debug. The sysctl code is now a
sysctl proc rather than a sysctl int. This allows us to determine setting
the flag for the corresponding jail (or prison0).

As part of the change, the dynamic allow.* API needed to be modified to
take into account pr_allow flags which may now be disabled in prison0.
This prevents conflicts with new pr_allow flags (like that of vmm(4)) that
are added (and removed) dynamically.

Also teach the jail creation KPI to allow differences for certain pr_allow
flags between the parent and child jail. This can happen when unprivileged
process debugging is disabled in the parent prison, but enabled in the
child.

Submitted by:	Shawn Webb <lattera at gmail.com>
Obtained from:	HardenedBSD (45b3625edba0f73b3e3890b1ec3d0d1e95fd47e1, deba0b5078cef0faae43cbdafed3035b16587afc, ab21eeb3b4c72f2500987c96ff603ccf3b6e7de8)
Relnotes:	yes
Sponsored by:	HardenedBSD and G2, Inc
Differential Revision:	https://reviews.freebsd.org/D18319
2018-11-27 17:51:50 +00:00
Eric van Gyzen
607a0eb2f1 Remove superfluous bzero in getcontext/swapcontext/sendsig
We zero the whole structure; we don't need to zero the __spare__ field again.

Remove trailing whitespace.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-26 20:56:05 +00:00
Alan Somers
72bce9fff6 vfs_aio.c: rename "physio" symbols to "bio".
aio has two paths: an asynchronous "physio" path and a synchronous path.
Confusingly, physio(9) isn't actually used by the "physio" path, and never
has been.  In fact, it may even be called by the synchronous path!  Rename
the "physio" path to the "bio" path to reflect what it actually does:
directly compose BIOs and send them to character devices.

MFC after:	2 weeks
2018-11-26 18:31:00 +00:00
Alan Cox
ee73fef96e blist_meta_alloc assumes that mask=scan->bm_bitmap is nonzero. But if the
cursor lies in the middle of the space that the meta node represents, then
blanking the low bits of mask may make it zero, and break later code that
expects a nonzero value.  Add a test that returns failure if the mask has
been cleared.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	pho
Tested by:	pho
X-MFC with:	r340402
Differential Revision:	https://reviews.freebsd.org/D18058
2018-11-24 21:52:10 +00:00
Mark Johnston
792843c38f Pass malloc flags directly through kevent(2) subroutines.
Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9).  Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18318
2018-11-24 17:06:01 +00:00
Mark Johnston
36c4960ef8 Plug some kernel memory disclosures via kevent(2).
The kernel may register for events on behalf of a userspace process,
in which case it must be careful to zero the kevent struct that will be
copied out to userspace.

Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18317
2018-11-24 17:02:31 +00:00
Mark Johnston
a2afae524a Ensure that knotes do not get registered when KQ_CLOSING is set.
KQ_CLOSING is set before draining the knotes associated with a kqueue,
so we must ensure that new knotes are not added after that point.  In
particular, some kernel facilities may register for events on behalf
of a userspace process and race with a close of the kqueue.

PR:		228858
Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-24 16:58:34 +00:00
Mark Johnston
1eeab857a3 Lock the knlist before releasing the in-flux state in knote_fork().
Otherwise there is a window, before iteration is resumed, during which
the knote may be freed.  The in-flux state ensures that the knote will
not be removed from the knlist while locks are dropped.

PR:		228858
Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-24 16:41:29 +00:00
Konstantin Belousov
cefb93f253 Parse FreeBSD Feature Control note on the ELF image activation.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:33:55 +00:00
Konstantin Belousov
92328a3251 Generalize ELF parse_notes().
Remove the knowledge of the ABI note type and brandnote from it,
instead provide it with a callback to do note-specific matching and
data fetching.  Implement callback to match against ELF brand.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:29:14 +00:00
Konstantin Belousov
eda8fe63c9 Trivial reduction of the code duplication, reuse the return FALSE code.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:16:01 +00:00
Mark Johnston
96fdfb3649 Honour the waitok parameter in kevent_expand().
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-23 23:10:03 +00:00
Konstantin Belousov
f5cf758998 Provide storage for the process feature control flags in struct proc.
The flags are cleared on exec, it is up to the image activator to set
them.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:07:57 +00:00
Mark Johnston
6d2e2df764 Ensure that directory entry padding bytes are zeroed.
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace.  With the ino64 work there
are also some explicit pad fields in struct dirent.  Add a subroutine
to clear these bytes and use it in the in-tree filesystems.  The
NFS client is omitted for now as it was fixed separately in r340787.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-23 22:24:59 +00:00
Mateusz Guzik
e3d3e8289b Revert "fork: fix use-after-free with vfork"
This unreliably breaks libc handling of vfork where forking succeded,
but execve did not.

vfork code in libc performs waitpid with WNOHANG in case of failed exec.
With the fix exit codepath was waking up the parent before the child
fully transitioned to a zombie. Woken up parent would waitpid, which
could find a not-yet-zombie child and fail to reap it due to the WNOHANG
flag.

While removing the flag fixes the problem, it is not an option due to older
releases which would still suffer from the kernel change.

Revert the fix until a solution can be worked out.

Note that while use-after-free which gets back due to the revert is a real
bug, it's side-effects are limited due to the fact that struct proc memory
is never released by UMA.
2018-11-23 04:38:50 +00:00
Mateusz Guzik
adce241981 Annotate TDP_RFPPWAIT as unlikely.
The flag is only set on vfork, but is tested for *all* syscalls.
On amd64 this shortens common-case (not vfork) code.
2018-11-22 21:38:24 +00:00
Mateusz Guzik
a5ac8272c0 fork: remove avoidable proc lock/unlock pair
We don't have to access the process after making it runnable, so there
is no need to hold it either.

Sponsored by:	The FreeBSD Foundation
2018-11-22 21:29:36 +00:00
Mateusz Guzik
b00b27e925 fork: fix use-after-free with vfork
The pointer to the child is stored without any reference held. Then it is
blindly used to wait until P_PPWAIT is cleared. However, if the child is
autoreaped it could have exited and get freed before the parent started
waiting.

Use the existing hold mechanism to mitigate the problem. Most common case
of doing exec remains unchanged. The corner case of doing exit performs
wake up before waiting for holds to clear.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18295
2018-11-22 21:08:37 +00:00
Mark Johnston
79db6fe7aa Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
Mateusz Guzik
f218ac5087 uipc_usrreq: fix inode number assignment
The code was incrementing a global variable in an unsafe manner.
Two different threads stating two different sockets could have resulted
in the same inode numbers assigned to both.

Creation is protected with a global lock, move the assigment there.
Since inode numbers are 64-bit now drop the check for overflows.

Sponsored by:	The FreeBSD Foundation
2018-11-21 22:25:05 +00:00
Mateusz Guzik
a627b4629d proc: update list manipulation comment on process exit
Processes stay in the hash until they get reaped.

This code does not unlink the child from the parent, so remove
the claim that it does.

Sponsored by:	The FreeBSD Foundation
2018-11-21 22:16:10 +00:00
Mateusz Guzik
7883ce1f26 uipc_shm: use unr64 for inode numbers
Sponsored by:	The FreeBSD Foundation
2018-11-21 22:01:06 +00:00
Mateusz Guzik
53011553fa proc: convert pfind & friends to use pidhash locks and other cleanup
pfind_locked is retired as it relied on allproc which unnecessarily
restricts locking of the hash.

Sponsored by:	The FreeBSD Foundation
2018-11-21 20:15:56 +00:00
Mateusz Guzik
3d3e6793f6 proc: implement pid hash locks and an iterator
forks, exits and waits are frequently stalled during poudriere -j 128 runs
due to killpg and process list exports performed for each package.

Both uses take the allproc lock. The latter case can be modified to iterate
over the hash with finer grained locking instead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17817
2018-11-21 18:56:15 +00:00
Mark Johnston
d5e494fee4 Avoid unsynchronized updates to kn_status.
kn_status is protected by the kqueue's lock, but we were updating it
without the kqueue lock held.  For EVFILT_TIMER knotes, there is no
knlist lock, so the knote activation could occur during the kn_status
update and result in KN_QUEUED being lost, in which case we'd enqueue
an already-enqueued knote, corrupting the queue.

Fix the problem by setting or clearing KN_DISABLED before dropping the
kqueue lock to call into the filter.  KN_DISABLED is used only by the
core kevent code, so there is no side effect from setting it earlier.

Reported and tested by:	Sylvain GALLIANO <sg@efficientip.com>
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18060
2018-11-21 17:32:09 +00:00
Mark Johnston
45aecd0422 Remove KN_HASKQLOCK.
It is a write-only flag whose last use was removed in r302235.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18059
2018-11-21 17:28:10 +00:00
Mark Johnston
bb58b5d670 Add a taskqueue_quiesce(9) KPI.
This is similar to taskqueue_drain_all(9) but will wait for the queue
to become idle before returning instead of only waiting for
already-enqueued tasks to finish.  This will be used in the opensolaris
compat layer.

PR:		227784
Reviewed by:	cem
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17975
2018-11-21 17:18:27 +00:00
Mark Johnston
c7dc361d6f Clear pad bytes in the struct exported by kern.ntp_pll.gettime.
Reported by:	Thomas Barabosch, Fraunhofer FKIE
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-20 20:32:10 +00:00
Mateusz Guzik
737037f6c0 pipe: use unr64
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18054
2018-11-20 14:59:27 +00:00
Mateusz Guzik
435bef7a2f Implement unr64
Important users of unr like tmpfs or pipes can get away with just
ever-increasing counters, making the overhead of managing the state
for 32 bit counters a pessimization.

Change it to an atomic variable. This can be further sped up by making
the counts variable "allocate" ranges and store them per-cpu.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18054
2018-11-20 14:58:41 +00:00
Ben Widawsky
1a305bda15 acpi: fix acpi_ec_probe to only check EC devices
This patch utilizes the fixed_devclass attribute in order to make sure
other acpi devices with params don't get confused for an EC device.

The existing code assumes that acpi_ec_probe is only ever called with a
dereferencable acpi param. Aside from being incorrect because other
devices of ACPI_TYPE_DEVICE may be probed here which aren't ec devices,
(and they may have set acpi private data), it is even more nefarious if
another ACPI driver uses private data which is not dereferancable. This
will result in a pointer deref during boot and therefore boot failure.

On X86, as it stands today, no other devices actually do this (acpi_cpu
checks for PROCESSOR type devices) and so there is no issue. I ran into
this because I am adding such a device which gets probed before
acpi_ec_probe and sets private data. If ARM ever has an EC, I think
they'd run into this issue as well.

There have been several iterations of this patch. Earlier
iterations had ECDT enumerated ECs not call into the probe/attach
functions of this driver. This change was Suggested by: jhb@.

Reviewed by:    jhb
Approved by:	emaste (mentor)
Differential Revision:  https://reviews.freebsd.org/D16635
2018-11-19 18:29:03 +00:00
Hans Petter Selasky
90acd1d139 Minor code factoring. No functional change.
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-11-19 09:36:09 +00:00
Hans Petter Selasky
2205f61a31 Be more verbose when a sysctl fails to unregister.
Print name of sysctl in question.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-11-19 09:35:16 +00:00
Kevin Bowling
2a24f4d911 Retire sbsndptr() KPI
As of r340465 all consumers use sbsndptr_adv and sbsndptr_noadv

Reviewed by:	gallatin
Approved by:	krion (mentor)
Differential Revision:	https://reviews.freebsd.org/D17998
2018-11-19 00:54:31 +00:00
Mateusz Guzik
2c054ce924 proc: always store parent pid in p_oppid
Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17825
2018-11-16 17:07:54 +00:00
Mark Johnston
aeb7a84ee1 Remove mostly-useless proc provider probes.
For some reason the proc UMA zone's ctor, dtor and init functions are
instrumented, but these functions are always available through FBT.
Moreover, the probes are not part of the original Solaris proc
provider, aren't documented, have no uses (e.g., in dwatch(8)) and
have no clear use to begin with.  Therefore, remove them.

Reviewed by:	rpaulo
Differential Revision:	https://reviews.freebsd.org/D2169
2018-11-15 23:02:59 +00:00
Warner Losh
36173f6976 Do proper conversion to/from sbt.
Doh! sbttoX and Xtosbt were backwards. While they ran, they produced
bogus results.

Pointy hat to: imp@
2018-11-15 16:02:24 +00:00
Gleb Smirnoff
905837ebe7 Initialize compatibility epoch tracker for thread0. Fixes
panics for drivers that call if_maddr_lock() during startup.

Reported by:	cy
2018-11-14 19:10:35 +00:00
Brooks Davis
5b1df30051 Use the main capabilities.conf for freebsd32.
Allow the location of capabilities.conf to be configured.

Also allow a per-abi syscall prefix to be configured with the
abi_func_prefix syscalls.conf variable and check syscalls against
entries in capabilities.conf with and without the prefix amended.

Take advantage of these two features to allow use shared capabilities.conf
between the default syscall vector and the freebsd32 compatability
layer.  We've been inconsistent about keeping the two in sync as
evidenced by the bugs fixed in r340294.  This eliminates that problem
going forward.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17932
2018-11-14 00:46:02 +00:00
Gleb Smirnoff
6febf18036 Fix build on some architectures after r340413. On amd64 epoch.h
appeared to be included implicitly.
2018-11-14 00:33:03 +00:00
Matt Macy
91cf497515 epoch(9) revert r340097 - no longer a need for multiple sections per cpu
I spoke with Samy Bahra and recent changes to CK to make ck_epoch_call and
ck_epoch_poll not modify the record have eliminated the need for this.
2018-11-14 00:12:04 +00:00
Gleb Smirnoff
635c18840a style(9), mostly adjusting overly long lines. 2018-11-13 23:57:34 +00:00
Gleb Smirnoff
a760c50c9e With epoch not inlined, there is no point in using _lite KPI. While here,
remove some unnecessary casts.
2018-11-13 23:45:38 +00:00
Gleb Smirnoff
9f360eecf9 The dualism between epoch_tracker and epoch_thread is fragile and
unnecessary. So, expose CK types to kernel and use a single normal
structure for epoch_tracker.

Reviewed by:	jtl, gallatin
2018-11-13 23:20:55 +00:00
Gleb Smirnoff
b79aa45e0e For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by:	markj
2018-11-13 22:58:38 +00:00
Mateusz Guzik
f183fb162c locks: plug warnings about unitialized variables
They only showed up after I redefined LOCKSTAT_ENABLED to 0.

doing_lockprof in mutex.c is a real (but harmless) bug. Should the
value be non-zero it will do checks for lock profiling which would
otherwise be skipped.

state in rwlock.c is a wart from the compiler, the value can't be
used if lock profiling is not enabled.

Sponsored by:	The FreeBSD Foundation
2018-11-13 21:29:56 +00:00
Eric van Gyzen
d54474e63b Make no assertions about lock state when the scheduler is stopped.
Change the assert paths in rm, rw, and sx locks to match the lock
and unlock paths.  I did this for mutexes in r306346.

Reported by:	Travis Lane <tlane@isilon.com>
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-13 20:48:05 +00:00
Gleb Smirnoff
a82296c2df Uninline epoch(9) entrance and exit. There is no proof that modern
processors would benefit from avoiding a function call, but bloating
code. In fact, clang created an uninlined real function for many
object files in the network stack.

- Move epoch_private.h into subr_epoch.c. Code copied exactly, avoiding
  any changes, including style(9).
- Remove private copies of critical_enter/exit.

Reviewed by:	kib, jtl
Differential Revision:	https://reviews.freebsd.org/D17879
2018-11-13 19:02:11 +00:00
Mark Johnston
bb4a27f927 Allow allocations across meta boundaries.
Remove restrictions that prevent allocation requests to cross the
boundary between two meta nodes.

Replace the bmu_avail field in meta nodes with a bitmap that identifies
which subtrees have some free memory, and iterate over the nonempty
subtrees only in blst_meta_alloc.  If free memory is scarce, this should
make searching for it faster.

Put the code for handling the next-leaf allocation in a separate
function.  When taking blocks from the next leaf empties the leaf, be
sure to clear the appropriate bit in its parent, and so on, up to the
least-common ancestor of this leaf and the next.

Eliminate special terminator nodes, and rely instead on the fact that
there is a 0-bit at the end of the bitmask at the root of the tree that
will stop a meta_alloc search, or a next-leaf search, before the search
falls off the end of the tree. Make sure that the tree is big enough to
have space for that 0-bit.

Eliminate special all-free indicators.  Lazy initialization of subtrees
stands in the way of having an allocation span a meta-node boundary, so
a subtree of all free blocks is not treated specially.  Subtrees of
all-allocated blocks are still recognized by looking at the bitmask at
the root and finding 0.

Don't print all-allocated subtrees.  Do print the bitmasks for meta
nodes, when tree-printing.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12635
2018-11-13 18:40:01 +00:00
Kyle Evans
75beb4d46a Add dynamic_kenv assertion to init_static_kenv
Both to formally document the requirement that this not be called after the
dynamic kenv is setup, and to perhaps help static analyzers figure out
what's going on. While calling init_static_kenv this late isn't fatal, there
are some caveats that the caller should be aware of:

- Late calls are effectively a no-op, as far as default FreeBSD is
concerned, as everything will switch to searching the dynamic kenv once it's
available.

- Each of the kern_getenv calls will leak memory, as it's assumed that
these are searching static environment and allocations will not be made.

As such, this usage is not sensible and should be detected.
2018-11-13 04:34:30 +00:00
Konstantin Belousov
389474c122 Allow set ether/vlan PCP operation from the VNET jails.
The vlan interfaces can be created from vnet jails, it seems, so it
sounds logical to allow pcp configuration as well.

Reviewed by:	bz, hselasky (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17777
2018-11-12 15:59:32 +00:00
Conrad Meyer
0d1467b199 netdump: Fix netdumping with INVARIANTS kernels
Correct boneheaded assertion I added in r339501.  Mea culpa.

The intent is to notice when an M_WAITOK zone allocation would fail during
netdump, not to prevent all use of mbufs during netdump.

Reviewed by:	markj
X-MFC-With:	r339501
Differential Revision:	https://reviews.freebsd.org/D17957
2018-11-12 05:24:20 +00:00
Konstantin Belousov
8782eef46f Remove one-use variable.
This also removes a lot of #ifdefs and cleans up a warning when the
AUDIT kernel option is defined, but neither KDTRACE_HOOKS nor MAC are.

Reported and tested by:	danger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-11-11 00:21:28 +00:00
Konstantin Belousov
ade85c5eec Allow absolute paths for O_BENEATH.
The path must have a tail which does not escape starting/topping
directory.  The documentation will come shortly, see the man pages
commit message for the reason of separate commit.

Reviewed by:	jilles (previous version)
Discussed with:	emaste
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17714
2018-11-11 00:04:36 +00:00
Brooks Davis
9a38df59e9 Fix freebsd32 mknod(at).
As dev_t is now a 64-bit integer, it requires special handling as a
system call argument.  64-bit arguments are split between two 64-bit
integers due to the way arguments are promoted to allow reuse of most
system call implementations.  They must be reassembled before use.
Further, 64-bit arguments at an odd offset (counting from zero) are
padded and slid to the next slot on powerpc and mips.  Fix the
non-COMPAT11 system call by adding a freebsd32_mknodat() and
appropriately padded declerations.

The COMPAT11 system calls are fully compatible with the 64-bit
implementations so remove the freebsd32_ versions.

Use uint32_t consistently as the type of the old dev_t.  This matches
the old definition.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17928
2018-11-09 21:01:16 +00:00
Brooks Davis
b34f4419fb Make freebsd32_umtx_op follow the freebsd32_foo convention.
Sponsored by:	DARPA, AFRL
2018-11-09 00:46:10 +00:00
John Baldwin
4bf4b0f139 Enable non-executable stacks by default on RISC-V.
Reviewed by:	markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D17878
2018-11-07 18:32:02 +00:00
Brooks Davis
5577e44bf4 Regen after r340221: allow pointer return types.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17873
2018-11-07 16:56:07 +00:00
Brooks Davis
e56ec0e519 makesyscalls.sh: allow pointer return types.
The previous code required that the return type be a single word.  This
allows it to be a pointer without using a typedef.

Update the return types of break, mmap, and shmat to be void * as
declared.  This only effects systrace output in-tree, but can aid in
generating system call wrappers from syscalls.master.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17873
2018-11-07 16:55:04 +00:00
Mark Johnston
f8a222010f Avoid fixing the tty_info() buffer size in tty.h.
Different compilation units may otherwise get a different view of the
layout of struct tty depending on whether they include opt_printf.h.
This caused a blowup in the number of types defined in the kernel's
CTF file after r339468; thanks to dim@ for bisecting down to that
revision.

PR:		232675
Reported by:	dim
Reviewed by:	cem (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17877
2018-11-06 23:41:44 +00:00
Mark Johnston
07702f72e5 Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map.
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.

Reported by:	jhb
Reviewed by:	alc, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17827
2018-11-06 21:57:03 +00:00
Mark Johnston
6741ea083f We need opt_stack.h after r339605.
Reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
2018-11-06 21:47:22 +00:00
Brooks Davis
dd4d2f216f Update some comments made obsolete by recent commits. 2018-11-06 20:45:15 +00:00
Brooks Davis
938e8dcf60 Regen after r340199: Use declared types for caddr_t arguments.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:47:29 +00:00
Brooks Davis
318f0d7720 Use declared types for caddr_t arguments.
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:46:38 +00:00
Mariusz Zaborski
f4a035b8df Regenerate after r340129.
Pointed out by:	brooks
2018-11-06 18:03:04 +00:00
Mark Johnston
f71ef9b686 Use plain atomic_{add,subtract} when that's sufficient.
CID:		1386920
MFC after:	2 weeks
2018-11-06 17:32:25 +00:00
Andrew Turner
4ea56599e8 Port the NetBSD ubsan runtime to the FreeBSD kernel.
This allows us to build the ubsan code added in r340189 into the kernel
with the KUBSAN option. This will report when undefined behaviour is
detected in the currently running kernel.

As it can be large, the kernel is 65MB on arm64, loader may not be able to
load the kernel on all architectures so is disabled by default for now.

Sponsored by:	DARPA, AFRL
2018-11-06 17:32:07 +00:00
Andrew Turner
0645126fae Import the NetBSD micro ubsan code for the kernel.
This imports revision 1.3 of common/lib/libc/misc/ubsan.c from NetBSD, the
micro-ubsan code. It is an implementation of the Undefined Behavior
Sanitizer runtime for use with recent clang and gcc.

The uubsan code will be used in a later commit to implement kubsan to help
find undefined behavior in the kernel.

Sponsored by:	DARPA, AFRL
2018-11-06 16:56:49 +00:00
Brooks Davis
44cbc1c2b7 Fix a couple indentation errors in r339958. 2018-11-06 00:09:43 +00:00
John Baldwin
4cbbb74888 Add a KPI for the delay while spinning on a spin lock.
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI.  Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms.  However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.

Reviewed by:	kib
MFC after:	3 days
2018-11-05 21:34:17 +00:00
Mariusz Zaborski
82560231d3 capsicum: allow ppoll(2) in capability mode
We already allow to use poll(2). There is no reason to disallow ppoll(2).

PR:		232495
Submitted by:	Stefan Grundmann <sg2342@googlemail.com>
Reviewed by:	cem, oshogbo
MFC after:	2 weeks
2018-11-04 17:12:53 +00:00
Matt Macy
10f42d244b Convert epoch to read / write records per cpu
In discussing D17503 "Run epoch calls sooner and more reliably" with
sbahra@ we came to the conclusion that epoch is currently misusing the
ck_epoch API. It isn't safe to do a "write side" operation (ck_epoch_call
or ck_epoch_poll) in the middle of a "read side" section. Since, by definition,
it's possible to be preempted during the middle of an EPOCH_PREEMPT
epoch the GC task might call ck_epoch_poll or another thread might call
ck_epoch_call on the same section. The right solution is ultimately to change
the way that ck_epoch works for this use case. However, as a stopgap for
12 we agreed to simply have separate records for each use case.

Tested by: pho@

MFC after:	3 days
2018-11-03 03:43:32 +00:00
Brooks Davis
4e8c73eb20 Regen after r340080: Add const to input-only char * arguments.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:56:19 +00:00
Brooks Davis
12e69f96a2 Add const to input-only char * arguments.
These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:50:22 +00:00
Warner Losh
003ffd57fe Add sysctl_usec_to_sbintime and sysctl_msec_to_sbintime.
These functions are used to present a sbintime_t as either a number of
microseconds or a number of milliseconds respectively.

Sponsored by: Netflix
2018-11-02 17:50:57 +00:00
Mark Johnston
2203c46d87 Initialize the eflags field of vm_map headers.
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14005
2018-11-02 16:26:44 +00:00
Brooks Davis
1493c2ee62 Make vop_symlink take a const target path.
This will enable callers to take const paths as part of syscall
decleration improvements.

Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.

Bump __FreeBSD_version for external API consumers.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17805
2018-11-02 14:42:36 +00:00
Conrad Meyer
78c2a9806e kern_poll: Restore explanatory comment removed in r177374
The comment isn't stale.  The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds.  The check is mostly a seatbelt against accidental misuse or
abuse.  FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.

Additionally, document this possible EINVAL condition in the poll.2
manual.

No functional change.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17671
2018-11-01 23:46:23 +00:00
Brooks Davis
f7e5ce325f Regent after r340034: Use mode_t when the documented signature does.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17784
2018-11-01 23:10:53 +00:00
Brooks Davis
2105ac07d7 Use mode_t when the documented signature does.
This is more clear and produces better results when generating function
stubs from syscalls.master.

Reviewed by:	kib, emaste
Obtained from:	CheribSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17784
2018-11-01 23:06:50 +00:00
John Baldwin
b317cfd4c0 Don't enter DDB for fatal traps before panic by default.
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'.  Disable the new knob by default.  Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob.  However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.

Reviewed by:	kib, avg
MFC after:	2 months
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D17768
2018-11-01 21:34:17 +00:00
Brooks Davis
e3e5481326 Reformat syscalls.master for better readability.
This takes advantage of two recents changes to makesyscalls.sh:
r328598: Permit a range of syscall numbers for UNIMPL
r339624: Remove the need for backslashes in syscalls.master

Syscall declerations are now split across multiple lines with the
syscall name and variables each on seperate lines (with an exception for
syscalls taking no arguments.)

Reviewed by:	imp
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17706
2018-10-31 16:17:45 +00:00
Bjoern A. Zeeb
9afc56849a Fix mips build after r339931.
I erroneously thought that it was two 64bit platforms which use link_elf_obj.c.

PR:		228854
Reported by:	ci.f.o.
MFC after:	3 days
X-MFC with:	r339931
Pointyhat to:	bz
2018-10-30 21:35:56 +00:00
Bjoern A. Zeeb
0f823b6497 As a follow-up to r339930 and various reports implement logging in case
we fail during module load because the pcpu or vnet module sections are
full.  We did return a proper error but not leaving any indication to
the user as to what the actual problem was.

Even worse, on 12/13 currently we are seeing an unrelated error (ENOSYS
instead of ENOSPC, which gets skipped over in kern_linker.c) to be
printed which made problem diagnostics even harder.

PR:		228854
MFC after:	3 days
2018-10-30 20:51:03 +00:00
Mark Johnston
9978bd996b Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
Mark Johnston
920239efde Fix some problems that manifest when NUMA domain 0 is empty.
- In uma_prealloc(), we need to check for an empty domain before the
  first allocation attempt, not after.  Fix this by switching
  uma_prealloc() to use a vm_domainset iterator, which addresses the
  secondary issue of using a signed domain identifier in round-robin
  iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
  excluding empty domains; otherwise we may frequently specify an empty
  domain when calling in to the page allocator, wasting CPU time.
  Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
  total page counts for now: some vm_phys segments are created before
  the SRAT is parsed and are thus always identified as being in domain 0
  even when they are not.  Then, when bootstrap pages are freed, they
  are added to a domain that we had previously thought was empty.  Until
  this is corrected, we simply exclude them from the per-domain page
  count.

Reported and tested by:	Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17704
2018-10-30 17:57:40 +00:00
Eric van Gyzen
fcbb889fdb Always stop the scheduler when entering kdb
Set curthread->td_stopsched when entering kdb via any vector.
Previously, it was only set when entering via panic, so when
entering kdb another way, mutexes and such were still "live",
and an attempt to lock an already locked mutex would panic.

Reviewed by:	kib, cem
Discussed with:	jhb
Tested by:	pho
MFC after:	2 months
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17687
2018-10-30 14:54:15 +00:00
Stephen Hurd
5201e0f110 Drain grouptaskqueue of the gtask before detaching it.
taskqgroup_detach() would remove the task even if it was running or
enqueued, which could lead to panics (see D17404). With this change,
taskqgroup_detach() drains the task and sets a new flag which prevents the
task from being scheduled again.

I've added grouptask_block() and grouptask_unblock() to allow control
over the flag from other locations as well.

Reviewed by:	Jeffrey Pieper <jeffrey.e.pieper@intel.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17674
2018-10-29 14:36:03 +00:00
Mark Johnston
4aed5937db Use M_WAITOK in init_hwpmc().
No functional change intended.

MFC after:	2 weeks
2018-10-27 18:48:49 +00:00
Conrad Meyer
2384981b61 poll: Unify userspace pollfd pointer name
Some of the poll code used 'fds' and some used 'ufds' to refer to the
uap->fds userspace pointer that was passed around to subroutines.  Some of
the poll code used 'fds' to refer to the kernel memory pollfd arrays, which
seemed unnecessarily confusing.

Unify on 'ufds' to refer to the userspace pollfd array.

Additionally, 'bits' is not an accurate description of the kernel pollfd
array in kern_poll, so rename that to 'kfds'.  Finally, clean up some logic
with mallocarray() and nitems().

No functional change.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D17670
2018-10-26 20:07:46 +00:00
Brooks Davis
ed34a7fcf2 Move 32-bit compat support for FIODGNAME to the right place.
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.

The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.

Unlike r339174 this change supports both places FIODGNAME is handled.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17475
2018-10-26 17:59:25 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Mark Johnston
ad054101eb Remove a dead store.
CID:		1304878
MFC after:	1 week
2018-10-25 17:36:28 +00:00
Mark Johnston
970a174f3b Add FALLTHROUGH comments to appease Coverity.
CID:		1017862-1017864, 1017866-1017868
MFC after:	2 weeks
2018-10-25 15:43:21 +00:00
Mark Johnston
a0a18fd46b Remove a redundant check.
CID:		1042100
MFC after:	2 weeks
2018-10-25 15:40:59 +00:00
Konstantin Belousov
4fceda6206 Correct condition to detect mount(2) support by a filesystem.
Reported and tested by:	cy
Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
2018-10-24 19:40:09 +00:00
Konstantin Belousov
8ff7fad1d7 Only call sigdeferstop() for NFS.
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.

The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.

VFS_OP()s wrap is straightforward.

Requested and reviewed by:	mjg (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17658
2018-10-23 21:43:41 +00:00
Eric Joyner
46fa0c2552 Revert r339634.
That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.

Reported by:	ae@, pho@, et al
Sponsored by:	Intel Corporation
2018-10-23 17:06:36 +00:00
Eric Joyner
940f62d616 iflib: drain enqueued tasks before detaching from taskqgroup
The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.

The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.

As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.

Submitted by:	Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by:	shurd@, erj@
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D17404
2018-10-23 04:37:29 +00:00
Brooks Davis
9d7051d920 Remove the need for backslashes in syscalls.master.
Join non-special lines together until we hit a line containing a '}'
character. This allows the function declaration body to be split
across multiple lines without backslash continuation characters.

Continue to join lines ending with backslashes to allow gradual
migration and to support out-of-tree syscall vectors

Reviewed by:	emaste, kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17488
2018-10-22 22:13:00 +00:00
Brooks Davis
22c0c9a481 Remove __restrict qualifiers from syscalls.master.
The restruct qualifier is intended to aid code generation in the
compiler, but the only access to storage through these pointers is via
structs using copyin/copyout and the like which can not be written in C
or C++ and thus the compiler gains nothing from the qualifiers.

As such, the qualifiers add no value in current usage.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17574
2018-10-22 21:50:43 +00:00
Mark Johnston
b61f314290 Make it possible to disable NUMA support with a tunable.
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration.  With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR:		231460
Reviewed by:	alc, gallatin, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17439
2018-10-22 20:13:51 +00:00
Conrad Meyer
93940996fd Conditionalize kern.tty_info_kstacks feature on STACKS option
Fix tinderbox (mips XLPN32) after r339471.

Reported by:	tinderbox
X-MFC-With:	r339471
Sponsored by:	Dell EMC Isilon
2018-10-22 17:42:57 +00:00
Mark Johnston
21744c825f Don't import 0 into vmem quantum caches.
vmem uses UMA cache zones to implement the quantum cache.  Since
uma_zalloc() returns 0 (NULL) to signal an allocation failure, UMA
should not be used to cache resource 0.  Fix this by ensuring that 0 is
never cached in UMA in the first place, and by modifying vmem_alloc()
to fall back to a search of the free lists if the cache is depleted,
rather than blocking in qc_import().

Reported by and discussed with:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17483
2018-10-22 16:16:42 +00:00
Warner Losh
e9b5375b04 Retire dpt(4)
Marked as gone in 12 and not relevant since the early 90s. No
sightings in nycbug's dmesg database.

Relnotes: yes
2018-10-22 02:35:12 +00:00
Warner Losh
a1db7455b7 Remove bt(4) driver
The buslogic scsi driver has been tagged as gone in 12 for some time
now. Remove it. The nycbug dmesg database shows only one sighting in 6
for this driver. It was very popular in the early days of the project,
but that popularity seems to have died by 2004 when the nycbug
database started up.

Relnotes: yes
2018-10-22 02:34:59 +00:00
Warner Losh
43b16da804 Remove adv(4) and adw(4)
Remove the advanssy drivers (both adv and adw). They were tagged as
gone in 12 a while qgo. The nycbug dmesg database shows this was last
seen in 6 and there were only a few adv sightings then (none for adw).

Relnotes: yes
2018-10-22 02:34:47 +00:00
Warner Losh
39c362e0b0 Remove aha(4) from the tree.
We tagged aha as gone in 12 a while ago. Proceed with its removal.
Data from nycbug's database shows the last sighting of this driver in
6, with the prior one in 4.x show its popularity had died prior to
4.x.

Relnotes: yes
2018-10-22 02:34:25 +00:00
Warner Losh
c4b23051af Remove stray refernce to pdq. Like the infamous twenty first of Johan
Sebastian Bach's twenty children, it hasn't been seen in many years.
2018-10-21 16:49:49 +00:00
Conrad Meyer
3937ee7557 netdump: Zone mbufs should be allocated before dump
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17306
2018-10-20 22:24:58 +00:00
Conrad Meyer
767bc248de ZSTDIO: Correctly initialize zstd context with provided 'level'
Prior to this revision, we allocated sufficient context space for 'level'
but never actually set the compress level parameter, so we would always get
the default '3'.

Reviewed by:	markj, vangyzen
MFC after:	12 hours
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17144
2018-10-20 21:49:44 +00:00
Conrad Meyer
ec86f8b28b dev_refthread: Do not initialize *ref when reference was not acquired
Like the companion API devvn_refthread, leave *ref uninitialized when a
reference was not acquired.  Initializing to 1 provides a vaguely
correct-looking but bogus value for broken callers to (mistakenly) pass to
dev_relthread() when refthread fails.

Make it even more clear to consumers that dev_relthread is only valid when
dev_refthread succeeds.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16885
2018-10-20 19:42:38 +00:00
Conrad Meyer
d7aa89c363 tty info (^T): Add optional kernel stack(9) traces
It is often useful for developers and administrators to determine a running
thread's stack for debugging purposes.  With this feature, using ^T will
print that information

For now, the feature is disabled by default.  Enable with sysctl
kern.tty_info_kstacks=1.

Discussed with:	markj
Reviewed by:	oshogbo
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17621
2018-10-20 18:42:28 +00:00
Conrad Meyer
6858c2cc8f Replace ttyprintf with sbuf_printf and tty drain routine
Add string variants of cnputc and tty_putchar, and use them from the tty
sbuf drain routine.

Suggested by:	ed@
Sponsored by:	Dell EMC Isilon
2018-10-20 18:31:36 +00:00
Conrad Meyer
d158fa4ade Add flags variants to linker_files / stack(9) symbol resolution
Some best-effort consumers may find trylock behavior for stack(9) symbol
resolution acceptable.  Expose that behavior to such consumers.

This API is ugly.  If in the future the modules and linker file list locking
is cleaned up such that the linker_files list can be iterated safely without
acquiring a sleepable lock, this API should be removed.  However, most of
the time nothing will be holding the linker files lock exclusive and the
acquisition can proceed.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17620
2018-10-20 18:08:43 +00:00
Mark Johnston
662e7fa8d9 Create some global domainsets and refactor NUMA registration.
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by:	alc, gallatin, jeff, kib
Tested by:	pho (part of a larger patch)
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17416
2018-10-20 17:36:00 +00:00
Bjoern A. Zeeb
9ae7bc39c2 In r78161 the lookup_set linker method was introduced which optionally
returns the section start and stop locations as well as a count if the
caller asks for them.
There was only one out-of-file consumer of count which did not actually
use it and hence was eliminated in r339407.
In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the
former) started to use the link_elf_lookup_set() interface internally
also asking for the count.

count is computed as the difference of the void **stop - void **start
locations and as such, if the absoulte numbers
	(stop - start) % sizeof(void *) != 0
a round-down happens, e.g., **stop 0x1003 - **start 0x1000 => count 0.

To get the section size instead of "count is the number of pointer
elements in the section", the parse_*() functions do a
	count *= sizeof(void *).
They use the result to allocate memory and copy the section data
into the "master" and per-instance memory regions with a size of
count.

As a result of count possibly round-down this can miss the last
bytes of the section.  The good news is that we do not touch
out of bounds memory during these operations (we may at a later stage
if the last bytes would overflow the master sections).
Given relocation in elf_relocaddr() works based on the absolute
numbers of start and stop, this means that we can possibly try to
access relocated data which was never copied and hence we get
random garbage or at best zeroed memory.

Stop the two (last) consumers of count (the parse_*() functions)
from using count as well, and calculate the section size based on
the absolute numbers of stop and start and use the proper size for
the memory allocation and data copies.  This will make the symbols
in the last bytes of the pcpu or vnet sections be presented as
expected.

PR:			232289
Approved by:		re (gjb)
MFC after:		2 weeks
2018-10-18 20:20:41 +00:00
Jamie Gritton
4520f617c9 Fix typos from r339409.
Reported by:	maxim
Approved by:	re (gjb)
2018-10-18 15:02:57 +00:00
Jonathan T. Looney
e77f0bdcb5 r334853 added a "socket destructor" callback. However, as implemented, it
was really a "socket close" callback.

Update the socket destructor functionality to run when a socket is
destroyed (rather than when it is closed). The original submitter has
confirmed that this change satisfies the intended use case.

Suggested by:	rwatson
Submitted by:	Michio Honda <micchie at sfc.wide.ad.jp>
Tested by:	Michio Honda <micchie at sfc.wide.ad.jp>
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17590
2018-10-18 14:20:15 +00:00
Jamie Gritton
b19d66fd5a Add a new jail permission, allow.read_msgbuf. When true, jailed processes
can see the dmesg buffer (this is the current behavior).  When false (the
new default), dmesg will be unavailable to jailed users, whether root or
not.

The security.bsd.unprivileged_read_msgbuf sysctl still works as before,
controlling system-wide whether non-root users can see the buffer.

PR:		211580
Submitted by:	bz
Approved by:	re@ (kib@)
MFC after:	3 days
2018-10-17 16:11:43 +00:00
Bjoern A. Zeeb
0455a92bcb The countp argument passed to linker_file_lookup_set() in
linker_load_dependencies() is unused, so no need to ask for the
value in first place.  Remove the unused "count" variable.

Approved by:	re (kib)
2018-10-17 10:31:08 +00:00
Mark Johnston
ddab8c351a Reparent a child of pdfork(2) to its reaper when the procdesc is closed.
Unconditionally reparenting to PID 1 breaks the procctl(2) reaper
functionality.

Add a regression test for this case.

Reviewed by:	kib
Approved by:	re (gjb)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17589
2018-10-16 20:06:56 +00:00
Gleb Smirnoff
47f1ea5109 Plug sendfile(2) on a listening socket with proper error code.
Reported by:	ngie
Reviewed by:	ngie
Approved by:	re (delphij)
2018-10-16 15:57:16 +00:00
Kyle Evans
29bf3a7ba8 Correct COMPAT* macro names in syscalls.master
Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master
cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places.

Approved by:	re (glebius)
2018-10-15 21:35:57 +00:00
Mateusz Guzik
98fca94d22 capsicum: provide cap_rights_fde_inline
Reading caps is in the hot path (on each successful fd lookup), but
completely unnecessarily requires a function call.

Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
2018-10-12 23:48:10 +00:00
Mateusz Guzik
c9964045a0 Add a file missed in r339321
Approved by:	re (implicit)
Sad face: mjg
2018-10-12 00:32:45 +00:00
Mateusz Guzik
3f102f5881 Provide string functions for use before ifuncs get resolved.
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.

Convert places which need them. Xen bits by royger.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17487
2018-10-11 23:28:04 +00:00
Jamie Gritton
08b4333399 Fix the test prohibiting jails from sharing IP addresses.
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address.  This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.

VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.

Approved by:	re@ (kib@)
MFC after:	5 days
2018-10-06 02:10:32 +00:00
Matt Macy
e8bb589d56 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
Gleb Smirnoff
ad7eb8cad5 In PR 227259, a user is reporting that they have code which is using
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.

It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.

This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.

PR:		227259
Obtained from:	jtl
Approved by:	re (kib)
Differential Revision:  D15019
2018-10-03 17:40:04 +00:00
Andrew Turner
8696dcdacf Add kernel ifunc support on arm64.
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.

Reviewed by:	kib (previous version)
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17370
2018-10-01 18:51:08 +00:00
Andrew Gallatin
30c5525b3c Allow empty NUMA memory domains to support Threadripper2
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2  memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
    rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
    Thanks to Jeff for suggesting this strategy.

Reviewed by:	alc, markj
Approved by:	re (gjb@)
Differential Revision:	https://reviews.freebsd.org/D1683
2018-10-01 14:14:21 +00:00