Commit Graph

16517 Commits

Author SHA1 Message Date
Edward Tomasz Napierala
1699546def Remove sv_pagesize, originally introduced with r100384.
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.

Reviewed by:	imp (who also contributed the commit message)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D19280
2019-03-01 16:16:38 +00:00
Marcin Wojtas
e3431664df Prevent detaching driver if the attach is not finished
When the device is in attaching state, detach should return
EBUSY instead of success. In other case, there could be race
between attach and detach during the driver unloading.

If driver goes sleep and releases GIANT lock during attaching,
unloading module could start. In such case when attach continues
after module unload, page fault "supervisor read instruction,
page not present" occurred.

This patch works around the real issue, which is a locking
deficiency of the busses.

Submitted by: Rafal Kozik <rk@semihalf.com>
Reviewed by: imp
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D19375
2019-03-01 01:18:39 +00:00
Mark Johnston
2b64ab22e8 Allow FIONBIO and FIOASYNC ioctls on POSIX shm descriptors.
They have no effect, as with filesystem file descriptors.
This improves compatibility with some existing userspace code.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19330
2019-02-28 22:00:36 +00:00
Mateusz Guzik
55fda58146 Rename seq to seqc to avoid namespace clashes with Linux
Linux generates the content of procfs files using a mechanism prefixed with
seq_*. This in particular came up with recent gcov import.

Sponsored by:	The FreeBSD Foundation
2019-02-27 22:56:55 +00:00
Mark Johnston
2b6010705c Improve vmem tuning for platforms without a direct map.
On platforms without a direct map (i.e., platforms without
UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a
number of tags for use when allocating a new slab of boundary tags,
as such platforms require free boundary tags in order to allocate
boundary tags.  r327899 increased the number of boundary tags required
for a KVA allocation in the worst case, and the aforementioned
reservation was not updated accordingly.  In some cases, this could
lead to a system hang.  Fix the problem by increasing this reservation.

Also reduce KVA_QUANTUM on systems lacking superpage support.
The previous import quantum (4MB with a 4KB page size) was quite large
for systems with limited KVA, and fragmentation in kernel_arena could
cause kernel memory allocation failures even with a substantial amount
of free KVA.

Reported and tested by:	jhibbits
Reviewed by:	alc, kib
No objections:	jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19337
2019-02-25 19:22:13 +00:00
Andrew Turner
feb2cc805f Check the index hasn't changed after writing the cmp entry.
If an interrupt fires while writing the cmp entry we may have a partial
entry. Work around this by using atomic_cmpset to set the new index. If it
fails we need to set the previous index value and try again as the entry
may be in an inconsistent state.

This fixes messages similar to the following from syzkaller:
bad comp 224 type 2163727253

Reviewed by:	tuexen
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D19287
2019-02-25 13:15:34 +00:00
Matt Macy
ebe0b35a18 Change seq_read to seq_load to avoid namespace conflicts with lkpi
MFC after:	1 week
Sponsored by:	iX Systems
2019-02-23 21:04:48 +00:00
Matt Macy
983ed4f9f1 lkpi: allow late binding of linux_alloc_current
Some consumers may be loosely coupled with the lkpi.
This allows them to call linux_alloc_current without
having a static dependency.

Reviewed by:	hps@
MFC after:	1 week
Sponsored by:	iX Systems
Differential Revision:	https://reviews.freebsd.org/D19257
2019-02-22 23:15:32 +00:00
Andrew Turner
bdffe3b5bf Allow the kcov buffer to be mmaped multiple times.
After r344391 this restriction is no longer needed.

Sponsored by:	DARPA, AFRL
2019-02-21 10:11:15 +00:00
Andrew Turner
01ffedf593 Unwire the kcov buffer when freeing the info struct.
Without this the physical memory will not be returned to the kernel.

While here call vm_object_reference on the object when mmapping the buffer.
This removed the need for buggy tracking of if it has been mapped or not.

This fixes issues where kcov could use all the system memory.

Reported by:	tuexen
Reviewed by:	kib
Sponsored by:	DARPA, AFTL
Differential Revision:	https://reviews.freebsd.org/D19252
2019-02-20 22:41:14 +00:00
Andrew Turner
a759a0a001 Call pmap_qenter for each page when creating the kcov buffer.
This removes the need to allocate a buffer to hold the vm_page_t objects
at the cost of extra IPIs on some architectures.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D19252
2019-02-20 22:32:28 +00:00
Mark Johnston
093295ae49 Remove an obsolete comment.
MFC after:	3 days
2019-02-20 18:29:52 +00:00
Konstantin Belousov
1809ef7836 Implement rangesets.
The data structure implements non-intersecting intervals over the [0,
UINT64_MAX] range, and supports fast insert, predicated clearing of
subrange, and lookup of an interval containing the specified address.
Internally it is a pctrie over the interval start addresses.

Implementation provides additional guarantees over the structure state
in case of memory allocation failures.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:38:19 +00:00
Andrew Turner
72b66398fa Create a common function to handle freeing the kcov info struct.
Both places that may free the kcov info struct are identical. Create a new
common function to hold the code.

Sponsored by:	DARPA, AFRL
2019-02-19 17:03:34 +00:00
Mark Johnston
18a7de663b Move a racy assertion in filt_pipewrite().
EVFILT_WRITE knotes for pipes live on the knlist for the other end of the
pipe.  Since they do not hold a reference on the corresponding file
structure, they may be removed from the knlist by pipeclose() while still
remaining active.  In this case, there is no knlist lock acquired before
filt_pipewrite() is called, so the assertion fails.

Fix the problem by first checking whether that end of the pipe has been
closed.  These checks are memory safe since the knote holds a reference
on one end of the pipe, and the pipe structure is not freed until both
ends are closed.  The checks are not racy since PIPE_EOF is never cleared
after being set, and pipe_present is never set back to PIPE_ACTIVE after
pipeclose() has been called.

PR:		235640
Reported and tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19224
2019-02-19 15:46:43 +00:00
Mark Johnston
648890835c Remove a write-only variable orphaned by r340677. 2019-02-17 16:56:41 +00:00
Bruce Evans
23e5e43ccd Finish the fix for overflow in calcru1().
The previous fix was unnecessarily very slow up to 105 hours where the
simple formula used previously worked, and unnecessarily slow by a factor
of about 5/3 up to 388 days, and didn't work above 388 days.  388 days is
not a long time, since it is a reasonable uptime, and for processes the
times being calculated are aggregated over all threads, so with N CPUs
running the same thread a runtime of 388 days is reachable after only
388 / N physical days.

The PRs document overflow at 388 days, but don't try to fix it.

Use the simple formula up to 76 hours.  Then use a complicated general
method that reduces to the simple formula up to a bit less than 105
hours, then reduces to the previous method without its extra work up
to almost 388 days, then does more complicated reductions, usually
many bits at a time so that this is not slow.  This works up to half
of maximum representable time (292271 years), with accumulated rounding
errors of at most 32 usec.

amd64 can do all this with no avoidable rounding errors in an inline
asm with 2 instructions, but this is too special to use.  __uint128_t
can do the same with 100's of instructions on 64-bit arches.  Long
doubles with at least 64 bits of precision are the easiest method to
use on i386 userland, but are hard to use in the kernel.

PR:		76972 and duplicates
Reviewed by:	kib
2019-02-14 19:07:08 +00:00
Marius Strobl
f855ec814d Make taskqgroup_attach{,_cpu}(9) work across architectures
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.

While at it:
- Change the gt_cpu member in the grouptask structure to be of type
  int as used elswhere for specifying CPUs (an int16_t may be too
  narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
  the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
  argument given that it actually operates on a struct gtask rather
  than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
  names.

Reported by:	mmel
Reviewed by:	mmel
Differential Revision:	https://reviews.freebsd.org/D19139
2019-02-12 21:23:59 +00:00
Conrad Meyer
e0d164c7a6 Prevent overflow for usertime/systime in caclru1
PR:		76972 and duplicates
Reported by:	Dr. Christopher Landauer <cal AT aero.org>,
		Steinar Haug <sthaug AT nethelp.no>
Submitted by:	Andrey Zonov <andrey AT zonov.org> (earlier version)
MFC after:	2 weeks
2019-02-10 23:07:46 +00:00
Konstantin Belousov
fa50a3552d Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings.  It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory.  This
implementation is relatively small and happens at the correct
architectural level.  Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection.  It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation.  The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in.  The initial location for the anon group range
is, of course, randomized.  This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386.  This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs.  Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
  to force ASLR off for the given binary.  (A tool to edit the feature
  control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR:	208580 (exp runs)
Exp-runs done by:	antoine
Reviewed by:	markj (previous version)
Discussed with:	emaste
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
Andrew Turner
c50c26aa07 Fix the spelling of cov_unregister_pc.
When unregistering kcov from the coverage interface we should use the
unregister function, not the register function.

Sponsored by:	DARPA, AFRL
2019-02-08 16:18:17 +00:00
Konstantin Belousov
7cdb0b9d82 Fix renameat(2) for CAPABILITIES kernels.
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute.  As result, the check was done against empty filecap
and syscall fails erronously.

Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.

PR:	222258
Reviewed by:	jilles, ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-note:	KBI breakage
Differential revision:	https://reviews.freebsd.org/D19096
2019-02-08 04:18:17 +00:00
Konstantin Belousov
6f26dd50c3 do_execve(): lock vnode when needed.
Code after exec_fail_dealloc label expects that the image vnode is
locked if present.  When copyout() of the strings or auxv vectors fails,
goto to the error handling did not relocked the vnode as required.

The copyout() can be made failing e.g. by creating an ELF image with
PT_GNU_STACK segment disabling the write.

Reported by:	Jonathan Stuart <n0t.jcs@gmail.com> (found by fuzzing)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-02-08 04:06:48 +00:00
Konstantin Belousov
eb785fab3b Port sysctl kern.elf32.read_exec from amd64 to i386.
Make it more comprehensive on i386, by not setting nx bit for any
mapping, not just adding PF_X to all kernel-loaded ELF segments.  This
is needed for the compatibility with older i386 programs that assume
that read access implies exec, e.g. old X servers with hand-rolled
module loader.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-07 02:17:34 +00:00
Mark Johnston
401ca034cd Avoid leaking fp references when truncating SCM_RIGHTS control messages.
Reported by:	pho
Approved by:	so
MFC after:	0 minutes
Security:	CVE-2019-5596
Sponsored by:	The FreeBSD Foundation
2019-02-05 17:55:08 +00:00
Bruce Evans
6fd2dcd428 Fix zapping of static hints and env in init_static_kenv(). Environments
are terminated by 2 NULs, but only 1 NUL was zapped.  Zapping only 1
NUL just splits the first string into an empty string and a corrupted
string.  All other strings in static hints and env remained live early
in the boot when they were supposed to be disabled.

Support calling init_static_kenv() very early in the boot, so as to
use the env very early in the boot.  Then the pointer to the loader
env may change after the first call due to enabling paging or otherwise
remapping the pointer.  Another call is needed to register the change.
Don't use the previous pointer in this (or any) later call.

Reviewed by:	kib
2019-02-05 15:34:55 +00:00
Conrad Meyer
e682df5397 extattr_list_vp: Narrow locked section somewhat
Suggested by:	mjg
Reviewed by:	kib, mjg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19083
2019-02-05 04:47:21 +00:00
Conrad Meyer
c3eb848ce3 extattr_list_vp: Only take shared vnode lock
List is a 'read'-type operation that does not modify shared state; it's safe
for multiple thread to proceed concurrently.  This is reflected in the vnode
operation LISTEXTATTR locking protocol specification, which only requires a
shared lock.

(Similar to previous r248933.)

Reported by:	Case van Rij <case.vanrij AT isilon.com>
Reviewed by:	kib, mjg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19082
2019-02-05 03:32:58 +00:00
Warner Losh
52467047aa Regularize the Netflix copyright
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
   piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.

Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
2019-02-04 21:28:25 +00:00
Konstantin Belousov
f02bc51c09 Do not call PHOLD() while owning the allproc_lock sx.
Otherwise the lock might recurse in faultin() if the process is
swapped out.

Reported by:	zeising
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-03 21:31:40 +00:00
Brooks Davis
12aec82c09 Remove iBCS2: also remove xenix syscall function support.
Missed in r342243.
2019-01-31 23:01:12 +00:00
Brooks Davis
90f2d5012a Regen after r342190.
Differential Revision:	https://reviews.freebsd.org/D18444
2019-01-31 22:58:17 +00:00
Gleb Smirnoff
eec189c70b Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually do
anything except several assertions.  This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed.  Such mbuf can not be stored or reallocated, its
life time is current context.
2019-01-31 22:37:28 +00:00
Mark Johnston
919e7b5359 Prevent some kobj memory allocation failures from panicking the system.
Parts of the kobj(9) KPI assume a non-sleepable context for the purpose
of internal memory allocations, but currently have no way to signal an
allocation failure to the caller, so they just panic in this case.  This
can occur even when kobj_create() is called with M_WAITOK.  Fix some
instances of the problem by plumbing wait flags from kobj_create() through
internal subroutines.  Change kobj_class_compile() to assume a sleepable
context when called externally, since all existing callers use it in a
sleepable context.

To fix the problem fully the kobj_init() KPI must be changed.

Reported and tested by:	pho
Reviewed by:	kib (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19023
2019-01-31 22:27:39 +00:00
Alexander Motin
6afd921090 Only sort requests of types that have concept of offset.
Other types, such as BIO_FLUSH or BIO_ZONE, or especially new/unknown ones,
may imply some degree of ordering even if strict ordering is not requested
explicitly.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-01-30 17:24:50 +00:00
Andrew Turner
524553f56d Extract the coverage sanitizer KPI to a new file.
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.

A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.

While here clean up the #include style a little.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18955
2019-01-29 11:04:17 +00:00
Andriy Voskoboinyk
58c43838f7 m_getm2: correct a comment.
The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.

Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.

PR:		134335
MFC after:	12 days
2019-01-27 16:44:27 +00:00
Konstantin Belousov
e5ac304989 Bump SPECNAMELEN to MAXNAMLEN.
This includes the bump for cdevsw d_version.  Otherwise, the impact on
the ABI (not KBI) is surprisingly low.  The most important affected
interface is devname(3) and ttyname(3) which already correctly handle
long names (and ttyname(3) should not be affected at all).

Still, due to the d_version bump, I argue that the change is not MFC-able.

Requested by:	mmacy
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18932
2019-01-27 00:46:06 +00:00
Kirk McKusick
dab83bd1e8 Add printing of b_ioflags to DDB `show buffer' command.
Sponsored by: Netflix
2019-01-25 21:24:09 +00:00
Konstantin Belousov
be8dd1428e Re-wrap long line after r341827.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-01-17 04:51:05 +00:00
Gleb Smirnoff
d1bb5d7d50 Fix mistake in r343030: move nswbuf calculation back to
kern_vfs_bio_buffer_alloc(), because in init_param2() nbuf
isn't really initialized yet.

Pointed out by:	bde
2019-01-16 20:20:38 +00:00
Konstantin Belousov
ea7e7006db Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
Xin LI
305bb04ee4 Use TD_IS_IDLETHREAD instead of unrolled version.
MFC after:	2 weeks
2019-01-15 06:44:37 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Gleb Smirnoff
46713135ae Add flag LK_NEW for lockinit() that is converted to LO_NEW and passed
down to lock_init().  This allows for lockinit() on a not prezeroed
memory.
2019-01-15 00:35:19 +00:00
Konstantin Belousov
28b740da38 Handle overflow in calculating max kmem size.
vm_kmem_size is u_long, and it might be not capable of holding page
count times PAGE_SIZE, even when scaled down by VM_KMEM_SIZE_SCALE.  As
bde reported, 12G PAE config ends up with zero for kmem size.

Explicitly check for overflow and clamp kmem size at vm_kmem_size_max.
If we end up at zero size because VM_KMEM_SIZE_MAX is not defined,
panic with clear explanation rather then failing in a way which is
hard to relate.

Reported by:	bde, pho
Tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18767
2019-01-14 07:31:19 +00:00
Jason A. Harmening
7dff7eda1a Handle SIGIO for listening sockets
r319722 separated struct socket and parts of the socket I/O path into
listening-socket-specific and dataflow-socket-specific pieces.  Listening
socket connection notifications are now handled by solisten_wakeup() instead
of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the
owning process.

PR:	234258
Reported by:	Kenneth Adelman
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18664
2019-01-13 20:33:54 +00:00
Olivier Houchard
7045ac437b Instead of using an incomplete list of platforms that uses 64bits time_t
in 32bits mode, special case amd64, as i386 is the only arch that still
uses 32bits time_t.
2019-01-13 00:19:15 +00:00
Andrew Turner
be860eae0f Fix the check for the offset of td_frame and td_emuldata in struct thread.
Pointy hat:	andrew
Sponsored by:	DARPA, AFRL
2019-01-12 20:41:57 +00:00
Andrew Turner
b3c0d957a2 Add support for the Clang Coverage Sanitizer in the kernel (KCOV).
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).

Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.

The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.

The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.

KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.

A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.

[1] https://github.com/google/syzkaller

Submitted by:	Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by:	kib
Testing by:	tuexen
Sponsored by:	DARPA, AFRL
Sponsored by:	The FreeBSD Foundation (Mitchell Horne)
Differential Revision:	https://reviews.freebsd.org/D14599
2019-01-12 11:21:28 +00:00