Commit Graph

15180 Commits

Author SHA1 Message Date
Mark Johnston
eab80d9276 Add a comment explaining the race fixed by r310423.
Suggested and reviewed by: jhb
X-MFC With:	r310423
2016-12-23 05:02:17 +00:00
Mark Johnston
aa3c544349 Revert part of r300109.
The removal of TAILQ_FOREACH_SAFE introduced a small race: when the last
thread on a sleepqueue is awoken, it reclaims the sleepqueue and may begin
executing on a different CPU before sleepq_resume_thread() returns. This
leaves a window during which it may go back to sleep and incorrectly be
awoken again by the caller of sleepq_broadcast().

Reported and tested by:	pho
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2016-12-22 17:51:44 +00:00
John Baldwin
99bc7e4123 Don't spin in pause() during early boot for kthreads other than thread0.
pause() uses a spin loop to simulate a sleep during early boot.  However,
we only need this for thread0 to get far enough in the boot process to
enable timers (at which point pause() can sleep).  For other kthreads,
sleeping in pause() is ok as the callout will be scheduled and will
eventually fire once thread0 initializes timers.

Tested by: 	Steven Kargl
Sleuthing by:	markj
MFC after:	1 week
Sponsored by:	Netflix
2016-12-20 19:44:44 +00:00
Konstantin Belousov
4afd808be7 Do not clear KN_INFLUX when not owning influx state.
For notes in KN_INFLUX|KN_SCAN state, the influx bit is set by a
parallel scan.  When knote() reports event for the vnode filters,
which require kqueue unlocked, it unconditionally sets and then clears
influx to keep note around kqueue unlock.  There, do not clear influx
flag if a scan set it, since we do not own it, instead we prevent scan
from executing by holding knlist lock.

The knote_fork() function has somewhat similar problem, it might set
KN_INFLUX for scanned note, drop kqueue and list locks, and then clear
the flag after relock.  A solution there would be different enough, as
well as the test program, so close the reported issue first.

Reported and test case provided by:	yjh0502@gmail.com
PR:	214923
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-19 22:18:36 +00:00
Konstantin Belousov
69baec3619 Switch from stdatomic.h to atomic.h for kernel.
Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not
work properly.  This effectively reverts r251803.

Reported and tested by:	lidl
Discussed with:	ed
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-16 17:41:20 +00:00
Ed Schouten
669a25b50d Document the existence of the {0, 6, ...} sysctl. 2016-12-15 15:45:11 +00:00
Jilles Tjoelker
b9a6fb9343 reaper: Make REAPER_KILL_SUBTREE actually work.
MFC after:	2 weeks
2016-12-14 22:49:20 +00:00
Ed Schouten
ae15715360 Add a "device_index" label to all sysctls under dev.$driver.$index.
This way it becomes possible to graph a property for all instances of a
single driver. For example, graphing the number of packets across all
USB controllers, the amount of dropped packets on all NICs, etc.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 13:03:01 +00:00
Ed Schouten
fd0f59709d Add labels to sysctls related to clocks.
Sysctls like kern.eventtimer.et.*.quality currently embed the name of
the clock device. This is problematic for the Prometheus metrics
exporter for two reasons:

- Some of those clocks have dashes in their names, which Prometheus
  doesn't allow to be used in metric names.
- It doesn't allow for extracting the same property of all clocks on the
  system from within a single query.

Attach these nodes to have a label, so that the Prometheus metrics
exporter gives these metric a uniform name with the name of the clock
attached as a label.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 12:56:58 +00:00
Ed Schouten
1e1f3941e4 Add support for attaching aggregation labels to sysctl objects.
I'm currently working on writing a metrics exporter for the Prometheus
monitoring system to provide access to sysctl metrics. Prometheus and
sysctl have some structural differences:

- sysctl is a tree of string component names.
- Prometheus uses a flat namespace for its metrics, but allows you to
  attach labels with values to them, so that you can do aggregation.

An initial version of my exporter simply translated

    hw.acpi.thermal.tz1.temperature

to

    sysctl_hw_acpi_thermal_tz1_temperature_celcius

while we should ideally have

    sysctl_hw_acpi_thermal_temperature_celcius{thermal_zone="tz1"}

allowing you to graph all thermal zones on a system in one go.

The change presented in this commit adds support for accomplishing this,
by providing the ability to attach labels to nodes. In the example I
gave above, the label "thermal_zone" would be attached to "tz1". As this
is a feature that will only be used very rarely, I decided to not change
the KPI too aggressively.

Discussed on:	hackers@
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 12:47:34 +00:00
Gleb Smirnoff
1276a8363c Zero return value when counter_rate() switches over to next second and
value is positive, but below the limit.
2016-12-13 20:11:45 +00:00
Mateusz Guzik
25e578de55 vfs: use vrefact in getcwd and fchdir 2016-12-12 19:16:35 +00:00
Edward Tomasz Napierala
e3d4c4dcde Undo r309891. Konstantin is right in that this condition normally
cannot happen - the um_dev field is assigned at mount and never written
to afterwards.
2016-12-12 19:11:04 +00:00
Mateusz Guzik
5afb134c32 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
Edward Tomasz Napierala
223cb0e434 Avoid dereferencing NULL pointers in devtoname(). I've seen it panic,
called from ufs_print() in DDB.

MFC after:	1 month
2016-12-12 15:22:21 +00:00
Konstantin Belousov
778aa66a68 Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal.
Requested and reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8746
2016-12-12 11:12:04 +00:00
Konstantin Belousov
545d312293 When a zombie gets reparented due to the parent exit, send SIGCHLD to
the reaper.

The traditional reaper init(8) is aware of zombies silently reparented
to it after the parents exit, it loops around waitpid(2) to collect
them.  For other reapers, the silent reparenting is surprising and
collecting zombies requires a thread blocking in waitpid(2) just for
that purpose.  It seems that sending second SIGCHLD is a better
workaround than forcing all reapers to obey the setup.

Reported by:	 Michael Zuo <muh.muhten@gmail.com>, jilles
PR:	213928
Reviewed by:	jilles (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-12 11:11:50 +00:00
Alan Cox
2d612d2dd2 When tmpfs and POSIX shm pagein a page for the sole purpose of performing
truncation, immediately queue the page for asynchronous laundering rather
than making the page pass through inactive queue first.

Reviewed by:	kib, markj
2016-12-11 19:24:41 +00:00
Konrad Witaszczyk
480f31c214 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
Mark Johnston
02315a6759 Use a consistent snapshot of the lock state in owner_mtx().
MFC after:	2 weeks
2016-12-10 02:59:34 +00:00
Mark Johnston
c365a2934e Return a non-NULL owner only if the lock is exclusively held in owner_sx().
Fix some whitespace bugs while here.

MFC after:	2 weeks
2016-12-10 02:56:44 +00:00
Gleb Smirnoff
5040da77c1 Use acquire write to cr_lock to complement with release write at end
of locked region.

Submitted by:	kib
2016-12-09 19:07:31 +00:00
Gleb Smirnoff
169170209c Provide counter_ratecheck(), a MP-friendly substitution to ppsratecheck().
When rated event happens at a very quick rate, the ppsratecheck() is not
only racy, but also becomes a performance bottleneck.

Together with:	rrs, jtl
2016-12-09 17:58:34 +00:00
Robert Watson
52b42f6287 Regnerate system-call definitions following r309677 correcting a whitespace
glitch in syscalls.master.
2016-12-07 16:12:27 +00:00
Robert Watson
82d8d2b8bc Replace spaces with tabs in definition of SCTP system calls, for consistency
with the remainder of the syscalls.master file.  This problem does not occur
in the freebsd32 version of the same system calls.
2016-12-07 16:11:55 +00:00
Eric van Gyzen
3d32d4a7c9 Export the whole thread name in kinfo_proc
kinfo_proc::ki_tdname is three characters shorter than
thread::td_name.  Add a ki_moretdname field for these three
extra characters.  Add the new field to kinfo_proc32, as well.
Update all in-tree consumers to read the new field and assemble
the full name, except for lldb's HostThreadFreeBSD.cpp, which
I will handle separately.  Bump __FreeBSD_version.

Reviewed by:	kib
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D8722
2016-12-07 15:04:22 +00:00
Konstantin Belousov
435da98564 Restructure the code to handle reporting of non-exited processes from
wait(2).
- Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED
  options were passed [1].
- Extract the code to report alive process into a new helper
  report_alive_proc() and use it for trapped, stopped and continued
  childrens.

Note that the process spinlock is required around the WTRAPPED and
WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are
set before other threads are stopped at the suspension point, and that
threads increment p_suspcount while owning only the process spinlock,
the process lock is dropped by them.  If the spinlock is not taken for
tests, the syscall thread might miss both p_suspcount increment and
wakeup in wakeup in thread_suspend_switch().

Based on the submission by:	mjg [1]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-04 20:44:58 +00:00
Eric van Gyzen
ff07dd913e thr_set_name(): silently truncate the given name as needed
Instead of failing with ENAMETOOLONG, which is swallowed by
pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1
bytes.  This is more likely what the user wants, and saves the
caller from truncating it before the call (which was the only
recourse).

Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2)
so the user might find the documentation for this behavior.

Reviewed by:	jilles
MFC after:	3 days
Sponsored by:	Dell EMC
2016-12-03 01:14:21 +00:00
Mateusz Guzik
a2d3554542 vfs: provide fake locking primitives for the crossmp vnode
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.

Reviewed by:	kib
Tested by:	pho
2016-12-02 18:03:15 +00:00
Mateusz Guzik
a4ce25b5b0 vfs: fix a whitespace nit in r309307 2016-11-30 02:17:03 +00:00
Mateusz Guzik
1babea0341 vfs: avoid VOP_ISLOCKED in the common case in lookup 2016-11-30 02:14:53 +00:00
Mark Johnston
64910ddbff Launder VPO_NOSYNC pages upon vnode deactivation.
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
   reclaimed, and this work may be unfairly deferred to the vnlru process
   or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
   the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
   the laundry thread needs to launder pages from an unreferenced such
   vnode, it will reactivate and deactivate the vnode with each laundering,
   potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.

Reviewed by:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8641
2016-11-26 21:00:27 +00:00
John Baldwin
9f3aabb9eb Permit timed sleeps for threads other than thread0 before timers are working.
The callout subsystem already handles early callouts and schedules
the first clock interrupt appropriately based on the currently pending
callouts.  The one nit to fix was that callouts scheduled via C_HARDCLOCK
during early boot could fire too early once timers were enabled as the
per-CPU base time is always zero until timers are initialized.  The change
in callout_when() handles this case by using the current uptime as the
base time of the callout during bootup if the per-CPU base time is zero.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-25 18:02:43 +00:00
Mateusz Guzik
746b6e8176 wait: avoid relocking the child if proc_to_reap returns 1
proc_to_reap would always unlock. However, if it returned 1, kern_wait6
would immediately lock it again. Save the dance.

Reviewed by:	kib
2016-11-24 18:21:48 +00:00
Mateusz Guzik
8b0e0c91e0 cache: ensure that the number of bucket locks does not exceed hash size
The size can be changed by side effect of modifying kern.maxvnodes.

Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.

Force the number of buckets to be not smaller.

Note this should not matter for real world cases.

Reported and tested by:	pho
2016-11-23 19:50:12 +00:00
Mark Johnston
99e6e1930c Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
Ruslan Bukin
dd7d4f199e Revert r306186 ("Adjust the sopt_val pointer on bigendian systems").
This logic doesn't work with bigger sopt_valsize (e.g. when ipfw
passing 2048 bytes rule).

Reported by:	adrian
Sponsored by:	DARPA, AFRL
2016-11-22 18:31:43 +00:00
Konstantin Belousov
eb962424ba Restore vnode pager statistic for buffer pagers.
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8585
2016-11-22 10:06:39 +00:00
John Baldwin
5d8cce1764 Initialize 'ticks' earlier in boot after 'hz' is set.
This avoids the time-warp after kthreads have started running and the
required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP
case.  Now, 'ticks' is initialized before any kthreads are created or
any context switches are performed.

Tested by:	gavin
MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-22 01:02:59 +00:00
Robert Watson
1279fdafce Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones.  This was likely an
oversight in the original adaptation of this code from XNU.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-11-22 00:41:24 +00:00
Gleb Smirnoff
00b5ffde8e Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't
do any speculations about readahead, and use exactly the amount of readahead
specified by user.  E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.
2016-11-17 21:36:18 +00:00
Gleb Smirnoff
5dba303d01 Use bogus_page to properly reduce number of I/Os in sendfile(2). The new
sendfile_swapin() loop works this way:

- Find first invalid page in the request.
- Do vm_pager_has_page() and get count of pages, that can be taken in
  single I/O.
- Trim valid pages from the end of the request.
- Cycle through the request and substitute to bogus_page all valid
  pages that are in the middle of the request.
- After I/O launched (pager copies array of pages into buf(9), it
  is important to restore proper page pointers with help vm_page_lookup().

Count bogus pages used and report them in sendfile stats.
2016-11-17 21:02:55 +00:00
Ruslan Bukin
6e18247a3d Fix build when no INET and INET6 in kernel config.
Submitted by:	kan
Sponsored by:	DARPA, AFRL
2016-11-17 16:13:30 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mateusz Guzik
6ce45c6ac3 cache: plug a write-only variable in cache_negative_zap_one 2016-11-15 03:43:10 +00:00
Mateusz Guzik
317cac6d5a cache: fix a race between entry removal and demotion
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

Reported and tested by:	truckman
2016-11-15 03:38:05 +00:00
Adrian Chadd
8ffa01a061 [mips] enable relbuf on mips for now to work around page aliasing in mips hardware.
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.

So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.

Discussed with: kib, bsdimp, juli
2016-11-15 01:41:45 +00:00
Adrian Chadd
0046bef85a [mips] make UMTX_CHAINS configurable at compile time.
The default (512) wastes quite a bit of space which doesn't really buy
us much on highly embedded systems which don't take a lot of locks in
parallel.

This makes it at least build time configurable so people can experiment.
2016-11-15 01:34:38 +00:00
Konstantin Belousov
ae44bb0146 Initialize reserved bytes in struct mq_attr and its 32compat
counterpart, to avoid kernel stack content leak in kmq_setattr(2)
syscall.  Also slightly simplify the checks around copyout()s.

Reported by:	Vlad Tsyrklevich <vlad902+spam@gmail.com>
PR:	214488
MFC after:	1 week
2016-11-14 13:20:10 +00:00
Konstantin Belousov
714b7df502 Provide simple mutual exclusion between mount point update and unmount.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call.  This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected.  Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-11-13 21:49:51 +00:00