Commit Graph

15463 Commits

Author SHA1 Message Date
Konstantin Belousov
3aeacc55a5 A followup to r315749, two more places where brand->interp_path was
accessed unconditionally.

Reported by:	se
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-30 04:21:02 +00:00
Robert Watson
b783025921 When handling msgsys(2), semsys(2), and shmsys(2) multiplex system calls,
map the 'which' argument into a suitable audit event identifier for the
specific operation requested.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-29 23:31:35 +00:00
Robert Watson
d8ca0a2b70 Hook up new audit event identifiers for various non-Orange Book/CAPP
system calls supported by OpenBSM 1.2-alpha5.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-29 22:33:56 +00:00
Bruce Evans
1370fa3380 Oops, my fix for bright colors broke bright black some more (in cases
that used to work via the bold hack).

Fix the table entry for bright black.  Fix spelling of plain black in
nearby table entries (use the macro for black everywhere everywhere).
Fix the currently-unused non-bright color table to not have bright
colors in entries 9-15.

Improve nearby comments.  Start converting to the xterm terminology
and default rendering of "bright" instead of "light" for bright
colors.

Syscons wasn't affected by the bug since I optimized it a little by
converting colors 0-15 directly.  This also fixes the layering of
the conversion for these colors.

Apply the same optimization to vt (actually the layer above it).  This
also moves the conversion 1 closer to the correct layer for colors
0-15.

The optimization of just avoiding 2 calls to a trivial function is worth
about 10% for simple output to the virtual buffer with occasional
rendering.  The optimization is so large because the 2 calls are done
on every character, so although there are too many other calls and
other instructions per character, there are only about 10 times as
many.  Old versions of syscons were about 10 times faster for simple
output, by using a fast path with about 12 instructions per character.
Rendering to even slow hardware takes relatively little time provided
it is rarely actually done.
2017-03-27 10:48:28 +00:00
Andriy Gapon
20c69e76b4 dtrace sched:::preempt should fire only when there is preemption
The probe fire on any thread switch before.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Panzura
2017-03-25 19:08:51 +00:00
Gleb Smirnoff
9e3c8bd3e2 Make sendfile(2) more robust against file change. This fixes a possible
crash when the file shrinks.  This also fixes sendfile(2) not sending more
data in a case when the file grows, and the request is open-ended or
specifies a size that is greater than old file size.

PR:		217789
Reviewed by:	gallatin
MFC after:	10 days
2017-03-24 16:01:19 +00:00
Ed Schouten
0fe9832013 Don't require the presence of the compat_3_brand.
The existing ELF image activator requires the brandinfo to provide such
a string unconditionally, even if the executable format in question
doesn't use this type of branding. Skip matching when it's a null
pointer.

Reviewed by:	kib
MFC after:	2 weeks
2017-03-23 14:09:45 +00:00
Andriy Gapon
afa0a46cfd move thread switch tracing from mi_switch to sched_switch
This is done so that the thread state changes during the switch
are not confused with the thread state changes reported when the thread
spins on a lock.

Here is an example, three consecutive entries for the same thread (from top to
bottom):

  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none

The above trace could leave an impression that the final state of
the thread was "running".
After this change the sleep state will be reported after the "spinning"
and "running" states reported for the sched lock.

Reviewed by:	jhb, markj
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9961
2017-03-23 08:57:04 +00:00
Konstantin Belousov
2274ab3d7b Update r315753 with the proper flag name.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:28:13 +00:00
Konstantin Belousov
1438fe3cf2 Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter.  In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by:	ed (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:23:01 +00:00
Konstantin Belousov
7aab7a80e2 Adjust r314851 to not require every brand to specify interpreter path.
Reported and tested by:	ed
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:06:48 +00:00
Enji Cooper
8b69c3e79f Print out name of non-dynamic sysctl in sysctl_remove_oid_locked
This will provide a slightly better smoking gun than just stating
"can't remove non-dynamic nodes!" when calling sysctl_ctx_free(9)
and sysctl_remove_{name,oid}(9) with a non-dynamic (likely
static) sysctl.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-03-22 05:27:20 +00:00
Conrad Meyer
5abb3b74d3 kern_fail: Allow sleeping for more than 2147483/hz seconds
Because of integer types, the timeout calculation result was limited to
INT_MAX / (1000 * hz) seconds.  For systems with hz=10000, this is only 215
seconds.  Perform the calculation with 64-bit math to allow sleeping for the
full INT_MAX / hz interval (215000 seconds on such hz=10000 systems).

Submitted by:	Scott Ferris <sferris at isilon.com>
Sponsored by:	Dell EMC Isilon
2017-03-21 22:41:37 +00:00
Ed Maste
26af611582 tighten buffer bounds in imgact_binmisc_populate_interp
We must ensure there's space for the terminating null in the temporary
buffer in imgact_binmisc_populate_interp().

Note that there's no buffer overflow here because xbe->xbe_interpreter's
length and null termination is checked in imgact_binmisc_add_entry()
before imgact_binmisc_populate_interp() is called. However, the latter
should correctly enforce its own bounds.

Reviewed by:	sbruno
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10042
2017-03-21 18:02:14 +00:00
Alan Cox
2a016de1a5 Use IDX_TO_OFF(), not ptoa(), when converting the difference between two
vm_pindex_t's into a vm_ooffset_t.

The length given to shm_dotruncate() must never be negative.  Assert this.

Tidy up a comment.

Reviewed by:	kib
MFC after:	1 week
2017-03-20 05:15:55 +00:00
Alan Cox
ac46d38655 Style fixes. In particular, the variable "bogus" is used like a Boolean.
Define it as such.

Reviewed by:	kib
MFC after:	1 week
2017-03-19 23:06:11 +00:00
Eric van Gyzen
26f86ab732 Regenerate syscall files for r315526
Sponsored by:	Dell EMC
2017-03-19 00:54:24 +00:00
Eric van Gyzen
3f8455b090 Add clock_nanosleep()
Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Reviewed by:	kib, ngie, jilles
MFC after:	3 weeks
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D10020
2017-03-19 00:51:12 +00:00
Alan Cox
c547cbb49c Avoid unnecessary calls to vm_map_protect() in elf_load_section().
Typically, when elf_load_section() unconditionally passed VM_PROT_ALL to
elf_map_insert(), it was needlessly enabling execute access on the
mapping, and it would later have to call vm_map_protect() to correct the
mapping's access rights.  Now, instead, elf_load_section() always passes
its parameter "prot" to elf_map_insert().  So, elf_load_section() must
only call vm_map_protect() if it needs to remove the write access that
was temporarily granted to perform a copyout().

Reviewed by:	kib
MFC after:	1 week
2017-03-18 23:37:00 +00:00
Eric van Gyzen
4cf66812ea nanosleep: plug a kernel memory disclosure
nanosleep() updates rmtp on EINVAL.  In that case, kern_nanosleep()
has not updated rmt, so sys_nanosleep() updates the user-space rmtp
by copying garbage from its stack frame.  This is not only a kernel
memory disclosure, it's also not POSIX-compliant.  Fix it to update
rmtp only on EINTR.

Reviewed by:	jilles (via D10020), dchagin
MFC after:	3 days
Security:	possibly
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D10044
2017-03-18 20:16:23 +00:00
Bruce Evans
4eb235fb4f Fix bright colors for syscons, and make them work for the first time
for vt.  Restore syscons' rendering of background (bg) brightness as
foreground (fg) blinking and vice versa, and add rendering of blinking
as background brightness to vt.

Bright/saturated is conflated with light/white in the implementation
and in this description.

Bright colors were broken in all cases, but appeared to work in the
only case shown by "vidcontrol show".  A boldness hack was applied
only in 1 layering-violation place (for some syscons sequences) where
it made some cases seem to work but was undone by clearing bold using
ANSI sequences, and more seriously was not undone when setting
ANSI/xterm dark colors so left them bright.  Move this hack to drivers.

The boldness hack is only for fg brightness.  Restore/add a similar hack
for bg brightness rendered as fg blinking and vice versa.  This works
even better for vt, since vt changes the default text mode to give the
more useful bg brightness instead of fg blinking.

The brightness bit in colors was unnecessarily removed by the boldness
hack.  In other cases, it was lost later by teken_256to8().  Use
teken_256to16() to not lose it.  teken_256to8() was intended to be
used for bg colors to allow finer or bg-specific control for the more
difficult reduction to 8; however, since 16 bg colors actually work
on VGA except in syscons text mode and the conversion isn't subtle
enough to significantly in that mode, teken_256to8() is not used now.

There are still bugs, especially in vidcontrol, if bright/blinking
background colors are set.

Restore XOR logic for bold/bright fg in syscons (don't change OR
logic for vt).  Remove broken ifdef on FG_UNDERLINE and its wrong
or missing bit and restore the correct hard-coded bit.  FG_UNDERLINE
is only for mono mode which is not really supported.

Restore XOR logic for blinking/bright bg in syscons (in vt, add
OR logic and render as bright bg).  Remove related broken ifdef
on BG_BLINKING and its missing bit and restore the correct
hard-coded bit.  The same bit means blinking or bright bg depending
on the mode, and we want to ignore the difference everywhere.

Simplify conversions of attributes in syscons.  Don't pretend to
support bold fonts.  Don't support unusual encodings of brightness.
It is as good as possible to map 16 VGA colors to 16 xterm-16
colors.  E.g., VGA brown -> xterm-16 Olive will be converted back
to VGA brown, so we don't need to convert to xterm-256 Brown.  Teken
cons25 compatibility code already does the same, and duplicates some
small tables.  This is mostly for the sc -> te direction.  The other
direction uses teken_256to16() which is too generic.
2017-03-18 11:13:54 +00:00
Konstantin Belousov
469ec1eb6a When clearing altsigstack settings on exec, do it to the right thread.
Diagnosed by:	smh
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-17 13:37:37 +00:00
Eric Badger
b4d3325975 Don't clear p_ptevents on normal SIGKILL delivery
The ptrace() user has the option of discarding the signal. In such a
case, p_ptevents should not be modified. If the ptrace() user decides to
send a SIGKILL, ptevents will be cleared in ptracestop(). procfs events
do not have the capability to discard the signal, so continue to clear
the mask in that case.

Reviewed by:	jhb (initial revision)
MFC after:	1 week
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9939
2017-03-16 13:03:31 +00:00
John Baldwin
03f7f17878 Use UMA_ALIGN_PTR instead of sizeof(void *) for zone alignment.
uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types.  Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).

Reported by:	brooks
Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA / AFRL
2017-03-15 18:23:32 +00:00
Alan Cox
52d1addaa1 Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by:	kib, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D10011
2017-03-15 17:43:45 +00:00
Mark Johnston
7d88be4c03 When draining a callout, don't clear CALLOUT_ACTIVE while it is running.
The callout may reschedule itself and execute again before callout_drain()
returns, but we should not clear CALLOUT_ACTIVE until the callout is
stopped.

Tested by:	pho
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2017-03-15 00:29:27 +00:00
Eric van Gyzen
8addc72b3e Add missing pieces of r315280
I moved this branch from github to a private server, and pulled from the
wrong one when committing r315280, so I failed to include two recent commits.
Thankfully, they were only cosmetic and were included in the review.
Specifically:

Add documentation, polish comments, and improve style(9).

Tested by:	pho (r315280)
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9791
2017-03-14 22:02:02 +00:00
Konstantin Belousov
d1780e8dac Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
Eric van Gyzen
9dbdf2a169 When the RTC is adjusted, reevaluate absolute sleep times based on the RTC
POSIX 2008 says this about clock_settime(2):

    If the value of the CLOCK_REALTIME clock is set via clock_settime(),
    the new value of the clock shall be used to determine the time
    of expiration for absolute time services based upon the
    CLOCK_REALTIME clock.  This applies to the time at which armed
    absolute timers expire.  If the absolute time requested at the
    invocation of such a time service is before the new value of
    the clock, the time service shall expire immediately as if the
    clock had reached the requested time normally.

    Setting the value of the CLOCK_REALTIME clock via clock_settime()
    shall have no effect on threads that are blocked waiting for
    a relative time service based upon this clock, including the
    nanosleep() function; nor on the expiration of relative timers
    based upon this clock.  Consequently, these time services shall
    expire when the requested relative interval elapses, independently
    of the new or old value of the clock.

When the real-time clock is adjusted, such as by clock_settime(3),
wake any threads sleeping until an absolute real-clock time.
Such a sleep is indicated by a non-zero td_rtcgen.  The sleep functions
will set that field to zero and return zero to tell the caller
to reevaluate its sleep duration based on the new value of the clock.

At present, this affects the following functions:

    pthread_cond_timedwait(3)
    pthread_mutex_timedlock(3)
    pthread_rwlock_timedrdlock(3)
    pthread_rwlock_timedwrlock(3)
    sem_timedwait(3)
    sem_clockwait_np(3)

I'm working on adding clock_nanosleep(2), which will also be affected.

Reported by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
Reviewed by:	jhb, kib
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9791
2017-03-14 19:06:44 +00:00
Konstantin Belousov
01feb4c3d4 Use designated initializers for kevent_copyops.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-14 09:25:01 +00:00
Konstantin Belousov
67d0b0ea60 Hide kev_iovlen() definition under #ifdef KTRACE, fixing build of
kernel configs without KTRACE.

Reported by:	rpokala
Sponsored by:	The FreeBSD Foundation
MFC after:	4 days
2017-03-14 08:45:52 +00:00
Ian Lepore
e3f87f6c70 Change 'Hz' back to 'HZ'... it's referring to the kernel config option
named HZ, not being used as an abbreviation of the unit of measure.
2017-03-12 18:07:03 +00:00
Ian Lepore
8a3966405e Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ). 2017-03-12 17:43:45 +00:00
Konstantin Belousov
9a2dde8013 Avoid reusing p_ksi while it is on queue.
When sending SIGCHLD informing reaper that a zombie was reparented to
it, we might race with the situation where the previous parent still
not finished delivering SIGCHLD and having its p_ksi structure on the
signal queue.  While on queue, the ksi should not be used for another
send.

Fix this by copying p_ksi into newly allocated ksi, which is directly
put onto reaper sigqueue.  The later ensures that siginfo for reaper
SIGCHLD is always present, similar to guarantees for siginfo of child.

Reported by:	bdrewery
Discussed with:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:58:51 +00:00
Konstantin Belousov
9bcf2f2da0 Accept linkers representation for ELF segments with zero on-disk length.
For such segments, GNU bfd linker writes knowingly incorrect value
into the the file offset field of the program header entry, with the
motivation that file should not be mapped for creation of this segment
at all.

Relax checks for the ELF structure validity when on-disk segment
length is zero, and explicitely set mapping length to zero for such
segments to avoid validating rounding arithmetic.

PR:	217610
Reported by:	Robert Clausecker <fuz@fuz.su>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:51:13 +00:00
Konstantin Belousov
973d67c407 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:49:42 +00:00
Konstantin Belousov
1e4296c919 Ktracing kevent(2) calls with unusual arguments might leads to an
overly large allocation requests.

When ktrace-ing io, sys_kevent() allocates memory to copy the
requested changes and reported events.  Allocations are sized by the
incoming syscall lengths arguments, which are user-controlled, and
might cause overflow in calculations or too large allocations.

Since io trace chunks are limited by ktr_geniosize, there is no sense
it even trying to satisfy unbounded allocations.  Export ktr_geniosize
and clamp the buffers sizes in advance.

PR:	217435
Reported by:	Tim Newsham <tim.newsham@nccgroup.trust>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:48:24 +00:00
Alan Cox
e383e820d3 Simplify the control flow and tidy up a comment in map_insert.
In collaboration with:	kib
MFC after:	1 week
2017-03-11 18:57:13 +00:00
Andriy Gapon
28ef18b8c1 trace thread running state when a thread is run for the first time
This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing.

MFC after:	10 days
2017-03-11 15:57:36 +00:00
Andriy Gapon
6c9271a918 actually implement proc:::lwp-exit probe
MFC after:	4 days
2017-03-11 15:47:27 +00:00
Mahdi Mokhtari
32a1fb0d3d Fix NULL pointer dereference and panic with shm file pread/pwrite.
PR:		217429
Reported by:	Tim Newsham <tim.newsham@nccgroup.trust>
Reviewed by:	kib
Approved by:	dchagin
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9844
2017-03-10 10:09:44 +00:00
Gleb Smirnoff
d0147e10ca In linker_load_file() print name of a file that failed to load.
Discussed with:	kib
2017-03-09 00:56:07 +00:00
Gleb Smirnoff
a344babf78 Reduce stack usage in link_elf_load_file(), allocating struct nameidata.
This function may be called recursively, when a module pulls its dependencies.
Under certain circumstances, e.g. quad chain of dependencies and presence
of dtrace we may run out of stack.
2017-03-09 00:45:15 +00:00
Gleb Smirnoff
14984031b7 m_mbuftouio() doesn't modify the mbuf. 2017-03-07 19:00:50 +00:00
Eric Badger
b38bd91f4f don't stop in issignal() if P_SINGLE_EXIT is set
Suppose a traced process is stopped in ptracestop() due to receipt of a
SIGSTOP signal, and is awaiting orders from the tracing process on how
to handle the signal. Before sending any such orders, the tracing
process exits. This should kill the traced process. But suppose a second
thread handles the SIGKILL and proceeds to exit1(), calling
thread_single(). The first thread will now awaken and will have a chance
to check once more if it should go to sleep due to the SIGSTOP.  It must
not sleep after P_SINGLE_EXIT has been set; this would prevent the
SIGKILL from taking effect, leaving a stopped orphan behind after the
tracing process dies.

Also add new tests for this condition.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9890
2017-03-07 13:41:01 +00:00
Konstantin Belousov
15a9aedfa1 When selecting brand based on old Elf branding, prefer the brand which
interpreter exactly matches the one requested by the activated image.

This change applies r295277, which did the same for note branding, to
the old brand selection, with the same reasoning of fixing compat32
interpreter substitution.

PR:	211837
Reported by:	kenji@kens.fm
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-07 13:38:25 +00:00
Konstantin Belousov
3d560b4be2 Require whole brand string matching for old Elf branding.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-07 13:37:35 +00:00
Konstantin Belousov
0bbee4cd3f Consistently use vm_ooffset_t type for the vm object offset in
elf_load_section.

The values passed currently as vm_offset_t are phdr.p_offset, which
have the native Elf word size.  Since elf_load_section interprets them
as the file offset, use vm object offset type.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-07 13:36:43 +00:00
Hiren Panchasara
f41b2de716 Fix the KASSERT check from r314813.
len being 0 is valid.

Submitted by:	ngie
Reported by:	ngie (via jenkins test run)
Sponsored by:	Limelight Networks
2017-03-07 06:46:38 +00:00
Hiren Panchasara
b5b023b91e We've found a recurring problem where some userland process would be
stuck spinning at 100% cpu around sbcut_internal(). Inside
sbflush_internal(), sb_ccc reached to about 4GB and before passing it
to sbcut_internal(), we type-cast it from uint to int making it -ve.

The root cause of sockbuf growing this large is unknown. Correct fix
is also not clear but based on mailing list discussions, adding
KASSERTs to panic instead of looping endlessly.

Reviewed by:		glebius
Sponsored by:		Limelight Networks
2017-03-07 00:20:01 +00:00
Gleb Smirnoff
6cf0c1db55 Fix compilation of r314784 on 32 bit. 2017-03-06 22:32:56 +00:00
Gleb Smirnoff
f2498877c9 In panic() print current timestamp, which matches timestamp in the dump
header.  This will help to correlate console server logs with dump files,
no matter how precise is clock on a console server appliance, and how
buggy the appliance is.
2017-03-06 19:14:08 +00:00
Konstantin Belousov
aaadc41f6c Instead of direct use of vm_map_insert(), call vm_map_fixed(MAP_CHECK_EXCL).
This KPI explicitely indicates the intent of creating the mapping at
the fixed address, and incorporates the map locking into the callee.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-06 14:09:54 +00:00
Alan Cox
28e8da6517 Style and punctuation fixes.
Reviewed by:	kib
MFC after:	3 days
2017-03-05 23:59:04 +00:00
Emmanuel Vadot
c1b014c51c Export a sysctl dev.<clkdom>.<unit>.clocks for each clock domain containing
all the clocks that they provide.
Each clocks are exported under the node 'clock.<clkname>' and have the following
children nodes :
- frequency
- parent (The selected parent, if any)
- parents (The list of parents, if any)
- childrens (The list of childrens, if any)
- enable_cnt (The enabled counter)

This give us the possibility to examine clocks at runtime and make graph of
the clock flow.

Reviewed by:	mmel
MFC after:	2 month
Differential Revision:	https://reviews.freebsd.org/D9833
2017-03-05 07:13:29 +00:00
Eric van Gyzen
8a8bea603c Fix grammar in some comments in subr_sleepqueue.c
While I'm here, remove trailing whitespace.

Reviewed by:	kib, mostly, as part of a larger review
MFC after:	3 days
2017-03-03 21:03:28 +00:00
Mark Johnston
7813302434 Fix a ticks comparison in sched_pctcpu_update().
We may fail to reset the %CPU tracking window if a thread does not run
for over half of the ticks rollover period, resulting in a bogus %CPU
value for the thread until ticks fully rolls over. Handle this by comparing
the unsigned difference ticks - ts_ltick with SCHED_TICK_TARG instead.

Reviewed by:	cem, jeff
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-03-03 20:57:40 +00:00
Ed Maste
e052a8b932 kern_sig.c: ANSIfy and remove archaic register keyword
Sponsored by:	The FreeBSD Foundation
2017-03-02 22:17:53 +00:00
Konstantin Belousov
fe0a8a3994 Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-03-02 17:35:13 +00:00
Hans Petter Selasky
403f4a31ab Implement taskqueue_poll_is_busy() for use by the LinuxKPI.
Refer to comment above function for a detailed description.

Discussed with:		kib @
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-03-02 12:20:23 +00:00
Sean Bruno
d945ed6472 Make gtaskqueue compatible with drm-next such that they can be used with the
linuxkpi tasklets.

Submitted by:	mmacy@nextbsd.org
Reported by:	hps
2017-03-01 18:37:35 +00:00
Konstantin Belousov
55b985b43b Use vm_map_insert() instead of vm_map_find() in elf_map_insert().
Elf_map_insert() needs to create mapping at the known fixed address.
Usage of vm_map_find() assumes, on the other hand, that any suitable
address space range above or equal the specified hint, is acceptable.
Due to operating on the fresh or cleared address space, vm_map_find()
usually creates mapping starting exactly at hint.

Switch to vm_map_insert() use to clearly request fixed mapping from
the VM.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-01 10:28:15 +00:00
Konstantin Belousov
e3d8f8fed4 When deallocating the vm object in elf_map_insert() due to
vm_map_insert() failure, drop the vnode lock around the call to
vm_object_deallocate().

Since the deallocated object is the vm object of the vnode, we might
get the vnode lock recursion there.  In fact, it is almost impossible
to make vm_map_insert() failing there on stock kernel.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-01 10:22:07 +00:00
Mateusz Guzik
a21018063b locks: ensure proper barriers are used with atomic ops when necessary
Unclear how, but the locking routine for mutexes was using the *release*
barrier instead of acquire. This must have been either a copy-pasto or bad
completion.

Going through other uses of atomics shows no barriers in:
- upgrade routines (addressed in this patch)
- sections protected with turnstile locks - this should be fine as necessary
  barriers are in the worst case provided by turnstile unlock

I would like to thank Mark Millard and andreast@ for reporting the problem and
testing previous patches before the issue got identified.

ps.
  .-'---`-.
,'          `.
|             \
|              \
\           _  \
,\  _    ,'-,/-)\
( * \ \,' ,' ,'-)
 `._,)     -',-')
   \/         ''/
    )        / /
   /       ,'-'

Hardware provided by: IBM LTC
2017-03-01 05:06:21 +00:00
Scott Long
38e41e66e5 Provide a comment on why stdio.h needs to be included. 2017-02-28 21:27:51 +00:00
Jung-uk Kim
c4e929946c Include stdio.h to fix libsbuf build.
Reviewed by:	scottl
2017-02-28 21:18:45 +00:00
Scott Long
388f3ce6c3 Implement sbuf_prf(), which takes an sbuf and outputs it
to stdout in the non-kernel case and to the console+log
in the kernel case.  For the kernel case it hooks the
putbuf() machinery underneath printf(9) so that the buffer
is written completely atomically and without a copy into
another temporary buffer.  This is useful for fixing
compound console/log messages that become broken and
interleaved when multiple threads are competing for the
console.

Reviewed by:	ken, imp
Sponsored by:	Netflix
2017-02-28 18:25:06 +00:00
Gleb Smirnoff
efe3b0de14 Remove SVR4 (System V Release 4) binary compatibility support.
UNIX System V Release 4 is operating system released in 1988. It ceased
to exist in early 2000-s.
2017-02-28 05:14:42 +00:00
Konstantin Belousov
aca4bb9112 Do not leak mount references for dying threads.
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit.  In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1().  softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-25 10:38:18 +00:00
Konstantin Belousov
8cd5962571 Remove cpu_deepest_sleep variable.
On Core2 and older Intel CPUs, where TSC stops in C2, system does not
allow C2 entrance if timecounter hardware is TSC.  This is done by
tc_windup() which tests for TC_FLAGS_C2STOP flag of the new
timecounter and increases cpu_disable_c2_sleep if flag is set.  Right
now init_TSC_tc() only sets the flag if cpu_deepest_sleep >= 2, but
TSC is initialized too early for this variable to be set by
acpi_cpu.c.

There is no reason to require that ACPI reported C2 and deeper states
to set TC_FLAGS_C2STOP, so remove cpu_deepest_sleep test from
init_TSC_tc() condition.  And since this is the only use of the
variable, remove it at all.

Reported and submitted by:	Jia-Shiun Li <jiashiun@gmail.com>
Suggested by:	jhb
MFC after:	2 weeks
2017-02-24 16:11:55 +00:00
Warner Losh
bbf6e5144e Cast values to (int) before comparing them to the range of the
enum. This ensures they are in range w/o the warnings.
2017-02-24 01:39:12 +00:00
Warner Losh
df1c30f6bd KDTRACE_HOOKS isn't guaranteed to be defined. Change to check to see
if it is defined or not rather than if it is non-zero.

Sponsored by: Netflix, Inc
2017-02-24 01:39:08 +00:00
Mateusz Guzik
dfaa7859d6 mtx: microoptimize lockstat handling in spin mutexes and thread lock
While here make the code compilablle on kernels with LOCK_PROFILING but without
KDTRACE_HOOKS.
2017-02-23 22:46:01 +00:00
Eric van Gyzen
b215ceaaec Add sem_clockwait_np()
This function allows the caller to specify the reference clock
and choose between absolute and relative mode.  In relative mode,
the remaining time can be returned.

The API is similar to clock_nanosleep(3).  Thanks to Ed Schouten
for that suggestion.

While I'm here, reduce the sleep time in the semaphore "child"
test to greatly reduce its runtime.  Also add a reasonable timeout.

Reviewed by:	ed (userland)
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9656
2017-02-23 19:36:38 +00:00
Jonathan T. Looney
c9cde8251c Fix a panic during boot caused by inadequate locking of some vt(4) driver
data structures.

vt_change_font() calls vtbuf_grow() to change some vt driver data
structures. It uses TF_MUTE to prevent the console from trying to use those
data structures while it changes them.

During the early stage of the boot process, the vt driver's tc_done routine
uses those data structures; however, it is currently called outside the
TF_MUTE check.

Move the tc_done routine inside the locked TF_MUTE check.

PR:		217282
Reviewed by:	ed, ray
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D9709
2017-02-23 01:18:47 +00:00
Warner Losh
6fec662c86 Make the code match the comments: If we have ANY buf's that failed
then return EAGAIN. The current code just returns that if the LAST buf
failed.

Reviewed by: kib@, trasz@
Differential Revision: https://reviews.freebsd.org/D9677
2017-02-21 18:56:06 +00:00
John Baldwin
150599be12 Consolidate statements to initialize files.
Previously, the first lines of various generated files from system call
tables were generated in two sections.  Some of the initialization was
done in BEGIN, and the rest was done when the first line was encountered.
The main reason for this split before r313564 was that most of the
initialization done in the second section depended on the $FreeBSD$ tag
extracted from the system call table.  Now that the $FreeBSD$ tag is no
longer used, consolidate all of the file initialization in the BEGIN
section.

This change was tested by confirming that the content of generated files
did not change.
2017-02-20 20:37:25 +00:00
Mateusz Guzik
13d2ef0f3a mtx: fix spin mutexes interaction with failed fcmpset
While doing so move recursion support down to the fallback routine.
2017-02-20 19:08:36 +00:00
Eric Badger
82a4538f31 Defer ptracestop() signals that cannot be delivered immediately
When a thread is stopped in ptracestop(), the ptrace(2) user may request
a signal be delivered upon resumption of the thread. Heretofore, those signals
were discarded unless ptracestop()'s caller was issignal(). Fix this by
modifying ptracestop() to queue up signals requested by the ptrace user that
will be delivered when possible. Take special care when the signal is SIGKILL
(usually generated from a PT_KILL request); no new stop events should be
triggered after a PT_KILL.

Add a number of tests for the new functionality. Several tests were authored
by jhb.

PR:		212607
Reviewed by:	kib
Approved by:	kib (mentor)
MFC after:	2 weeks
Sponsored by:	Dell EMC
In collaboration with:	jhb
Differential Revision:	https://reviews.freebsd.org/D9260
2017-02-20 15:53:16 +00:00
Konstantin Belousov
ecc6c515ab Apply noexec mount option for mmap(PROT_EXEC).
Right now the noexec mount option disallows image activators to try
execve the files on the mount point.  Also, after r127187, noexec
also limits max_prot map entries permissions for mappings of files
from such mounts, but not the actual mapping permissions.

As result, the API behaviour is inconsistent.  The files from noexec
mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution
permission, it cannot be re-enabled later.  Make this consistent
logically and aligned with behaviour of other systems, by disallowing
PROT_EXEC for mmap(2).

Note that this change only ensures aligned results from mmap(2) and
mprotect(2), it does not prevent actual code execution from files
coming from noexec mount.  Such files can always be read into
anonymous executable memory and executed from there.

Reported by:	shamaz.mazum@gmail.com
PR:	217062
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 20:51:04 +00:00
Mateusz Guzik
b247fd395d locks: make trylock routines check for 'unowned' value
Since fcmpset can fail without lock contention e.g. on arm, it was possible
to get spurious failures when the caller was expecting the primitive to succeed.

Reported by:	mmel
2017-02-19 16:28:46 +00:00
Hans Petter Selasky
316e092a77 Make sure the thread constructor and destructor eventhandlers are
called for all threads belonging to a procedure. Currently the first
thread in a procedure is kept around as an optimisation step and is
never freed. Because the first thread in a procedure is never freed
nor allocated, its destructor and constructor callbacks are never
called which means per thread structures allocated by dtrace and the
Linux emulation layers for example, might be present for threads which
don't need these structures.

This patch adds a thread construction and destruction call for the
first thread in a procedure.

Tested:			dtrace, linux emulation
Reviewed by:		kib @
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-02-19 13:15:33 +00:00
Jason A. Harmening
e2a8d17887 Bring back r313037, with fixes for mips:
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.

Reviewed by:	andreast, kan, lidl
Tested by:	lidl(mips, sparc64), andreast(powerpc)
Differential Revision:	https://reviews.freebsd.org/D9587
2017-02-19 02:03:09 +00:00
Mateusz Guzik
5c5df0d99b locks: clean up trylock primitives
In particular thius reduces accesses of the lock itself.
2017-02-18 22:06:03 +00:00
Bryan Drewery
8e31b510b0 Fix panic with unlocked vnode to vrecycle().
MFC after:	2 weeks
2017-02-18 05:07:53 +00:00
Mateusz Guzik
a24c8eb847 mtx: plug the 'opts' argument when not used 2017-02-18 01:52:10 +00:00
Mateusz Guzik
cbebea4e67 mtx: get rid of file/line args from slow paths if they are unused
This denotes changes which went in by accident in r313877.

On most production kernels both said parameters are zeroed and have nothing
reading them in either __mtx_lock_sleep or __mtx_unlock_sleep. Thus this change
stops passing them by internal consumers which this is the case.

Kernel modules use _flags variants which are not affected kbi-wise.
2017-02-17 15:40:24 +00:00
Mateusz Guzik
09f1319acd mtx: restrict r313875 to kernels without LOCK_PROFILING 2017-02-17 15:34:40 +00:00
Mateusz Guzik
7640beb920 mtx: microoptimize lockstat handling in __mtx_lock_sleep
This saves a function call and multiple branches after the lock is acquired.
2017-02-17 14:55:59 +00:00
Mateusz Guzik
0108a98012 sx: fix compilation on UP kernels after r313855
sx primitives use inlines as opposed to macros. Change the tested condition
to LOCK_DEBUG which covers the case, but is slightly overzelaous.

Reported by:	kib
2017-02-17 10:58:12 +00:00
Mateusz Guzik
91fa47076d Introduce SCHEDULER_STOPPED_TD for use when the thread pointer was already read
Sprinkle in few places.
2017-02-17 06:45:04 +00:00
Mateusz Guzik
ffd5c94c4f locks: let primitives for modules unlock without always goging to the slsow path
It is only needed if the LOCK_PROFILING is enabled. It has to always check if
the lock is about to be released which requires an avoidable read if the option
is not specified..
2017-02-17 05:39:40 +00:00
Mateusz Guzik
afa39f7a32 locks: remove SCHEDULER_STOPPED checks from primitives for modules
They all fallback to the slow path if necessary and the check is there.

This means a panicked kernel executing code from modules will be able to
succeed doing actual lock/unlock, but this was already the case for core code
which has said primitives inlined.
2017-02-17 05:09:51 +00:00
Ryan Stone
27ee18ad33 Revert r313814 and r313816
Something evidently got mangled in my git tree in between testing and
review, as an old and broken version of the patch was apparently submitted
to svn.  Revert this while I work out what went wrong.

Reported by:	tuexen
Pointy hat to:	rstone
2017-02-16 21:18:31 +00:00
Eric van Gyzen
8144690af4 Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:47:41 +00:00
Ryan Stone
3600f4ba35 Fix a typo in my previous commit
Somehow in the late stages of testing my sched_ule patch, a character was
accidentally deleted from the file.  Correct this.

While I'm committing anyway, the previous commit message requires some
clarification: in the normal case of unlending priority after releasing
a mutex, the thread that was doing the lending will be woken up and
immediately become the highest-priority thread, and in that case no
priority inversion would take place.  However, if that thread is pinned
to a different CPU, then the currently running thread that just had its
priority lowered will not be preempted and then priority inversion can
occur.

Reported by:	O. Hartmann (typo), jhb (scheduler clarification)
MFC after:	1 month
Pointy hat to:	rstone
2017-02-16 20:06:21 +00:00
Ryan Stone
09ae7c4814 Check for preemption after lowering a thread's priority
When a high-priority thread is waiting for a mutex held by a
low-priority thread, it temporarily lends its priority to the
low-priority thread to prevent priority inversion.  When the mutex
is released, the lent priority is revoked and the low-priority
thread goes back to its original priority.

When the priority of that thread is lowered (through a call to
sched_priority()), the schedule was not checking whether
there is now a high-priority thread in the run queue.  This can
cause threads with real-time priority to be starved in the run
queue while the low-priority thread finishes its quantum.

Fix this by explicitly checking whether preemption is necessary
when a thread's priority is lowered.

Sponsored by: Dell EMC Isilon
Obtained from: Sandvine Inc
Differential Revision:	https://reviews.freebsd.org/D9518
Reviewed by: Jeff Roberson (ule)
MFC after: 1 month
2017-02-16 19:41:13 +00:00
Mark Johnston
c6a4ba5a38 Apply MADV_FREE to exec_map entries only after a lowmem event.
This effectively provides the same benefit as applying MADV_FREE inline
upon every execve, since the page daemon invokes lowmem handlers prior to
scanning the inactive queue. It also has less overhead; the cost of
applying MADV_FREE is very noticeable on many-CPU systems since it includes
that of a TLB shootdown of global PTEs. For instance, this change nearly
halves the system CPU usage during a buildkernel on a 128-vCPU EC2
instance (with some other patches applied).

Benchmarked by:	cperciva (earlier version)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9586
2017-02-15 01:50:58 +00:00
Eric Badger
28d2efa983 sleepq_catch_signals: do thread suspension before signal check
Since locks are dropped when a thread suspends, it's possible for another
thread to deliver a signal to the suspended thread. If the thread awakens from
suspension without checking for signals, it may go to sleep despite having
a pending signal that should wake it up. Therefore the suspension check is
done first, so any signals sent while suspended will be caught in the
subsequent signal check.

Reviewed by:	kib
Approved by:	kib (mentor)
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9530
2017-02-14 17:13:23 +00:00
Andriy Gapon
937c1b0757 try to fix RACCT_RSS accounting
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero.  In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state.  Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR:		210315
Reviewed by:	trasz
Differential Revision: https://reviews.freebsd.org/D9464
2017-02-14 13:54:05 +00:00