Commit Graph

14595 Commits

Author SHA1 Message Date
Marius Strobl
7888a51f8d - Unbreak dumpsys(9) on sparc64 after r276772
- While at it, arrange #ifndefs in kern_dump.c more intelligently; it's
  rather confusing to have multiple competing and/or unused functions in
  the kernel.
2015-11-16 23:02:33 +00:00
Edward Tomasz Napierala
15db3c0738 Speed up rctl operation with large rulesets, by holding the lock
during iteration instead of relocking it for each traversed rule.

Reviewed by:	mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D4110
2015-11-15 12:10:51 +00:00
Randall Stewart
7c4676ddee This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped.  Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after:	1 Month with associated callout change.
2015-11-13 22:51:35 +00:00
John Baldwin
645743ea99 Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved.  Annotations for
other platforms will be added in the future.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3773
2015-11-12 22:00:59 +00:00
Randall Stewart
18b4fd62e0 Add new async_drain to the callout system. This is so-far not used but
should be used by TCP for sure in its cleanup of the IN-PCB (will be coming shortly).

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D4076
2015-11-10 14:49:32 +00:00
Josh Paetzel
09766dd52a Fix a bug in the CPU % limiting code
If you attempt to set a pcpu limit that is higher than
110% using rctl (for instance, you want a jail to be
able to use 2 cores on your system so you set pcpu to
200%) the thing you are trying to limit becomes unthrottled.

PR:	189870
Submitted by:	dustinwenz@ebureau.com
Reviewed by:	trasz
MFC after:	1 week
2015-11-10 14:14:32 +00:00
Edward Tomasz Napierala
ea228b482e Make naming more consistent; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-11-08 18:11:24 +00:00
Edward Tomasz Napierala
2b4035eeb6 Speed up rctl(8) rule retrieval; the difference shows mostly in "rctl -n",
as otherwise most of the time is spent resolving UIDs to names.

Reviewed by:	mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D4059
2015-11-08 18:08:31 +00:00
Tijl Coosemans
27f38a8d69 Since r289279 bufinit() uses mp_ncpus, but some architectures set this
variable during mp_start() which is too late.  Move this to mp_setmaxid()
where other architectures set it and move x86 assertions to MI code.

Reviewed by:	kib (x86 part)
2015-11-08 14:26:50 +00:00
Mark Johnston
d28713378a - Consistently use PROC_ASSERT_HELD() to verify that a process' hold count
is non-zero.
- Include the process address in the PROC_ASSERT_HELD() and
  PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding
  process can be found easily when debugging.

MFC after:	1 week
2015-11-08 01:38:56 +00:00
Conrad Meyer
e072f955ff Flesh out sysctl types further (follow-up of r290475)
Use the right intmax_t type instead of intptr_t in a few remaining
places.

Add support for CTLFLAG_TUN for the new fixed with types.  Bruce will be
upset that the new handlers silently truncate tuned quad-sized inputs,
but so do all of the existing handlers.

Add the new types to debug_dump_node, for whatever use that is.

Bump FreeBSD_version again, for good measure.  We are changing
SYSCTL_HANDLER_ARGS and a member of struct sysctl_oid to intmax_t.

Correct the sysctl typed NULL values for the fixed-width types.  (Hat
tip: hps@.)

Suggested by:	hps (partial)
Sponsored by:	EMC / Isilon Storage Division
2015-11-07 18:26:32 +00:00
Adrian Chadd
2eb936acb9 Add a sched_yield() to work around low memory conditions in the current code.
Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.

This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.

Tested:

* Carambola2, 64MB (and 32MB by manual config.)
2015-11-07 04:04:00 +00:00
Conrad Meyer
be87839e56 Round out SYSCTL macros to the full set of fixed-width types
Add S8, S16, S32, and U32 types;  add SYSCTL*() macros for them, as well
as for the existing 64-bit types.  (While SYSCTL*QUAD and UQUAD macros
already exist, they do not take the same sort of 'val' parameter that
the other macros do.)

Clean up the documented "types" in the sysctl.9 document.  (These are
macros and thus not real types, but the manual page documents intent.)

The sysctl_add_oid(9) arg2 has been bumped from intptr_t to intmax_t to
accommodate 64-bit types on 32-bit pointer architectures.

This is just the kernel support piece; the userspace sysctl(1) support
will follow in a later patch.

Submitted by:	Ravi Pokala <rpokala@panasas.com>
Reviewed by:	cem
Relnotes:	no
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D4091
2015-11-07 01:43:01 +00:00
Mateusz Guzik
b577e693aa fd: implement kern.proc.nfds sysctl
Intended purpose is to provide an equivalent of OpenBSD's getdtablecount
syscall for the compat library..
2015-11-07 00:18:14 +00:00
John Baldwin
db41d262d3 When dumping an rman in DDB, include the RID of each resource.
Submitted by:	Ravi Pokala (rpokala@panasas.com)
Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D4086
2015-11-05 23:12:23 +00:00
Mark Johnston
e8e0fac552 Have elf_lookup() return an error if the specified non-weak symbol could
not be found. Otherwise, relocations against such symbols will be silently
ignored instead of causing an error to be raised.

Reviewed by:	kib
MFC after:	1 week
2015-11-03 03:29:35 +00:00
Enji Cooper
aaca704590 Define fhard in pps_event(..) only when PPS_SYNC is defined to mute
an -Wunused-but-set-variable warning

Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job
Sponsored by: EMC / Isilon Storage Division
2015-11-02 03:14:37 +00:00
Enji Cooper
9a12e28212 Define compress in __elfN(coredump) when #ifdef GZIO is true to mute
an -Wunused-but-set-variable warning

Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job
Sponsored by: EMC / Isilon Storage Division
2015-11-02 01:47:26 +00:00
Warner Losh
a1f26210f3 The error classification from lower layers is a poor indicator of
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.

Reviewed by: mckusick@, kib@
2015-10-31 04:53:07 +00:00
Konstantin Belousov
50657fd342 Minor (and incomplete) style cleanup.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 20:47:42 +00:00
Konstantin Belousov
2936e0013c Also mark compat32 umtx op table as constant.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 19:32:30 +00:00
Konstantin Belousov
c539e87014 Use C99 array initialization, which also makes the code
self-documented, and eases addition of new ops.

For the similar reasons, eliminate UMTX_OP_MAX.  nitems() handles the
only use of the symbol.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 19:20:40 +00:00
Edward Tomasz Napierala
665aea9323 After r290196, the kernel won't wait for stuff like gmirror nodes
if they are not required for mounting rootfs.  However, it's possible
that some setups try to mount them in mountcritlocal (ie from fstab).

Export the list of current root mount holds using a new sysctl,
vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails
and the list is not empty.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3709
2015-10-30 15:52:10 +00:00
Edward Tomasz Napierala
a3ba3d09c2 Make root mount wait mechanism smarter, by making it wait only if the root
device doesn't yet exist.

Reviewed by:	kib@, marcel@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3709
2015-10-30 15:35:04 +00:00
Bryan Drewery
156c04f793 getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse().
This came in recently in r289279.

Coverity CID:	1331561
2015-10-29 19:02:24 +00:00
Hans Petter Selasky
cb3450e26e Add missing NULL check in physio().
When destroying a character device the si_devsw field is set to NULL
before all references are gone, to indicate the character device is
going away. This can cause a NULL-dereference fault inside physio().

The callers of physio() should own a thread reference on the cdev and
if si_devsw is seen as non-NULL, it is usable during the execution of
the function. Else an ENXIO error code is returned.

Reviewed by:	kib
MFC after:	2 weeks
2015-10-29 13:53:37 +00:00
Warner Losh
83a283cf62 Add a note to the effect that BUS_ADD_CHILD calls
device_add_child_ordered to add the child. device_add_child_ordered
doesn't call BUS_ADD_CHILD.
2015-10-28 18:53:18 +00:00
Kirk McKusick
a57418a761 Bring the tags and links entries for amd64 up to date.
Based on how out of date it is, I doubt that anyone
other than me and my code-reading students still use it.
2015-10-27 22:59:24 +00:00
Pawel Jakub Dawidek
38d68e2d42 The aio_waitcomplete(2) syscall should not sleep when the given timeout
is 0. Without this change it was sleeping for one tick. Maybe not a big
deal, but it makes share/dtrace/blocking script to report that.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D3814
Sponsored by:	Wheel Systems, http://wheelsystems.com
2015-10-25 18:48:09 +00:00
Conrad Meyer
c3220d0b6d Sysctl: Add common support for U8, U16 types
Sponsored by:	EMC / Isilon Storage Division
2015-10-22 23:03:06 +00:00
John Baldwin
ea7b054e99 Missing regen after last change to sys/kern/syscalls.master. 2015-10-22 21:30:39 +00:00
John Baldwin
2f99bcce1e Rename remaining linux32 symbols such as linux_sysent[] and
linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with
linux64.ko.  While here, add support for linux64 binaries to systrace.
- Update NOPROTO entries in amd64/linux/syscalls.master to match the
  main table to fix systrace build.
- Add a special case for union l_semun arguments to the systrace
  generation.
- The systrace_linux32 module now only builds the systrace_linux32.ko.
  module on amd64.
- Add a new systrace_linux module that builds on both i386 and amd64.
  For i386 it builds the existing systrace_linux.ko.  For amd64 it
  builds a systrace_linux.ko for 64-bit binaries.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D3954
2015-10-22 21:28:20 +00:00
Ed Schouten
aff5735784 Add a way to distinguish between forking and thread creation in schedtail.
For CloudABI we need to initialize the registers of new threads
differently based on whether the thread got created through a fork or
through simple thread creation.

Add a flag, TDP_FORKING, that is set by do_fork() and cleared by
fork_exit(). This can be tested against in schedtail.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D3973
2015-10-22 09:33:34 +00:00
Konstantin Belousov
8fc3db038e Trim spaces at end of line to record the proper commit message for
r289660:

Do not allow to execute ptrace(PT_TRACE_ME) when the process is
already traced.

Do not allow to execute ptrace(PT_TRACE_ME) when there is no parent
which can trace the process, i.e. when the parent is already init.
Note that after the PT_TRACE_ME request the process is unkillable and
non-continuable until a debugger is attached, or parent is killed, the
later clears P_TRACED state.  Since init clearly would not debug the
caller, and cannot be killed, disallow creation of unkillable
processes.

Reviewed by:	jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:38:20 +00:00
Konstantin Belousov
1b253694f4 Mark struct thread zone as type-stable.
When establishing the locking state for several lock types (including
blockable mutexes and sx) failed, locking primitives try to spin while
the owner thread is running.  The spinning loop performs the test for
running condition by dereferencing the owner->td_state field of the
owner thread.  If the owner thread exited while spinner was put off
the processor, it is harmless to access reused struct thread owner,
since in some near future the current processor would notice the owner
change and make appropriate progress.  But it could be that the page
which carried the freed struct thread was unmapped, then we fault
(this cannot happen on amd64).

For now, disallowing free of the struct thread seems to be good
enough, and tests which create a lot of threads once, did not
demonstrated regressions.

Reviewed by:	jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:29:21 +00:00
Konstantin Belousov
77b9bec37b Reviewed by: jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:22:57 +00:00
Konstantin Belousov
1ed2e49b55 No need to dereference struct proc to pids when comparing processes
for equality.

Reviewed by:	jhb, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-20 20:12:42 +00:00
John Baldwin
c814b86843 Switch pl_child_pid from int to pid_t.
Reviewed by:	emaste, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3857
2015-10-20 17:58:21 +00:00
Ian Lepore
11ead989eb Fix printf format to allow for bus_size_t not being u_long on all platforms. 2015-10-20 03:25:17 +00:00
Enji Cooper
3f18b7fa12 Replace /dev/acd0 with /dev/cd1
atapicd(4) has been removed since r249083, and if a system has more than one
optical drive, it will likely be /dev/cd1

Update mount.conf(8) to reflect the change in behavior

MFC after: never
Sponsored by: EMC / Isilon Storage Division
2015-10-17 08:51:10 +00:00
Konstantin Belousov
d8f3dc7871 If falloc_caps() failed, cleanup needs to be performed. This is a bug
in r289026.

Sponsored by:	The FreeBSD Foundation
2015-10-16 14:55:39 +00:00
Konstantin Belousov
6c775eb64e Allow PT_INTERP and PT_NOTES segments to be located anywhere in the
executable image.  Keep one page (arbitrary) limit on the max allowed
size of the PT_NOTES.

The ELF image activators still require that program headers of the
executable are fully contained in the first page of the image file.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3871
2015-10-14 18:27:35 +00:00
Jeff Roberson
21fae96123 Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

 - Get buffer headers from a per-cpu uma cache that sits in from of the
   free queue.
 - Use a per-cpu quantum cache in vmem to eliminate contention for kva.
 - Use multiple clean queues according to buffer cache size to eliminate
   clean queue lock contention.
 - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
   from blocking or doing direct recycling.
 - Close some bufspace allocation races that could lead to endless
   recycling.
 - Further the transition to a more modern style of small functions grouped
   by prefix in order to improve growing complexity.

Sponsored by:	EMC / Isilon
Reviewed by:	kib
Tested by:	pho
2015-10-14 02:10:07 +00:00
Hiren Panchasara
86a996e6bd There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren
2015-10-14 00:35:37 +00:00
Edward Tomasz Napierala
92001b9497 Change the default setting of kern.ipc.shm_allow_removed from 0 to 1.
This removes the need for manually changing this flag for Google Chrome
users. It also improves compatibility with Linux applications running under
Linuxulator compatibility layer, and possibly also helps in porting software
from Linux.

Generally speaking, the flag allows applications to create the shared memory
segment, attach it, remove it, and then continue to use it and to reattach it
later. This means that the kernel will automatically "clean up" after the
application exits.

It could be argued that it's against POSIX. However, SUSv3 says this
about IPC_RMID: "Remove the shared memory identifier specified by shmid from
the system and destroy the shared memory segment and shmid_ds data structure
associated with it." From my reading, we break it in any case by deferring
removal of the segment until it's detached; we won't break it any more
by also deferring removal of the identifier.

This is the behaviour exhibited by Linux since... probably always, and
also by OpenBSD since the following commit:

revision 1.54
date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8;
Allow segments to be used even after they were marked for deletion with
the IPC_RMID flag.
This is permitted as an extension beyond the standards and this is similar
to what other operating systems like linux do.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3603
2015-10-10 09:29:47 +00:00
Edward Tomasz Napierala
b9a5c7b595 Provide better debug message on kernel module name clash.
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-10-10 09:21:55 +00:00
Edward Tomasz Napierala
8d90e66066 Remove root_mount_wait(). It's not used anywhere.
Reviewed by:	bapt@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3787
2015-10-09 12:11:37 +00:00
Konstantin Belousov
4b48959f9f Enforce the maxproc limitation before allocating struct proc, initial
struct thread and kernel stack for the thread.  Otherwise, a load
similar to a fork bomb would exhaust KVA and possibly kmem, mostly due
to the struct proc being type-stable.

The nprocs counter is changed from being protected by allproc_lock sx
to be an atomic variable.  Note that ddb/db_ps.c:db_ps() use of nprocs
was unsafe before, and is still unsafe, but it seems that the only
possible undesired consequence is the harmless warning printed when
allproc linked list length does not match nprocs.

Diagnosed by:	Svatopluk Kraus <onwahe@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-08 11:07:09 +00:00
Fabien Thomas
78e79434d2 Fix r283998 that broke mapin events for hwpmc.
Reviewed by:	jhb
Sponsored by:	Stormshield
2015-10-08 09:54:33 +00:00
Gleb Smirnoff
e40e8705db Fix regression from r248371. We need to copy packet header to new
mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set
only on a first mbuf.

Reported & tested by:	Andriy Voskoboinyk <s3erios gmail.com>
Sponsored by:		Nginx, Inc.
2015-10-07 12:40:00 +00:00
John Baldwin
189ac973de Fix various edge cases related to system call tracing.
- Always set td_dbg_sc_* when P_TRACED is set on system call entry
  even if the debugger is not tracing system call entries.  This
  ensures the fields are valid when reporting other stops that
  occur at system call boundaries such as for PT_FOLLOW_FORKS or
  when only tracing system call exits.
- Set TDB_SCX when reporting the stop for a new child process in
  fork_return().  This causes the event to be reported as a system
  call exit.
- Report a system call exit event in fork_return() for new threads in
  a traced process.
- Copy td_dbg_sc_* to new threads instead of zeroing.  This ensures
  that td_dbg_sc_code in particular will report the system call that
  created the new thread or process when it reports a system call
  exit event in fork_return().
- Add new ptrace tests to verify that new child processes and threads
  report system call exit events with a valid pl_syscall_code via
  PT_LWPINFO.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D3822
2015-10-06 19:29:05 +00:00
Conrad Meyer
e6b95927f3 Fix core corruption caused by race in note_procstat_vmmap
This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath.  As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
  kinfo packing for PROCSTAT_VMMAP notes.  This avoids VMMAP corruption
  and truncation, even if names change, at the cost of up to PATH_MAX
  bytes per mapped object.  The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass.  This
  addresses corruption, at the cost of sometimes producing a truncated
  result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
  to grok the new zero padding.

Reported by:	pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3824
2015-10-06 18:07:00 +00:00
Gleb Smirnoff
640082d498 Remove debugging variable from r143761. 2015-10-06 09:43:49 +00:00
John Baldwin
3edd0ffffe Include additional info in ptrace(2) KTR traces:
- The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL,
  and PT_TO_SC[EX].
- The system call code returned via PT_LWPINFO.

MFC after:	1 week
2015-10-05 21:36:53 +00:00
Mark Johnston
403ec61cbb Revert r288628 and instead fix a discrepancy between the posix_fadvise(2)
man page and POSIX: posix_fadvise(2) returns an error number on failure.

Reported by:	jilles
MFC after:	1 week
2015-10-03 22:27:14 +00:00
Mark Johnston
a7713f7631 The return value of posix_fadvise(2) is just an error status, so
sys_posix_fadvise() should simply return the errno (or 0) to syscallenter()
rather than setting a return value.

MFC after:	1 week
2015-10-03 19:37:41 +00:00
Alan Cox
acada7aef0 Perform a single batched update to the object's paging-in-progress count
rather than updating it for each page.
2015-10-03 17:04:52 +00:00
Poul-Henning Kamp
d58b610faa Fail the sbuf if vsnprintf(3) fails. 2015-10-02 09:23:14 +00:00
Mark Johnston
0a19cfd454 Ensure that vop_stdadvise() does not call getblk() on vnodes that have an
empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the
buffer cache may trigger assertion failures.

Reported by:	Fabien Keil
2015-10-01 16:34:53 +00:00
Colin Percival
2eb0015ab7 Disable suspend when we're shutting down. This solves the "tell FreeBSD
to shut down; close laptop lid" scenario which otherwise tended to end
with a laptop overheating or the battery dying.

The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets
this while rc.suspend runs, and the ACPI sleep code ignores requests while
the sysctl is set.

Discussed on:	freebsd-acpi (35 emails)
MFC after:	1 week
2015-10-01 10:52:26 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Andriy Gapon
2f2f522b5d save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
Jeff Roberson
4615830db2 - Collapse vfs_vmio_truncate & vfs_vmio_release into a single function.
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
   single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
 - Eliminate the special B_ASYNC handling on free that has not been
   relevant for some time.
 - Remove the extraneous page busy from vfs_vmio_truncate().

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon storage division
2015-09-27 05:16:06 +00:00
Mark Johnston
0a805de6f3 Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.
2015-09-26 22:26:55 +00:00
Mark Johnston
d925c2e800 Fix argument ordering in vn_printf().
MFC after:	3 days
2015-09-26 22:16:54 +00:00
Conrad Meyer
2f1c4e0ebf sbuf: Process more than one char at a time
Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and
fixup callers.

Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly
changes to some callers.

Reviewed by:	jhb, markj
Obtained from:	Dan Sledz
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3717
2015-09-25 18:37:14 +00:00
Konstantin Belousov
b2557db607 Use per-cpu values for base and last in tc_cpu_ticks(). The values
are updated lockess, different CPUs write its own view of timecounter
state.  The critical section is done for safety, callers of
tc_cpu_ticks() are supposed to already enter critical section, or to
own a spinlock.

The change fixes sporadical reports of too high values reported for
the (W)CPU on platforms that do not provide cpu ticker and use
tc_cpu_ticks(), in particular, arm*.

Diagnosed and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-25 13:03:57 +00:00
Mateusz Guzik
3c44a3495f kqueue: simplify kern_kqueue by not refing/unrefing creds too early
No functional changes.
2015-09-23 12:45:08 +00:00
Jeff Roberson
589c956a5a - Fix a nonsense reordering that somehow slipped into my last diff.
Reported by:	pho
2015-09-23 07:44:07 +00:00
Jeff Roberson
8264830c95 Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
   invalidation loop in brelse.  This has been a no-op in modern times as
   biodone() is responsible for cleaning up after bogus pages.  This
   would've spammed the console with printfs at a minimum.
 - Allow the compiler and human readers alike to reason about allocbuf()
   by splitting it into constituent parts.
 - Separate the VM manipulating and buf manipulating code in brelse() and
   bufdone() so that the intentions are clear.  This makes it evident that
   there are several duplicated buf pages loops that will be consolidated
   at a later time.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 23:57:52 +00:00
Alan Cox
15aaea7892 Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 18:16:52 +00:00
Hans Petter Selasky
c55f4c9445 Revert r287780 until more developers have their say.
Differential Revision:	https://reviews.freebsd.org/D3521
Requested by:		gnn
2015-09-22 06:51:55 +00:00
Bryan Drewery
6c5c24c98c vfs_mountroot_shuffle() never returns non-zero. 2015-09-22 03:34:07 +00:00
Konstantin Belousov
1f57d8c66b Ensure that maxproc does not exceed pid_max, at the time of boot.
Changes to kern.pid_max mib after the boot can break this relation.

The maxfiles value was calculated by the MAXFILES formula based on
maxproc value, but this change decouples them, and MAXFILES now
references maxusers.  Without manual tuning, the maxfiles default
value remains as it was prior to this commit.  But for systems which
have tuned maxproc and rely on maxfiles to adjust, additional
reconfiguration is needed.

Reported by:	rwatson
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-21 15:02:59 +00:00
Konstantin Belousov
cff8c6f2d1 Add support for weak symbols to the kernel linkers. It means that
linkers no longer raise an error when undefined weak symbols are
found, but relocate as if the symbol value was 0.  Note that we do not
repeat the mistake of userspace dynamic linker of making the symbol
lookup prefer non-weak symbol definition over the weak one, if both
are available.  In fact, kernel linker uses the first definition
found, and ignores duplicates.

Signature of the elf_lookup() and elf_obj_lookup() functions changed
to split result/error code and the symbol address returned.
Otherwise, it is impossible to return zero address as the symbol
value, to MD relocation code.  This explains the mechanical changes in
elf_machdep.c sources.

The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the
lookup() call, the patch leaves the code as is (untested).

Reported by:	glebius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-20 01:27:59 +00:00
Edward Tomasz Napierala
0d3d0cc358 Kernel part of reroot support - a way to change rootfs without reboot.
Note that the mountlist manipulations are somewhat fragile, and not very
pretty.  The reason for this is to avoid changing vfs_mountroot(), which
is (obviously) rather mission-critical, but not very well documented,
and thus hard to test properly.  It might be possible to rework it to use
its own simple root mount mechanism instead of vfs_mountroot().

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D2698
2015-09-18 17:32:22 +00:00
John Baldwin
bdd64116b0 Always clear TDB_USERWR before fetching system call arguments. The
TDB_USERWR flag may still be set after a debugger detaches from a
process via PT_DETACH.  Previously the flag would never be cleared
forcing a double fetch of the system call arguments for each system
call.  Note that the flag cannot be cleared at PT_DETACH time in case
one of the threads in the process is currently stopped in
syscallenter() and the debugger has modified the arguments for that
pending system call before detaching.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3678
2015-09-16 20:55:00 +00:00
John Baldwin
4295bec16a When a process group leader exits, all of the processes in the group are
sent SIGHUP and SIGCONT if any of the processes are stopped.  Currently this
behavior is triggered for any type of process stop including ptrace() stops
and transient stops for single threading during exit() and execve().
Thus, if a debugger is attached to a process in a group when the leader
exits, the entire group can be HUPed.  Instead, only send the signals if a
process in the group is stopped due to SIGSTOP.

PR:		201149
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3681
2015-09-16 16:40:07 +00:00
Mateusz Guzik
7665e341ca sysctl: switch sysctllock to a sleepable rmlock, take 2
This restores r285125. Previous attempt was reverted due to a bug in rmlocks,
which is fixed since r287833.
2015-09-15 23:06:56 +00:00
John Baldwin
e89d5f43da Threads holding a read lock of a sleepable rm lock are not permitted
to sleep.  The rmlock implementation enforces this by disabling
sleeping when a read lock is acquired. To simplify the implementation,
sleeping is disabled for most of the duration of rm_rlock.  However,
it doesn't need to be disabled until the lock is acquired.  If a
sleepable rm lock is contested, then rm_rlock may need to acquire the
backing sx lock.  This tripped the overly-broad assertion.  Fix by
relaxing the assertion around the call to sx_xlock().

Reported by:	mjg
Reviewed by:	kib, mjg
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3324
2015-09-15 22:16:21 +00:00
Conrad Meyer
55d33667ee kevent(2): Note DOOMED vnodes with NOTE_REVOKE
In poll mode, check for and wake VBAD vnodes.  (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation.  (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3675
2015-09-15 20:22:30 +00:00
Hans Petter Selasky
9acc0eafd7 Implement callout_drain_async(), inspired by the projects/hps_head
branch.

This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.

Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.

Differential Revision:	https://reviews.freebsd.org/D3521
Reviewed by:		wblock
MFC after:		1 month
2015-09-14 10:52:26 +00:00
Warner Losh
7297c5e535 bufdonebio is now unused. Retire it too. 2015-09-11 04:20:04 +00:00
Mark Johnston
610141cebb Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.

It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.

This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.

Reviewed by:	bdrewery, jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3256
2015-09-11 03:54:37 +00:00
Warner Losh
ad8d57a99d dev_strategy and dev_strategy_csw are unused since r281825. Remove
them.

Differential Revision: https://reviews.freebsd.org/D3620
2015-09-11 00:38:58 +00:00
Adrian Chadd
32766cd281 Also make kern.maxfilesperproc a boot time tunable.
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl.  The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.

This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen.  I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit.  On a 1TB machine this is like, 26 million
FDs per process.  Ugh.

So:

* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
  screen, python, etc doing a close() on potentially millions of FDs
  even though you only have four open.

Tested:

* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
  screen and python forking doesn't result in some pretty hilariously
  bad behaviour.

TODO:

* Note that the default login.conf sets openfiles-cur to unlimited,
  effectively obeying kern.maxfilesperproc.  Perhaps we should fix
  this.

* .. and even if we do, we need to also ensure that daemons get
  a soft limit of something reasonable and capped - they can request
  more FDs themselves.

MFC after:	1 week
Sponsored by:	Norse Corp, Inc.
2015-09-10 04:05:58 +00:00
Konstantin Belousov
9e18c9eb27 For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories.  But do allow the flag
combination to succeed if the directory already exists.

Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.

Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL.  The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.

Reported by:	Tom Ridge <freebsd@tom-ridge.com>
Reviewed by:	rwatson
PR:	 202892
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 19:31:08 +00:00
Mateusz Guzik
9af8c8b72b fd: make rights a mandatory argument to fgetvp_rights
The only caller already always passes rights.
2015-09-07 20:05:56 +00:00
Mateusz Guzik
d7832811a7 fd: make the common case in filecaps_copy work lockless
The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
2015-09-07 20:02:56 +00:00
Conrad Meyer
bcb60d52e6 Follow-up to r287442: Move sysctl to compiled-once file
Avoid duplicate sysctl nodes.

Found by:	tijl
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3586
2015-09-07 16:44:28 +00:00
Kirk McKusick
17518b1a2b Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by:   Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after:   2 weeks
2015-09-06 05:50:51 +00:00
Xin LI
28ffe927c2 Expose an interface to determine if an ACE is inherited.
Submitted by:	sef
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3540
2015-09-04 00:14:20 +00:00
Conrad Meyer
14bdbaf2e4 Detect badly behaved coredump note helpers
Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath.  As vnodes may move around during dump, this is racy.

So:

 - Detect badly behaved notes in putnote() and pad underfilled notes.

 - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
   exercise the NT_PROCSTAT_FILES corruption.  It simply picks random
   lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

 - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
   disable kinfo packing for PROCSTAT_FILES notes.  This should avoid
   both FILES note corruption and truncation, even if filenames change,
   at the cost of about 1 kiB in padding bloat per open fd.  Document
   the new sysctl in core.5.

 - Fix note_procstat_files to self-limit in the 2nd pass.  Since
   sometimes this will result in a short write, pad up to our advertised
   size.  This addresses note corruption, at the risk of sometimes
   truncating the last several fd info entries.

 - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
   zero padding.

With suggestions from:	bjk, jhb, kib, wblock
Approved by:	markj (mentor)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3548
2015-09-03 20:32:10 +00:00
Mateusz Guzik
7e8f566c0c fd: remove UMA_ZONE_ZINIT argument from Files zone
Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.

This reverts r286921.

Discussed with:		kib
2015-09-02 23:14:39 +00:00
John Baldwin
ded3e7f08e The 'sa' argument to syscallret() is not unused. 2015-09-01 22:28:23 +00:00
John Baldwin
183b68f74f Export current system call code and argument count for system call entry
and exit events. procfs stop events for system call tracing report these
values (argument count for system call entry and code for system call exit),
but ptrace() does not provide this information. (Note that while the system
call code can be determined in an ABI-specific manner during system call
entry, it is not generally available during system call exit.)

The values are exported via new fields at the end of struct ptrace_lwpinfo
available via PT_LWPINFO.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3536
2015-09-01 22:24:54 +00:00
Konstantin Belousov
6ae26d06dc Exit notification for EVFILT_PROC removes knote from the knlist. In
particular, this invalidates the knote kn_link linkage, making the
SLIST_FOREACH() loop accessing undefined values (e.g. trashed by
QUEUE_MACRO_DEBUG).  If the knote is freed by other thread when kq
lock is released or when influx is cleared, e.g. by knote_scan() for
kqueue owning the knote, the iteration step would access freed memory.

Use SLIST_FOREACH_SAFE() to fix iteration.

Diagnosed by:	avg
Tested by:	avg, lstewart, pawel
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 14:05:29 +00:00
Konstantin Belousov
78b9afe121 Clean up the kqueue use of the uma KPI.
Explain why it is fine to not check for M_NOWAIT failures in
kqueue_register().  Remove unneeded check for NULL result from
waitable allocation in kqueue_scan().  uma_free(9) handles NULL
argument correctly, remove checks for NULL.  Remove useless cast and
adjust style in knote_alloc().

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:21:32 +00:00
Andriy Gapon
378d5c6c89 callout_reset: fix a reversed check for cc_exec_cancel
The typo was introduced in r278469 / 344ecf88af.

As a result of the bug there was a timing window where callout_reset()
would fail to cancel a concurrent execution of a callout that is about
to start and would schedule the callout again.
The callout would fire more times than it is scheduled.
That would happen even if the callout is initialized with a lock.

For example, the bug triggered the "Stray timeout" assertion in
taskqueue_timeout_func().

MFC after:	5 days
2015-09-01 09:27:14 +00:00
Konstantin Belousov
8d830e02f5 Use P1B_PRIO_MAX to designate max posix priority for the RR/FIFO
scheduler types.  It was intended to be used there, compare with the
min value, and with the test for correctness in ksched_setscheduler().

Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical
values, the change is cosmetical.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-08-30 18:02:57 +00:00