Commit Graph

14297 Commits

Author SHA1 Message Date
Davide Italiano
a76d4388e1 Don't access sockbuf fields directly, use accessor functions instead.
It is safe to move the call to socantsendmore_locked() after
sbdrop_locked() as long as we hold the sockbuf lock across the two
calls.

CR:	D1805
Reviewed by:	adrian, kmacy, julian, rwatson
2015-02-14 20:00:57 +00:00
John Baldwin
bc411bc2d0 Include OBJT_PHYS VM objects in ELF core dumps. In particular this
includes the shared page allowing debuggers to use the signal trampoline
code to identify signal frames in core dumps.

Differential Revision:	https://reviews.freebsd.org/D1828
Reviewed by:	alc, kib
MFC after:	1 week
2015-02-14 17:12:31 +00:00
John Baldwin
1b76e0b732 Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
  exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
  calls to getnewvnode().

Differential Revision:	https://reviews.freebsd.org/D1671
Reviewed by:	kib
MFC after:	1 week
2015-02-14 17:02:51 +00:00
Alan Cox
c2d5d3ee0e Preset the object's color, or alignment, to maximize superpage usage.
MFC after:	5 days
2015-02-13 19:58:53 +00:00
Randall Stewart
66525b2d16 This fixes a bug I in-advertantly inserted when I updated the callout
code in my last commit. The cc_exec_next is used to track the next
when a direct call is being made from callout. It is *never* used
in the in-direct method. When macro-izing I made it so that it
would separate out direct/vs/non-direct. This is incorrect and can
cause panics as Peter Holm has found for me (Thanks so much Peter for
all your help in this). What this change does is restore that behavior
but also get rid of the cc_next from the array and instead make it
be part of the base callout structure. This way no one else will get
confused since we will never use it for non-direct.

Reviewed by:	Peter Holm and more importantly tested by him ;-)
MFC after:	3 days.
Sponsored by:	Netflix Inc.
2015-02-12 13:31:08 +00:00
Rui Paulo
b5263b26db Remove check against NULL after M_WAITOK.
Submitted by:	Oliver Pinter
2015-02-11 19:07:05 +00:00
Rui Paulo
6fbc0f7d98 Restore the data array in coredump(), but use a different style to
calculate the length.

Requested by:	kib
2015-02-11 00:58:15 +00:00
Rui Paulo
624157bb5e Remove a printf and an strlen() from the coredump code. 2015-02-10 18:35:46 +00:00
Konstantin Belousov
5d6f5b24ca Mountd iterating over the mount points may race with the parallel
unmount, which causes error from nmount(2) call when performing
MNT_DELEXPORT over the directory which ceased to be a mount point.

The race is legitimate and innocent, but results in the chatty mountd.
Silence it by providing an distinguished error code for the situation,
and ignoring the error in mountd loop.

Based on the patch by:	Andreas Longwitz <longwitz@incore.de>
Prodded and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-02-10 18:00:32 +00:00
Rui Paulo
eb6368d4f8 Sanitise the coredump file names sent to devd.
While there, add a sysctl to turn this feature off as requested by
kib@.
2015-02-10 04:34:39 +00:00
Rui Paulo
842ab62b05 Notify devd(8) when a process crashed.
This change implements a notification (via devctl) to userland when
the kernel produces coredumps after a process has crashed.
devd can then run a specific command to produce a human readable crash
report.  The command is most usually a helper that runs gdb/lldb
commands on the file/coredump pair.  It's possible to use this
functionality for implementing automatic generation of crash reports.

devd(8) will be notified of the full path of the binary that crashed and
the full path of the coredump file.
2015-02-09 23:13:50 +00:00
Randall Stewart
d2854fa488 This fixes two conditions that can incur when migration
is being done in the callout code and harmonizes the macro
use.:
1) The callout_active() will lie. Basically if a migration
   is occuring and the callout is about to expire and the
   migration has been deferred, the callout_active will no
   longer return true until after the migration. This confuses
   and breaks callers that are doing callout_init(&c, 1); such
   as TCP.
2) The migration code had a bug in it where when migrating, if
   a two calls to callout_reset came in and they both collided with
   the callout on the wheel about to run, then the second call to
   callout_reset would corrupt the list the callout wheel uses
   putting the callout thread into a endless loop.
3) Per imp, I have fixed all the macro occurance in the code that
   were for the most part being ignored.

Phabricator D1711 and looked at by lstewart and jhb and sbruno.
Reviewed by:	kostikbel, imp, adrian, hselasky
MFC after:	3 days
Sponsored by:	Netflix Inc.
2015-02-09 19:19:44 +00:00
Alan Cox
f4c6aea395 Preset the object's color, or alignment, to maximize superpage usage.
MFC after:	5 days
2015-02-08 21:00:51 +00:00
John Baldwin
64de80195b Add a new device control utility for new-bus devices called devctl. This
allows the user to request administrative changes to individual devices
such as attach or detaching drivers or disabling and re-enabling devices.
- Add a new /dev/devctl2 character device which uses ioctls for device
  requests.  The ioctls use a common 'struct devreq' which is somewhat
  similar to 'struct ifreq'.
- The ioctls identify the device to operate on via a string.  This
  string can either by the device's name, or it can be a bus-specific
  address.  (For unattached devices, a bus address is the only way to
  locate a device.)  Bus drivers register an eventhandler to claim
  unrecognized device names that the driver recognizes as a valid address.
  Two buses currently support addresses: ACPI recognizes any device
  in the ACPI namespace via its full path starting with "\" and
  the PCI bus driver recognizes an address specification of
  'pci[<domain>:]<bus>:<slot>:<func>' (identical to the PCI selector
  strings supported by pciconf).
- To make it easier to cut and paste, change the PnP location string
  in the PCI bus driver to output a full PCI selector string rather
  than 'slot=<slot> function=<func>'.
- Add a devctl(3) interface in libdevctl which provides a wrapper around
  the ioctls and is the preferred interface for other userland code.
- Add a devctl(8) program which is a simple wrapper around the requests
  supported by devctl(3).
- Add a device_is_suspended() function to check DF_SUSPENDED.
- Add a resource_unset_value() function that can be used to remove a
  hint from the kernel environment.  This is used to clear a
  hint.<driver>.<unit>.disabled hint when re-enabling a boot-time
  disabled device.

Reviewed by:	imp (parts)
Requested by:	imp (changing PCI location string)
Relnotes:	yes
2015-02-06 16:09:01 +00:00
John Baldwin
94f0eafcd2 Expose the constants for internal new-bus device flags to userland. The
flag value is already exposed via dv_flags, just not the meaning of the
flags themselves.  Use these constants to annotate devices that are
disabled or suspended in devinfo output.
2015-02-05 22:42:44 +00:00
John Baldwin
a1324315e3 Set and clear the DF_SUSPENDED flag on the child device being manipulated
rather than on the parent.
2015-02-05 22:24:22 +00:00
John-Mark Gurney
9541307fb7 turn GEOM_UNCOMPRESS_DEBUG into a proper option so it can be specified
in kernel config files..

put VERBOSE_SYSINIT in it's own option header so the one file,
init_main.c, can use it instead of requiring an entire kernel recompile
to change one file..
2015-02-05 07:51:38 +00:00
Peter Wemm
0c56c4f1ab Initialize ticks so that it wraps 10 minutes after boot to increase the
chances of finding problems related to wraparound sooner.

This comes from P4 change 167856 on 2009/08/26 around when we had problems
with the TCP stack with ticks after 24 days of uptime.
2015-02-05 01:43:21 +00:00
Konstantin Belousov
6e3bf5392d Add ddb command 'show clocksource' to display state of the per-cpu
clock events.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-02-04 14:49:47 +00:00
Konstantin Belousov
ff5ba73987 Fix use after free in pipe_dtor(). PIPE_NAMED flag must be tested
before pipeclose() is called, since for !PIPE_NAMED case, when peer is
already closed, the pipe pair memory is freed.

Submitted by:	luke.tw@gmail.com
PR:	197246
Tested by:	pho
MFC after:	3 days
2015-02-03 10:29:40 +00:00
Konstantin Belousov
6690381ef1 The dependency chain for priority-inheritance mutexes could be
subverted by userspace into cycle.  Both umtx_propagate_priority() and
umtx_repropagate_priority() would then loop infinitely, owning the
spinlock.

Check for the cycle using standard Floyd' algorithm before doing the
pass in the affected functions.  Add simple check for condition of
tricking the thread into a wait for itself, which could be easily
simulated by usermode without race.

Found by:	Eric van Gyzen <eric@vangyzen.net>
In collaboration with:	Eric van Gyzen <eric@vangyzen.net>
Tested by:	pho
MFC after:	1 week
2015-01-31 12:27:40 +00:00
Jamie Gritton
464aad1407 Add allow.mount.fdescfs jail flag.
PR:		192951
Submitted by:	ruben@verweg.com
MFC after:	3 days
2015-01-28 21:08:09 +00:00
John Baldwin
4f621933a5 Fix a couple of panics when detaching from a cxgbe/cxl interface that was
never brought up:
- Allow NULL to be passed to sglist_free().
- Don't try to stop an interface that was never fully initialized.

Reviewed by:	np
2015-01-26 16:26:28 +00:00
Adrian Chadd
9500dd9f0b Call WITNESS_WARN() in callout_drain() to check whether any locks are
being held before sleeping.

This has bitten me (in ath(4)) once before and I'd like to see this
not bite anyone else.

Differential Revision:	D1638
Reviewed by:	jhb, hselasky
MFC after:	1 week
2015-01-26 04:04:57 +00:00
John Baldwin
a77a12340a Change the default VFS timestamp precision from seconds to microseconds.
Discussed on:	arch@
MFC after:	2 weeks
2015-01-25 19:56:45 +00:00
Jilles Tjoelker
2b35e6a9f2 Run make sysent. 2015-01-23 21:08:24 +00:00
Jilles Tjoelker
2205e0d1bd Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
Alexey Dokuchaev
c5f282daad Fix usage example in kvprintf(9) and its copy in libstand(3): trailing '\n'
in bitfield argument is wrong, as it will be treated as bit 10, causing any
code printing >=10 bits with bit 10 on as having a trailing comma.

Newline (intended one) should be part of the format string (already present
in the examples).

Also fix grammar and kill EOL whitespace in comment while here.

PR:		195005
Approved by:	bdrewery
2015-01-23 07:30:57 +00:00
Hans Petter Selasky
a115fb62ed Revert for r277213:
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by:		kmacy, adrian, glebius and kib
Differential Revision:	https://reviews.freebsd.org/D1438
2015-01-22 11:12:42 +00:00
Mateusz Guzik
5e7cd3ec22 filedesc: avoid spurious copying of capabilities in fget_unlocked
We obtain a stable copy and store it in local 'fde' variable. Storing another
copy (based on aforementioned variable) does not serve any purpose.

No functional changes.
2015-01-21 18:32:53 +00:00
Mateusz Guzik
f9051b0e02 filedesc: return 0 from badfo_close
The only potential in-tree consumer (_fdrop) special-cased it and returns 0
0 on its own instead of calling badfo_close.

Remove the special case since it is not needed and very unlikely to encounter
anyway.

No objections from:	kib
2015-01-21 18:05:42 +00:00
Mateusz Guzik
5751146497 filedesc: fix whitespace nits in fget and fget_read
No functional changes.
2015-01-21 18:02:28 +00:00
Konstantin Belousov
fe63170115 Do not assert that the new pipepair mutex is not initialized. The
backing memory contains garbage and might trigger the assertion.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-21 16:32:54 +00:00
Mateusz Guzik
c31c057957 filedesc: plug a test for impossible condition in _fget 2015-01-21 01:06:14 +00:00
Neel Natu
d1b1b60065 Update the vdso timehands only via tc_windup().
Prior to this change CLOCK_MONOTONIC could go backwards when the timecounter
hardware was changed via 'sysctl kern.timecounter.hardware'. This happened
because the vdso timehands update was missing the special treatment in
tc_windup() when changing timecounters.

Reviewed by:	kib
2015-01-20 03:54:30 +00:00
Konstantin Belousov
3b50dff506 Stop enforcing additional reference on all cdevs, which was introduced
in r277199.  Acquire the neccessary reference in delist_dev_locked()
and inform destroy_devl() about it using CDP_UNREF_DTR flag.

Fix some style nits, add asserts.

Discussed with:	hselasky
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-19 17:36:52 +00:00
Konstantin Belousov
677258f7e7 Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process.  Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with:	jilles, rwatson
Man page fixes by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:13:11 +00:00
Konstantin Belousov
e3612a4c1f Make SIGSTOP working for sleeps done while waiting for fifo readers or
writers in open(2), when the fifo is located on an NFS mount.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:03:26 +00:00
Konstantin Belousov
271ab2406f For sigaction(2), ignore possible garbage in sa_flags for sa_handler
== SIG_DFL or SIG_IGN.  Sloppy code does not fully initialize struct
sigaction for such cases, and being too demanding in the case of
default handler does not catch anything.

Reported and tested by:	Alex Tutubalin <lexa@lexa.ru>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-16 07:06:58 +00:00
Hans Petter Selasky
1a26c3c047 Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
  CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
  spinlocks.
- Switching the callout CPU number cannot always be done on a
  per-callout basis. See the updated timeout(9) manual page for more
  information.
- The timeout(9) manual page has been updated to reflect how all the
  functions inside the callout API are working. The manual page has
  been made function oriented to make it easier to deduce how each of
  the functions making up the callout API are working without having
  to first read the whole manual page. Group all functions into a
  handful of sections which should give a quick top-level overview
  when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
  to reduce the complexity in the callout code and to avoid problems
  about atomically stopping callouts via callout_stop(). If someone
  needs it, it can be re-added. From my quick grep there are no
  CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
  added. See the updated timeout(9) manual page for a complete
  description.
- Update the callout clients in the "kern/" folder to use the callout
  API properly, like cv_timedwait(). Previously there was some custom
  sleepqueue code in the callout subsystem, which has been removed,
  because we now allow callouts to be protected by spinlocks. This
  allows us to tear down the callout like done with regular mutexes,
  and a "td_slpmutex" has been added to "struct thread" to atomically
  teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
  "SWT_SLEEPQTIMO" states can now be completely removed. Currently
  they are marked as available and will be cleaned up in a follow up
  commit.
- Bump the __FreeBSD_version to indicate kernel modules need
  recompilation.
- There has been several reports that this patch "seems to squash a
  serious bug leading to a callout timeout and panic".

Kernel build testing:	all architectures were built
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D1438
Sponsored by:		Mellanox Technologies
Reviewed by:		jhb, adrian, sbruno and emaste
2015-01-15 15:32:30 +00:00
Robert Watson
3d1a9ed34e In order to support ongoing work to implement variable-size mbufs, and
more generally make it easier to extend 'struct mbuf in the future', make
a number of changes to the data structure:

- As we anticipate embedding mbufs headers within variable-size regions of
  memory in the future, change the definitions of byte arrays embedded in
  mbufs to be of size [0] rather than [MLEN] and [MHLEN].  In fact, the
  cxgbe driver already uses 'struct mbuf' on the front of other storage
  sizes, but we would like the global mbuf allocator do be able to do this
  as well.

- Fold 'struct m_hdr' into 'struct mbuf' itself, eliminating a set of
  macros that aliased 'mh_foo' field names to 'm_foo' names such as
  'm_next'.  These present a particular problem as we would like to add
  new mbuf-header fields -- e.g., 'm_size' -- that, if similarly named via
  macros, would introduce collisions with many other variable names in the
  kernel.

- Rename 'struct m_ext' to 'struct struct_m_ext' so that we can add
  compile-time assertions without bumping into the still-extant 'm_ext'
  macro.

- Remove the MSIZE compile-time assertion for 'struct mbuf', but add new
  assertions for alignment of embedded data arrays (64-bit alignment even
  on 32-bit platforms), and for the sizes the mbuf header, packet header,
  and m_ext structure.

- Document that these assertions exist in comments in mbuf.h.

This change is not intended to cause (non-trivial) behavioural
differences, but is a precursor to further mbuf-allocator work.

Differential Revision:	https://reviews.freebsd.org/D1483
Reviewed by:	bz, gnn, np, glebius ("go ahead, I trust you")
Sponsored by:	EMC / Isilon Storage Division
2015-01-14 23:44:00 +00:00
Hans Petter Selasky
d2955419cd Avoid race with "dev_rel()" when using the recently added
"delist_dev()" function. Make sure the character device structure
doesn't go away until the end of the "destroy_dev()" function due to
concurrently running cleanup code inside "devfs_populate()".

MFC after:	1 week
Reported by:	dchagin@
2015-01-14 22:07:13 +00:00
Hans Petter Selasky
07dbde6777 Add a kernel function to delist our kernel character devices, so that
the device name can be re-used right away in case we are destroying
the character devices in the background.

MFC after:	4 days
Reported by:	dchagin@
2015-01-14 14:04:29 +00:00
Jamie Gritton
6a3f277901 Remove the prison flags PR_IP4_DISABLE and PR_IP6_DISABLE, which have been
write-only for as long as they've existed.
2015-01-14 04:50:28 +00:00
Jamie Gritton
0e5e396ede Don't set prison's pr_ip4s or pr_ip6s to -1.
PR:		196474
MFC after:	3 days
2015-01-14 03:52:41 +00:00
Konstantin Belousov
18cc2ff047 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
Ian Lepore
f8f02928d3 Fix an off by one in ppsratecheck(). If you asked for N=1 you'd get one,
but for any N>1 you'd get N-1 packets/events per second.
2015-01-11 20:48:29 +00:00
Robert Watson
a5e9a6d56f Garbage collect m_copymdata(), an mbuf utility routine introduced
in FreeBSD 7 that has not been used since.  It contains a number
of unresolved bugs including an inverted bcopy() and incorrect
handling of read-only mbufs using internal storage.  Removing this
unused code is substantially essier than fixing it in order to
update it to the coming mbuf world order -- but it can always be
restored from revision history if it turns out to prove useful for
future work.

Pointed out by:	jmallett
Sponsored by:	EMC / Isilon Storage Division
2015-01-10 10:41:23 +00:00
Dmitry Chagin
47ce3e5326 Allow clock_getcpuclockid() on the CPU-time clock for zombie process.
Posix does not prohibit this.

Differential Revision:	https://reviews.freebsd.org/D1470

Reviewed by:	kib
MFC after:	1 week
2015-01-10 07:22:38 +00:00
Xin LI
6e19f0def0 Improve style and fix a possible use-after-free case introduced in r268384
by reinitializing the 'freestate' pointer after freeing the memory.

Obtained from:	HardenedBSD (71fab80c5dd3034b71a29a61064625018671bbeb)
PR:		194525
Submitted by:	Oliver Pinter <oliver.pinter@hardenedbsd.org>
MFC after:	2 weeks
2015-01-10 06:48:35 +00:00
Robert Watson
3df42f654e Remove a 'This is dumb' comment that has been incorrect for at least a
decade: m_pulldown() is willing to consider ordinary mbufs writable.
Retain another, related, and also outdated comment, but with a caveat
that it is partially stale.  Do not, for now, address the problem that
it raises (that only EXT_CLUSTER external storage is considered
writable, regardless of the results of M_WRITABLE() on the mbuf).

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-01-09 12:08:51 +00:00
John Baldwin
7a635f0d70 Change the default method for device_quiesce() to return 0 instead of
EOPNOTSUPP.  The current behavior can mask real quiesce errors since
devclass_quiesce_driver() stops iterating over drivers as soon as it
gets an error (incluiding EOPNOTSUPP), but the caller it returns the
error to explicitly ignores EOPNOTSUPP.

Reviewed by:	imp
2015-01-08 21:46:28 +00:00
John Baldwin
bbf686ed60 Reject attempts to read the cpuset mask of a negative domain ID. 2015-01-08 19:11:14 +00:00
John Baldwin
c0ae66888b Create a cpuset mask for each NUMA domain that is available in the
kernel via the global cpuset_domain[] array. To export these to userland,
add a CPU_WHICH_DOMAIN level that can be used to fetch the mask for a
specific domain. Add a -d flag to cpuset(1) that can be used to fetch
the mask for a given domain.

Differential Revision:	https://reviews.freebsd.org/D1232
Submitted by:	jeff (kernel bits)
Reviewed by:	adrian, jeff
2015-01-08 15:53:13 +00:00
Robert Watson
b66f2a48e6 Replace hand-crafted versions of M_SIZE() and M_START() in uipc_mbuf.c
with calls to the centralised macros, reducing direct use of MLEN and
MHLEN.

Differential Revision:	https://reviews.freebsd.org/D1444
Reviewed by:	bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-08 11:16:21 +00:00
Mark Johnston
bdb9ab0dd9 Factor out duplicated code from dumpsys() on each architecture into generic
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.

PR:		193873
Differential Revision:	https://reviews.freebsd.org/D904
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by:	jhibbits (earlier version)
Sponsored by:	EMC / Isilon Storage Division
2015-01-07 01:01:39 +00:00
Mark Johnston
bbd685e3a5 Use crcopysafe(9) to make a copy of a process' credential struct. crcopy(9)
may perform a blocking memory allocation, which is unsafe when holding a
mutex.

Differential Revision:	https://reviews.freebsd.org/D1443
Reviewed by:	rwatson
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 23:07:22 +00:00
John Baldwin
531d65e139 Trim trailing whitespace. 2015-01-05 20:50:44 +00:00
John Baldwin
92597e064b On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST).  Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.

PR:		192316
Differential Revision:	https://reviews.freebsd.org/D1441
No objection from:	jkim
MFC after:	1 week
2015-01-05 20:44:44 +00:00
Robert Watson
ed6a66ca6c To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
Justin T. Gibbs
5b326a32ce Prevent live-lock and access of destroyed data in taskqueue_drain_all().
Phabric:	https://reviews.freebsd.org/D1247
Reviewed by:	jhb, avg
Sponsored by:	Spectra Logic Corporation

sys/kern_subr_taskqueue.c:
	Modify taskqueue_drain_all() processing to use a temporary
	"barrier task", rather than rely on a user task that may
	be destroyed during taskqueue_drain_all()'s execution.  The
	barrier task is queued behind all previously queued tasks
	and then has its priority elevated so that future tasks
	cannot pass it in the queue.

	Use a similar barrier scheme to drain threads processing
	current tasks.  This requires taskqueue_run_locked() to
	insert and remove the taskqueue_busy object for the running
	thread for every task processed.

share/man/man9/taskqueue.9:
	Remove warning about live-lock issues with taskqueue_drain_all()
	and indicate that it does not wait for tasks queued after
	it begins processing.
2015-01-04 19:55:44 +00:00
Dmitry Chagin
1beb1a8e13 Regen for r276654 (__getcwd()). 2015-01-04 10:40:23 +00:00
Dmitry Chagin
9f7a06f27e Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by:	bde

MFC after:	1 week
2015-01-04 10:34:02 +00:00
Hans Petter Selasky
04a8159ddf Rework r276532 a bit. Always avoid recursing into the console drivers
clients, hence they might not handle it very well. This change allows
debugging mutex problems with kernel console drivers when
"debug.witness.skipspin=0" is set in the boot environment.

MFC after:	1 week
2015-01-03 17:21:19 +00:00
Hans Petter Selasky
2029b6c9e6 The "cnputs_mtx" mutex must be allowed to recurse. Debug prints and/or
witness printouts in the console driver clients can cause this mutex
to recurse by calls to "printf()" from witness for example. In
particular this can happen if "debug.witness.skipspin=0" is set in the
boot environment.

MFC after:	1 week
2015-01-02 13:10:33 +00:00
Mateusz Guzik
af77c1a620 Convert vfs hash lock from a mutex to an rwlock. 2014-12-30 21:40:45 +00:00
Warner Losh
12d7eaa009 Turns out, this isn't only called from i386... 2014-12-30 02:39:47 +00:00
Mateusz Guzik
84267cacf4 sysctl: don't modify oid_running for static nodes
It is necessary to prevent nodes from being destroyed while used, but static
ones cannot be destroyed.
2014-12-28 19:24:01 +00:00
Rick Macklem
46b34e5adc Fix the comment introduced in r276192 so that it clearly
states that the change is needed to avoid a deadlock.

Suggested by:	kib
MFC after:	1 week
2014-12-25 14:44:04 +00:00
Rick Macklem
65cb225c5e Modify vop_stdadvlock{async}() so that it only
locks/unlocks the vnode and does a VOP_GETATTR()
for the SEEK_END case. This is safe to do, since
lf_advlock{async}() only uses the size argument
for the SEEK_END case.
The NFSv4 server needs this when
vfs.nfsd.enable_locallocks!=0 since locking the
vnode results in a LOR that can cause a deadlock
for the nfsd threads.

Reviewed by:	kib
MFC after:	1 week
2014-12-24 22:58:08 +00:00
Gleb Smirnoff
53b680caa2 In sbappend*() family of functions clear M_PROTO flags of incoming
mbufs. sbappendstream() already does this in m_demote().

PR:		196174
Sponsored by:	Nginx, Inc.
2014-12-22 15:39:24 +00:00
Konstantin Belousov
8ee9765a9d Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that
the created file name was cached.  Use the flag for core dumps.

Requested by:	rpaulo
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:32:07 +00:00
Warner Losh
61f26cae7d Where appropriate, use the modern terms for the one true time base
(UTC) rather than the archaic (GMT) in comments. Except where the
comments are making fun of people doing this (and pedants who insist
on the new terms).
2014-12-21 05:07:11 +00:00
Gleb Smirnoff
e834a84026 Revert r274494, r274712, r275955 and provide extra comments explaining
why there could appear a zero-sized mbufs in socket buffers.

A proper fix would be to divorce record socket buffers and stream
socket buffers, and divorce pru_send that accepts normal data from
pru_send that accepts control data.
2014-12-20 22:12:04 +00:00
Gleb Smirnoff
b7413232f6 Add to sbappendstream_locked() a check against NULL mbuf, like it is done
in sbappend_locked() and sbappendrecord_locked().

This is a quick fix to the panic introduced by r274712.

A proper solution should be to make sosend_generic() avoid calling
pru_send() with NULL mbuf for the protocols that do not understand
control messages. Those protocols that understand control messages,
should be able to receive NULL mbuf, if control is non-NULL.
2014-12-20 14:19:46 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
Gleb Kurtsou
dde58752db Adjust printf format specifiers for dev_t and ino_t in kernel.
ino_t and dev_t are about to become uint64_t.

Reviewed by:	kib, mckusick
2014-12-17 07:27:19 +00:00
Konstantin Belousov
5b73811feb Add missed break.
CID:	1258587
Sponsored by:	The FreeBSD Foundation
MFC after:	20 days
2014-12-16 09:49:07 +00:00
Konstantin Belousov
917dd39084 Add missed break.
CID:	1258586
Sponsored by:	The FreeBSD Foundation
MFC after:	4 days
2014-12-16 09:48:23 +00:00
John Baldwin
5ad25ceb41 Check for SS_NBIO in so->so_state instead of sb->sb_flags in
soreceive_stream().

Differential Revision:	https://reviews.freebsd.org/D1299
Reviewed by:	bz, gnn
MFC after:	1 week
2014-12-15 17:52:08 +00:00
Konstantin Belousov
237623b028 Add a facility for non-init process to declare itself the reaper of
the orphaned descendants.  Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by:	bapt
Reviewed by:	jilles (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-12-15 12:01:42 +00:00
Konstantin Belousov
83d7eb544c Fix gcc build.
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2014-12-14 08:43:13 +00:00
Dmitry Chagin
fd07ddcf6f Add _NEW flag to mtx(9), sx(9), rmlock(9) and rwlock(9).
A _NEW flag passed to _init_flags() to avoid check for double-init.

Differential Revision:	https://reviews.freebsd.org/D1208
Reviewed by:	jhb, wblock
MFC after:	1 Month
2014-12-13 21:00:10 +00:00
Konstantin Belousov
6ddcc23386 Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC.  SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process.  Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume.  It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-13 16:18:29 +00:00
Konstantin Belousov
0061ddb3ed Only sleep interruptible while waiting for suspension end when
filesystem specified VFCF_SBDRY flag, i.e. for NFS.

There are two issues with the sleeps.  First, applications may get
unexpected EINTR from the disk i/o syscalls.  Second, interruptible
sleep allows the stop of the process, and since mount point is
referenced while thread sleeps, unmount cannot free mount point
structure' memory, blocking unmount indefinitely.

Even for NFS, it is probably only reasonable to enable PCATCH for intr
mounts, but this information is currently not available at VFS level.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-13 16:07:01 +00:00
Konstantin Belousov
ea117d1735 The vinactive() call in vgonel() may start writes for the dirty pages,
creating delayed write buffers belonging to the reclaimed vnode.  Put
the buffer cleanup code after inactivation.

Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-13 16:02:37 +00:00
Konstantin Belousov
fe21241ee0 For architectures where time_t is wide enough, in particular, 64bit
platforms, avoid overflow after year 2038 in clock_ct_to_ts().

PR:	195868
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-12 09:37:18 +00:00
Konstantin Belousov
b2344ab5ff Do not call VFS_SYNC() before VFS_UNMOUNT() for forced unmount.
Since VFS does not/cannot stop writes, sync might run indefinitely, or
be a wrong thing to do at all.  E. g. NFS ignores VFS_SYNC() for
forced unmounts, since non-responding server does not allow sync to
finish.  On the other hand, filesystems can and do stop writes using
fs-specific facilities, and should already fully flush caches in
VFS_UNMOUNT() due to the race.

Adjust msdosfs tp sync in unmount for forced call, to accomodate the
new behaviour.  Note that it is still racy, since writes are not
stopped.

Discussed with:	avg, bjk, mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-12-09 10:00:47 +00:00
Konstantin Belousov
a77c72f5ae Apply chunk forgotten in r275620. Remove local variable for real.
CID:	1257462
Sponsored by:	The FreeBSD Foundation
2014-12-09 09:36:28 +00:00
Konstantin Belousov
a25100c539 Add functions syncer_suspend() and syncer_resume(), which are supposed
to be called before suspension and after resume, correspondingly.  The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems.  The syncer_resume() restores stopped
threads.

For now, only syncer is stopped.  This is needed, because each sync
loop causes superblock updates for UFS.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:48:57 +00:00
Konstantin Belousov
904ed548bb When getnewbuf_reuse_bp() is called to reclaim some (clean) buffer,
the vnode owning the buffer is not locked.  More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.

As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object.  Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock.  Also, seeing stale pointer to the v_object is safe due to vm
object type stability.

Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:42:34 +00:00
Konstantin Belousov
07a9368a48 Do some refactoring and minor cleanups of the thread_single() code in
preparation for the global stop commit.

Move the code to weed suspended or sleeping threads into the
appropriate state, into the helper weed_inhib().  Current code already
has deep nesting and hard to follow [1].

Add currently useless helper remain_for_mode(), which returns the
count of threads which are allowed to run, according to the
single-threading mode.

In thread_single_end(), do not save curthread into local variable, it
is unused after, except to find curproc.

Remove stray empty line.

Requested by:	avg [1]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:27:43 +00:00
Konstantin Belousov
8638fe7bea Thread waiting for the vfork(2)-ed child to exec or exit, must allow
for the suspension.

Currently, the loop performs uninterruptible cv_wait(9) call, which
prevents suspension until child allows further execution of parent.
If child is stopped, suspension or single-threading is delayed
indefinitely.

Create a helper thread_suspend_check_needed() to identify the need for
a call to thread_suspend_check().  It is required since call to the
thread_suspend_check() cannot be safely done while owning the child
(p2) process lock.  Only when suspension is needed, drop p2 lock and
call thread_suspend_check().  Perform wait for cv with timeout, in
case suspend is requested after wait started; I do not see a better
way to interrupt the wait.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:18:05 +00:00
Konstantin Belousov
aba1ca528e When process is exiting, check for suspension regardless of
multithreaded status of the process.

The stopped state must be cleared before P_WEXIT is set.  A stop
signal delivered just before first PROC_LOCK() block in exit1(9) would
put the process into pending stop with P_WEXIT set or assertion
triggered.  Also recheck for the suspension after failed
thread_single(9) call, since process lock could be dropped.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:02:02 +00:00
Andriy Gapon
036a8c5dac remove opensolaris cyclic code, replace with high-precision callouts
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior.  Now we can directly use callout(9) in the very few
places where cyclic was used.

Differential Revision:	https://reviews.freebsd.org/D1161
Reviewed by:	gnn, jhb, markj
MFC after:	2 weeks
2014-12-07 11:21:41 +00:00
Warner Losh
d0b6da086f Const poison in a few places to ensure we don't modify things
through the module data pointer.
2014-12-03 22:14:13 +00:00
John Baldwin
b10c08a52b Revert device_getenv_int() for now as it duplicates resource_int_value().
We should perhaps implement a device_getenv_*() and device_setenv_*() API
as a convenience wrapper on top of resource_*_value() and resource_set_*().
2014-12-03 15:29:53 +00:00
Konstantin Belousov
6afb32fc67 Disable recursion for the process spinlock.
Tested by:	pho
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2014-12-01 17:36:10 +00:00
Justin T. Gibbs
2c6bf3d90b Remove trailing whitespace. 2014-11-30 19:32:00 +00:00
Gleb Smirnoff
c80ea19b38 Merge from projects/sendfile:
Provide pru_ready for AF_LOCAL sockets.  Local sockets sendsdata directly
to the receive buffer of the peer, thus pru_ready also works on the peer
socket.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 13:40:58 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
0f9d0a73a4 Merge from projects/sendfile:
o Introduce a notion of "not ready" mbufs in socket buffers.  These
mbufs are now being populated by some I/O in background and are
referenced outside.  This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
  freed.

o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
  The sb_ccc stands for ""claimed character count", or "committed
  character count".  And the sb_acc is "available character count".
  Consumers of socket buffer API shouldn't already access them directly,
  but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
  with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
  search.
o New function sbready() is provided to activate certain amount of mbufs
  in a socket buffer.

A special note on SCTP:
  SCTP has its own sockbufs.  Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs.  Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them.  The new
notion of "not ready" data isn't supported by SCTP.  Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
  A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does.  This was discussed with rrs@.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 12:52:33 +00:00
Gleb Smirnoff
57f43a45a3 - Move sbcheck() declaration under SOCKBUF_DEBUG.
- Improve SOCKBUF_DEBUG macros.
- Improve sbcheck().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 11:22:39 +00:00
Gleb Smirnoff
8967b220a3 Make sballoc() and sbfree() functions. Ideally, they could be marked
as static, but unfortunately Infiniband (ab)uses them.

Sponsored by:	Nginx, Inc.
2014-11-30 11:02:07 +00:00
Warner Losh
fac92ae126 The current limit of 100k for the linker hints file is getting a bit
crowded as we now are at about 70k. Bump the limit to 1MB instead
which is still quite a reasonable limit and allows for future growth
of this file and possible future expansion to additional data.

MFC After: 2 weeks
2014-11-29 17:29:30 +00:00
Konstantin Belousov
6762091ea4 Remove lock recursion for the pipe pair mutex, and disable the
recursion on mutex initialization.

The only places where the recursive acquire is performed are read and
write filters, since knlist, which uses the pipe pair mutex as lock,
is locked when filter is called.

The recursion was added in r93296, and consistent locking for
kn_fop->f_event() introduced in r133741.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2014-11-29 17:18:20 +00:00
Konstantin Belousov
70778bba03 Assert the state of the process lock and sigact mutex in
kern_sigprocmask() and reschedule_signals().

Discussed with:	rea
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-28 10:20:00 +00:00
Hans Petter Selasky
50ae6690fc Style changes:
- Move two IOCTL related defines to the top of the C-file
- Add more comments describing the recently added IOCTL small size and
small align macros
2014-11-28 09:32:07 +00:00
Alfred Perlstein
56c14bca7e Make igb and ixgbe check tunables at probe time.
This allows one to make a kernel module to tune the
number of queues before the driver loads.

This is needed so that a module at SI_SUB_CPU can set
tunables for these drivers to take.  Otherwise getenv
is called too early by the TUNABLE macros.

Reviewed by: smh
Phabric: https://reviews.freebsd.org/D1149
2014-11-26 20:19:36 +00:00
Konstantin Belousov
5c7bebf961 The process spin lock currently has the following distinct uses:
- Threads lifetime cycle, in particular, counting of the threads in
  the process, and interlocking with process mutex and thread lock.
  The main reason of this is that turnstile locks are after thread
  locks, so you e.g. cannot unlock blockable mutex (think process
  mutex) while owning thread lock.

- Virtual and profiling itimers, since the timers activation is done
  from the clock interrupt context.  Replace the p_slock by p_itimmtx
  and PROC_ITIMLOCK().

- Profiling code (profil(2)), for similar reason.  Replace the p_slock
  by p_profmtx and PROC_PROFLOCK().

- Resource usage accounting.  Need for the spinlock there is subtle,
  my understanding is that spinlock blocks context switching for the
  current thread, which prevents td_runtime and similar fields from
  changing (updates are done at the mi_switch()).  Replace the p_slock
  by p_statmtx and PROC_STATLOCK().

The split is done mostly for code clarity, and should not affect
scalability.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-26 14:10:00 +00:00
Konstantin Belousov
e442f29f08 Fix SA_SIGINFO | SA_RESETHAND handling. The sysent' sv_sendsig()
method needs pre-reset state of the ps_siginfo to correctly construct
signal frame.

Move sigdflt() call after the sv_sendsig() invocation in postsig().
Simultaneously extract common code from trapsignal() and postsig()
into new helper postsig_done().

Submitted by:	rea
MFC after:	1 week
2014-11-26 14:09:04 +00:00
John Baldwin
a2d751936b Add a bus_get_domain() wrapper around BUS_GET_DOMAIN(). Use this to add
a new per-device '%domain' sysctl node that returns the NUMA domain a
device is associated with if it is associated with one.

Note that this API is still a WIP and might change before 11.0 actually
ships.

Differential Revision:	https://reviews.freebsd.org/D930
Reviewed by:	kib, adrian
2014-11-24 19:55:45 +00:00
John Baldwin
20abb66ede Properly initialize the capability rights for vnodes exported to procstat
that aren't for file descriptors (cwd, jdir, tracevp, etc.).

Submitted by:	Mikhail <mp@lenta.ru>
2014-11-24 18:34:11 +00:00
Gleb Smirnoff
90effb2341 Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
  doesn't sleep. It returns immediately, and will execute the I/O done handler
  function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
  method for vnode_pager.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-23 12:01:52 +00:00
Mateusz Guzik
dff9862c0e ifdef RACCT ui_racct_foreach and struct uidinfo's ui_racct
Change racct_ create and destroy to macros evaluating to nothing without RACCT
so that their callers passing ui_racct don't have to be ifdefed.
2014-11-23 08:25:44 +00:00
Mateusz Guzik
0c0d16e8ac filedesc: plug a test for impossible condition in fgetvp_rights 2014-11-23 00:12:27 +00:00
Konstantin Belousov
64779280c9 The size value should be asserted when it is known.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2014-11-22 18:15:02 +00:00
John Baldwin
180e57e5c7 Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
  to match what Linux does in that 1) it dumps the entire XSAVE area
  including the fxsave state, and 2) it stashes a copy of the current
  xsave mask in the unused padding between the fxsave state and the
  xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
  only the extra portion. This avoids having to always make two
  ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
  XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
  and the current XSAVE mask (%xcr0).

Differential Revision:	https://reviews.freebsd.org/D1193
Reviewed by:	kib
MFC after:	2 weeks
2014-11-21 20:53:17 +00:00
Gleb Smirnoff
67af272bcf Do not allocate zero-length mbuf in sosend_generic().
Found by:	pho
Sponsored by:	Nginx, Inc.
2014-11-19 14:27:38 +00:00
Zbigniew Bodek
dc61566f95 Stop using early_putc immediately after configuring console with cninit()
Early UART should be released right after system console initialization is
completed. Otherwise, after cninit() both early and system console coexist
what may lead to various issues (i.a. writing to unmapped early
UART address). This cannot be done in cninit_finish() since it can be
called late at the end of MI configuration.

Obtained from:   Semihalf
Reviewed by:     andrew
Sponsored by:    The FreeBSD Foundation
2014-11-19 14:23:29 +00:00
Warner Losh
40e6bdaf1e opt_global.h is included automatically in the build. No need to
explicitly include it in these places.

Sponsored by: Netflix
2014-11-18 17:06:56 +00:00
John-Mark Gurney
2c30bc1fcf prevent doing filter ops locking for staticly compiled filter ops...
This significantly reduces lock contention when adding/removing knotes
on busy multi-kq system...  Next step is to cache these references per
kq.. i.e. kq refs it once and keeps a local ref count so that the same
refs don't get accessed by many cpus...

only allocate a knote when we might use it...

Add a new flag, _FORCEONESHOT..  This allows a thread to force the
delivery of another event in a safe manner, say waking up an idle http
connection to force it to be reaped...

If we are _DISABLE'ing a knote, don't bother to call f_event on it, it's
disabled, so won't be delivered anyways..

Tested by:	adrian
2014-11-16 01:18:41 +00:00
Gleb Smirnoff
8146bcfea1 - Use NULL to compare a pointer.
- Use KASSERT() instead of panic.
- Remove useless 'continue', no need to restart cycle here.

Sponsored by:	Nginx, Inc.
2014-11-14 15:44:19 +00:00
Gleb Smirnoff
6bf6b25e88 Merge from projects/sendfile:
Use sbcut_locked() instead of manually editing a sockbuf.

Sponsored by:	Nginx, Inc.
2014-11-14 15:33:40 +00:00
Konstantin Belousov
5fab60a071 In vfs_write_suspend_umnt(), if suspension cannot be established, do
not forget to restore write ops count when returning the error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-14 11:31:10 +00:00
Gleb Smirnoff
f274e25659 There should not be zero length mbufs in socket buffers. The code comes
from r1451, and thus can't be explained.  A patch with explicit panic()
here survived all tests.

Tested by:	pho
Sponsored by:	Nginx, Inc.
2014-11-14 06:02:29 +00:00
Jung-uk Kim
db1ec81edd Correct a typo to fix chown(2). It was broken since r274476.
Pointy hat to:	kib
X-MFC-With:	r274476
2014-11-13 23:51:13 +00:00
Mateusz Guzik
eb48fbd963 filedesc: fixup fdinit to lock fdp and preapare files conditinally
Not all consumers providing fdp to copy from want files.

Perhaps these functions should be reorganized to better express the outcome.

This fixes up panics after r273895 .

Reported by:	markj
2014-11-13 21:15:09 +00:00
Konstantin Belousov
416be7a1c6 Fix assertion, &uc->uc_busy is never zero, the intent is to test the
uc_busy value, and not its address [1].

Remove the single use of the macro, write KASSERT() explicitely in the
code of umtxq_sleep_pi().

Submitted by:	Eric van Gyzen <eric@vangyzen.net> [1]
MFC after:	1 week
2014-11-13 18:51:09 +00:00
Konstantin Belousov
6e646651d3 Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 18:01:51 +00:00
Konstantin Belousov
e64b4fa858 Do not try to dereference thread pointer when the value is not a pointer.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 17:44:35 +00:00
Konstantin Belousov
f2c1a52afb Remove fossil. It has been present in 4.4Lite2, but its use was
removed for some time.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 17:43:37 +00:00
Dmitry Chagin
c28d9d0f9f Regen for r274462. 2014-11-13 05:28:06 +00:00
Dmitry Chagin
186d9c3473 Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision:	https://reviews.freebsd.org/D1133
Reviewed by:	kib, wblock
MFC after:	1 month
2014-11-13 05:26:14 +00:00
Konstantin Belousov
389a25c716 For posix_fallocate(2) and posix_fadvise(2), return ESPIPE when
underlying file does not have DFLAG_SEEKABLE set [1].

For posix_fallocate(2), simplify error handling logic.  Do return when
fp is not yet referenced.

Noted by:	bde [1]
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-12 17:31:38 +00:00
Gleb Smirnoff
2b21d0e883 Merge from projects/sendfile:
- Use KASSERT()s instead of panic().
- Use sbavail() instead of sb_cc.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-12 10:17:46 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Gleb Smirnoff
efe28398f5 Fix build. 2014-11-11 22:08:18 +00:00
Gleb Smirnoff
0e87b36eaa Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used.  It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained.  It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by:	Netflix
2014-11-11 20:32:46 +00:00
Pawel Jakub Dawidek
5ebb15b942 Add missing privilege check when setting the dump device. Before that change it
was possible for a regular user to setup the dump device if he had write access
to the given device. In theory it is a security issue as user might get access
to kernel's memory after provoking kernel crash, but in practise it is not
recommended to give regular users direct access to storage devices.

Rework the code so that we do privileges check within the set_dumper() function
to avoid similar problems in the future.

Discussed with:	secteam
2014-11-11 04:48:09 +00:00
Konstantin Belousov
0436fcb809 When sleeping waiting for the profiling stop, always set P_STOPPROF
before dropping process lock.  Clear P_STOPPROF when doing wakeup.

Both issues caused thread to hang in stopprofclock() "stopprof" sleep.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-10 14:11:17 +00:00
Alexander V. Chernikov
5e11eb847e Finish r274118#2: commit forgotten uipc_debug.c 2014-11-06 15:17:04 +00:00
Bjoern A. Zeeb
763f2e7844 After the changes in r274118 make NOIP kernels compile by hiding an
otherwise unused variable declaration behind INET6 || INET.

MFC after:	27 days
X-MFS with:	r274118
2014-11-06 12:19:39 +00:00
Mateusz Guzik
bfda9935bd Add sysctl kern.proc.cwd
It returns only current working directory of given process which saves a lot of
overhead over kern.proc.filedesc if given proc has a lot of open fds.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn> (slightly modified)
X-Additional:	JuniorJobs project
2014-11-06 08:12:34 +00:00
Mateusz Guzik
3ae366de58 filedesc: avoid taking fdesc_mtx when not necessary in fddrop
No functional changes.
2014-11-06 07:44:10 +00:00
Mateusz Guzik
eb6021fb96 filedesc: just free old tables without altering the list which is freed anyway
No functional changes.
2014-11-06 07:37:31 +00:00
Mateusz Guzik
a99500a912 Extend struct ucred with group table.
This saves one malloc + free with typical cases and better utilizes
memory.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn> (slightly modified)
X-Additional:	JuniorJobs project
2014-11-05 02:08:37 +00:00
Alexander V. Chernikov
9f25cbe45e Remove old hack abusing domattach from NFS code.
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.

Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.

While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.

While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.

MFC after:	1 month
2014-11-05 00:58:01 +00:00
Dag-Erling Smørgrav
bccb6d5aa1 [SA-14:25] Fix kernel stack disclosure in setlogin(2) / getlogin(2).
[SA-14:26] Fix remote command execution in ftp(1).

Approved by:	so (des)
2014-11-04 23:29:29 +00:00
John Baldwin
2cba8dd301 Add a new thread state "spinning" to schedgraph and add tracepoints at the
start and stop of spinning waits in lock primitives.
2014-11-04 16:35:56 +00:00
Hans Petter Selasky
0ecd606b24 Simplify logic a bit. Ensure data buffer is properly aligned,
especially for platforms where unaligned access is not allowed. Make
it possible to override the small buffer size.

A simple continuous read string test using libusb showed a reduction
in CPU usage from roughly 10% to less than 1% using a dual-core GHz
CPU, when the malloc() operation was skipped for small buffers.

MFC after:	2 weeks
2014-11-04 11:29:49 +00:00
Jean-Sébastien Pédron
2d6f6d6373 Enable vt(4) by default
vt(4) is a new console driver which brings features such as:
    o  Support for Unicode and double-width characters
    o  Integration with the KMS kernel video drivers
    o  Support for UEFI

You may need to update your console settings in /etc/rc.conf, most
probably the keymap. During boot, /etc/rc.d/syscons will indicate what
you need to do.

vt(4) still has issues and lacks some features compared to syscons(4).
See the wiki for up-to-date information:
    https://wiki.freebsd.org/Newcons

If you want to keep using syscons(4), you can do so by adding the
following line to /boot/loader.conf:
    kern.vty=sc

Differential Revision:	https://reviews.freebsd.org/D1005
Discussed with:	emaste@, nwhitehorn@, ray@
Relnotes:	yes
2014-11-04 10:18:03 +00:00
Konstantin Belousov
74d5b4af82 Clean up confusing comment. Move it to the place of code which is
talked about.  Explain where the mentioned trampoline located
(usermode), and the fact that attempt to exit last thread is denied in
kernel (by delegating the work to usermode).

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-03 11:29:08 +00:00
Konstantin Belousov
ab57474c83 When other end of the pipe closed during the write, but some bytes
were written, return short write instead of EPIPE.

Update comment.

Discussed with:	bde (long time ago)
MFC after:	2 weeks
2014-11-03 10:01:56 +00:00
Mateusz Guzik
5cbf44bf89 Provide an on-stack temporary buffer for small ioctl requests. 2014-11-03 07:46:51 +00:00
Mateusz Guzik
324a7026f1 filedesc: plus sys/kdb.h include which crept in with r274007 2014-11-03 06:24:43 +00:00
Mateusz Guzik
1d29258ac2 filedesc: plug unnecessary fdp NULL checks in fdescfreee and fdcopy
Anything reaching these functions has fd table.
2014-11-03 05:12:17 +00:00
Mateusz Guzik
32417098f0 filedesc: create a dedicated zone for struct filedesc0
Currently sizeof(struct filedesc0) is 1096 bytes, which means allocations from
malloc use 2048 bytes.

There is no easy way to shrink the structure <= 1024 an it is likely to grow in
the future.
2014-11-03 04:16:04 +00:00
Konstantin Belousov
cc24666735 Followup to r273966. Fix the build with ADAPTIVE_LOCKMGRS kernel option.
Note that the option is currently not used in any in-tree kernel
configs, including LINTs.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 19:51:33 +00:00
Mateusz Guzik
3dca54ab98 filedesc: move freeing old tables to fdescfree
They cannot be accessed by anyone and hold count only protects the structure
from being freed.
2014-11-02 14:12:03 +00:00
Mateusz Guzik
3dc85312b2 filedesc: factor out some code out of fdescfree
Previously it had a huge self-contained chunk dedicated to dealing with shared
tables.

No functional changes.
2014-11-02 13:43:04 +00:00
Konstantin Belousov
72ba3c0822 Fix two issues with lockmgr(9) LK_CAN_SHARE() test, which determines
whether the shared request for already shared-locked lock could be
granted.  Both problems result in the exclusive locker starvation.

The concurrent exclusive request is indicated by either
LK_EXCLUSIVE_WAITERS or LK_EXCLUSIVE_SPINNERS flags.  The reverse
condition, i.e. no exclusive waiters, must check that both flags are
cleared.

Add a flag LK_NODDLKTREAT for shared lock request to indicate that
current thread guarantees that it does not own the lock in shared
mode.  This turns back the exclusive lock starvation avoidance code;
see man page update for detailed description.

Use LK_NODDLKTREAT when doing lookup(9).

Reported and tested by:	pho
No objections from:	attilio
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 13:10:31 +00:00
Mateusz Guzik
080fdefc28 filedesc: tidy up fdcheckstd
No functional changes.
2014-11-02 02:32:33 +00:00
Mateusz Guzik
d3f3e12a4f filedesc: lock filedesc lock in fdcloseexec only when needed 2014-11-02 01:13:11 +00:00
Mateusz Guzik
cdcf242896 Fix up module unload for syscall_module_handler consumers.
After r273707 it was registering syscalls as static.

This fixes hwpmc module unload.

Reported by: markj
2014-11-01 22:36:40 +00:00
Jean-Sébastien Pédron
da49f6bcc3 vt(4): Adjust the cursor position after changing the window size
A new terminal_set_cursor() is added: it wraps the existing
teken_set_cursor() function.

In vtbuf_grow(), the cursor position is adjusted at the end of the
function. In vt_change_font(), we call terminal_set_cursor() just after
terminal_set_winsize_blank(), while the terminal is mute.

This fixes a bug where, after loading a kernel video driver which
increases the terminal window size, the cursor remains at its old
position, in other words, in the middle of the display content.

PR:		194421
MFC after:	1 week
2014-11-01 17:05:15 +00:00
Konstantin Belousov
2361c6d135 Add type qualifier volatile to the base (userspace) address argument
of fuword(9) and suword(9).  This makes the functions type-compatible
with volatile objects and does not require devolatile force, e.g. in
kern_umtx.c.

Requested by:	bde
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-10-31 17:43:21 +00:00
Mateusz Guzik
2534d8eeb6 filedesc: drop retval argument from do_dup
It was almost always td_retval anyway.

For the one case where it is not, preserve the old value across the call.
2014-10-31 10:35:01 +00:00
Mateusz Guzik
8a5177cca3 filedesc: fix missed comments about fdsetugidsafety
While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.
2014-10-31 09:56:00 +00:00
Mateusz Guzik
f652d856ab filedesc: make fdinit return with source filedesc locked and new one sized
appropriately

Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable.
We don't have to call it with the lock held if we are just creating new
filedesc.

As a side note, strictly speaking processes can have fdtables with
fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file
descriptor they get will be 0 and the only syscall allowing to choose fd number
requires an active file descriptor. Should this ever change, we can add an 'init'
(or similar) parameter to fdgrowtable.
2014-10-31 09:25:28 +00:00
Mateusz Guzik
ffeb890592 filedesc: iterate over fd table only once in fdcopy
While here add 'fdused_init' which does not perform unnecessary work.

Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold
it when appropriate. This function is only used with INVARIANTS.

No functional changes intended.
2014-10-31 09:19:46 +00:00
Mateusz Guzik
1a0c80a3df filedesc: tidy up fdfree
Implement fdefree_last variant and get rid of 'last' parameter.

No functional changes.
2014-10-31 09:15:59 +00:00
Mateusz Guzik
b97a758ffc filedesc: tidy up fdcopy a little bit
Test for file availability by fde_file != NULL instead of fdisused, this is
consistent with similar checks later.

Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never
reached in practice.

fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS.

No functional changes.
2014-10-31 05:41:27 +00:00
Mark Murray
10cb24248a This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
Mateusz Guzik
f55cf4b0d1 filedesc: make sure to force table reload in fget_unlocked when count == 0
This is a fixup to r273843.
2014-10-30 07:21:38 +00:00
Mateusz Guzik
29c85772bb filedesc: microoptimize fget_unlocked by retrying obtaining reference count
without restarting whole lookup

Restart is only needed when fp was closed by current process, which is a much
rarer event than ref/deref by some other thread.
2014-10-30 05:21:12 +00:00
Mateusz Guzik
aa77d52800 filedesc: get rid of atomic_load_acq_int from fget_unlocked
A read barrier was necessary because fd table pointer and table size were
updated separately, opening a window where fget_unlocked could read new size
and old pointer.

This patch puts both these fields into one dedicated structure, pointer to which
is later atomically updated. As such, fget_unlocked only needs data a dependency
barrier which is a noop on all supported architectures.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-30 05:10:33 +00:00
John Baldwin
01e1933dcc Rework virtual machine hypervisor detection.
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
  the 0x40000000 leaf to determine the hypervisor vendor ID.  Export the
  vendor ID and the highest supported hypervisor CPUID leaf via
  hv_vendor[] and hv_high variables, respectively.  The hv_vendor[]
  array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
  identcpu.c.  Add a VM_GUEST_VMWARE to identify vmware and use that in
  the TSC code to identify VMWare.

Differential Revision:	https://reviews.freebsd.org/D1010
Reviewed by:	delphij, jkim, neel
2014-10-28 19:17:44 +00:00
Konstantin Belousov
f7e91c288a Convert kern_umtx.c to use fueword() and casueword().
Also fix some mishandling of suword(9) errors as errno, which resulted
in spurious ERESTART.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:30:33 +00:00
Konstantin Belousov
0a2c94b86e Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
Konstantin Belousov
4f3dc90023 Add fueword(9) and casueword(9) functions. They are like fuword(9)
and casuword(9), but do not mix value read and indication of fault.

I know (or remember) enough assembly to handle x86 and powerpc.  For
arm, mips and sparc64, implement fueword() and casueword() as wrappers
around fuword() and casuword(), which means that the functions cannot
distinguish between -1 and fault.

On architectures where fueword() and casueword() are native, implement
fuword() and casuword() using fueword() and casuword(), to reduce
assembly code duplication.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks (ia64 needs treating)
2014-10-28 15:22:13 +00:00
Hans Petter Selasky
0e1152fcc2 The SYSCTL data pointers can come from userspace and must not be
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.

Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-28 12:00:39 +00:00
Mateusz Guzik
a7963b9758 Simplify sys_getloginclass.
Just use current thread credentials as they have the same accuracy as the
ones obtained from proc..
2014-10-28 04:59:33 +00:00
Mateusz Guzik
b720dc9749 Change loginclass mutex to an rwlock.
While here reduce nesting in loginclass_free.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn>
X-Additional:	JuniorJobs project
MFC after:	2 weeks
2014-10-28 04:33:57 +00:00
Mateusz Guzik
8d866b10fc Tidy up functions related to uidinfo management.
- reference found uidinfo in uilookup
- reduce nesting by handling shorter cases first
2014-10-27 20:20:05 +00:00
Mateusz Guzik
8101958c4f De-k&r-ify function definitions in kern/kern_resource.c
No functional changes.
2014-10-27 20:18:30 +00:00
Mateusz Guzik
e015b1ab0a Avoid dynamic syscall overhead for statically compiled modules.
The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-26 19:42:44 +00:00
Mateusz Guzik
b90638866e Fix up an assertion in kern_setgroups, it should compare with ngroups_max + 1
Bug introdued in r273685.

Noted by: Tiwei Bie <btw mail.ustc.edu.cn>
2014-10-26 14:25:42 +00:00
Mateusz Guzik
7e9a456a53 Tidy up sys_setgroups and kern_setgroups.
- 'groups' initialization to NULL is always ovewrwriten before use, so plug it
- get rid of 'goto out'
- kern_setgroups's callers already validate ngrp, so only assert the condition
- ngrp  is an u_int, so 'ngrp < 1' is more readable as 'ngrp == 0'

No functional changes.
2014-10-26 06:04:09 +00:00
Mateusz Guzik
92b064f43d Use a temporary buffer in sys_setgroups for requests with <= XU_NGROUPS groups.
Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn>
X-Additional: JuniorJobs project
MFC after:	2 weeks
2014-10-26 05:39:42 +00:00
Mateusz Guzik
f84f8f9468 Now that sysctl_root is only called with sysctl lock in shared mode, update
its assertion to require that.

Update comment missed in r273400: sysctl_xlock/unlock -> sysctl_xlock/xunlock

Noted by: jhb
2014-10-26 01:47:55 +00:00
John Baldwin
1bc9ea1caa Use correct type in __DEVOLATILE(). 2014-10-25 20:42:47 +00:00
Alexander Motin
ccf8a5688a Revert somewhat hackish geom_disk optimization, committed as part of r256880,
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.

While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load.  I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.

Requested by:	scottl@
MFC after:	3 days
2014-10-25 15:16:19 +00:00
Mateusz Guzik
675c3507d4 rlimit: plug duplicate assertion
counter sanity is already checked by refcount_release.
2014-10-25 05:56:21 +00:00
Xin LI
6362e06b42 Fix build. 2014-10-25 00:16:36 +00:00
John Baldwin
53e1ffbbce The current POSIX semaphore implementation stores the _has_waiters flag
in a separate word from the _count.  This does not permit both items to
be updated atomically in a portable manner.  As a result, sem_post()
must always perform a system call to safely clear _has_waiters.

This change removes the _has_waiters field and instead uses the high bit
of _count as the _has_waiters flag.  A new umtx object type (_usem2) and
two new umtx operations are added (SEM_WAIT2 and SEM_WAKE2) to implement
these semantics.  The older operations are still supported under the
COMPAT_FREEBSD9/10 options.  The POSIX semaphore API in libc has
been updated to use the new implementation.  Note that the new
implementation is not compatible with the previous implementation.
However, this only affects static binaries (which cannot be helped by
symbol versioning).  Binaries using a dynamic libc will continue to work
fine.  SEM_MAGIC has been bumped so that mismatched binaries will error
rather than corrupting a shared semaphore.  In addition, a padding field
has been added to sem_t so that it remains the same size.

Differential Revision:	https://reviews.freebsd.org/D961
Reported by:	adrian
Reviewed by:	kib, jilles (earlier version)
Sponsored by:	Norse
2014-10-24 20:02:44 +00:00
Dag-Erling Smørgrav
b0d69dfad9 In all cases except CTLTYPE_STRING, penv is NULL here, so passing it
indiscriminately to printf() and freeenv() is incorrect.  Add a NULL
check before freeenv(); as for printf(), we could use req.newptr
instead, but we'd have to select the correct format string based on
the type, and that's too much work for an error message, so just
remove it.
2014-10-23 22:42:56 +00:00
Mateusz Guzik
ffc5ce7b75 In selfdfree re-evaulate sf_si after takin the lock.
Otherwise we can race with doselwakeup.

This is a fixup to r273549

Reviewed by:	jhb
Reported by:	everyone and their dog
2014-10-23 19:06:08 +00:00
Xin LI
2735a91d93 Test if 'env' is NULL before doing memset() and strlen(),
the caller may pass NULL to freeenv().
2014-10-23 18:23:50 +00:00
Mateusz Guzik
73f2e5f759 Avoid taking the lock in selfdfree when not needed. 2014-10-23 15:35:47 +00:00
Colin Percival
b9f6af45a5 Avoid leaking data from the kernel environment: When we convert the
initial static environment to a dynamic one, zero the static environment
buffer, and zero individual values when kern_unsetenv and freeenv are
called.

Tested by:	kmoore (VM memory dump + grep)
Tested by:	cperciva (kernel panic dump + grep)
2014-10-22 23:35:32 +00:00
Mateusz Guzik
58a3dcb229 filedesc assert that table size is at least 3 in fdsetugidsafety
Requested by: kib
2014-10-22 08:56:57 +00:00
Mateusz Guzik
4bc68ed7bc Plug unnecessary PRS_NEW check in kern_procctl.
pfind does not return processes in such state.
2014-10-22 04:16:09 +00:00
Mateusz Guzik
a39d200bb9 Reduce nesting in vn_access.
No functional changes.
2014-10-22 01:53:00 +00:00
Mateusz Guzik
eac9678110 Avoid crdup when possible in kern_accessat.
While here tidy up a little.
2014-10-22 01:09:07 +00:00
Mateusz Guzik
11888da8d9 filedesc: cleanup setugidsafety a little
Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.
2014-10-22 00:23:43 +00:00
Mateusz Guzik
07b384cbe2 Eliminate unnecessary memory allocation in sys_getgroups and its ibcs2 counterpart. 2014-10-21 23:08:46 +00:00
Mateusz Guzik
2afec8edfc Take the lock shared in linker_search_symbol_name.
This helps sysctl kern.proc.stack.
2014-10-21 21:29:20 +00:00
Mateusz Guzik
fca7732078 Mark some more sysctl stuff shared-locked and MPSAFE. 2014-10-21 21:08:45 +00:00
Mateusz Guzik
b564c5d6aa Make sysctl name2oid shared-locked as well.
This is a follow-up to r273401.
2014-10-21 19:45:08 +00:00
Mateusz Guzik
efe0abddf5 Implement shared locking for sysctl. 2014-10-21 19:05:44 +00:00
Mateusz Guzik
580a011762 Rename sysctl_lock and _unlock to sysctl_xlock and _xunlock. 2014-10-21 19:02:26 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Mateusz Guzik
5c37b305fd Plug unnecessary binvp NULL initialization and test.
Reported by: Coverity
CID: 1018889
2014-10-20 22:52:15 +00:00
Mateusz Guzik
966ee9f25f filedesc: plug 2 write-only variables
Reported by: Coverity
CID: 1245745, 1245746
2014-10-20 21:57:24 +00:00
Mark Johnston
4fd6ca7275 Fix a typo from r189544, which replaced unp_global_rwlock with unp_list_lock
and unp_link_rwlock.

MFC after:	3 days
2014-10-20 20:21:40 +00:00
Mateusz Guzik
4fce16e4c9 Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after:	2 weeks
2014-10-20 18:00:50 +00:00
Marcel Moolenaar
0067051fe7 Fully support constructors for the purpose of code coverage analysis.
This involves:
1.  Have the loader pass the start and size of the .ctors section to the
    kernel in 2 new metadata elements.
2.  Have the linker backends look for and record the start and size of
    the .ctors section in dynamically loaded modules.
3.  Have the linker backends call the constructors as part of the final
    work of initializing preloaded or dynamically loaded modules.

Note that LLVM appends the priority of the constructors to the name of
the .ctors section. Not so when compiling with GCC. The code currently
works for GCC and not for LLVM.

Submitted by:	Dmitry Mikulin <dmitrym@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-10-20 17:04:03 +00:00
Mateusz Guzik
020b8f17a0 Provide vfs suspension support only for filesystems which need it.
Need is expressed by providing vfs_susp_clean function in vfsops.

Differential Revision:	D952
Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-19 06:59:33 +00:00
Adrian Chadd
3fe93b946f Convert a missed u_char cpu -> int cpu.
This was caught by a gcc build.

Reported by:	luigi
Sponsored by:	Norse Corp, Inc.
2014-10-19 04:38:02 +00:00
Adrian Chadd
e77f9fed15 Update the ULE scheduler + thread and kinfo structs to use int for cpuid
rather than u_char.

To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254.  The new fields now contain the full CPU ID, or -1 for no cpu.

Differential Revision:	D955
Reviewed by:	jhb, kib
Sponsored by:	Norse Corp, Inc.
2014-10-18 19:36:11 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Alexander Motin
99b9076c21 Remove setting BIO_DONE flag for BIOs that have done() method.
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.

MFC after:	1 week
2014-10-15 18:36:34 +00:00
Konstantin Belousov
f821fad417 Implement FIODTYPE for master ptys.
Requested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-15 12:38:26 +00:00
Mateusz Guzik
32e7f8e4d5 Don't take devmtx unnecessarily in vn_isdisk.
MFC after:	1 week
2014-10-15 05:17:36 +00:00
Mateusz Guzik
55056be254 filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall
No functional changes.
2014-10-15 01:16:11 +00:00
Marcel Moolenaar
ddbe5b951f Fix nits in previous commit:
1.  Remove initializer for badstack_sbuf_size; it gets set unconditionally.
2.  Remove meaningless comment.
3.  Group witness_count and its sysctl together.
4.  Fix spacing in for statements (space after for and within condition).
5.  Change *all* M_NOWAIT usages in witness_initialize() to M_WAITOK; not
    just those that were newly introduced -- the allocation is assumed to
    succeed for all allocations.
6.  Avoid using uint8_t as the base type in sizeof() expressions; Use the
    variable name (w_rmatrix) as much as possible.

Pointed out by: jhb@ (thanks!)
2014-10-11 16:34:01 +00:00
Marcel Moolenaar
90a5222f14 Turn WITNESS_COUNT into a tunable and sysctl. This allows adjusting
the value without recompiling the kernel.  This is useful when
recompiling is not possible as an immediate solution. When we run out
of witness objects, witness is completely disabled. Not having an
immediate solution can therefore be problematic.

Submitted by:	Sreekanth Rupavatharam <rupavath@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-10-11 02:02:58 +00:00
Marcel Moolenaar
2e7634503e Regenerate after r272823:
Move the SCTP syscalls to netinet with the rest of the SCTP code.

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:19:35 +00:00
Marcel Moolenaar
80b47aefa1 Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
  sys_sctp_peeloff
  sys_sctp_generic_sendmsg
  sys_sctp_generic_sendmsg_iov
  sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file.  A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:16:52 +00:00
Adrian Chadd
ffcf962dab Add a bus method to fetch the VM domain for the given device/bus.
* Add a bus_if.m method - get_domain() - returning the VM domain or
  ENOENT if the device isn't in a VM domain;
* Add bus methods to print out the domain of the device if appropriate;
* Add code in srat.c to save the PXM -> VM domain mapping that's done and
  expose a function to translate VM domain -> PXM;
* Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute
  and if so map it to the VM domain;
* (.. yes, this works recursively.)
* Have the pci bus glue print out the device VM domain if present.

Note: this is just the plumbing to start enumerating information -
it doesn't at all modify behaviour.

Differential Revision:	D906
Reviewed by:	jhb
Sponsored by:	Norse Corp
2014-10-09 05:33:25 +00:00
Marcel Moolenaar
383f423be1 Fix draining in ttydev_leave():
1.  ERESTART is not only returned when the revoke count changed. It
    is also returned when a signal is received. While a change in
    the revoke count should be ignored, a signal should not.
2.  Waiting until the output queue is entirely drained can cause a
    hang when the underlying device is stuck or broken.

Have tty_drain() take care of this by telling it when we're leaving.
When leaving, tty_drain() will use a timed wait to address point 2
above and it will check the revoke count to handle point 1 above.
The timeout is set to 1 second, which is arbitrary and long enough
to expect a change in the output queue.

Discussed with: jilles@
Reported by: Yamagi Burmeister <lists@yamagi.org>
2014-10-09 02:30:38 +00:00
Marcel Moolenaar
75c2b79df8 Apply r269126 to tty_timedwait():
Don't return ERESTART when the device is gone.
2014-10-09 01:59:25 +00:00
John Baldwin
232e8b52b0 Add schedgraph traces for callout handlers. Specifically, a callwheel logs
a running event each time it executes a callout function.  The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context.  The callwheel is marked idle when each handler
completes.  This effectively logs the duration of each callout routine in
the graph.
2014-10-08 16:22:59 +00:00
Jung-uk Kim
37417245bf Make kern.nswbuf tunable from loader.
MFC after:	1 week
2014-10-07 20:13:47 +00:00
Mateusz Guzik
dd2390be68 Convert racct stubs to inline functions.
This saves some symbols and function calls for kernel without RACCT.

MFC after:	1 week
2014-10-06 02:31:33 +00:00
Mateusz Guzik
2b4a2528d7 filedesc: fix up breakage introduced in 272505
Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.

Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.

Suggested by:	kib [1]
X-MFC:		with 272505
2014-10-05 19:40:29 +00:00
Konstantin Belousov
57c2505e65 On error, sbuf_bcat() returns -1. Some callers returned this -1 to
the upper layers, which interpret it as errno value, which happens to
be ERESTART.  The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.

Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-05 17:35:59 +00:00
Mateusz Guzik
bad2520a2b Avoid unnecessary ppeers_lock acquisition in exit1.
MFC after:	1 week
2014-10-05 07:21:41 +00:00
Mateusz Guzik
25108069ec Get rid of crshared. 2014-10-05 02:16:53 +00:00
Konstantin Belousov
4142462eeb Slightly reword comment. Move code, which is described by the
comment, after it.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-04 18:51:55 +00:00
Konstantin Belousov
b76278407d Add kernel option KSTACK_USAGE_PROF to sample the stack depth on
interrupts and report the largest value seen as sysctl
debug.max_kstack_used.  Useful to estimate how close the kernel stack
size is to overflow.

In collaboration with:	Larry Baird <lab@gta.com>
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-04 18:38:14 +00:00
Konstantin Belousov
539c9eef12 Fixes for i/o during coredumping:
- Do not dump into system files.
- Do not acquire write reference to the mount point where img.core is
  written, in the coredump().  The vn_rdwr() calls from ELF imgact
  request the write ref from vn_rdwr().  Recursive acqusition of the
  write ref deadlocks with the unmount.
- Instead, take the range lock for the whole core file.  This prevents
  parallel dumping from two processes executing the same image,
  converting the useless interleaved dump into sequential dumping,
  with second core overwriting the first.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-10-04 18:35:00 +00:00
Konstantin Belousov
e3d6feceb1 Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is
not locked, but range is.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-10-04 18:28:27 +00:00
Ian Lepore
41e8f7efbe Make kevent(2) periodic timer events more reliably periodic. The event
callout is now scheduled using the C_ABSOLUTE flag, and the absolute time
of each event is calculated as the time the previous event was scheduled
for plus the interval.  This ensures that latency in processing a given
event doesn't perturb the arrival time of any subsequent events.

Reviewed by:	jhb
2014-10-04 15:59:15 +00:00
Mateusz Guzik
ee3fd7bbb1 Plug capability races.
fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.

This could result either in protection bypass or in a spurious ENOTCAPABLE.

Make fp + capability check atomic with the help of sequence counters.

Reviewed by:	kib
MFC after:	3 weeks
2014-10-04 08:08:56 +00:00
John Baldwin
7aa1071e48 Require p_cansched() for changing a process' protection status via
procctl() rather than p_cansee().

Submitted by:	rwatson
MFC after:	3 days
2014-10-02 21:18:16 +00:00
Will Andrews
9832a24d27 In the syncer, drop the sync mutex while patting the watchdog.
Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by:	asomers
MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	637548 on 2012/10/04
2014-10-01 15:32:28 +00:00
Navdeep Parhar
a9fa76f27e Test for absence of M_NOFREE before attempting to purge the mbuf's tags.
This will leave more state intact should the assertion go off.

MFC after:	1 month
2014-09-30 23:16:26 +00:00
Mateusz Guzik
8e572983d3 Use bzero instead of explicitly zeroing stuff in do_execve.
While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.

No functional changes.
2014-09-29 23:59:19 +00:00