Commit Graph

14208 Commits

Author SHA1 Message Date
markj
25e41e0131 Reimplement support for userland core dump compression using a new interface
in kern_gzio.c. The old gzio interface was somewhat inflexible and has not
worked properly since r272535: currently, the gzio functions are called with
a range lock held on the output vnode, but kern_gzio.c does not pass the
IO_RANGELOCKED flag to vn_rdwr() calls, resulting in deadlock when vn_rdwr()
attempts to reacquire the range lock. Moreover, the new gzio interface can
be used to implement kernel core compression.

This change also modifies the kernel configuration options needed to enable
userland core dump compression support: gzio is now an option rather than a
device, and the COMPRESS_USER_CORES option is removed. Core dump compression
is enabled using the kern.compress_user_cores sysctl/tunable.

Differential Revision:	https://reviews.freebsd.org/D1832
Reviewed by:	rpaulo
Discussed with:	kib
2015-03-09 03:50:53 +00:00
nwhitehorn
fd67077071 Make 32-bit PowerPC kernels, like 64-bit PowerPC kernels, position-independent
executables. The goal here, not yet accomplished, is to let the e500 kernel
run under QEMU by setting KERNBASE to something that fits in low memory and
then having the kernel relocate itself at runtime.
2015-03-07 20:14:46 +00:00
hselasky
f8d33b7a49 Add mutex support to the pps_ioctl() API in the kernel.
Bump kernel version to reflect structure change.

PR:		196897
MFC after:	1 week
2015-03-07 18:23:32 +00:00
rstone
eb3820a352 Move libnv into the kernel and hook it into the kernel build
Differential Revision:		https://reviews.freebsd.org/D1883
Reviewed by:			jfv
MFC after:			1 month
Sponsored by:			Sandvine Inc.
2015-03-01 00:34:27 +00:00
rstone
9ce855b54f Correct the use of an unitialized variable in sendfind_getobj()
When sendfile_getobj() is called on a DTYPE_SHM file, it never
initializes error, which is eventually returned to the caller.

Differential Revision:		https://reviews.freebsd.org/D1989
Reviewed by:			kib
Reported by: 			Brainy Code Scanner, by Maxime Villard.
2015-02-28 21:49:59 +00:00
ian
0b0c9cce2f Format the line properly (wrap before column 80). 2015-02-28 17:44:31 +00:00
ian
57c4ae8cb3 Export the new osreldate and osrelease jail parms in jail_get(2). 2015-02-28 17:32:31 +00:00
kib
8e556fe98a The umtx_lock mutex is used by top-half of the kernel, but is
currently a spin lock.  Apparently, the only reason for this is that
umtx_thread_exit() is called under the process spinlock, which put the
requirement on the umtx_lock.  Note that the witness static order list
is wrong for the umtx_lock, umtx_lock is explicitely before any thread
lock, so it is also before sleepq locks.

Change umtx_lock to be the sleepable mutex.  For the reason above, the
calls to umtx_thread_exit() are moved from thread_exit() earlier in
each caller, when the process spin lock is not yet taken.

Discussed with:	jhb
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-02-28 04:19:02 +00:00
imp
25b5dc1900 Put back Andy's void for gcc happiness.
Submitted by:	jchandra@
2015-02-27 23:14:08 +00:00
imp
c20b5fa0f2 Make sched_random() return an unsigned number, and use uint32_t
consistently. This also matches the per-cpu pointer declaration
anyway.

This changes the tweak we give to the load from -32..31 to be 0..31
which seems more inline with the rest of the code (- rnd and the -=
64). It should also provide the randomness we need, and may fix a
signedness bug in the old code (it isn't clear that the effect was
intentional as opposed to sloppy, and the right shift of a signed
value is undefined to boot).

This stores sched_balance() behavior when it used random().

Differential Revision: https://reviews.freebsd.org/D1981
2015-02-27 21:15:12 +00:00
kib
3bc9cbc06a The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems.  Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by:  EMC / Isilon Storage Division
Submitted by:   Conrad Meyer
MFC after:	1 week
2015-02-27 16:43:50 +00:00
ian
1df855e5be Allow the kern.osrelease and kern.osreldate sysctl values to be set in a
jail's creation parameters.  This allows the kernel version to be reliably
spoofed within the jail whether examined directly with sysctl or
indirectly with the uname -r and -K options.

The values can only be set at jail creation time, to eliminate the need
for any locking when accessing the values via sysctl.

The overridden values are inherited by nested jails (unless the config for
the nested jails also overrides the values).

There is no sanity or range checking, other than disallowing an empty
release string or a zero release date, by design.  The system
administrator is trusted to set sane values.  Setting values that are
newer than the actual running kernel will likely cause compatibility
problems.

Differential Revision:	https://reviews.freebsd.org/D1948
Relnotes:	yes
2015-02-27 16:28:55 +00:00
andrew
078d1e1d3a Fix sched_ule on sparc64, gcc complains sched_random is not a correct
prototype.

Sponsored by:	The FreeBSD Foundation
2015-02-27 15:05:20 +00:00
andrew
7df93e0ecc sched_random is only called for SMP, only define it there.
Sponsored by:	The FreeBSD Foundation
2015-02-27 12:38:24 +00:00
imp
ace438b760 Create sched_rand() and move the LCG code into that. Call this when
we need randomness in ULE. This removes random() call from the
rebalance interval code.

Submitted by: Harrison Grundy
Differential Revision: https://reviews.freebsd.org/D1968
2015-02-27 02:56:58 +00:00
adrian
72abec5bf2 Remove taskqueue_start_threads_pinned(); there's noa generic cpuset version of this.
Sponsored by:	Norse Corp, Inc.
2015-02-25 21:59:03 +00:00
kib
1d0219737e When failing to claim ownership of a umtx_pi, restore the umutex owner
to its previous, unowned state.  This avoids compounding an existing
problem of inconsistent ownership.

Submitted by:	Eric van Gyzen <eric_van_gyzen@dell.com>
Obtained from:	Dell Inc.
PR:	198914
MFC after:	1 week
2015-02-25 16:17:16 +00:00
kib
2c8adee420 When unlocking a contested PI pthread mutex, if the queue of waiters
is empty, look up the umtx_pi and disown it if the current thread owns it.
This can happen if a signal or timeout removed the last waiter from
the queue, but there is still a thread in do_lock_pi() holding a reference
on the umtx_pi.  The unlocking thread might not own the umtx_pi in this case,
but if it does, it must disown it to keep the ownership consistent between
the umtx_pi and the umutex.

Submitted by:	Eric van Gyzen <eric_van_gyzen@dell.com>
	with advice from: Elliott Rabe and Jim Muchow, also at Dell Inc.
Obtained from:	Dell Inc.
PR:	198914
2015-02-25 16:12:56 +00:00
kib
c3462c63fb Keep a reference on the coredump vnode for vn_fullpath() call. Do it
by moving vn_close() after the point where notification is sent.

Reported by:	sbruno
Tested by:	pho, sbruno
Sponsored by:	The FreeBSD Foundation
2015-02-24 13:07:31 +00:00
ae
c6a9e35096 soreceive_generic() still has similar KASSERT(), therefore instead of
remove KASSERT(), change it to check mbuf isn't NULL.

Suggested by:	kib
MFC after:	1 week
2015-02-23 15:24:43 +00:00
ae
92ce4d2d91 In some cases soreceive_dgram() can return no data, but has control
message. This can happen when application is sending packets too big
for the path MTU and recvmsg() will return zero (indicating no data)
but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU.
Remove KASSERT() which does NULL pointer dereference in such case.
Also call m_freem() only when m isn't NULL.

PR:		197882
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-02-23 13:41:35 +00:00
nwhitehorn
03bb9e5889 Make kernel ELF image parsing not crash for kernels running at locations
other than their link address.
2015-02-21 23:20:05 +00:00
markj
06baf2f090 Don't specify a resid parameter if we're just going to ignore it. Instead,
let vn_rdwr() check for short reads.

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-02-20 20:49:00 +00:00
markj
5b69f35cbc Remove unnecessary checks for a return value of NULL from M_WAITOK
allocations.

MFC after:	3 days
2015-02-19 03:32:48 +00:00
markj
8aabcd8ae5 Free the zlib stream after expanding a compressed CTF section.
Note that this memory would only be leaked once, since CTF info for a kld
file is cached after the first access.

MFC after:	3 days
2015-02-19 03:29:46 +00:00
kib
634138588e If malloc() sleeps, Giant is dropped. Recheck for another thread
doing our work.

Remove unneeded check for failed M_WAITOK allocation.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-02-18 18:12:06 +00:00
mjg
2f36a26464 filedesc: obtain a stable copy of credentials in fget_unlocked
This was broken in r278930.

While here tidy up fget_mmap to use fdp from local var instead of obtaining
the same pointer from td.
2015-02-18 13:37:28 +00:00
mjg
0a219ba739 filedesc: simplify fget_unlocked & friends
Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by:	silence on -hackers
2015-02-17 23:54:06 +00:00
glebius
5a69729fb8 Use anonymous unions and structs to organize shared space in mbuf(9),
instead of preprocessor macros.
  This will make debugger output of 'print *m' exactly match the names
we use in code, making life of a kernel hacker way more pleasant. And
this also allows to rename struct_m_ext back to m_ext.
2015-02-17 20:52:51 +00:00
glebius
8de94edb0d Use anonymous unions to add possibility to put mbufs into queue(3)
STAILQs and SLISTs using the same structure field as good old m_next
and m_nextpkt linkage occupy.

New code is encouraged to use queue(3) macros, instead of implementing
the wheel. However, better not to have a mixture of old style and
queue(3) in one file or subsystem.

Reviewed by:		rwatson, rrs, rpaulo
Differential Revision:	D1499
2015-02-17 19:32:11 +00:00
ngie
994a2af400 Add the mnt_lockref field to the ddb(4) 'show mount' command
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D1688
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC / Isilon Storage Division
2015-02-17 09:31:58 +00:00
adrian
f9f1541cd2 Implement taskqueue_start_threads_cpuset().
This is a more generic version of taskqueue_start_threads_pinned()
which only supports a single cpuid.

This originally came from John Baldwin <jhb@> who implemented it
as part of a push towards NUMA awareness in drivers.  I started implementing
something similar for RSS and NUMA, then found he already did it.

I'd like to axe taskqueue_start_threads_pinned() so it doesn't become
part of a longer-term API.  (Read: hps@ wants to MFC things, and
if I don't do this soon, he'll MFC what's here. :-)

I have a follow-up commit which converts the intel drivers over
to using the cpuset version of this function, so we can eventually
nuke the the pinned version.

Tested:

* igb, ixgbe

Obtained from:	jhbbsd
2015-02-17 02:35:06 +00:00
kib
76636ae592 Reparenting done by debugger attach can leave reaper without direct
children.  Handle the situation instead asserting that it is
impossible.

Reported and tested by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-02-15 08:44:30 +00:00
kib
105218e4a7 Return with the process locked, caller expects p still locked after
the call.

Reported and tested by:	bapt
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-02-15 08:43:19 +00:00
davide
7aab309aae Don't access sockbuf fields directly, use accessor functions instead.
It is safe to move the call to socantsendmore_locked() after
sbdrop_locked() as long as we hold the sockbuf lock across the two
calls.

CR:	D1805
Reviewed by:	adrian, kmacy, julian, rwatson
2015-02-14 20:00:57 +00:00
jhb
de820a9105 Include OBJT_PHYS VM objects in ELF core dumps. In particular this
includes the shared page allowing debuggers to use the signal trampoline
code to identify signal frames in core dumps.

Differential Revision:	https://reviews.freebsd.org/D1828
Reviewed by:	alc, kib
MFC after:	1 week
2015-02-14 17:12:31 +00:00
jhb
4247c4fbb3 Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
  exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
  calls to getnewvnode().

Differential Revision:	https://reviews.freebsd.org/D1671
Reviewed by:	kib
MFC after:	1 week
2015-02-14 17:02:51 +00:00
alc
c8abad000b Preset the object's color, or alignment, to maximize superpage usage.
MFC after:	5 days
2015-02-13 19:58:53 +00:00
rrs
e878a76f46 This fixes a bug I in-advertantly inserted when I updated the callout
code in my last commit. The cc_exec_next is used to track the next
when a direct call is being made from callout. It is *never* used
in the in-direct method. When macro-izing I made it so that it
would separate out direct/vs/non-direct. This is incorrect and can
cause panics as Peter Holm has found for me (Thanks so much Peter for
all your help in this). What this change does is restore that behavior
but also get rid of the cc_next from the array and instead make it
be part of the base callout structure. This way no one else will get
confused since we will never use it for non-direct.

Reviewed by:	Peter Holm and more importantly tested by him ;-)
MFC after:	3 days.
Sponsored by:	Netflix Inc.
2015-02-12 13:31:08 +00:00
rpaulo
a306f93a85 Remove check against NULL after M_WAITOK.
Submitted by:	Oliver Pinter
2015-02-11 19:07:05 +00:00
rpaulo
7a55949e1e Restore the data array in coredump(), but use a different style to
calculate the length.

Requested by:	kib
2015-02-11 00:58:15 +00:00
rpaulo
2549cc669a Remove a printf and an strlen() from the coredump code. 2015-02-10 18:35:46 +00:00
kib
7a1bb0de5f Mountd iterating over the mount points may race with the parallel
unmount, which causes error from nmount(2) call when performing
MNT_DELEXPORT over the directory which ceased to be a mount point.

The race is legitimate and innocent, but results in the chatty mountd.
Silence it by providing an distinguished error code for the situation,
and ignoring the error in mountd loop.

Based on the patch by:	Andreas Longwitz <longwitz@incore.de>
Prodded and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-02-10 18:00:32 +00:00
rpaulo
dc0c223646 Sanitise the coredump file names sent to devd.
While there, add a sysctl to turn this feature off as requested by
kib@.
2015-02-10 04:34:39 +00:00
rpaulo
7a7642565c Notify devd(8) when a process crashed.
This change implements a notification (via devctl) to userland when
the kernel produces coredumps after a process has crashed.
devd can then run a specific command to produce a human readable crash
report.  The command is most usually a helper that runs gdb/lldb
commands on the file/coredump pair.  It's possible to use this
functionality for implementing automatic generation of crash reports.

devd(8) will be notified of the full path of the binary that crashed and
the full path of the coredump file.
2015-02-09 23:13:50 +00:00
rrs
344ecf88af This fixes two conditions that can incur when migration
is being done in the callout code and harmonizes the macro
use.:
1) The callout_active() will lie. Basically if a migration
   is occuring and the callout is about to expire and the
   migration has been deferred, the callout_active will no
   longer return true until after the migration. This confuses
   and breaks callers that are doing callout_init(&c, 1); such
   as TCP.
2) The migration code had a bug in it where when migrating, if
   a two calls to callout_reset came in and they both collided with
   the callout on the wheel about to run, then the second call to
   callout_reset would corrupt the list the callout wheel uses
   putting the callout thread into a endless loop.
3) Per imp, I have fixed all the macro occurance in the code that
   were for the most part being ignored.

Phabricator D1711 and looked at by lstewart and jhb and sbruno.
Reviewed by:	kostikbel, imp, adrian, hselasky
MFC after:	3 days
Sponsored by:	Netflix Inc.
2015-02-09 19:19:44 +00:00
alc
c8a99957e2 Preset the object's color, or alignment, to maximize superpage usage.
MFC after:	5 days
2015-02-08 21:00:51 +00:00
jhb
571edab7e4 Add a new device control utility for new-bus devices called devctl. This
allows the user to request administrative changes to individual devices
such as attach or detaching drivers or disabling and re-enabling devices.
- Add a new /dev/devctl2 character device which uses ioctls for device
  requests.  The ioctls use a common 'struct devreq' which is somewhat
  similar to 'struct ifreq'.
- The ioctls identify the device to operate on via a string.  This
  string can either by the device's name, or it can be a bus-specific
  address.  (For unattached devices, a bus address is the only way to
  locate a device.)  Bus drivers register an eventhandler to claim
  unrecognized device names that the driver recognizes as a valid address.
  Two buses currently support addresses: ACPI recognizes any device
  in the ACPI namespace via its full path starting with "\" and
  the PCI bus driver recognizes an address specification of
  'pci[<domain>:]<bus>:<slot>:<func>' (identical to the PCI selector
  strings supported by pciconf).
- To make it easier to cut and paste, change the PnP location string
  in the PCI bus driver to output a full PCI selector string rather
  than 'slot=<slot> function=<func>'.
- Add a devctl(3) interface in libdevctl which provides a wrapper around
  the ioctls and is the preferred interface for other userland code.
- Add a devctl(8) program which is a simple wrapper around the requests
  supported by devctl(3).
- Add a device_is_suspended() function to check DF_SUSPENDED.
- Add a resource_unset_value() function that can be used to remove a
  hint from the kernel environment.  This is used to clear a
  hint.<driver>.<unit>.disabled hint when re-enabling a boot-time
  disabled device.

Reviewed by:	imp (parts)
Requested by:	imp (changing PCI location string)
Relnotes:	yes
2015-02-06 16:09:01 +00:00
jhb
5114ec1b59 Expose the constants for internal new-bus device flags to userland. The
flag value is already exposed via dv_flags, just not the meaning of the
flags themselves.  Use these constants to annotate devices that are
disabled or suspended in devinfo output.
2015-02-05 22:42:44 +00:00
jhb
d99174b27c Set and clear the DF_SUSPENDED flag on the child device being manipulated
rather than on the parent.
2015-02-05 22:24:22 +00:00
jmg
8a06c15bc0 turn GEOM_UNCOMPRESS_DEBUG into a proper option so it can be specified
in kernel config files..

put VERBOSE_SYSINIT in it's own option header so the one file,
init_main.c, can use it instead of requiring an entire kernel recompile
to change one file..
2015-02-05 07:51:38 +00:00
peter
bf625afa44 Initialize ticks so that it wraps 10 minutes after boot to increase the
chances of finding problems related to wraparound sooner.

This comes from P4 change 167856 on 2009/08/26 around when we had problems
with the TCP stack with ticks after 24 days of uptime.
2015-02-05 01:43:21 +00:00
kib
c87d139b47 Add ddb command 'show clocksource' to display state of the per-cpu
clock events.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-02-04 14:49:47 +00:00
kib
04052a24ca Fix use after free in pipe_dtor(). PIPE_NAMED flag must be tested
before pipeclose() is called, since for !PIPE_NAMED case, when peer is
already closed, the pipe pair memory is freed.

Submitted by:	luke.tw@gmail.com
PR:	197246
Tested by:	pho
MFC after:	3 days
2015-02-03 10:29:40 +00:00
kib
871996ee6e The dependency chain for priority-inheritance mutexes could be
subverted by userspace into cycle.  Both umtx_propagate_priority() and
umtx_repropagate_priority() would then loop infinitely, owning the
spinlock.

Check for the cycle using standard Floyd' algorithm before doing the
pass in the affected functions.  Add simple check for condition of
tricking the thread into a wait for itself, which could be easily
simulated by usermode without race.

Found by:	Eric van Gyzen <eric@vangyzen.net>
In collaboration with:	Eric van Gyzen <eric@vangyzen.net>
Tested by:	pho
MFC after:	1 week
2015-01-31 12:27:40 +00:00
jamie
c7d0935d11 Add allow.mount.fdescfs jail flag.
PR:		192951
Submitted by:	ruben@verweg.com
MFC after:	3 days
2015-01-28 21:08:09 +00:00
jhb
c5ac2eb628 Fix a couple of panics when detaching from a cxgbe/cxl interface that was
never brought up:
- Allow NULL to be passed to sglist_free().
- Don't try to stop an interface that was never fully initialized.

Reviewed by:	np
2015-01-26 16:26:28 +00:00
adrian
70dc4fad7a Call WITNESS_WARN() in callout_drain() to check whether any locks are
being held before sleeping.

This has bitten me (in ath(4)) once before and I'd like to see this
not bite anyone else.

Differential Revision:	D1638
Reviewed by:	jhb, hselasky
MFC after:	1 week
2015-01-26 04:04:57 +00:00
jhb
71715e274d Change the default VFS timestamp precision from seconds to microseconds.
Discussed on:	arch@
MFC after:	2 weeks
2015-01-25 19:56:45 +00:00
jilles
6ad32c1c79 Run make sysent. 2015-01-23 21:08:24 +00:00
jilles
67db24d0f2 Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
danfe
d20a416d73 Fix usage example in kvprintf(9) and its copy in libstand(3): trailing '\n'
in bitfield argument is wrong, as it will be treated as bit 10, causing any
code printing >=10 bits with bit 10 on as having a trailing comma.

Newline (intended one) should be part of the format string (already present
in the examples).

Also fix grammar and kill EOL whitespace in comment while here.

PR:		195005
Approved by:	bdrewery
2015-01-23 07:30:57 +00:00
hselasky
c0aba3b50d Revert for r277213:
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by:		kmacy, adrian, glebius and kib
Differential Revision:	https://reviews.freebsd.org/D1438
2015-01-22 11:12:42 +00:00
mjg
9bc86796d3 filedesc: avoid spurious copying of capabilities in fget_unlocked
We obtain a stable copy and store it in local 'fde' variable. Storing another
copy (based on aforementioned variable) does not serve any purpose.

No functional changes.
2015-01-21 18:32:53 +00:00
mjg
e15a87cc6a filedesc: return 0 from badfo_close
The only potential in-tree consumer (_fdrop) special-cased it and returns 0
0 on its own instead of calling badfo_close.

Remove the special case since it is not needed and very unlikely to encounter
anyway.

No objections from:	kib
2015-01-21 18:05:42 +00:00
mjg
4b90cc79ee filedesc: fix whitespace nits in fget and fget_read
No functional changes.
2015-01-21 18:02:28 +00:00
kib
ef36961ce1 Do not assert that the new pipepair mutex is not initialized. The
backing memory contains garbage and might trigger the assertion.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-21 16:32:54 +00:00
mjg
03fe27a773 filedesc: plug a test for impossible condition in _fget 2015-01-21 01:06:14 +00:00
neel
1fe4c6403b Update the vdso timehands only via tc_windup().
Prior to this change CLOCK_MONOTONIC could go backwards when the timecounter
hardware was changed via 'sysctl kern.timecounter.hardware'. This happened
because the vdso timehands update was missing the special treatment in
tc_windup() when changing timecounters.

Reviewed by:	kib
2015-01-20 03:54:30 +00:00
kib
f748dc7ade Stop enforcing additional reference on all cdevs, which was introduced
in r277199.  Acquire the neccessary reference in delist_dev_locked()
and inform destroy_devl() about it using CDP_UNREF_DTR flag.

Fix some style nits, add asserts.

Discussed with:	hselasky
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-19 17:36:52 +00:00
kib
aa0ac99391 Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process.  Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with:	jilles, rwatson
Man page fixes by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:13:11 +00:00
kib
53832db395 Make SIGSTOP working for sleeps done while waiting for fifo readers or
writers in open(2), when the fifo is located on an NFS mount.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:03:26 +00:00
kib
f4ea6035eb For sigaction(2), ignore possible garbage in sa_flags for sa_handler
== SIG_DFL or SIG_IGN.  Sloppy code does not fully initialize struct
sigaction for such cases, and being too demanding in the case of
default handler does not catch anything.

Reported and tested by:	Alex Tutubalin <lexa@lexa.ru>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-16 07:06:58 +00:00
hselasky
a9eed96a6b Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
  CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
  spinlocks.
- Switching the callout CPU number cannot always be done on a
  per-callout basis. See the updated timeout(9) manual page for more
  information.
- The timeout(9) manual page has been updated to reflect how all the
  functions inside the callout API are working. The manual page has
  been made function oriented to make it easier to deduce how each of
  the functions making up the callout API are working without having
  to first read the whole manual page. Group all functions into a
  handful of sections which should give a quick top-level overview
  when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
  to reduce the complexity in the callout code and to avoid problems
  about atomically stopping callouts via callout_stop(). If someone
  needs it, it can be re-added. From my quick grep there are no
  CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
  added. See the updated timeout(9) manual page for a complete
  description.
- Update the callout clients in the "kern/" folder to use the callout
  API properly, like cv_timedwait(). Previously there was some custom
  sleepqueue code in the callout subsystem, which has been removed,
  because we now allow callouts to be protected by spinlocks. This
  allows us to tear down the callout like done with regular mutexes,
  and a "td_slpmutex" has been added to "struct thread" to atomically
  teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
  "SWT_SLEEPQTIMO" states can now be completely removed. Currently
  they are marked as available and will be cleaned up in a follow up
  commit.
- Bump the __FreeBSD_version to indicate kernel modules need
  recompilation.
- There has been several reports that this patch "seems to squash a
  serious bug leading to a callout timeout and panic".

Kernel build testing:	all architectures were built
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D1438
Sponsored by:		Mellanox Technologies
Reviewed by:		jhb, adrian, sbruno and emaste
2015-01-15 15:32:30 +00:00
rwatson
08b6ceecbb In order to support ongoing work to implement variable-size mbufs, and
more generally make it easier to extend 'struct mbuf in the future', make
a number of changes to the data structure:

- As we anticipate embedding mbufs headers within variable-size regions of
  memory in the future, change the definitions of byte arrays embedded in
  mbufs to be of size [0] rather than [MLEN] and [MHLEN].  In fact, the
  cxgbe driver already uses 'struct mbuf' on the front of other storage
  sizes, but we would like the global mbuf allocator do be able to do this
  as well.

- Fold 'struct m_hdr' into 'struct mbuf' itself, eliminating a set of
  macros that aliased 'mh_foo' field names to 'm_foo' names such as
  'm_next'.  These present a particular problem as we would like to add
  new mbuf-header fields -- e.g., 'm_size' -- that, if similarly named via
  macros, would introduce collisions with many other variable names in the
  kernel.

- Rename 'struct m_ext' to 'struct struct_m_ext' so that we can add
  compile-time assertions without bumping into the still-extant 'm_ext'
  macro.

- Remove the MSIZE compile-time assertion for 'struct mbuf', but add new
  assertions for alignment of embedded data arrays (64-bit alignment even
  on 32-bit platforms), and for the sizes the mbuf header, packet header,
  and m_ext structure.

- Document that these assertions exist in comments in mbuf.h.

This change is not intended to cause (non-trivial) behavioural
differences, but is a precursor to further mbuf-allocator work.

Differential Revision:	https://reviews.freebsd.org/D1483
Reviewed by:	bz, gnn, np, glebius ("go ahead, I trust you")
Sponsored by:	EMC / Isilon Storage Division
2015-01-14 23:44:00 +00:00
hselasky
b04cbf0c36 Avoid race with "dev_rel()" when using the recently added
"delist_dev()" function. Make sure the character device structure
doesn't go away until the end of the "destroy_dev()" function due to
concurrently running cleanup code inside "devfs_populate()".

MFC after:	1 week
Reported by:	dchagin@
2015-01-14 22:07:13 +00:00
hselasky
99b9110513 Add a kernel function to delist our kernel character devices, so that
the device name can be re-used right away in case we are destroying
the character devices in the background.

MFC after:	4 days
Reported by:	dchagin@
2015-01-14 14:04:29 +00:00
jamie
15f3ae0c52 Remove the prison flags PR_IP4_DISABLE and PR_IP6_DISABLE, which have been
write-only for as long as they've existed.
2015-01-14 04:50:28 +00:00
jamie
20db074137 Don't set prison's pr_ip4s or pr_ip6s to -1.
PR:		196474
MFC after:	3 days
2015-01-14 03:52:41 +00:00
kib
79db3369f9 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
ian
f0876ef9ac Fix an off by one in ppsratecheck(). If you asked for N=1 you'd get one,
but for any N>1 you'd get N-1 packets/events per second.
2015-01-11 20:48:29 +00:00
rwatson
ac02dc452c Garbage collect m_copymdata(), an mbuf utility routine introduced
in FreeBSD 7 that has not been used since.  It contains a number
of unresolved bugs including an inverted bcopy() and incorrect
handling of read-only mbufs using internal storage.  Removing this
unused code is substantially essier than fixing it in order to
update it to the coming mbuf world order -- but it can always be
restored from revision history if it turns out to prove useful for
future work.

Pointed out by:	jmallett
Sponsored by:	EMC / Isilon Storage Division
2015-01-10 10:41:23 +00:00
dchagin
990c8c285b Allow clock_getcpuclockid() on the CPU-time clock for zombie process.
Posix does not prohibit this.

Differential Revision:	https://reviews.freebsd.org/D1470

Reviewed by:	kib
MFC after:	1 week
2015-01-10 07:22:38 +00:00
delphij
27945e456a Improve style and fix a possible use-after-free case introduced in r268384
by reinitializing the 'freestate' pointer after freeing the memory.

Obtained from:	HardenedBSD (71fab80c5dd3034b71a29a61064625018671bbeb)
PR:		194525
Submitted by:	Oliver Pinter <oliver.pinter@hardenedbsd.org>
MFC after:	2 weeks
2015-01-10 06:48:35 +00:00
rwatson
48e613fbd8 Remove a 'This is dumb' comment that has been incorrect for at least a
decade: m_pulldown() is willing to consider ordinary mbufs writable.
Retain another, related, and also outdated comment, but with a caveat
that it is partially stale.  Do not, for now, address the problem that
it raises (that only EXT_CLUSTER external storage is considered
writable, regardless of the results of M_WRITABLE() on the mbuf).

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2015-01-09 12:08:51 +00:00
jhb
69eebc007b Change the default method for device_quiesce() to return 0 instead of
EOPNOTSUPP.  The current behavior can mask real quiesce errors since
devclass_quiesce_driver() stops iterating over drivers as soon as it
gets an error (incluiding EOPNOTSUPP), but the caller it returns the
error to explicitly ignores EOPNOTSUPP.

Reviewed by:	imp
2015-01-08 21:46:28 +00:00
jhb
8189659be8 Reject attempts to read the cpuset mask of a negative domain ID. 2015-01-08 19:11:14 +00:00
jhb
06e75f0dba Create a cpuset mask for each NUMA domain that is available in the
kernel via the global cpuset_domain[] array. To export these to userland,
add a CPU_WHICH_DOMAIN level that can be used to fetch the mask for a
specific domain. Add a -d flag to cpuset(1) that can be used to fetch
the mask for a given domain.

Differential Revision:	https://reviews.freebsd.org/D1232
Submitted by:	jeff (kernel bits)
Reviewed by:	adrian, jeff
2015-01-08 15:53:13 +00:00
rwatson
a546fbcd7c Replace hand-crafted versions of M_SIZE() and M_START() in uipc_mbuf.c
with calls to the centralised macros, reducing direct use of MLEN and
MHLEN.

Differential Revision:	https://reviews.freebsd.org/D1444
Reviewed by:	bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-08 11:16:21 +00:00
markj
7e7e145818 Factor out duplicated code from dumpsys() on each architecture into generic
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.

PR:		193873
Differential Revision:	https://reviews.freebsd.org/D904
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by:	jhibbits (earlier version)
Sponsored by:	EMC / Isilon Storage Division
2015-01-07 01:01:39 +00:00
markj
457a8d982b Use crcopysafe(9) to make a copy of a process' credential struct. crcopy(9)
may perform a blocking memory allocation, which is unsafe when holding a
mutex.

Differential Revision:	https://reviews.freebsd.org/D1443
Reviewed by:	rwatson
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 23:07:22 +00:00
jhb
b607899981 Trim trailing whitespace. 2015-01-05 20:50:44 +00:00
jhb
55d0376a65 On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST).  Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.

PR:		192316
Differential Revision:	https://reviews.freebsd.org/D1441
No objection from:	jkim
MFC after:	1 week
2015-01-05 20:44:44 +00:00
rwatson
1c44e71143 To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
gibbs
7646916ff0 Prevent live-lock and access of destroyed data in taskqueue_drain_all().
Phabric:	https://reviews.freebsd.org/D1247
Reviewed by:	jhb, avg
Sponsored by:	Spectra Logic Corporation

sys/kern_subr_taskqueue.c:
	Modify taskqueue_drain_all() processing to use a temporary
	"barrier task", rather than rely on a user task that may
	be destroyed during taskqueue_drain_all()'s execution.  The
	barrier task is queued behind all previously queued tasks
	and then has its priority elevated so that future tasks
	cannot pass it in the queue.

	Use a similar barrier scheme to drain threads processing
	current tasks.  This requires taskqueue_run_locked() to
	insert and remove the taskqueue_busy object for the running
	thread for every task processed.

share/man/man9/taskqueue.9:
	Remove warning about live-lock issues with taskqueue_drain_all()
	and indicate that it does not wait for tasks queued after
	it begins processing.
2015-01-04 19:55:44 +00:00
dchagin
6996603344 Regen for r276654 (__getcwd()). 2015-01-04 10:40:23 +00:00
dchagin
e777b160fa Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by:	bde

MFC after:	1 week
2015-01-04 10:34:02 +00:00
hselasky
dd66ca979a Rework r276532 a bit. Always avoid recursing into the console drivers
clients, hence they might not handle it very well. This change allows
debugging mutex problems with kernel console drivers when
"debug.witness.skipspin=0" is set in the boot environment.

MFC after:	1 week
2015-01-03 17:21:19 +00:00
hselasky
c81bb57ba0 The "cnputs_mtx" mutex must be allowed to recurse. Debug prints and/or
witness printouts in the console driver clients can cause this mutex
to recurse by calls to "printf()" from witness for example. In
particular this can happen if "debug.witness.skipspin=0" is set in the
boot environment.

MFC after:	1 week
2015-01-02 13:10:33 +00:00
mjg
09f7a46ec7 Convert vfs hash lock from a mutex to an rwlock. 2014-12-30 21:40:45 +00:00