13479 Commits

Author SHA1 Message Date
Konstantin Belousov
eda6009c04 Add a sysctl kern.disallow_high_osrel which disables executing the
images compiled on the world with higher major version number than the
high version number of the booted kernel.  Default to disable.

Sponsored by:	The FreeBSD Foundation
Discussed with:	bapt
MFC after:	1 week
2013-10-15 06:38:40 +00:00
Konstantin Belousov
cd4dd444dd By default, allow up to SSIZE_MAX i/o for non-devfs files.
Sponsored by:	The FreeBSD Foundation
Reminded by:	Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after:	1 month
X-MFC-note:	stable/10 only
2013-10-15 06:35:22 +00:00
Konstantin Belousov
bf3e483b44 Similar to debug.iosize_max_clamp sysctl, introduce
devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized
i/o requests on the devfs files.

Sponsored by:	The FreeBSD Foundation
Reminded by:	Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after:	1 week
2013-10-15 06:33:10 +00:00
Mark Murray
cc4d059c03 Merge from project branch. Uninteresting commits are trimmed.
Refactor of /dev/random device. Main points include:

* Userland seeding is no longer used. This auto-seeds at boot time
on PC/Desktop setups; this may need some tweeking and intelligence
from those folks setting up embedded boxes, but the work is believed
to be minimal.

* An entropy cache is written to /entropy (even during installation)
and the kernel uses this at next boot.

* An entropy file written to /boot/entropy can be loaded by loader(8)

* Hardware sources such as rdrand are fed into Yarrow, and are no
longer available raw.

------------------------------------------------------------------------
r256240 | des | 2013-10-09 21:14:16 +0100 (Wed, 09 Oct 2013) | 4 lines

Add a RANDOM_RWFILE option and hide the entropy cache code behind it.
Rename YARROW_RNG and FORTUNA_RNG to RANDOM_YARROW and RANDOM_FORTUNA.
Add the RANDOM_* options to LINT.

------------------------------------------------------------------------
r256239 | des | 2013-10-09 21:12:59 +0100 (Wed, 09 Oct 2013) | 2 lines

Define RANDOM_PURE_RNDTEST for rndtest(4).

------------------------------------------------------------------------
r256204 | des | 2013-10-09 18:51:38 +0100 (Wed, 09 Oct 2013) | 2 lines

staticize struct random_hardware_source

------------------------------------------------------------------------
r256203 | markm | 2013-10-09 18:50:36 +0100 (Wed, 09 Oct 2013) | 2 lines

Wrap some policy-rich code in 'if NOTYET' until we can thresh out
what it really needs to do.

------------------------------------------------------------------------
r256184 | des | 2013-10-09 10:13:12 +0100 (Wed, 09 Oct 2013) | 2 lines

Re-add /dev/urandom for compatibility purposes.

------------------------------------------------------------------------
r256182 | des | 2013-10-09 10:11:14 +0100 (Wed, 09 Oct 2013) | 3 lines

Add missing include guards and move the existing ones out of the
implementation namespace.

------------------------------------------------------------------------
r256168 | markm | 2013-10-08 23:14:07 +0100 (Tue, 08 Oct 2013) | 10 lines

Fix some just-noticed problems:

o Allow this to work with "nodevice random" by fixing where the
MALLOC pool is defined.

o Fix the explicit reseed code. This was correct as submitted, but
in the project branch doesn't need to set the "seeded" bit as this
is done correctly in the "unblock" function.

o Remove some debug ifdeffing.

o Adjust comments.

------------------------------------------------------------------------
r256159 | markm | 2013-10-08 19:48:11 +0100 (Tue, 08 Oct 2013) | 6 lines

Time to eat crow for me.

I replaced the sx_* locks that Arthur used with regular mutexes;
this turned out the be the wrong thing to do as the locks need to
be sleepable. Revert this folly.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (In original diff)

------------------------------------------------------------------------
r256138 | des | 2013-10-08 12:05:26 +0100 (Tue, 08 Oct 2013) | 10 lines

Add YARROW_RNG and FORTUNA_RNG to sys/conf/options.

Add a SYSINIT that forces a reseed during proc0 setup, which happens
fairly late in the boot process.

Add a RANDOM_DEBUG option which enables some debugging printf()s.

Add a new RANDOM_ATTACH entropy source which harvests entropy from the
get_cyclecount() delta across each call to a device attach method.

------------------------------------------------------------------------
r256135 | markm | 2013-10-08 07:54:52 +0100 (Tue, 08 Oct 2013) | 8 lines

Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use
EVENTHANDLER(mountroot) instead.

This means we can't count on /var being present, so something will
need to be done about harvesting /var/db/entropy/... .

Some policy now needs to be sorted out, and a pre-sync cache needs
to be written, but apart from that we are now ready to go.

Over to review.

------------------------------------------------------------------------
r256094 | markm | 2013-10-06 23:45:02 +0100 (Sun, 06 Oct 2013) | 8 lines

Snapshot.

Looking pretty good; this mostly works now. New code includes:

* Read cached entropy at startup, both from files and from loader(8)
preloaded entropy. Failures are soft, but announced. Untested.

* Use EVENTHANDLER to do above just before we go multiuser. Untested.

------------------------------------------------------------------------
r256088 | markm | 2013-10-06 14:01:42 +0100 (Sun, 06 Oct 2013) | 2 lines

Fix up the man page for random(4). This mainly removes no-longer-relevant
details about HW RNGs, reseeding explicitly and user-supplied
entropy.

------------------------------------------------------------------------
r256087 | markm | 2013-10-06 13:43:42 +0100 (Sun, 06 Oct 2013) | 6 lines

As userland writing to /dev/random is no more, remove the "better
than nothing" bootstrap mode.

Add SWI harvesting to the mix.

My box seeds Yarrow by itself in a few seconds! YMMV; more to follow.

------------------------------------------------------------------------
r256086 | markm | 2013-10-06 13:40:32 +0100 (Sun, 06 Oct 2013) | 11 lines

Debug run. This now works, except that the "live" sources haven't
been tested. With all sources turned on, this unlocks itself in
a couple of seconds! That is no my box, and there is no guarantee
that this will be the case everywhere.

* Cut debug prints.

* Use the same locks/mutexes all the way through.

* Be a tad more conservative about entropy estimates.

------------------------------------------------------------------------
r256084 | markm | 2013-10-06 13:35:29 +0100 (Sun, 06 Oct 2013) | 5 lines

Don't use the "real" assembler mnemonics; older compilers may not
understand them (like when building CURRENT on 9.x).

# Submitted by:	Konstantin Belousov <kostikbel@gmail.com>

------------------------------------------------------------------------
r256081 | markm | 2013-10-06 10:55:28 +0100 (Sun, 06 Oct 2013) | 12 lines

SNAPSHOT.

Simplify the malloc pools; We only need one for this device.

Simplify the harvest queue.

Marginally improve the entropy pool hashing, making it a bit faster
in the process.

Connect up the hardware "live" source harvesting. This is simplistic
for now, and will need to be made rate-adaptive.

All of the above passes a compile test but needs to be debugged.

------------------------------------------------------------------------
r256042 | markm | 2013-10-04 07:55:06 +0100 (Fri, 04 Oct 2013) | 25 lines

Snapshot. This passes the build test, but has not yet been finished or debugged.

Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)

------------------------------------------------------------------------
r255319 | markm | 2013-09-06 18:51:52 +0100 (Fri, 06 Sep 2013) | 4 lines

Yarrow wants entropy estimations to be conservative; the usual idea
is that if you are certain you have N bits of entropy, you declare
N/2.

------------------------------------------------------------------------
r255075 | markm | 2013-08-30 18:47:53 +0100 (Fri, 30 Aug 2013) | 4 lines

Remove short-lived idea; thread to harvest (eg) RDRAND enropy into the
usual harvest queues. It was a nifty idea, but too heavyweight.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com>

------------------------------------------------------------------------
r255071 | markm | 2013-08-30 12:42:57 +0100 (Fri, 30 Aug 2013) | 4 lines

Separate out the Software RNG entropy harvesting queue and thread
into its own files.

# Submitted by:	 Arthur Mesh <arthurmesh@gmail.com>

------------------------------------------------------------------------
r254934 | markm | 2013-08-26 20:07:03 +0100 (Mon, 26 Aug 2013) | 2 lines

Remove the short-lived namei experiment.

------------------------------------------------------------------------
r254928 | markm | 2013-08-26 19:35:21 +0100 (Mon, 26 Aug 2013) | 2 lines

Snapshot; Do some running repairs on entropy harvesting. More needs
to follow.

------------------------------------------------------------------------
r254927 | markm | 2013-08-26 19:29:51 +0100 (Mon, 26 Aug 2013) | 15 lines

Snapshot of current work;

1) Clean up namespace; only use "Yarrow" where it is Yarrow-specific
or close enough to the Yarrow algorithm. For the rest use a neutral
name.

2) Tidy up headers; put private stuff in private places. More could
be done here.

3) Streamline the hashing/encryption; no need for a 256-bit counter;
128 bits will last for long enough.

There are bits of debug code lying around; these will be removed
at a later stage.

------------------------------------------------------------------------
r254784 | markm | 2013-08-24 14:54:56 +0100 (Sat, 24 Aug 2013) | 39 lines

1) example (partially humorous random_adaptor, that I call "EXAMPLE")
 * It's not meant to be used in a real system, it's there to show how
   the basics of how to create interfaces for random_adaptors. Perhaps
   it should belong in a manual page

2) Move probe.c's functionality in to random_adaptors.c
 * rename random_ident_hardware() to random_adaptor_choose()

3) Introduce a new way to choose (or select) random_adaptors via tunable
"rngs_want" It's a list of comma separated names of adaptors, ordered
by preferences. I.e.:
rngs_want="yarrow,rdrand"

Such setting would cause yarrow to be preferred to rdrand. If neither of
them are available (or registered), then system will default to
something reasonable (currently yarrow). If yarrow is not present, then
we fall back to the adaptor that's first on the list of registered
adaptors.

4) Introduce a way where RNGs can play a role of entropy source. This is
mostly useful for HW rngs.

The way I envision this is that every HW RNG will use this
functionality by default. Functionality to disable this is also present.
I have an example of how to use this in random_adaptor_example.c (see
modload event, and init function)

5) fix kern.random.adaptors from
kern.random.adaptors: yarrowpanicblock
to
kern.random.adaptors: yarrow,panic,block

6) add kern.random.active_adaptor to indicate currently selected
adaptor:
root@freebsd04:~ # sysctl kern.random.active_adaptor
kern.random.active_adaptor: yarrow

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com>

Submitted by:	Dag-Erling Smørgrav <des@FreeBSD.org>, Arthur Mesh <arthurmesh@gmail.com>
Reviewed by:	des@FreeBSD.org
Approved by:	re (delphij)
Approved by:	secteam (des,delphij)
2013-10-12 12:57:57 +00:00
John Baldwin
d251e7006b Ignore attempts to set the nmbcluster sysctls to their current value
rather than failing with an error.

Reviewed by:	andre
Approved by:	re (delphij)
MFC after:	2 weeks
2013-10-10 16:11:34 +00:00
Mark Murray
72acff0f07 MFC - tracking commit. 2013-10-09 21:03:34 +00:00
Konstantin Belousov
acb9d2c7f0 The device vnodes are often unlocked when bread() or bwrite() is
called.  This probably should be fixed eventually, but for now it is
not needed to try to flush such vnodes from the buffer allocation
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-09 18:45:01 +00:00
Konstantin Belousov
2c1531e746 Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by:    John Marshall <john.marshall@riverwillow.com.au>
Tested by:      John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-09 18:43:29 +00:00
Konstantin Belousov
1744fe5048 When growing the file descriptor table, new larger memory chunk is
allocated, but the old table is kept around to handle the case of
threads still performing unlocked accesses to it.

Grow the table exponentially instead of increasing its size by
sizeof(long) * 8 chunks when overflowing. This mode significantly
reduces the total memory use for the processes consuming large numbers
of the file descriptors which open them one by one.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-10-09 18:41:35 +00:00
Konstantin Belousov
3625bde45d Reduce code duplication, introduce the getmaxfd() helper to calculate
the max filedescriptor index.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-10-09 18:39:44 +00:00
Mark Murray
371cbaafa8 MFC - tracking commit 2013-10-09 17:41:47 +00:00
Gleb Smirnoff
1d2df300e9 - Substitute sbdrop_internal() with sbcut_internal(). The latter doesn't free
mbufs, but return chain of free mbufs to a caller. Caller can either reuse
  them or return to allocator in a batch manner.
- Implement sbdrop()/sbdrop_locked() as a wrapper around sbcut_internal().
- Expose sbcut_locked() for outside usage.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 11:57:53 +00:00
Dag-Erling Smørgrav
db3fcaf970 Add YARROW_RNG and FORTUNA_RNG to sys/conf/options.
Add a SYSINIT that forces a reseed during proc0 setup, which happens
fairly late in the boot process.

Add a RANDOM_DEBUG option which enables some debugging printf()s.

Add a new RANDOM_ATTACH entropy source which harvests entropy from the
get_cyclecount() delta across each call to a device attach method.
2013-10-08 11:05:26 +00:00
Mark Murray
6e818c871f Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use EVENTHANDLER(mountroot) instead.
This means we can't count on /var being present, so something will need to be done about harvesting /var/db/entropy/... .

Some policy now needs to be sorted out, and a pre-sync cache needs to be written, but apart from that we are now ready to go.

Over to review.
2013-10-08 06:54:52 +00:00
Mark Murray
1a3c1f06dd Snapshot.
Looking pretty good; this mostly works now. New code includes:

* Read cached entropy at startup, both from files and from loader(8) preloaded entropy. Failures are soft, but announced. Untested.

* Use EVENTHANDLER to do above just before we go multiuser. Untested.
2013-10-06 22:45:02 +00:00
Mark Murray
12babbf219 MFC - tracking commit 2013-10-06 09:37:57 +00:00
Konstantin Belousov
505cdd82bf Remove the uipc_cow.c file, which is not used since the zero copy
sockets removal.

Noted by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-10-06 06:57:28 +00:00
Alan Cox
61083fcc61 Tidy up kmeminit(): Since r245575, 'nmbclusters' is calculated after
kmeminit() runs, so it contributes nothing to 'vm_kmem_size'; update a
comment to reflect that r254025 replaced the kmem submap with the kmem
arena.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2013-10-05 18:53:03 +00:00
Mark Murray
3c59587daa MFC - tracking commit. 2013-10-04 07:00:59 +00:00
Mark Murray
f02e47dc1e Snapshot. This passes the build test, but has not yet been finished or debugged.
Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)
2013-10-04 06:55:06 +00:00
Sean Bruno
d3baefa809 Change len checks for fstypelen and fspathlen to be against absolute len
not strlen as they are *not* strings.

Discovered by GSOC student, Mike Ma <mikemandarine@gmail.com> during his
fuse.glusterfs port to FreeBSD.

Final patch from mckusick@

Submitted by:	mckusick@
Approved by:	re (hrs)
MFC after:	2 weeks
2013-10-03 22:52:03 +00:00
Konstantin Belousov
432e79fc33 When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue.  We are only
interested in, and can operate on, the buffers owned by the current
vnode [1].  Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.

Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.

Discussed with:	Jeff [1]
Reported by:	pho [2]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (hrs)
2013-10-02 06:00:34 +00:00
Konstantin Belousov
d6498b153e When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-01 20:18:33 +00:00
Konstantin Belousov
fe39412e99 For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked.  If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:07:14 +00:00
Konstantin Belousov
ac34145005 Reimplement r255797 using LK_TRYUPGRADE.
The  r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:04:57 +00:00
Konstantin Belousov
7c6fe80353 Add LK_TRYUPGRADE operation for lockmgr(9), which attempts to
atomically upgrade shared lock to exclusive.  On failure, error is
returned and lock is not dropped in the process.

Tested by:      pho (previous version)
No objections from:     attilio
Sponsored by:   The FreeBSD Foundation
MFC after:      1 week
Approved by:	re (glebius)
2013-09-29 18:02:23 +00:00
John-Mark Gurney
da9442ef43 it must be the last member, not might...
Reviewed by:	attilio
Approved by:	re (delphij, gjb)
2013-09-26 17:55:04 +00:00
Konstantin Belousov
9d2abcd01a Do not allow negative timeouts for kqueue timers, check for the
negative timeout both before and after the conversion to sbintime_t.

For periodic kqueue timer, convert zero timeout into 1ms, to avoid
interrupt storm on fast event timers.

Reported and tested by:	pho
Discussed with:	mav
Reviewed by:	davide
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-26 13:17:31 +00:00
Konstantin Belousov
27884e3bd1 Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by:	many
Tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-26 13:14:51 +00:00
Davide Italiano
1b0c144fc2 Make the callout arithmetic more robust adding checks for overflow.
Without these, if the timeout value passed is "large enough", the
value of the sum of it and other factors (e.g. current time as
returned by sbinuptime() or 'precision' argument) might result in a
negative number. This negative number is then passed to
eventtimers(4), which causes et_start() routine to load et_min_period
into eventtimer, making the CPU where the thread is stuck forever in
timer interrupt handler routine. This is now avoided rounding to
INT64_MAX the timeout period in case of overflow.

Reported by:	kib, pho
Discussed with:	kib, mav
Tested by:	pho (stress2 suite, kevent7.sh scenario)
Approved by:	re (kib)
2013-09-26 10:06:50 +00:00
Attilio Rao
57a9eeb4ed Avoid memory accesses reordering which can result in fget_unlocked()
seeing a stale fd_ofiles table once fd_nfiles is already updated,
resulting in OOB accesses.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon storage division
Reported and tested by:	pho
Reviewed by:	benno
2013-09-25 13:37:52 +00:00
Alexander Motin
ea4af9c09a Make load average sampling asynchronous to hardclock ticks. This improves
measurement of load caused by time-related events still using hardclock.
For example, without this change dummynet, scheduling events each hardclock
tick, was always miscounted as load of 1.

There is still aliasing with events delayed by the new precision mechanism,
but it probably can't be avoided without moving this sampling from using
callout to some lower-level code or handling it in some other special way.

Reviewed by:	davide
Approved by:	re (marius)
2013-09-24 07:03:16 +00:00
Dag-Erling Smørgrav
0f7bc112c0 Always request zeroed memory, in case we're dumb enough to leak it later.
Approved by:	re (gjb)
2013-09-22 23:47:56 +00:00
Konstantin Belousov
12af71a69f Revert r255797. The LK_UPGRADE | LK_NOWAIT drops the lock.
Approved by:	re (marius, implicit)
2013-09-22 20:29:03 +00:00
Konstantin Belousov
19f6a6a1ca Pre-acquire the filedesc sx when a possibility exists that the later
code could need to remove a kqueue from the filedesc list.  Global
lock is already locked, which causes sleepable after non-sleepable
lock acquisition.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2013-09-22 19:54:47 +00:00
Konstantin Belousov
d1f8ca485d Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-22 19:23:48 +00:00
Davide Italiano
cf6b879fad Consistently use the same value to indicate exclusively-held and
shared-held locks for all the primitives in lc_lock/lc_unlock routines.
This fixes the problems introduced in r255747, which indeed introduced an
inversion in the logic.

Reported by:	many
Tested by:	bdrewery, pho, lme, Adam McDougall, O. Hartmann
Approved by:	re (glebius)
2013-09-22 14:09:07 +00:00
Gleb Smirnoff
255c1caae3 - Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
  to new namespace in favor of POLA.

Reviewed by:	scottl
Approved by:	re (gjb)
2013-09-22 13:36:52 +00:00
Justin T. Gibbs
255424ddb7 Fix ia64 and mips kernel builds due to XENHVM=>GENERIC integration in
revision 255744.

sys/kern/subr_smp.c:
	IPI_SUSPEND is only available on amd64 and i386.  Protect
	new uses of this constant with #ifdefs to avoid impacting
	other platforms.

Approved by:	re (blanket Xen)
2013-09-22 02:46:13 +00:00
Mark Johnston
8d305ba0dc Regenerate syscall argument strings after r255777.
Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:06:36 +00:00
Mark Johnston
f17f2ffcdd Omit "__restrict" when generating syscall argument strings. DTrace doesn't
handle it and cannot determine the argument type when it's present.

Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:05:44 +00:00
Davide Italiano
1f96759fb1 Fix callout_init_rm() in the shared case, allocating storage for 'struct
rm_priotracker' directly in the softclock thread. Now consumers can
pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely.
The choice of the already existing flags  instead of special casing
shared rmlocks is done to prevent consumer footshooting.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:16:15 +00:00
Davide Italiano
7faf4d90e8 Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With
current lock classes KPI it was really difficult because there was no
way to pass an rmtracker object to the lock/unlock routines. In order
to accomplish the task, modify the aforementioned functions so that
they can return (or pass as argument) an uinptr_t, which is in the rm
case used to hold a pointer to struct rm_priotracker for current
thread. As an added bonus, this fixes rm_sleep() in the rm shared
case, which right now can communicate priotracker structure between
lc_unlock()/lc_lock().

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:06:21 +00:00
Justin T. Gibbs
566a5f5020 Merge Xen PVHVM support into the GENERIC kernel config for both
amd64 and i386.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	- Introduce two new CPU hooks for initialization and resume
	  purposes. This allows us to get rid of the XENHVM ifdefs in
	  mp_machdep, and also sets some hooks into common code that can be
	  used by other hypervisor implementations.

sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
	- Remove these configs now that GENERIC has builtin support for Xen
	  HVM.

sys/kern/subr_smp.c:
	- Make sure there are no pending IPIs when suspending a system.

sys/x86/xen/hvm.c:
	- Add cpu init and resume vectors that are called from mp_machdep
	  using the new hooks.
	- Only clear the vcpu_info mapping data on resume.  It is already
	  clear for the BSP on a cold boot and is set correctly as APs
	  are started.
	- Gate xen_hvm_init_cpu only to systems running under Xen.

sys/x86/xen/xen_intr.c:
	 - Gate the setup of event channels only to systems running under Xen.
2013-09-20 22:59:22 +00:00
Justin T. Gibbs
428b7ca290 Add support for suspend/resume/migration operations when running as a
Xen PVHVM guest.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	- Make sure that are no MMU related IPIs pending on migration.
	- Reset pending IPI_BITMAP on resume.
	- Init vcpu_info on resume.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
	- Add a "suspend_cancelled" parameter to pic_resume().  For the
	  Xen PIC, restoration of interrupt services differs between
	  the aborted suspend and normal resume cases, so we must provide
	  this information.

sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
	- Don't swap out "suspend safe" timers across a suspend/resume
	  cycle.  This includes the Xen PV and ACPI timers.

sys/dev/xen/control/control.c:
	- Perform proper suspend/resume process for PVHVM:
		- Suspend all APs before going into suspension, this allows us
		  to reset the vcpu_info on resume for each AP.
		- Reset shared info page and callback on resume.

sys/dev/xen/timer/timer.c:
	- Implement suspend/resume support for the PV timer. Since FreeBSD
	  doesn't perform a per-cpu resume of the timer, we need to call
	  smp_rendezvous in order to correctly resume the timer on each CPU.

sys/dev/xen/xenpci/xenpci.c:
	- Don't reset the PCI interrupt on each suspend/resume.

sys/kern/subr_smp.c:
	- When suspending a PVHVM domain make sure there are no MMU IPIs
	  in-flight, or we will get a lockup on resume due to the fact that
	  pending event channels are not carried over on migration.
	- Implement a generic version of restart_cpus that can be used by
	  suspended and stopped cpus.

sys/x86/xen/hvm.c:
	- Implement resume support for the hypercall page and shared info.
	- Clear vcpu_info so it can be reset by APs when resuming from
	  suspension.

sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
	- Support UP kernel configurations.

sys/x86/xen/xen_intr.c:
	- Properly rebind per-cpus VIRQs and IPIs on resume.
2013-09-20 05:06:03 +00:00
John Baldwin
a566e8e3c5 Regen.
Approved by:	re (delphij)
2013-09-19 18:56:00 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
Pawel Jakub Dawidek
3fded357af Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00
Roman Divacky
b12698e1a1 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
Roman Divacky
253c75c0de Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00