Commit Graph

13696 Commits

Author SHA1 Message Date
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
kib
3c8b0e8428 Revert back to use int for the page counts. In vn_io_fault(), the i/o
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.

Rearrange the checks to correctly handle overflowing address arithmetic.

Submitted by:	bde
Tested by:	pho
Discussed with:	alc
MFC after:	1 week
2013-11-20 08:45:26 +00:00
avg
b757e82b43 taskqueue_cancel: garbage collect a write-only variable
MFC after:	3 days
2013-11-19 18:45:29 +00:00
jilles
10685f105c Fix siginfo_t.si_status for wait6/waitid/SIGCHLD.
Per POSIX, si_status should contain the value passed to exit() for
si_code==CLD_EXITED and the signal number for other si_code. This was
incorrect for CLD_EXITED and CLD_DUMPED.

This is still not fully POSIX-compliant (Austin group issue #594 says that
the full value passed to exit() shall be returned via si_status, not just
the low 8 bits) but is sufficient for a si_status-related test in libnih
(upstart, Debian/kFreeBSD).

PR:		kern/184002
Reported by:	Dmitrijs Ledkovs
Tested by:	Dmitrijs Ledkovs
2013-11-17 22:31:23 +00:00
pjd
6175f0915f Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had
a very hard time to fully understand) with much more intuitive rights:

	CAP_EVENT - when set on descriptor, the descriptor can be monitored
		with syscalls like select(2), poll(2), kevent(2).

	CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
		syscall can be called on this kqueue to with the eventlist
		argument set to non-NULL value; in other words the given
		kqueue descriptor can be used to monitor other descriptors.
	CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
		syscall can be called on this kqueue to with the changelist
		argument set to non-NULL value; in other words it allows to
		modify events monitored with the given kqueue descriptor.

Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.

Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2013-11-15 19:55:35 +00:00
jhb
32cc455aa9 Don't allow vfs.lorunningspace or vfs.hirunningspace to be set such
that lorunningspace is greater than hirunningspace as the system
performs terribly if it is mistuned in this fashion.

MFC after:	1 week
2013-11-15 15:29:53 +00:00
pjd
19aec859d4 Change cap_rights_merge(3) and cap_rights_remove(3) to return pointer
to the destination cap_rights_t structure.

This already matches manual page.

MFC after:	3 days
2013-11-14 22:59:20 +00:00
pjd
0ace53fcb3 Add a note that this file is compiled as part of the kernel and libc.
Requested by:	kib
MFC after:	3 days
2013-11-14 22:57:07 +00:00
glebius
fd01c5677d Fix a very bad typo from r248887.
Submitted by:	art
2013-11-14 09:45:33 +00:00
ray
7317ada90c MFC @r258091. 2013-11-13 13:41:36 +00:00
pluknet
ab417c0aca Add VM_LAST, a special last element in enum VM_GUEST and use it in CTASSERT
to ensure that vm_guest range is covered by vm_guest_sysctl_names.

Suggested by:	mjg
2013-11-12 20:13:10 +00:00
alc
4f2ec918f5 Eliminate the gratuitous use of mmap(2) flags from the implementation
of kern_shmat().  Use a simpler approach to determine whether to pass
VMFS_NO_SPACE or VMFS_OPTIMAL_SPACE to vm_map_find().
2013-11-12 17:46:11 +00:00
kib
3cdb6ad1ff Avoid overflow for the page counts.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-12 08:47:58 +00:00
pluknet
587aa2d7ea Set description string for VM_GUEST_HV (HyperV guest).
This fixes fallout from r256425.

Reported by:	Pavel Timofeev <timp87@gmail com>
Tested by:	Pavel Timofeev <timp87@gmail com>
Reviewed by:	Roger Pau Monnц╘
MFC after:	3 days
2013-11-11 16:14:33 +00:00
kib
9b734f0e00 If filesystem declares that it supports shared locking for writes, use
shared vnode lock for VOP_PUTPAGES() as well.  The only such
filesystem in the tree is ZFS, and it uses
vnode_pager_generic_putpages(), which performs the pageout with
VOP_WRITE().

Reviewed by:	alc
Discussed with:	avg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-11-09 20:36:29 +00:00
kib
f9f4aa68f7 Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL.  If vnode is reclaimed
meantime, second dereference would still give NULL.  Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-11-09 20:30:13 +00:00
hiren
6b902f8b12 Fix typo in a comment. 2013-11-08 20:11:15 +00:00
alc
6c60c88452 As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other
words, every architecture is now auto-sizing the kmem arena.  This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.

Replace or eliminate all existing definitions of VM_KMEM_SIZE.  With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures.  Use VM_KMEM_SIZE_MIN for
clarity.

Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size.  Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct.  In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not.  Remedy this inconsistency.  Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.

Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena.  Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.

Reviewed by:	kib (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-11-08 16:25:00 +00:00
ray
b8525cc352 MFC @r257740. 2013-11-06 11:16:05 +00:00
pjd
b88f8186c6 - Remove mac_get_fd/mac_set_fd - those are not syscalls. The __mac_get_fd() and
__mac_set_fd() syscalls are listed earlier.
- Correct typo in syscall name. It should be sched_rr_get_interval,
  not sched_rr_getinterval.

Submitted by:	David Drysdale <drysdale@google.com>
MFC after:	3 days
2013-11-06 07:46:10 +00:00
ray
c28e9771d6 MFC @r257698. 2013-11-05 12:55:28 +00:00
jilles
83816cd1d5 kqueue: Change error for kqueues rlimit from EMFILE to ENOMEM and document
this error condition in the kqueue(2) manual page.

Discussed with:	kib
2013-11-03 23:06:24 +00:00
mav
b8e600624b Make getenv_*() functions and respectively TUNABLE_*_FETCH() macros not
allocate memory and so not require sleepable environment.  getenv() has
already used on-stack temporary storage, so just use it more rationally.
getenv_string() receives buffer as argument, so don't need another one.
2013-11-01 10:32:33 +00:00
glebius
93563d8269 prison_check_ip4() can take const arguments. 2013-11-01 10:01:57 +00:00
emax
6faa9004b4 Rate limit (to once per minute) "Listen queue overflow" message in
sonewconn().

Reviewed by:	scottl, lstewart
Obtained from:	Netflix, Inc
MFC after:	2 weeks
2013-10-31 20:33:21 +00:00
ray
a2f43a3339 Add teken_subr_do_resize new method, to update taken sizes w/o reset positions
and use it in case we update terminal size not touching existing data.

Sponsored by:	The FreeBSD Foundation
2013-10-31 09:44:48 +00:00
ray
4c2f1ff059 Do not reset terminal state if it is not cleared on resize.
Sponsored by:	The FreeBSD Foundation
2013-10-28 14:00:06 +00:00
kib
79afbd5fdd Add bus_dmamap_load_ma() function to load map with the array of
vm_pages.  Provide trivial implementation which forwards the load to
_bus_dmamap_load_phys() page by page.  Right now all architectures use
bus_dmamap_load_ma_triv().

Tested by:	pho (as part of the functional patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2013-10-27 21:39:16 +00:00
kib
977e19e068 Fix typo.
MFC after:	3 days
2013-10-27 18:52:09 +00:00
kib
b385bc7f31 When reentering kdb, typically due to a bug causing trap or assert in
the code executed in the context of debugger, do not be ashamed to
inform loudly about the re-entry.  Also, print the backtrace before
obliterating current stack with longjmp, allowing the operator to see
a place which caused the bug.

The change should make it less mysterious debugging the ddb itself.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-27 16:20:52 +00:00
glebius
ff6e113f1b The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
ray
f690470e36 MFC @r257107. 2013-10-25 08:41:36 +00:00
markj
76ca16ff0f Redefine the io provider using the SDT(9) macros instead of doing everything
manually. This change has no functional impact.

Discussed with:	gnn
2013-10-24 02:39:07 +00:00
ray
3d72b52090 MFC @r256953 2013-10-23 09:21:14 +00:00
brooks
9e59bf1911 MFP4:
Change 221669 by bz@bz_zenith on 2013/02/01 12:26:04

        Run the initialization for polling earlier along with INTRs
        so that we can put network interface into polling mode by default
        if DEVICE_POLLING is compiled in and no interrupts are available.

MFC after:	3 days
Sponsored by:	DARPA/AFRL
2013-10-22 22:03:01 +00:00
ray
c3dff1dbba Add new terminal method terminal_set_winsize_blank. Same as terminal_set_winsize,
but with optional blank. That will allow us to see early messages after attach
more specific driver.

Sponsored by:	The FreeBSD Foundation
2013-10-22 14:00:46 +00:00
mav
f1e553e2f1 Remove global device lock acquisition from dev_relthread(), replacing it
with atomics on per-device data.
2013-10-22 10:40:26 +00:00
mav
4219fc0074 Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
 - caller should not hold any locks and should be reenterable;
 - callee should not depend on GEOM dual-threaded concurency semantics;
 - on the way down, if request is unmapped while callee doesn't support it,
   the context should be sleepable;
 - kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
 - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
 - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
 - G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
 - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe.  If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION.  da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by:	iXsystems, Inc.
MFC after:	2 months
2013-10-22 08:22:19 +00:00
mav
2470f9f00d Add comments that taskqueue_enqueue_locked() returns without the lock. 2013-10-21 21:16:50 +00:00
kib
80e669df9e Add a resource limit for the total number of kqueues available to the
user.  Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.

Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process.  Limit
allows administrator to prevent the abuse.

This is kernel-mode side of the change, with the user-mode enabling
commit following.

Reported and tested by:	pho
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-10-21 16:44:53 +00:00
kib
bbb34fefa1 Print more useful information about the transfer that trigger the assertion.
Other data is available with ddb command 'show pginfo'.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-21 16:17:46 +00:00
mav
a3e56e6934 MFprojects/camlock r256619:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer.  That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

Move mapping removal before calling bio_done() method.  I believe that it is
very wrong to do anything to BIO after reporting completion.  kib@ thinks
it was done for some forgotten now case when bio_done() method needed mapped
buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called
in the same way; 2) IMO there is no guatantee that buffer will be mapped at
this point at all, for example, if all underlying stack supports unmapped
I/O, so bio_done() handler can not expect that.
2013-10-21 06:44:55 +00:00
glebius
15bf763e72 Revert r256587.
Requested by:	zec
2013-10-18 11:26:40 +00:00
emaste
404c6c2644 Error out on failure to open specified config file 2013-10-16 17:03:46 +00:00
mav
73e769cd28 MFprojects/camlock r256370:
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
 - Remove msleep() timeout from biowait().  It was added 11 years ago, when
there was no locks used, and it should not be needed any more.
2013-10-16 09:56:40 +00:00
mav
26edd33dd3 MFprojects/camlock r254763:
Move tq_enqueue() call out of the queue lock for known handlers (actually
I have found no others in the base system).  This reduces queue lock hold
time and congestion spinning under active multithreaded enqueuing.
2013-10-16 09:52:59 +00:00
mav
39830ee862 MFprojects/camlock r254685:
Remove TQ_FLAGS_PENDING flag, softly duplicating queue emptiness status.
2013-10-16 09:48:23 +00:00
mav
2080607889 MFprojects/camlock r254905:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time.  Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.
2013-10-16 09:12:40 +00:00
glebius
d000cac124 For VIMAGE kernels store vnet in the struct task, and set vnet context
during task processing.

Reported & tested by:	mm
2013-10-16 05:02:01 +00:00
kib
45444cbe66 Add a sysctl kern.disallow_high_osrel which disables executing the
images compiled on the world with higher major version number than the
high version number of the booted kernel.  Default to disable.

Sponsored by:	The FreeBSD Foundation
Discussed with:	bapt
MFC after:	1 week
2013-10-15 06:38:40 +00:00
kib
5e67d6314a By default, allow up to SSIZE_MAX i/o for non-devfs files.
Sponsored by:	The FreeBSD Foundation
Reminded by:	Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after:	1 month
X-MFC-note:	stable/10 only
2013-10-15 06:35:22 +00:00
kib
3dc87905fa Similar to debug.iosize_max_clamp sysctl, introduce
devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized
i/o requests on the devfs files.

Sponsored by:	The FreeBSD Foundation
Reminded by:	Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after:	1 week
2013-10-15 06:33:10 +00:00
markm
9c2f444a0a Merge from project branch. Uninteresting commits are trimmed.
Refactor of /dev/random device. Main points include:

* Userland seeding is no longer used. This auto-seeds at boot time
on PC/Desktop setups; this may need some tweeking and intelligence
from those folks setting up embedded boxes, but the work is believed
to be minimal.

* An entropy cache is written to /entropy (even during installation)
and the kernel uses this at next boot.

* An entropy file written to /boot/entropy can be loaded by loader(8)

* Hardware sources such as rdrand are fed into Yarrow, and are no
longer available raw.

------------------------------------------------------------------------
r256240 | des | 2013-10-09 21:14:16 +0100 (Wed, 09 Oct 2013) | 4 lines

Add a RANDOM_RWFILE option and hide the entropy cache code behind it.
Rename YARROW_RNG and FORTUNA_RNG to RANDOM_YARROW and RANDOM_FORTUNA.
Add the RANDOM_* options to LINT.

------------------------------------------------------------------------
r256239 | des | 2013-10-09 21:12:59 +0100 (Wed, 09 Oct 2013) | 2 lines

Define RANDOM_PURE_RNDTEST for rndtest(4).

------------------------------------------------------------------------
r256204 | des | 2013-10-09 18:51:38 +0100 (Wed, 09 Oct 2013) | 2 lines

staticize struct random_hardware_source

------------------------------------------------------------------------
r256203 | markm | 2013-10-09 18:50:36 +0100 (Wed, 09 Oct 2013) | 2 lines

Wrap some policy-rich code in 'if NOTYET' until we can thresh out
what it really needs to do.

------------------------------------------------------------------------
r256184 | des | 2013-10-09 10:13:12 +0100 (Wed, 09 Oct 2013) | 2 lines

Re-add /dev/urandom for compatibility purposes.

------------------------------------------------------------------------
r256182 | des | 2013-10-09 10:11:14 +0100 (Wed, 09 Oct 2013) | 3 lines

Add missing include guards and move the existing ones out of the
implementation namespace.

------------------------------------------------------------------------
r256168 | markm | 2013-10-08 23:14:07 +0100 (Tue, 08 Oct 2013) | 10 lines

Fix some just-noticed problems:

o Allow this to work with "nodevice random" by fixing where the
MALLOC pool is defined.

o Fix the explicit reseed code. This was correct as submitted, but
in the project branch doesn't need to set the "seeded" bit as this
is done correctly in the "unblock" function.

o Remove some debug ifdeffing.

o Adjust comments.

------------------------------------------------------------------------
r256159 | markm | 2013-10-08 19:48:11 +0100 (Tue, 08 Oct 2013) | 6 lines

Time to eat crow for me.

I replaced the sx_* locks that Arthur used with regular mutexes;
this turned out the be the wrong thing to do as the locks need to
be sleepable. Revert this folly.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (In original diff)

------------------------------------------------------------------------
r256138 | des | 2013-10-08 12:05:26 +0100 (Tue, 08 Oct 2013) | 10 lines

Add YARROW_RNG and FORTUNA_RNG to sys/conf/options.

Add a SYSINIT that forces a reseed during proc0 setup, which happens
fairly late in the boot process.

Add a RANDOM_DEBUG option which enables some debugging printf()s.

Add a new RANDOM_ATTACH entropy source which harvests entropy from the
get_cyclecount() delta across each call to a device attach method.

------------------------------------------------------------------------
r256135 | markm | 2013-10-08 07:54:52 +0100 (Tue, 08 Oct 2013) | 8 lines

Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use
EVENTHANDLER(mountroot) instead.

This means we can't count on /var being present, so something will
need to be done about harvesting /var/db/entropy/... .

Some policy now needs to be sorted out, and a pre-sync cache needs
to be written, but apart from that we are now ready to go.

Over to review.

------------------------------------------------------------------------
r256094 | markm | 2013-10-06 23:45:02 +0100 (Sun, 06 Oct 2013) | 8 lines

Snapshot.

Looking pretty good; this mostly works now. New code includes:

* Read cached entropy at startup, both from files and from loader(8)
preloaded entropy. Failures are soft, but announced. Untested.

* Use EVENTHANDLER to do above just before we go multiuser. Untested.

------------------------------------------------------------------------
r256088 | markm | 2013-10-06 14:01:42 +0100 (Sun, 06 Oct 2013) | 2 lines

Fix up the man page for random(4). This mainly removes no-longer-relevant
details about HW RNGs, reseeding explicitly and user-supplied
entropy.

------------------------------------------------------------------------
r256087 | markm | 2013-10-06 13:43:42 +0100 (Sun, 06 Oct 2013) | 6 lines

As userland writing to /dev/random is no more, remove the "better
than nothing" bootstrap mode.

Add SWI harvesting to the mix.

My box seeds Yarrow by itself in a few seconds! YMMV; more to follow.

------------------------------------------------------------------------
r256086 | markm | 2013-10-06 13:40:32 +0100 (Sun, 06 Oct 2013) | 11 lines

Debug run. This now works, except that the "live" sources haven't
been tested. With all sources turned on, this unlocks itself in
a couple of seconds! That is no my box, and there is no guarantee
that this will be the case everywhere.

* Cut debug prints.

* Use the same locks/mutexes all the way through.

* Be a tad more conservative about entropy estimates.

------------------------------------------------------------------------
r256084 | markm | 2013-10-06 13:35:29 +0100 (Sun, 06 Oct 2013) | 5 lines

Don't use the "real" assembler mnemonics; older compilers may not
understand them (like when building CURRENT on 9.x).

# Submitted by:	Konstantin Belousov <kostikbel@gmail.com>

------------------------------------------------------------------------
r256081 | markm | 2013-10-06 10:55:28 +0100 (Sun, 06 Oct 2013) | 12 lines

SNAPSHOT.

Simplify the malloc pools; We only need one for this device.

Simplify the harvest queue.

Marginally improve the entropy pool hashing, making it a bit faster
in the process.

Connect up the hardware "live" source harvesting. This is simplistic
for now, and will need to be made rate-adaptive.

All of the above passes a compile test but needs to be debugged.

------------------------------------------------------------------------
r256042 | markm | 2013-10-04 07:55:06 +0100 (Fri, 04 Oct 2013) | 25 lines

Snapshot. This passes the build test, but has not yet been finished or debugged.

Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)

------------------------------------------------------------------------
r255319 | markm | 2013-09-06 18:51:52 +0100 (Fri, 06 Sep 2013) | 4 lines

Yarrow wants entropy estimations to be conservative; the usual idea
is that if you are certain you have N bits of entropy, you declare
N/2.

------------------------------------------------------------------------
r255075 | markm | 2013-08-30 18:47:53 +0100 (Fri, 30 Aug 2013) | 4 lines

Remove short-lived idea; thread to harvest (eg) RDRAND enropy into the
usual harvest queues. It was a nifty idea, but too heavyweight.

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com>

------------------------------------------------------------------------
r255071 | markm | 2013-08-30 12:42:57 +0100 (Fri, 30 Aug 2013) | 4 lines

Separate out the Software RNG entropy harvesting queue and thread
into its own files.

# Submitted by:	 Arthur Mesh <arthurmesh@gmail.com>

------------------------------------------------------------------------
r254934 | markm | 2013-08-26 20:07:03 +0100 (Mon, 26 Aug 2013) | 2 lines

Remove the short-lived namei experiment.

------------------------------------------------------------------------
r254928 | markm | 2013-08-26 19:35:21 +0100 (Mon, 26 Aug 2013) | 2 lines

Snapshot; Do some running repairs on entropy harvesting. More needs
to follow.

------------------------------------------------------------------------
r254927 | markm | 2013-08-26 19:29:51 +0100 (Mon, 26 Aug 2013) | 15 lines

Snapshot of current work;

1) Clean up namespace; only use "Yarrow" where it is Yarrow-specific
or close enough to the Yarrow algorithm. For the rest use a neutral
name.

2) Tidy up headers; put private stuff in private places. More could
be done here.

3) Streamline the hashing/encryption; no need for a 256-bit counter;
128 bits will last for long enough.

There are bits of debug code lying around; these will be removed
at a later stage.

------------------------------------------------------------------------
r254784 | markm | 2013-08-24 14:54:56 +0100 (Sat, 24 Aug 2013) | 39 lines

1) example (partially humorous random_adaptor, that I call "EXAMPLE")
 * It's not meant to be used in a real system, it's there to show how
   the basics of how to create interfaces for random_adaptors. Perhaps
   it should belong in a manual page

2) Move probe.c's functionality in to random_adaptors.c
 * rename random_ident_hardware() to random_adaptor_choose()

3) Introduce a new way to choose (or select) random_adaptors via tunable
"rngs_want" It's a list of comma separated names of adaptors, ordered
by preferences. I.e.:
rngs_want="yarrow,rdrand"

Such setting would cause yarrow to be preferred to rdrand. If neither of
them are available (or registered), then system will default to
something reasonable (currently yarrow). If yarrow is not present, then
we fall back to the adaptor that's first on the list of registered
adaptors.

4) Introduce a way where RNGs can play a role of entropy source. This is
mostly useful for HW rngs.

The way I envision this is that every HW RNG will use this
functionality by default. Functionality to disable this is also present.
I have an example of how to use this in random_adaptor_example.c (see
modload event, and init function)

5) fix kern.random.adaptors from
kern.random.adaptors: yarrowpanicblock
to
kern.random.adaptors: yarrow,panic,block

6) add kern.random.active_adaptor to indicate currently selected
adaptor:
root@freebsd04:~ # sysctl kern.random.active_adaptor
kern.random.active_adaptor: yarrow

# Submitted by:	Arthur Mesh <arthurmesh@gmail.com>

Submitted by:	Dag-Erling Smørgrav <des@FreeBSD.org>, Arthur Mesh <arthurmesh@gmail.com>
Reviewed by:	des@FreeBSD.org
Approved by:	re (delphij)
Approved by:	secteam (des,delphij)
2013-10-12 12:57:57 +00:00
jhb
5a13cd57b4 Ignore attempts to set the nmbcluster sysctls to their current value
rather than failing with an error.

Reviewed by:	andre
Approved by:	re (delphij)
MFC after:	2 weeks
2013-10-10 16:11:34 +00:00
markm
a9d73bef64 MFC - tracking commit. 2013-10-09 21:03:34 +00:00
kib
57b9540060 The device vnodes are often unlocked when bread() or bwrite() is
called.  This probably should be fixed eventually, but for now it is
not needed to try to flush such vnodes from the buffer allocation
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-09 18:45:01 +00:00
kib
d973ab2c23 Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by:    John Marshall <john.marshall@riverwillow.com.au>
Tested by:      John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-09 18:43:29 +00:00
kib
7ff487b3a2 When growing the file descriptor table, new larger memory chunk is
allocated, but the old table is kept around to handle the case of
threads still performing unlocked accesses to it.

Grow the table exponentially instead of increasing its size by
sizeof(long) * 8 chunks when overflowing. This mode significantly
reduces the total memory use for the processes consuming large numbers
of the file descriptors which open them one by one.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-10-09 18:41:35 +00:00
kib
9375280e4a Reduce code duplication, introduce the getmaxfd() helper to calculate
the max filedescriptor index.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-10-09 18:39:44 +00:00
markm
cb606506a3 MFC - tracking commit 2013-10-09 17:41:47 +00:00
glebius
0216035e66 - Substitute sbdrop_internal() with sbcut_internal(). The latter doesn't free
mbufs, but return chain of free mbufs to a caller. Caller can either reuse
  them or return to allocator in a batch manner.
- Implement sbdrop()/sbdrop_locked() as a wrapper around sbcut_internal().
- Expose sbcut_locked() for outside usage.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 11:57:53 +00:00
ray
da824c4a2b MFC @r256148. 2013-10-08 14:02:35 +00:00
ray
5f5f17c829 o Rename methods according to "consdev style".
o Add cngrab/cnungrab methods.
o Allow later console attach with termcn_cnregister().

Sponsored by:	The FreeBSD Foundation
2013-10-08 12:01:17 +00:00
des
7dad8b80f6 Add YARROW_RNG and FORTUNA_RNG to sys/conf/options.
Add a SYSINIT that forces a reseed during proc0 setup, which happens
fairly late in the boot process.

Add a RANDOM_DEBUG option which enables some debugging printf()s.

Add a new RANDOM_ATTACH entropy source which harvests entropy from the
get_cyclecount() delta across each call to a device attach method.
2013-10-08 11:05:26 +00:00
markm
04741fa764 Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use EVENTHANDLER(mountroot) instead.
This means we can't count on /var being present, so something will need to be done about harvesting /var/db/entropy/... .

Some policy now needs to be sorted out, and a pre-sync cache needs to be written, but apart from that we are now ready to go.

Over to review.
2013-10-08 06:54:52 +00:00
markm
6492773aa9 Snapshot.
Looking pretty good; this mostly works now. New code includes:

* Read cached entropy at startup, both from files and from loader(8) preloaded entropy. Failures are soft, but announced. Untested.

* Use EVENTHANDLER to do above just before we go multiuser. Untested.
2013-10-06 22:45:02 +00:00
markm
21998ad688 MFC - tracking commit 2013-10-06 09:37:57 +00:00
kib
1a04bcfbb6 Remove the uipc_cow.c file, which is not used since the zero copy
sockets removal.

Noted by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-10-06 06:57:28 +00:00
alc
4c5162818b Tidy up kmeminit(): Since r245575, 'nmbclusters' is calculated after
kmeminit() runs, so it contributes nothing to 'vm_kmem_size'; update a
comment to reflect that r254025 replaced the kmem submap with the kmem
arena.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2013-10-05 18:53:03 +00:00
markm
3d04a78b78 MFC - tracking commit. 2013-10-04 07:00:59 +00:00
markm
b28953010e Snapshot. This passes the build test, but has not yet been finished or debugged.
Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)
2013-10-04 06:55:06 +00:00
sbruno
6b64ef7266 Change len checks for fstypelen and fspathlen to be against absolute len
not strlen as they are *not* strings.

Discovered by GSOC student, Mike Ma <mikemandarine@gmail.com> during his
fuse.glusterfs port to FreeBSD.

Final patch from mckusick@

Submitted by:	mckusick@
Approved by:	re (hrs)
MFC after:	2 weeks
2013-10-03 22:52:03 +00:00
markm
e7fc623a18 MFC - tracking update. 2013-10-02 18:12:18 +00:00
kib
031d824a49 When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue.  We are only
interested in, and can operate on, the buffers owned by the current
vnode [1].  Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.

Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.

Discussed with:	Jeff [1]
Reported by:	pho [2]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (hrs)
2013-10-02 06:00:34 +00:00
kib
2f645d31a9 When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-01 20:18:33 +00:00
kib
26571b5c44 For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked.  If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:07:14 +00:00
kib
e309fe770f Reimplement r255797 using LK_TRYUPGRADE.
The  r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:04:57 +00:00
kib
632f1d29f3 Add LK_TRYUPGRADE operation for lockmgr(9), which attempts to
atomically upgrade shared lock to exclusive.  On failure, error is
returned and lock is not dropped in the process.

Tested by:      pho (previous version)
No objections from:     attilio
Sponsored by:   The FreeBSD Foundation
MFC after:      1 week
Approved by:	re (glebius)
2013-09-29 18:02:23 +00:00
jmg
1dc8591054 it must be the last member, not might...
Reviewed by:	attilio
Approved by:	re (delphij, gjb)
2013-09-26 17:55:04 +00:00
kib
9a6c86c297 Do not allow negative timeouts for kqueue timers, check for the
negative timeout both before and after the conversion to sbintime_t.

For periodic kqueue timer, convert zero timeout into 1ms, to avoid
interrupt storm on fast event timers.

Reported and tested by:	pho
Discussed with:	mav
Reviewed by:	davide
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-26 13:17:31 +00:00
kib
c58dbf73e0 Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by:	many
Tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-26 13:14:51 +00:00
davide
5c31dfc658 Make the callout arithmetic more robust adding checks for overflow.
Without these, if the timeout value passed is "large enough", the
value of the sum of it and other factors (e.g. current time as
returned by sbinuptime() or 'precision' argument) might result in a
negative number. This negative number is then passed to
eventtimers(4), which causes et_start() routine to load et_min_period
into eventtimer, making the CPU where the thread is stuck forever in
timer interrupt handler routine. This is now avoided rounding to
INT64_MAX the timeout period in case of overflow.

Reported by:	kib, pho
Discussed with:	kib, mav
Tested by:	pho (stress2 suite, kevent7.sh scenario)
Approved by:	re (kib)
2013-09-26 10:06:50 +00:00
attilio
29d161240e Avoid memory accesses reordering which can result in fget_unlocked()
seeing a stale fd_ofiles table once fd_nfiles is already updated,
resulting in OOB accesses.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon storage division
Reported and tested by:	pho
Reviewed by:	benno
2013-09-25 13:37:52 +00:00
mav
a836e0c536 Make load average sampling asynchronous to hardclock ticks. This improves
measurement of load caused by time-related events still using hardclock.
For example, without this change dummynet, scheduling events each hardclock
tick, was always miscounted as load of 1.

There is still aliasing with events delayed by the new precision mechanism,
but it probably can't be avoided without moving this sampling from using
callout to some lower-level code or handling it in some other special way.

Reviewed by:	davide
Approved by:	re (marius)
2013-09-24 07:03:16 +00:00
des
65c545bff1 Always request zeroed memory, in case we're dumb enough to leak it later.
Approved by:	re (gjb)
2013-09-22 23:47:56 +00:00
kib
752a7a786f Revert r255797. The LK_UPGRADE | LK_NOWAIT drops the lock.
Approved by:	re (marius, implicit)
2013-09-22 20:29:03 +00:00
kib
5252c8b0bf Pre-acquire the filedesc sx when a possibility exists that the later
code could need to remove a kqueue from the filedesc list.  Global
lock is already locked, which causes sleepable after non-sleepable
lock acquisition.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2013-09-22 19:54:47 +00:00
kib
46dd93739b Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-22 19:23:48 +00:00
davide
30f0365ae9 Consistently use the same value to indicate exclusively-held and
shared-held locks for all the primitives in lc_lock/lc_unlock routines.
This fixes the problems introduced in r255747, which indeed introduced an
inversion in the logic.

Reported by:	many
Tested by:	bdrewery, pho, lme, Adam McDougall, O. Hartmann
Approved by:	re (glebius)
2013-09-22 14:09:07 +00:00
glebius
12fb397079 - Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
  to new namespace in favor of POLA.

Reviewed by:	scottl
Approved by:	re (gjb)
2013-09-22 13:36:52 +00:00
gibbs
f749b57e89 Fix ia64 and mips kernel builds due to XENHVM=>GENERIC integration in
revision 255744.

sys/kern/subr_smp.c:
	IPI_SUSPEND is only available on amd64 and i386.  Protect
	new uses of this constant with #ifdefs to avoid impacting
	other platforms.

Approved by:	re (blanket Xen)
2013-09-22 02:46:13 +00:00
markj
d3af58e0a0 Regenerate syscall argument strings after r255777.
Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:06:36 +00:00
markj
885bf4020f Omit "__restrict" when generating syscall argument strings. DTrace doesn't
handle it and cannot determine the argument type when it's present.

Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:05:44 +00:00
davide
defa3b8b85 Fix callout_init_rm() in the shared case, allocating storage for 'struct
rm_priotracker' directly in the softclock thread. Now consumers can
pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely.
The choice of the already existing flags  instead of special casing
shared rmlocks is done to prevent consumer footshooting.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:16:15 +00:00
davide
5273d359fd Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With
current lock classes KPI it was really difficult because there was no
way to pass an rmtracker object to the lock/unlock routines. In order
to accomplish the task, modify the aforementioned functions so that
they can return (or pass as argument) an uinptr_t, which is in the rm
case used to hold a pointer to struct rm_priotracker for current
thread. As an added bonus, this fixes rm_sleep() in the rm shared
case, which right now can communicate priotracker structure between
lc_unlock()/lc_lock().

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:06:21 +00:00
gibbs
7ed30adae7 Merge Xen PVHVM support into the GENERIC kernel config for both
amd64 and i386.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	- Introduce two new CPU hooks for initialization and resume
	  purposes. This allows us to get rid of the XENHVM ifdefs in
	  mp_machdep, and also sets some hooks into common code that can be
	  used by other hypervisor implementations.

sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
	- Remove these configs now that GENERIC has builtin support for Xen
	  HVM.

sys/kern/subr_smp.c:
	- Make sure there are no pending IPIs when suspending a system.

sys/x86/xen/hvm.c:
	- Add cpu init and resume vectors that are called from mp_machdep
	  using the new hooks.
	- Only clear the vcpu_info mapping data on resume.  It is already
	  clear for the BSP on a cold boot and is set correctly as APs
	  are started.
	- Gate xen_hvm_init_cpu only to systems running under Xen.

sys/x86/xen/xen_intr.c:
	 - Gate the setup of event channels only to systems running under Xen.
2013-09-20 22:59:22 +00:00
gibbs
a9c07a6f67 Add support for suspend/resume/migration operations when running as a
Xen PVHVM guest.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	- Make sure that are no MMU related IPIs pending on migration.
	- Reset pending IPI_BITMAP on resume.
	- Init vcpu_info on resume.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
	- Add a "suspend_cancelled" parameter to pic_resume().  For the
	  Xen PIC, restoration of interrupt services differs between
	  the aborted suspend and normal resume cases, so we must provide
	  this information.

sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
	- Don't swap out "suspend safe" timers across a suspend/resume
	  cycle.  This includes the Xen PV and ACPI timers.

sys/dev/xen/control/control.c:
	- Perform proper suspend/resume process for PVHVM:
		- Suspend all APs before going into suspension, this allows us
		  to reset the vcpu_info on resume for each AP.
		- Reset shared info page and callback on resume.

sys/dev/xen/timer/timer.c:
	- Implement suspend/resume support for the PV timer. Since FreeBSD
	  doesn't perform a per-cpu resume of the timer, we need to call
	  smp_rendezvous in order to correctly resume the timer on each CPU.

sys/dev/xen/xenpci/xenpci.c:
	- Don't reset the PCI interrupt on each suspend/resume.

sys/kern/subr_smp.c:
	- When suspending a PVHVM domain make sure there are no MMU IPIs
	  in-flight, or we will get a lockup on resume due to the fact that
	  pending event channels are not carried over on migration.
	- Implement a generic version of restart_cpus that can be used by
	  suspended and stopped cpus.

sys/x86/xen/hvm.c:
	- Implement resume support for the hypercall page and shared info.
	- Clear vcpu_info so it can be reset by APs when resuming from
	  suspension.

sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
	- Support UP kernel configurations.

sys/x86/xen/xen_intr.c:
	- Properly rebind per-cpus VIRQs and IPIs on resume.
2013-09-20 05:06:03 +00:00
jhb
c6151e30b1 Regen.
Approved by:	re (delphij)
2013-09-19 18:56:00 +00:00
jhb
d3ef75b6c7 Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
pjd
667d7255be Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00
rdivacky
fae003a069 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
rdivacky
d57db3eead Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00
glebius
ab92040a7b Fix assertion in sendfile_readpage() to assert only the validity
of requested amount of data in a page. Move assertion down below
object unlock.

Approved by:	re (kib)
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-09-17 06:37:21 +00:00
kib
6796656333 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
kib
36c84de38f Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant
time removal on kqueue close.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (delphij)
2013-09-13 19:50:50 +00:00
kib
4c87f20f3a When opening or closing fifo, ensure that the vnode is locked
exclusively.  Filesystems are assumed to disable shared locking for
the fifo vnode locks, but some do not.

Reported and tested by:	olgeni
Discussed with:	avg
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:52:23 +00:00
kib
b3080da236 Reduce the scope of the proctree_lock. If several processes cause
continuous calls to the uprintf(9), the proctree_lock could be
shared-locked for indefinite amount of time, starving exclusive
requests. Since proctree_lock is needed for fork() and exit(), this
effectively stops the machine.

While there, do the similar reduction for tprintf(9).

Reported and tested by: pho
Reviewed by:    ed
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:39:10 +00:00
jhb
5dbaab99c0 Regen.
Approved by:	re (kib)
2013-09-12 18:03:51 +00:00
jhb
e0689d5d63 Fix the type of the idtype argument to wait6() in syscalls.master.
(Accidentally missed this in the previous commit)

Approved by:	re (kib)
MFC after:	1 week
2013-09-12 18:01:13 +00:00
jhb
d31f6d6b64 Fix the type of the idtype argument to wait6() in syscalls.master.
Approved by:	re (kib)
MFC after:	1 week
2013-09-12 17:52:18 +00:00
glebius
3d45194f2c Provide pr_ctloutput method for AF_LOCAL/SOCK_SEQPACKET sockets.
This makes setsockopt() on them working.

Reported by:	Yuri <yuri rawbw.com>
Approved by:	re (kib)
2013-09-11 18:22:30 +00:00
kib
4d92de31b2 Fix build with gcc.
Build-tested by:	gjb
Approved by:	re (glebius)
2013-09-11 17:31:22 +00:00
kib
55cc853a4d Implement sendfile(2) for the posix shared memory segment file descriptor,
in addition to the regular files.

Requested by:	alc
Discussed with:	emaste
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-11 06:41:15 +00:00
des
21e6fd796b Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
jhb
04bb6e10cd Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
delphij
7917506a10 In r243868, the error message buffer errmsg have been changed from
an on-stack array to a pointer and therefore sizeof(errmsg) would
become 4 or 8 bytes depending on the architecture.

Fix this by using ERRMSGL in place of sizeof().

Submitted by:	J David <j.david.lists@gmail.com>
MFC after:	3 days
Approved by:	re (kib)
2013-09-09 05:01:18 +00:00
kib
56cc686058 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
pjd
f16777c9dc Sort properly. 2013-09-07 19:16:02 +00:00
pjd
e781fc782f Fix panic in cap_rights_is_valid() when invalid rights are provided -
the right_to_index() function should assert correctness in this case.

Improve other assertions.

Reported by:	pho
Tested by:	pho
2013-09-07 19:03:16 +00:00
mav
e372a1314d Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
2013-09-07 15:16:30 +00:00
markm
b41e1125b0 MFC 2013-09-07 07:58:29 +00:00
np
24e0bcea20 Add a vtprintf. It is to tprintf what vprintf is to printf.
Reviewed by:	kib
2013-09-07 07:53:21 +00:00
markm
9d67aa8bff MFC 2013-09-06 17:42:12 +00:00
jamie
d13d69ef17 Keep PRIV_KMEM_READ permitted inside jails as it is on the outside. 2013-09-06 17:32:29 +00:00
hiren
0493244548 Fixing a small typo.
Reviewed by:	gjb
Approved by:	sbruno (mentor)
2013-09-05 18:18:23 +00:00
kib
a08e5f97bc The vm_pageout_flush() functions sbusies pages in the passed pages
run.  After that, the pager put method is called, usually translated
to VOP_WRITE().  For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain.  The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).

Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them.  Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy.  Switch to sbusy.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:56:08 +00:00
pjd
e6563595d7 The fget() function now takes pointer to cap_rights_t, so change 0 to NULL. 2013-09-05 11:59:23 +00:00
pjd
1c7defb76e Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
glebius
12cfeea19a Fix !CAPABILITIES build. 2013-09-05 10:24:09 +00:00
pjd
ad93dafc6a Correct the logic broken in my last commit.
Reported by:	tijl
2013-09-05 09:36:19 +00:00
sbruno
bf3232d8bb Restore builds on architectures that don't support CAPABILITIES (mips). 2013-09-05 03:46:44 +00:00
sbruno
c1754372ea This looks like a typo that breaks the build. Yell at me if this isn't the
intended declaration.
2013-09-05 03:36:57 +00:00
pjd
add7315b85 Style fixes. 2013-09-05 00:19:30 +00:00
pjd
36e3a50584 Style fixes. Most fixes are about not treating integers and pointers as
booleans.
2013-09-05 00:17:38 +00:00
pjd
d1a65cb7ef Regenerate after r255219.
Sponsored by:	The FreeBSD Foundation
2013-09-05 00:11:59 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
jhb
00a02e74da Trim a couple of panic messages. 2013-09-04 11:52:28 +00:00
davide
3e4464bbe3 Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by:	Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by:	jhb (earlier version)
2013-09-01 23:34:53 +00:00
ray
c555be1512 MFC @r255128. 2013-09-01 23:06:28 +00:00
rmacklem
8d06f831a7 Forced dismounts of NFS mounts can fail when thread(s) are stuck
waiting for an RPC reply from the server while holding the mount
point busy (mnt_lockref incremented). This happens because dounmount()
msleep()s waiting for mnt_lockref to become 0, before calling
VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(),
which the NFS client implements as purging RPCs in progress. Making
this call before checking mnt_lockref fixes the problem, by ensuring
that the VOP_xxx() calls will fail and unbusy the mount point.

Reported by:	sbruno
Reviewed by:	kib
MFC after:	2 weeks
2013-09-01 23:02:59 +00:00
markm
ff7909302f MFC 2013-08-30 11:38:34 +00:00
hselasky
5a68ee8757 Simplify pause_sbt() logic. Don't call DELAY() if remainder is less
than or equal to zero.
2013-08-30 10:39:56 +00:00
kib
dc9c173247 Move the definition of the struct unrhdr into a separate header file,
to allow embedding the struct.  Add init_unrhdr(9) initializer, which
sets up preallocated unrhdr.

Reviewed by:	alc
Tested by:	pho, bf
2013-08-30 07:37:45 +00:00
ken
6165f30980 Fix some issues in change 254760 pointed out by Bruce Evans:
- Remove excessive parenthesis
 - Use KNF continuation indentation
 - Cut down on excessive continuation lines
 - More consistent style in messages
 - Use uprintf() instead of printf()

Submitted by:	bde
2013-08-29 16:41:40 +00:00
jhb
5875a10cf6 Don't return an error for socket timeouts that are too large. Just
cap them to INT_MAX ticks instead.

PR:		kern/181416 (r254699 really)
Requested by:	bde
MFC after:	2 weeks
2013-08-29 15:59:05 +00:00
andre
73f239a63e Pad m_hdr on 32bit architectures to to prevent alignment and padding
problems with the way MLEN, MHLEN, and struct mbuf are set up.

CTASSERT's are provided to detect such issues at compile time in the
future.

The #define MLEN and MHLEN calculation do not take actual compiler-
induced alignment and padding inside the complete struct mbuf into
account.  Accordingly appropriate attention is required when changing
members of struct mbuf.

Ideally one would calculate MLEN as (MSIZE - sizeof(((struct mbuf *)0)->m_hdr)
but that doesn't work as the compiler refuses to operate on an as of
yet incomplete structure.

In particular ARM 32bit has more strict alignment requirements which
caused 4 bytes of padding between m_hdr and pkthdr in struct mbuf
because of the 64bit members in pkthdr.  This wasn't picked up by MLEN
and MHLEN causing an overflow of the mbuf provided data storage by
overestimating its size.

I386 didn't show this problem because it handles unaligned access just
fine, albeit at a small performance penalty.

On 64bit architectures the struct mbuf layout is 64bit aligned in all
places.

Reported by:	Thomas Skibo <ThomasSkibo-at-sbcglobal-dot-net>
Tested by:	tuexen, ian, Thomas Skibo (extended patch)
Sponsored by:	The FreeBSD Foundation
2013-08-27 20:52:02 +00:00
kib
9426e190e7 When allocating a pbuf for the cluster write, do not sleep waiting
for the available pbuf when passed vnode is backing md(4). Other i/o
directed to the same md device might already hold pbufs, and then we
could deadlock since only our progress can free a pbuf needed for
wakeup.

Obtained from:	projects/vm6
Reminded and tested by:	pho
MFC after:	1 week
2013-08-27 01:31:12 +00:00
will
a52b9ca1d3 Add the ability to display the default FIB number for a process to the
ps(1) utility, e.g. "ps -O fib".

bin/ps/keyword.c:
	Add the "fib" keyword and default its column name to "FIB".

bin/ps/ps.1:
	Add "fib" as a supported keyword.

sys/compat/freebsd32/freebsd32.h:
sys/kern/kern_proc.c:
sys/sys/user.h:
	Add the default fib number for a process (p->p_fibnum)
	to the user land accessible process data of struct kinfo_proc.

Submitted by:	Oliver Fromme <olli@fromme.com>, gibbs
2013-08-26 23:48:21 +00:00
jmg
f4ae1f5e6d fix up some comments and a white space issue...
MFC after:	3 days
2013-08-26 18:53:19 +00:00
markm
8a3bb03c25 Snapshot; Do some running repairs on entropy harvesting. More needs to follow. 2013-08-26 18:35:21 +00:00
andre
6c0efad132 Give (*ext_free) an int return value allowing for very sophisticated
external mbuf buffer management capabilities in the future.

For now only EXT_FREE_OK is defined with current legacy behavior.

Sponsored by:	The FreeBSD Foundation
2013-08-25 10:57:09 +00:00
andre
46cd7f807a Adjust socow_iodone() after r254799.
Sponsored by:	The FreeBSD Foundation
2013-08-25 09:40:15 +00:00
andre
e02967a3c7 After r254779 "error" must always be present in mb_ctor_pack(),
not only when MAC is defined.

Reported by:	gjb / tinderbox
Sponsored by:	The FreeBSD Foundation
2013-08-24 21:25:53 +00:00
markj
3541d8b143 Rename the kld_unload event handler to kld_unload_try, and add a new
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.

Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.

Reviewed by:	jhb
2013-08-24 21:13:38 +00:00
andre
6ad648e223 Remove unused m_free_fast(). The difference to m_free() is only
2 predictable branches nowadays.  However as a pre-condition the
caller had to ensure that the mbuf pkthdr did not have any mtags
attached to it, costing some potential branches again.

Sponsored by:	The FreeBSD Foundation
2013-08-24 21:09:57 +00:00
markj
351bab29d0 Set things up so that linker_file_lookup_set() is always called with the
linker lock held. This makes it possible to call it from a kld event handler
with the shared lock held.

Reviewed by:	jhb
2013-08-24 21:08:55 +00:00
markj
6eda1f0daf Remove the kld lock macros and just use the sx(9) API. Add locking in
linker_init_kernel_modules() and linker_preload() in order to remove most
of the checks for !cold before asserting that the kld lock is held. These
routines are invoked by SYSINIT(9), so there's no harm in them taking the
kld lock.
2013-08-24 21:07:04 +00:00
markj
86598e9098 Remove some code that has been commented out since it was added in 2000. 2013-08-24 21:00:39 +00:00
andre
e3737c33e7 Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
ken
32eeb1af5c Fix a printf format warning on 32-bit mips and powerpc.
Reported by:	bde, gjb
Pointy hat to:	ken
2013-08-24 19:02:36 +00:00
andre
b148bf45e0 Add an mbuf pointer parameter to (*ext_free) to give the external
free function access to the mbuf the external memory was attached
to.

Mechanically adjust all users to include the mbuf parameter.

This fixes a long standing annoyance for external free functions.
Before one had to sacrifice one of the argument pointers for this.

Sponsored by:	The FreeBSD Foundation
2013-08-24 16:57:44 +00:00
mav
8d8ac3d3c9 MFprojects/camlock r254460:
Remove locking from taskqueue_member().  The list of threads is static
during the taskqueue life cycle, so there is no need to protect it,
taking quite congested lock several more times for each ZFS I/O.
2013-08-24 14:41:49 +00:00
andre
817be10f1c dd a 24 bits wide ext_flags field to m_ext by reducing ext_type
to 8 bits.  ext_type is an enumerator and the number of types we
have is a mere dozen.

A couple of ext_types are renumbered to fit within 8 bits.

EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and
experimental local mapping.

The ext_flags field is currently unused but has a couple of flags
already defined for future use.  Again vendor and experimental
flags are provided for local mapping.

EXT_FLAG_BITS is provided for the printf(9) %b identifier.

Initialize and copy ext_flags in the relevant mbuf functions.

Improve alignment and packing of struct m_ext on 32 and 64 archs
by carefully sorting the fields.
2013-08-24 13:15:42 +00:00
andre
b9a332b6b2 Avoid code duplication for mbuf initialization and use m_init() instead
in mb_ctor_mbuf() and mb_ctor_pack().
2013-08-24 12:24:58 +00:00
ken
281a193b53 Add support to physio(9) for devices that don't want I/O split and
configure sa(4) to request no I/O splitting by default.

For tape devices, the user needs to be able to clearly understand
what blocksize is actually being used when writing to a tape
device.  The previous behavior of physio(9) was that it would split
up any I/O that was too large for the device, or too large to fit
into MAXPHYS.  This means that if, for instance, the user wrote a
1MB block to a tape device, and MAXPHYS was 128KB, the 1MB write
would be split into 8 128K chunks.  This would be done without
informing the user.

This has suboptimal effects, especially when trying to communicate
status to the user.  In the event of an error writing to a tape
(e.g. physical end of tape) in the middle of a 1MB block that has
been split into 8 pieces, the user could have the first two 128K
pieces written successfully, the third returned with an error, and
the last 5 returned with 0 bytes written.  If the user is using
a standard write(2) system call, all he will see is the ENOSPC
error.  He won't have a clue how much actually got written.  (With
a writev(2) system call, he should be able to determine how much
got written in addition to the error.)

The solution is to prevent physio(9) from splitting the I/O.  The
new cdev flag, SI_NOSPLIT, tells physio that the driver does not
want I/O to be split beforehand.

Although the sa(4) driver now enables SI_NOSPLIT by default,
that can be disabled by two loader tunables for now.  It will not
be configurable starting in FreeBSD 11.0.  kern.cam.sa.allow_io_split
allows the user to configure I/O splitting for all sa(4) driver
instances.  kern.cam.sa.%d.allow_io_split allows the user to
configure I/O splitting for a specific sa(4) instance.

There are also now three sa(4) driver sysctl variables that let the
users see some sa(4) driver values.  kern.cam.sa.%d.allow_io_split
shows whether I/O splitting is turned on.  kern.cam.sa.%d.maxio shows
the maximum I/O size allowed by kernel configuration parameters
(e.g. MAXPHYS, DFLTPHYS) and the capabilities of the controller.
kern.cam.sa.%d.cpi_maxio shows the maximum I/O size supported by
the controller.

Note that a better long term solution would be to implement support
for chaining buffers, so that that MAXPHYS is no longer a limiting
factor for I/O size to tape and disk devices.  At that point, the
controller and the tape drive would become the limiting factors.

sys/conf.h:	Add a new cdev flag, SI_NOSPLIT, that allows a
		driver to tell physio not to split up I/O.

sys/param.h:	Bump __FreeBSD_version to 1000049 for the addition
		of the SI_NOSPLIT cdev flag.

kern_physio.c:	If the SI_NOSPLIT flag is set on the cdev, return
		any I/O that is larger than si_iosize_max or
		MAXPHYS, has more than one segment, or would have
		to be split because of misalignment with EFBIG.
		(File too large).

		In the event of an error, print a console message to
		give the user a clue about what happened.

scsi_sa.c:	Set the SI_NOSPLIT cdev flag on the devices created
		for the sa(4) driver by default.

		Add tunables to control whether we allow I/O splitting
		in physio(9).

		Explain in the comments that allowing I/O splitting
		will be deprecated for the sa(4) driver in FreeBSD
		11.0.

		Add sysctl variables to display the maximum I/O
		size we can do (which could be further limited by
		read block limits) and the maximum I/O size that
		the controller can do.

		Limit our maximum I/O size (recorded in the cdev's
		si_iosize_max) by MAXPHYS.  This isn't strictly
		necessary, because physio(9) will limit it to
		MAXPHYS, but it will provide some clarity for the
		application.

		Record the controller's maximum I/O size reported
		in the Path Inquiry CCB.

sa.4:		Document the block size behavior, and explain that
		the option of allowing physio(9) to split the I/O
		will disappear in FreeBSD 11.0.

Sponsored by:	Spectra Logic
2013-08-24 04:52:22 +00:00
delphij
b93cf73204 Allow tmpfs be mounted inside jail. 2013-08-23 22:52:20 +00:00
jkim
e9e439afa6 Fix a whitespace. 2013-08-23 16:54:38 +00:00
kib
4731d4c0e7 Since the 253927, which removed the soft busy call for the sf page, it
does not make sense to wait for the soft busy state of the page to
drain.  The vm object lock is dropped immediately after, so the result
of the wait is invalidated.

It might make sense to not wait for the hard busy state as well,
esp. for the fully valid page, but this is postponed for now.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-23 14:50:03 +00:00
jhb
6d81dda61e Use tvtohz() to convert a socket buffer timeout to a tick value rather
than using a home-rolled version.  The home-rolled version could result
in shorter-than-requested sleeps.

Reported by:	Vitja Makarov <vitja.makarov@gmail.com>
MFC after:	2 weeks
2013-08-23 13:47:41 +00:00
kib
e25f6a560e Both cluster_rbuild() and cluster_wbuild() sometimes set the pages
shared busy without first draining the hard busy state.  Previously it
went unnoticed since VPO_BUSY and m->busy fields were distinct, and
vm_page_io_start() did not verified that the passed page has VPO_BUSY
flag cleared, but such page state is wrong.  New implementation is
more strict and catched this case.

Drain the busy state as needed, before calling vm_page_sbusy().

Tested by:	pho, jkim
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:26:45 +00:00
kib
ba12eedccd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
andre
d8dc6ade5b Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality.
Requested by:	np, grehan
2013-08-21 18:12:04 +00:00
kib
d11c4f9c32 Implement read(2)/write(2) and neccessary lseek(2) for posix shmfd.
Add MAC framework entries for posix shm read and write.

Do not allow implicit extension of the underlying memory segment past
the limit set by ftruncate(2) by either of the syscalls.  Read and
write returns short i/o, lseek(2) fails with EINVAL when resulting
offset does not fit into the limit.

Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-21 17:45:00 +00:00
kib
6a459eb27c Make the seek a method of the struct fileops.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-21 17:36:01 +00:00
kib
a3e8f2c6dc Extract the general-purpose code from tmpfs to perform uiomove from
the page queue of some vm object.

Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-21 17:23:24 +00:00
pho
967f62efad Added sysctl to turn off calls to vmem_check().
Sponsored by:	EMC / Isilon storage division
Discussed with:	 jeff
2013-08-20 11:06:56 +00:00
jeff
ed90d4ba3f - Use an arbitrary but reasonably large import size for kva on architectures
that don't support superpages.  This keeps the number of spans and internal
   fragmentation lower.
 - When the user asks for alignment from vmem_xalloc adjust the imported size
   by 2*align to be certain we can satisfy the allocation.  This comes at
   the expense of potential failures when the backend can't supply enough
   memory but could supply the requested size and alignment.

Sponsored by:	EMC / Isilon Storage Division
2013-08-19 23:02:39 +00:00
andre
e1092223ba Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING external mbuf
type together with a dedicated *ext_free function.

Discussed with:	trociny, glebius
2013-08-19 11:16:53 +00:00
jilles
1d1d0428d2 Disallow opening a POSIX message queue for execute.
O_EXEC was formerly ignored, so equivalent to O_RDONLY.

Reject O_EXEC with [EINVAL] like the invalid mode 3.
2013-08-18 13:27:04 +00:00
pjd
3014e000ae Implement 32bit versions of the cap_ioctls_limit(2) and cap_ioctls_get(2)
system calls as unsigned longs have different size on i386 and amd64.

Reported by:	jilles
Sponsored by:	The FreeBSD Foundation
2013-08-18 10:30:41 +00:00
bryanv
1b4affdd25 Do not use potentially stale thread in kthread_add()
When an existing process is provided, the thread selected to use
to initialize the new thread could have exited and be reaped.
Acquire the proc lock earlier to ensure the thread remains valid.

Reviewed by:	jhb, julian (previous version)
MFC after:	3 days
2013-08-17 17:02:43 +00:00
pjd
635a029a89 In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.
Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
2013-08-17 14:13:45 +00:00
delphij
a1c410b061 Fix build. 2013-08-17 00:25:11 +00:00
kib
408a640438 Restore the previous sendfile(2) behaviour on the block devices.
Provide valid .fo_sendfile method for several missed struct fileops.

Reviewed by:	glebius
Sponsored by:	The FreeBSD Foundation
2013-08-16 14:22:20 +00:00
markj
e461168640 Use strdup(9) instead of reimplementing it. 2013-08-16 03:41:41 +00:00
ken
5591de079d Change the way that unmapped I/O capability is advertised.
The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver.  The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't.  The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot.  The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h:	Remove the D_UNMAPPED_IO cdevsw flag and replace it
		with an SI_UNMAPPED cdev flag.

kern_physio.c:	Look at the cdev SI_UNMAPPED flag to determine
		whether or not a particular driver can handle
		unmapped I/O.

geom_dev.c:	Set the SI_UNMAPPED flag for all GEOM cdevs.
		Since GEOM will create a temporary mapping when
		needed, setting SI_UNMAPPED unconditionally will
		work.

		Remove the D_UNMAPPED_IO flag.

nvme_ns.c:	Set the SI_UNMAPPED flag on cdevs created here
		if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c:	In aio_qphysio(), check the SI_UNMAPPED flag on a
		cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h:	Bump __FreeBSD_version to 1000045 for the switch from
		setting the D_UNMAPPED_IO flag in the cdevsw to setting
		SI_UNMAPPED in the cdev.

Reviewed by:	kib, jimharris
MFC after:	1 week
Sponsored by:	Spectra Logic
2013-08-15 22:52:39 +00:00
cperciva
8544609e64 Change the queue of locks in kern_rangelock.c from holding lock requests in
the order that they arrive, to holding
(a) granted write lock requests, followed by
(b) granted read lock requests, followed by
(c) ungranted requests, in order of arrival.

This changes the stopping condition for iterating through granted locks to
see if a new request can be granted: When considering a read lock request,
we can stop iterating as soon as we see a read lock request, since anything
after that point is either a granted read lock request or a request which
has not yet been granted.  (For write lock requests, we must still compare
against all granted lock requests.)

For workloads with R parallel reads and W parallel writes, this improves
the time spent from O((R+W)^2) to O(W*(R+W)); i.e., heavy parallel-read
workloads become significantly more scalable.

No statistically significant change in buildworld time has been measured,
but synthetic tests of parallel 'dd > /dev/null' and 'openssl enc >/dev/null'
with the input file cached yield dramatic (up to 10x) improvement with high
(up to 128 processes) levels of parallelism.

Reviewed by:	kib
2013-08-15 20:19:17 +00:00
ray
3f79da49d7 MFC @r254374. 2013-08-15 19:32:08 +00:00
glebius
722a1a5e5d Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
markj
6c24c9fb32 Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
markj
cee1e037da Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by:	jhb
X-MFC with:	r254266
2013-08-14 00:42:21 +00:00
jeff
ca13e69c2f - Disable quantum caches on the kmem_arena. This can make fragmentation
worse on small KVA systems.  I had intended to only enable it for
   debugging.

Sponsored by:	EMC / Isilon Storage Division
2013-08-13 22:41:24 +00:00
jeff
d330a11545 - Add a statically allocated memguard arena since it is needed very early
on.
 - Pass the appropriate flags to vmem_xalloc() when allocating space for
   the arena from kmem_arena.

Sponsored by:	EMC / Isilon Storage Division
2013-08-13 22:40:43 +00:00
jhb
7120845712 Some small cleanups to the fixes in r180340:
- Set NOTE_TRACKERR before running filt_proc().  If the knote did not
  have NOTE_FORK set in fflags when registered, then the TRACKERR event
  could miss being posted.
- Don't pass the pid in to filt_proc() for NOTE_FORK events.  The special
  handling for pids is done knote_fork() directly and no longer in
  filt_proc().

MFC after:	2 weeks
2013-08-13 18:45:58 +00:00
glebius
33afc0388b - Minor style(9) fix.
- Bring a comment up to date.
2013-08-13 13:40:31 +00:00
markj
5a3f78714c FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
  trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR:		166927, 166926, 166928
Reported by:	davide [1]
Reviewed by:	avg, trociny (earlier version)
MFC after:	1 month
2013-08-13 03:10:39 +00:00
markj
5423ffaa89 Remove some unused fields from struct linker_file. They were added in
r172862 for use by the DTrace SDT framework but don't seem to have ever
been used.

MFC after:	2 weeks
2013-08-13 03:09:00 +00:00
markj
80dd3f5e73 Add event handlers for module load and unload events. The load handlers are
called after the module has been loaded, and the unload handlers are called
before the module is unloaded. Moreover, the module unload handlers may
return an error to prevent the unload from proceeding.

Reviewed by:	avg
MFC after:	2 weeks
2013-08-13 03:07:49 +00:00
kib
58a6f3bcbe The r254167 moved initialization of the sleepqueues before the witness
is operational.  init_sleepqueues() initializes 256 mutexes, which,
due to witness still being cold, started to overflow the pending_locks
array.

As stated in the reported panic message, increase WITNESS_PENDLIST
from 768 to 1024, which provides space for additional 256 locks.

Reported by:	many
Tested by:	rakuco, bdrewery
2013-08-10 21:42:14 +00:00
cognet
333a884980 Don't call sleepinit() from proc0_init(), make it a SYSINIT instead.
vmem needs the sleepq locks to be initialized when free'ing kva, so we want it
called as early as possible.
2013-08-09 23:13:52 +00:00