Commit Graph

14495 Commits

Author SHA1 Message Date
Mateusz Guzik
5fe97c20dc fd: split kern_dup flags argument into actual flags and a mode
Tidy up the code inside to switch on the mode.
2015-07-10 11:01:30 +00:00
Konstantin Belousov
e8677f3885 Change the mb() use in the sched_ult tdq_notify() and sched_idletd()
to more C11-ish atomic_thread_fence_seq_cst().

Note that on PowerPC, which currently uses lwsync for mb(), the change
actually fixes the missed store/load barrier, intended by r271604 [*].

Reviewed by:	alc
Noted by:	alc [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-10 08:54:12 +00:00
Ed Schouten
47a84387ad Let listen() return EDESTADDRREQ when not bound.
We currently return EINVAL when calling listen() on a UNIX socket that
has not been bound to a pathname. If my interpretation of POSIX is
correct, we should return EDESTADDRREQ: "The socket is not bound to a
local address, and the protocol does not support listening on an unbound
socket."

Return EDESTADDRREQ instead when not bound and not connected.

Differential Revision:	https://reviews.freebsd.org/D3038
Reviewed by:	gnn, network
2015-07-10 06:47:14 +00:00
Mateusz Guzik
318b946321 vfs: cosmetic changes to namei and namei_handle_root
- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead
2015-07-09 17:17:26 +00:00
Mateusz Guzik
d177f49f6f vfs: simplify error handling in namei
The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by:	kib
2015-07-09 16:32:58 +00:00
Ed Schouten
2491302a04 Add implementations for some of the CloudABI file descriptor system calls.
All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision:	https://reviews.freebsd.org/D3035
Reviewed by:	mjg
2015-07-09 16:07:01 +00:00
Mateusz Guzik
efdc25304c fd: prepare do_dup for being exported
- rename it to kern_dup.
- prefix flags with FD
- assert that correct flags were passed
2015-07-09 15:19:45 +00:00
Mateusz Guzik
d19ba50e12 vfs: avoid spurious vref/vrele for absolute lookups
namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by:	kib
2015-07-09 15:06:58 +00:00
Mateusz Guzik
a03f1b2970 vfs: plug a use-after-free of fd_rdir in namei
fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by:	kib
MFC after:	1 week
2015-07-09 15:06:24 +00:00
Ed Schouten
3a41ec6af7 Don't clobber td->td_retval[0] in proc_reap().
While writing tests for CloudABI, I noticed that close() on process
descriptors returns the process ID of the child process. This is
interesting, as close() is only allowed to return 0 or -1. It turns out
that we clobber td->td_retval[0] in proc_reap(), so that wait*()
properly returns the process ID.

Change proc_reap() to leave td->td_retval[0] alone. Set the return value
in kern_wait6() instead, by keeping track of the PID before we
(potentially) reap the process.

Differential Revision:	https://reviews.freebsd.org/D3032
Reviewed by:	kib
2015-07-09 12:04:45 +00:00
Konstantin Belousov
fcb5b3a419 Cover a race between doselwakeup() and selfdfree(). If doselwakeup()
loop finds the selfd entry and clears its sf_si pointer, which is
handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free
the memory.  The result is the race and accesses to the freed memory.

Refcount the selfd ownership.  One reference is for the sf_link
linkage, which is unconditionally dereferenced by selfdfree().
Another reference is for sf_threads, both selfdfree() and
doselwakeup() race to deref it, the winner unlinks and than frees the
selfd entry.

Reported by:	Larry Rosenman <ler@lerctr.org>
Tested by:	Larry Rosenman <ler@lerctr.org>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-09 09:22:21 +00:00
Ed Schouten
6d338f9a81 Import the CloudABI datatypes and create a system call table.
CloudABI is a pure capability-based runtime environment for UNIX. It
works similar to Capsicum, except that processes already run in
capabilities mode on startup. All functionality that conflicts with this
model has been omitted, making it a compact binary interface that can be
supported by other operating systems without too much effort.

CloudABI is 'secure by default'; the idea is that it should be safe to
run arbitrary third-party binaries without requiring any explicit
hardware virtualization (Bhyve) or namespace virtualization (Jails). The
rights of an application are purely determined by the set of file
descriptors that you grant it on startup.

The datatypes and constants used by CloudABI's C library (cloudlibc) are
defined in separate files called syscalldefs_mi.h (pointer size
independent) and syscalldefs_md.h (pointer size dependent). We import
these files in sys/contrib/cloudabi and wrap around them in
cloudabi*_syscalldefs.h.

We then add stubs for all of the system calls in sys/compat/cloudabi or
sys/compat/cloudabi64, depending on whether the system call depends on
the pointer size. We only have nine system calls that depend on the
pointer size. If we ever want to support 32-bit binaries, we can simply
add sys/compat/cloudabi32 and implement these nine system calls again.

The next step is to send in code reviews for the individual system call
implementations, but also add a sysentvec, to allow CloudABI executabled
to be started through execve().

More information about CloudABI:
- GitHub: https://github.com/NuxiNL/cloudlibc
- Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA

Differential Revision:	https://reviews.freebsd.org/D2848
Reviewed by:	emaste, brooks
Obtained from:	https://github.com/NuxiNL/freebsd
2015-07-09 07:20:15 +00:00
Konstantin Belousov
f4b5a9725a Reimplement the ordering requirements for the timehands updates, and
for timehands consumers, by using fences.

Ensure that the timehands->th_generation reset to zero is visible
before the data update is visible [*].  tc_setget() allowed data update
writes to become visible before generation (but not on TSO
architectures).

Remove tc_setgen(), tc_getgen() helpers, use atomics inline [**].

Noted by:	alc [*]
Requested by:	bde [**]
Reviewed by:	alc, bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-08 18:42:08 +00:00
Konstantin Belousov
69d11def74 Handle copyout for the fcntl(F_OGETLK) using oflock structure.
Otherwise, kernel overwrites a word past the destination.

Submitted by:	walter@pelissero.de
PR:	196718
MFC after:	1 week
2015-07-08 13:19:13 +00:00
Mark Johnston
620711e033 Fix an incorrect assertion in witness.
The number of available lock list entries for a thread is LOCK_CHILDCOUNT,
and each entry can record up to LOCK_NCHILDREN locks. When iterating over
the locks held by a thread, a bound on the loop index is therefore given
by LOCK_CHILDCOUNT * LOCK_NCHILDREN; WITNESS_COUNT is an unrelated
constant.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D2974
2015-07-07 19:29:18 +00:00
Pedro F. Giffuni
9129dd59be Relocate sched_random() within the SMP section.
Place sched_random nearer to where it's first used: moving the
code nearer to where it  is used makes the code easier to read
and we can reduce the initial "#ifdef SMP" island.

Reword a little the comment and clean some whitespaces
while here.
2015-07-07 15:22:29 +00:00
Mateusz Guzik
aa0e2887f4 tty: replace several curthread->td_proc with stored curproc
No functional changes.
2015-07-06 18:53:56 +00:00
Patrick Kelsey
6f99ea0520 Don't acquire sysctlmemlock in userland_sysctl() when the old value
pointer is NULL, as in that case there are no userland pages that
could potentially be wired.  It is common for old to be NULL and
oldlenp to be non-NULL in calls to userland_sysctl(), as this is used
to probe for the length of a variable-length sysctl entry before
retrieving a value.  Note that it is typical for such calls to be made
with an uninitialized value in *oldlenp, so sysctlmemlock was
essentially being acquired at random (depending on the uninitialized
value in *oldlenp being > PAGE_SIZE or not) for these calls prior to
this patch.

Differential Revision: https://reviews.freebsd.org/D2987
Reviewed by: mjg, kib
Approved by: jmallett (mentor)
MFC after: 1 month
2015-07-06 16:07:21 +00:00
Konstantin Belousov
9889bbac23 Mutex memory is not zeroed, add MTX_NEW.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-06 14:09:00 +00:00
Mark Johnston
947401dd50 Move the comment describing namei(9) back to namei()'s definition.
MFC after:	3 days
2015-07-05 22:56:41 +00:00
Mark Johnston
8bbd1f25b1 Remove a stale descriptive comment for gbincore().
The splay trees referenced in the comment were converted to
path-compressed tries in r250551.

MFC after:	3 days
2015-07-05 22:44:41 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Mariusz Zaborski
54f98da930 Move the nvlist source and private includes from sys/kern to seperate
directory sys/contrib/libnv.

The goal of this operation is to NOT install header files which shouldn't
be used outside the nvlist library.

Approved by:	pjd (mentor)
2015-07-04 16:33:37 +00:00
Mateusz Guzik
9ca30b0e06 vfs: use shared vnode locking when looking up ".." in vop_stdvptocnp
Briefly discussed with: kib
2015-07-04 15:46:39 +00:00
Mateusz Guzik
dba0bec2bb fd: de-k&r-ify functions + some whitespace fixes
No functional changes.
2015-07-04 15:42:03 +00:00
Mateusz Guzik
ee5f66f820 sysctl: get rid of sysctl_lock/unlock
Inline their contents into the only consumer.
2015-07-04 14:44:39 +00:00
Mateusz Guzik
d5fc115a1a sysctl: remove a debugging printf which crept in with r285125 2015-07-04 07:01:43 +00:00
Mateusz Guzik
b8633775a8 sysctl: switch sysctllock to a sleepable rmlock
The lock is almost never taken for writing.
2015-07-04 06:54:15 +00:00
Mateusz Guzik
e2f5418e73 sysvshm: fix up some whitespace issues and spurious initialisation 2015-07-02 19:14:30 +00:00
Mateusz Guzik
77a26248a3 sysvshm: don't lock proc when calculating attach_va
vm_daddr is constant and RLIMIT_DATA can be obtained from thread's copy of
rlimits.
2015-07-02 19:03:44 +00:00
Mateusz Guzik
0be3a191a4 sysvshm: fix shmrealloc
The code was supposed to initialize new segs in newsegs array, but used the old
pointer.
2015-07-02 19:00:22 +00:00
Konstantin Belousov
1965f86c72 Vnode is not referenced by the vfs_domount() at the point where
asserts are made.  Remove them, since we might dereference freed
memory.  Leaked locks are asserted by the syscall return code anyway.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-02 14:31:47 +00:00
Navdeep Parhar
9523d1bfc3 Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags
hanging off the header need to be freed too.

Differential Revision:	https://reviews.freebsd.org/D2708
Reviewed by:	ae@, hiren@
2015-06-30 17:19:58 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
Konstantin Belousov
6ef120027f Do not calculate the stack's bottom address twice.
Submitted by:	Olivц╘r Pintц╘r
Review:	https://reviews.freebsd.org/D2953
MFC after:	1 week
2015-06-30 15:22:47 +00:00
Mark Murray
6687b6720b Ansify another function. This is the last in the file, I hope. 2015-06-28 10:51:08 +00:00
Mark Murray
7233d3094d ANSIfy the only function that uses K&R definition in this file. 2015-06-28 09:44:58 +00:00
Konstantin Belousov
b2c3df842b Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
Adrian Chadd
5bbb2169d2 Un-static cpuset_which() - it's useful in other contexts, such as some
CPU set operations in my upcoming NUMA work.

Tested/compiled:

* i386 (run)
* amd64 (run)
* mips (run)
* mips64 (run)
* armv6 (built)

Sponsored by:	Norse Corp, Inc.
2015-06-26 04:14:05 +00:00
Mateusz Guzik
7150ce743a rlimit: deduplicate code in chg* functions 2015-06-25 00:15:37 +00:00
Sean Bruno
4e83b32a80 At the suggestion of jhb, replace atomic_set/clear calls with use of
exclusive locks in the enable/disable interpreter path.

Tested with WITNESS/INVARIANTS on and off.

Reviewed by:	sson davide
2015-06-24 15:52:26 +00:00
John-Mark Gurney
1977bd233a zero this struct as it depends upon it...
Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D2890
2015-06-23 18:40:20 +00:00
Konstantin Belousov
b05c401ff6 Only take previous buffer queue lock (olock) when needed for REMFREE
in binsfree().

Submitted by:	Conrad Meyer
Sponsored by:	EMC / Isilon Storage Division
Review:	https://reviews.freebsd.org/D2882
MFC after:	1 week
2015-06-23 06:12:14 +00:00
Sean Bruno
945afa7c25 Make imgact_binmisc_exec() static.
Submitted by:	kib
Reviewed by:	sson
2015-06-22 17:04:24 +00:00
Sean Bruno
602ec83516 Remove uneeded NULL check since malloc the malloc is now M_WAITOK
Submitted by:	mjg
2015-06-19 20:35:17 +00:00
Sean Bruno
e0ae213f63 Must have one of either M_WAITOK or M_NOWAIT, read the man page bruno.
Submitted by:	mjg
2015-06-19 19:57:39 +00:00
Sean Bruno
a7647ec444 Feedback from commit r284535
davide:  imgact_binmisc_clear_entry() needs to use atomic ops to remove
the enable bit.

kib:  M_NOWAIT is not warranted and comment is invalid.
2015-06-19 18:57:36 +00:00
Sean Bruno
5f98711d51 This change replaces the mutex with a sx lock for the interpreter list to
avoid the problem of holding a non-sleep lock during a page fault as
reported by witness. It also uses atomics where possible to avoid having
to acquire the exclusive lock. In addition, it consistently uses
memset()/memcpy() instead of bzero()/bcopy().

Differential Revision:	https://reviews.freebsd.org/D1971
Submitted by:	sson
Reviewed by:	jhb
2015-06-18 02:04:20 +00:00
Bjoern A. Zeeb
af10bf055f Initialise pr_enforce_statfs from the "default" sysctl value and
not from the compile time constant.  The sysctl value is seeded
from the compile time constant.

MFC after:	2 weeks
2015-06-17 13:15:54 +00:00
Konstantin Belousov
1eabd96728 vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
Pedro F. Giffuni
708694f704 Use nitems() macro instead of __arraycount() 2015-06-16 20:19:00 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Mateusz Guzik
9ef8328d52 fd: make rights a mandatory argument to fget_unlocked 2015-06-16 09:52:36 +00:00
Mateusz Guzik
80f3623f2f fd: don't unnecessary copy capabilities in _fget 2015-06-16 09:08:30 +00:00
Mateusz Guzik
cedab3c72c fd: reduce excessive zeroing on fd close
fde_file as NULL is already an indicator of an unused fd. All other
fields are populated when fp is installed.
2015-06-14 14:10:05 +00:00
Mateusz Guzik
ea31808c3b fd: move out actual fp installation to _finstall
Use it in fd passing functions as the first step towards fd code cleanup.
2015-06-14 14:08:52 +00:00
Jeremie Le Hen
3768a5dfb5 nit: Rename racct_alloc_resource to racct_adjust_resource.
This is more accurate as the amount can be negative.

MFC after:	2 weeks
2015-06-14 08:33:14 +00:00
Gleb Smirnoff
093c7f396d Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
Andriy Gapon
076dd8eb2e several lockstat improvements
0. For spin events report time spent spinning, not a loop count.
While loop count is much easier and cheaper to obtain it is hard
to reason about the reported numbers, espcially for adaptive locks
where both spinning and sleeping can happen.
So, it's better to compare apples and apples.

1. Teach lockstat about FreeBSD rw locks.
This is done in part by changing the corresponding probes
and in part by changing what probes lockstat should expect.

2. Teach lockstat that rw locks are adaptive and can spin on FreeBSD.

3. Report lock acquisition events for successful rw try-lock operations.

4. Teach lockstat about FreeBSD sx locks.
Reporting of events for those locks completely mirrors
rw locks.

5. Report spin and block events before acquisition event.
This is behavior documented for the upstream, so it makes sense to stick
to it.  Note that because of FreeBSD adaptive lock implementations
both the spin and block events may be reported for the same acquisition
while the upstream reports only one of them.

Differential Revision:	https://reviews.freebsd.org/D2727
Reviewed by:	markj
MFC after:	17 days
Relnotes:	yes
Sponsored by:	ClusterHQ
2015-06-12 10:01:24 +00:00
Mateusz Guzik
3331a33a42 ussreq: use saved fdp pointer insted of td->td_proc->p_fd
No functional changes.
2015-06-12 06:28:22 +00:00
Konstantin Belousov
529c97886b Tweaks for r284178:
Do not include machine/atomic.h explicitely, the header is already included
by sys/systm.h.

Force inlining of tc_getgen() and tc_setgen().  The functions are used
more than once, which causes compilers with non-aggressive inlining
policies to generate calls.

Suggested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-11 04:41:54 +00:00
Mateusz Guzik
21de5aea6c Fixup the build after r284215.
Submitted by:	Ivan Klymenko <fidaj ukr.net> [slighly modified]
2015-06-10 12:39:01 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Mateusz Guzik
4ea6a9a28f Generalised support for copy-on-write structures shared by threads.
Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.
2015-06-10 10:43:59 +00:00
Mateusz Guzik
3b3eb22ab6 fd: remove fdesc_mtx 2015-06-10 09:40:07 +00:00
Mateusz Guzik
153cc61b54 fd: use atomics to manage fd_refcnt and fd_holcnt
This gets rid of fdesc_mtx.
2015-06-10 09:34:50 +00:00
Kenneth D. Merry
5672fac935 Add support for reading MAM attributes to camcontrol(8) and libcam(3).
MAM is Medium Auxiliary Memory and is most commonly found as flash
chips on tapes.

This includes support for reading attributes and decoding most
known attributes, but does not yet include support for writing
attributes or reporting attributes in XML format.

libsbuf/Makefile:
	Add subr_prf.c for the new sbuf_hexdump() function.  This
	function is essentially the same function.

libsbuf/Symbol.map:
	Add a new shared library minor version, and include the
	sbuf_hexdump() function.

libsbuf/Version.def:
	Add version 1.4 of the libsbuf library.

libutil/hexdump.3:
	Document sbuf_hexdump() alongside hexdump(3), since it is
	essentially the same function.

camcontrol/Makefile:
	Add attrib.c.

camcontrol/attrib.c:
	Implementation of READ ATTRIBUTE support for camcontrol(8).

camcontrol/camcontrol.8:
	Document the new 'camcontrol attrib' subcommand.

camcontrol/camcontrol.c:
	Add the new 'camcontrol attrib' subcommand.

camcontrol/camcontrol.h:
	Add a function prototype for scsiattrib().

share/man/man9/sbuf.9:
	Document the existence of sbuf_hexdump() and point users to
	the hexdump(3) man page for more details.

sys/cam/scsi/scsi_all.c:
	Add a table of known attributes, text descriptions and
	handler functions.

	Add a new scsi_attrib_sbuf() function along with a number
	of other related functions that help decode attributes.

	scsi_attrib_ascii_sbuf() decodes ASCII format attributes.

	scsi_attrib_int_sbuf() decodes binary format attributes, and
	will pass them off to scsi_attrib_hexdump_sbuf() if they're
	bigger than 8 bytes.

	scsi_attrib_vendser_sbuf() decodes the vendor and drive
	serial number attribute.

	scsi_attrib_volcoh_sbuf() decodes the Volume Coherency
	Information attribute that LTFS writes out.

sys/cam/scsi/scsi_all.h:
	Add a number of attribute-related structure definitions and
	other defines.

	Add function prototypes for all of the functions added in
	scsi_all.c.

sys/kern/subr_prf.c:
	Add a new function, sbuf_hexdump().  This is the same as
	the existing hexdump(9) function, except that it puts the
	result in an sbuf.

	This also changes subr_prf.c so that it can be compiled in
	userland for includsion in libsbuf.

	We should work to change this so that the kernel hexdump
	implementation is a wrapper around sbuf_hexdump() with a
	statically allocated sbuf with a drain.  That will require
	a drain function that goes to the kernel printf() buffer
	that can take a non-NUL terminated string as input.
	That is because an sbuf isn't NUL-terminated until it is
	finished, and we don't want to finish it while we're still
	using it.

	We should also work to consolidate the userland hexdump and
	kernel hexdump implemenatations, which are currently
	separate.  This would also mean making applications that
	currently link in libutil link in libsbuf.

sys/sys/sbuf.h:
	Add the prototype for sbuf_hexdump(), and add another copy
	of the hexdump flag values if they aren't already defined.

	Ideally the flags should be defined in one place but the
	implemenation makes it difficult to do properly.  (See
	above.)

Sponsored by:	Spectra Logic Corporation
MFC after:	1 week
2015-06-09 21:39:38 +00:00
Konstantin Belousov
2c6946dca2 When updating/accessing the timehands, barriers are needed to ensure
that:
- th_generation update is visible after the parameters update is
  visible;
- the read of parameters is not reordered before initial read of
  th_generation.

On UP kernels, compiler barriers are enough.  For SMP machines, CPU
barriers must be used too, as was confirmed by submitter by testing on
the Freescale T4240 platform with 24 PowerPC processors.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2015-06-09 11:49:56 +00:00
John Baldwin
15c2b30155 Revert r284153, as I believe it breaks the dtrace sdt module. I will
fix the original issue a different way.
2015-06-08 18:06:00 +00:00
Ed Maste
6b16d66497 Add user facing errors for exceeding process memory limits
Previously the process terminating with SIGABRT at startup was the
only notification.

PR:		200617
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D2731
2015-06-08 16:07:07 +00:00
John Baldwin
69c5c774fe Add an internal "locked" variant of linker_file_lookup_set() and change
the public function to acquire the global linker lock directly.  This
permits linker_file_lookup_set() to be safely used from other modules.
2015-06-08 14:06:47 +00:00
Mark Johnston
8b84d791a0 witness: don't warn about matrix inconsistencies without holding the mutex
Lock order checking is done without the witness mutex held, so multiple
threads that are racing to establish a new lock order may read matrix
entries that are in an inconsistent state. Don't print a warning in this
case, but instead just redo the check after taking the witness lock.

Differential Revision:	https://reviews.freebsd.org/D2713
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-06-07 18:59:47 +00:00
Sean Bruno
280b716943 Revert 284029, update imgact_binmisctl.c change mtx to reader count, at the
request of the submitter.

Will attempt to use an sx_lock for this fix to WITNESS crashes in a later
revision.

Submitted by:	sson
2015-06-05 18:16:10 +00:00
Sean Bruno
8c8613a14f This change uses a reader count instead of holding the mutex for the
interpreter list to avoid the problem of holding a non-sleep lock during
a page fault as reported by witness.  In addition, it consistently uses
memset()/memcpy() instead of bzero()/bcopy() except in the case where
bcopy() is required (i.e. overlapping copy).

Differential Revision:	https://reviews.freebsd.org/D2123
Submitted by:	sson
MFC after:	2 weeks
Relnotes:	Yes
2015-06-05 16:21:43 +00:00
John Baldwin
7077c42623 Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c.  This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions.  A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings.  For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead.  The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset.  The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional.  If it is not set, then fo_mmap() will
fail with ENODEV.  A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead.  While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision:	https://reviews.freebsd.org/D2658
Reviewed by:	alc (glanced over), kib
MFC after:	1 month
Sponsored by:	Chelsio
2015-06-04 19:41:15 +00:00
Eric van Gyzen
63e4c6cdf9 Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
Xin LI
4c372ca254 Clear p_stops when doing PT_DETACH.
Without this, if a process was being traced by truss(1), which
uses different p_stops bits than gdb(1), the latter would
misbehave because of the unexpected bits.

Reported by:	jceel
Submitted by:	sef
Sponsored by:	iXsystems, Inc.
MFC after:	2 weeks
2015-06-01 18:15:45 +00:00
Konstantin Belousov
aef68c961a When delivering a signal with default disposition to the thread,
tdsigwakeup() increases the priority of the low-priority threads, to
give them a chance to be terminated timely.  Also, kernel allows user
to signal kernel processes.  The combined effect is that signalling
idle process bump a priority of the selected delivery thread, which
starts eating CPU.

Check for the delivery thread be an idle thread and do not raise its
priority then.

The signal delivery to the kernel threads must be opt-in feature.
Kernel thread should explicitely declare the ability to handle signals
directed to it.  E.g., nfsd threads check for signal as an indication
of exit request.

Most threads do not handle signals at all, and queuing the signal to
them causes odd side-effects.  Most innocent consequence is the memory
leak due to queued ksiginfo, which is never deleted from the sigqueue.
Code to prevent even queuing signals to the kernel threads is trivial,
but it requires careful examination of each call to kproc/kthread
creation to decide should the signalling be allowed.  The commit is a
stop-gap measure which fixes the immediate case for now.

PR:	200493
Reported and tested by:	trasz
Discussed with:	trasz, emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 16:26:08 +00:00
Konstantin Belousov
69baeadc31 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
Konstantin Belousov
780dca1b1e Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode.  Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced.  dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:22:50 +00:00
Konstantin Belousov
2db0e1f50d Add V_MNTREF flag to the vn_start_write(9) and
vn_start_secondary_write(9) functions.  The flag indicates that the
caller already owns a reference on the mount point, and the functions
can consume it.  The reference is released by vn_finished_write(9) and
vn_finished_secondary_write(9) in due course.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:21:47 +00:00
Konstantin Belousov
1bc93bb7b9 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
John Baldwin
57c74f5b64 Do not allow a process to reap an orphan (a child currently being
traced by another process such as a debugger). The parent process does
need to check for matching orphan pids to avoid returning ECHILD if an
orphan has exited, but it should not return the exited status for the
child until after the debugger has detached from the orphan process
either explicitly or implicitly via wait().

Add two tests for for this case: one where the debugger is the direct
child (thus the parent has a non-empty children list) and one where
the debugger is not a direct child (so the only "child" of the parent
is the orphan).

Differential Revision:	https://reviews.freebsd.org/D2644
Reviewed by:	kib
MFC after:	2 weeks
2015-05-26 10:29:37 +00:00
Xin LI
eb3d0c5d8c MFuser/delphij/zfs-arc-rebase@r281754:
In r256613, taskqueue_enqueue_locked() have been modified to release the
task queue lock before returning.  In r276665, taskqueue_drain_all() will
call taskqueue_enqueue_locked() to insert the barrier task into the queue,
but did not reacquire the lock after it but later code expects the lock
still being held (e.g. TQ_SLEEP()).

The barrier task is special and if we release then reacquire the lock,
there would be a small race window where a high priority task could sneak
into the queue.  Looking more closely, the race seems to be tolerable but
is undesirable from semantics standpoint.

To solve this, in taskqueue_drain_tq_queue(), instead of directly calling
taskqueue_enqueue_locked(), insert the barrier task directly without
releasing the lock.
2015-05-26 01:40:33 +00:00
John Baldwin
515b7a0b97 Add KTR tracing for some MI ptrace events.
Differential Revision:	https://reviews.freebsd.org/D2643
Reviewed by:	kib
2015-05-25 22:13:22 +00:00
Dmitry Chagin
7236f2c220 For future use in the Linuxulator:
1. Add a kern_kqueue() counterpart for kqueue() with flags parameter.

2. Be a bit secure. To avoid a double fp lookup add a kern_kevent_fp()
counterpart for kern_kevent() with file pointer parameter instead
of file descriptor an pass the buck to it.

Suggested by: mjg [2]

Differential Revision:	https://reviews.freebsd.org/D1091
Reviewed by:	trasz
2015-05-24 16:36:29 +00:00
Dmitry Chagin
91d1786f65 In preparation for switching linuxulator to the use the native 1:1
threads add a hook for cleaning thread resources before the thread die.

Differential Revision:	https://reviews.freebsd.org/D1038
2015-05-24 14:51:29 +00:00
Dmitry Chagin
a93e83c8d7 In preparation for switching linuxulator to the use the native 1:1
threads split sys_sched_getparam(), sys_sched_setparam(),
sys_sched_getscheduler(), sys_sched_setscheduler() to their kern_*
counterparts and add targettd parameter to allow specify the target
thread directly by callee.

Differential Revision:	https://reviews.freebsd.org/D1034
Reviewed by:	trasz
2015-05-24 14:44:06 +00:00
Dmitry Chagin
1aa90eca33 In preparation for switching linuxulator to the use the native 1:1
threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval().
Add a kern_sched_rr_get_interval() counterpart which takes a targettd
parameter to allow specify target thread directly by callee (new Linuxulator).

Linuxulator temporarily uses first thread in proc.

Move linux_sched_rr_get_interval() to the MI part.

Differential Revision:	https://reviews.freebsd.org/D1032
Reviewed by:	trasz
2015-05-24 14:39:26 +00:00
Dmitry Chagin
09baafb471 In preparation for switching linuxulator to the use the native 1:1
threads introduce kern_thr_alloc() which will be used later in the
linux_clone().

Differential Revision:	https://reviews.freebsd.org/D1029
Reviewed by:	trasz
2015-05-24 14:37:45 +00:00
Dmitry Chagin
95be6d2b1f In preparation for switching linuxulator to the use the native 1:1
threads split sys_thr_exit() up into sys_thr_exit() and kern_thr_exit().
Move
Where the second will be used in linux_exit() system call later.

Differential Revision:	https://reviews.freebsd.org/D1028
Reviewed by:	trasz
2015-05-24 14:36:33 +00:00
Konstantin Belousov
3077f938b4 If thread requested to not stop on non-boundary, then not only
stopping signals should obey, but also all forms of single-threading.
Otherwise, thread might sleep interruptible while owning some
resources, and single-threading thread could try to access them.
An example is owning vnode lock while dumping core.

Submitted by:	Conrad Meyer
Review:	https://reviews.freebsd.org/D2612
Tested by:	pho
MFC after:	1 week
2015-05-23 19:09:04 +00:00
Warner Losh
ee960398a0 Fix typo in symbol name. It helps to hit save in all your buffers
before committing.
2015-05-22 21:10:14 +00:00
Warner Losh
d36eec691a Export the eflags field from the elf header. This allows better
discrimination between different subarch binaries, at least for mips
and arm. Arm is implemented, mips is still tbd, so not currently
exported. aarch64 does not export this because aarch64 binaries use
different tags and flags than arm.

Differential Revision: https://reviews.freebsd.org/D2611
2015-05-22 20:50:35 +00:00
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
John Baldwin
21119d0641 Expand ktr_mask to be a 64-bit unsigned integer.
The mask does not really need to be updated with atomic operations and
the downside of losing races during transitions is not great (it is
not marked volatile, so those races are pretty wide open as it is).

Differential Revision:	https://reviews.freebsd.org/D2595
Reviewed by:	emaste, neel, rpaulo
MFC after:	2 weeks
2015-05-22 11:09:41 +00:00
John Baldwin
c209e3e2e6 Only reparent a traced process to its old parent if the tracing process is
not the old parent. Otherwise, proc_reap() will leave the zombie in place
resulting in the process' status being returned twice to its parent.

Add test cases for PT_TRACE_ME and PT_ATTACH which are fixed by
this change.

Differential Revision:	https://reviews.freebsd.org/D2594
Reviewed by:	kib
MFC after:	2 weeks
2015-05-22 11:04:54 +00:00
John Baldwin
c636f94bd2 Revert r282971. It depends on condvar consumers not destroying condvars
until all threads sleeping on a condvar have resumed execution after being
awakened.  However, there are cases where that guarantee is very hard to
provide.
2015-05-21 16:43:26 +00:00
Pedro F. Giffuni
cd508278c1 ddb: finish converting boolean values.
The replacement started at r283088 was necessarily incomplete without
replacing boolean_t with bool.  This also involved cleaning some type
mismatches and ansifying old C function declarations.

Pointed out by:	bde
Discussed with:	bde, ian, jhb
2015-05-21 15:16:18 +00:00
Mariusz Zaborski
963bc7a03f Fix memory leak.
Approved by:	pjd (mentor)
2015-05-20 17:48:22 +00:00
Mariusz Zaborski
1acd888ff5 Style.
Approved by:	pjd (mentor)
2015-05-20 17:47:01 +00:00
Mariusz Zaborski
823870acb8 Always use the nv_free function.
Approved by:	pjd (mentor)
2015-05-20 17:44:58 +00:00
Alan Somers
7a9c38e681 Properly null-terminate strings in a kernel dump header. A version string
longer than 192 bytes will cause the version field of a dump header to
overflow. strncpy doesn't null terminate it, so savecore will print a
corrupted info file. Using strlcpy fixes the bug.

Differential Revision:	https://reviews.freebsd.org/D2560
Reviewed by:		markj
MFC after:		3 weeks
Sponsored by:		Spectra Logic
2015-05-19 16:23:47 +00:00
Mateusz Guzik
747c0dd67c fd: fix imbalanced fdp unlock in F_SETLK and F_GETLK
MFC after:	3 days
2015-05-18 14:27:04 +00:00
Mateusz Guzik
c3293b83c4 Tidy up sys_umask a little bit
Consistently use saved fdp pointer as it cannot change. If it could change the
code would be already incorrect.

No functional changes.
2015-05-18 13:43:33 +00:00
John Baldwin
5c894ee2ef Previously, cv_waiters was only updated by cv_signal or cv_wait. If a
thread awakened due to a time out, then cv_waiters was not decremented.
If INT_MAX threads timed out on a cv without an intervening cv_broadcast,
then cv_waiters could overflow. To fix this, have each sleeping thread
decrement cv_waiters when it resumes.

Note that previously cv_waiters was protected by the sleepq chain lock.
However, that lock is not held when threads resume from sleep. In
addition, the interlock is also not always reacquired after resuming
(cv_wait_unlock), nor is it always held by callers of cv_signal() or
cv_broadcast(). Instead, use atomic ops to update cv_waiters. Since
the sleepq chain lock is still held on every increment, it should
still be safe to compare cv_waiters against zero while holding the
lock in the wakeup routines as the only way the race should be lost
would result in extra calls to sleepq_signal() or sleepq_broadcast().

Differential Revision:	https://reviews.freebsd.org/D2427
Reviewed by:	benno
Reported by:	benno (wrap of cv_waiters in the field)
MFC after:	2 weeks
2015-05-15 13:50:37 +00:00
Konstantin Belousov
100ac78be1 On amd64, make proc0 pmap initialization slightly more correct. In
particular, switch to the proc0 pmap to have expected %cr3 and PCID
for the thread0 during initialization, and the up to date pm_active
mask.

pmap_pinit0() should be done after proc0->p_vmspace is assigned so
that the amd64 pmap_activate() find the correct curproc pmap.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-05-15 08:30:29 +00:00
Konstantin Belousov
84cdea97e5 Right now, the process' p_boundary_count counter is decremented by the
suspended thread itself, on the return path from
thread_suspend_check().  A consequence is that return from
thread_single_end(SINGLE_BOUNDARY) may leave p_boundary_count
non-zero, it might be even equal to the threads count.

Now, assume that we have two threads in the process, both calling
execve(2).  Suppose that the first thread won the race to be the
suspension thread, and that afterward its exec failed for any reason.
After the first thread did thread_single_end(SINGLE_BOUNDARY), second
thread becomes the process suspension thread and checks
p_boundary_count.  The non-zero value of the count allows the
suspension loop to finish without actually suspending some threads.
In other words, we enter exec code with some threads not suspended.

Fix this by decrementing p_boundary_count in the
thread_single_end()->thread_unsuspend_one() during marking the thread
as runnable.  This way, a return from thread_single_end() guarantees
that the counter is cleared.  We do not care whether the unsuspended
thread has a chance to run.

Add some asserts to ensure the state of the process when single
boundary suspension is lifted.  Also make thread_unuspend_one()
static.

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-15 07:54:31 +00:00
Jonathan Anderson
60aa2c85fa Allow sizeof(cpuset_t) to be queried in capability mode.
This allows functions that retrieve and inspect pthread_attr_t objects to
work correctly: querying the cpuset_t size is part of querying CPU
affinity information, which is part of creating a complete pthread_attr_t.

Approved by: rwatson (mentor)
Reviewed by: pjd
Sponsored by: NSERC
2015-05-14 15:14:03 +00:00
Edward Tomasz Napierala
ba8f0eb8fc Build GENERIC with RACCT/RCTL support by default. Note that it still
needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf.

Differential Revision:	https://reviews.freebsd.org/D2407
Reviewed by:	emaste@, wblock@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-05-14 14:03:55 +00:00
Konstantin Belousov
7b445033ff On exec, single-threading must be enforced before arguments space is
allocated from exec_map.  If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space.  Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-10 09:00:40 +00:00
Konstantin Belousov
44ec2b63c5 The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon.  The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still.  The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly.  The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-09 20:08:36 +00:00
Konstantin Belousov
ac437c0754 Do not return from thread_single(SINGLE_BOUNDARY) until all stopped
thread are guarenteed to be removed from the processors.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-09 18:32:13 +00:00
Andrey V. Elsukov
089bb672c4 m_dup() is supposed to give a writable copy of an mbuf chain. It uses
m_dup_pkthdr(), that uses M_COPYFLAGS mask to copy m_flags field.
If original mbuf chain has M_RDONLY flag, its copy also will have it.
Reset this flag explicitly.

MFC after:	2 weeks
2015-05-07 18:35:01 +00:00
Mateusz Guzik
edf1796d3e Fix up panics when fork fails due to hitting proc limit
The function clearning credentials on failure asserts the process is a
zombie, which is not true when fork fails.

Changing creds to NULL is unnecessary, but is still being done for
consistency with other code.

Pointy hat: mjg
Reported by: pho
2015-05-06 21:03:19 +00:00
Ian Lepore
28315e27a7 Implement a mechanism for making changes in the kernel<->driver PPS
interface without breaking ABI or API compatibility with existing drivers.

The existing data structures used to communicate between the kernel and
driver portions of PPS processing contain no spare/padding fields and no
flags field or other straightforward mechanism for communicating changes
in the structures or behaviors of the code.  This makes it difficult to
MFC new features added to the PPS facility.  ABI compatibility is
important; out-of-tree drivers in module form are known to exist.  (Note
that the existing api_version field in the pps_params structure must
contain the value mandated by RFC 2783 and any RFCs that come along after.)

These changes introduce a pair of abi-version fields which are filled in
by the driver and the kernel respectively to indicate the interface
version.  The driver sets its version field before calling the new
pps_init_abi() function.  That lets the kernel know how much of the
pps_state structure is understood by the driver and it can avoid using
newer fields at the end of the structure that it knows about if the driver
is a lower version.  The kernel fills in its version field during the init
call, letting the driver know what features and data the kernel supports.

To implement the new version information in a way that is backwards
compatible with code from before these changes, the high bit of the
lightly-used 'kcmode' field is repurposed as a flag bit that indicates the
driver is aware of the abi versioning scheme.  Basically if this bit is
clear that indicates a "version 0" driver and if it is set the driver_abi
field indicates the version.

These changes also move the recently-added 'mtx' field of pps_state from
the middle to the end of the structure, and make the kernel code that uses
this field conditional on the driver being abi version 1 or higher.  It
changes the only driver currently supplying the mtx field, usb_serial, to
use pps_init_abi().

Reviewed by:	hselasky@
2015-05-04 17:59:39 +00:00
Mariusz Zaborski
a523fa069f nv_malloc can fail in userland.
Add check to prevent a NULL pointer dereference.

Pointed out by:	mjg
Approved by:	pjd (mentor)
2015-05-02 18:12:34 +00:00
Mariusz Zaborski
da20d06f84 Remove duplicated code using macro template for the nvlist_add_.* functions.
Approved by:	pjd (mentor)
2015-05-02 18:10:45 +00:00
Mariusz Zaborski
169c153b59 Introduce the NV_FLAG_NO_UNIQUE flag. When set, it allows to store
multiple values using the same key in a nvlist.

Approved by:	pjd (mentor)
Obtained from:	WHEEL Systems (http://www.wheelsystems.com)

Update man page.

Reviewed by:	AllanJude
Approved by:	pjd (mentor)
2015-05-02 18:03:47 +00:00
Mariusz Zaborski
bd1da0a002 Approved, oprócz użycie RESTORE_ERRNO() do ustawiania errno.
Change the nvlist_recv() function to take additional argument that
specifies flags expected on the received nvlist. Receiving a nvlist with
different set of flags than the ones we expect might lead to undefined
behaviour, which might be potentially dangerous.

Update consumers of this and related functions and update the tests.

Approved by:	pjd (mentor)

Update man page for nvlist_unpack, nvlist_recv, nvlist_xfer, cap_recv_nvlist
and cap_xfer_nvlist.

Reviewed by:	AllanJude
Approved by:	pjd (mentor)
2015-05-02 17:45:52 +00:00
Bjoern A. Zeeb
44aa151e1b Fix an off-by-one bug in string/array handling which lead to memory overwrite
and follow-up assertion errors on at least ARM after r282257,
with nvp_magic being 0x6e7600:
Assertion failed: ((nvp)->nvp_magic == 0x6e7670), function nvpair_name, file .../subr_nvpair.c, line 713.

Sponsored by:	DARPA/AFRL
2015-05-02 08:31:16 +00:00
Mark Johnston
9ad64f27be Remove a stale reference to the stop_scheduler_on_panic tunable, which
itself was removed in r243515.

MFC after:	1 week
2015-05-02 00:27:58 +00:00
Mariusz Zaborski
24f93ee714 Add nvlist_flags() function, which returns nvlist's public flags.
Approved by:	pjd (mentor)
2015-05-01 17:50:24 +00:00
Mariusz Zaborski
b7be86aca6 Mark local function as static as a result of removing recursion.
Approved by:	pjd (mentor)
2015-04-30 20:50:42 +00:00
Mariusz Zaborski
45fd5ced7b Rename macros to use prefix ERRNO. Add macro ERRNO_SET. Now
ERRNO_{RESTORE/SAVE} must by used together, additional variable is not
needed. Always use ERRNO_{SAVE/RESTORE/SET} macros.

Approved by:	pjd (mentor)
2015-04-30 20:47:33 +00:00
John Baldwin
ed95805e90 Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
Mariusz Zaborski
be73922fcd Save errno from close override.
Approved by:	pjd (mentor)
2015-04-29 22:59:44 +00:00
Mariusz Zaborski
0bb5e6ef80 Remove the nvlist_.*[fv] functions.
Those functions are problematic, because there is no way to report
memory allocation problems without complicating the API, so we can
either abort or potentially return invalid results. None of which is
acceptable.

In most cases the caller knows the size of the name, so he can allocate
buffer on the stack and use snprintf(3) to prepare the name.

After some discussion the conclusion is to removed those functions,
which also simplifies the API.

Discussed with: pjd, rstone
Approved by:	pjd (mentor)
2015-04-29 22:57:04 +00:00
Mariusz Zaborski
3cfb71c186 Remove recursion from descriptor-related functions.
Approved by:	pjd (mentor)
2015-04-29 22:15:02 +00:00
Mariusz Zaborski
003e3ea15b Always use the nv_malloc macro instead of malloc(3).
Approved by:	pjd (mentor)
2015-04-29 21:54:34 +00:00
Mariusz Zaborski
906289c2ec Style fixes.
Approved by:	pjd (mentor)
2015-04-29 21:50:04 +00:00
Edward Tomasz Napierala
4b5c9cf62f Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
Edward Tomasz Napierala
aa32b5e076 Make setproctitle(3) work in Capsicum capability mode. This makes
ctld(8) child processes to indicate initiator address and name in
their titles, similar to what iscsid(8) child processes do.

PR:		181352
Differential Revision:	https://reviews.freebsd.org/D2363
Reviewed by:	rwatson@, mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-27 11:18:16 +00:00
Konstantin Belousov
9a2c85350a Partially revert r255986: do not call VOP_FSYNC() when helping
bufdaemon in getnewbuf(), do use buf_flush().  The difference is that
bufdaemon uses TRYLOCK to get buffer locks, which allows calls to
getnewbuf() while another buffer is locked.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-27 11:13:19 +00:00
Konstantin Belousov
f16f8610bb Fix locking for oshmctl() and shmsys().
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-27 11:12:51 +00:00
Mateusz Guzik
8d0a4ab212 fd: plug an always overwritten initialization in fdalloc 2015-04-26 17:27:55 +00:00
Mateusz Guzik
203322f966 Consistently use p instead of td->td_proc in create_thread
No functional changes.
2015-04-26 17:22:59 +00:00
Rick Macklem
7cfdc2a7bc MAXBSIZE defines both the largest UFS block size and the
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.

Differential Revision:	https://reviews.freebsd.org/D2330
Reviewed by:	mav, kib
MFC after:	2 weeks
2015-04-25 00:52:01 +00:00
Konstantin Belousov
b9062c934a Use correct length for sparse uiomove(). It must be the clipped to
the page size, len is the total transfer length, which may be larger
than zero_region.

Reported and tested by:	clusteradm (gjb)
Sponsored by:	The FreeBSD Foundation
X-MFC-With:	r281442
2015-04-24 22:05:12 +00:00
Mark Johnston
da10a60340 Make vpanic() externally visible so that it can be called as part of the
DTrace panic() action.

Differential Revision:	https://reviews.freebsd.org/D2349
Reviewed by:	avg
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-04-24 03:17:21 +00:00
Konstantin Belousov
fe8a824ca6 Handle incorrect ELF images specifying size for PT_GNU_STACK not being
multiple of page size.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-23 11:27:21 +00:00
Alexander Motin
f743d981f3 Make AIO to not allocate pbufs for unmapped I/O like r281825.
While there, make few more performance optimizations.

On 40-core system doing many 512-byte AIO reads from array of raw SSDs
this change removes lock congestions inside pbuf allocator and devfs,
and bottleneck on single AIO completion taskqueue thread.  It improves
peak AIO performance from ~600K to ~1.3M IOPS.

MFC after:	2 weeks
2015-04-22 18:11:34 +00:00
Craig Rodrigues
d9db52256e Move zlib.c from net to libkern.
It is not network-specific code and would
be better as part of libkern instead.
Move zlib.h and zutil.h from net/ to sys/
Update includes to use sys/zlib.h and sys/zutil.h instead of net/

Submitted by:		Steve Kiernan stevek@juniper.net
Obtained from:		Juniper Networks, Inc.
GitHub Pull Request:	https://github.com/freebsd/freebsd/pull/28
Relnotes:		yes
2015-04-22 14:38:58 +00:00
Craig Rodrigues
d5fec48956 Support file verification in MAC.
* Add VCREAT flag to indicate when a new file is being created
* Add VVERIFY to indicate verification is required
* Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open
  and are removed from the accmode after
* Add O_VERIFY flag to rtld open of objects
* Add 'v' flag to __sflags to set O_VERIFY flag.

Submitted by:		Steve Kiernan <stevek@juniper.net>
Obtained from:		Juniper Networks, Inc.
GitHub Pull Request:	https://github.com/freebsd/freebsd/pull/27
Relnotes:		yes
2015-04-22 01:54:25 +00:00
Edward Tomasz Napierala
6289b482ec Modify kern___getcwd() to take max pathlen limit as an additional
argument.  This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.

Differential Revision:	https://reviews.freebsd.org/D2335
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-21 13:55:24 +00:00
Alexander Motin
869fd29a7b Rewrite physio() to not allocate pbufs for unmapped I/O.
pbufs is a limited resource, and their allocator is not SMP-scalable.
So instead of always allocating pbuf to immediately convert it to bio,
allocate bio just here.  If buffer needs kernel mapping, then pbuf is
still allocated, but used only as a source of KVA and storage for a list
of held pages.

On 40-core system doing many 512-byte reads from user level to array of
raw SSDs this change removes huge lock congestion inside pbuf allocator.
It improves peak performance from ~300K to ~1.2M IOPS.  On my previous
24-core system this problem also existed, but was less serious.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-21 10:55:53 +00:00
Eric van Gyzen
c207ff9319 Always send log(9) messages to the message buffer.
It is truer to the semantics of logging for messages to *always*
go to the message buffer, where they can eventually be collected
and, in fact, be put into a log file.

This restores the behavior prior to r70239, which seems to have
changed it inadvertently.

Submitted by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Approved by:	kib (mentor)
Obtained from:	Dell Inc.
MFC after:	1 week
2015-04-20 20:03:26 +00:00
Konstantin Belousov
8103a8f608 Regen. 2015-04-18 21:50:53 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Mark Johnston
38563f7c92 Remove unimplemented sched provider probes.
They were added for compatibility with the sched provider in Solaris and
illumos, but our sched provider is already incompatible since it uses native
types, so there isn't much point in keeping them around.

Differential Revision:	https://reviews.freebsd.org/D2167
Reviewed by:	rpaulo
2015-04-18 20:36:58 +00:00
Konstantin Belousov
ad8b1d857d Initialize td_sel in the thread_init(). Struct thread is not zeroed
on the initial allocation, but seltdinit() assumes that td_sel is NULL
or a valid pointer.  Note that thread_fini()/seltdfini() also relies
on this, but correctly resets td_sel to NULL.

Submitted by:	luke.tw@gmail.com
PR:	199518
MFC after:	1 week
2015-04-18 17:21:12 +00:00
Kirk McKusick
f351915514 More accurately collect name-cache statistics in sysctl functions
sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash().
These changes are in preparation for allowing changes in the size
of the vnode hash tables driven by increases and decreases in the
maximum number of vnodes in the system.

Reviewed by: kib@
Phabric:     D2265
2015-04-18 00:59:03 +00:00
Devin Teske
43d4f8c4c6 Add "GELI Passphrase:" prompt to boot loader.
A new loader.conf(5) option of geom_eli_passphrase_prompt="YES" will now
allow you to enter your geli(8) root-mount credentials prior to invoking
the kernel.

See check-password.4th(8) for details.

Differential Revision:	https://reviews.freebsd.org/D2105
Reviewed by:	imp, kmoore
Discussed on:	-current
MFC after:	3 days
X-MFC-to:	stable/10
Relnotes:	yes
2015-04-16 20:53:15 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Neel Natu
1a688aa53e Fix handling of BUS_PROBE_NOWILDCARD in 'device_probe_child()'.
Device probe value of BUS_PROBE_NOWILDCARD should be treated specially only
if the device has a fixed devclass. Otherwise it should be interpreted just
as if the driver doesn't want to claim the device.

Prior to this change a device that was not claimed explicitly by its driver
would remain "attached" to the driver that returned BUS_PROBE_NOWILDCARD.
This would bump up the reference on 'driver->refs' and its 'dev->ops' would
point to the 'driver->ops'. When the driver is subsequently unloaded the
'dev->ops->cls' is left pointing to freed memory.

This fixes an easily reproducible #GP fault caused by loading and unloading
vmm.ko multiple times.

Differential Revision:	https://reviews.freebsd.org/D2294
Reviewed by:	imp, jhb
Discussed with:	rstone
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-15 16:22:05 +00:00
Edward Tomasz Napierala
1c73bcab8e Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision:	https://reviews.freebsd.org/D2193
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-15 09:13:11 +00:00
Konstantin Belousov
316b384343 Implement support for binary to requesting specific stack size for the
initial thread.  It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-15 08:13:53 +00:00
George V. Neville-Neil
63bf240482 When a kernel has DEVICE_POLLING turned on but no drivers have
the capability do not try to take the mutex at all.

Replaces misbegotten attempt from reverted commit 281276

Pointed out by: glebius
Sponsored by: Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D2262
2015-04-14 14:22:34 +00:00
Randall Stewart
b132edb56f Fix my stupid restoral of old code.. must be c_iflags now.
Thanks jhb for catching my stupidity...
MFC after:	3 days
2015-04-14 00:02:39 +00:00
Randall Stewart
07a2df5d83 Restore the two lines accidentally deleted that allow CALLOUT_DIRECT to be
specifed in the flags.

Thanks Mark Johnston for noticing this ;-o

MFC after:	3 days
2015-04-13 23:06:13 +00:00
Will Andrews
6311d7aaf0 uiomove_object_page(): Avoid instantiating pages in sparse regions on reads.
Check whether the page being requested is either resident or on swap.  If
not, read from the zero_region instead of instantiating an unnecessary page.

This avoids consuming memory for sparse files on tmpfs, when they are read
by applications that do not use SEEK_HOLE/SEEK_DATA (which is most of them).

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Spectra Logic
2015-04-11 18:51:41 +00:00
Mateusz Guzik
2574218578 Replace struct filedesc argument in getsock_cap with struct thread
This is is a step towards removal of spurious arguments.
2015-04-11 16:00:33 +00:00
Mateusz Guzik
90f54cbfeb fd: remove filedesc argument from fdclose
Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.
2015-04-11 15:40:28 +00:00
Alexander Motin
cdd09fea28 Add vmem locking to r281026.
While races there are not fatal, they cause result underestimation, that
cause unneeded ARC reclaims.

MFC after:	1 month
2015-04-05 14:17:26 +00:00
Konstantin Belousov
4cfc037c30 Restore proper error from oshmctl(2), used by COMPAT_43, when the
segment cannot be found.  Broken by r280323.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-04 23:56:38 +00:00
Jilles Tjoelker
78d75aba77 utimensat: Correct Capsicum required capability rights. 2015-04-04 21:47:54 +00:00
Konstantin Belousov
0122d251bf Remove useless initialization.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-04 08:44:20 +00:00
Alexander Motin
2e9ccb32a1 Make ZFS ARC track both KVA usage and fragmentation.
Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise).  FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.

Address this situation from two sides:
 - restore KVA usage limitations in a way the most close to Illumos;
 - introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.

Experiments show that first limitation done alone is not sufficient.  On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk.  Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)

MFC after:	1 month
2015-04-03 14:45:48 +00:00
Konstantin Belousov
2832cd544f Speed up symbol lookup for the amd64 kernel modules.
Amd64 uses relocatable object files as the modules format.  It is good
WRT not having unneeded overhead for PIC code, in particular, due to
absence of useless GOT and PLT.  But the cost is that the module
linking process cannot use hash to speed up the symbol lookup, and
that each reference to the symbol requiring a relocation, instead of
single-place relocation in GOT.

Cache the successfull symbol lookup results in the module symbol
table, using the newly allocated SHN_FBSD_CACHED value from
SHN_LOOS-HIOS range as an indicator.  The SHN_FBSD_CACHED together
with the non-existent definition of the found symbol are reverted
after successfull relocations, which is done under kld_sx lock, so it
should not be visible to other consumers of the symbol table.

Submitted by:	Conrad Meyer
Differential Revision:  https://reviews.freebsd.org/D1718
MFC after:	3 weeks
2015-04-02 20:14:51 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Randall Stewart
403df7a672 Adopt jhb's suggested changes, updated comments and callout_migration() moving
to kern/kern_timeout.c

This does *not* address his -1 -> NOCPU comment.

Sponsored by:	Netflix Inc.
2015-03-31 00:18:00 +00:00
Gleb Smirnoff
f6d6b5e262 Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().
2015-03-30 22:49:26 +00:00
Alexander Motin
43329ffcc8 Periodically wake up threads waiting for vmem(9) resources, so they could
ask for resource reclamation again.

This is kind of dirty hack, but as last resort this is better then stuck
indefinitely because of KVA fragmentation, waiting until some random event
free something sufficient.  OpenSolaris also has this hack in its vmem(9).

MFC after:	2 weeks
2015-03-30 13:30:53 +00:00
Alexander Motin
b308aaed27 Add four new DDB commands to display vmem(9) statistics.
In particular, such DDB commands were added:
        show vmem <addr>
        show all vmem
        show vmemdump <addr>
        show all vmemdump

As possible usage, that allows to see KVA usage and fragmentation.
2015-03-29 10:02:29 +00:00
Konstantin Belousov
eeb697c8e9 Make debug.vmem_check a tunable. It is useful to set it early.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-03-28 23:30:51 +00:00
Eric van Gyzen
e858b027db Clean up some cosmetic nits in kern_umtx.c, found during recent work
in this area and by the Clang static analyzer.

Remove some dead assignments.

Fix a typo in a panic string.

Use umtx_pi_disown() instead of duplicate code.

Use an existing variable instead of curthread.

Approved by:	kib (mentor)
MFC after:	3 days
Sponsored by:	Dell Inc
2015-03-28 21:21:40 +00:00
Bjoern A. Zeeb
a04d412295 Try to unbreak !SMP kernels broken in r280785 by using the proper macros
to access cc_cpu.
2015-03-28 15:07:19 +00:00
Randall Stewart
15b1eb142c Change the callout to supply -1 to indicate we are not changing
CPU, also add protection against invalid CPU's as well as
split c_flags and c_iflags so that if a user plays with the active
flag (the one expected to be played with by callers in MPSAFE) without
a lock, it won't adversely affect the callout system by causing a corrupt
list. This also means that all callers need to use the macros and *not*
play with the falgs directly (like netgraph used to).

Differential Revision: htts://reviews.freebsd.org/D1894
Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky
             tested by hiren and netflix.
Sponsored by:	Netflix Inc.
2015-03-28 12:50:24 +00:00
Hans Petter Selasky
38668c6044 Implement a simple OID number garbage collector. Given the increasing
number of dynamically created and destroyed SYSCTLs during runtime it
is very likely that the current new OID number limit of 0x7fffffff can
be reached. Especially if dynamic OID creation and destruction results
from automatic tests. Additional changes:

- Optimize the typical use case by decrementing the next automatic OID
sequence number instead of incrementing it. This saves searching time
when inserting new OIDs into a fresh parent OID node.

- Add simple check for duplicate non-automatic OID numbers.

MFC after:  1 week
2015-03-25 08:55:34 +00:00
Hans Petter Selasky
502702c644 Make sure tunable sysctls are only fetched once. The existing code can
re-register sysctls when destroying sysctl contexts or when moving
sysctls from one tree to another.
2015-03-24 17:42:53 +00:00
Gleb Smirnoff
a2d4a7e456 Do not include if_var.h and in6_var.h into kern_jail.c. It is now possible
after r280444.

Sponsored by:	Nginx, Inc.
2015-03-24 16:46:40 +00:00
Hans Petter Selasky
ab91c9a743 Correct string pointer offset for error printout. 2015-03-24 16:37:19 +00:00
Rui Paulo
0da9e11b7e Disable coredump_devctl because it could lead to leaking paths to
jails.
2015-03-24 02:17:17 +00:00
Mateusz Guzik
ea926658ff filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch
Casting fd to an unsigned type simplifies fd range coparison to mere checking
if the result is bigger than the table.
2015-03-24 00:10:11 +00:00
Ian Lepore
296f235de0 The sysctls that return process argv and envv return binary data, so clear
the SBUF_INCLUDENUL flag.

Pointed out by:	    tijl@
2015-03-22 21:18:44 +00:00
Hans Petter Selasky
2793ea13aa Fix for out of order device destruction notifications when using the
delist_dev() function. In addition to this change:
- add a proper description of this function
- add a proper witness assert inside this function
- switch a nearby line to use the "cdp" pointer instead of cdev2priv()

MFC after:	3 days
2015-03-22 13:11:56 +00:00
Mateusz Guzik
f97af9706b proc: use MTX_NEW flag in proc_init
This allows us to get rid of bzero which was added specifically to make
mtx_init on p_mtx reliable.

This also fixes a potential problem where mtx_init on other mutexes
could trip over on unitialized memory and fire an assertion.

Reviewed by:	kib
2015-03-21 20:25:34 +00:00
Mateusz Guzik
ffb34484ee cred: add proc_set_cred_init helper
proc_set_cred_init can be used to set first credentials of a new
process.

Update proc_set_cred assertions so that it only expects already used
processes.

This fixes panics where p_ucred of a new process happens to be non-NULL.

Reviewed by:	kib
2015-03-21 20:24:54 +00:00
Mateusz Guzik
12cec311e6 fork: assign refed credentials earlier
Prior to this change the kernel would take p1's credentials and assign
them tempororarily to p2. But p1 could change credentials at that time
and in effect give us a use-after-free.

No objections from: kib
2015-03-21 20:24:03 +00:00
Alan Cox
3d653db063 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
Olivier Houchard
d8d2f47629 error is only used if MAC is defined, so make its declaration conditional
as well.
2015-03-21 16:16:17 +00:00
Konstantin Belousov
0555fb3523 Somewhat modernize the SysV shm code:
- Use real locking, replace Giant with global sx protecting the
  subsystem.  Since the subsystem' lock is no longer dropped during
  the sleepsk, remove not needed SHMSEG_WANTED segment flag, and
  revert r278963.
- To do proper code simplification possible after the change of the
  lock, restructure several functions into _locked body and
  originally-named wrapper which calls into _locked variant.  This
  allows to eliminate the 'goto done2' spread over the code.
- Merge shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().
- Consistently change all function prototypes to ANSI C.

Reviewed by:	mjg (who has earlier version of the similar patch to
	 introduce real locking)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-21 15:01:19 +00:00
Mateusz Guzik
5bc0ff888a coredump: protect corefilename access with a lock
Previously format string traversal could happen while the string itself was
being modified.

Use allproc_lock as coredumping is a rare operation and as such we don't
have to create a dedicated lock.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn>
Reviewed by:	kib
X-Additional:	JuniorJobs project
2015-03-21 04:39:33 +00:00
Ian Lepore
612d9391a4 The minimum sbuf buffer size is 2 bytes (a byte plus a nulterm), assert that.
Values smaller than two lead to strange asserts that have nothing to do
with the actual problem (in the case of size=0), or to writing beyond the
end of the allocated buffer in sbuf_finish() (in the case of size=1).
2015-03-17 21:00:31 +00:00
Ian Lepore
2834924513 In sbuf_new_for_sysctl(), default the buffer size to 64 bytes if the
passed-in pointer is NULL and the length is zero.
2015-03-17 20:56:24 +00:00
Gleb Smirnoff
8c4df6296b Reduce header pollution. 2015-03-17 14:16:50 +00:00
Benno Rice
43348dc2ad Reset bp->bio_done to unmapped_buf when removing a transient map in biodone.
Submitted by:	Scott Ferris <scott.ferris@isilon.com>
Sponsored by:	EMC / Isilon Storage Division
Reviewed by:	kib
2015-03-16 20:00:09 +00:00
Ian Lepore
f62fbd30cb Trivial change / forced-commit to document prior change that slipped in
without a commit message...

Use sbuf_new() + SYSCTL_OUT() instead of wiring the userland buffer and
using sbuf_new_for_sysctl().  The preallocated 256 byte buffer is always
going to be big enough to hold these results, and this should be more
efficient than wiring the old buffer.
2015-03-16 19:29:19 +00:00
Ian Lepore
ff352d8978 2015-03-16 19:25:03 +00:00