Commit Graph

12948 Commits

Author SHA1 Message Date
kib
76522f4cab Fix several reads beyond the mapped first page of the binary in the
ELF parser. Specifically, do not allow note reader and interpreter
path comparision in the brandelf code to read past end of the page.
This may happen if specially crafter ELF image is activated.

Submitted by:	Lukasz Wojcik <lukasz.wojcik zoho com>
MFC after:	3 days
2012-07-19 11:15:53 +00:00
kib
fb0ee769bd Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4.
PR:	  standards/169962
Submitted by:	Jukka A. Ukkonen <jau iki fi>
MFC after:	1 week
2012-07-19 10:22:54 +00:00
gnn
fb2c54cc14 Add support for walltimestamp in DTrace.
Submitted by:	Fabian Keil
MFC after:	2 weeks
2012-07-16 20:17:19 +00:00
pgj
f939e5a15a - Add support for displaying process stack memory regions.
Approved by:	rwatson
MFC after:	3 days
2012-07-16 09:38:19 +00:00
mdf
a42ef9b109 Fix a bug with memguard(9) on 32-bit architectures without a
VM_KMEM_MAX_SIZE.

The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit.  The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.

Found by:	Justin Hibbits
2012-07-15 20:29:48 +00:00
jhb
e4bb5cb3f0 Make the interval timings for EVFILT_TIMER more accurate. tvtohz() always
adds an extra tick to account for the current partial clock tick.  However,
that is not appropriate for a repeating timer when the exact tvtohz() value
should be used for subsequent intervals.  Fix repeating callouts for
EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the
fix used in realitexpire() for interval timers.

While here, update a few comments to note that if the EVFILT_TIMER code
were to move out of kern_event.c, it should move to kern_time.c (where the
interval timer code it mimics lives) rather than kern_timeout.c.

MFC after:	1 month
2012-07-13 13:24:33 +00:00
kib
9845f2214c Fix build for kernels with dtrace hooks.
MFC after:	1 month
2012-07-11 18:50:50 +00:00
gnn
786ac82553 Initial commit of an I/O provider for DTrace on FreeBSD.
These probes are most useful when looking into the structures
they provide, which are listed in io.d.  For example:

dtrace -n 'io:genunix::start { printf("%d\n", args[0]->bio_bcount); }'

Note that the I/O systems in FreeBSD and Solaris/Illumos are sufficiently
different that there is not a 1:1 mapping from scripts that work
with one to the other.
MFC after:	1 month
2012-07-11 16:27:02 +00:00
davidxu
affa007492 Always clear p_xthread if current thread no longer needs it, in theory, if
debugger exited without calling ptrace(PT_DETACH), there is a time window
that the p_xthread may be pointing to non-existing thread, in practical,
this is not a problem because child process soon will be killed by parent
process.
2012-07-10 05:45:13 +00:00
davidxu
be2413da6d If you have pressed CTRL+Z and a process is suspended, then you use gdb
to attach to the process, it is surprising that the process is resumed
without inputting any gdb commands, however ptrace manual said:
  The tracing process will see the newly-traced process stop and may
  then control it as if it had been traced all along.
But the current code does not work in this way, unless traced process
received a signal later, it will continue to run as a background task.
To fix this problem, just send signal SIGSTOP to the traced process after
we resumed it, this works like that you are attaching to a running process,
it is not perfect but better than nothing.
2012-07-09 09:24:46 +00:00
mjg
24d9f5c7d6 Follow-up commit to r238220:
Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for
!FWRITE anyway and may as well know about FREAD.

Make _fget code a bit more readable by converting permission checking from if()
to switch(). Assert that correct permission flags are passed.

In collaboration with:	kib
Approved by:	trasz (mentor)
MFC after:	6 days
X-MFC: with r238220
2012-07-09 05:39:31 +00:00
mjg
1e0d3fae01 Unbreak handling of descriptors opened with O_EXEC by fexecve(2).
While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).

Add fgetvp_exec function which performs appropriate checks.

PR:		kern/169651
In collaboration with:	kib
Approved by:	trasz (mentor)
MFC after:	1 week
2012-07-08 00:51:38 +00:00
trociny
3ef0ae6cd1 Fix KASSERT message.
MFC after:	3 days
2012-07-03 19:08:02 +00:00
kib
53224f018a Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by:	pho
No objections from:	jhb
MFC after:    3 weeks
2012-07-02 21:01:03 +00:00
jhb
ab100847da Honor db_pager_quit in 'show uma' and 'show malloc'.
MFC after:	1 month
2012-07-02 16:14:52 +00:00
imp
492254ade0 Remove an old hack I noticed years ago, but never committed. 2012-06-28 07:33:43 +00:00
alc
c5e6daff9d Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
kevlo
8473fac955 Correct sizeof usage
Obtained from:	DragonFly
2012-06-25 05:41:16 +00:00
kib
c763bb1500 Move the code dealing with shared page into a dedicated
kern_sharedpage.c source file from kern_exec.c.

MFC after:	  29 days
2012-06-23 10:15:23 +00:00
kib
497817697c Stop updating the struct vdso_timehands from even handler executed in
the scheduled task from tc_windup(). Do it directly from tc_windup in
interrupt context [1].

Establish the permanent mapping of the shared page into the kernel
address space, avoiding the potential need to sleep waiting for
allocation of sf buffer during vdso_timehands update. As a
consequence, shared_page_write_start() and shared_page_write_end()
functions are not needed anymore.

Guess and memorize the pointers to native host and compat32 sysentvec
during initialization, to avoid the need to get shared_page_alloc_sx
lock during the update.

In tc_fill_vdso_timehands(), do not loop waiting for timehands
generation to stabilize, since vdso_timehands is written in the same
interrupt context which wrote timehands.

Requested by:	  mav [1]
MFC after:	  29 days
2012-06-23 09:33:06 +00:00
kib
7b36a08108 Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
kib
4109c3e1ac Enchance the shared page chunk allocator.
Do not rely on the busy state of the page from which we allocate the
chunk, to protect allocator state. Use statically allocated sx lock
instead.

Provide more flexible KPI. In particular, allow to allocate chunk
without providing initial data, and allow writes into existing
allocation. Allow to get an sf buf which temporary maps the chunk, to
allow sequential updates to shared page content without unmapping in
between.

Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 06:39:28 +00:00
kib
df9f3d2faa Fix locking for f_offset, vn_read() and vn_write() cases only, for now.
It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.

This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.

Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.

Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.

Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.

Reviewed by:	jhb
Tested by:	pho
MFC after:	2 weeks
2012-06-21 09:19:41 +00:00
pjd
8f9f9f3c91 Check proper flag (PDF_DAEMON, not PD_DAEMON) when deciding if the process
should be killed or not.

This fixes killing pdfork(2)ed process on last close of the corresponding
process descriptor.

Reviewed by:	rwatson
MFC after:	1 month
2012-06-19 22:23:59 +00:00
pjd
81ad62d5c5 The falloc() function obtains two references to newly created 'fp'.
On success we have to drop one after procdesc_finit() and on failure
we have to close allocated slot with fdclose(), which also drops one
reference for us and drop the remaining reference with fdrop().

Without this change closing process descriptor didn't result in killing
pdfork(2)ed child.

Reviewed by:	rwatson
MFC after:	1 month
2012-06-19 22:21:59 +00:00
jhb
571562fffb Further refine the implementation of POSIX_FADV_NOREUSE.
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads.  A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read.  If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read.  The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read.  This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.

Second, apply the same changes made to read(2) by r230782 and this
change to writes.  This provides much better performance in the
sequential write case as it allows writes to still be clustered.  It
also provides much better performance for misaligned writes.  It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.

MFC after:	2 weeks
2012-06-19 18:42:24 +00:00
pho
85a3f61dca In tty_makedev() the following construction:
dev = make_dev_cred();
dev->si_drv1 = tp;

leaves a small window where the newly created device may be opened
and si_drv1 is NULL.

As this is a vary rare situation, using a lock to close the window
seems overkill. Instead just wait for the assignment of si_drv1.

Suggested by:	kib
MFC after:	1 week
2012-06-18 07:34:38 +00:00
pjd
0118c86062 Don't check for race with close on advisory unlock (there is nothing smart we
can do when such a race occurs). This saves lock/unlock cycle for the filedesc
lock for every advisory unlock operation.

MFC after:	1 month
2012-06-17 21:04:22 +00:00
pjd
32ff81e94f Extend the comment about checking for a race with close to explain why
it is done and why we don't return an error in such case.

Discussed with:	kib
MFC after:	1 month
2012-06-17 16:59:37 +00:00
pjd
9a81d01ee0 If VOP_ADVLOCK() call or earlier checks failed don't check for a race with
close, because even if we had a race there is nothing to unlock.

Discussed with:	kib
MFC after:	1 month
2012-06-17 16:32:32 +00:00
davide
163c370e14 The variable 'error' in sys_poll() is initialized in declaration to value
zero but in any case is overwritten by successive copyin(), making the
previous initialization useless. Remove this.
As an added bonus this fixes a style(9) bug.

Discussed with:		kib
Approved by:		gnn (mentor)
MFC after:		3 days
2012-06-17 13:03:50 +00:00
pjd
9719a38d39 Revert r237073. 'td' can be NULL here.
MFC after:	1 month
2012-06-16 12:56:36 +00:00
pjd
144a7f643e One more attempt to make prototypes formated according to style(9), which
holefully recovers from the "worse than useless" state.

Reported by:	bde
MFC after:	1 month
2012-06-15 10:00:29 +00:00
pjd
2ede2f9ae2 Update comment.
MFC after:	1 month
2012-06-14 17:32:58 +00:00
pjd
c2fe03ba67 Remove fdtofp() function and use fget_locked(), which works exactly the same.
MFC after:	1 month
2012-06-14 16:25:10 +00:00
pjd
0984458a79 Assert that the filedesc lock is being held when the fdunwrap() function
is called.

MFC after:	1 month
2012-06-14 16:23:16 +00:00
pjd
f84f6132c8 Simplify the code by making more use of the fdtofp() function.
MFC after:	1 month
2012-06-14 15:37:15 +00:00
pjd
4a9c37500e - Assert that the filedesc lock is being held when fdisused() is called.
- Fix white spaces.

MFC after:	1 month
2012-06-14 15:35:14 +00:00
pjd
7b02ff9171 Style fixes and assertions improvements.
MFC after:	1 month
2012-06-14 15:34:10 +00:00
pjd
32b7d4b149 Assert that the filedesc lock is not held when closef() is called.
MFC after:	1 month
2012-06-14 15:26:23 +00:00
pjd
e1c12932a7 Style fixes.
Reported by:	bde
MFC after:	1 month
2012-06-14 15:21:57 +00:00
pjd
2014b8defb Remove code duplication from fdclosexec(), which was the reason of the bug
fixed in r237065.

MFC after:	1 month
2012-06-14 12:43:37 +00:00
pjd
6634e42976 When we are closing capabilities during exec, we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Similar bug was fixed in r236853.

MFC after:	1 month
2012-06-14 12:41:21 +00:00
pjd
841890f62a Style.
MFC after:	1 month
2012-06-14 12:37:41 +00:00
pjd
0ca632f7e9 When checking if file descriptor number is valid, explicitely check for 'fd'
being less than 0 instead of using cast-to-unsigned hack.

Today's commit was brought to you by the letters 'B', 'D' and 'E' :)
2012-06-13 22:12:10 +00:00
pjd
0123f7ed5a Now that dupfdopen() doesn't depend on finstall() being called earlier,
indx will never be -1 on error, as none of dupfdopen(), finstall() and
kern_capwrap() modifies it on error, but what is more important none of
those functions install and leave file at indx descriptor on error.

Leave an assert to prove my words.

MFC after:	1 month
2012-06-13 21:38:07 +00:00
pjd
f695b590b4 Allocate descriptor number in dupfdopen() itself instead of depending on
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().

MFC after:	1 month
2012-06-13 21:32:35 +00:00
pjd
f7e18321ef - Remove nfp variable that is not really needed.
- Update comment.
- Style nits.

MFC after:	1 month
2012-06-13 21:22:35 +00:00
pjd
219cd5caaa Remove duplicated code.
MFC after:	1 month
2012-06-13 21:15:01 +00:00
pjd
5d3532ce69 Add missing {.
MFC after:	1 month
2012-06-13 21:13:18 +00:00
pjd
c745de62f2 Style.
MFC after:	1 month
2012-06-13 21:11:58 +00:00
pjd
54a86dc320 There is no need to set td->td_retval[0] to -1 on error.
Confirmed by:	jhb
MFC after:	1 month
2012-06-13 21:10:00 +00:00
pjd
b836448bf3 There is only one caller of the dupfdopen() function, so we can simplify
it a bit:
- We can assert that only ENODEV and ENXIO errors are passed instead of
  handling other errors.
- The caller always call finstall() for indx descriptor, so we can assume
  it is set. Actually the filedesc lock is dropped between finstall() and
  dupfdopen(), so there is a window there for another thread to close the
  indx descriptor, but it will be closed in next commit.

Reviewed by:	mjg
MFC after:	1 month
2012-06-13 19:00:29 +00:00
mjg
29bd2f6d46 Remove 'low' argument from fd_last_used().
This function is static and the only caller always passes 0 as low.

While here update note about return values in comment.

Reviewed by:	pjd
Approved by:	trasz (mentor)
MFC after:	1 month
2012-06-13 17:18:16 +00:00
mjg
1ca4c8cbf9 Re-apply reverted parts of r236935 by pjd with some changes.
If fdalloc() decides to grow fdtable it does it once and at most doubles
the size. This still may be not enough for sufficiently large fd. Use fd
in calculations of new size in order to fix this.

When growing the table, fd is already equal to first free descriptor >= minfd,
also fdgrowtable() no longer drops the filedesc lock. As a result of this there
is no need to retry allocation nor lookup.

Fix description of fd_first_free to note all return values.

In co-operation with:	pjd
Approved by:	trasz (mentor)
MFC after:	1 month
2012-06-13 17:12:53 +00:00
pjd
bcf3f4263d Revert part of the r236935 for now, until I figure out why it doesn't
work properly.

Reported by:	davidxu
2012-06-12 10:25:11 +00:00
pjd
ea4cd345da fdgrowtable() no longer drops the filedesc lock so it is enough to
retry finding free file descriptor only once after fdgrowtable().

Spotted by:	pluknet
MFC after:	1 month
2012-06-11 22:05:26 +00:00
pjd
b7902b949c Use consistent way of checking if descriptor number is valid.
MFC after:	1 month
2012-06-11 20:17:20 +00:00
pjd
00ef5a8d82 Be consistent with white spaces.
MFC after:	1 month
2012-06-11 20:01:50 +00:00
pjd
d698b8f852 Remove code duplicated in kern_close() and do_dup() and use closefp() function
introduced a minute ago.

This code duplication was responsible for the bug fixed in r236853.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 20:00:44 +00:00
pjd
c8465e01a1 Introduce closefp() function that we will be able to use to eliminate
code duplication in kern_close() and do_dup().

This is committed separately from the actual removal of the duplicated
code, as the combined diff was very hard to read.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:57:31 +00:00
pjd
cab8c2dc3a Merge two ifs into one to make the code almost identical to the code in
kern_close().

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:53:41 +00:00
pjd
b903b5753d Move the code around a bit to move two parts of code duplicated from
kern_close() close together.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:51:27 +00:00
pjd
2042e99ed8 Now that fdgrowtable() doesn't drop the filedesc lock we don't need to
check if descriptor changed from under us. Replace the check with an
assert.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:48:55 +00:00
iwasaki
e6018a6dfd Another fixe for r236772.
- Adjust correct cpuset (stopped_cpus/suspended_cpus) for
  cpu_spinwait() in generic_stop_cpus().
2012-06-11 18:47:26 +00:00
pjd
859bb04daa Style fixes and simplifications.
MFC after:	1 month
2012-06-11 16:08:03 +00:00
pjd
f4a0f109b8 Remove redundant include.
MFC after:	1 month
2012-06-10 20:24:01 +00:00
pjd
5081787525 Style: move opt_*.h includes in the proper place.
MFC after:	1 month
2012-06-10 20:22:10 +00:00
pjd
23c7c80ef5 When we are closing capability during dup2(), we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Discussed with:	rwatson
Sponsored by:	FreeBSD Foundation
MFC after:	1 month
2012-06-10 14:57:18 +00:00
pjd
0da1a67419 Merge two ifs into one. Other minor style fixes.
MFC after:	1 month
2012-06-10 13:10:21 +00:00
pjd
67f6f356fc Simplify fdtofp().
MFC after:	1 month
2012-06-10 06:31:54 +00:00
mckusick
070b3c0414 When synchronously syncing a device (MNT_WAIT), wait for buffers
to become available. Otherwise we may excessively spin and fail
with ``fsync: giving up on dirty''.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   1 week
2012-06-09 22:26:53 +00:00
pjd
0311d1f4cc There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as
we now use sx lock for filedesc structure protection.

Reviewed by:	kib
MFC after:	1 month
2012-06-09 18:50:32 +00:00
pjd
468d011a0d Remove now unused variable.
MFC after:	1 month
MFC with:	r236820
2012-06-09 18:48:06 +00:00
pjd
b9def82bd7 Make some of the loops more readable.
Reviewed by:	tegge
MFC after:	1 month
2012-06-09 18:03:23 +00:00
pjd
b1dc458d22 Correct panic message.
MFC after:	1 month
MFC with:	r236731
2012-06-09 12:27:30 +00:00
iwasaki
861bb3822c Add x86/acpica/acpi_wakeup.c for amd64 and i386. Difference of
suspend/resume procedures are minimized among them.

common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
  among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().

amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().

i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
  amd64 code.

Reviewed by:	attilio@, acpi@
2012-06-09 00:37:26 +00:00
jhb
176ddf31c3 Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2012-06-08 18:32:09 +00:00
mjg
1d2ca7b8d8 Plug socket refcount leak on error in sys_sctp_peeloff.
Reviewed by:	tuexen
Approved by:	trasz (mentor)
MFC after:	3 days
2012-06-08 08:04:51 +00:00
pjd
b738e3d524 In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0.
Assert that instead of setting it to 0.

Sponsored by:	FreeBSD Foundation
MFC after:	1 month
2012-06-07 23:33:10 +00:00
pjd
576bf7639b Eliminate redundant variable.
Sponsored by:	FreeBSD Foundation
MFC after:	1 week
2012-06-07 23:08:18 +00:00
pjd
858973e30c Plug file reference leak in capability failure case.
Sponsored by:	FreeBSD Foundation
MFC after:	3 days
2012-06-07 22:49:09 +00:00
glebius
2bdbc6913f style(9) for r236563. 2012-06-05 05:16:04 +00:00
glebius
b0d113b96e Microoptimisation of code from r236560, also coming from Nginx Inc.
Submitted by:	ru
2012-06-04 14:18:13 +00:00
glebius
df2b290f0d Optimise kern_sendfile(): skip cycling through the entire mbuf chain in
m_cat(), storing pointer to last mbuf in chain in local variable and
attaching new mbuf to the end of chain.

Submitter reports that CPU load dropped for > 10% on a web server
serving large files with this optimisation.

Submitted by:	Sergey Budnevitch <sb nginx.com>
2012-06-04 12:49:21 +00:00
kib
472101ae88 Add a knob to disable vn_io_fault.
MFC after:	1 month
2012-06-03 16:19:37 +00:00
kib
d977144831 Count and export the number of prefaulting happen.
MFC after:	 1 month
2012-06-03 16:06:56 +00:00
avg
85a02186bc free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.

MFC after:	3 weeks
2012-06-03 08:01:12 +00:00
kib
3c3d727c68 Fix typo [1]. Use commas to separate flag printouts, in style with
other parts of function.

Submitted by: bf [1]
MFC after:   1 week
2012-06-02 19:39:12 +00:00
kib
e32a51888c Update the print mask for decoding b_flags. Add print masks for
b_vflags and b_xflags_t and print them as well.

MFC after:   1 week
2012-06-02 18:44:40 +00:00
jhb
65701cddc9 Extend VERBOSE_SYSINIT to also print out the name of variables passed
to SYSINIT routines if they can be resolved via symbol look up in DDB.
To avoid false positives, only honor a name if the symbol resolves
exactly to the pointer value (no offset).

MFC after:	1 week
2012-06-01 15:42:37 +00:00
pjd
fa432bf52c Regenerate after r236361.
MFC after:	3 days
2012-05-31 19:34:53 +00:00
pjd
b01a263416 Add missing system calls.
MFC after:	3 days
2012-05-31 19:32:37 +00:00
pjd
cc9a75903e There is no rmdirat system call. Weird, I know.
MFC after:	3 days
2012-05-31 19:31:28 +00:00
imp
ce8d6b964c Unlock in the error path to prevent a lock leak.
PR:		162174
Submitted by:	Ian Lepore
MFC after:	2 weeks
2012-05-31 17:27:05 +00:00
kib
080f2e89d9 vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by:	bf (comments), mdf, pjd (previous version)
Tested by:	pho
Tested by:	flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after:	2 months
2012-05-30 16:42:08 +00:00
kib
6f4e16f833 Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after:     2 month
2012-05-30 16:06:38 +00:00
kib
7638868334 Assert that TDP_NOFAULTING and TDP_NOSPEEPING thread flags do not leak
when thread returns from a syscall to usermode.

Tested by:	pho
MFC after:	1 week
2012-05-30 13:44:42 +00:00
raj
7136f7f893 Let us manage differences of Book-E PowerPC variations i.e. vendor /
implementation specific vs. the common architecture definition.

Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.

This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.

Obtained from:	AppliedMicro, Freescale, Semihalf
2012-05-27 10:25:20 +00:00
kib
cae6484163 Fix ki_cow for compat32 binaries.
MFC after:	3 days
2012-05-27 05:24:53 +00:00
kib
dcb105721a Stop treating td_sigmask specially for the purposes of new thread
creation. Move it into the copied region of the struct thread.

Update some comments.

Requested by:	bde
X-MFC after:	never
2012-05-26 20:03:47 +00:00
kib
08dbe8fa01 Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after:	2 weeks
2012-05-26 05:28:47 +00:00
ed
0d9131d0d0 Regenerate system call tables. 2012-05-25 21:52:57 +00:00
ed
55e4d6365d Remove use of non-ISO-C integer types from system call tables.
These files already use ISO-C-style integer types, so make them less
inconsistent by preferring the standard types.
2012-05-25 21:50:48 +00:00
avg
aa1a7122dc device_add_child: protect against child device with no driver but fixed unit number
This combination doesn't make sense, unit numbers should be hardwired
only in context of a known driver.  The wildcard devices should have
wildcard unit numbers.

Reviewed by:	jhb
MFC after:	2 weeks
2012-05-25 07:32:26 +00:00
mav
8f3c5562d6 MFprojects/zfsd:
Hide warning behind bootverbose. Average user has nothing to do about it.
2012-05-24 11:24:44 +00:00
gleb
3c7243df78 Add kern_fhstat(), adjust sys_fhstat() to use it.
Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by:	Google Summer of Code 2011
2012-05-24 08:00:26 +00:00
kib
187a8c5cd6 Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by:	Andrey Zonov <andrey zonov org>
MFC after:	1 week
2012-05-23 18:10:54 +00:00
trasz
b2747e472e Fix use-after-free in kern_jail_set() triggered e.g. by attempts
to clear "persist" flag from empty persistent jail, like this:

jail -c persist=1
jail -n 1 -m persist=0

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 19:43:20 +00:00
trasz
a25d879040 Don't leak locks in prison_racct_modify().
Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 17:30:02 +00:00
trasz
3a811deac7 Fix panic with RACCT that could occur in low memory (or out of swap)
situations, due to fork1() calling racct_proc_exit() without calling
racct_proc_fork() first.

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com> (earlier version)
Reviewed by:	Mateusz Guzik <mjguzik at gmail dot com>
2012-05-22 15:58:27 +00:00
harti
c7e30562ca Make dumptid non-static. It is used by libkvm to detect whether
this is a VNET-kernel or not. gcc used to put the static symbol into
the symbol table, clang does not. This fixes the 'netstat: no namelist'
error seen on clang+VNET systems.
2012-05-22 07:23:41 +00:00
melifaro
34ec5c8650 Fix old panic when BPF consumer attaches to destroying interface.
'flags' field is added to the end of bpf_if structure. Currently the only
flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
Problem can be easily triggered on SMP stable/[89] by the following command (sort of):
'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'

Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory,
while interface is still UP and there can be routes via this interface.
Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api.

Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet
matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches)

Approved by:      kib(mentor)
MFC in:            4 weeks
2012-05-21 22:17:29 +00:00
iwasaki
31eddd58e3 Add SMP/i386 suspend/resume support.
Most part is merged from amd64.

- i386/acpica/acpi_wakecode.S
Replaced with amd64 code (from realmode to paging enabling code).

- i386/acpica/acpi_wakeup.c
Replaced with amd64 code (except for wakeup_pagetables stuff).

- i386/include/pcb.h
- i386/i386/genassym.c
Added PCB new members (CR0, CR2, CR4, DS, ED, FS, SS, GDT, IDT, LDT
and TR) needed for suspend/resume, not for context switch.

- i386/i386/swtch.s
Added suspendctx() and resumectx().
Note that savectx() was not changed and used for suspending (while
amd64 code uses it).
BSP and AP execute the same sequence, suspendctx(), acpi_wakecode()
and resumectx() for suspend/resume (in case of UP system also).

- i386/i386/apic_vector.s
Added cpususpend().

- i386/i386/mp_machdep.c
- i386/include/smp.h
Added cpususpend_handler().

- i386/include/apicvar.h
- kern/subr_smp.c
- sys/smp.h
Added IPI_SUSPEND and suspend_cpus().

- i386/i386/initcpu.c
- i386/i386/machdep.c
- i386/include/md_var.h
- pc98/pc98/machdep.c
Moved initializecpu() declarations to md_var.h.

MFC after:	3 days
2012-05-18 18:55:58 +00:00
gleb
3288f283ff Skip directory entries with zero inode number during traversal.
Entries with zero inode number are considered placeholders by libc and
UFS.  Fix remaining uses of VOP_READDIR in kernel: vop_stdvptocnp,
unionfs.

Sponsored by:	Google Summer of Code 2011
2012-05-16 10:44:09 +00:00
pluknet
7aab7d56be Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP. 2012-05-15 10:58:17 +00:00
gber
112a2e964f Do not call bremfree for managed buffers.
Calling bremfree for these buffers results in panic:
"bremfree: buffer %p not on a queue."

Approved by: kib
2012-05-15 09:55:15 +00:00
rstone
a059a0e086 Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
delphij
53e510d1ef Revert previous revision, misunderstood the code :( 2012-05-11 23:43:32 +00:00
delphij
f7e33a4a67 Release proc lock after setting signal queue.
PR:		kern/167727
Submitted by:	Jinjun Gao <gjinjun gmail com>
MFC after:	2 weeks
2012-05-11 23:41:52 +00:00
kib
c5f120d09b Move the code to call the callout callback into the helper function
softclock_call_cc(). While there, move some common code to callout_cc_del().

Requested by:	avg, jhb
Reviewed by:	jhb
MFC after:    1 week
2012-05-03 20:00:30 +00:00
kib
9e5fca0368 When callout_reset_on() cannot immediately migrate a callout since it
is running on other cpu, the CALLOUT_PENDING flag is temporarily
cleared. Then, callout_stop() on this, in fact active, callout fails
because CALLOUT_PENDING is not set, and callout_stop() returns 0.

Now, in sleepq_check_timeout(), the failed callout_stop() causes the
sleepq code to execute mi_switch() without even setting the wmesg,
since the switch-out is supposed to be transient. In fact, the thread
is put off the CPU for full timeout interval, instead of being put on
runq immediately.  Until timeout fires, the process is unkillable for
obvious reasons.

Fix this by marking the migrating callouts with CALLOUT_DFRMIGRATION
flag. The flag is cleared by callout_stop_safe() when the function
detects a migration, besides returning the success. The softclock()
rechecks the flag for migrating callout and cancels its execution if
the flag was cleared meantime.

PR:	 misc/166340
Reported, debugging traces provided and tested by:
	Christian Esken <christian.esken trivago com>
Reviewed by:	 avg, jhb
MFC after:	 1 week
2012-05-03 10:38:02 +00:00
jhb
c96b8c07a4 - Don't log messages saying that accounting is being disabled and enabled
if the accounting log file is atomically replaced with a new file
  (such as during log rotation).
- Simplify accounting log rotation a bit.  There is no need to re-run
  accton(8) after renaming the new log file to it's real name.

PR:		kern/167321
Tested by:	Jeremy Chadwick
2012-05-02 14:25:39 +00:00
kib
0e86d1558c Allow for the process information sysctls to accept a thread id in addition
to the process id.  It follows the ptrace(2) interface and allows debugging
libraries to use thread ids directly, without slow and verbose conversion
of thread id into pid.

The PGET_NOTID flag is provided to allow a specific sysctl to disallow
this behaviour.  All current callers of pget(9) have useful semantic to
operate on tid and do not need this flag.

Reviewed by:	jhb, trocini
MFC after:	1 week
2012-04-23 20:56:05 +00:00
trasz
023bd7c6bf Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
trasz
baac623cd9 Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
jhb
aa85973504 Include the associated wait channel message for context switch ktrace
records.  kdump supports both the old and new messages.

Submitted by:	Andrey Zonov  andrey zonov org
MFC after:	1 week
2012-04-20 15:32:36 +00:00
jh
433fc8eeff The value of flags matching VNOVAL can't be supported. Return EOPNOTSUPP
from setfflags() in this case. This fixes the return value of
chflags(path, -1).

Discussed with:	bde
MFC after:	2 weeks
2012-04-20 10:08:30 +00:00
mckusick
d9895ac1fe This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
mckusick
5b7b29e35b This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 06:50:44 +00:00
mckusick
a9a210460f Delete a no longer useful VNASSERT missed during changes in 234400.
Suggested by: kib
2012-04-18 19:34:20 +00:00
mckusick
be8731298f Fix a memory leak of M_VNODE_MARKER introduced in 234386.
Found by:  Peter Holm
2012-04-18 19:30:22 +00:00
mckusick
841f20af50 Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after:      2 weeks
2012-04-17 21:46:59 +00:00
mckusick
ffee40eeff Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
trasz
7f09aee7a1 Fix bug where NFSv4 ACL enforcement code wouldn't unconditionally
allow the owner to read and write ACL and file attributes when there
was no entry with subject matching the owner.  In other words,
'getfacl meh' shouldn't fail for the owner if the ACL looks like this:

# file: meh
# owner: trasz
# group: wheel
         user:root:------a-------:------:allow

Reported by:	kientzle
2012-04-17 14:54:00 +00:00
trasz
29ba0a35f6 Stop treating system processes as special. This fixes panics
like the one triggered by this:

# kldload geom_vinum
# pwait `pgrep -S gv_worker` &
# kldunload geom_vinum

or this:

GEOM_JOURNAL: Shutting down geom gjournal 3464572051.
panic: destroying non-empty racct: 1 allocated for resource 6

which were tracked by jh@ to be caused by checking p->p_flag,
while it wasn't initialised yet.  Basically, during fork, the code
checked p_flag, concluded the process isn't marked as P_SYSTEM,
incremented the counter, and later on, when exiting, checked that
the process was marked as P_SYSTEM, and thus didn't decrement it.

Also, I believe there wasn't any good reason for checking P_SYSTEM
in the first place.

Tested by:	jh
2012-04-17 14:31:02 +00:00
trasz
a41bb18a29 Fix panic, triggered like this: "int main() { thr_exit(); }"
Submitted by:	Mateusz Guzik
2012-04-17 13:44:40 +00:00
trasz
c37ffba90a Enforce upper bound on the input buffer length.
Reported by:	Mateusz Guzik
2012-04-17 13:28:14 +00:00
jkim
e210f689a8 - Implement pipe2 syscall for Linuxulator. This syscall appeared in 2.6.27
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c.  There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *.  Probably
this was the source of MI/MD confusion.

Reviewed by:	emulation
2012-04-16 21:22:02 +00:00
davide
63cc567af5 Fix a typo.
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:59:58 +00:00
davide
ff8b0a29f3 Fix some style bugs introduced in a previous commit (r233045)
Reported by:	glebius, jmallet
Reviewed by:	jmallet
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:53:31 +00:00
marius
6f1427f0e6 Fix !DDB build after r234190. 2012-04-14 11:21:24 +00:00
adrian
2c73480574 Use strdup() on the name (and free it when it's done) so non-static names
can be used in firmware_register().
2012-04-13 04:22:42 +00:00
jhb
20ac4e4f81 - Extend the KDB interface to add a per-debugger callback to print a
backtrace for an arbitrary thread (rather than the calling thread).
  A kdb_backtrace_thread() wrapper function uses the configured debugger
  if possible, otherwise it falls back to using stack(9) if that is
  available.
- Replace a direct call to db_trace_thread() in propagate_priority()
  with a call to kdb_backtrace_thread() instead.

MFC after:	1 week
2012-04-12 17:43:59 +00:00
jhb
51ec6999bb If a linker file contains at least one module, but all of the modules
fail to load (the MOD_LOAD event fails) during a kldload(2), unload the
linker file and fail the kldload(2) with ENOEXEC.

Reported by:	gcooper
MFC after:	1 week
2012-04-12 14:49:25 +00:00
kib
319ab382ef Add thread-private flag to indicate that error value is already placed
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.

Use the flag to fix sigsuspend(2) error return ktrace records.

Requested by:	bde
MFC after:	1 week
2012-04-12 10:48:43 +00:00
mckusick
7901256b30 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
jhb
294ae9574d Allow device_busy() and device_unbusy() to be invoked while a device is
being attached.  This is implemented by adding a new DS_ATTACHING state
while a device's DEVICE_ATTACH() method is being invoked.  A driver is
required to not fail an attach of a busy device.  The device's state will
be promoted to DS_BUSY rather than DS_ACTIVE() if the device was marked
busy during DEVICE_ATTACH().

Reviewed by:	kib
MFC after:	1 week
2012-04-11 20:57:41 +00:00
eadler
2a42c5c4e9 Return EBADF instead of EMFILE from dup2 when the second argument is
outside the range of valid file descriptors

PR:		kern/164970
Submitted by:	Peter Jeremy <peterjeremy@acm.org>
Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-04-11 14:08:09 +00:00
jilles
4360dc9ca8 Remove unused and wrong SA_PROC internal signal property.
The SA_PROC signal property indicated whether each signal number is directed
at a specific thread or at the process in general. However, that depends on
how the signal was generated and not on the signal number. SA_PROC was not
used.
2012-04-09 21:58:58 +00:00
mav
e1ffe54fb7 Microoptimize cpu_search().
According to profiling, it makes one take 6% of CPU time on hackbench
with its million of context switches per second, instead of 8% before.
2012-04-09 18:24:58 +00:00
gleb
fb452e77b0 Add vfs_getopt_size. Support human readable file system options in tmpfs.
Increase maximum tmpfs file system size to 4GB*PAGE_SIZE on 32 bit archs.

Discussed with:	delphij
MFC after:	2 weeks
2012-04-07 15:27:34 +00:00
melifaro
8b1d10268c - Improve BPF locking model.
Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.

- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.

- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.

Found by:       Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by:   Yandex LLC

Reviewed by:    glebius (previous version)
Reviewed by:    silence on -net@
Approved by:    (mentor)

MFC after:      3 weeks
2012-04-06 06:53:58 +00:00
jhb
5829de48d9 Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take.  The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by:	kib
MFC after:	2 weeks
2012-04-05 17:13:14 +00:00
davidxu
cc55f4943b In sem_post, the field _has_waiters is no longer used, because some
application destroys semaphore after sem_wait returns. Just enter
kernel to wake up sleeping threads, only update _has_waiters if
it is safe. While here, check if the value exceed SEM_VALUE_MAX and
return EOVERFLOW if this is true.
2012-04-05 03:05:02 +00:00
davidxu
8c31e244f2 umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily.  On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
2012-04-05 02:24:08 +00:00
np
307ef13f94 - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.
- Add a check for errors during copyin while here.

Reviewed by:	julian, bz
MFC after:	2 weeks
2012-04-03 18:38:00 +00:00
kib
ff6239a557 When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.

Reported and tested by:	pho
MFC after:	3 days
2012-04-02 19:35:36 +00:00
kib
9ad701f91f Add helper function to remove the process from the orphans list and
use it instead of inlined code.

Tested by:	pho
MFC after:	3 days
2012-04-02 19:34:56 +00:00
jhb
506e2f15b9 Export some more useful info about shared memory objects to userland
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
  with and add a shm_path() method to copy the path out to a caller-supplied
  buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
  export the path, mode, and size of a shared memory object via
  struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
  procstat_get_shm_info() to export the mode and size of a shared memory
  object.
- Change procstat to always print out the path for a given object if it
  is valid.
- Teach fstat about shared memory objects and to display their path,
  mode, and size.

MFC after:	2 weeks
2012-04-01 18:22:48 +00:00
davidxu
42d5de0c66 Remove stale comments. 2012-03-31 06:48:41 +00:00
davidxu
0bd3403eb7 Remove trailing semicolon, it is a typo. 2012-03-30 12:57:14 +00:00
davidxu
febc18f31b Fix COMPAT_FREEBSD32 build.
Submitted by: Andreas Tobler < andreast at fgznet dot ch >
2012-03-30 09:03:53 +00:00
davidxu
f7f769bc6d Remove trailing space. 2012-03-30 05:49:32 +00:00
davidxu
5faf75d34c Merge umtxq_sleep and umtxq_nanosleep into a single function by using
an abs_timeout structure which describes timeout info.
2012-03-30 05:40:26 +00:00
davidxu
362bad78ca Reduce code size by creating common timed sleeping function. 2012-03-29 02:46:43 +00:00
fabient
5edfb77dd3 Add software PMC support.
New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).

Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.

Sponsored by: NETASQ
MFC after:	1 month
2012-03-28 20:58:30 +00:00
rstone
0ee65aa24e Instead of only iterating over the set of known SDT probes when sdt.ko is
loaded and unloaded, also have sdt.ko register callbacks with kern_sdt.c
that will be called when a newly loaded KLD module adds more probes or
a module with probes is unloaded.

This fixes two issues: first, if a module with SDT probes was loaded after
sdt.ko was loaded, those new probes would not be available in DTrace.
Second, if a module with SDT probes was unloaded while sdt.ko was loaded,
the kernel would panic the next time DTrace had cause to try and do
anything with the no-longer-existent probes.

This makes it possible to create SDT probes in KLD modules, although there
are still two caveats: first, any SDT probes in a KLD module must be part
of a DTrace provider that is defined in that module.  At present DTrace
only destroys probes when the provider is destroyed, so you can still
panic the system if a KLD module creates new probes in a provider from a
different module(including the kernel) and then unload the the first module.

Second, the system will panic if you unload a module containing SDT probes
while there is an active D script that has enabled those probes.

MFC after:	1 month
2012-03-27 15:07:43 +00:00
melifaro
fd561480db - Add knlist_init_rw_reader() function to kqueue(9).
Function acquired reader lock if needed.
Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED)
- While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues

Reviewed by:    glebius
Approved by:    ae(mentor)

MFC after:      2 weeks
2012-03-26 09:34:17 +00:00
trociny
0079b1f6c5 Add a sysctl to set and retrieve binary osreldate of another process.
Suggested by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-03-23 20:05:41 +00:00
ae
bb8b607479 Correct debug message. 2012-03-22 09:29:07 +00:00
alc
e02fd6b842 Handle spurious page faults that may occur in no-fault sections of the
kernel.

When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB.  In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB.  This is exactly as
recommended in AMD's documentation.  In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry.  In short, this saves us from
having to perform potentially costly TLB flushes.  In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry.  Usually, such spurious page faults are handled
by vm_fault() without incident.  However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault().  This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.

In collaboration with:	kib
Tested by:		gibbs (an earlier version)

I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.

MFC after:	1 week
2012-03-22 04:52:51 +00:00
ae
f0e7ec67c0 Acquire modules lock before call module_getname() in the KLD_DEBUG case.
MFC after:	1 week
2012-03-21 09:48:32 +00:00
eadler
169b46c915 - Clean up timestamps in msgbuf code. The timestamps should now be
inserted after the priority token thus cleaning up the output.
- Remove the needless double internal do_add_char function.
- Resolve a possible deadlock if interrupts are
    disabled and getnanotime is called

Reviewed by:	bde  kmacy, avg, sbruno (various versions)
Approved by:	cperciva
MFC after:	2 weeks
2012-03-19 00:36:32 +00:00
jh
683a986c03 Cast wallclock.tv_sec to uint64_t to avoid overflow in the calculation.
PR:		kern/161552
Reviewed by:	trasz
Tested by:	Nikos Vassiliadis
MFC after:	1 week
2012-03-18 19:13:32 +00:00
davide
cd0c342e57 Add rudimentary profiling of the hash table used in the in the umtx code to
hold active lock queues.

Reviewed by:	attilio
Approved by:	davidxu, gnn (mentor)
MFC after:	3 weeks
2012-03-16 20:32:11 +00:00
tuexen
b8b34b6ecf Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.
2012-03-15 14:13:38 +00:00
ae
894c8dc15b Add CTLFLAG_TUN to the sysctl definition and fix style.
Pointed by:	Garrett Cooper
MFC after:	2 weeks
2012-03-15 06:01:21 +00:00
ae
9be115302d Add debug.kld_debug loader tunable.
MFC after:	2 weeks
2012-03-15 05:11:29 +00:00
jh
59d9d84ca4 Add an assert for proctree_lock to proc_to_reap().
Discussed with:	kib
MFC after:	1 week
2012-03-14 15:52:23 +00:00
kib
6e85340add Lock the process around manipulations with p_flag.
Reported and reviewed by:	jh
MFC after:	3 days
2012-03-13 22:00:46 +00:00
adrian
f2bb6a85d7 Add module load/unload stubs. 2012-03-13 20:27:48 +00:00
mav
5b5fc4e585 Add kern.eventtimer.activetick tunable/sysctl, specifying whether each
hardclock() tick should be run on every active CPU, or on only one.

On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.

Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.

MFC after:	1 month
2012-03-13 10:21:08 +00:00
mav
ffaa080e67 Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.

MFC after:	1 month
2012-03-13 08:18:54 +00:00
pho
e35bb21f2c Allways call fdrop(). 2012-03-12 11:56:57 +00:00
kib
4e790f9b2b ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by:	kan
Tested by:	andrew (arm), flo (sparc64)
MFC after:	3 weeks
2012-03-11 19:38:49 +00:00
kib
8adabb0356 Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
mav
4be9351f8b Revert r175376 and tune cpufreq(4) frequency comparison logic instead.
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.

ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.

I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.

MFC after:	2 weeks
2012-03-10 18:56:16 +00:00
mav
1324baa4eb Idle ticks optimization:
- Pass number of events to the statclock() and profclock() functions
   same as to hardclock() before to not call them many times in a loop.
 - Rename them into statclock_cnt() and profclock_cnt().
 - Turn statclock() and profclock() into compatibility wrappers,
   still needed for arm.
 - Rename hardclock_anycpu() into hardclock_cnt() for unification.

MFC after:	1 week
2012-03-10 14:57:21 +00:00
trasz
a0d48d6f11 Remove useless thread_{lock,unlock}() in raccd. 2012-03-10 14:38:49 +00:00
jmallett
d25fa497f7 Export intrcnt correctly when running under 32-bit compatibility.
Reviewed by:	gonzo, nwhitehorn
2012-03-09 22:30:54 +00:00
pho
c84e05a07c Perform the parameter validation before assigning it to a signed int
variable. This fixes the problem seen with readdir(3) fuzzing.

Submitted by:	bde
MFC after:	1 week
2012-03-09 21:31:12 +00:00
mav
d6e827162d Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
2012-03-09 19:09:08 +00:00
mav
fb50c869a4 Be more polite when setting state->nextevent inside cpu_new_callout().
Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition.

MFC after:	2 weeks
2012-03-09 07:30:48 +00:00
kib
5abd2bb7cb Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
pho
81cae127b0 Free up allocated memory used by posix_fadvise(2). 2012-03-08 20:34:13 +00:00
jhb
19feaba08b Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
jhb
4fea355eb2 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
kib
9d4d411642 The pipe_poll() performs lockless access to the vnode to test
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().

Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.

Reported and tested by:	bf
Submitted by:	gianni
MFC after:	1 week (a backport)
2012-03-07 07:31:50 +00:00
trasz
613d36617d Make racct and rctl correctly handle jail renaming. Previously
they would continue using old name, the one jail was created with.

PR:		bin/165207
2012-03-06 11:05:50 +00:00
ivoras
5fe0ebe46a Print out process name and thread id in the debugging message.
This is useful because the message can end up in system logs in
non-debugging operation.

Reviewed by:	attilio (earlier version)
2012-03-05 14:19:43 +00:00
kib
6f473d5748 pipe_read(): change the type of size to int, and remove signed clamp.
pipe_write(): change the type of desiredsize back to int, its value fits.

Requested by: bde
MFC after:    3 weeks
2012-03-04 15:09:01 +00:00
kib
273d08b6bc Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

Requested by:	bde
MFC after:	3 weeks
2012-03-04 14:55:37 +00:00
trociny
f331516921 Make kern.proc.umask sysctl readonly.
Requested by:	src
MFC after:	1 week
2012-03-03 11:53:35 +00:00
mav
1504832681 Fix bug of r232207, when cpu_search() could prefer CPU group with best
load, but with no CPU matching given limitations. It caused kernel panics
in some cases when thread was bound to specific CPUs with cpuset(1).
2012-03-03 11:50:48 +00:00
jmallett
50c253779f o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI.  This mostly follows nwhitehorn's lead in implementing
   COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
   32-bit ABI being used.  Since the MIPS port is relatively-new, even the 32-bit
   ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
   with 32-bit compatibility, then, disable some code which assumes otherwise
   wrongly when built for MIPS.  A more general macro to check in this case would
   seem like a good idea eventually.  If someone adds support for using n32
   userland with n64 kernels on MIPS, then they will have to add a variety of
   flags related to each piece of the ABI that can vary.  That's probably the
   right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
   freebsd32 compat code.  Probably this should be generalized at some point.

Reviewed by:	gonzo
2012-03-03 08:19:18 +00:00
rmacklem
f633984c25 Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by:	bde
Discussed with:	kib
Reviewed by:	jhb
MFC after:	2 weeks
2012-03-03 01:06:54 +00:00
jhb
5013ab31bd - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
mckusick
b23f922edf This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
trociny
1aad0004ee Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR:		kern/51583, kern/159663
Suggested by:	kib
Reviewed by:	arch
MFC after:	2 weeks
2012-02-29 21:38:31 +00:00
davidxu
49fb0a40aa initialize clock ID and flags only when copying timespec, a _umtx_time
copy already contains these fields.
2012-02-29 02:01:48 +00:00
mm
77766742e1 Add procfs to jail-mountable filesystems.
Reviewed by:	jamie
MFC after:	1 week
2012-02-29 00:30:18 +00:00
dim
e045194768 Change definition of pipe_chmod() from K&R to C99, to avoid the
following clang warning:

sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
      (aka 'unsigned short') declared in a previous prototype [-Werror]
        mode_t mode;
               ^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t       pipe_chmod;
                        ^
2012-02-28 21:45:21 +00:00
jhb
22eaf01bc1 Properly clear a device's devclass if DEVICE_ATTACH() fails if the device
does not have a fixed devclass.

Reviewed by:	imp
MFC after:	2 weeks
2012-02-28 19:16:02 +00:00
kib
0c3998cc9e Currently, the debugger attached to the process executing vfork() does
not get syscall exit notification until the child performed exec of
exit.  Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.

Reported, tested and reviewed by:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 2 weeks
2012-02-27 21:10:10 +00:00
jhb
4110bb206b Clear the a device's description string anytime it's driver changes.
Descriptions  are specific to drivers and we don't change drivers on attached
devices.  This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed).  While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
  devclass_driver_deleted().  device_detach() removes it already.
- Fix a typo.
2012-02-27 16:08:18 +00:00
davidxu
96aacc2279 Follow changes made in revision 232144, pass absolute timeout to kernel,
this eliminates a clock_gettime() syscall.
2012-02-27 13:38:52 +00:00
mav
8a35bc6c4f Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
 - In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
 - Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
 - Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
 - Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
 - Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.

All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.

Reviewed by:	jeff
Tested by:	flo, hackers@
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2012-02-27 10:31:54 +00:00
phk
c1bac816f5 Also call the low-level driver if ->c_iflag & (IXON|IXOFF|IXANY) changes.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
2012-02-26 20:56:49 +00:00
alc
351ef48158 Fix typo.
MFC after:	1 week
2012-02-26 19:10:14 +00:00
mm
d974ef7be1 Analogous to r232059, add a parameter for the ZFS file system:
allow.mount.zfs:
	allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO:	document the connection of allow.mount.* and VFCF_JAIL for kernel
	developers

MFC after:	10 days
2012-02-26 16:30:39 +00:00
jilles
06d1ac3238 Fix fchmod() and fchown() on fifos.
The new fifo implementation in r232055 broke fchmod() and fchown() on fifos.
Postfix needs this.

Submitted by:	gianni
Reported by:	dougb
2012-02-26 15:14:29 +00:00
trociny
a9902044d1 Add sysctl to retrieve or set umask of another process.
Submitted by:	Dmitry Banschikov <me ubique spb ru>
Discussed with:	kib, rwatson
Reviewed by:	kib
MFC after:	2 weeks
2012-02-26 14:25:48 +00:00
kib
79e65f67d4 Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the
socket protocol number.  This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.

Submitted by:	Jukka A. Ukkonen <jau iki fi>
PR:	  kern/162352
Discussed with:	bz
Reviewed by:	glebius
MFC after:	2 weeks
2012-02-26 13:55:43 +00:00
kib
18b7fd6ba3 Remove apparently redundand checks for socket so_proto being non-NULL
from sosetopt() and sogetopt().  No exposed sockets may have so_proto
invalid.

Discussed with:	bz, rwatson
Reviewed by:	glebius
MFC after:	2 weeks
2012-02-26 13:51:05 +00:00
maxim
07e034748a o Reduce chances for integer overflow.
o More verbose sysctl description added.

MFC after:	2 weeks
Sponsored by:	Nginx, Inc.
2012-02-25 12:06:40 +00:00
trociny
87f7f0cfe8 When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-02-25 10:15:41 +00:00
davidxu
61033245ae In revision 231989, we pass a 16-bit clock ID into kernel, however
according to POSIX document, the clock ID may be dynamically allocated,
it unlikely will be in 64K forever. To make it future compatible, we
pack all timeout information into a new structure called _umtx_time, and
use fourth argument as a size indication, a zero means it is old code
using timespec as timeout value, but the new structure also includes flags
and a clock ID, so the size argument is different than before, and it is
non-zero. With this change, it is possible that a thread can sleep
on any supported clock, though current kernel code does not have such a
POSIX clock driver system.
2012-02-25 02:12:17 +00:00
kib
a0f40cddfb Restore the return statement erronously removed in the r232048.
Submitted by:	cognet
Pointy hat to:	kib (reuse the one I already got today)
MFC after:	13 days
2012-02-24 11:02:35 +00:00
mm
4825085ea4 To improve control over the use of mount(8) inside a jail(8), introduce
a new jail parameter node with the following parameters:

allow.mount.devfs:
	allow mounting the devfs filesystem inside a jail

allow.mount.nullfs:
	allow mounting the nullfs filesystem inside a jail

Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.

Reviewed by:	jamie
Suggested by:	pjd
MFC after:	2 weeks
2012-02-23 18:51:24 +00:00
kmacy
a5c27d1ee0 merge pipe and fifo implementations
Also reviewed by: jhb, jilles (initial revision)
Tested by: pho, jilles

Submitted by:	gianni
Reviewed by:	bde
2012-02-23 18:37:30 +00:00
brueffer
8b76671a80 Catch up with r195837 (2.5 years ago) which renamed net_add_domain() to domain_add().
PR:		165424
Submitted by:	Lachlan Kang
MFC after:	1 week
2012-02-23 17:47:19 +00:00
kib
bfe47eb1df Allow the parent to gather the exit status of the children reparented
to the debugger.  When reparenting for debugging, keep the child in
the new orphan list of old parent.  When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.

Submitted by:	Dmitry Mikulin <dmitrym juniper.net>
MFC after:	2 weeks
2012-02-23 11:50:23 +00:00
davidxu
79308ead48 Fix typo. 2012-02-22 07:34:23 +00:00
davidxu
d177303078 Use unused fourth argument of umtx_op to pass flags to kernel for operation
UMTX_OP_WAIT. Upper 16bits is enough to hold a clock id, and lower
16bits is used to pass flags. The change saves a clock_gettime() syscall
from libthr.
2012-02-22 03:22:49 +00:00
trociny
05b0baa9d7 unp_connect() may use a shared lock on the vnode to fetch the socket.
Suggested by:	jhb
Reviewed by:	jhb, kib, rwatson
MFC after:	2 weeks
2012-02-21 19:40:13 +00:00
kib
80ae8fe82c Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
delphij
3e8e765569 Revert r231923 for now. Further work is needed to make sure that the
behavior is consistent.
2012-02-20 09:32:32 +00:00
delphij
c86b0fb582 Use uprintf instead of printf for the reason why a kernel module can not
be loaded.  This way, the administrator can get response immediately from
the shell session rather than relying on dmesg.

MFC after:	1 month
2012-02-20 01:05:17 +00:00
alc
d953613252 Close a race due to dropping of the map lock between creating a map entry
for a shared mapping and marking the entry for inheritance.

Reviewed by:	kib
X-MFC after:	r231526
2012-02-19 00:28:49 +00:00
kib
abd1094f17 Fix misuse of the kernel map in miscellaneous image activators.
Vnode-backed mappings cannot be put into the kernel map, since it is a
system map.

Use exec_map for transient mappings, and remove the mappings with
kmem_free_wakeup() to notify the waiters on available map space.

Do not map the whole executable into KVA at all to copy it out into
usermode.  Directly use vn_rdwr() for the case of not page aligned
binary.

There is one place left where the potentially unbounded amount of data
is mapped into exec_map, namely, in the COFF image activator
enumeration of the needed shared libraries.

Reviewed by:   alc
MFC after:     2 weeks
2012-02-17 23:47:16 +00:00
bz
dcdb23291f Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:
Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.

This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.

Sponsored by:	Cisco Systems, Inc.
Reviewed by:	melifaro (basically)
MFC after:	10 days
2012-02-17 02:39:58 +00:00
eadler
c7937266c4 Add a timestamp to the msgbuf output in order to determine when when
messages were printed.

This can be enabled with the kern.msgbuf_show_timestamp sysctl

PR:		kern/161553
Reviewed by:	avg
Submitted by:	Arnaud Lacombe <lacombar@gmail.com>
Approved by:	cperciva
MFC after:	1 month
2012-02-16 05:11:35 +00:00
kib
4658c8a871 The PTRACESTOP() macro is used only once. Inline the only use and remove
the macro.

MFC after:	1 week
2012-02-11 14:49:25 +00:00
ed
ec8ff22f84 Remove unneeded newline. It fits in 80 columns now.
Pointed out by:	jh
2012-02-10 14:55:47 +00:00
ed
2b4f7a9e8a Merge si_name and __si_namebuf.
The si_name pointer always points to the __si_namebuf member inside the
same object. Remove it and rename __si_namebuf to si_name.
2012-02-10 12:40:50 +00:00
kevlo
c935e4e242 Add a missing break. This bug was introduced in r228856. 2012-02-10 06:30:52 +00:00
kib
956d09353b Mark the automatically attached child with PL_FLAG_CHILD in struct
lwpinfo flags, for PT_FOLLOWFORK auto-attachment.

In collaboration with:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 1 week
2012-02-10 00:02:13 +00:00
mm
1626913ed1 Add support for mounting devfs inside jails.
A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for
mounting devfs inside jails. A value of -1 disables mounting devfs in
jails, a value of zero means no restrictions. Nested jails can only
have mounting devfs disabled or inherit parent's enforcement as jails are
not allowed to view or manipulate devfs(8) rules.

Utilizes new functions introduced in r231265.

Reviewed by:	jamie
MFC after:	1 month
2012-02-09 10:22:08 +00:00
kib
5515381ae2 Unbreak detection of the async mode for clustered writes after r231075.
Submitted by:	bde
MFC after:	12 days
2012-02-08 15:07:19 +00:00