Commit Graph

12886 Commits

Author SHA1 Message Date
Warner Losh
f4dc9a4015 Remove an old hack I noticed years ago, but never committed. 2012-06-28 07:33:43 +00:00
Alan Cox
e30df26e7b Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
Kevin Lo
c61325d009 Correct sizeof usage
Obtained from:	DragonFly
2012-06-25 05:41:16 +00:00
Konstantin Belousov
a665ed986c Move the code dealing with shared page into a dedicated
kern_sharedpage.c source file from kern_exec.c.

MFC after:	  29 days
2012-06-23 10:15:23 +00:00
Konstantin Belousov
21c295ef88 Stop updating the struct vdso_timehands from even handler executed in
the scheduled task from tc_windup(). Do it directly from tc_windup in
interrupt context [1].

Establish the permanent mapping of the shared page into the kernel
address space, avoiding the potential need to sleep waiting for
allocation of sf buffer during vdso_timehands update. As a
consequence, shared_page_write_start() and shared_page_write_end()
functions are not needed anymore.

Guess and memorize the pointers to native host and compat32 sysentvec
during initialization, to avoid the need to get shared_page_alloc_sx
lock during the update.

In tc_fill_vdso_timehands(), do not loop waiting for timehands
generation to stabilize, since vdso_timehands is written in the same
interrupt context which wrote timehands.

Requested by:	  mav [1]
MFC after:	  29 days
2012-06-23 09:33:06 +00:00
Konstantin Belousov
aea810386d Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
Konstantin Belousov
a9d8437c6d Enchance the shared page chunk allocator.
Do not rely on the busy state of the page from which we allocate the
chunk, to protect allocator state. Use statically allocated sx lock
instead.

Provide more flexible KPI. In particular, allow to allocate chunk
without providing initial data, and allow writes into existing
allocation. Allow to get an sf buf which temporary maps the chunk, to
allow sequential updates to shared page content without unmapping in
between.

Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 06:39:28 +00:00
Konstantin Belousov
854c3ce7ac Fix locking for f_offset, vn_read() and vn_write() cases only, for now.
It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.

This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.

Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.

Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.

Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.

Reviewed by:	jhb
Tested by:	pho
MFC after:	2 weeks
2012-06-21 09:19:41 +00:00
Pawel Jakub Dawidek
53e1646325 Check proper flag (PDF_DAEMON, not PD_DAEMON) when deciding if the process
should be killed or not.

This fixes killing pdfork(2)ed process on last close of the corresponding
process descriptor.

Reviewed by:	rwatson
MFC after:	1 month
2012-06-19 22:23:59 +00:00
Pawel Jakub Dawidek
0a7007b98f The falloc() function obtains two references to newly created 'fp'.
On success we have to drop one after procdesc_finit() and on failure
we have to close allocated slot with fdclose(), which also drops one
reference for us and drop the remaining reference with fdrop().

Without this change closing process descriptor didn't result in killing
pdfork(2)ed child.

Reviewed by:	rwatson
MFC after:	1 month
2012-06-19 22:21:59 +00:00
John Baldwin
cd4ecf3cd2 Further refine the implementation of POSIX_FADV_NOREUSE.
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads.  A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read.  If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read.  The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read.  This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.

Second, apply the same changes made to read(2) by r230782 and this
change to writes.  This provides much better performance in the
sequential write case as it allows writes to still be clustered.  It
also provides much better performance for misaligned writes.  It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.

MFC after:	2 weeks
2012-06-19 18:42:24 +00:00
Peter Holm
e84a11e7ff In tty_makedev() the following construction:
dev = make_dev_cred();
dev->si_drv1 = tp;

leaves a small window where the newly created device may be opened
and si_drv1 is NULL.

As this is a vary rare situation, using a lock to close the window
seems overkill. Instead just wait for the assignment of si_drv1.

Suggested by:	kib
MFC after:	1 week
2012-06-18 07:34:38 +00:00
Pawel Jakub Dawidek
d99e1d5fd6 Don't check for race with close on advisory unlock (there is nothing smart we
can do when such a race occurs). This saves lock/unlock cycle for the filedesc
lock for every advisory unlock operation.

MFC after:	1 month
2012-06-17 21:04:22 +00:00
Pawel Jakub Dawidek
604a7c2f00 Extend the comment about checking for a race with close to explain why
it is done and why we don't return an error in such case.

Discussed with:	kib
MFC after:	1 month
2012-06-17 16:59:37 +00:00
Pawel Jakub Dawidek
fd6049b186 If VOP_ADVLOCK() call or earlier checks failed don't check for a race with
close, because even if we had a race there is nothing to unlock.

Discussed with:	kib
MFC after:	1 month
2012-06-17 16:32:32 +00:00
Davide Italiano
68cefd64ad The variable 'error' in sys_poll() is initialized in declaration to value
zero but in any case is overwritten by successive copyin(), making the
previous initialization useless. Remove this.
As an added bonus this fixes a style(9) bug.

Discussed with:		kib
Approved by:		gnn (mentor)
MFC after:		3 days
2012-06-17 13:03:50 +00:00
Pawel Jakub Dawidek
cff2dcd10d Revert r237073. 'td' can be NULL here.
MFC after:	1 month
2012-06-16 12:56:36 +00:00
Pawel Jakub Dawidek
3cde71cb25 One more attempt to make prototypes formated according to style(9), which
holefully recovers from the "worse than useless" state.

Reported by:	bde
MFC after:	1 month
2012-06-15 10:00:29 +00:00
Pawel Jakub Dawidek
a79de683f5 Update comment.
MFC after:	1 month
2012-06-14 17:32:58 +00:00
Pawel Jakub Dawidek
19a8f6748e Remove fdtofp() function and use fget_locked(), which works exactly the same.
MFC after:	1 month
2012-06-14 16:25:10 +00:00
Pawel Jakub Dawidek
b7fc69ca89 Assert that the filedesc lock is being held when the fdunwrap() function
is called.

MFC after:	1 month
2012-06-14 16:23:16 +00:00
Pawel Jakub Dawidek
1a94dc8581 Simplify the code by making more use of the fdtofp() function.
MFC after:	1 month
2012-06-14 15:37:15 +00:00
Pawel Jakub Dawidek
215aeba939 - Assert that the filedesc lock is being held when fdisused() is called.
- Fix white spaces.

MFC after:	1 month
2012-06-14 15:35:14 +00:00
Pawel Jakub Dawidek
7aef754274 Style fixes and assertions improvements.
MFC after:	1 month
2012-06-14 15:34:10 +00:00
Pawel Jakub Dawidek
8d169d9ff0 Assert that the filedesc lock is not held when closef() is called.
MFC after:	1 month
2012-06-14 15:26:23 +00:00
Pawel Jakub Dawidek
eb273c01f3 Style fixes.
Reported by:	bde
MFC after:	1 month
2012-06-14 15:21:57 +00:00
Pawel Jakub Dawidek
c7e9a659ca Remove code duplication from fdclosexec(), which was the reason of the bug
fixed in r237065.

MFC after:	1 month
2012-06-14 12:43:37 +00:00
Pawel Jakub Dawidek
8f59e9fddc When we are closing capabilities during exec, we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Similar bug was fixed in r236853.

MFC after:	1 month
2012-06-14 12:41:21 +00:00
Pawel Jakub Dawidek
5570ae7d87 Style.
MFC after:	1 month
2012-06-14 12:37:41 +00:00
Pawel Jakub Dawidek
620216725a When checking if file descriptor number is valid, explicitely check for 'fd'
being less than 0 instead of using cast-to-unsigned hack.

Today's commit was brought to you by the letters 'B', 'D' and 'E' :)
2012-06-13 22:12:10 +00:00
Pawel Jakub Dawidek
7080f124d2 Now that dupfdopen() doesn't depend on finstall() being called earlier,
indx will never be -1 on error, as none of dupfdopen(), finstall() and
kern_capwrap() modifies it on error, but what is more important none of
those functions install and leave file at indx descriptor on error.

Leave an assert to prove my words.

MFC after:	1 month
2012-06-13 21:38:07 +00:00
Pawel Jakub Dawidek
3812dcd3de Allocate descriptor number in dupfdopen() itself instead of depending on
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().

MFC after:	1 month
2012-06-13 21:32:35 +00:00
Pawel Jakub Dawidek
7f35af0110 - Remove nfp variable that is not really needed.
- Update comment.
- Style nits.

MFC after:	1 month
2012-06-13 21:22:35 +00:00
Pawel Jakub Dawidek
c64dd3bae1 Remove duplicated code.
MFC after:	1 month
2012-06-13 21:15:01 +00:00
Pawel Jakub Dawidek
81424ab705 Add missing {.
MFC after:	1 month
2012-06-13 21:13:18 +00:00
Pawel Jakub Dawidek
85c1550d63 Style.
MFC after:	1 month
2012-06-13 21:11:58 +00:00
Pawel Jakub Dawidek
baf946221d There is no need to set td->td_retval[0] to -1 on error.
Confirmed by:	jhb
MFC after:	1 month
2012-06-13 21:10:00 +00:00
Pawel Jakub Dawidek
6195bfebcc There is only one caller of the dupfdopen() function, so we can simplify
it a bit:
- We can assert that only ENODEV and ENXIO errors are passed instead of
  handling other errors.
- The caller always call finstall() for indx descriptor, so we can assume
  it is set. Actually the filedesc lock is dropped between finstall() and
  dupfdopen(), so there is a window there for another thread to close the
  indx descriptor, but it will be closed in next commit.

Reviewed by:	mjg
MFC after:	1 month
2012-06-13 19:00:29 +00:00
Mateusz Guzik
2ca63f0a90 Remove 'low' argument from fd_last_used().
This function is static and the only caller always passes 0 as low.

While here update note about return values in comment.

Reviewed by:	pjd
Approved by:	trasz (mentor)
MFC after:	1 month
2012-06-13 17:18:16 +00:00
Mateusz Guzik
02efb9a8b1 Re-apply reverted parts of r236935 by pjd with some changes.
If fdalloc() decides to grow fdtable it does it once and at most doubles
the size. This still may be not enough for sufficiently large fd. Use fd
in calculations of new size in order to fix this.

When growing the table, fd is already equal to first free descriptor >= minfd,
also fdgrowtable() no longer drops the filedesc lock. As a result of this there
is no need to retry allocation nor lookup.

Fix description of fd_first_free to note all return values.

In co-operation with:	pjd
Approved by:	trasz (mentor)
MFC after:	1 month
2012-06-13 17:12:53 +00:00
Pawel Jakub Dawidek
faf0db351d Revert part of the r236935 for now, until I figure out why it doesn't
work properly.

Reported by:	davidxu
2012-06-12 10:25:11 +00:00
Pawel Jakub Dawidek
039dc89f0d fdgrowtable() no longer drops the filedesc lock so it is enough to
retry finding free file descriptor only once after fdgrowtable().

Spotted by:	pluknet
MFC after:	1 month
2012-06-11 22:05:26 +00:00
Pawel Jakub Dawidek
d3ec30e525 Use consistent way of checking if descriptor number is valid.
MFC after:	1 month
2012-06-11 20:17:20 +00:00
Pawel Jakub Dawidek
fd45a47ba6 Be consistent with white spaces.
MFC after:	1 month
2012-06-11 20:01:50 +00:00
Pawel Jakub Dawidek
19d9c0e11e Remove code duplicated in kern_close() and do_dup() and use closefp() function
introduced a minute ago.

This code duplication was responsible for the bug fixed in r236853.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 20:00:44 +00:00
Pawel Jakub Dawidek
642db963ab Introduce closefp() function that we will be able to use to eliminate
code duplication in kern_close() and do_dup().

This is committed separately from the actual removal of the duplicated
code, as the combined diff was very hard to read.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:57:31 +00:00
Pawel Jakub Dawidek
129c87eb7d Merge two ifs into one to make the code almost identical to the code in
kern_close().

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:53:41 +00:00
Pawel Jakub Dawidek
d327cee241 Move the code around a bit to move two parts of code duplicated from
kern_close() close together.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:51:27 +00:00
Pawel Jakub Dawidek
8b40793150 Now that fdgrowtable() doesn't drop the filedesc lock we don't need to
check if descriptor changed from under us. Replace the check with an
assert.

Discussed with:	kib
Tested by:	pho
MFC after:	1 month
2012-06-11 19:48:55 +00:00
Mitsuru IWASAKI
c1b0dc80b5 Another fixe for r236772.
- Adjust correct cpuset (stopped_cpus/suspended_cpus) for
  cpu_spinwait() in generic_stop_cpus().
2012-06-11 18:47:26 +00:00
Pawel Jakub Dawidek
f3cd980557 Style fixes and simplifications.
MFC after:	1 month
2012-06-11 16:08:03 +00:00
Pawel Jakub Dawidek
effb6326a1 Remove redundant include.
MFC after:	1 month
2012-06-10 20:24:01 +00:00
Pawel Jakub Dawidek
297f11037f Style: move opt_*.h includes in the proper place.
MFC after:	1 month
2012-06-10 20:22:10 +00:00
Pawel Jakub Dawidek
69d7614850 When we are closing capability during dup2(), we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Discussed with:	rwatson
Sponsored by:	FreeBSD Foundation
MFC after:	1 month
2012-06-10 14:57:18 +00:00
Pawel Jakub Dawidek
1b693d7494 Merge two ifs into one. Other minor style fixes.
MFC after:	1 month
2012-06-10 13:10:21 +00:00
Pawel Jakub Dawidek
8849ae7256 Simplify fdtofp().
MFC after:	1 month
2012-06-10 06:31:54 +00:00
Kirk McKusick
75c898f2a4 When synchronously syncing a device (MNT_WAIT), wait for buffers
to become available. Otherwise we may excessively spin and fail
with ``fsync: giving up on dirty''.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   1 week
2012-06-09 22:26:53 +00:00
Pawel Jakub Dawidek
e59a97362d There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as
we now use sx lock for filedesc structure protection.

Reviewed by:	kib
MFC after:	1 month
2012-06-09 18:50:32 +00:00
Pawel Jakub Dawidek
68abac4337 Remove now unused variable.
MFC after:	1 month
MFC with:	r236820
2012-06-09 18:48:06 +00:00
Pawel Jakub Dawidek
380513aaae Make some of the loops more readable.
Reviewed by:	tegge
MFC after:	1 month
2012-06-09 18:03:23 +00:00
Pawel Jakub Dawidek
5d02ed91e9 Correct panic message.
MFC after:	1 month
MFC with:	r236731
2012-06-09 12:27:30 +00:00
Mitsuru IWASAKI
fb864578af Add x86/acpica/acpi_wakeup.c for amd64 and i386. Difference of
suspend/resume procedures are minimized among them.

common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
  among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().

amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().

i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
  amd64 code.

Reviewed by:	attilio@, acpi@
2012-06-09 00:37:26 +00:00
John Baldwin
7ac1b61aac Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2012-06-08 18:32:09 +00:00
Mateusz Guzik
3b5da8d609 Plug socket refcount leak on error in sys_sctp_peeloff.
Reviewed by:	tuexen
Approved by:	trasz (mentor)
MFC after:	3 days
2012-06-08 08:04:51 +00:00
Pawel Jakub Dawidek
bf3e37ef15 In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0.
Assert that instead of setting it to 0.

Sponsored by:	FreeBSD Foundation
MFC after:	1 month
2012-06-07 23:33:10 +00:00
Pawel Jakub Dawidek
d3644b04ca Eliminate redundant variable.
Sponsored by:	FreeBSD Foundation
MFC after:	1 week
2012-06-07 23:08:18 +00:00
Pawel Jakub Dawidek
f6ed2ff79d Plug file reference leak in capability failure case.
Sponsored by:	FreeBSD Foundation
MFC after:	3 days
2012-06-07 22:49:09 +00:00
Gleb Smirnoff
36eeafa0e5 style(9) for r236563. 2012-06-05 05:16:04 +00:00
Gleb Smirnoff
8955d2720f Microoptimisation of code from r236560, also coming from Nginx Inc.
Submitted by:	ru
2012-06-04 14:18:13 +00:00
Gleb Smirnoff
835d890042 Optimise kern_sendfile(): skip cycling through the entire mbuf chain in
m_cat(), storing pointer to last mbuf in chain in local variable and
attaching new mbuf to the end of chain.

Submitter reports that CPU load dropped for > 10% on a web server
serving large files with this optimisation.

Submitted by:	Sergey Budnevitch <sb nginx.com>
2012-06-04 12:49:21 +00:00
Konstantin Belousov
bba080854d Add a knob to disable vn_io_fault.
MFC after:	1 month
2012-06-03 16:19:37 +00:00
Konstantin Belousov
bb2f52a61d Count and export the number of prefaulting happen.
MFC after:	 1 month
2012-06-03 16:06:56 +00:00
Andriy Gapon
7adc598a15 free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.

MFC after:	3 weeks
2012-06-03 08:01:12 +00:00
Konstantin Belousov
d1b07fd498 Fix typo [1]. Use commas to separate flag printouts, in style with
other parts of function.

Submitted by: bf [1]
MFC after:   1 week
2012-06-02 19:39:12 +00:00
Konstantin Belousov
705de7c19e Update the print mask for decoding b_flags. Add print masks for
b_vflags and b_xflags_t and print them as well.

MFC after:   1 week
2012-06-02 18:44:40 +00:00
John Baldwin
b871e6613b Extend VERBOSE_SYSINIT to also print out the name of variables passed
to SYSINIT routines if they can be resolved via symbol look up in DDB.
To avoid false positives, only honor a name if the symbol resolves
exactly to the pointer value (no offset).

MFC after:	1 week
2012-06-01 15:42:37 +00:00
Pawel Jakub Dawidek
5edfa04b94 Regenerate after r236361.
MFC after:	3 days
2012-05-31 19:34:53 +00:00
Pawel Jakub Dawidek
6ba7e8178a Add missing system calls.
MFC after:	3 days
2012-05-31 19:32:37 +00:00
Pawel Jakub Dawidek
243f67938e There is no rmdirat system call. Weird, I know.
MFC after:	3 days
2012-05-31 19:31:28 +00:00
Warner Losh
a241707e7a Unlock in the error path to prevent a lock leak.
PR:		162174
Submitted by:	Ian Lepore
MFC after:	2 weeks
2012-05-31 17:27:05 +00:00
Konstantin Belousov
41014d996a vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by:	bf (comments), mdf, pjd (previous version)
Tested by:	pho
Tested by:	flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after:	2 months
2012-05-30 16:42:08 +00:00
Konstantin Belousov
8f0e91308a Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after:     2 month
2012-05-30 16:06:38 +00:00
Konstantin Belousov
6c5d7af158 Assert that TDP_NOFAULTING and TDP_NOSPEEPING thread flags do not leak
when thread returns from a syscall to usermode.

Tested by:	pho
MFC after:	1 week
2012-05-30 13:44:42 +00:00
Rafal Jaworowski
17f4cae4a5 Let us manage differences of Book-E PowerPC variations i.e. vendor /
implementation specific vs. the common architecture definition.

Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.

This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.

Obtained from:	AppliedMicro, Freescale, Semihalf
2012-05-27 10:25:20 +00:00
Konstantin Belousov
371778a333 Fix ki_cow for compat32 binaries.
MFC after:	3 days
2012-05-27 05:24:53 +00:00
Konstantin Belousov
9768156746 Stop treating td_sigmask specially for the purposes of new thread
creation. Move it into the copied region of the struct thread.

Update some comments.

Requested by:	bde
X-MFC after:	never
2012-05-26 20:03:47 +00:00
Konstantin Belousov
292520f710 Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after:	2 weeks
2012-05-26 05:28:47 +00:00
Ed Schouten
4412ad4887 Regenerate system call tables. 2012-05-25 21:52:57 +00:00
Ed Schouten
520b6a84f6 Remove use of non-ISO-C integer types from system call tables.
These files already use ISO-C-style integer types, so make them less
inconsistent by preferring the standard types.
2012-05-25 21:50:48 +00:00
Andriy Gapon
fa44da0995 device_add_child: protect against child device with no driver but fixed unit number
This combination doesn't make sense, unit numbers should be hardwired
only in context of a known driver.  The wildcard devices should have
wildcard unit numbers.

Reviewed by:	jhb
MFC after:	2 weeks
2012-05-25 07:32:26 +00:00
Alexander Motin
20654f4ef4 MFprojects/zfsd:
Hide warning behind bootverbose. Average user has nothing to do about it.
2012-05-24 11:24:44 +00:00
Gleb Kurtsou
76dcec5d09 Add kern_fhstat(), adjust sys_fhstat() to use it.
Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by:	Google Summer of Code 2011
2012-05-24 08:00:26 +00:00
Konstantin Belousov
4d34e019c4 Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by:	Andrey Zonov <andrey zonov org>
MFC after:	1 week
2012-05-23 18:10:54 +00:00
Edward Tomasz Napierala
1fb2497499 Fix use-after-free in kern_jail_set() triggered e.g. by attempts
to clear "persist" flag from empty persistent jail, like this:

jail -c persist=1
jail -n 1 -m persist=0

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 19:43:20 +00:00
Edward Tomasz Napierala
e30345e790 Don't leak locks in prison_racct_modify().
Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 17:30:02 +00:00
Edward Tomasz Napierala
ab27d5d88a Fix panic with RACCT that could occur in low memory (or out of swap)
situations, due to fork1() calling racct_proc_exit() without calling
racct_proc_fork() first.

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com> (earlier version)
Reviewed by:	Mateusz Guzik <mjguzik at gmail dot com>
2012-05-22 15:58:27 +00:00
Hartmut Brandt
ac6e25ec7d Make dumptid non-static. It is used by libkvm to detect whether
this is a VNET-kernel or not. gcc used to put the static symbol into
the symbol table, clang does not. This fixes the 'netstat: no namelist'
error seen on clang+VNET systems.
2012-05-22 07:23:41 +00:00
Alexander V. Chernikov
afa85850e7 Fix old panic when BPF consumer attaches to destroying interface.
'flags' field is added to the end of bpf_if structure. Currently the only
flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
Problem can be easily triggered on SMP stable/[89] by the following command (sort of):
'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'

Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory,
while interface is still UP and there can be routes via this interface.
Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api.

Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet
matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches)

Approved by:      kib(mentor)
MFC in:            4 weeks
2012-05-21 22:17:29 +00:00
Mitsuru IWASAKI
e3fd0bc1b2 Add SMP/i386 suspend/resume support.
Most part is merged from amd64.

- i386/acpica/acpi_wakecode.S
Replaced with amd64 code (from realmode to paging enabling code).

- i386/acpica/acpi_wakeup.c
Replaced with amd64 code (except for wakeup_pagetables stuff).

- i386/include/pcb.h
- i386/i386/genassym.c
Added PCB new members (CR0, CR2, CR4, DS, ED, FS, SS, GDT, IDT, LDT
and TR) needed for suspend/resume, not for context switch.

- i386/i386/swtch.s
Added suspendctx() and resumectx().
Note that savectx() was not changed and used for suspending (while
amd64 code uses it).
BSP and AP execute the same sequence, suspendctx(), acpi_wakecode()
and resumectx() for suspend/resume (in case of UP system also).

- i386/i386/apic_vector.s
Added cpususpend().

- i386/i386/mp_machdep.c
- i386/include/smp.h
Added cpususpend_handler().

- i386/include/apicvar.h
- kern/subr_smp.c
- sys/smp.h
Added IPI_SUSPEND and suspend_cpus().

- i386/i386/initcpu.c
- i386/i386/machdep.c
- i386/include/md_var.h
- pc98/pc98/machdep.c
Moved initializecpu() declarations to md_var.h.

MFC after:	3 days
2012-05-18 18:55:58 +00:00
Gleb Kurtsou
ac13a90c4b Skip directory entries with zero inode number during traversal.
Entries with zero inode number are considered placeholders by libc and
UFS.  Fix remaining uses of VOP_READDIR in kernel: vop_stdvptocnp,
unionfs.

Sponsored by:	Google Summer of Code 2011
2012-05-16 10:44:09 +00:00
Sergey Kandaurov
2aaae99d96 Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP. 2012-05-15 10:58:17 +00:00
Grzegorz Bernacki
823c83e842 Do not call bremfree for managed buffers.
Calling bremfree for these buffers results in panic:
"bremfree: buffer %p not on a queue."

Approved by: kib
2012-05-15 09:55:15 +00:00
Ryan Stone
b3e9e682cf Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
Xin LI
9aa97da69e Revert previous revision, misunderstood the code :( 2012-05-11 23:43:32 +00:00
Xin LI
259e101831 Release proc lock after setting signal queue.
PR:		kern/167727
Submitted by:	Jinjun Gao <gjinjun gmail com>
MFC after:	2 weeks
2012-05-11 23:41:52 +00:00
Konstantin Belousov
6098e7acff Move the code to call the callout callback into the helper function
softclock_call_cc(). While there, move some common code to callout_cc_del().

Requested by:	avg, jhb
Reviewed by:	jhb
MFC after:    1 week
2012-05-03 20:00:30 +00:00
Konstantin Belousov
57d07ca9f0 When callout_reset_on() cannot immediately migrate a callout since it
is running on other cpu, the CALLOUT_PENDING flag is temporarily
cleared. Then, callout_stop() on this, in fact active, callout fails
because CALLOUT_PENDING is not set, and callout_stop() returns 0.

Now, in sleepq_check_timeout(), the failed callout_stop() causes the
sleepq code to execute mi_switch() without even setting the wmesg,
since the switch-out is supposed to be transient. In fact, the thread
is put off the CPU for full timeout interval, instead of being put on
runq immediately.  Until timeout fires, the process is unkillable for
obvious reasons.

Fix this by marking the migrating callouts with CALLOUT_DFRMIGRATION
flag. The flag is cleared by callout_stop_safe() when the function
detects a migration, besides returning the success. The softclock()
rechecks the flag for migrating callout and cancels its execution if
the flag was cleared meantime.

PR:	 misc/166340
Reported, debugging traces provided and tested by:
	Christian Esken <christian.esken trivago com>
Reviewed by:	 avg, jhb
MFC after:	 1 week
2012-05-03 10:38:02 +00:00
John Baldwin
b8cb2346fc - Don't log messages saying that accounting is being disabled and enabled
if the accounting log file is atomically replaced with a new file
  (such as during log rotation).
- Simplify accounting log rotation a bit.  There is no need to re-run
  accton(8) after renaming the new log file to it's real name.

PR:		kern/167321
Tested by:	Jeremy Chadwick
2012-05-02 14:25:39 +00:00
Konstantin Belousov
b3bfb267cb Allow for the process information sysctls to accept a thread id in addition
to the process id.  It follows the ptrace(2) interface and allows debugging
libraries to use thread ids directly, without slow and verbose conversion
of thread id into pid.

The PGET_NOTID flag is provided to allow a specific sysctl to disallow
this behaviour.  All current callers of pget(9) have useful semantic to
operate on tid and do not need this flag.

Reviewed by:	jhb, trocini
MFC after:	1 week
2012-04-23 20:56:05 +00:00
Edward Tomasz Napierala
af6e6b87ad Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala
c52fd858ae Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
John Baldwin
88bf5036fc Include the associated wait channel message for context switch ktrace
records.  kdump supports both the old and new messages.

Submitted by:	Andrey Zonov  andrey zonov org
MFC after:	1 week
2012-04-20 15:32:36 +00:00
Jaakko Heinonen
dd952f80bc The value of flags matching VNOVAL can't be supported. Return EOPNOTSUPP
from setfflags() in this case. This fixes the return value of
chflags(path, -1).

Discussed with:	bde
MFC after:	2 weeks
2012-04-20 10:08:30 +00:00
Kirk McKusick
dca5e0ec50 This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
Kirk McKusick
f257ebbb2e This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 06:50:44 +00:00
Kirk McKusick
16165feec4 Delete a no longer useful VNASSERT missed during changes in 234400.
Suggested by: kib
2012-04-18 19:34:20 +00:00
Kirk McKusick
60005d66ab Fix a memory leak of M_VNODE_MARKER introduced in 234386.
Found by:  Peter Holm
2012-04-18 19:30:22 +00:00
Kirk McKusick
73305eb826 Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after:      2 weeks
2012-04-17 21:46:59 +00:00
Kirk McKusick
71469bb38f Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
Edward Tomasz Napierala
9e21ef395a Fix bug where NFSv4 ACL enforcement code wouldn't unconditionally
allow the owner to read and write ACL and file attributes when there
was no entry with subject matching the owner.  In other words,
'getfacl meh' shouldn't fail for the owner if the ACL looks like this:

# file: meh
# owner: trasz
# group: wheel
         user:root:------a-------:------:allow

Reported by:	kientzle
2012-04-17 14:54:00 +00:00
Edward Tomasz Napierala
0b18eb6d74 Stop treating system processes as special. This fixes panics
like the one triggered by this:

# kldload geom_vinum
# pwait `pgrep -S gv_worker` &
# kldunload geom_vinum

or this:

GEOM_JOURNAL: Shutting down geom gjournal 3464572051.
panic: destroying non-empty racct: 1 allocated for resource 6

which were tracked by jh@ to be caused by checking p->p_flag,
while it wasn't initialised yet.  Basically, during fork, the code
checked p_flag, concluded the process isn't marked as P_SYSTEM,
incremented the counter, and later on, when exiting, checked that
the process was marked as P_SYSTEM, and thus didn't decrement it.

Also, I believe there wasn't any good reason for checking P_SYSTEM
in the first place.

Tested by:	jh
2012-04-17 14:31:02 +00:00
Edward Tomasz Napierala
47f6635cc1 Fix panic, triggered like this: "int main() { thr_exit(); }"
Submitted by:	Mateusz Guzik
2012-04-17 13:44:40 +00:00
Edward Tomasz Napierala
786813aa1f Enforce upper bound on the input buffer length.
Reported by:	Mateusz Guzik
2012-04-17 13:28:14 +00:00
Jung-uk Kim
d69a426fce - Implement pipe2 syscall for Linuxulator. This syscall appeared in 2.6.27
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c.  There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *.  Probably
this was the source of MI/MD confusion.

Reviewed by:	emulation
2012-04-16 21:22:02 +00:00
Davide Italiano
99006d44f8 Fix a typo.
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:59:58 +00:00
Davide Italiano
331805a5d3 Fix some style bugs introduced in a previous commit (r233045)
Reported by:	glebius, jmallet
Reviewed by:	jmallet
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:53:31 +00:00
Marius Strobl
91849f349c Fix !DDB build after r234190. 2012-04-14 11:21:24 +00:00
Adrian Chadd
676c1784cb Use strdup() on the name (and free it when it's done) so non-static names
can be used in firmware_register().
2012-04-13 04:22:42 +00:00
John Baldwin
0cc457b000 - Extend the KDB interface to add a per-debugger callback to print a
backtrace for an arbitrary thread (rather than the calling thread).
  A kdb_backtrace_thread() wrapper function uses the configured debugger
  if possible, otherwise it falls back to using stack(9) if that is
  available.
- Replace a direct call to db_trace_thread() in propagate_priority()
  with a call to kdb_backtrace_thread() instead.

MFC after:	1 week
2012-04-12 17:43:59 +00:00
John Baldwin
7582954e34 If a linker file contains at least one module, but all of the modules
fail to load (the MOD_LOAD event fails) during a kldload(2), unload the
linker file and fail the kldload(2) with ENOEXEC.

Reported by:	gcooper
MFC after:	1 week
2012-04-12 14:49:25 +00:00
Konstantin Belousov
2dd9ea6f70 Add thread-private flag to indicate that error value is already placed
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.

Use the flag to fix sigsuspend(2) error return ktrace records.

Requested by:	bde
MFC after:	1 week
2012-04-12 10:48:43 +00:00
Kirk McKusick
ecb6e528c5 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
John Baldwin
77b479e644 Allow device_busy() and device_unbusy() to be invoked while a device is
being attached.  This is implemented by adding a new DS_ATTACHING state
while a device's DEVICE_ATTACH() method is being invoked.  A driver is
required to not fail an attach of a busy device.  The device's state will
be promoted to DS_BUSY rather than DS_ACTIVE() if the device was marked
busy during DEVICE_ATTACH().

Reviewed by:	kib
MFC after:	1 week
2012-04-11 20:57:41 +00:00
Eitan Adler
847d0034e3 Return EBADF instead of EMFILE from dup2 when the second argument is
outside the range of valid file descriptors

PR:		kern/164970
Submitted by:	Peter Jeremy <peterjeremy@acm.org>
Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-04-11 14:08:09 +00:00
Jilles Tjoelker
8a8be77610 Remove unused and wrong SA_PROC internal signal property.
The SA_PROC signal property indicated whether each signal number is directed
at a specific thread or at the process in general. However, that depends on
how the signal was generated and not on the signal number. SA_PROC was not
used.
2012-04-09 21:58:58 +00:00
Alexander Motin
70801abe8f Microoptimize cpu_search().
According to profiling, it makes one take 6% of CPU time on hackbench
with its million of context switches per second, instead of 8% before.
2012-04-09 18:24:58 +00:00
Gleb Kurtsou
0ff93c48da Add vfs_getopt_size. Support human readable file system options in tmpfs.
Increase maximum tmpfs file system size to 4GB*PAGE_SIZE on 32 bit archs.

Discussed with:	delphij
MFC after:	2 weeks
2012-04-07 15:27:34 +00:00
Alexander V. Chernikov
e4b3229aa5 - Improve BPF locking model.
Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.

- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.

- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.

Found by:       Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by:   Yandex LLC

Reviewed by:    glebius (previous version)
Reviewed by:    silence on -net@
Approved by:    (mentor)

MFC after:      3 weeks
2012-04-06 06:53:58 +00:00
John Baldwin
35818d2e94 Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take.  The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by:	kib
MFC after:	2 weeks
2012-04-05 17:13:14 +00:00
David Xu
8931e524bf In sem_post, the field _has_waiters is no longer used, because some
application destroys semaphore after sem_wait returns. Just enter
kernel to wake up sleeping threads, only update _has_waiters if
it is safe. While here, check if the value exceed SEM_VALUE_MAX and
return EOVERFLOW if this is true.
2012-04-05 03:05:02 +00:00
David Xu
17ce606321 umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily.  On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
2012-04-05 02:24:08 +00:00
Navdeep Parhar
60a305887a - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.
- Add a check for errors during copyin while here.

Reviewed by:	julian, bz
MFC after:	2 weeks
2012-04-03 18:38:00 +00:00
Konstantin Belousov
5085ecb75a When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.

Reported and tested by:	pho
MFC after:	3 days
2012-04-02 19:35:36 +00:00
Konstantin Belousov
2e39e24f64 Add helper function to remove the process from the orphans list and
use it instead of inlined code.

Tested by:	pho
MFC after:	3 days
2012-04-02 19:34:56 +00:00
John Baldwin
e506e182dd Export some more useful info about shared memory objects to userland
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
  with and add a shm_path() method to copy the path out to a caller-supplied
  buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
  export the path, mode, and size of a shared memory object via
  struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
  procstat_get_shm_info() to export the mode and size of a shared memory
  object.
- Change procstat to always print out the path for a given object if it
  is valid.
- Teach fstat about shared memory objects and to display their path,
  mode, and size.

MFC after:	2 weeks
2012-04-01 18:22:48 +00:00
David Xu
8b1eafa723 Remove stale comments. 2012-03-31 06:48:41 +00:00
David Xu
b29d7d9b60 Remove trailing semicolon, it is a typo. 2012-03-30 12:57:14 +00:00
David Xu
0cf573e989 Fix COMPAT_FREEBSD32 build.
Submitted by: Andreas Tobler < andreast at fgznet dot ch >
2012-03-30 09:03:53 +00:00
David Xu
4ed8858df0 Remove trailing space. 2012-03-30 05:49:32 +00:00
David Xu
e05171d939 Merge umtxq_sleep and umtxq_nanosleep into a single function by using
an abs_timeout structure which describes timeout info.
2012-03-30 05:40:26 +00:00
David Xu
d31f470d15 Reduce code size by creating common timed sleeping function. 2012-03-29 02:46:43 +00:00
Fabien Thomas
f5f9340b98 Add software PMC support.
New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).

Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.

Sponsored by: NETASQ
MFC after:	1 month
2012-03-28 20:58:30 +00:00
Ryan Stone
9742410797 Instead of only iterating over the set of known SDT probes when sdt.ko is
loaded and unloaded, also have sdt.ko register callbacks with kern_sdt.c
that will be called when a newly loaded KLD module adds more probes or
a module with probes is unloaded.

This fixes two issues: first, if a module with SDT probes was loaded after
sdt.ko was loaded, those new probes would not be available in DTrace.
Second, if a module with SDT probes was unloaded while sdt.ko was loaded,
the kernel would panic the next time DTrace had cause to try and do
anything with the no-longer-existent probes.

This makes it possible to create SDT probes in KLD modules, although there
are still two caveats: first, any SDT probes in a KLD module must be part
of a DTrace provider that is defined in that module.  At present DTrace
only destroys probes when the provider is destroyed, so you can still
panic the system if a KLD module creates new probes in a provider from a
different module(including the kernel) and then unload the the first module.

Second, the system will panic if you unload a module containing SDT probes
while there is an active D script that has enabled those probes.

MFC after:	1 month
2012-03-27 15:07:43 +00:00
Alexander V. Chernikov
b25711e6b0 - Add knlist_init_rw_reader() function to kqueue(9).
Function acquired reader lock if needed.
Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED)
- While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues

Reviewed by:    glebius
Approved by:    ae(mentor)

MFC after:      2 weeks
2012-03-26 09:34:17 +00:00
Mikolaj Golub
903712c99c Add a sysctl to set and retrieve binary osreldate of another process.
Suggested by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-03-23 20:05:41 +00:00
Andrey V. Elsukov
5b0da85a41 Correct debug message. 2012-03-22 09:29:07 +00:00
Alan Cox
5730afc9b6 Handle spurious page faults that may occur in no-fault sections of the
kernel.

When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB.  In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB.  This is exactly as
recommended in AMD's documentation.  In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry.  In short, this saves us from
having to perform potentially costly TLB flushes.  In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry.  Usually, such spurious page faults are handled
by vm_fault() without incident.  However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault().  This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.

In collaboration with:	kib
Tested by:		gibbs (an earlier version)

I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.

MFC after:	1 week
2012-03-22 04:52:51 +00:00
Andrey V. Elsukov
c5e7f0649a Acquire modules lock before call module_getname() in the KLD_DEBUG case.
MFC after:	1 week
2012-03-21 09:48:32 +00:00
Eitan Adler
24c10828e4 - Clean up timestamps in msgbuf code. The timestamps should now be
inserted after the priority token thus cleaning up the output.
- Remove the needless double internal do_add_char function.
- Resolve a possible deadlock if interrupts are
    disabled and getnanotime is called

Reviewed by:	bde  kmacy, avg, sbruno (various versions)
Approved by:	cperciva
MFC after:	2 weeks
2012-03-19 00:36:32 +00:00
Jaakko Heinonen
59f513cd09 Cast wallclock.tv_sec to uint64_t to avoid overflow in the calculation.
PR:		kern/161552
Reviewed by:	trasz
Tested by:	Nikos Vassiliadis
MFC after:	1 week
2012-03-18 19:13:32 +00:00
Davide Italiano
c6111de55d Add rudimentary profiling of the hash table used in the in the umtx code to
hold active lock queues.

Reviewed by:	attilio
Approved by:	davidxu, gnn (mentor)
MFC after:	3 weeks
2012-03-16 20:32:11 +00:00
Michael Tuexen
99f293a20e Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.
2012-03-15 14:13:38 +00:00
Andrey V. Elsukov
b26a09848a Add CTLFLAG_TUN to the sysctl definition and fix style.
Pointed by:	Garrett Cooper
MFC after:	2 weeks
2012-03-15 06:01:21 +00:00
Andrey V. Elsukov
199aa9756b Add debug.kld_debug loader tunable.
MFC after:	2 weeks
2012-03-15 05:11:29 +00:00
Jaakko Heinonen
db62ced238 Add an assert for proctree_lock to proc_to_reap().
Discussed with:	kib
MFC after:	1 week
2012-03-14 15:52:23 +00:00
Konstantin Belousov
7335ed90a0 Lock the process around manipulations with p_flag.
Reported and reviewed by:	jh
MFC after:	3 days
2012-03-13 22:00:46 +00:00
Adrian Chadd
a9a282f672 Add module load/unload stubs. 2012-03-13 20:27:48 +00:00
Alexander Motin
fd053fae73 Add kern.eventtimer.activetick tunable/sysctl, specifying whether each
hardclock() tick should be run on every active CPU, or on only one.

On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.

Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.

MFC after:	1 month
2012-03-13 10:21:08 +00:00
Alexander Motin
7295465e33 Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.

MFC after:	1 month
2012-03-13 08:18:54 +00:00
Peter Holm
62a9fc76df Allways call fdrop(). 2012-03-12 11:56:57 +00:00
Konstantin Belousov
1a9c7dec1f ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by:	kan
Tested by:	andrew (arm), flo (sparc64)
MFC after:	3 weeks
2012-03-11 19:38:49 +00:00
Konstantin Belousov
b80dcb55aa Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
Alexander Motin
5f3818a56e Revert r175376 and tune cpufreq(4) frequency comparison logic instead.
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.

ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.

I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.

MFC after:	2 weeks
2012-03-10 18:56:16 +00:00
Alexander Motin
bcfd016cff Idle ticks optimization:
- Pass number of events to the statclock() and profclock() functions
   same as to hardclock() before to not call them many times in a loop.
 - Rename them into statclock_cnt() and profclock_cnt().
 - Turn statclock() and profclock() into compatibility wrappers,
   still needed for arm.
 - Rename hardclock_anycpu() into hardclock_cnt() for unification.

MFC after:	1 week
2012-03-10 14:57:21 +00:00
Edward Tomasz Napierala
0a53cd5742 Remove useless thread_{lock,unlock}() in raccd. 2012-03-10 14:38:49 +00:00
Juli Mallett
85729c2c44 Export intrcnt correctly when running under 32-bit compatibility.
Reviewed by:	gonzo, nwhitehorn
2012-03-09 22:30:54 +00:00
Peter Holm
39e77c4c50 Perform the parameter validation before assigning it to a signed int
variable. This fixes the problem seen with readdir(3) fuzzing.

Submitted by:	bde
MFC after:	1 week
2012-03-09 21:31:12 +00:00
Alexander Motin
b3f40a4107 Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
2012-03-09 19:09:08 +00:00
Alexander Motin
55c71d634f Be more polite when setting state->nextevent inside cpu_new_callout().
Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition.

MFC after:	2 weeks
2012-03-09 07:30:48 +00:00
Konstantin Belousov
38ddb5725b Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
Peter Holm
ffae9d4d7c Free up allocated memory used by posix_fadvise(2). 2012-03-08 20:34:13 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
John Baldwin
44ad547522 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
Konstantin Belousov
f950879e16 The pipe_poll() performs lockless access to the vnode to test
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().

Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.

Reported and tested by:	bf
Submitted by:	gianni
MFC after:	1 week (a backport)
2012-03-07 07:31:50 +00:00
Edward Tomasz Napierala
c34bbd2ada Make racct and rctl correctly handle jail renaming. Previously
they would continue using old name, the one jail was created with.

PR:		bin/165207
2012-03-06 11:05:50 +00:00
Ivan Voras
2573ea5f76 Print out process name and thread id in the debugging message.
This is useful because the message can end up in system logs in
non-debugging operation.

Reviewed by:	attilio (earlier version)
2012-03-05 14:19:43 +00:00
Konstantin Belousov
e7f19c3d81 pipe_read(): change the type of size to int, and remove signed clamp.
pipe_write(): change the type of desiredsize back to int, its value fits.

Requested by: bde
MFC after:    3 weeks
2012-03-04 15:09:01 +00:00
Konstantin Belousov
8bb9a904d5 Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

Requested by:	bde
MFC after:	3 weeks
2012-03-04 14:55:37 +00:00
Mikolaj Golub
e0fcf639d2 Make kern.proc.umask sysctl readonly.
Requested by:	src
MFC after:	1 week
2012-03-03 11:53:35 +00:00
Alexander Motin
6022f0bcb3 Fix bug of r232207, when cpu_search() could prefer CPU group with best
load, but with no CPU matching given limitations. It caused kernel panics
in some cases when thread was bound to specific CPUs with cpuset(1).
2012-03-03 11:50:48 +00:00
Juli Mallett
9624d94701 o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI.  This mostly follows nwhitehorn's lead in implementing
   COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
   32-bit ABI being used.  Since the MIPS port is relatively-new, even the 32-bit
   ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
   with 32-bit compatibility, then, disable some code which assumes otherwise
   wrongly when built for MIPS.  A more general macro to check in this case would
   seem like a good idea eventually.  If someone adds support for using n32
   userland with n64 kernels on MIPS, then they will have to add a variety of
   flags related to each piece of the ABI that can vary.  That's probably the
   right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
   freebsd32 compat code.  Probably this should be generalized at some point.

Reviewed by:	gonzo
2012-03-03 08:19:18 +00:00
Rick Macklem
5e99212d36 Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by:	bde
Discussed with:	kib
Reviewed by:	jhb
MFC after:	2 weeks
2012-03-03 01:06:54 +00:00
John Baldwin
831ce4cb3d - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
Kirk McKusick
35338e6091 This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
Mikolaj Golub
c7e41c8b50 Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR:		kern/51583, kern/159663
Suggested by:	kib
Reviewed by:	arch
MFC after:	2 weeks
2012-02-29 21:38:31 +00:00
David Xu
c9b01ed581 initialize clock ID and flags only when copying timespec, a _umtx_time
copy already contains these fields.
2012-02-29 02:01:48 +00:00
Martin Matuska
41c0675e6e Add procfs to jail-mountable filesystems.
Reviewed by:	jamie
MFC after:	1 week
2012-02-29 00:30:18 +00:00
Dimitry Andric
ac382cb7f1 Change definition of pipe_chmod() from K&R to C99, to avoid the
following clang warning:

sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
      (aka 'unsigned short') declared in a previous prototype [-Werror]
        mode_t mode;
               ^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t       pipe_chmod;
                        ^
2012-02-28 21:45:21 +00:00
John Baldwin
0428cbe4df Properly clear a device's devclass if DEVICE_ATTACH() fails if the device
does not have a fixed devclass.

Reviewed by:	imp
MFC after:	2 weeks
2012-02-28 19:16:02 +00:00
Konstantin Belousov
1d7ca9bb8e Currently, the debugger attached to the process executing vfork() does
not get syscall exit notification until the child performed exec of
exit.  Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.

Reported, tested and reviewed by:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 2 weeks
2012-02-27 21:10:10 +00:00