Commit Graph

11896 Commits

Author SHA1 Message Date
trasz
dd1ffe6ba1 Remove outdated comment and move part of it into more applicable place. 2010-07-18 19:29:12 +00:00
ivoras
56cd1257b0 In keeping with the Age-of-the-fruitbat theme, scale up hirunningspace on
machines which can clearly afford the memory.

This is a somewhat conservative version of the patch - more fine tuning may be
necessary.

Idea from: Thread on hackers@
Discussed with: alc
2010-07-18 10:15:33 +00:00
jhb
96d598c33f Retire td_syscalls now that it is no longer needed. 2010-07-15 20:24:37 +00:00
ivoras
3fb9f87a34 A cosmetic change - don't output empty <flags>. 2010-07-15 13:46:30 +00:00
mav
bd622e7c20 Rename timeevents.c to kern_clocksource.c.
Suggested by:	jhb@
2010-07-14 18:43:27 +00:00
jhb
fb1e0aa66f - Document layout of KTR_STRUCT payload in a comment.
- Simplify ktrstruct() calling convention by having ktrstruct() use
  strlen() rather than requiring the caller to hand-code the length of
  constant strings.

MFC after:	1 month
2010-07-14 17:38:01 +00:00
mav
b8b00841c9 Move timeevents.c to MI code, as it is not x86-specific. I already have
it working on Marvell ARM SoCs, and it would be nice to unify timer code
between more platforms.
2010-07-14 13:31:27 +00:00
cperciva
14d1adbf2c Correctly copy the M_RDONLY flag when duplicating a reference
to an mbuf external buffer.

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-10:07.mbuf
2010-07-13 02:45:17 +00:00
jkim
06b6c2769b Use type-specific inline function imax() instead of deprecated macro MAX().
Prodded by:	bde
2010-07-12 15:32:45 +00:00
alc
db4ca9f5c2 Change the implementation of vm_hold_free_pages() so that it performs at
most one call to pmap_qremove(), and thus one TLB shootdown, instead of one
call and TLB shootdown per page.

Simplify the interface to vm_hold_free_pages().

MFC after:	3 weeks
2010-07-11 20:11:44 +00:00
mav
d760bd51fb Remove interval validation from cpu_tick_calibrate(). As I found, check
was needed at preliminary version of the patch, where number of CPU ticks
was divided strictly on 16 seconds. Final code instead uses real interval
duration, so precise interval should not be important. Same time aliasing
issues around second boundary causes false positives, periodically logging
useless "t_delta ... too long/short" messages when HZ set below 256.
2010-07-11 16:47:45 +00:00
alc
7c09dc242c Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit.  Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by:	kib
2010-07-09 19:38:30 +00:00
jhb
f338f6d0f8 Accidentally committed an older version of this comment rather than the
final one.
2010-07-09 13:59:53 +00:00
jhb
7e3b216a37 Refine a comment.
Reviewed by:	bde
2010-07-09 13:53:25 +00:00
jh
d171161918 Remove redundant high >= 0.
Reported by:	rstone
2010-07-09 10:57:55 +00:00
jkim
93b88a93da Implement optional 'precision' for numbers. Previously, it was parsed but
ignored.  Some third-party modules (e.g., APCICA) prefer this format over
zero padding flag '0'.
2010-07-08 22:13:23 +00:00
jhb
1f4cf66ed2 - Various style and whitespace fixes.
- Make sugid_coredump and kern_logsigexit private to kern_sig.c.

Submitted by:	bde (partially)
MFC after:	1 month
2010-07-08 19:15:26 +00:00
jh
f673b7098a Assert that low and high are >= 0. The allocator doesn't support the
negative range.
2010-07-08 16:53:19 +00:00
attilio
865de58a04 - Simplify logic in handling ticks wrap-up
- Fix a bug where thread may be in sleeping state but the wchan won't
  be set, leading to an empty container for sleepq_type(). [0]

Sponsored by:		Sandvine Incorporated
[0] Submitted by:	Bryan Venteicher
			<bryanv at daemoninthecloset dot org>
MFC after:		3 days
X-MFC:			209577
2010-07-07 12:00:11 +00:00
kib
15d16124c2 In revoke(), verify that VCHR vnode indeed belongs to devfs.
Found and tested by:	pho
MFC after:	1 week
2010-07-06 18:20:49 +00:00
ed
1075ceb3e2 Fix a race condition, where a TTY could be destroyed twice.
There are special cases where tty_rel_free() can be called twice in a
row, namely when closing and revoking the TTY at the same moment. Only
call destroy_dev_sched_cb() once.

Reported by:	Jeremie Le Hen
MFC after:	1 week
2010-07-06 08:56:34 +00:00
kib
15a394fbba Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by:	alc
MFC after:	2 weeks
2010-07-05 21:13:32 +00:00
jh
b0744cfb8d Extend the kernel unit number allocator for allocating specific unit
numbers. This change adds a new function alloc_unr_specific() which
returns the requested unit number if it is free. If the number is
already allocated or out of the range, -1 is returned.

Update alloc_unr(9) manual page accordingly and add a MLINK for
alloc_unr_specific(9).

Discussed on:	freebsd-hackers
2010-07-05 16:23:55 +00:00
kib
4de7ec3dbb Obey sv_syscallnames bounds in syscallname().
Reported and tested by:	pho
2010-07-04 18:16:17 +00:00
kib
22a31bdc6e Extend ptrace(PT_LWPINFO) to report siginfo for the signal that caused
debugee stop. The change should keep the ABI. Take care of compat32.

Discussed with:	davidxu, jhb
MFC after:	2 weeks
2010-07-04 11:48:30 +00:00
alc
afd002fb75 Use vm_page_next() instead of vm_page_lookup() in exec_map_first_page()
because vm_page_next() is faster.
2010-07-02 15:50:30 +00:00
jhb
de324e256c Move prototypes for kern_sigtimedwait() and kern_sigprocmask() to
<sys/syscallsubr.h> where all other kern_<syscall> prototypes live.
2010-06-30 18:03:42 +00:00
jhb
738cd61a3d Update comment for tdsignal() -> tdsendsignal() rename. Forgot to include
this in 209592.
2010-06-30 18:00:45 +00:00
alc
df23299909 Improve bufdone_finish()'s handling of the bogus page. Specifically, if
one or more mappings to the bogus page must be replaced, call pmap_qenter()
just once.  Previously, pmap_qenter() was called for each mapping to the
bogus page.

MFC after:	3 weeks
2010-06-30 04:52:42 +00:00
jhb
44b49a3eaa Send SIGPIPE to the thread that issued the offending system call
rather than to the entire process.

Reported by:	Anit Chakraborty
Reviewed by:	kib, deischen (concept)
MFC after:	1 week
2010-06-29 20:44:19 +00:00
jhb
df7979cf76 Tweak the in-kernel API for sending signals to threads:
- Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c.
- Add tdsignal() and tdksignal() routines that mirror psignal() and
  pksignal() except that they accept a thread as an argument instead of
  a process.  They send a signal to a specific thread rather than to an
  individual process.

Reviewed by:	kib
2010-06-29 20:41:52 +00:00
dougb
ebed8715b6 If i is going to be used in the loop unconditionally the declaration
has to be unconditional as well.

Conical head covering to:	kib
2010-06-29 01:04:24 +00:00
kib
180cca1c2d Regenerate 2010-06-28 18:17:21 +00:00
kib
2ab2a361d3 Despite system call deregistration drains the threads executing System V
shm syscalls, and initial check for the number of allocated segments
in the module deinitialization code, the following might happen:
   after the check for active segment, while waiting for threads to
   leave some other syscall, shmget(2) is called. Then, we can end
   up with the shared segment that cannot be detached since sysvshm
   module is unloaded.

Prevent the leak by rechecking and disclaiming a reference to the vm
object owned by sysvshm module, that might have grown during the drain.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 month
2010-06-28 18:12:42 +00:00
kib
b6d8416eac Count number of threads that enter and leave dynamically registered
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.

Reviewed by:	jhb
Tested by:	pho
MFC after:	1 month
2010-06-28 18:06:46 +00:00
attilio
f818dc9368 Fix a lock leak in the deadlock resolver in case the ticks counter
wrapped up.

Sponsored by:	Sandvine Incorporated
Submitted by:	pluknet <pluknet at gmail dot com>
Reported by:	Anton Yuzhaninov <citrin at citrin dot ru>
Reviewed by:	jhb
MFC after:	3 days
2010-06-28 17:45:00 +00:00
jh
0a8e6bb738 Correct a comment typo. 2010-06-27 12:19:09 +00:00
pjd
6ff3cc04b0 Correct arguments order. 2010-06-26 21:44:45 +00:00
tuexen
d27c0f60a0 * Do not dereference a NULL pointer when calling an SCTP send syscall
not providing a destination address and using ktrace.
* Do not copy out kernel memory when providing sinfo for sctp_recvmsg().
Both bug where reported by Valentin Nechayev.
The first bug results in a kernel panic.
MFC after: 3 days.
2010-06-26 19:26:20 +00:00
nwhitehorn
ecf1995ac7 Reverse the logic of the if statement that sets the default value of
HZ; the list of 1000 Hz platforms was getting unwieldy.

Suggested by:	marcel
2010-06-24 00:27:20 +00:00
nwhitehorn
da5a28c706 Move default HZ from 100 to 1000 on powerpc.
Reviewed by:	marcel
MFC after:	2 weeks
2010-06-23 23:26:14 +00:00
kib
6375d4e4db Remove the support for int13 FPU exception reporting on i386. It is
believed that all 486-class CPUs FreeBSD is capable to run on, either
have no FPU and cannot use external coprocessor, or have FPU on the
package and can use #MF.

Reviewed by:	bde
Tested by:	pho (previous version)
2010-06-23 11:12:58 +00:00
mav
a21b0b9d72 "time lock" is no longer a spin-lock since r209371.
Reported by:	kib@
2010-06-21 21:15:51 +00:00
ed
76489ac1ea Use ISO C99 integer types in sys/kern where possible.
There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.
2010-06-21 09:55:56 +00:00
kib
107ec73aad Do not report a stack garbage as the old value for debug.ncores sysctl.
Reported by:	brucec
2010-06-21 09:51:25 +00:00
mav
d1175426d7 Implement new event timers infrastructure. It provides unified APIs for
writing event timer drivers, for choosing best possible drivers by machine
independent code and for operating them to supply kernel with hardclock(),
statclock() and profclock() events in unified fashion on various hardware.

Infrastructure provides support for both per-CPU (independent for every CPU
core) and global timers in periodic and one-shot modes. MI management code
at this moment uses only periodic mode, but one-shot mode use planned for
later, as part of tickless kernel project.

For this moment infrastructure used on i386 and amd64 architectures. Other
archs are welcome to follow, while their current operation should not be
affected.

This patch updates existing drivers (i8254, RTC and LAPIC) for the new
order, and adds event timers support into the HPET driver. These drivers
have different capabilities:
 LAPIC - per-CPU timer, supports periodic and one-shot operation, may
freeze in C3 state, calibrated on first use, so may be not exactly precise.
 HPET - depending on hardware can work as per-CPU or global, supports
periodic and one-shot operation, usually provides several event timers.
 i8254 - global, limited to periodic mode, because same hardware used also
as time counter.
 RTC - global, supports only periodic mode, set of frequencies in Hz
limited by powers of 2.

Depending on hardware capabilities, drivers preferred in following orders,
either LAPIC, HPETs, i8254, RTC or HPETs, LAPIC, i8254, RTC.
User may explicitly specify wanted timers via loader tunables or sysctls:
kern.eventtimer.timer1 and kern.eventtimer.timer2.
If requested driver is unavailable or unoperational, system will try to
replace it. If no more timers available or "NONE" specified for second,
system will operate using only one timer, multiplying it's frequency by few
times and uing respective dividers to honor hz, stathz and profhz values,
set during initial setup.
2010-06-20 21:33:29 +00:00
pjd
b3024a4af9 Backout r207970 for now, it can lead to deadlocks.
Reported by:	kan
MFC after:	3 days
2010-06-17 17:39:51 +00:00
rpaulo
a8c5bafed5 Make DTrace syscall provider work again by including opt_kdtrace.h here. 2010-06-17 17:34:45 +00:00
jh
8a203f841c - Fix compilation of the subr_unit.c user space test program.
- Use %zu for size_t in a few format strings.
2010-06-17 16:12:06 +00:00
avg
9f2d4c3357 lock_profile_release_lock: do not compare unsigned with zero
Found by:	Coverity Prevent
CID:		3660
Reviewed by:	jhb
MFC after:	2 weeks
2010-06-17 10:15:13 +00:00
ed
70171ee94e Remove the unit argument from the recently added make_dev_p().
New code that creates character devices shouldn't use device unit
numbers, but only si_drv[12] to hold pointer to per-device data. Make
this function more future proof by removing the unit number argument.

Discussed with:	kib
2010-06-17 08:49:31 +00:00
jh
1c0174e29a Correct the function name in a KASSERT. 2010-06-16 16:02:17 +00:00
jkim
14f08fd627 Implement flexible BPF timestamping framework.
- Allow setting format, resolution and accuracy of BPF time stamps per
listener.  Previously, we were only able to use microtime(9).  Now we can
set various resolutions and accuracies with ioctl(2) BIOCSTSTAMP command.
Similarly, we can get the current resolution and accuracy with BIOCGTSTAMP
command.  Document all supported options in bpf(4) and their uses.

- Introduce new time stamp 'struct bpf_ts' and header 'struct bpf_xhdr'.
The new time stamp has both 64-bit second and fractional parts.  bpf_xhdr
has this time stamp instead of 'struct timeval' for bh_tstamp.  The new
structures let us use bh_tstamp of same size on both 32-bit and 64-bit
platforms without adding additional shims for 32-bit binaries.  On 64-bit
platforms, size of BPF header does not change compared to bpf_hdr as its
members are already all 64-bit long.  On 32-bit platforms, the size may
increase by 8 bytes.  For backward compatibility, struct bpf_hdr with
struct timeval is still the default header unless new time stamp format is
explicitly requested.  However, the behaviour may change in the future and
all relevant code is wrapped around "#ifdef BURN_BRIDGES" for now.

- Add experimental support for tagging mbufs with time stamps from a lower
layer, e.g., device driver.  Currently, mbuf_tags(9) is used to tag mbufs.
The time stamps must be uptime in 'struct bintime' format as binuptime(9)
and getbinuptime(9) do.

Reviewed by:	net@
2010-06-15 19:28:44 +00:00
mav
ea954fa396 Virtualize pci_remap_msi_irq() call from general MSI code. It allows MSI
(FSB interrupts) to be used by non-PCI devices, such as HPET.
2010-06-14 07:10:37 +00:00
kib
bbe91d0e0f Add another variation of make_dev(9), make_dev_p(9), that is allowed
to fail and can return useful error code.

Requested by:	jh
Reviewed by:	imp, jh
MFC after:	3 weeks
2010-06-12 13:22:39 +00:00
kib
9e98593ebc When make_dev_credf(MAKEDEV_WAITOK) is called, use
devctl_notify_f(M_WAITOK) for devfs notifications.

Suggested by:	jh
Reviewed by:	imp, jh
MFC after:	3 weeks
2010-06-12 13:21:25 +00:00
kib
2605a178f6 Add modifications of devctl_notify(9) functions that take flags. Use
flags to specify M_WAITOK/M_NOWAIT. M_WAITOK allows devctl to sleep for
the memory allocation.

As Warner noted, allowing the functions to sleep might cause
reordering of the queued notifications.

Reviewed by:	imp, jh
MFC after:	3 weeks
2010-06-12 13:20:38 +00:00
avg
324886002f fix a few cases where a string is passed via format argument instead of
via %s

Most of the cases looked harmless, but this is done for the sake of
correctness.  In one case it even allowed to drop an intermediate buffer.

Found by:	clang
MFC after:	2 week
2010-06-11 19:27:21 +00:00
jhb
9b74a62d73 Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
mdf
09830f0c6f Add INVARIANTS checking that numfreebufs values are sane. Also add a
per-buf flag to catch if a buf is double-counted in the free count.
This code was useful to debug an instance where a local patch at Isilon
was incorrectly managing numfreebufs for a new buf state.

Reviewed by:	jeff
Approved by:	zml (mentor)
2010-06-11 17:03:26 +00:00
ivoras
5a89fd1114 In another move to join with the age of the Fruitbat, increase SYSV
shared resources defaults beyond absolute minimums.

The new values are chosen mostly by magic. They are still fairly
small and will need increasing for large installations (especially
SHMMAX). However, they are now enough to e.g. start PostgreSQL
installations with ~~300 users and nearly 512 MB of shared buffers.

Reviewed by:	A short discussion on hackers@
2010-06-11 09:27:33 +00:00
mav
b8bbab8130 Store interrupt trap frame into struct thread. It allows interrupt handler
to obtain both trap frame and opaque argument submitted on registrction.
After kernel and all drivers get used to it, legacy hack can be removed.

Reviewed by:	jhb@
2010-06-10 16:14:05 +00:00
ivoras
04624ee0ea Unconfuse THREAD and SMT flags 2010-06-10 11:48:14 +00:00
ivoras
7937017072 Cosmetic change to XML - less ugly newlines 2010-06-10 11:01:17 +00:00
kib
317abde372 Reorganize the code in bdwrite() which handles move of dirtiness
from the buffer pages to buffer. Combine the code to set buffer
dirty range (previously in vfs_setdirty()) and to clean the pages
(vfs_clean_pages()) into new function vfs_clean_pages_dirty_buf(). Now
the vm object lock is acquired only once.

Drain the VPO_BUSY bit of the buffer pages before setting valid
and clean bits in vfs_clean_pages_dirty_buf() with new helper
vfs_drain_busy_pages(). pmap_clear_modify() asserts that page is not
busy.

In vfs_busy_pages(), move the wait for draining of VPO_BUSY before
the dirtyness handling, to follow the structure of
vfs_clean_pages_dirty_buf().

Reported and tested by:	pho
Suggested and reviewed by:	alc
MFC after:	2 weeks
2010-06-08 17:54:28 +00:00
jhb
72cdd6ef99 Fix a sign bug that caused adaptive spinning in sx_xlock() to not work
properly.  Among other things it did not drop Giant while spinning
leading to livelocks.

Reviewed by:	rookie, kib, jmallett
MFC after:	3 days
2010-06-08 16:17:47 +00:00
mav
4363e5b2ce Call BUS_PROBE_NOMATCH() when device detached due to driver unload.
This allows bus to power-down device when driver unloaded on-flight.
2010-06-07 18:47:53 +00:00
cperciva
4adc6d09d8 Declare ip6 as (struct in6_addr *) instead of (struct in_addr *). This is
a harmless bug since we never actually use ip6 as anything other than an
opaque pointer.

Found with:	Coverty Prevent(tm)
CID:		4319
MFC after:	1 month
2010-06-04 14:38:24 +00:00
jhb
16dab63fe9 Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it.  All of the current callers already hold the
lock.

MFC after:	1 month
2010-06-03 16:02:11 +00:00
trasz
253bf0319d The 'acl_cnt' field is unsigned; no point in checking if it's >= 0.
Found with:	Coverity Prevent
CID:		3688
2010-06-03 13:45:27 +00:00
trasz
9985f972fd The 'acl_cnt' field is unsigned; no point in checking if it's >= 0.
Found with:	Coverity Prevent
CID:		3684
2010-06-03 13:43:58 +00:00
trasz
cbfca8b888 The acl_cnt field is unsigned; no point in checking if it's >= 0.
Found with:	Coverity Prevent
CID:		3683
2010-06-03 13:41:55 +00:00
kib
2ba33ab98e Sometimes vnodes share the lock despite being different vnodes on
different mount points, e.g. the nullfs vnode and the covered vnode
from the lower filesystem. In this case, existing assertion in
vop_rename_pre() may be triggered.

Check for vnode locks equiality instead of the vnodes itself to
not trip over the situation.

Submitted by:	Mikolaj Golub <to.my.trociny@gmail.com>
Tested by:	pho
MFC after:	2 weeks
2010-06-03 10:20:08 +00:00
alc
24ac89cf14 Minimize the use of the page queues lock for synchronizing access to the
page's dirty field.  With the exception of one case, access to this field
is now synchronized by the object lock.
2010-06-02 15:46:37 +00:00
kib
5e1e617f5e Add a facility to dynamically adjust or unconfigure p1003_1b mib.
Use it to allow to tune sem_nsem_max at runtime, only when sem.ko
module is present in kernel.

Requested and tested by:	amdmi3
Reviewed by:	jhb
MFC after:	3 days
2010-06-02 09:59:05 +00:00
zml
7f5d6a35d6 Revert taskqueue(9) related commits until mdf@ is approved and can
resolve issues.

This reverts commits r207439, r208623, r208624
2010-06-01 16:04:01 +00:00
zml
cadeb05108 Avoid a wakeup(9) if we can be sure no one is waiting on the task.
Submitted by:       Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by:        zml, jhb
2010-05-28 18:15:34 +00:00
zml
f1e0737c28 Revert r207439 and solve the problem differently. The task handler
ta_func may free the task structure, so no references to its members
are valid after the handler has been called. Using a per-queue member
and having waits longer than strictly necessary was suggested by jhb.

Submitted by:       Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by:        zml, jhb
2010-05-28 18:15:28 +00:00
rwatson
c7e8976175 When close() is called on a connected socket pair, SO_ISCONNECTED might be
set but be cleared before the call to sodisconnect().  In this case,
ENOTCONN is returned: suppress this error rather than returning it to
userspace so that close() doesn't report an error improperly.

PR:		kern/144061
Reported by:	Matt Reimer <mreimer at vpop.net>,
		Nikolay Denev <ndenev at gmail.com>,
		Mikolaj Golub <to.my.trociny at gmail.com>
MFC after:	3 days
2010-05-27 15:27:31 +00:00
attilio
e56433dd50 Add the support for reporting the NOCOREDUMP flag from
sysctl_kern_proc_vmmap().

Sponsored by:	Sandvine Incorporated
Reviewed by:	kib, emaste
MFC after:	1 week
2010-05-27 08:10:12 +00:00
kib
4f460f2f9a Allow to use syscallname(9) outside subr_trap.c.
MFC after:	1 month
2010-05-26 15:39:43 +00:00
jhb
6caceffefa Ignore the 'addr' argument passed to PT_STEP (it is required to be '1'
for PT_STEP which means "ignore") and PT_DETACH.

PR:		kern/146167
MFC after:	1 week
2010-05-25 21:32:37 +00:00
alc
54739180f5 Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages().  It is no longer needed.

Submitted by:	kib
2010-05-25 02:26:25 +00:00
alc
32b13ee957 Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
mav
48198e3ddd - Implement MI helper functions, dividing one or two timer interrupts with
arbitrary frequencies into hardclock(), statclock() and profclock() calls.
Same code with minor variations duplicated several times over the tree for
different timer drivers and architectures.
- Switch all x86 archs to new functions, simplifying the code and removing
extra logic from timer drivers. Other archs are also welcome.
2010-05-24 11:40:49 +00:00
kib
70f08890fc Fix the double counting of the last process thread td_incruntime
on exit, that is done once in thread_exit() and the second time in
proc_reap(), by clearing td_incruntime.

Use the opportunity to revert to the pre-RUSAGE_THREAD exporting of ruxagg()
instead of ruxagg_locked() and use it from thread_exit().

Diagnosed and tested by:	neel
MFC after:	3 days
2010-05-24 10:23:49 +00:00
kib
4208ccbe79 Reorganize syscall entry and leave handling.
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
  usermode into struct syscall_args. The structure is machine-depended
  (this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
  from the syscall. It is a generalization of
  cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
  return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively.  The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by:	jhb, marcel, marius, nwhitehorn, stas
Tested by:	marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
	stas (mips)
MFC after:	1 month
2010-05-23 18:32:02 +00:00
jhb
cf780ce267 - Adjust the whitespace for the lines that output fields in 'show pcpu' in
DDB so that all the fields line up.
- Print out the tid of the per-CPU idlethread instead of the pid since
  the idle process is now shared across all idle threads.

MFC after:	1 month
2010-05-21 17:17:56 +00:00
jhb
ce208e1f41 Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after:	1 month
2010-05-21 17:15:56 +00:00
jhb
b7fc8e97f1 Allow a const char * to be passed as the process name to kproc_kthread_add()
without generating a warning.

MFC after:	1 month
2010-05-21 17:14:36 +00:00
kib
890c865dcf Remove PIOLLHUP from the flags used to test for to set exceptfsd
fd_set bits in select(2). It seems that historical behaviour is to not
reporting exception on EOF, and several applications are broken.

Reported by:	Yoshihiko Sarumaru <ysarumaru gmail com>
Discussed with:	bde
PR:	ports/140934
MFC after:	2 weeks
2010-05-21 10:36:29 +00:00
alc
f8bed5b288 The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty().  Perform some style clean up while I'm here.

Reviewed by:	kib
2010-05-18 16:40:29 +00:00
rrs
8ea4ab29a0 This pushes all of JC's patches that I have in place. I
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.

JC check and see if I am missing anything except your
core-mask changes

Obtained from:	JC
2010-05-16 19:43:48 +00:00
bz
c9d1ca826b Fix an issue with the dynamic pcpu/vnet data allocators.
We cannot expect that modspace is the last entry in the linker
set and thus that modspace + possible extra space up to PAGE_SIZE
would be contiguous.  For the moment do not support more than
*_MODMIN space and ignore the extra space (*).

(*) We know how to get it back but it'll need testing.

Discussed with:	jeff, rwatson (briefly)
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	4 days
2010-05-14 21:11:58 +00:00
zml
773cda6040 Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by:       Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by:        zml, dfr
2010-05-12 21:24:46 +00:00
pjd
05f836c1c3 When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.

No objections from:	kib
Tested by:		Alexander V. Ribchansky <shurik@zk.informjust.ua>
MFC after:		1 week
2010-05-12 16:42:28 +00:00
pjd
f1b200bbcc I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

MFC after:	3 days
2010-05-11 22:46:36 +00:00
attilio
4d95c325dd Right now, WITNESS just blindly pipes all the output to the
(TOCONS | TOLOG) mask even when called from DDB points.
That breaks several output, where the most notable is textdump output.
Fix this by having configurable callbacks passed to witness_list_locks()
and witness_display_spinlock() for printing out datas.

Reported by:	several broken textdump outputs
Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
MFC after:	7 days
X-MFC:		r207922
2010-05-11 18:24:22 +00:00
attilio
a6a1f012b7 There is not a good reason to have a different prototype for db_printf()
when compared to printf().
Unify it by returning the number of characters displayed for db_printf()
as well.

MFC after:	7 days
2010-05-11 17:01:14 +00:00
attilio
31c196b3b9 Fix a hang introduced in r206878 for kernel compiled with SMP support but
being not actual SMP and similar situations by always initializing the
smp ipi mutex.

Reported by:	marius
MFC after:	3 days
X-MFC:		r206878
2010-05-11 15:36:16 +00:00
alc
bc80981f79 Update a comment: It no longer makes sense to talk about the page queues
lock here.
2010-05-08 23:01:47 +00:00
alc
40b44f9713 Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free().  Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped().  (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.
2010-05-08 20:34:01 +00:00
kib
77dcee6926 Add MAKEDEV_NOWAIT flag to make_dev_credf(9), to create a device node
in a no-sleep context. If resource allocation cannot be done without
sleep, make_dev_credf() fails and returns NULL.

Reviewed by:	jh
MFC after:	2 weeks
2010-05-06 19:22:50 +00:00
alc
fecc56fac1 Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
trasz
f26ccb52af Avoid overflow.
Submitted by:	bde@
2010-05-06 18:52:41 +00:00
trasz
e6f92048fa Style fixes and removal of unneeded variable.
Submitted by:	bde@
2010-05-06 18:43:19 +00:00
alc
e6c77ecaea Remove page queues locking from all sf_buf_mext()-like functions. The page
lock now suffices.

Fix a couple nearby style violations.
2010-05-06 17:43:41 +00:00
alc
a4eb017f2a Eliminate a small bit of unneeded code from kern_sendfile(): While
kern_sendfile() is running, the file's vm object can't be destroyed
because kern_sendfile() increments the vm object's reference count.  (Once
kern_sendfile() decrements the reference count and returns, the vm object
can, however, be destroyed.  So, sf_buf_mext() must handle the case where
the vm object is destroyed.)

Reviewed by:	kib
2010-05-06 15:52:08 +00:00
joel
c8dfd5c0cb Switch to our preferred 2-clause BSD license.
Approved by:	kmacy
2010-05-05 20:39:02 +00:00
alc
5c7ca3ee73 Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
trasz
402e3baade Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
kib
a3da7d7e69 Fix a mistake in r207603. td_rux.rux_runtime still needs conversion.
Reported and tested by:	nwhitehorn
Pointy hat to:	kib
MFC after:	6 days
2010-05-05 16:05:51 +00:00
alc
ea7b6345be Push down the acquisition of the page queues lock into vm_page_unwire().
Update the comment describing which lock should be held on entry to
vm_page_wire().

Reviewed by:	kib
2010-05-05 03:45:46 +00:00
alc
c9aaa1e2a2 Add page locking to the vm_page_cow* functions.
Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by:	kib
2010-05-04 15:55:41 +00:00
kib
26be0345aa Fix typo in comment.
MFC after:	3 days
2010-05-04 06:06:01 +00:00
kib
e5f4727bbf Remove a comment that merely repeats code.
Submitted by:	bde
MFC after:	1 week
2010-05-04 06:04:33 +00:00
kib
7ef4b25b49 Use td_rux.rux_runtime for ki_runtime instead of redoing calculation.
Submitted by:	bde
MFC after:	1 week
2010-05-04 06:00:39 +00:00
kib
b13e838a49 Implement RUSAGE_THREAD. Add td_rux to keep extended runtime and ticks
information for thread to allow calcru1() (re)use.

Rename ruxagg()->ruxagg_locked(), ruxagg_tlock()->ruxagg() [1].
The ruxagg_locked() function no longer clears thread ticks nor
td_incruntime.

Requested by:	attilio [1]
Discussed with:	attilio, bde
Reviewed by:	bde
Based on submission by:	Alexander Krizhanovsky <ak natsys-lab com>
MFC after:	1 week
X-MFC-Note:	td_rux shall be moved to the end of struct thread
2010-05-04 05:55:37 +00:00
alc
1923b6ded3 Acquire the page lock around vm_page_unwire() and vm_page_wire().
Reviewed by:	kib
2010-05-03 16:41:11 +00:00
alc
387e15c45a This is the first step in transitioning responsibility for synchronizing
access to the page's wire_count from the page queues lock to the page lock.

Submitted by:	kmacy
2010-05-03 05:41:50 +00:00
kib
9c4f2e9ab2 Lock the page around hold_count access.
Reviewed by:	alc
2010-05-02 19:25:22 +00:00
alc
299c89c6fb Properly synchronize access to the page's hold_count in vfs_vmio_release().
Reviewed by:	kib
2010-05-02 19:10:27 +00:00
alc
f35e97166b It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(),
to unconditionally set PG_REFERENCED on a page before sleeping.  In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable.  Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.
2010-05-02 17:33:46 +00:00
zec
139551016d Remove a redundant variable assignment.
Reviewed by:	bz, rwatson
MFC after:	3 days
2010-05-01 18:34:50 +00:00
kib
64dab823a0 Extract thread_lock()/ruxagg()/thread_unlock() fragment into utility
function ruxagg_tlock().
Convert the definition of kern_getrusage() to ANSI C.

Submitted by:	Alexander Krizhanovsky <ak natsys-lab com>
MFC after:	1 week
2010-05-01 14:46:17 +00:00
zml
3eac0000f0 Handle taskqueue_drain(9) correctly on a threaded taskqueue:
taskqueue_drain(9) will not correctly detect whether a task is
currently running.  The check is against a field in the taskqueue
struct, but for a threaded queue with more than one thread, multiple
threads can simultaneously be running a task, thus stomping over the
tq_running field.

Submitted by:       Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by:        jhb
Approved by:        dfr (mentor)
2010-04-30 16:29:05 +00:00
alfred
12d5232340 Avoid allocating MAXHOSTNAMELEN bytes on the stack in expand_name(),
use the heap instead.

Obtained from: Juniper Networks

Reviewed by:	jhb
2010-04-30 03:15:00 +00:00
alfred
993bf6ff36 Don't leak core_buf or gzfile if doing a compressed core file and we
hit an error condition.

Obtained from: Juniper Networks
2010-04-30 03:13:24 +00:00
alfred
20fdc94b9e Do not set IO_NODELOCKED while writing to vnodes as our consumers
do not lock the vnodes.

Obtained from: Juniper Networks

Reviewed by:	jhb
2010-04-30 03:10:53 +00:00
kmacy
1dc1263413 On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib
2010-04-30 00:46:43 +00:00
kib
a22b32df4a Remove caddr_t casts.
Requested by:	bde
MFC after:	10 days
2010-04-29 09:55:51 +00:00
avg
2cfe78bdd9 kern_ntptime: drop a comment that became stale after r207359
MFC after:	1 week
X-MFC after:	r207359
2010-04-29 09:18:36 +00:00
avg
cce2a4186b periodically save system time to hardware time-of-day clock
This is done in kern_ntptime, perhaps not the best place.
This is done using resettodr().
Some features:
- make save period configurable via tunable and sysctl
- period of zero disables saving, setting a non-zero period re-enables
  it or reschedules it
- do saving only if system clock is ntp-synchronized
- save on shutdown

Discussed with:	des, Peter Jeremy <peterjeremy@acm.org>
X-Maybe:		save time near seconds boundary for better precision
MFC after:		2 weeks
2010-04-29 09:02:46 +00:00
avg
cbff4850b6 kern_ntptime: abstract time error check into a function
... to avoid code duplication

MFC after:	1 week
2010-04-29 09:02:21 +00:00
lstewart
bf49d6a9f9 - Rework the underlying ALQ storage to be a circular buffer, which amongst other
things allows variable length messages to be easily supported.

- Extend KPI with alq_writen() and alq_getn() to support variable length
  messages, which is enabled at ALQ creation time depending on the
  arguments passed to alq_open(). Also add variants of alq_open() and
  alq_post() that accept a flags argument. The KPI is still fully
  backwards compatible and shouldn't require any change in ALQ consumers
  unless they wish to utilise the new features.

- Introduce the ALQ_NOACTIVATE and ALQ_ORDERED flags to allow ALQ consumers
  to have more control over IO scheduling and resource acquisition
  respectively.

- Strengthen invariants checking.

- Document ALQ changes in ALQ(9) man page.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jeff, rpaulo, rwatson
MFC after:	1 month
2010-04-26 13:48:22 +00:00
kib
e91c695f77 Move the constants specifying the size of struct kinfo_proc into
machine-specific header files. Add KINFO_PROC32_SIZE for struct
kinfo_proc32 for architectures providing COMPAT_FREEBSD32. Add
CTASSERT for the size of struct kinfo_proc32.

Submitted by:	pluknet
Reviewed by:	imp, jhb, nwhitehorn
MFC after:	2 weeks
2010-04-24 12:49:52 +00:00
jeff
a574495410 - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
bz
8b9f8a6735 Remove one zero from the double-0.
This code doesn't have a license to kill.

MFC after:	3 days
2010-04-23 14:32:58 +00:00
kib
f504a7390f Fix typo.
Submitted by:	emaste
Pointy hat to:	kib (who needs much bigger wardrobe)
MFC after:	1 week
2010-04-21 20:04:42 +00:00
kib
1b4a81ab7e Provide compat32 shims for kinfo_proc sysctl. This allows 32bit ps(1) to
mostly work on 64bit host.

The work is based on an original patch submitted by emaste, obtained
from Sandvine's source tree.

Reviewed by:	jhb
MFC after:	1 week
2010-04-21 19:32:00 +00:00
imp
f6c103ea8b Make sure that we free the passed in data message if we don't actually
insert it onto the queue.  Also, fix a mtx leak if someone turns off
devctl while we're processing a messages.

MFC after:	5 days
2010-04-20 20:39:42 +00:00
attilio
6dda1433c8 Fix compilation in the !SMP case.
Keep the interrupts disabled in order to avoid preemption problems.

Reported by:	tinderbox, b.f. <bf1783 at googlemail dot com>
MFC:		2 weeks
X-MFC:		r206878
2010-04-20 12:22:06 +00:00
kib
d5b92466a9 The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by:	pho
Reviewed by:	kan
MFC after:	2 weeks
2010-04-20 10:19:27 +00:00
attilio
70405fc79f getblk lockmgr is mostly used as a msleep() and may lead too easilly to
false positives.
Whitelist it.

Reported by:	Erik Cederstrand <erik at cederstrand dot dk>
2010-04-19 23:40:46 +00:00
attilio
fca97c8d7a Fix a deadlock in the shutdown code:
When performing a smp_rendezvous() or more likely, on amd64 and i386,
a smp_tlb_shootdown() the caller will end up with the smp_ipi_mtx
spinlock held, busy-waiting for other CPUs to acknowledge the operation.
As long as CPUs are suspended (via cpu_reset()) between the active mask
read and IPI sending there can be a deadlock where the caller will wait
forever for a dead CPU to acknowledge the operation.
Please note that on CPU0 that is going to be someway heavier because of
the spinlocks being disabled earlier than quitting the machine.

Fix this bug by calling cpu_reset() with the smp_ipi_mtx held.
Note that it is very likely that a saner offline/online CPUs mechanism
will help heavilly in fixing similar cases as it is likely more bugs
of this type may arise in the future.

Reported by:	rwatson
Discussed with:	jhb
Tested by:	rnoland, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
MFC:		2 weeks

Special deciation to:	anyone who made possible to have 16-ways machines
			in Netperf
2010-04-19 23:27:54 +00:00
kib
502beaff46 Fix typo.
MFC after:	3 days
2010-04-15 17:17:02 +00:00
julian
247d9af67e Change the semantics of the debug.ktr.alq_enable control so that when you
disable alq, it acts as if alq had not been enabled in the build.
in other words, the rest of ktr is still available for use.
If you really don't want that as well, set the mask to 0.

MFC after:3 weeks
2010-04-14 21:42:29 +00:00
kib
bf84540647 Handle a case in kern_openat() when vn_open() change file type from
DTYPE_VNODE.

Only acquire locks for O_EXLOCK/O_SHLOCK if file type is still vnode,
since we allow for fcntl(2) to process with advisory locks for
DTYPE_VNODE only. Another reason is that all fo_close() routines need to
check and release locks otherwise.

For O_TRUNC, call fo_truncate() instead of truncating the vnode.

Discussed with:	rwatson
MFC after:	2 week
2010-04-13 08:52:20 +00:00
kib
d5f342f2da Remove XXX comment. Add another comment, describing why f_vnode assignment
is useful.

MFC after:	3 days
2010-04-13 08:45:55 +00:00
alc
89e5d72c2b Initialize the virtual memory-related resource limits in a single place.
Previously, one of these limits was initialized in two places to a
different value in each place.  Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures.  (Currently, this
error is masked by login.conf's default settings.)

Make vm_thread_swapin() and vm_thread_swapout() static.

Submitted by:	bde (an earlier version)
Reviewed by:	kib
2010-04-11 16:26:07 +00:00
attilio
02f2ab87a2 - Introduce a blessed list for sxlocks that prevents the deadlkres to
panic on those ones. [0]
- Fix ticks counter wrap-up

Sponsored by:		Sandvine Incorporated
[0] Reported by:	jilles
[0] Tested by:		jilles
MFC:			1 week
2010-04-11 16:06:09 +00:00
kib
231b3cddc2 Do not leak master pty or ptmx vnode.
Report and test case by:	Petr Salinger <Petr.Salinger seznam cz>
Reviewed by:	ed
MFC after:	1 week
2010-04-08 08:58:18 +00:00
kib
47feb6893a When OOM searches for a process to kill, ignore the processes already
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.

This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.

In collaboration with:	pho
Reviewed by:	alc
MFC after:	4 weeks
2010-04-06 10:43:01 +00:00
jh
6b4bef1bca Add missing MNT_NFS4ACLS. 2010-04-04 14:48:43 +00:00
alc
7530e331f2 Make _vm_map_init() the one place where the vm map's pmap field is
initialized.

Reviewed by:	kib
2010-04-03 19:07:05 +00:00
pjd
bd6cec6aca Fix some whitespace nits. 2010-04-03 11:19:20 +00:00
pjd
f0663e1c41 Add missing mnt_kern_flag flags in 'show mount' output. 2010-04-03 11:15:55 +00:00
avg
91cd4478c2 vn_stat: take into account va_blocksize when setting st_blksize
As currently st_blksize is always PAGE_SIZE, it is playing safe to not
use any smaller value.  For some cases this might not be optimal, but
at least nothing should get broken.

Generally I don't expect this commit to change much for the following
reasons (in case of VREG, VDIR):
- application I/O and physical I/O are sufficiently decoupled by
  filesystem code, buffer cache code, cluster and read-ahead logic
- not all applications use st_blksize as a hint, some use f_iosize, some
  use fixed block sizes

I expect writes to the middle of files on ZFS to benefit the most from
this change.

Silence from:	fs@
MFC after:	2 weeks
2010-04-03 08:39:00 +00:00
avg
ad244906c7 bo_bsize: revert r205860 and take an alternative approch in getblk
In r205860 I missed the fact that there is code that strongly assumes
that devvp bo_bsize is equal to underlying provider's sectorsize.
In those places it is hard to obtain the sectorsize in an alternative
way if devvp bo_bsize is set to something else.
So, I am reverting bo_bsize assigment in g_vfs_open.
Instead, in getblk I use DEV_BSIZE block size for b_offset calculation
if vp is a disk vp as reported by vn_isdisk.  This should coinside with
vp being a devvp.

Reported by:	Mykola Dzham <i@levsha.me>
Tested by:	Mykola Dzham <i@levsha.me>
Pointyhat to:	avg
MFC after:	2 weeks
X-ToDo:		convert bread(devvp) in all fs to use bo_bsize-d blocks
2010-04-02 15:12:31 +00:00
kib
dbbce00a33 Supply default implementation of VOP_RENAME() that does neccessary
unlocks and unreferences for argument vnodes, as expected by
kern_renameat(9), and returns EOPNOTSUPP. This fixes locks and
reference leaks when rename is attempted on fs that does not
implement rename.

PR:	kern/107439
Based on submission by:	Mikolaj Golub <to.my.trociny gmail com>
Tested by:	Mikolaj Golub
MFC after:	1 week
2010-04-02 14:03:43 +00:00
kib
86c35b90b7 Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by:	Mikolaj Golub
MFC after:	1 week
2010-04-02 14:03:01 +00:00
lstewart
c1a8ef630e The ALQ should not be considered drained until it has been made inactive.
Sponsored by:	FreeBSD Foundation
Reviewed by:	dwmalone, jeff, rpaulo, rwatson (as part of a larger patch)
Approved by:	kmacy (mentor)
MFC after:	1 month
2010-04-01 01:27:10 +00:00
lstewart
f1881c310c According to SLEEP(9), msleep() is deprecated in favour of mtx_sleep().
Sponsored by:	FreeBSD Foundation
Reviewed by:	dwmalone, jeff, rpaulo, rwatson (as part of a larger patch)
Approved by:	kmacy (mentor)
MFC after:	1 month
2010-04-01 01:23:36 +00:00
lstewart
4aab292692 - Factor code to destroy an ALQ out of alq_close() into a private alq_destroy().
- Use the new alq_destroy() to properly handle a failure case in alq_open().

Sponsored by:	FreeBSD Foundation
Reviewed by:	dwmalone, jeff, rpaulo, rwatson (as part of a larger patch)
Approved by:	kmacy (mentor)
MFC after:	1 month
2010-04-01 01:16:00 +00:00
lstewart
a8e85cee7a Add support for ALQ(9) to be compiled and loaded as a kernel module.
Sponsored by:	FreeBSD Foundation
Reviewed by:	dwmalone, jeff, rpaulo, rwatson
Approved by:	kmacy (mentor)
MFC after:	1 month
2010-03-31 03:58:57 +00:00
jhb
5bba6cc028 Defer freeing a kevent list until after dropping kqueue locks.
LOR:		185
Submitted by:	Matthew Fleming @ Isilon
MFC after:	1 week
2010-03-30 18:31:55 +00:00
ed
4f08ecd7ed Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.
A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.
2010-03-28 13:13:22 +00:00
jh
7a3d363cfe Support only LOOKUP operation for "/" in relookup() because lookup()
can't succeed for CREATE, DELETE and RENAME.

Discussed with:	bde
2010-03-26 11:33:12 +00:00
nwhitehorn
ac2318460c Add the ELF relocation base to struct image_params. This will be
required to correctly relocate the executable entry point's function
descriptor on powerpc64.
2010-03-25 14:31:26 +00:00
nwhitehorn
d63c82a6ac Change the arguments of exec_setregs() so that it receives a pointer
to the image_params struct instead of several members of that struct
individually. This makes it easier to expand its arguments in the future
without touching all platforms.

Reviewed by:	jhb
2010-03-25 14:24:00 +00:00
nwhitehorn
b60f1f5349 Change the way text_addr and data_addr are computed to use the
executable status of segments instead of detecting the main text segment
by which segment contains the program entry point. This affects
obreak() and is required for correct operation of that function
on 64-bit PowerPC systems. The previous behavior was apparently
required only for the Alpha, which is no longer supported.

Reviewed by:	jhb
Tested on:	amd64, sparc64, powerpc
2010-03-25 14:21:22 +00:00
bz
8fb79807f2 Print the pointer to the lock with the panic message. The previous
panic: rw lock not unlocked
was not really helpful for debugging. Now one can at least call
	show lock <ptr>
form ddb to learn more about the lock.

MFC after:	3 days
2010-03-24 19:21:26 +00:00
nwhitehorn
32325226c5 The nargvstr and nenvstr properties of arginfo are ints, not longs,
so should be copied to userspace with suword32() instead of suword().
This alleviates problems on 64-bit big-endian architectures, and is a
no-op on all 32-bit architectures.

Tested on:	amd64, sparc64, powerpc64
2010-03-24 03:13:24 +00:00
ed
6156503467 Actually make O_DIRECTORY work.
According to POSIX open() must return ENOTDIR when the path name does
not refer to a path name. Change vn_open() to respect this flag. This
also simplifies the Linuxolator a bit.
2010-03-21 20:43:23 +00:00
bz
102a1f8933 Split eventhandler_register() into an internal part and a wrapper function
that provides the allocated and setup eventhandler entry.

Add a new wrapper for VIMAGE that allocates extra space to hold the
callback function and argument in addition to an extra wrapper function.
While the wrapper function goes as normal callback function the
argument points to the extra space allocated holding the original func
and arg that the wrapper function can then call.

Provide an iterator function for the virtual network stack (vnet) that
will call the callback function for each network stack.

Provide a new set of macros for VNET that in the non-VIMAGE case will
just call eventhandler_register() while in the VIMAGE case it will use
vimage_eventhandler_register() passing in the extra iterator function
but will only register once rather than per-vnet.
We need a special macro in case we are interested in the tag returned
as we must check for curvnet and can neither simply assign the
return value, nor not change it in the non-vnet0 case without that.

Sponsored by:	ISPsystem
Discussed with:	jhb
Reviewed by:	zec (earlier version), jhb
MFC after:	1 month
2010-03-19 19:51:03 +00:00
kib
1e468766f7 Convert aio syscall registration to SYSCALL_INIT_HELPER.
Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:11:34 +00:00
kib
06319cba03 Implement compat32 shims for mqueuefs.
Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:10:24 +00:00
kib
34d2655cb1 Implement compat32 shims for ksem syscalls.
Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:08:43 +00:00
kib
b27fa06f97 Move SysV IPC freebsd32 compat shims from freebsd32_misc.c to corresponding
sysv_{msg,sem,shm}.c files.

Mark SysV IPC freebsd32 syscalls as NOSTD and add required
SYSCALL_INIT_HELPER/SYSCALL32_INIT_HELPERs to provide auto
register/unregister on module load.

This makes COMPAT_FREEBSD32 functional with SysV IPC compiled and loaded
as modules.

Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:04:42 +00:00
kib
610214ed4c Move SysV IPC freebsd32 compat shims helpers from freebsd32_misc.c to
sysv_ipc.c.

Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:01:51 +00:00
kib
d19a162142 Introduce SYSCALL_INIT_HELPER and SYSCALL32_INIT_HELPER macros and
neccessary support functions to allow registering dynamically loaded
syscalls from the MOD_LOAD handlers. Helpers handle registration
failures semi-automatically.

Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 10:56:30 +00:00
kib
b145781d49 Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that
take iov.

Reviewed by:	tuexen
MFC after:	2 weeks
2010-03-19 10:46:54 +00:00
kib
0c73b78495 Remove dead statement.
Reviewed by:	tuexen
MFC after:	2 weeks
2010-03-19 10:44:02 +00:00
kib
08e36289a4 Fix two style issues.
MFC after:	2 weeks
2010-03-19 10:41:32 +00:00
jhb
5b4b1c75c0 Style fixes.
Submitted by:	bde
2010-03-11 15:13:55 +00:00
nwhitehorn
142a4d2993 Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by:	kib, jhb
2010-03-11 14:49:06 +00:00
jhb
3a7e251600 Fix a comment nit.
Submitted by:	Alexander Best
2010-03-11 13:16:06 +00:00
jhb
9167691ca3 Add descriptions for debug.ktr sysctl nodes. 2010-03-10 21:35:42 +00:00
imp
3ebe1a7826 Bump up the firmware_table from 30 to 50. bwn needs more than 30, it
seems.
2010-03-07 22:37:35 +00:00
alfred
88b3bf6496 put calls to gzclose() under ifdef COMPRESS_USER_CORES to prevent
undefined symbols on kernels without this option.

Reported by: Alexander Best
2010-03-04 21:53:45 +00:00
rrs
17cf51b0bb sched_getparam was just plain broke for time-share
processes. It did not return an error but instead
just let garbage be passed back. This I fix so
it actually properly translates the priority the
process is at to a posix's high means more priority.
I also fix it so that if the ULE scheduler has bumped
it up to a realtime process you get back a sane value
i.e. the highest priority (63 for time-share).

sched_setscheduler() had the setting of the
timeshare class priority disabled. With some notes
about rejecting the posix high numbers is greater
priority and use nice instead. This fix also
adjusts that to work, with the cavet that a t-s
process may well get bumped up or down i.e. the
setscheduler() will NOT change the nice value only
the current priority. I think this is reasonable
considering if the user wants to play with nice then
he can. At least all the posix'ish interfaces now
respond sanely.

MFC after:	3 weeks
2010-03-03 21:46:51 +00:00
jhb
f9290c7f6d Allow lseek(SEEK_END) to work on disk devices by using the DIOCGMEDIASIZE
to determine the media size.

Submitted by:	nox
MFC after:	1 week
2010-03-03 16:18:04 +00:00
ivoras
14f9175723 Document the VM detection type and sysctl a bit better. 2010-03-02 23:57:42 +00:00
alfred
f34ce3dd38 Merge projects/enhanced_coredumps (r204346) into HEAD:
Enhanced process coredump routines.

  This brings in the following features:
  1) Limit number of cores per process via the %I coredump formatter.
  Example:
    if corefilename is set to %N.%I.core AND num_cores = 3, then
    if a process "rpd" cores, then the corefile will be named
    "rpd.0.core", however if it cores again, then the kernel will
    generate "rpd.1.core" until we hit the limit of "num_cores".

    this is useful to get several corefiles, but also prevent filling
    the machine with corefiles.

  2) Encode machine hostname in core dump name via %H.

  3) Compress coredumps, useful for embedded platforms with limited space.
    A sysctl kern.compress_user_cores is made available if turned on.

    To enable compressed coredumps, the following config options need to be set:
    options COMPRESS_USER_CORES
    device zlib   # brings in the zlib requirements.
    device gzio   # brings in the kernel vnode gzip output module.

  4) Eventhandlers are fired to indicate coredumps in progress.

  5) The imgact sv_coredump routine has grown a flag to pass in more
  state, currently this is used only for passing a flag down to compress
  the coredump or not.

  Note that the gzio facility can be used for generic output of gzip'd
  streams via vnodes.

Obtained from: Juniper Networks
Reviewed by: kan
2010-03-02 06:58:58 +00:00
bruno
3bef33deb1 Deliver siginfo when signal is generated by thr_kill(2) (SI_USER with properly
filled si_uid and si_pid).

Reported by:	Joel Bertrand <joel.bertrand systella fr>
PR:		141956
Reviewed by:	kib
MFC after:	2 weeks
2010-03-01 14:27:16 +00:00
rwatson
fc045dee13 Remove stale comment about socket buffer accounting from access(2) code.
It is the case, however, that the uidinfo of the temporary credential
set up for access(2) is not properly updated when its effective uid is
changed.

MFC after:	3 days
2010-02-27 19:57:40 +00:00
alc
83149d5d10 When running as a guest operating system, the FreeBSD kernel must assume
that the virtual machine monitor has enabled machine check exceptions.
Unfortunately, on AMD Family 10h processors the machine check hardware
has a bug (Erratum 383) that can result in a false machine check exception
when a superpage promotion occurs.  Thus, I am disabling superpage
promotion when the FreeBSD kernel is running as a guest operating system
on an AMD Family 10h processor.

Reviewed by:	jhb, kib
MFC after:	3 days
2010-02-27 18:00:57 +00:00
kib
a8e7da4cf2 For kinfo_proc in kp->ki_siglist, return the set of the signals pending
in the process queue when gathering information for the process, and set
of signals pending for the thread, when gathering information for the
thread. Previously, the sysctl returned a union of the process and some
arbitrary thread pending set for the process, and union of the process
and the thread pending set for the thread.

MFC after:	1 week
2010-02-27 15:32:49 +00:00
kib
695f0b496c Fix several style issues.
Define make_dev_credv() as static to match declaration.

MFC after:	3 days
2010-02-27 15:26:36 +00:00
jilles
9a57b41b18 Include terminated threads in ps's process cpu time field.
MFC after:	2 weeks
2010-02-27 12:15:59 +00:00