Commit Graph

11532 Commits

Author SHA1 Message Date
marcel
bcd7bda0da Unbreak building kernels with COMPAT_32 enabled. The actual support
for the PT_VM_ENTRY request from 32-bit processes will follow.

Pointy hat: marcel
2010-02-09 17:20:00 +00:00
marcel
764ce56ace Add PT_VM_TIMESTAMP and PT_VM_ENTRY so that the tracing process can
obtain the memory map of the traced process. PT_VM_TIMESTAMP can be
used to check if the memory map changed since the last time to avoid
iterating over all the VM entries unnecesarily.

MFC after:	1 month
2010-02-09 05:52:35 +00:00
ed
d40177139e Remove unused LIBCOMPAT keyword from syscalls.master. 2010-02-08 10:02:01 +00:00
davidxu
7d46cfed0a Set waiters flag before checking semaphore's counter,
otherwise we might lose a wakeup. Tested on postgresql database server.
2010-02-08 07:31:05 +00:00
gavin
c2c8b20929 Spelling nit 2010-02-07 18:00:13 +00:00
ed
760cf1fda3 Remove statistics from the TTY queues.
I added counters to see how often fast copying to userspace was actually
performed, which was only useful during development. Remove these
statistics now we know it to be effective.
2010-02-07 15:42:15 +00:00
mav
fab0a47626 MFp4:
Make CAM to stop all attached devices on system shutdown.
It allows devices to park heads, reducing stress on power loss.
Add `kern.cam.power_down` tunable and sysctl to controll it.
2010-02-03 08:42:08 +00:00
davidxu
47ff0c69ad Fix comments in do_sem_wait(). 2010-02-03 07:21:20 +00:00
davidxu
324cd07ff3 After busied the lock, re-read state word before checking waiters flag,
otherwise, the waiters bit may not be set and a wakeup is lost.

Submitted by:	justin.teller at gmail dot com
MFC after:	3 days
2010-02-03 03:56:32 +00:00
rwatson
ddf7a2e0a7 Only audit pathnames in namei(9) if copying the directory string completes
successfully.  Continue to do this before the empty path check so that the
ENOENT returned in that case gets an empty string token in the BSM record.

MFC after:	3 days
2010-02-02 23:10:27 +00:00
avg
8ad65aae5c KASSERT that return value of interrupt filter complies with contract
For example a return value of zero could lead to a stuck level-triggered
interrupt line.

Reviewed by:	jhb (for INTR_FILTER case)
MFC after:	3 weeks
2010-01-27 09:59:08 +00:00
delphij
d9a0cd0982 Revised revision 199201 (add interface description capability as inspired
by OpenBSD), based on comments from many, including rwatson, jhb, brooks
and others.

Sponsored by:	iXsystems, Inc.
MFC after:	1 month
2010-01-27 00:30:07 +00:00
attilio
953b6f2ced Split out an invariant in order to better check that newtd, when
provided, must be on a runqueue.

Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
MFC:		2 weeks
X-MFC:		r202889
2010-01-24 18:16:38 +00:00
attilio
a4281bb884 - Fix the kthread_{suspend, resume, suspend_check}() locking.
In the current code, the locking is completely broken and may lead
  easilly to deadlocks. Fix it by using the proc_mtx, linked to the
  suspending thread, as lock for the operation.  Keep using the
  thread_lock for setting and reading the flag even if it is not entirely
  necessary (atomic ops may do it as well, but this way the code is more
  readable).
- Fix a deadlock within kthread_suspend().
  The suspender should not sleep on a different channel wrt the suspended
  thread, or, otherwise, the awaker should wakeup both. Uniform the
  interface to what the kproc_* counterparts do (sleeping on the same
  channel).
- Change the kthread_suspend_check() prototype.
  kthread_suspend_check() always assumes curthread and must only refer to
  it, so skip the thread pointer as it may be easilly mistaken.
  If curthread is not a kthread, the system will panic.

In collabouration with:	jhb
Tested by:		Giovanni Trematerra
			<giovanni dot trematerra at gmail dot com>
MFC:			2 weeks
2010-01-24 15:07:00 +00:00
attilio
9a7f4738f4 - Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
  sched_lock was acquired (without the aid of the td_lock interface) and
  the td_lock was dropped. This was going to break locking rules on other
  threads willing to access to the thread (via the td_lock interface) and
  modify his flags (allowed as long as the container lock was different
  by the one used in sched_switch).
  In order to prevent this situation, while sched_lock is acquired there
  the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
  thread_lock_block() and make the former semantic as the default for
  thread_lock_block(). This means that thread_lock_block() will not
  disable interrupts when called (and consequently thread_unlock_block()
  will not re-enabled them when called). This should be done manually
  when necessary.
  Note, however, that ULE's thread_unblock_switch() is not reaped
  because it does reflect a difference in semantic due in ULE (the
  td_lock may not be necessarilly still blocked_lock when calling this).
  While asymmetric, it does describe a remarkable difference in semantic
  that is good to keep in mind.

[0] Reported by:	Kohji Okuno
			<okuno dot kohji at jp dot panasonic dot com>
Tested by:		Giovanni Trematerra
			<giovanni dot trematerra at gmail dot com>
MFC:			2 weeks
2010-01-23 15:54:21 +00:00
kib
ddbe48bd56 For PT_TO_SCE stop that stops the ptraced process upon syscall entry,
syscall arguments are collected before ptracestop() is called. As a
consequence, debugger cannot modify syscall or its arguments.

For i386, amd64 and ia32 on amd64 MD syscall(), reread syscall number
and arguments after ptracestop(), if debugger modified anything in the
process environment. Since procfs stopeven requires number of syscall
arguments in p_xstat, this cannot be solved by moving stop/trace point
before argument fetching.

Move the code to read arguments into separate function
fetch_syscall_args() to avoid code duplication. Note that ktrace point
for modified syscall is intentionally recorded twice, once with original
arguments, and second time with the arguments set by debugger.

PT_TO_SCX stop is executed after cpu_syscall_set_retval() already.

Reported by:	Ali Polatel <alip exherbo org>
Briefly discussed with:	jhb
MFC after:	3 weeks
2010-01-23 11:45:35 +00:00
kib
a22fc66b6e Staticise sigqueue manipulation functions used only in kern_sig.c.
MFC after:	1 week
2010-01-23 11:43:30 +00:00
kib
604a9a1db3 When traced process is about to receive the signal, the process is
stopped and debugger may modify or drop the signal. After the changes to
keep process-targeted signals on the process sigqueue, another thread
may note the old signal on the queue and act before the thread removes
changed or dropped signal from the process queue. Since process is
traced, it usually gets stopped. Or, if the same signal is delivered
while process was stopped, the thread may erronously remove it,
intending to remove the original signal.

Remove the signal from the queue before notifying the debugger. Restore
the siginfo to the head of sigqueue when signal is allowed to be
delivered to the debugee, using newly introduced KSI_HEAD ksiginfo_t
flag. This preserves required order of delivery. Always restore the
unchanged signal on the curthread sigqueue, not to the process queue,
since the thread is about to get it anyway, because sigmask cannot be
changed.

Handle failure of reinserting the siginfo into the queue by falling
back to sq_kill method, calling sigqueue_add with NULL ksi.

If debugger changed the signal to be delivered, use sigqueue_add()
with NULL ksi instead of only setting sq_signals bit.

Reported by:	Gardner Bell <gbell72 rogers com>
Analyzed and first version of fix by:	Tijl Coosemans <tijl coosemans org>
PR:	142757
Reviewed by:	davidxu
MFC after:	2 weeks
2010-01-20 11:58:04 +00:00
ed
0c91591629 Remove a dead initialization.
Spotted by:	scan-build (uqs)
2010-01-18 18:58:03 +00:00
kib
1c4578d223 Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with:	pho
Reviewed by:	alc
MFC after:	3 weeks
2010-01-17 21:24:27 +00:00
bz
d80ba03e3c Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.

This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.

Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]

Reviewed by:	jamie, hrs (ipv6 part)
Pointed out by:	hrs [1]
MFC After:	2 weeks
Asked for by:	Jase Thew (bazerka beardz.net)
2010-01-17 12:57:11 +00:00
brooks
c67fa34ee5 Only allocate the space we need before calling kern_getgroups instead
of allocating what ever the user asks for up to "ngroups_max + 1".  On
systems with large values of kern.ngroups this will be more efficient.

The now redundant check that the array is large enough in
kern_getgroups() is deliberate to allow this change to be merged to
stable/8 without breaking potential third party consumers of the API.

Reported by:	bde
MFC after:	28 days
2010-01-15 07:18:46 +00:00
ed
a6d91cab94 Remove the 1000 pseudo terminal limit from pts(4).
Even with the old utmp format, we could in fact go to pts/9999, because
ut_line wasn't guaranteed to be null terminated there.
2010-01-13 21:22:23 +00:00
brooks
093cb4c3ba Declare the kern.ngroups sysctl to be read-only, but tunable at boot for
better error reporting.

Submitted by:	Matthew Fleming <matthew dot fleming at isilon dot com>
MFC After:	1 month
2010-01-12 18:20:20 +00:00
brooks
a093b41daf Replace the static NGROUPS=NGROUPS_MAX+1=1024 with a dynamic
kern.ngroups+1.  kern.ngroups can range from NGROUPS_MAX=1023 to
INT_MAX-1.  Given that the Windows group limit is 1024, this range
should be sufficient for most applications.

MFC after:	1 month
2010-01-12 07:49:34 +00:00
bz
bdc1b5009d Change DDB show prison:
- name some columns more closely to the user space variables,
  as we do for host.* or allow.* (in the listing) already.
- print pr_childmax (children.max).
- prefix hex values with 0x.

MFC after:	3 weeks
2010-01-11 22:34:25 +00:00
bz
4ba08a5642 Adjust a comment to reflect reality, as we have proper source
address selection, even for IPv4, since r183571.

Pointed out by:	Jase Thew (bazerka beardz.net)
MFC after:	3 days
2010-01-11 21:21:30 +00:00
mckusick
0cddeb2cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
imp
25563df1b0 Merge change r198561 from projects/mips to head:
r198561 | thompsa | 2009-10-28 15:25:22 -0600 (Wed, 28 Oct 2009) | 4 lines
Allow a scratch buffer to be set in order to be able to use setenv() while
booting, before dynamic kenv is running. A few platforms implement their own
scratch+sprintf handling to save data from the boot environment.
2010-01-10 22:34:18 +00:00
davidxu
5fb7f00d2f Make a chain be a list of queues, and make threads waiting
for same key coalesce to same queue, this makes searching
path shorter and improves performance.
Also fix comments about shared PI-mutex.
2010-01-10 09:31:57 +00:00
brooks
5e7cdd35de Correct the explination text for the kern.ngroups. It reflects the
number of supplemental groups, not the total number of groups.

MFC after:	3 days
2010-01-09 23:22:31 +00:00
davidxu
871ba2b0e0 Use enum to define key types.
Suggested by:	jmallett
2010-01-09 06:30:40 +00:00
davidxu
715f123cec put semaphore waiter in long term list. 2010-01-09 06:12:44 +00:00
davidxu
b4d682588b Add key type TYPE_SEM. 2010-01-09 06:05:31 +00:00
attilio
fde84f320b Introduce the new kernel thread called "deadlock resolver".
While the name is pretentious, a good explanation of its targets is
reported in this 17 months old presentation e-mail:
http://lists.freebsd.org/pipermail/freebsd-arch/2008-August/008452.html

In order to implement it, the sq_type in sleepqueues is mandatory and not
only compiled along with INVARIANTS option. Additively, a new sleepqueue
function, sleepq_type() is added, returning the type of the sleepqueue
linked to a wchan.
Three new sysctls are added in order to configure the thread:
debug.deadlkres.slptime_threshold
debug.deadlkres.blktime_threshold
debug.deadlkres.sleepfreq

rappresenting the thresholds for sleep and block time that will lead to
a deadlock matching (when exceeded), while the sleepfreq rappresents the
number of seconds between 2 consecutive thread runnings.
In order to enable the deadlock resolver thread recompile your kernel
with the option DEADLKRES.

Reviewed by:	jeff
Tested by:	pho, Giovanni Trematerra
Sponsored by:	Nokia Incorporated, Sandvine Incorporated
MFC after:	2 weeks
2010-01-09 01:46:38 +00:00
brueffer
69131979da Free allocated sbufs before returning ENOMEM.
PR:		128335
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	2 week
2010-01-08 22:58:50 +00:00
attilio
7fdd56a8c5 - Fix a bug in sched_4bsd where the timestamp for the sleeping operation
is not cleaned up on the wakeup but reset.
  This is harmless mostly because td_slptick (and ki_slptime from
  userland) should be analyzed only with the assumption that the thread
  is actually sleeping (thus while the td_slptick is correctly set) but
  without this invariant the number is nomore consistent.
- Move td_slptick from u_int to int in order to follow 'ticks' signedness
  and wrap up accordingly [0]

[0] Submitted by:	emaste
Sponsored by:		Sandvine Incorporated
MFC			1 week
2010-01-08 14:55:11 +00:00
mbr
7450f52a57 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
attilio
a576a41d1d Fix typos. 2010-01-07 01:24:09 +00:00
attilio
80498ec26c Tweak comments. 2010-01-07 01:19:01 +00:00
attilio
2e14be290b Exclusive waiters sleeping with LK_SLEEPFAIL on and using interruptible
sleeps/timeout may have left spourious lk_exslpfail counts on, so clean
it up even when accessing a shared queue acquisition, giving to
lk_exslpfail the value of 'upper limit'.
In the worst case scenario, infact (mixed
interruptible sleep / LK_SLEEPFAIL waiters) what may happen is that both
queues are awaken even if that's not necessary, but still no harm.

Reported by:	Lucius Windschuh <lwindschuh at googlemail dot com>
Reviewed by:	kib
Tested by:	pho, Lucius Windschuh <lwindschuh at googlemail dot com>
2010-01-07 00:47:50 +00:00
davidxu
87c8a1faf2 Use umtx to implement process sharable semaphore, to make this work,
now type sema_t is a structure which can be put in a shared memory area,
and multiple processes can operate it concurrently.
User can either use mmap(MAP_SHARED) + sem_init(pshared=1) or use sem_open()
to initialize a shared semaphore.
Named semaphore uses file system and is located in /tmp directory, and its
file name is prefixed with 'SEMD', so now it is chroot or jail friendly.
In simplist cases, both for named and un-named semaphore, userland code
does not have to enter kernel to reduce/increase semaphore's count.
The semaphore is designed to be crash-safe, it means even if an application
is crashed in the middle of operating semaphore, the semaphore state is
still safely recovered by later use, there is no waiter counter maintained
by userland code.
The main semaphore code is in libc and libthr only has some necessary stubs,
this makes it possible that a non-threaded application can use semaphore
without linking to thread library.
Old semaphore implementation is kept libc to maintain binary compatibility.
The kernel ksem API is no longer used in the new implemenation.

Discussed on: threads@
2010-01-05 02:37:59 +00:00
ed
74b0526bbe Make TIOCSTI work again.
It looks like I didn't implement this when I imported MPSAFE TTY.
Applications like mail(1) still use this. I think it's conceptually bad.

Tested by:	Pete French <petefrench ticketswitch com>
MFC after:	2 weeks
2010-01-04 20:59:52 +00:00
trasz
85a258e1b1 Fix comments. 2010-01-04 12:39:42 +00:00
davidxu
bbf7e232ea Add user-level semaphore synchronous type, this change allows multiple
processes to share semaphore by using shared memory area, in simplest case,
only one atomic operation is needed in userland, waiter flag is maintained by
kernel and userland only checks the flag, if the flag is set, user code enters
kernel and does a wakeup() call.
Move type definitions into file _umtx.h to minimize compiling time.
Also type names need to be prefixed with underline character, this would reduce
name conflict (still in progress).
2010-01-04 05:27:49 +00:00
brooks
cfffc49d99 If a filter has already been added, actually return EEXIST when trying
at add it again.

MFC after:	1 week
2009-12-31 20:56:28 +00:00
brooks
a5cc24440b The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
kib
fe41ad464e Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
jhb
9d473f39f1 Actually set RLE_ALLOCATED when allocating a reserved resource so that
resource_list_release() will later release the resource instead of failing.
2009-12-30 22:37:28 +00:00
jhb
8d2a9e1d3c - Assert that a reserved resource returned via resource_list_alloc() is not
active.
- Fix bus_generic_rl_(alloc|release)_resource() to not attempt to fetch a
  resource list for grandchild devices, but just pass those requests up to
  the parent directly.  This worked by accident previously, but it is
  better to not let bus drivers try to operate on devices they do not
  manage.
2009-12-30 19:44:31 +00:00