Commit Graph

11494 Commits

Author SHA1 Message Date
Brooks Davis
5feedc2575 Correct the explination text for the kern.ngroups. It reflects the
number of supplemental groups, not the total number of groups.

MFC after:	3 days
2010-01-09 23:22:31 +00:00
David Xu
73532aa78c Use enum to define key types.
Suggested by:	jmallett
2010-01-09 06:30:40 +00:00
David Xu
4904f91fe0 put semaphore waiter in long term list. 2010-01-09 06:12:44 +00:00
David Xu
2c3b3fef36 Add key type TYPE_SEM. 2010-01-09 06:05:31 +00:00
Attilio Rao
f7829d0d5c Introduce the new kernel thread called "deadlock resolver".
While the name is pretentious, a good explanation of its targets is
reported in this 17 months old presentation e-mail:
http://lists.freebsd.org/pipermail/freebsd-arch/2008-August/008452.html

In order to implement it, the sq_type in sleepqueues is mandatory and not
only compiled along with INVARIANTS option. Additively, a new sleepqueue
function, sleepq_type() is added, returning the type of the sleepqueue
linked to a wchan.
Three new sysctls are added in order to configure the thread:
debug.deadlkres.slptime_threshold
debug.deadlkres.blktime_threshold
debug.deadlkres.sleepfreq

rappresenting the thresholds for sleep and block time that will lead to
a deadlock matching (when exceeded), while the sleepfreq rappresents the
number of seconds between 2 consecutive thread runnings.
In order to enable the deadlock resolver thread recompile your kernel
with the option DEADLKRES.

Reviewed by:	jeff
Tested by:	pho, Giovanni Trematerra
Sponsored by:	Nokia Incorporated, Sandvine Incorporated
MFC after:	2 weeks
2010-01-09 01:46:38 +00:00
Christian Brueffer
30215f483e Free allocated sbufs before returning ENOMEM.
PR:		128335
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	2 week
2010-01-08 22:58:50 +00:00
Attilio Rao
6eac7e5752 - Fix a bug in sched_4bsd where the timestamp for the sleeping operation
is not cleaned up on the wakeup but reset.
  This is harmless mostly because td_slptick (and ki_slptime from
  userland) should be analyzed only with the assumption that the thread
  is actually sleeping (thus while the td_slptick is correctly set) but
  without this invariant the number is nomore consistent.
- Move td_slptick from u_int to int in order to follow 'ticks' signedness
  and wrap up accordingly [0]

[0] Submitted by:	emaste
Sponsored by:		Sandvine Incorporated
MFC			1 week
2010-01-08 14:55:11 +00:00
Martin Blapp
c2ede4b379 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
Attilio Rao
aab9c8c28d Fix typos. 2010-01-07 01:24:09 +00:00
Attilio Rao
c636ba831a Tweak comments. 2010-01-07 01:19:01 +00:00
Attilio Rao
9dbf7a62f4 Exclusive waiters sleeping with LK_SLEEPFAIL on and using interruptible
sleeps/timeout may have left spourious lk_exslpfail counts on, so clean
it up even when accessing a shared queue acquisition, giving to
lk_exslpfail the value of 'upper limit'.
In the worst case scenario, infact (mixed
interruptible sleep / LK_SLEEPFAIL waiters) what may happen is that both
queues are awaken even if that's not necessary, but still no harm.

Reported by:	Lucius Windschuh <lwindschuh at googlemail dot com>
Reviewed by:	kib
Tested by:	pho, Lucius Windschuh <lwindschuh at googlemail dot com>
2010-01-07 00:47:50 +00:00
David Xu
9b0f1823b5 Use umtx to implement process sharable semaphore, to make this work,
now type sema_t is a structure which can be put in a shared memory area,
and multiple processes can operate it concurrently.
User can either use mmap(MAP_SHARED) + sem_init(pshared=1) or use sem_open()
to initialize a shared semaphore.
Named semaphore uses file system and is located in /tmp directory, and its
file name is prefixed with 'SEMD', so now it is chroot or jail friendly.
In simplist cases, both for named and un-named semaphore, userland code
does not have to enter kernel to reduce/increase semaphore's count.
The semaphore is designed to be crash-safe, it means even if an application
is crashed in the middle of operating semaphore, the semaphore state is
still safely recovered by later use, there is no waiter counter maintained
by userland code.
The main semaphore code is in libc and libthr only has some necessary stubs,
this makes it possible that a non-threaded application can use semaphore
without linking to thread library.
Old semaphore implementation is kept libc to maintain binary compatibility.
The kernel ksem API is no longer used in the new implemenation.

Discussed on: threads@
2010-01-05 02:37:59 +00:00
Ed Schouten
328d9d2c96 Make TIOCSTI work again.
It looks like I didn't implement this when I imported MPSAFE TTY.
Applications like mail(1) still use this. I think it's conceptually bad.

Tested by:	Pete French <petefrench ticketswitch com>
MFC after:	2 weeks
2010-01-04 20:59:52 +00:00
Edward Tomasz Napierala
922ec47140 Fix comments. 2010-01-04 12:39:42 +00:00
David Xu
79f8b61995 Add user-level semaphore synchronous type, this change allows multiple
processes to share semaphore by using shared memory area, in simplest case,
only one atomic operation is needed in userland, waiter flag is maintained by
kernel and userland only checks the flag, if the flag is set, user code enters
kernel and does a wakeup() call.
Move type definitions into file _umtx.h to minimize compiling time.
Also type names need to be prefixed with underline character, this would reduce
name conflict (still in progress).
2010-01-04 05:27:49 +00:00
Brooks Davis
7eb5db5179 If a filter has already been added, actually return EEXIST when trying
at add it again.

MFC after:	1 week
2009-12-31 20:56:28 +00:00
Brooks Davis
a6fffd6cb0 The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
Konstantin Belousov
17c4c3563c Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
John Baldwin
6a41dc1043 Actually set RLE_ALLOCATED when allocating a reserved resource so that
resource_list_release() will later release the resource instead of failing.
2009-12-30 22:37:28 +00:00
John Baldwin
0eb9893a80 - Assert that a reserved resource returned via resource_list_alloc() is not
active.
- Fix bus_generic_rl_(alloc|release)_resource() to not attempt to fetch a
  resource list for grandchild devices, but just pass those requests up to
  the parent directly.  This worked by accident previously, but it is
  better to not let bus drivers try to operate on devices they do not
  manage.
2009-12-30 19:44:31 +00:00
Robert Noland
cfd7bacef2 Update d_mmap() to accept vm_ooffset_t and vm_memattr_t.
This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.

Purge d_mmap2().

All driver modules will need to be rebuilt since D_VERSION is also
bumped.

Reviewed by:	jhb@
MFC after:	Not in this lifetime...
2009-12-29 21:51:28 +00:00
Edward Tomasz Napierala
6cb02977e2 SLIP is gone; remove its mutex from witness. 2009-12-29 08:45:27 +00:00
Ed Schouten
62375ca8c1 Don't forget to use `void' for sched_balance(). It has no arguments. 2009-12-28 23:12:12 +00:00
Antoine Brodin
13e403fdea (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR:		137213
Submitted by:	Eygene Ryabinkin (initial version)
MFC after:	1 month
2009-12-28 22:56:30 +00:00
Konstantin Belousov
a411786576 Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records. The reclamation is not enabled by default because
for typical workload it would make namecache unusable, but large nested
directory tree easily puts any process that accesses filesystem into 1
second wait for vlru.

Reported by:	yar (long time ago)
MFC after:	3 days
2009-12-28 15:35:39 +00:00
Edward Tomasz Napierala
558e9b5c95 Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND
is not being used improperly.
2009-12-26 11:36:10 +00:00
Bjoern A. Zeeb
e4ff598ec6 Remove extra spaces (no functional change).
MFC after:	3 days
2009-12-25 21:14:05 +00:00
Bjoern A. Zeeb
095809b084 Remove an unused global.
MFC after:	3 days
2009-12-25 20:03:03 +00:00
Robert Watson
c7ca33d138 Minor comment tweaks in rmlocks.
MFC after:	3 days
2009-12-25 01:16:24 +00:00
Konstantin Belousov
49e3050e6c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
Ed Schouten
907b48bc05 Fix indentation. 2009-12-20 22:55:27 +00:00
Ed Schouten
8dc9b4cf04 Let access overriding to TTYs depend on the cdev_priv, not the vnode.
Basically this commit changes two things, which improves access to TTYs
in exceptional conditions. Basically the problem was that when you ran
jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the
node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if
you want to attach to screens quickly, use ssh(1), etc.

The fixes:

- Cache the cdev_priv of the controlling TTY in struct session. Change
  devfs_access() to compare against the cdev_priv instead of the vnode.
  This allows you to bypass UNIX permissions, even across different
  mounts of devfs.

- Extend devfs_prison_check() to unconditionally expose the device node
  of the controlling TTY, even if normal prison nesting rules normally
  don't allow this. This actually allows you to interact with this
  device node.

To be honest, I'm not really happy with this solution. We now have to
store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp).
In an ideal world, we should just get rid of the latter two and only use
s_ttyp, but this makes certian pieces of code very impractical (e.g.
devfs, kern_exit.c).

Reported by:	Many people
2009-12-19 18:42:12 +00:00
Edward Tomasz Napierala
28d3fd007e Interpret VAPPEND correctly in vaccess_acl_nfs4(9). 2009-12-19 11:41:52 +00:00
Ed Schouten
e6d84d057a Make the wchan names of pts(4) fit in top(1).
Just like a similar change we made to the TTY code about half a year
ago, make these strings look similar.

Suggested by:	Jille Timmermans <jille@quis.cx>
2009-12-18 20:11:29 +00:00
Andrew Thompson
2b54315009 If the runcount is non-zero in eventhandler_deregister() then one or more
threads are executing the eventhandler, sleep in this case to make it safe for
module unload. If the runcount was up then an entry would have been marked
EHE_DEAD_PRIORITY so use this as a trigger to do the wakeup in
eventhandler_prune_list().

Reviewed by:	jhb
2009-12-17 21:17:13 +00:00
Matt Jacob
e7d829a46c Fix argument order in a call to mtx_init.
MFC after:	1 week
2009-12-17 00:22:56 +00:00
Luigi Rizzo
20c510f826 Properly fix callout handling by putting all the per-cpu info in
struct callout_cpu. From the comment in the file:

+ * There is one struct callout_cpu per cpu, holding all relevant
+ * state for the callout processing thread on the individual CPU.
+ * In particular:
+ *     cc_ticks is incremented once per tick in callout_cpu().
+ *     It tracks the global 'ticks' but in a way that the individual
+ *     threads should not worry about races in the order in which
+ *     hardclock() and hardclock_cpu() run on the various CPUs.
+ *     cc_softclock is advanced in callout_cpu() to point to the
+ *     first entry in cc_callwheel that may need handling. In turn,
+ *     a softclock() is scheduled so it can serve the various entries i
+ *     such that cc_softclock <= i <= cc_ticks .

Together with a smaller patch committed in september, this fixes a
bug that affects 8.0 with apps that rely on callouts to fire exactly
in the number of ticks specified (qemu among them).
Right now, callouts in 8.0 fire one tick late.

This was discussed in september with JeffR and jhb

MFC after:	3 days
2009-12-14 12:23:46 +00:00
Bjoern A. Zeeb
de0bd6f76b Throughout the network stack we have a few places of
if (jailed(cred))
left.  If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that,  as it is used
outside the network stack as well.

Discussed with:	julian, zec, jamie, rwatson (back in Sept)
MFC after:	5 days
2009-12-13 13:57:32 +00:00
Attilio Rao
2028867def In current code, threads performing an interruptible sleep (on both
sxlock, via the sx_{s, x}lock_sig() interface, or plain lockmgr), will
leave the waiters flag on forcing the owner to do a wakeup even when if
the waiter queue is empty.
That operation may lead to a deadlock in the case of doing a fake wakeup
on the "preferred" (based on the wakeup algorithm) queue while the other
queue has real waiters on it, because nobody is going to wakeup the 2nd
queue waiters and they will sleep indefinitively.

A similar bug, is present, for lockmgr in the case the waiters are
sleeping with LK_SLEEPFAIL on.  In this case, even if the waiters queue
is not empty, the waiters won't progress after being awake but they will
just fail, still not taking care of the 2nd queue waiters (as instead the
lock owned doing the wakeup would expect).

In order to fix this bug in a cheap way (without adding too much locking
and complicating too much the semantic) add a sleepqueue interface which
does report the actual number of waiters on a specified queue of a
waitchannel (sleepq_sleepcnt()) and use it in order to determine if the
exclusive waiters (or shared waiters) are actually present on the lockmgr
(or sx) before to give them precedence in the wakeup algorithm.
This fix alone, however doesn't solve the LK_SLEEPFAIL bug. In order to
cope with it, add the tracking of how many exclusive LK_SLEEPFAIL waiters
a lockmgr has and if all the waiters on the exclusive waiters queue are
LK_SLEEPFAIL just wake both queues.

The sleepq_sleepcnt() introduction and ABI breakage require
__FreeBSD_version bumping.

Reported by:	avg, kib, pho
Reviewed by:	kib
Tested by:	pho
2009-12-12 21:31:07 +00:00
John Baldwin
42a346fa63 For some buses, devices may have active resources assigned even though they
are not allocated by the device driver.  These resources should still appear
allocated from the system's perspective so that their assigned ranges are
not reused by other resource requests.  The PCI bus driver has used a hack
to effect this for a while now where it uses rman_set_device() to assign
devices to the PCI bus when they are first encountered and later assigns
them to the actual device when a driver allocates a BAR.  A few downsides of
this approach is that it results in somewhat confusing devinfo -r output as
well as not being very easily portable to other bus drivers.

This commit adds generic support for "reserved" resources to the resource
list API used by many bus drivers to manage the resources of child devices.
A resource may be reserved via resource_list_reserve().  This will allocate
the resource from the bus' parent without activating it.
resource_list_alloc() recognizes an attempt to allocate a reserved resource.
When this happens it activates the resource (if requested) and then returns
the reserved resource.  Similarly, when a reserved resource is released via
resource_list_release(), it is deactivated (if it is active) and the
resource is then marked reserved again, but is left allocated from the
bus' parent.  To completely remove a reserved resource, a bus driver may
use resource_list_unreserve().  A bus driver may use resource_list_busy()
to determine if a reserved resource is allocated by a child device or if
it can be unreserved.

The PCI bus driver has been changed to use this framework instead of
abusing rman_set_device() to keep track of reserved vs allocated resources.

Submitted by:	imp (an older version many moons ago)
MFC after:	1 month
2009-12-09 21:52:53 +00:00
Edward Tomasz Napierala
9d7031a6d6 Don't add VAPPEND if the file is not being opened for writing. Note that this
only affects cases where open(2) is being used improperly - i.e. when the user
specifies O_APPEND without O_WRONLY or O_RDWR.

Reviewed by:	rwatson
2009-12-08 20:47:10 +00:00
Konstantin Belousov
4f17d481ed Remove wrong assertion. Debugee is allowed to lose a signal.
Reported and tested by:	jh
MFC after:	2 weeks
2009-12-03 20:16:59 +00:00
Edward Tomasz Napierala
6bb58cdd0f Add change that was somehow missed in r192586. It could manifest by
incorrectly returning EINVAL from acl_valid(3) for applications linked
against pre-8.0 libc.
2009-12-03 13:29:24 +00:00
Ed Schouten
6eaf04022b Don't allocate an input buffer for a TTY when the receiver is turned off.
When the termios CREAD flag is not set, it makes little sense to
allocate an input buffer. Just set the size to 0 in this case to reduce
memory footprint.

Disallow CREAD to be disabled for pseudo-devices to prevent
foot-shooting.
2009-12-01 19:14:57 +00:00
Alan Cox
a6d42a0d62 Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages.  Text pages are not just write protected but they are also
copy-on-write.  VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy.  However, here is where things become
confused.  It is the debugger, not the process being debugged that requires
write access to the copied page.  Nonetheless, the copied page is being
mapped into the process with write access enabled.  In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page.  Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access.  VM_PROT_COPY addresses
this problem.  The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.

Reviewed by:	kib
MFC after:	4 weeks
2009-11-26 05:16:07 +00:00
Ivan Voras
cbc4ea28e2 Make ULE process usage (%CPU) accounting usable again by keeping track
of the last tick we incremented on.

Submitted by:	matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by:	jeff (who thinks there should be a better way in the future)
Approved by:	gnn (mentor)
MFC after:	3 weeks
2009-11-24 19:57:41 +00:00
Konstantin Belousov
080136212f On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not
unlock Giant twice.

While there, bring conditions in the do/while loops closer to style,
that also makes the lines fit into 80 columns.

Reported and tested by:	dougb
2009-11-20 22:22:53 +00:00
Jaakko Heinonen
10d843a446 Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Reviewed by:	pjd
Approved by:	trasz (mentor)
MFC after:	2 weeks
2009-11-19 14:33:03 +00:00
Oleksandr Tymoshenko
3ea6157e6b - Unbreak build with KLD_DEBUG defined
- Add debug.kld_debug sysctl to control KLD debugging level
- Print information about KLD dependencies with debug enabled
2009-11-17 21:56:12 +00:00
Konstantin Belousov
a3de221dbe Among signal generation syscalls, only sigqueue(2) is allowed by POSIX
to fail due to lack of resources to queue siginfo. Add KSI_SIGQ flag
that allows sigqueue_add() to fail while trying to allocate memory for
new siginfo. When the flag is not set, behaviour is the same as for
KSI_TRAP: if memory cannot be allocated, set bit in sq_kill. KSI_TRAP is
kept to preserve KBI.

Add SI_KERNEL si_code, to be used in siginfo.si_code when signal is
generated by kernel. Deliver siginfo when signal is generated by kill(2)
family of syscalls (SI_USER with properly filled si_uid and si_pid), or
by kernel (SI_KERNEL, mostly job control or SIGIO). Since KSI_SIGQ flag
is not set for the ksi, low memory condition cause old behaviour.

Keep psignal(9) KBI intact, but modify it to generate SI_KERNEL
si_code. Pgsignal(9) and gsignal(9) now take ksi explicitely. Add
pksignal(9) that behaves like psignal but takes ksi, and ddb kill
command implemented as pksignal(..., ksi = NULL) to not do allocation
while in debugger.

While there, remove some register specifiers and use ANSI C prototypes.

Reviewed by:	davidxu
MFC after:	1 month
2009-11-17 11:39:15 +00:00