Commit Graph

218 Commits

Author SHA1 Message Date
Stefan Eßer
774e8d906f Fix compilation on 32 bit architectures and use INT64_MAX instead of
LONG_MAX for the upper bound check.
2013-12-19 21:35:33 +00:00
Stefan Eßer
53d5cc255d Fix overflow for timeout values of more than 68 years, which is the maximum
covered by sbintime (LONG_MAX seconds).

Some programs use timeout values in excess of 1000 years. The conversion
to sbintime caused wrap-around on overflow, which resulted in short or
negative timeout values. This caused long delays on sockets opened by
affected programs (e.g. OpenSSH).

Kernels compiled without -fno-strict-overflow were not affected, apparently
because the compiler tested the sign of the timeout value before performing
the multiplication that lead to overflow.

When the -fno-strict-overflow option was added to CFLAGS, this optimization
was disabled and the test was performed on the result of the multiplication.
Negative products were caught and resulted in EINVAL being returned, but
wrap-around to positive values just shortened the timeout value to the
residue of the result that could be represented by sbintime.

The fix is to cap the timeout values at the maximum that can be represented
by sbintime, which is 2^31 - 1 seconds or more than 68 years.

After this change, the kernel can be compiled with -fno-strict-overflow
with no ill effects.

MFC after:	3 days
2013-12-19 09:01:46 +00:00
Pawel Jakub Dawidek
ed5848c835 Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had
a very hard time to fully understand) with much more intuitive rights:

	CAP_EVENT - when set on descriptor, the descriptor can be monitored
		with syscalls like select(2), poll(2), kevent(2).

	CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
		syscall can be called on this kqueue to with the eventlist
		argument set to non-NULL value; in other words the given
		kqueue descriptor can be used to monitor other descriptors.
	CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
		syscall can be called on this kqueue to with the changelist
		argument set to non-NULL value; in other words it allows to
		modify events monitored with the given kqueue descriptor.

Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.

Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2013-11-15 19:55:35 +00:00
Jilles Tjoelker
1947c8a6d1 kqueue: Change error for kqueues rlimit from EMFILE to ENOMEM and document
this error condition in the kqueue(2) manual page.

Discussed with:	kib
2013-11-03 23:06:24 +00:00
Konstantin Belousov
9110db818a Add a resource limit for the total number of kqueues available to the
user.  Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.

Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process.  Limit
allows administrator to prevent the abuse.

This is kernel-mode side of the change, with the user-mode enabling
commit following.

Reported and tested by:	pho
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-10-21 16:44:53 +00:00
Konstantin Belousov
9d2abcd01a Do not allow negative timeouts for kqueue timers, check for the
negative timeout both before and after the conversion to sbintime_t.

For periodic kqueue timer, convert zero timeout into 1ms, to avoid
interrupt storm on fast event timers.

Reported and tested by:	pho
Discussed with:	mav
Reviewed by:	davide
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-26 13:17:31 +00:00
Konstantin Belousov
19f6a6a1ca Pre-acquire the filedesc sx when a possibility exists that the later
code could need to remove a kqueue from the filedesc list.  Global
lock is already locked, which causes sleepable after non-sleepable
lock acquisition.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2013-09-22 19:54:47 +00:00
Roman Divacky
b12698e1a1 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
Roman Divacky
253c75c0de Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00
Konstantin Belousov
e8de242d3a Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant
time removal on kqueue close.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (delphij)
2013-09-13 19:50:50 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
John-Mark Gurney
57150ff69b fix up some comments and a white space issue...
MFC after:	3 days
2013-08-26 18:53:19 +00:00
Gleb Smirnoff
ca04d21d5f Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
John Baldwin
e05bf4cf95 Some small cleanups to the fixes in r180340:
- Set NOTE_TRACKERR before running filt_proc().  If the knote did not
  have NOTE_FORK set in fflags when registered, then the TRACKERR event
  could miss being posted.
- Don't pass the pid in to filt_proc() for NOTE_FORK events.  The special
  handling for pids is done knote_fork() directly and no longer in
  filt_proc().

MFC after:	2 weeks
2013-08-13 18:45:58 +00:00
John Baldwin
5b596f0f5f Don't emit a spurious EVFILT_PROC event with no fflags set on process exit
if NOTE_EXIT is not being monitored.  The rationale is that a listener
should only get an event for exit() if they registered interest via
NOTE_EXIT.  This matches the behavior on OS X.
- Don't save the exit status on process exit unless NOTE_EXIT is being
  monitored.
- Add an internal EV_DROP flag that requests kqueue_scan() to free the
  knote without signalling it to userland and use this when a process
  exits but the fflags in the knote is zero.

Reviewed by:	jmg
MFC after:	1 month
2013-08-07 19:56:35 +00:00
Ed Schouten
2381f6ef8c Change callout use counter to use C11 atomics.
In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.
2013-06-16 09:30:35 +00:00
Alexander Motin
21a37a7196 Rework overflow checks of r247898 to not let too "intelligent" compiler to
optimize it out.

Submitted by:	bde
2013-03-09 09:07:13 +00:00
Alexander Motin
836972b877 Fix off-by-one error in nanoseconds validation.
Submitted by:	bde
2013-03-07 16:50:07 +00:00
Alexander Motin
980c545d76 Fix time math overflows and improve zero intervals handling in poll(),
select(), nanosleep() and kevent() functions after calloutng changes.

Reported by:	bde
2013-03-06 19:37:38 +00:00
Davide Italiano
40e794ab19 MFcalloutng:
- Rewrite kevent() timeout implementation to allow sub-tick precision.
- Make the interval timings for EVFILT_TIMER more accurate. This also
removes an hack introduced in r238424.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 16:55:16 +00:00
John Baldwin
2919668490 Make the interval timings for EVFILT_TIMER more accurate. tvtohz() always
adds an extra tick to account for the current partial clock tick.  However,
that is not appropriate for a repeating timer when the exact tvtohz() value
should be used for subsequent intervals.  Fix repeating callouts for
EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the
fix used in realitexpire() for interval timers.

While here, update a few comments to note that if the EVFILT_TIMER code
were to move out of kern_event.c, it should move to kern_time.c (where the
interval timer code it mimics lives) rather than kern_timeout.c.

MFC after:	1 month
2012-07-13 13:24:33 +00:00
Pawel Jakub Dawidek
a79de683f5 Update comment.
MFC after:	1 month
2012-06-14 17:32:58 +00:00
Alexander V. Chernikov
b25711e6b0 - Add knlist_init_rw_reader() function to kqueue(9).
Function acquired reader lock if needed.
Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED)
- While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues

Reviewed by:    glebius
Approved by:    ae(mentor)

MFC after:      2 weeks
2012-03-26 09:34:17 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Attilio Rao
6aba400a70 Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by:	Sandvine Incorporated
Reported by:	rstone
Tested by:	pluknet
Reviewed by:	jhb, kib
Approved by:	re (bz)
MFC after:	3 weeks
2011-08-25 15:51:54 +00:00
Konstantin Belousov
9c00bb9190 Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by:	glebius
Reviewed by:	rwatson
Approved by:	re (bz)
2011-08-16 20:07:47 +00:00
Jonathan Anderson
d1b6899e83 Rename CAP_*_KEVENT to CAP_*_EVENT.
Change the names of a couple of capability rights to be less
FreeBSD-specific.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-12 14:26:47 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
Konstantin Belousov
1fe80828e7 After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by:	jhb
X-MFC-note:	do not
2011-04-01 13:28:34 +00:00
John Baldwin
7e3d78ae6d Defer freeing a kevent list until after dropping kqueue locks.
LOR:		185
Submitted by:	Matthew Fleming @ Isilon
MFC after:	1 week
2010-03-30 18:31:55 +00:00
Konstantin Belousov
d07dc8e3eb Do not leak process lock when current thread is not allowed to see target.
Bumped into by:	ed
MFC after:	3 days
2010-02-14 13:59:01 +00:00
Brooks Davis
7eb5db5179 If a filter has already been added, actually return EEXIST when trying
at add it again.

MFC after:	1 week
2009-12-31 20:56:28 +00:00
Brooks Davis
a6fffd6cb0 The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
Konstantin Belousov
47d81c1b70 Postpone dropping fp till both kq_global and kqueue mutexes are
unlocked. fdrop() closes file descriptor when reference count goes to
zero. Close method for vnodes locks the vnode, resulting in "sleepable
after non-sleepable". For pipes, pipe mutex is before kqueue lock,
causing LOR.

Reported and tested by:	pho
MFC after:	2 weeks
2009-10-10 14:56:34 +00:00
Xin LI
13dcbd75c1 Use correct sizeof() object for klist 'list'. Currently, struct klist
contained only SLIST_HEAD as its member, thus sizeof(struct klist) would
equal to sizeof(struct klist *), so this change makes the code more
correct in terms of semantics, but should be a no-op to compiler at this
time.

Reported by:	MQ <antinvidia at gmail com>
2009-09-28 10:22:46 +00:00
Roman Divacky
1c2825bd80 Change unsigned foo to u_foo as required by style(9).
Requested by:	bde
Approved by:	ed (mentor)
2009-09-22 16:16:02 +00:00
Roman Divacky
6413e27b3a Fix the style of the previous commit.
Approved by:	ed (mentor, implicit)
2009-09-17 17:48:13 +00:00
Roman Divacky
abc8594d70 Make these argument/variable unsigned as the defines for them don't fit
into signed 32bit integer.

Approved by:	ed (mentor, implicit)
Approved by:	sson
2009-09-17 17:41:28 +00:00
Stacey Son
fdc1a1131e Add EV_RECEIPT to kevents.
EV_RECEIPT is useful to disambiguating error conditions when multiple
events structures are passed to kevent(2).  The error code is returned
in the data field and EV_ERROR is set.

Approved by:	rwatson (co-mentor)
2009-09-16 03:49:54 +00:00
Stacey Son
1a921c410a Add the EV_DISPATCH flag to kevents.
When the EV_DISPATCH flag is used the event source will be disabled
immediately after the delivery of an event.   This is similar to the
EV_ONESHOT flag but it doesn't delete the event.

Approved by:	rwatson (co-mentor)
2009-09-16 03:37:39 +00:00
Stacey Son
2c2e449905 Add EVFILT_USER to kevents.
Add user events support to kernel events which are not associated with any
kernel mechanism but are triggered by user level code.  This is useful for
adding user level events to an event handler that may also be monitoring
kernel events.

Approved by:	rwatson (co-mentor)
2009-09-16 03:30:12 +00:00
Stacey Son
95128e983f Add optional touch event filter hooks to kevents.
The touch event filter is called when a kernel event data is possibly
updated.  There are two hook points.  First, during a kevent() system
call.  Second, when an event has been triggered.

Approved by:	rwatson (co-mentor)
2009-09-16 03:15:57 +00:00
Robert Watson
e76d823b81 Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
Stanislav Sedov
fe1d3f15f6 - Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
  without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag.  This is required
  for ZFS as its getattr implementation may sleep.

Approved by:	re (rwatson)
Reviewed by:	kib
MFC after:	2 weeks
2009-06-28 21:49:43 +00:00
Konstantin Belousov
d8b0556c6d Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00
Dag-Erling Smørgrav
e11e3f187d Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Konstantin Belousov
7054ee4e38 The kqueue_register() function assumes that it is called from the top of
the syscall code and acquires various event subsystem locks as needed.
The handling of the NOTE_TRACK for EVFILT_PROC is currently done by
calling the kqueue_register() from filt_proc() filter, causing recursive
entrance of the kqueue code. This results in the LORs and recursive
acquisition of the locks.

Implement the variant of the knote() function designed to only handle
the fork() event. It mostly copies the knote() body, but also handles
the NOTE_TRACK, removing the handling from the filt_proc(), where it
causes problems described above. The function is called from the fork1()
instead of knote().

When encountering NOTE_TRACK knote, it marks the knote as influx
and drops the knlist and kqueue lock. In this context call to
kqueue_register is safe from the problems.

An error from the kqueue_register() is reported to the observer as
NOTE_TRACKERR fflag.

PR:	108201
Reviewed by:	jhb, Pramod Srinivasan <pramod juniper net> (previous version)
Discussed with:	jmg
Tested by:	pho
MFC after:	2 weeks
2008-07-07 09:30:11 +00:00
Konstantin Belousov
e1a32fd42b The r178914 I erronously put the setting of the KQ_FLUXWAIT flag before
KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the
kqueue_scan() thread may be not woken up.

Move the setting of KQ_FLUXWAIT after wakeup to correct the issue.

Reported and tested by:	pho
MFC after:	3 days
2008-07-07 09:15:29 +00:00
Konstantin Belousov
e15864efd8 Kqueue_scan() may sleep when encountered the influx knotes. On the other
hand, it may cause other threads to sleep since kqueue_scan() may mark
some knotes as infux. This could lead to the deadlock.

Before kqueue_scan() sleeps, wakeup the threads that are waiting for the
influx knotes produced by this thread.

Tested by:	pho (previous version)
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:37:05 +00:00
Konstantin Belousov
2e711e4d0d The kqueue_close() encountering the KN_INFLUX knotes on the kq being
closed is the legitimate situation. For instance, filedescriptor with
registered events may be closed in parallel with closing the kqueue.
Properly handle the case instead of asserting that this cannot happen.

Reported and tested by:	pho
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:35:32 +00:00
Jeff Roberson
e8245292a7 - Convert two timeout users to the new callout_reset_curcpu() api.
Sponsored by:	Nokia
2008-04-02 11:21:42 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
John Baldwin
e46502943a Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
  a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
  object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
  vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by:	rwatson
2008-01-07 20:05:19 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Jeff Roberson
ace8398da0 Refactor select to reduce contention and hide internal implementation
details from consumers.

 - Track individual selecters on a per-descriptor basis such that there
   are no longer collisions and after sleeping for events only those
   descriptors which triggered events must be rescaned.
 - Protect the selinfo (per descriptor) structure with a mtx pool mutex.
   mtx pool mutexes were chosen to preserve api compatibility with
   existing code which does nothing but bzero() to setup selinfo
   structures.
 - Use a per-thread wait channel rather than a global wait channel.
 - Hide select implementation details in a seltd structure which is
   opaque to the rest of the kernel.
 - Provide a 'selsocket' interface for those kernel consumers who wish to
   select on a socket when they have no fd so they no longer have to
   be aware of select implementation details.

Tested by:	kris
Reviewed on:	arch
2007-12-16 06:21:20 +00:00
Craig Rodrigues
d7f81adbd4 Revert previous commits which I committed by mistake.
Approved by:	re (implicit)
Pointy hat to:	me
2007-07-14 21:23:31 +00:00
Craig Rodrigues
d678780e60 The last entry in the ext2_opts array must be NULL,
otherwise the kernel with crash in vfs_filteropt() if an invalid
mount option is passed to ext2fs.

Approved by:	re (kensmith)
2007-07-14 21:18:19 +00:00
Robert Watson
dede2ab3b2 In kern_kevent(), unconditionally fdrop() fp once fget() has succeeded,
as we never have an opportunity to set it to NULL.

Found with:	Coverity Prevent(tm)
CID:		2161
2007-05-28 17:15:05 +00:00
Robert Watson
87066f04c6 Select a more appealing spelling for the word acquire. 2007-05-27 19:24:00 +00:00
Robert Watson
5e3f7694b1 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
Robert Watson
0c14ff0eb5 Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
2007-03-04 22:36:48 +00:00
John Baldwin
6600b45d88 Save exit status of an exiting process in kn_data in the knote.
Submitted by:	Jared Yanovich ^phirerunner at comcast.net^
MFC after:	2 weeks
2006-11-20 22:17:50 +00:00
John-Mark Gurney
33fabe46da remove unnecessary NULL check...
Coverity ID:	1545
2006-09-25 01:29:48 +00:00
John-Mark Gurney
4db71d27a1 hide kqueue_register from public view, and replace it w/ kqfd_register...
this eliminates a possible race in aio registering a kevent..
2006-09-24 04:47:47 +00:00
John-Mark Gurney
9edac6f3f9 add KTRACE hooks into kevent... This will help people debug their kqueue
programs to find out exactly which events were registered and which were
returned...  This should be lower in kern_kevent, but that would require
special munging due to locks and the functions used to copyin/copyout
kevents...

If someone wants to teach ktrace how to output pretty kevents, I have a
kevent prety printer that can be used...
2006-09-24 02:23:29 +00:00
John Baldwin
5c69ad8374 Use fget() in kqueue_register() instead of doing all the work by hand. 2006-06-12 21:46:23 +00:00
Pawel Jakub Dawidek
f420242b2b Don't forget to unlock kq lock in low memory situations.
OK'ed by:	jmg
2006-06-02 13:23:39 +00:00
Pawel Jakub Dawidek
8ebab14c70 Remove confusing done_noglobal label. The KQ_GLOBAL_UNLOCK() macro know
how to handle both situations - when kq_global lock is and is not held.

OK'ed by:	jmg
2006-06-02 13:21:21 +00:00
Pawel Jakub Dawidek
241321abc0 Use SLIST_FOREACH_SAFE() macro, because knote_drop() can free an element
which can be then used to find next element in the list.

OK'ed by:	jmg
2006-06-02 13:18:59 +00:00
John Baldwin
a29b4f6eec Drop the kqueue global mutex as soon as we are finished with it rather
than keeping it locked until we exit the function to optimize the case
where the lock would be dropped and later reacquired.  The optimization
was broken when kevent's were moved from UFS to VFS and the knote list
lock for a vnode kevent became the lockmgr vnode lock.  If one tried
to use a kqueue that contained events for a kqueue fd followed by a vnode,
then the kq global lock would end up being held when the vnode lock was
acquired which could result in sleeping with a mutex held (and subsequent
panics) if the vnode lock was contested.

Reviewed by:	jmg
Tested by:	ps (on 6.x)
MFC after:	3 days
2006-04-14 14:27:28 +00:00
John-Mark Gurney
1c4ca5e5fe spell unlock correctly, this is relatively minor as it's rare someone would
provide a lock method, and want the default unlock, but it is a bug...

PR:		95356
Submitted by:	Stephen Corteselli
MFC after:	3 days
2006-04-07 17:21:27 +00:00
John-Mark Gurney
5e6125891f mask out any action when copying the flags from the event to the knote..
Pointed out by:	Václav Haisman
Submitted by:	Dan Nelson (slightly modifed patch)
MFC after:	3 days
2006-04-01 20:15:39 +00:00
John-Mark Gurney
4e095bc045 hold the list lock over the f_event and KNOTE_ACTIVATE calls... This closes
a race where data could come in before we clear the INFLUX flag, and get
skipped over by knote (and hence never be activated, though it should of
been)...

Found by:	glebius & co.
Reviewed by:	glebius
MFC after:	3 days
2006-03-29 18:15:30 +00:00
Doug Ambrisko
69cd28dacb Add in kqueue support to LIO event notification and fix how it handled
notifications when LIO operations completed.  These were the problems
with LIO event complete notification:
      -	Move all LIO/AIO event notification into one general function
	so we don't have bugs in different data paths.  This unification
	got rid of several notification bugs one of which if kqueue was
	used a SIGILL could get sent to the process.
      -	Change the LIO event accounting to count all AIO request that
	could have been split across the fast path and daemon mode.
	The prior accounting only kept track of AIO op's in that
	mode and not the entire list of operations.  This could cause
	a bogus LIO event complete notification to occur when all of
	the fast path AIO op's completed and not the AIO op's that
	ended up queued for the daemon.

Suggestions from:	alc
2005-10-12 17:51:31 +00:00
Stephan Uphoff
19b2dff7b0 Fix race condition that caused activation of an event to
be ignored immediately after it was deactivated.

Found by: 	Yahoo!
MFC after:	3 days
2005-09-15 21:10:12 +00:00
Suleiman Souhlal
571dcd15e2 Fix the recent panics/LORs/hangs created by my kqueue commit by:
- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by:	jmg
Approved by:	re (scottl), scottl (mentor override)
Pointyhat to:	ssouhlal
Will be happy:	everyone
2005-07-01 16:28:32 +00:00
Paul Saab
efe5becafa Wrap copyin/copyout for kevent so the 32bit wrapper does not have
to malloc nchanges * sizeof(struct kevent) AND/OR nevents *
sizeof(struct kevent) on every syscall.

Glanced at by:	peter, jmg
Obtained from:	Yahoo!
MFC after:	2 weeks
2005-06-03 23:15:01 +00:00
John-Mark Gurney
b633f50dd8 make stat return an zero'd struct, and be a FIFO again... This is only
to fix libc_r since it requires stat to close fd's, and so commented in
the code...

PR:		threads/75795
Reviewed by:	ps
MFC after:	1 week
2005-05-24 23:42:50 +00:00
John-Mark Gurney
c4c44d2935 fix aio+kq... I've been running ambrisko's test program for much longer
w/o problems than I was before...  This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...

Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..

MFC after:	1 week
Prodded by:	ambrisko and dwhite
2005-03-18 01:11:39 +00:00
Paul Saab
b8a4edc17e Use kern_kevent instead of the stackgap for 32bit syscall wrapping.
Submitted by:	jhb
Tested on:	amd64
2005-03-01 17:45:55 +00:00
Robert Watson
4f7fd28ee1 When invoking callout_init(), spell '1' as "CALLOUT_MPSAFE".
MFC after:	3 days
2005-02-22 13:11:33 +00:00
Poul-Henning Kamp
c711aea6ca Make a bunch of malloc types static.
Found by:	src/tools/tools/kernxref
2005-02-10 12:02:37 +00:00
Poul-Henning Kamp
1b5cd47aa0 Move a FILEDESC_UNLOCK upwards to silence witness. 2004-11-16 14:41:31 +00:00
Poul-Henning Kamp
124e4c3be8 Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.
Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
2004-11-13 11:53:02 +00:00
John-Mark Gurney
583ef6b6d2 /me gets the wrong patch out of the pr :(
/me had the write patch w/o comments on his test system.

Pointed out by:	kuriyama and ache
Pointy hat to:	jmg
2004-10-14 03:26:50 +00:00
John-Mark Gurney
d46316e8f9 fix a bug where signal events didn't set the flags for attach/detach..
PR:		72234
MFC after:	2 days
2004-10-13 20:55:19 +00:00
John-Mark Gurney
31580e6817 unlock global lock in kqueue_scan before msleep'ing to prevent dead
lock..  we didn't unlock global lock earlier to prevent just having
to reaquire it again..

Found by:	peter
Reviewed by:	ps
MFC after:	3 days
2004-09-14 18:38:16 +00:00
John-Mark Gurney
ca95b2de43 remove giant required from kqueue_close..
Reported by:	kuriyama
MFC after:	3 days
2004-09-10 03:14:32 +00:00
John-Mark Gurney
9b90387dcf don't call f_detach if the filter has alread removed the knote.. This
happens when a proc exits, but needs to inform the user that this has
happened..  This also means we can remove the check for detached from
proc and sig f_detach functions as this is doing in kqueue now...

MFC after:	5 days
2004-09-06 19:02:42 +00:00
Brian Feldman
1c0f9af5b5 Allocate the marker, when scanning a kqueue, from the "heap" instead of the
stack.  When swapped out, a process's kernel stack would be unavailable,
and we could get a page fault when scanning the same kqueue.

PR:	kern/61849
2004-08-16 03:08:38 +00:00
John-Mark Gurney
ad3b9257c2 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
John-Mark Gurney
7d5e45a391 looks like rwatson forgot tabs... :) 2004-08-13 07:38:58 +00:00
Robert Watson
44f31f7556 Trim trailing white space. 2004-08-12 18:06:21 +00:00
Robert Watson
a6719c82b1 Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant.  Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op.  Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
2004-07-22 20:40:23 +00:00
Robert Watson
1c1ce9253f Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

  Giant is still acquired in close() even when closing MPSAFE objects
  due to kqueue requiring Giant in the calling closef() code.
  Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
  of pipe create/destroy pairs from user space with SMP compiled into
  the kernel.

  The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
  test it extensively and so have left Giant over fo_close().  It can
  probably be removed given some testing and review.
2004-07-22 18:35:43 +00:00
Alfred Perlstein
a88295bb83 Disable SIGIO for now, leave a comment as to why it's busted and hard
to fix.
2004-07-15 03:49:52 +00:00
Alfred Perlstein
67543ab1e3 Make FIOASYNC, FIOSETOWN and FIOGETOWN work on kqueues. 2004-07-14 07:02:03 +00:00
Alfred Perlstein
94ed9c8af5 Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace.  Currently only mount and unmount
of filesystems are signalled.  Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
2004-07-04 10:52:54 +00:00
Robert Watson
948a4734ed Add GIANT_REQUIRED to kqueue_close(), since kqueue currently requires
Giant.
2004-06-01 18:05:41 +00:00