Commit Graph

349 Commits

Author SHA1 Message Date
Mateusz Guzik
ffb34484ee cred: add proc_set_cred_init helper
proc_set_cred_init can be used to set first credentials of a new
process.

Update proc_set_cred assertions so that it only expects already used
processes.

This fixes panics where p_ucred of a new process happens to be non-NULL.

Reviewed by:	kib
2015-03-21 20:24:54 +00:00
Mateusz Guzik
daf63fd2f9 cred: add proc_set_cred helper
The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.
2015-03-16 00:10:03 +00:00
Ian Lepore
b96bd95b85 Allow the kern.osrelease and kern.osreldate sysctl values to be set in a
jail's creation parameters.  This allows the kernel version to be reliably
spoofed within the jail whether examined directly with sysctl or
indirectly with the uname -r and -K options.

The values can only be set at jail creation time, to eliminate the need
for any locking when accessing the values via sysctl.

The overridden values are inherited by nested jails (unless the config for
the nested jails also overrides the values).

There is no sanity or range checking, other than disallowing an empty
release string or a zero release date, by design.  The system
administrator is trusted to set sane values.  Setting values that are
newer than the actual running kernel will likely cause compatibility
problems.

Differential Revision:	https://reviews.freebsd.org/D1948
Relnotes:	yes
2015-02-27 16:28:55 +00:00
John-Mark Gurney
9541307fb7 turn GEOM_UNCOMPRESS_DEBUG into a proper option so it can be specified
in kernel config files..

put VERBOSE_SYSINIT in it's own option header so the one file,
init_main.c, can use it instead of requiring an entire kernel recompile
to change one file..
2015-02-05 07:51:38 +00:00
Hans Petter Selasky
a115fb62ed Revert for r277213:
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by:		kmacy, adrian, glebius and kib
Differential Revision:	https://reviews.freebsd.org/D1438
2015-01-22 11:12:42 +00:00
Hans Petter Selasky
1a26c3c047 Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
  CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
  spinlocks.
- Switching the callout CPU number cannot always be done on a
  per-callout basis. See the updated timeout(9) manual page for more
  information.
- The timeout(9) manual page has been updated to reflect how all the
  functions inside the callout API are working. The manual page has
  been made function oriented to make it easier to deduce how each of
  the functions making up the callout API are working without having
  to first read the whole manual page. Group all functions into a
  handful of sections which should give a quick top-level overview
  when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
  to reduce the complexity in the callout code and to avoid problems
  about atomically stopping callouts via callout_stop(). If someone
  needs it, it can be re-added. From my quick grep there are no
  CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
  added. See the updated timeout(9) manual page for a complete
  description.
- Update the callout clients in the "kern/" folder to use the callout
  API properly, like cv_timedwait(). Previously there was some custom
  sleepqueue code in the callout subsystem, which has been removed,
  because we now allow callouts to be protected by spinlocks. This
  allows us to tear down the callout like done with regular mutexes,
  and a "td_slpmutex" has been added to "struct thread" to atomically
  teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
  "SWT_SLEEPQTIMO" states can now be completely removed. Currently
  they are marked as available and will be cleaned up in a follow up
  commit.
- Bump the __FreeBSD_version to indicate kernel modules need
  recompilation.
- There has been several reports that this patch "seems to squash a
  serious bug leading to a callout timeout and panic".

Kernel build testing:	all architectures were built
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D1438
Sponsored by:		Mellanox Technologies
Reviewed by:		jhb, adrian, sbruno and emaste
2015-01-15 15:32:30 +00:00
Konstantin Belousov
237623b028 Add a facility for non-init process to declare itself the reaper of
the orphaned descendants.  Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by:	bapt
Reviewed by:	jilles (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-12-15 12:01:42 +00:00
Konstantin Belousov
5c7bebf961 The process spin lock currently has the following distinct uses:
- Threads lifetime cycle, in particular, counting of the threads in
  the process, and interlocking with process mutex and thread lock.
  The main reason of this is that turnstile locks are after thread
  locks, so you e.g. cannot unlock blockable mutex (think process
  mutex) while owning thread lock.

- Virtual and profiling itimers, since the timers activation is done
  from the clock interrupt context.  Replace the p_slock by p_itimmtx
  and PROC_ITIMLOCK().

- Profiling code (profil(2)), for similar reason.  Replace the p_slock
  by p_profmtx and PROC_PROFLOCK().

- Resource usage accounting.  Need for the spinlock there is subtle,
  my understanding is that spinlock blocks context switching for the
  current thread, which prevents td_runtime and similar fields from
  changing (updates are done at the mi_switch()).  Replace the p_slock
  by p_statmtx and PROC_STATLOCK().

The split is done mostly for code clarity, and should not affect
scalability.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-26 14:10:00 +00:00
Mateusz Guzik
eb48fbd963 filedesc: fixup fdinit to lock fdp and preapare files conditinally
Not all consumers providing fdp to copy from want files.

Perhaps these functions should be reorganized to better express the outcome.

This fixes up panics after r273895 .

Reported by:	markj
2014-11-13 21:15:09 +00:00
Mark Murray
10cb24248a This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Olivier Houchard
662423aeaf Don't call sleepinit() from proc0_init(), make it a SYSINIT instead.
vmem needs the sleepq locks to be initialized when free'ing kva, so we want it
called as early as possible.
2013-08-09 23:13:52 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Andriy Gapon
785797c341 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
Brooks Davis
56fddc5d8c MFP4 change 210763
Allow boothowto and bootverbose to be set via kernel options, which
is useful on architectures that are unable to rely on a boot loader
to pass configuration variables to the kernel.

Submitted by:	rwatson
2013-04-03 22:24:36 +00:00
Andriy Gapon
bfdcb3bcba print compiler version in the kernel banner
And provide kernel compiler version as a sysctl as well.
This is useful while we have gcc and clang cohabitation.
This could be even more useful when we have support
for external toolchains.

In cooperation with:	mjg
MFC after:		13 days
2013-02-02 11:58:35 +00:00
Konstantin Belousov
f7e50ea722 Fix a race between kern_setitimer() and realitexpire(), where the
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout.  Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.

As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized.  The consequence is the corrupted callwheel tailq.

Use process mutex to interlock the callout start, which fixes the race.

Reported and tested by:	pho
Reviewed by:	jhb
MFC after:	2 weeks
2012-12-04 20:49:39 +00:00
Konstantin Belousov
abce621c3a Fix grammar.
Submitted by:	jh
MFC after:	1 week
2012-08-16 13:01:56 +00:00
Konstantin Belousov
02c6fc2114 Add a sysctl kern.pid_max, which limits the maximum pid the system is
allowed to allocate, and corresponding tunable with the same
name. Note that existing processes with higher pids are left intact.

MFC after:	1 week
2012-08-15 15:56:21 +00:00
John Baldwin
b871e6613b Extend VERBOSE_SYSINIT to also print out the name of variables passed
to SYSINIT routines if they can be resolved via symbol look up in DDB.
To avoid false positives, only honor a name if the symbol resolves
exactly to the pointer value (no offset).

MFC after:	1 week
2012-06-01 15:42:37 +00:00
Pawel Jakub Dawidek
9b9a01792d TDF_* flags should be used with td_flags field and TDP_* flags should be used
with td_pflags field. Correct two places where it was not the case.

Discussed with:	kib
MFC after:	1 week
2012-01-22 11:01:36 +00:00
Sergey Kandaurov
3bedc94069 Remove the long reprecated ``/stand/sysinstall'' from the init_path.
It can be put back using the INIT_PATH config option or init_path
loader variable, if still needed (which I doubt).

MFC after:	1 week
2011-10-27 10:25:11 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Jonathan Anderson
cfb5f76865 Add experimental support for process descriptors
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-18 22:51:30 +00:00
Edward Tomasz Napierala
58c77a9d53 Enable accounting for RACCT_NPROC and RACCT_NTHR.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-31 19:22:11 +00:00
Edward Tomasz Napierala
097055e26d Add racct. It's an API to keep per-process, per-jail, per-loginclass
and per-loginclass resource accounting information, to be used by the new
resource limits code.  It's connected to the build, but the code that
actually calls the new functions will come later.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-29 17:47:25 +00:00
Dmitry Chagin
e5d81ef1b5 Extend struct sysvec with new method sv_schedtail, which is used for an
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.

Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.

While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.

Discussed with:	kib

MFC after:	2 Week
2011-03-08 19:01:45 +00:00
Edward Tomasz Napierala
2bfc50bc4f Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL.  This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by:	kib (as part of a larger patch)
2011-03-05 12:40:35 +00:00
John Baldwin
fd05807822 - Properly initialize the base priority (td_base_pri) of thread0 to PVM
to match the desired priority in td_priority.  Otherwise the first time
  thread0 used a borrowed priority it would drop down to PUSER instead of
  PVM.
- Explicitly initialize the starting priority of new kprocs to PVM to
  avoid inheriting some random priority from thread0.

MFC after:	2 weeks
2011-01-06 22:26:00 +00:00
David Xu
acbe332a58 MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week
2010-12-09 02:42:02 +00:00
John Baldwin
b94e6f0ef6 Set bootverbose directly in mi_startup() rather than via a SYSINIT. This
ensures 'bootverbose' is in a valid state for all SYSINITs.

Reported by:	avg
MFC after:	1 week
2010-10-28 14:17:06 +00:00
David Xu
21ecd1e977 - Insert thread0 into correct thread hash link list.
- In thr_exit() and kthread_exit(), only remove thread from
  hash if it can directly exit, otherwise let exit1() do it.
- In thread_suspend_check(), fix cleanup code when thread needs
  to exit.
This change seems fixed the "Bad link elm " panic found by
Peter Holm.

Stress testing: pho
2010-10-17 11:01:52 +00:00
David Xu
cf7d9a8ca8 Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian
2010-10-09 02:50:23 +00:00
Gavin Atkinson
a0c87b747c Add descriptions to a handful of sysctl nodes.
PR:		kern/148580
Submitted by:	Galimov Albert <wtfcrap mail.ru>
MFC after:	1 week
2010-08-09 14:48:31 +00:00
Edward Tomasz Napierala
175389cff2 Remove spurious '/*-' marks and fix some other style problems.
Submitted by:	bde@
2010-07-22 05:42:29 +00:00
Edward Tomasz Napierala
1a996ed1d8 Revert r210225 - turns out I was wrong; the "/*-" is not license-only
thing; it's also used to indicate that the comment should not be automatically
rewrapped.

Explained by:	cperciva@
2010-07-18 20:57:53 +00:00
Edward Tomasz Napierala
805cc58ac0 The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright
occurences from sys/sys/ and sys/kern/.
2010-07-18 20:23:10 +00:00
Konstantin Belousov
afe1a68827 Reorganize syscall entry and leave handling.
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
  usermode into struct syscall_args. The structure is machine-depended
  (this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
  from the syscall. It is a generalization of
  cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
  return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively.  The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by:	jhb, marcel, marius, nwhitehorn, stas
Tested by:	marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
	stas (mips)
MFC after:	1 month
2010-05-23 18:32:02 +00:00
Alan Cox
ac45ee97c9 Initialize the virtual memory-related resource limits in a single place.
Previously, one of these limits was initialized in two places to a
different value in each place.  Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures.  (Currently, this
error is masked by login.conf's default settings.)

Make vm_thread_swapin() and vm_thread_swapout() static.

Submitted by:	bde (an earlier version)
Reviewed by:	kib
2010-04-11 16:26:07 +00:00
Alan Cox
92351f162e Make _vm_map_init() the one place where the vm map's pmap field is
initialized.

Reviewed by:	kib
2010-04-03 19:07:05 +00:00
Ruslan Ermilov
e64585bdc2 Random number generator initialization cleanup:
- Introduce new SI_SUB_RANDOM point in boot sequence to make it
clear from where one may start using random(9).  It should be as
early as possible, so place it just after SI_SUB_CPU where we
have some randomness on most platforms via get_cyclecount().

- Move stack protector initialization to be after SI_SUB_RANDOM
as before this point we have no randomness at all.  This fixes
stack protector to actually protect stack with some random guard
value instead of a well-known one.

Note that this patch doesn't try to address arc4random(9) issues.
With current code, it will be implicitly seeded by stack protector
and hence will get the same entropy as random(9).  It will be
securely reseeded once /dev/random is feeded by some entropy from
userland.

Submitted by:	Maxim Dounin <mdounin@mdounin.ru>
MFC after:	3 days
2009-10-20 16:36:51 +00:00
Bjoern A. Zeeb
878adb8517 Add a mitigation feature that will prevent user mappings at
virtual address 0, limiting the ability to convert a kernel
NULL pointer dereference into a privilege escalation attack.

If the sysctl is set to 0 a newly started process will not be able
to map anything in the address range of the first page (0 to PAGE_SIZE).
This is the default. Already running processes are not affected by this.

You can either change the sysctl or the tunable from loader in case
you need to map at a virtual address of 0, for example when running
any of the extinct species of a set of a.out binaries, vm86 emulation, ..
In that case set security.bsd.map_at_zero="1".

Superseeds:		r197537
In collaboration with:	jhb, kib, alc
2009-10-02 17:48:51 +00:00
Andriy Gapon
7e936cef34 print machine in kernel boot version string
Discussed with:	gavin, kib, jhb
PR:		kern/126926
MFC after:	2 weeks
2009-10-01 10:53:12 +00:00
Andriy Gapon
78edb09e67 print_caddr_t: drop incorrect __unused attribute from parameter
seems like a purely cosmetic change

Reviewed by:	jhb, kib
MFC after:	1 week
2009-09-30 11:14:13 +00:00
Jamie Gritton
7afcbc18b3 Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
Konstantin Belousov
d8b0556c6d Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00