Commit Graph

11441 Commits

Author SHA1 Message Date
Konstantin Belousov
d4c8e5ac7b MFC r197031:
Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.

Approved by:	re (kensmith)
2009-09-12 18:05:57 +00:00
Konstantin Belousov
3c9d279b1d MFC r197030:
In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Approved by:	re (kensmith)
2009-09-12 18:02:57 +00:00
Konstantin Belousov
93566d2a83 MFC r196887:
In fhopen, vfs_ref() the mount point while vnode is unlocked, to prevent
vn_start_write(NULL, &mp) from operating on potentially freed or reused
struct mount *.

Remove unmatched vfs_rel() in cleanup.

Approved by:	re (kensmith)
2009-09-09 13:28:18 +00:00
Attilio Rao
c90c9ddddb Adaptive spinning for locking primitives, in read-mode, have some tuning
SYSCTLs which are inappropriate for a daily use of the machine (mostly
useful only by a developer which wants to run benchmarks on it).
Remove them before the release as long as we do not want to ship with
them in.

Now that the SYSCTLs are gone, instead than use static storage for some
constants, use real numeric constants in order to avoid eventual compiler
dumbiness and the risk to share a storage (and then a cache-line) among
CPUs when doing adaptive spinning together.

Pleasse note that the sys/linker_set.h inclusion in lockmgr and sx lock
support could have been gone, but re@ preferred them to be in order to
minimize the risk of problems on future merging.

Please note that this patch is not a MFC, but an 'edge case' as commit
directly to stable/8, which creates a diverging from HEAD.

Tested by:      Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Approved by:	re (kib)
2009-09-09 09:34:13 +00:00
Attilio Rao
db0c92ce82 MFC r196772:
fix adaptive spinning in lockmgr by using correctly GIANT_RESTORE and
continue statement and improve adaptive spinning for sx lock by just
doing once GIANT_SAVE.

Approved by:	re (kib)
2009-09-09 09:17:31 +00:00
Jamie Gritton
3c7562c77e MFC r196835:
Allow a jail's name to be the same as its jid (which is the default if
  no name is specified), and let a numeric name specify the jid for a new
  jail when the jid isn't otherwise set.  Still disallow other numeric
  names.

Reviewed by:	zec
Approved by:	re (kib), bz (mentor)
2009-09-08 19:18:02 +00:00
Konstantin Belousov
2af00decb8 MFC r196730:
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ.

Introduce separate kernel stack cache that keeps some limited amount of
preallocated kernel stacks to lower the latency of thread allocation.

Not a merge: instead of removing td_altkstack* members of struct thread,
replace them with placeholders to keep struct thread layout on the
stable branch.

Also, record r196640, r196644 and r196648 as merged.

Approved by:	re (kensmith)
2009-09-08 15:31:23 +00:00
Konstantin Belousov
c02280f542 MFC r196692:
Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value.

MFC r196733:
Fix mount reference leak when V_XSLEEP is specified to vn_start_write().

Approved by:	re (kensmith)
2009-09-08 14:43:42 +00:00
Warner Losh
45f395006c MFC r196529:
Rather than having enabled/disabled, implement a max queue depth.
  While usually not an issue, this firewalls bugs in the code that may
  run us out of memory.

  Fix a memory exhaustion in the case where devctl was disabled, but the
  link was bouncing.  The check to queue was in the wrong place.

  Implement a new sysctl hw.bus.devctl_queue to control the depth.  Make
  compatibility hacks for hw.bus.devctl_disable to ease transition.

  Reviewed by:	emaste@
  Approved by:	re@ (kib)
  MFC after:	asap
2009-09-05 08:03:29 +00:00
Bjoern A. Zeeb
914e5afefd MFC r196653:
Make sure FreeBSD binaries without .note.ABI-tag section work
  correctly and do not match a colliding Debian GNU/kFreeBSD
  brandinfo statements.
  For this mark the Debian GNU/kFreeBSD brandinfo that it must have
  an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
  when comparing a possibly colliding set of options.

  Due to SYSINIT we add the brandinfo in a non-deterministic order,
  so native FreeBSD is not always first. We may want to consider
  to force native FreeBSD to come first as well.

  The only way a problem could currently be noticed is when running an
  i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
  brandinfo  was matched first,  as the fallback to ld-elf32.so.1 does
  not exist in that case.

Reported and tested by:	ticso
In collaboration with:	kib
MFC after:		3 days
Approved by:		re (rwatson)
2009-09-02 10:39:46 +00:00
Jilles Tjoelker
33656b91cd MFC r196460
Fix the conformance of poll(2) for sockets after r195423 by
  returning POLLHUP instead of POLLIN for several cases. Now, the
  tools/regression/poll results for FreeBSD are closer to that of the
  Solaris and Linux.

  Also, improve the POSIX conformance by explicitely clearing POLLOUT
  when POLLHUP is reported in pollscan(), making the fix global.

  Submitted by:	bde
  Reviewed by:	rwatson

MFC r196556

  Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.

  This reverts part of r196460, so that sockets only return POLLHUP if both
  directions are closed/error. Fifos get POLLHUP by closing the unused
  direction immediately after creating the sockets.

  The tools/regression/poll/*poll.c tests now pass except for two other
  things:
  - if POLLHUP is returned, POLLIN is always returned as well instead of
    only when there is data left in the buffer to be read
  - fifo old/new reader distinction does not work the way POSIX specs it

  Reviewed by:	kib, bde

MFC r196554

  Add some tests for poll(2)/shutdown(2) interaction.

Approved by:	re (kensmith)
2009-09-01 20:58:41 +00:00
Marius Strobl
e4e5ed252d Add a temporary workaround which just lets init die instead of
causing a panic if it is killed due to a unsolved stack overflow
seen very late during shutdown on sparc64 when the gmirror worker
process exists, which is a regression introduced in 8.0.

Reviewed by:	kib
Approved by:	re (rwatson)
2009-08-31 19:16:58 +00:00
Jamie Gritton
f37b0a3db5 MFC r196592:
Fix a LOR between allprison_lock and vnode locks by releasing
  allprison_lock before releasing a prison's root vnode.

PR:		kern/138004
Reviewed by:	kib
Approved by:	re (rwatson), bz (mentor)
2009-08-31 14:13:45 +00:00
Konstantin Belousov
e179d138ba MFC r196560:
Honor the vfs.timestamp_precision sysctl settings for utimes(path, NULL)
and similar calls.

Approved by:	re (rwatson)
2009-08-31 09:08:14 +00:00
Robert Watson
3ef94f2b72 Merge r196481 from head to stable/8:
Rework global locks for interface list and index management, correcting
  several critical bugs, including race conditions and lock order issues:

  Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
  sxlock.  Either can be held to stablize the lists and indexes, but both
  are required to write.  This allows the list to be held stable in both
  network interrupt contexts and sleepable user threads across sleeping
  memory allocations or device driver interactions.  As before, writes to
  the interface list must occur from sleepable contexts.

  Reviewed by:  bz, julian

Approved by:	re (kib)
2009-08-28 20:06:02 +00:00
Marko Zec
61268392e1 MFC r196505:
When "jail -c vnet" request fails, the current code actually creates and
  leaves behind an orphaned vnet.  This change ensures that such vnets get
  released.

  This change affects only options VIMAGE builds.

  Submitted by: jamie
  Discussed with:       bz
  Approved by:  re (rwatson), julian (mentor)

Approved by:	re (rwatson)
2009-08-28 19:15:17 +00:00
Marko Zec
939af5009a MFC r196501:
When registering a protocol to an existing protocol domain via
  pf_proto_register(), iterate over all existing vnets to call protosw_init()
  and thus the appropriate .pr_init() handler in the context of each vnet.
  NB in the future we probably want to separate pr_init() handlers into
  two, i.e. per-vnet and global, functions.

  This change has no impact on nooptions VIMAGE builds.

  Approved by:  re (rwatson), julian (mentor)

Approved by:	re (rwatson)
2009-08-28 19:08:56 +00:00
Bjoern A. Zeeb
ac63e409c2 MFC r196512:
Fix handling of .note.ABI-tag section for GNU systems [1].
  Handle GNU/Linux according to LSB Core Specification 4.0,
  Chapter 11. Object Format, 11.8. ABI note tag.

  Also check the first word of desc, not only name, according to
  glibc abi-tags specification to distinguish between Linux and
  kFreeBSD.

  Add explicit handling for Debian GNU/kFreeBSD, which runs
  on our kernels as well [2].

  In {amd64,i386}/trap.c, when checking osrel of the current process,
  also check the ABI to not change the signal behaviour for Linux
  binary processes, now that we save an osrel version for all three
  from the lists above in struct proc [2].

  These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
  and Linux binaries on the same machine again for at least i386 and
  amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR:		kern/135468
Submitted by:	dchagin [1] (initial patch)
Suggested by:	kib [2]
Tested by:	Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by:	kib
Approved by:	re (kensmith)
2009-08-27 17:34:13 +00:00
John Baldwin
18fb1e9a44 MFC 196417:
This patch fixes two bugs in sglist(9) and improves robustness of the API via
better semantics if a request to append an address range to an existing list
fails.
- When cloning an sglist, properly set the length in the new sglist instead of
  leaving the new list empty.
- Properly compute the amount of data added to an sglist via
  _sglist_append_buf().  This allows sglist_consume_uio() to properly update
  uio_resid.
- When a request to append an address range to a scatter/gather list fails,
  restore the sglist to the state it had at the start of the function call
  instead of resetting it to an empty list.

Approved by:	re (kib)
2009-08-21 03:14:39 +00:00
Robert Watson
708b471c4b Merge r196267 from head to stable/8:
Rather than fix questionable ifnet list locking in the implementation of
  the kern.polling.enable sysctl, remove the sysctl.  It has been deprecated
  since FreeBSD 6 in favour of per-ifnet polling flags.

  Reviewed by:	luigi

Approved by:	re (kib)
2009-08-20 21:29:49 +00:00
John Baldwin
5c91164df2 MFC 196404:
Change the 'resid' parameter to sglist_consume_uio() from an int to a
size_t to match the recent type change of the uio_resid member of struct
uio.

Approved by:	re (kib)
2009-08-20 20:53:36 +00:00
John Baldwin
247db0748a MFC 196403: Temporarily revert the new-bus locking for 8.0 release.
Approved by:	re (kib)
2009-08-20 20:23:28 +00:00
Ed Schouten
e047c5fbb6 MFC r196378:
Small changes to the warning message generated by pty(4):

  - Only print the warning once, instead of filling up the screen.
  - Use the word "legacy" for the pty_warningcnt description, to prevent
    confusion.
  - Use log() instead of printf().

  Discussed with: rwatson, jhb
  Approved by:    re (kib)
2009-08-19 14:38:43 +00:00
Pawel Jakub Dawidek
65536ad653 MFC r196358:
Remove unused taskqueue_find() function.

Reviewed by:	dfr
Approved by:	re (kib)
2009-08-18 14:00:25 +00:00
Attilio Rao
7e2d0af9e0 MFC r196334:
* Change the scope of the ASSERT_ATOMIC_LOAD() from a generic check to
  a pointer-fetching specific operation check. Consequently, rename the
  operation ASSERT_ATOMIC_LOAD_PTR().
* Fix the implementation of ASSERT_ATOMIC_LOAD_PTR() by checking
  directly alignment on the word boundry, for all the given specific
  architectures. That's a bit too strict for some common case, but it
  assures safety.
* Add a comment explaining the scope of the macro
* Add a new stub in the lockmgr specific implementation

Tested by: marcel (initial version), marius
Reviewed by: rwatson, jhb (comment specific review)
Approved by: re (kib)
2009-08-17 16:33:53 +00:00
Pawel Jakub Dawidek
e43f173602 MFC r196295:
Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by:	re (kib)
2009-08-17 09:03:47 +00:00
Pawel Jakub Dawidek
ea5f504fed MFC r196293:
Because taskqueue_run() can drop tq_mutex, we need to check if the
TQ_FLAGS_ACTIVE flag wasn't removed in the meantime, which means we missed a
wakeup.

Approved by:	re (kib)
2009-08-17 08:46:47 +00:00
Ed Schouten
336b627671 MFC r196276:
Fix small style regression introduced by the MPSAFE newbus code.

Approved by:	re (rwatson)
2009-08-16 20:33:16 +00:00
Bjoern A. Zeeb
21845eba48 MFC r196226:
Add a new macro to test that a variable could be loaded atomically.
  Check that the given variable is at most uintptr_t in size and that
  it is aligned.

  Note: ASSERT_ATOMIC_LOAD() uses ALIGN() to check for adequate
        alignment -- however, the function of ALIGN() is to guarantee
        alignment, and therefore may lead to stronger alignment
        enforcement than necessary for types that are smaller than
        sizeof(uintptr_t).

  Add checks to mtx, rw and sx locks init functions to detect possible
  breakage. This was used during debugging of the problem fixed with
  r196118 where a pointer was on an un-aligned address in the dpcpu area.

  In collaboration with:  rwatson
  Reviewed by:            rwatson

Approved by:	re (kib)
2009-08-14 21:50:47 +00:00
Konstantin Belousov
106c3802ff MFC r196203:
Correctly handle unlock for !MAKEENTRY case.

Approved by:	re (rwatson)
2009-08-14 11:06:58 +00:00
Attilio Rao
be1057174e MFC r196196:
* Completely remove the option STOP_NMI from the kernel.  This option
  has proven to have a good effect when entering KDB by using a NMI,
  but it completely violates all the good rules about interrupts
  disabled while holding a spinlock in other occasions.  This can be the
  cause of deadlocks on events where a normal IPI_STOP is expected.
* Add an new IPI called IPI_STOP_HARD on all the supported architectures.
  This IPI is responsible for sending a stop message among CPUs using a
  privileged channel when disponible. In other cases it just does match a
  normal IPI_STOP.
  Right now the IPI_STOP_HARD functionality uses a NMI on ia32 and amd64
  architectures, while on the other has a normal IPI_STOP effect. It is
  responsibility of maintainers to eventually implement an hard stop
  when necessary and possible.
* Use the new IPI facility in order to implement a new userend SMP kernel
  function called stop_cpus_hard(). That is specular to stop_cpu() but
  it does use the privileged channel for the stopping facility.
* Let KDB use the newly introduced function stop_cpus_hard() and leave
  stop_cpus() for all the other cases
* Disable interrupts on CPU0 when starting the process of APs suspension.
* Style cleanup and comments adding

This patch should fix the reboot/shutdown deadlocks many users are
constantly reporting on mailing lists.

Please don't forget to update your config file with the STOP_NMI
option removal

Reviewed by:  jhb
Tested by:    pho, bz, rink
Approved by:  re (kib)
2009-08-13 17:54:11 +00:00
Bjoern A. Zeeb
da2a30fca1 MFC r196176:
Make it possible to change the vnet sysctl variables on jails
  with their own virtual network stack. Jails only inheriting a
  network stack cannot change anything that cannot be changed from
  within a prison.

  Reviewed by:  rwatson, zec

Approved by:	re (kib)
2009-08-13 10:31:02 +00:00
Bjoern A. Zeeb
abff5b8ad9 MFC r196135:
Make the kernel compile without IP networking by moving
  a variable under a proper #ifdef.

Approved by:	re (rwatson)
2009-08-12 12:14:30 +00:00
Bjoern A. Zeeb
537791e584 MFC r196132:
Add ddb show dpcpu_off command to ease dpcpu memory debugging.
  While show pcpu prints pc_dynamic this also prints the original
  memory address as well as the maths.

  Once dpcpu goes NUMA this is considered to help debugging as well.

  Reviewed by:  rwatson

Approved by:	re
2009-08-12 12:10:28 +00:00
Julian Elischer
9734411552 Stop uuidgen(2) from crashing in vimage kerenels.
make curvnet valid when needed.

Reviewed by:	bz@
Approved by:	re (kib)
2009-08-02 16:59:02 +00:00
Attilio Rao
444b91868b Make the newbus subsystem Giant free by adding the new newbus sxlock.
The newbus lock is responsible for protecting newbus internIal structures,
device states and devclass flags. It is necessary to hold it when all
such datas are accessed. For the other operations, softc locking should
ensure enough protection to avoid races.

Newbus lock is automatically held when virtual operations on the device
and bus are invoked when loading the driver or when the suspend/resume
take place. For other 'spourious' operations trying to access/modify
the newbus topology, newbus lock needs to be automatically acquired and
dropped.

For the moment Giant is also acquired in some key point (modules subsystem)
in order to avoid problems before the 8.0 release as module handlers could
make assumptions about it. This Giant locking should go just after
the release happens.

Please keep in mind that the public interface can be expanded in order
to provide more support, if there are really necessities at some point
and also some bugs could arise as long as the patch needs a bit of
further testing.

Bump __FreeBSD_version in order to reflect the newbus lock introduction.

Reviewed by:    ed, hps, jhb, imp, mav, scottl
No answer by:   ariff, thompsa, yongari
Tested by:      pho,
                G. Trematerra <giovanni dot trematerra at gmail dot com>,
                Brandon Gooch <jamesbrandongooch at gmail dot com>
Sponsored by:   Yahoo! Incorporated
Approved by:	re (ksmith)
2009-08-02 14:28:40 +00:00
Ed Schouten
d40b91cb13 Fix two bugs related to TTY input:
- fix write() on pseudo-terminal masters to return the amount of bytes
  passed to the TTY, not the amount of bytes read from user.

- fix ttydisc_rint_bypass() to set the high watermark when it cannot
  write all input, just like ttydisc_rint() itself.

Approved by:	re (kib)
2009-08-02 14:25:26 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Jamie Gritton
42bebd8205 Make the "enforce_statfs" default 2 (most restrictive) in jail_set(2),
instead of whatever the parent/system has (which is generally 0).  This
mirrors the old-style default used for jail(2) in conjunction with the
security.jail.enforce_statfs sysctl.

Approved by:	re (kib), bz (mentor)
2009-07-31 16:00:41 +00:00
John Baldwin
87eca70e0c Fix some LORs between vnode locks and filedescriptor table locks.
- Don't grab the filedesc lock just to read fd_cmask.
- Drop vnode locks earlier when mounting the root filesystem and before
  sanitizing stdin/out/err file descriptors during execve().

Submitted by:	kib
Approved by:	re (rwatson)
MFC after:	1 week
2009-07-31 13:40:06 +00:00
Jamie Gritton
2b0d6f815e Remove a LOR, where the the sleepable allprison_lock was being obtained
in prison_equal_ip4/6 while an inp mutex was held.  Locking allprison_lock
can be avoided by making a restriction on the IP addresses associated with
jails:

Don't allow the "ip4" and "ip6" parameters to be changed after a jail is
created.  Setting the "ip4.addr" and "ip6.addr" parameters is allowed,
but only if the jail was already created with either ip4/6=new or
ip4/6=disable.  With this restriction, the prison flags in question
(PR_IP4_USER and PR_IP6_USER) become read-only and can be checked
without locking.

This also allows the simplification of a messy code path that was needed
to handle an existing prison gaining an IP address list.

PR:		kern/136899
Reported by:	Dirk Meyer
Approved by:	re (kib), bz (mentor)
2009-07-30 14:28:56 +00:00
Jamie Gritton
bdfc8cc4cd Don't allow mixing the "vnet" and "ip4/6" jail parameters, since vnet
jails have their own IP stack and don't have access to the parent IP
addresses anyway.  Note that a virtual network stack forms a break
between prisons with regard to the list of allowed IP addresses.

Approved by:	re (kib), bz (mentor)
2009-07-29 16:46:59 +00:00
Jamie Gritton
8986e3a023 Change the default value of the "ip4" and "ip6" jail parameters to
"disable", which only allows access to the parent/physical system's
IP addresses when specifically directed.  Change the default value of
"host" to "new", and don't copy the parent host values, to insulate
jails from the parent hostname et al.

Approved by:	re (kib), bz (mentor)
2009-07-29 16:41:02 +00:00
Robert Watson
791b0ad2bf Eliminate ARG_UPATH[12] arguments to AUDIT_ARG_UPATH() and instead
provide specific macros, AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2()
to capture path information for audit records.  This allows us to
move the definitions of ARG_* out of the public audit header file,
as they are an implementation detail of our current kernel-internal
audit record, which may change.

Approved by:	re (kensmith)
Obtained from:	TrustedBSD Project
MFC after:	1 month
2009-07-29 07:44:43 +00:00
Robert Watson
b146fc1bf0 Rework vnode argument auditing to follow the same structure, in order
to avoid exposing ARG_ macros/flag values outside of the audit code in
order to name which one of two possible vnodes will be audited for a
system call.

Approved by:	re (kib)
Obtained from:	TrustedBSD Project
MFC after:	1 month
2009-07-28 21:52:24 +00:00
Robert Watson
e4b4bbb665 Audit file descriptors passed to fooat(2) system calls, which are used
instead of the root/current working directory as the starting point for
lookups.  Up to two such descriptors can be audited.  Add audit record
BSM encoding for fooat(2).

Note: due to an error in the OpenBSM 1.1p1 configuration file, a
further change is required to that file in order to fix openat(2)
auditing.

Approved by:	re (kib)
Reviewed by:	rdivacky (fooat(2) portions)
Obtained from:	TrustedBSD Project
MFC after:	1 month
2009-07-28 21:39:58 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Bjoern A. Zeeb
be31e5e7b5 Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls
(ifconfig ifN (-)vnet <jname|jid>) work correctly.

Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.

In the reclaim case, correctly set the vnet before calling if_vmove.

Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)

There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.

Suggested by:	rwatson (*)
Approved by:	re (kib)
2009-07-26 11:29:26 +00:00
Jamie Gritton
7cbf72137f Some jail parameters (in particular, "ip4" and "ip6" for IP address
restrictions) were found to be inadequately described by a boolean.
Define a new parameter type with three values (disable, new, inherit)
to handle these and future cases.

Approved by:	re (kib), bz (mentor)
Discussed with:	rwatson
2009-07-25 14:48:57 +00:00
Brooks Davis
254d03c508 Introduce a new sysctl process mib, kern.proc.groups which adds the
ability to retrieve the group list of each process.

Modify procstat's -s option to query this mib when the kinfo_proc
reports that the field has been truncated.  If the mib does not exist,
fall back to the truncated list.

Reviewed by:	rwatson
Approved by:	re (kib)
MFC after:	2 weeks
2009-07-24 19:12:19 +00:00
Brooks Davis
1b5768be71 Revert the changes to struct kinfo_proc in r194498. Instead, fill
in up to 16 (KI_NGROUPS) values and steal a bit from ki_cr_flags
(all bits currently unused) to indicate overflow with the new flag
KI_CRF_GRP_OVERFLOW.

This fixes procstat -s.

Approved by: re (kib)
2009-07-24 15:03:10 +00:00
John Baldwin
013818111a Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses.  The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-24 13:50:29 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Rui Paulo
5b4d5f9ee2 Improve the printf message when a module failed to load. This gives the
user some clue about the possibility of a __FreeBSD_version mismatch.

Discussed with:	rwatson, jhb
Approved by:	re (kib)
2009-07-21 14:18:25 +00:00
Robert Watson
17ef1feb8a Add macros VNET_SETNAME and VNET_SYMPREFIX, and expose to userspace if
_WANT_VNET is defined.  This way we don't need separate definitions in
libkvm.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 07:50:50 +00:00
Konstantin Belousov
7ac5806bfb When buffer write is failed, it is wrong for brelse() to invalidate
portion of the page that was written. Among other problems, this
page might be picked up by pagedaemon, with failed assertion in
vm_pageout_flush() about validity of the page.

Reported and tested by:	pho
Approved by:	re (kensmith)
MFC after:	3 weeks
2009-07-19 20:25:59 +00:00
Robert Watson
006e9db452 Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-19 17:40:45 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Jamie Gritton
7afcbc18b3 Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
Robert Watson
cd2b95a237 r195699 introduced an assertion regarding when progbits data in kernel
modules was present, which turns out to be false in some situations.
Back out the assertion.

Reported by:	Luiz Otavio O Souza <lists.br at gmail.com>,
		Florian Smeets <flo at kasimir.com>
Approved by:	re (kensmith) (implicit)
2009-07-15 09:19:01 +00:00
Konstantin Belousov
f33a947b56 Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by:	pho
Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:52:46 +00:00
Konstantin Belousov
79799053a7 Move the repeated code to calculate the number of the threads in the
process that still need to be suspended or exited from thread_single
into the new function calc_remaining().

Tested by:	pho
Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:51:31 +00:00
Konstantin Belousov
c8167830f9 When wakeup(9) is going to notify swapper, assert that wait channel is not
equal to &proc0. It shall be not, since proc0 stack is not swappable, and
kick_proc0() is wakeup(&proc0).

Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:50:41 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Alan Cox
3153e878dd Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t.  The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes.  Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures.  The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map.  The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

  kmem_alloc_contig() can now be used to allocate kernel memory with
  non-default memory attributes on amd64 and i386.

  vm_page_alloc() and the device pager will set the memory attributes
  for the real or fictitious page according to the object's default
  memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386.  In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by:	re (kib)
2009-07-12 23:31:20 +00:00
Konstantin Belousov
7c6d401c75 The control terminal revocation at the session leader exit does not
correctly checks for reclaimed vnode, possibly calling VOP_REVOKE for
such vnode. If the terminal is already revoked, or devfs mount was
forcibly unmounted, the revocation of doomed ctty vnode causes panic.

Reported and tested by:	lstewart
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-09 18:54:38 +00:00
Jamie Gritton
9dce97d788 Remove crcopy call from seteuid now that it calls crcopysafe.
Reviewed by:	brooks
Approved by:	re (kib), bz (mentor)
2009-07-08 21:45:48 +00:00
Edward Tomasz Napierala
e2b881bf03 Regenerate after lpathconf(2) addition.
Approved by:	re (kib)
2009-07-08 15:25:27 +00:00
Edward Tomasz Napierala
c38898116a There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by:	rwatson (earlier version)
Approved by:	re (kib)
2009-07-08 15:23:18 +00:00
Ed Schouten
6b53d5c0e7 Fix regressions in return events of poll() on TTYs.
As pointed out, POLLHUP should be generated, even if it hasn't been
specified on input. It is also not allowed to return both POLLOUT and
POLLHUP at the same time.

Reported by:	jilles
Approved by:	re (kib)
2009-07-08 10:21:52 +00:00
Mike Silbersack
347e22a68b Increase HZ_VM from 10 to 100. While 10 hz saves cpu time
under VM environments, it's too slow for FreeBSD to work
properly.  For example, ping at 10hz pings about every 600ms
instead of about every second.

Approved by:	re (kib)
2009-07-08 01:09:12 +00:00
Konstantin Belousov
7f5dff5064 Fix poll(2) and select(2) for named pipes to return "ready for read"
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.

Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.

PR:	kern/94772
Submitted by:	bde (original version)
Reviewed by:	rwatson, jilles
Approved by:	re (kensmith)
MFC after:	6 weeks (might be)
2009-07-07 09:43:44 +00:00
Konstantin Belousov
f1eccd05ec In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
Jamie Gritton
f0899a3460 Call prison_check from vfs_suser rather than re-implementing it.
Approved by:	re (kib), bz (mentor)
2009-07-02 14:19:33 +00:00
Robert Watson
64c9a4d9ab Audit file descriptor and command arguments to ioctl(2).
Approved by:	re (audit argument blanket)
MFC after:	1 week
2009-07-02 09:16:25 +00:00
Robert Watson
2a5658382a Clean up a number of aspects of token generation from audit arguments to
system calls:

- Centralize generation of argument tokens for VM addresses in a macro,
  ADDR_TOKEN(), and properly encode 64-bit addresses in 64-bit arguments.
- Fix up argument numbers across a large number of syscalls so that they
  match the numeric argument into the system call.
- Don't audit the address argument to ioctl(2) or ptrace(2), but do keep
  generating tokens for mmap(2), minherit(2), since they relate to passing
  object access across execve(2).

Approved by:	re (audit argument blanket)
Obtained from:	TrustedBSD Project
MFC after:	1 week
2009-07-02 09:15:30 +00:00
Robert Watson
03f7b00438 For access(2) and eaccess(2), audit the requested access mode.
Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 22:47:45 +00:00
Jeff Roberson
2141453eb4 - Use fd_lastfile + 1 as the upper bound on nd. This is more correct than
using the size of the descriptor array.
 - A lock is not needed to fetch fd_lastfile.  The results are stale the
   instant it is dropped.
 - Use a private mutex pool for select since the pool mutex is not used
   as a leaf.
 - Fetch the si_mtx pointer first before resorting to hashing to compute
   the mutex address.

Reviewed by:	McKusick
Approved by:	re (kib)
2009-07-01 20:43:46 +00:00
Robert Watson
15ca46f69d Audit file descriptor numbers for various socket-related system calls.
Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 19:55:11 +00:00
Robert Watson
9e4c1521d5 Define missing audit argument macro AUDIT_ARG_SOCKET(), and
capture the domain, type, and protocol arguments to socket(2)
and socketpair(2).

Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 18:54:49 +00:00
John Baldwin
cebc7fb16c Improve the handling of cpuset with interrupts.
- For x86, change the interrupt source method to assign an interrupt source
  to a specific CPU to return an error value instead of void, thus allowing
  it to fail.
- If moving an interrupt to a CPU fails due to a lack of IDT vectors in the
  destination CPU, fail the request with ENOSPC rather than panicing.
- For MSI interrupts on x86 (but not MSI-X), only allow cpuset to be used
  on the first interrupt in a group.  Moving the first interrupt in a group
  moves the entire group.
- Use the icu_lock to protect intr_next_cpu() on x86 instead of the
  intr_table_lock to fix a LOR introduced in the last set of MSI changes.
- Add a new privilege PRIV_SCHED_CPUSET_INTR for using cpuset with
  interrupts.  Previously, binding an interrupt to a CPU only performed a
  privilege check if the interrupt had an interrupt thread.  Interrupts
  without a thread could be bound by non-root users as a result.
- If an interrupt event's assign_cpu method fails, then restore the original
  cpuset mask for the associated interrupt thread.

Approved by:	re (kib)
2009-07-01 17:20:07 +00:00
Robert Watson
6d5a61563a When auditing unmount(2), capture FSID arguments as regular text strings
rather than as paths, which would lead to them being treated as relative
pathnames and hence confusingly converted into absolute pathnames.

Capture flags to unmount(2) via an argument token.

Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 16:56:56 +00:00
Robert Watson
422d786676 Audit the file descriptor number passed to lseek(2).
Approved by:	re (kib)
MFC after:	3 days
2009-07-01 15:37:23 +00:00
Robert Watson
c5957d6bba Fix link(2) auditing: use the second audit record path for the new object
name.

Approved by:	re (kib)
MFC after:	3 days
2009-07-01 13:22:08 +00:00
Robert Watson
2ef24dde7c udit the 'options' argument to wait4(2).
Approved by:	re (kib)
MFC after:	3 days
2009-07-01 12:36:10 +00:00
Alan Cox
9785747f87 Remove a stale comment. The very same revision (r85511) that introduced
this comment also implemented the proposed change to the code.

Approved by:	re (kib)
2009-06-30 19:39:17 +00:00
Ed Maste
2dafac3976 Add FIONSPACE from NetBSD. FIONSPACE is provided so that programs may
easily determine how much space is left in the send queue; they do not
need to know the send queue size.

NetBSD revisions:
  sys_socket.c r1.41, 1.42
  filio.h r1.9

Obtained from:	NetBSD
Approved by:	re (kensmith)
2009-06-30 13:38:49 +00:00
Konstantin Belousov
a73034ef7f Free struct ucreds allocated in vfs_hang_addrlist() when deleting
the export element.
While there, remove register storage-class specifiers.

Reported and tested by:	pho
Reviewed by:	kan
Approved by:	re (kensmith)
2009-06-29 18:09:07 +00:00
Attilio Rao
5257ff7ee1 Don't assume a default (currently 15) value for preloaded klds when
loading hwpmc, but calculate at runtime and allocate the necessary space.
Also the current logic is wrong as it can lead to an endless loop.

Sponsored by:   Sandvine Incorporated
Reported by:    Ryan Stone <rstone at sandvine dot com>
Tested by:      Giovanni Trematerra
                <giovanni dot trematerra at gmail dot com>
Approved by:	re (kib)
2009-06-29 16:03:18 +00:00
Stanislav Sedov
fe1d3f15f6 - Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
  without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag.  This is required
  for ZFS as its getattr implementation may sleep.

Approved by:	re (rwatson)
Reviewed by:	kib
MFC after:	2 weeks
2009-06-28 21:49:43 +00:00
Ed Schouten
f9bb1cf010 Add FIONWRITE support to TTYs.
TTYs already supported TIOCOUTQ, but FIONWRITE seems to be a more
generic name for this.

Approved by:	re (kib)
2009-06-28 12:02:15 +00:00
Poul-Henning Kamp
bb520069ca There are a number of ways an application can check if there are
inbound data waiting on a filedescriptor, such as a pipe or a socket,
for instance by using select(2), poll(2), kqueue(2), ioctl(FIONREAD)
etc.

But we have no way of finding out if written data have yet to be
disposed of, for instance, transmitted (and ack'ed!) to some remote
host, or read by the applicantion at the far end of the pipe.

The closest we get, is calling shutdown(2) on a TCP socket in
non-blocking mode, but this has the undesirable sideeffect of
preventing future communication.

Add a complement to FIONREAD, called FIONWRITE, which returns the
number of bytes not yet properly disposed of.  Implement it for
all sockets.

Background:

A HTTP server will want to time out connections, if no new request
arrives within a certain period after the last transmitted response
has actually been sent (and ack'ed).

For a busy HTTP server, this timeout can be subsecond duration.

In order to signal to a load-balancer that the connection is truly
dead, TCP_RST will be the preferred method, as this avoids the need
for a RTT delay for FIN handshaking, with a client which, surprisingly
often, no longer at the remote IP number.

If a slow, distant client is being served a response which is big
enough to fill the window, but small enough to fit in the socket
buffer, the write(2) call will return immediately.

If the session timeout is armed at that time, all bytes in the
response may not have been transmitted by the time it fires.

FIONWRITE allows the timeout to check that no data is outstanding
on the connection, before it TCP_RST's it.

Input & Idea from: rwatson
Approved by:	re (kib)
2009-06-28 11:28:14 +00:00
Alan Cox
de0c3e0895 Correct a long-standing performance bug in cluster_rbuild(). Specifically,
in the case of a file system with a block size that is less than the page
size, cluster_rbuild() looks at too many of the page's valid bits.
Consequently, it may terminate prematurely, resulting in poor performance.

Reported by:	bde
Reviewed by:	tegge
Approved by:	re (kib)
2009-06-27 21:37:36 +00:00
Robert Watson
14961ba789 Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type.  This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by:	brooks
Approved by:	re (kib)
Obtained from:	TrustedBSD Project
MFC after:	1 week
2009-06-27 13:58:44 +00:00
Alan Cox
e999111ae7 This change is the next step in implementing the cache control functionality
required by video card drivers.  Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures.  In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig().  These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with:	jhb
2009-06-26 04:47:43 +00:00
Konstantin Belousov
28fe6a3f3e In lf_iteratelocks_vnode, increment state->ls_threads around iterating
of the vnode advisory lock list. This prevents deallocation of state
while inside the loop.

Reported and tested by:	pho
MFC after:	2 weeks
2009-06-25 18:54:56 +00:00
John Baldwin
9bd55acf5e Return errors from intr_event_bind() to the caller of intr_set_affinity().
Specifically, if a non-root user attempts to bind an interrupt the request
will now report failure with EPERM rather than silently failing with a
successful return code.

MFC after:	1 week
2009-06-25 18:35:19 +00:00
John Baldwin
af88b2c276 Use the correct cast for the arguments passed to freebsd_shmctl() in
oshmctl().

Submitted by:	kib
2009-06-25 17:11:27 +00:00
John Baldwin
ca99828420 Tweak the oshmctl() compile fix: convert the K&R definition to ANSI. 2009-06-25 13:36:57 +00:00
Robert Watson
893ef4d21c oshmctl() now requires a sysv_shm.c-local function prototype. 2009-06-25 07:16:10 +00:00
Jeff Roberson
791d9a6e5d - Use DPCPU for SCHED_STATS. This is somewhat awkward because the
offset of the stat is not known until link time so we must emit a
   function to call SYSCTL_ADD_PROC rather than using SYSCTL_PROC
   directly.
 - Eliminate the atomic from SCHED_STAT_INC now that it's using per-cpu
   variables.  Sched stats are always incremented while we're holding
   a spinlock so no further protection is required.

Reviewed by:	sam
2009-06-25 01:33:51 +00:00
Jeff Roberson
e6e06a8b47 - Add a sysctl_dpcpu_long to support long typed pcpu stats.
- Remove the #ifndef SMP case as the SMP code works on UP as well.

Reviewed by:	sam
2009-06-25 01:31:59 +00:00
Jamie Gritton
499650a033 Wrap a PR_VNET inside "#ifdef VIMAGE" since that the only place it applies.
bz wants the blame for this.

Noticed by:	rwatson
Approved by:	bz (mentor)
2009-06-24 22:06:56 +00:00
John Baldwin
3899e0dfe7 Regen. 2009-06-24 21:54:08 +00:00
Jamie Gritton
6bb795633c In case of prisons with their own network stack, permit
additional privileges as well as not restricting the type of
sockets a user can open.

Note: the VIMAGE/vnet fetaure of of jails is still considered
      experimental and cannot guarantee that privileged users
      can be kept imprisoned if enabled.

Reviewed by:	rwatson
Approved by:	bz (mentor)
2009-06-24 21:39:50 +00:00
John Baldwin
b648d4806b Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
  short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
  short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
  (this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
  removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
  int.  This removes the need for the shm_bsegsz member in struct
  shmid_kernel and should allow for complete support of SYSV SHM regions
  >= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
  short.
- The shm_internal member of struct shmid_ds is now gone.  The internal
  VM object pointer for SHM regions has been moved into struct
  shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
  now marked COMPAT7 and new versions of those system calls which support
  the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc.  The
  FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
  symbol versions has been added to libc.  Version tags are added to
  system calls by adding an appropriate __sym_compat() entry to
  src/lib/libc/incldue/compat.h. [1]

PR:		kern/16195 kern/113218 bin/129855
Reviewed by:	arch@, rwatson
Discussed with:	kan, kib [1]
2009-06-24 21:10:52 +00:00
John Baldwin
45f48220bb Deprecate the msgsys(), semsys(), and shmsys() system calls by moving
them under COMPAT_FREEBSD[4567].  Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls.  The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().
2009-06-24 20:01:13 +00:00
Ulf Lilleengen
2dd43ecaec - Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode
use count was increased twice, but only decreased once.
2009-06-24 18:44:38 +00:00
Ulf Lilleengen
7083246d7c - Fix a bug where a FIFO vnode use count was increased twice, but only
decreased once.

MFC after:	1 week
2009-06-24 18:38:51 +00:00
Jamie Gritton
bf8c1f990d Fix a race in vi_if_move, where a vnet is used after the prison that
referred to it has been released.

Approved by:	bz (mentor)
2009-06-24 15:29:36 +00:00
John Baldwin
3c366f1f14 Add a new COMPAT7 flag for FreeBSD 7.x compatibility system calls. 2009-06-24 13:36:37 +00:00
John Baldwin
71361470b1 - Move syscall function argument structure types to be just above the
relevenat system call function.
- Whitespace fixes.
2009-06-24 13:35:38 +00:00
Robert Watson
36fecbf302 Add stack_print_short() and stack_print_short_ddb() interfaces to
stack(9), which generate a more compact rendition of a stack trace
via the kernel's printf.

MFC after:	1 week
2009-06-24 12:06:15 +00:00
Jeff Roberson
50c202c592 Implement a facility for dynamic per-cpu variables.
- Modules and kernel code alike may use DPCPU_DEFINE(),
   DPCPU_GET(), DPCPU_SET(), etc. akin to the statically defined
   PCPU_*.  Requires only one extra instruction more than PCPU_* and is
   virtually the same as __thread for builtin and much faster for shared
   objects.  DPCPU variables can be initialized when defined.
 - Modules are supported by relocating the module's per-cpu linker set
   over space reserved in the kernel.  Modules may fail to load if there
   is insufficient space available.
 - Track space available for modules with a one-off extent allocator.
   Free may block for memory to allocate space for an extent.

Reviewed by:    jhb, rwatson, kan, sam, grehan, marius, marcel, stas
2009-06-23 22:42:39 +00:00
Jeff Roberson
c76ee82799 - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE
supports arbitrary numbers of cpus rather than being limited by
   cpumask_t to the number of bits in a long.
2009-06-23 22:12:37 +00:00
Ed Schouten
9801591468 Improve my last commit: use a separate condvar to serialize.
The advantage of using a separate condvar is that we can just use
cv_signal(9) instead of cv_broadcast(9). It makes no sense to wake up
multiple threads. It also makes the TTY code easier to understand.
t_dcdwait sounds totally unrelated.
2009-06-23 21:43:02 +00:00
Ed Schouten
2d41cf3a24 Use dcdwait to block threads to serialize writes.
I suspect the usage of bgwait causes a lot of spurious wakeups when
threads are blocked in the background, because they will be woken up
each time a write() call is performed.

Also wakeup dcdwait when the TTY is abandoned.
2009-06-23 21:33:26 +00:00
Konstantin Belousov
3364c323e6 Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
Jamie Gritton
b97457e2e6 Add a limit for child jails via the "children.cur" and "children.max"
parameters.  This replaces the simple "allow.jails" permission.

Approved by:	bz (mentor)
2009-06-23 20:35:51 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Jamie Gritton
399645e1b9 Remove unnecessary/redundant includes.
Approved by:	bz (mentor)
2009-06-23 14:39:21 +00:00
Peter Holm
401679debe vn_open_cred() needs a non NULL ucred pointer
Reviewed by:	kib
2009-06-23 11:29:54 +00:00
Andre Oppermann
ef760e6ad2 Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
  the length of data to be moved into the uio compared to
  soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
  cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams.  Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets.  Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4:	r164817 //depot/user/andre/soreceive_stream/
2009-06-22 23:08:05 +00:00
Andre Oppermann
bc05b2f6fa Add m_mbuftouio() helper function to copy(out) an arbitrary
long mbuf chain into an arbitrary large uio in a single step.
It is a functional mirror image of m_uiotombuf().

This function is supposed to be used instead of hand rolled code
with the same purpose and to concentrate it into one place for
potential further optimization or hardware assistance.
2009-06-22 22:20:38 +00:00
Andre Oppermann
4ad1c464fe In sbappendstream_locked() demote all incoming packet mbufs (and
chains) to pure data mbufs using m_demote().  This removes the
packet header and all m_tag information as they are not meaningful
anymore on a stream socket where mbufs are linked through m->m_next.
Strictly speaking a packet header can be only ever valid on the first
mbuf in an m_next chain.

sbcompress() was doing this already when the mbuf chain layout lent
itself to it (e.g. header splitting or merge-append), just not
consistently.

This frees resources at socket buffer append time instead of at
sbdrop_internal() time after data has been read from the socket.

For MAC the per packet information has done its duty and during
socket buffer appending the policy of the socket itself takes over.
With the append the packet boundaries disappear naturally and with
it any context that was based on it.  None of the residual information
from mbuf headers in the socket buffer on stream sockets was looked at.
2009-06-22 21:46:40 +00:00
John Baldwin
e47a833f38 Regen. 2009-06-22 20:24:03 +00:00
John Baldwin
773fa740c9 Include definitions for the audit identifiers for compat system calls in
sysproto.h.  This makes it possible to use SYSCALL_MODULE() for compat
system calls that live in kernel modules.
2009-06-22 20:14:10 +00:00
John Baldwin
0b0fe06a5e Fix a typo in a comment. 2009-06-22 20:12:40 +00:00
Andre Oppermann
0cb6db1d21 Update m_demote:
- remove HT_HEADER test (MT_HEADER == MT_DATA for some time now)
- be more pedantic about m_nextpkt in other than first mbuf
- update m_flags to be retained
2009-06-22 19:35:39 +00:00
Konstantin Belousov
c808c9632d Add explicit struct ucred * argument for VOP_VPTOCNP, to be used by
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.

This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.

Reviewed by:	rwatson
2009-06-21 19:21:01 +00:00
Konstantin Belousov
e0c161b89c Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson
2009-06-21 13:41:32 +00:00
Roman Divacky
f7490cfe01 In non-debugging mode make this define (void)0 instead of nothing. This
helps to catch bugs like the below with clang.

	if (cond);		<--- note the trailing ;
	   something();

Approved by:	ed (mentor)
Discussed on:	current@
2009-06-21 07:54:47 +00:00
Brooks Davis
7f92e57820 Change crsetgroups_locked() (called by crsetgroups()) to sort the
supplemental groups using insertion sort.  Use this property in
groupmember() to let us use a binary search instead of the previous
linear search.
2009-06-20 20:29:21 +00:00
Ed Schouten
f8f6146082 Improve nested jail awareness of devfs by handling credentials.
Now that we start to use credentials on character devices more often
(because of MPSAFE TTY), move the prison-checks that are in place in the
TTY code into devfs.

Instead of strictly comparing the prisons, use the more common
prison_check() function to compare credentials. This means that
pseudo-terminals are only visible in devfs by processes within the same
jail and parent jails.

Even though regular users in parent jails can now interact with
pseudo-terminals from child jails, this seems to be the right approach.
These processes are also capable of interacting with the jailed
processes anyway, through signals for example.

Reviewed by:	kib, rwatson (older version)
2009-06-20 14:50:32 +00:00
Kip Macy
5b204a113c define helper routines for deferred mbuf initialization 2009-06-19 21:14:39 +00:00
Brooks Davis
838d985825 Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
John Baldwin
afd9f91c63 Fix a deadlock in the getpeername() method for UNIX domain sockets.
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.

Reported by:	Mel Flynn  mel.flynn of mailing.thruhere.net
Reviewed by:	rwatson
MFC after:	1 week
2009-06-18 20:56:22 +00:00
Alan Cox
c45c00344f Utilize the new function kmem_alloc_contig() to implement the UMA back-end
allocator for the jumbo frames zones.  This change has two benefits: (1) a
custom back-end deallocator is no longer required.  UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames.  Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
2009-06-18 17:59:04 +00:00
John Baldwin
4b7b144f63 Regen. 2009-06-17 19:53:47 +00:00
John Baldwin
21def99b51 - Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables.  Specifically, one can now use
  the 'NO*' types as flags in addition to the 'COMPAT*' types.  For example,
  to tag 'COMPAT*' system calls as living in a KLD via NOSTD.  The COMPAT*
  type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
  makesyscalls.sh that return true if a requested flag is found in the
  type field ($3).  The flag() function checks all of the flags in the
  field, but type() only checks the first flag.  type() is meant to be
  used in the top-level "switch" statement and flag() should be used
  otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
  the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
  say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
2009-06-17 19:50:38 +00:00
John Baldwin
0ec0b41cf6 Remove the now-unused NOIMPL flag. It serves no useful purpose given the
existing UNIMPL and NOSTD types.
2009-06-17 18:46:14 +00:00
John Baldwin
f425839118 - NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL.  It should have been NOSTD instead of NOIMPL back
  when it lived in nfsclient.ko, but it was removed from that a long time
  ago.
2009-06-17 18:44:15 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Konstantin Belousov
f02c9d2858 Decrement state->ls_threads when vnode appeared to be doomed.
Reported and tested by:	pho
2009-06-17 12:43:04 +00:00
Attilio Rao
651175c9db Introduce support for adaptive spinning in lockmgr.
Actually, as it did receive few tuning, the support is disabled by
default, but it can opt-in with the option ADAPTIVE_LOCKMGRS.
Due to the nature of lockmgrs, adaptive spinning needs to be
selectively enabled for any interested lockmgr.
The support is bi-directional, or, in other ways, it will work in both
cases if the lock is held in read or write way.  In particular, the
read path is passible of further tunning using the sysctls
debug.lockmgr.retries and debug.lockmgr.loops .  Ideally, such sysctls
should be axed or compiled out before release.

Addictionally note that adaptive spinning doesn't cope well with
LK_SLEEPFAIL.  The reason is that many (and probabilly all) consumers
of LK_SLEEPFAIL are mainly interested in knowing if the interlock was
dropped or not in order to reacquire it and re-test initial conditions.
This directly interacts with adaptive spinning because lockmgr needs
to drop the interlock while spinning in order to avoid a deadlock
(further details in the comments inside the patch).

Final note: finding someone willing to help on tuning this with
relevant workloads would be either very important and appreciated.

Tested by:	jeff, pho
Requested by:	many
2009-06-17 01:55:42 +00:00
Konstantin Belousov
01ed174831 Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by:	jhb
MFC after:	1 week
2009-06-16 15:13:45 +00:00
Ed Schouten
eaaaf1906b Perform some more cleanups to in-kernel session handling.
The code that was in place in exit1() was mainly based on code from the
old TTY layer. The main reason behind this, was because at one moment I
ran a system that had two TTY layers in place at the same time. It is
now sufficient to do the following:

- Remove references from the session structure to the TTY vnode and the
  session leader.

- If we have a controlling TTY and the session used by the TTY is equal
  to our session, send the SIGHUP.

- If we have a vnode to the controlling TTY which has not been revoked,
  revoke it.

While there, change sys/kern/tty.c to use s_ttyp in the comparison
instead of s_ttyvp. It should not make any difference, because s_ttyvp
can only become null when the session leader already left, but it's
nicer to compare against the proper value.
2009-06-15 20:45:51 +00:00
John Baldwin
6653a58307 Regen. 2009-06-15 20:40:23 +00:00
John Baldwin
c4f16b69e1 Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'.  It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD.  One difference from other *BSD is that this closefrom() does not
fail with any errors.  In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR.  DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads.  As such, it is not
multithread safe.

Submitted by:	rwatson (initial version)
Reviewed by:	rwatson
MFC after:	2 weeks
2009-06-15 20:38:55 +00:00