Commit Graph

11168 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
8d8bc0182e After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
Marko Zec
bc29160df3 Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
Robert Watson
324fb6be0f Move zombie-reaping code out of kern_wait() and into its own function,
proc_reap().
Reviewed by:	jhb
MFC after:	3 days
Sponsored by:	Google, Inc.
2009-06-08 15:26:09 +00:00
Konstantin Belousov
ce3fa4d5cf Do not dereference vp->v_rdev without holding any of dev_mtx or vnode
lock. Use code similar to devfs_fp_check(), but inlined to feet other
checks performed by ttyhook_register().

Reviewed by:	ed
2009-06-08 13:34:45 +00:00
Alan Cox
0a2e596a93 Eliminate unnecessary obfuscation when testing a page's valid bits. 2009-06-07 19:38:26 +00:00
Alan Cox
0a276edef9 Eliminate an unused variable from allocbuf().
Eliminate the unnecessary setting of page valid bits from a non-VMIO buffer
in vm_hold_load_pages().
2009-06-07 18:19:04 +00:00
Alan Cox
d1a6e42ddd If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check
the page's valid bits.  The page is guaranteed to be fully valid.  (For the
record, this is documented in vm/vm_pager.h's comments.)
2009-06-06 20:13:14 +00:00
Dag-Erling Smørgrav
322ef7cc60 Eliminate trailing_slash, which was made redundant in r193028.
Remove a couple of 4-year-old "temporary" KASSERTs.
Improve comments.

MFC after:	1 week
2009-06-06 00:49:49 +00:00
Joe Marcus Clarke
8a4444049e Unlock the cache lock before returning when we run out of buffer space
trying to fill in the full path name.

Reported by:	David Naylor <naylor.b.david@gmail.com>
Approved by:	kib
2009-06-05 16:44:42 +00:00
Ed Schouten
8888f696d0 Remove clists from the kernel.
Clists were originally used by the TTY layer as a text buffer interface.
The advantage of clists were that it would allocate a small set of
additional buffers that could be shared between TTYs when needed. In
the modern days we can just allocate some more KBs of memory to keep the
TTYs satisfied. The global cfreelist also requires synchronisation,
which may not be useful when trying to improve scalability.

The MPSAFE TTY layer uses its own text buffers (ttyinq and ttyoutq). We
had a small amount of drivers in the tree that still uses clists, like
the old USB stack and some keyboard drivers. With the old USB stack gone
and the keyboard drivers changed to use a circular buffer, we can safely
remove clists from the kernel.
2009-06-05 15:31:38 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
af4656042e Add mac_framework.h include missed when MAC code was (presumably) copied
from another file.
2009-06-05 14:23:24 +00:00
Brian Somers
f121baaa45 If we're passed garbage in malloc_init(), panic() rather than expecting
a KASSERT to handle it.  People are likely to turn off INVARIANTS RSN
and loading an old module can cause garbage-in here.

I saw the issue with an older nvidia driver (x11/nvidia-driver) loading
into a new kernel - a crash wasn't seen 'till sysctl_kern_malloc_stats().
I was lucky that mtp->ks_shortdesc was NULL and not something horrible.

While I'm here, KASSERT that malloc_uninit() isn't passed something that's
not in kmemstatistics.

MFC after:	3 weeks
2009-06-05 09:16:52 +00:00
Paul Saab
983b43c5de When checking for shared writes, use the struct mount returned from
vn_start_write.

Reviewed by:	jhb
2009-06-04 16:50:03 +00:00
Paul Saab
a6d545d8ed Support shared vnode locks for write operations when the offset is
provided on filesystems that support it.  This really improves mysql
+ innodb performance on ZFS.

Reviewed by:	jhb, kmacy, jeffr
2009-06-04 16:18:07 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Robert Watson
e499bd28dd Remove unneeded include.
MFC after:	3 days
2009-06-02 15:59:46 +00:00
Attilio Rao
f083018223 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
Alan Cox
3c33df624c Correct a boundary case error in the management of a page's dirty bits by
shm_dotruncate() and vnode_pager_setsize().  Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared.  Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
2009-06-02 08:02:27 +00:00
Jeff Roberson
f4471727f3 - Use an acquire barrier to increment f_count in fget_unlocked and
remove the volatile cast.  Describe the reason in detail in a comment.

Discussed with:	bde, jhb
2009-06-02 06:55:32 +00:00
John Baldwin
64345f0b57 Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
  added to cdevsw.  This function is called for each mmap() request.
  If it returns ENODEV, then the mmap() request will fall back to using
  the device's device pager object and d_mmap().  Otherwise, the method
  can return a VM object to satisfy this entire mmap() request via
  *object.  It can also modify the starting offset into this object via
  *foff.  This allows device drivers to use the file offset as a cookie
  to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
  mapping V_CHR vnodes.  This avoids duplicating all the cdev mmap
  handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02.  Older device drivers
  using D_VERSION_01 are still supported.

MFC after:	1 month
2009-06-01 21:32:52 +00:00
John Baldwin
74fb0ba732 Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
  locked.  It is not permissible to call soisconnected() with this lock
  held; however, so socket upcalls now return an integer value.  The two
  possible values are SU_OK and SU_ISCONNECTED.  If an upcall returns
  SU_ISCONNECTED, then the soisconnected() will be invoked on the
  socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls.  The
  API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
  the receive socket buffer automatically.  Note that a SO_SND upcall
  should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
  instead of calling soisconnected() directly.  They also no longer need
  to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
  state machine, but other accept filters no longer have any explicit
  knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
  while invoking soreceive() as a temporary band-aid.  The plan for
  the future is to add a new flag to allow soreceive() to be called with
  the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
  buffer locked.  Previously sowakeup() would drop the socket buffer
  lock only to call aio_swake() which immediately re-acquired the socket
  buffer lock for the duration of the function call.

Discussed with:	rwatson, rmacklem
2009-06-01 21:17:03 +00:00
John Baldwin
b36cfff75d Add a simple API to manage scatter/gather lists of phyiscal addresses.
Each list describes a logical memory object that is backed by one or more
physical address ranges.  To minimize locking, the sglist objects
themselves are immutable once they are shared.

These objects may be used in the future to facilitate I/O requests using
physically-addressed buffers.  For the immediate future I plan to use them
to implement a new type of VM object and pager.

Reviewed by:	jeff, scottl
MFC after:	1 month
2009-06-01 20:35:39 +00:00
Robert Watson
820c518125 Add a flags field to struct ucred, and export that via kinfo_proc,
consuming one of its spare fields.  The cr_flags field is currently
unused, but will be used for features, including capability mode and
pay-as-you-go audit.

Discussed with:	jhb, sson
2009-06-01 20:26:51 +00:00
Robert Watson
33dd50646e Regenerate generated syscall files following changes to struct sysent in
r193234.
2009-06-01 16:14:38 +00:00
Robert Watson
46c10f272c Add 'sy_flags', a currently unused per-syscall entry flags field that will
see future use in 9-CURRENT and 8-STABLE for features such as the
capability-mode enable flag and pay-as-you-audit.

Discussed with:	jhb, sson
2009-06-01 16:13:06 +00:00
Robert Watson
d4b5cae49b Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
  workstream, or set of per-protocol queues.  Threads may be bound
  to specific CPUs, or allowed to migrate, based on a global policy.

  In the future it would be desirable to support topology-centric
  policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
  currently be one of:

  NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
    an implicit or explicit source (such as an interface or socket).

  NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
    as well as allowing protocols to provide a flow generation function
    for mbufs without flow identifers (m2flow).  Falls back on
    NETISR_POLICY_SOURCE if now flow ID is available.

  NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
    each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
  being used, as well as a mapping function from workstream to CPU ID,
  which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
  get and clear drop counters, which query data or apply changes across
  all workstreams.

- Add a more extensible netisr registration interface, in which
  protocols declare 'struct netisr_handler' structures for each
  registered NETISR_ type.  These include name, handler function,
  optional mbuf to flow ID function, optional mbuf to CPU ID function,
  queue limit, and ordering policy.  Padding is present to allow these
  to be expanded in the future.  If no queue limit is declared, then
  a default is used.

- Queue limits are now per-workstream, and raised from the previous
  IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
  with the exception of netnatm, default queue limits.  Most protocols
  register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
  NETISR_POLICY_FLOW, and will therefore take advantage of driver-
  generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
  the netisr, rather than having polling pretend to be two protocols.
  Provide two explicit hooks in the netisr worker for start and end
  events for runs: netisr_poll() and netisr_pollmore(), as well as a
  function, netisr_sched_poll(), to allow the polling code to schedule
  netisr execution.  DEVICE_POLLING still embeds single-netisr
  assumptions in its implementation, so for now if it is compiled into
  the kernel, a single and un-bound netisr thread is enforced
  regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking.  It will be enabled once further rmlock
optimization has taken place.  However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by:	bz
2009-06-01 10:41:38 +00:00
Alan Cox
6864a18c41 Eliminate a comment describing code that was deleted over eight years ago.
Move another comment to its proper place.  Fix a typo in a third comment.
2009-06-01 06:12:08 +00:00
Craig Rodrigues
0c349f0856 sys/boot/common.c
=================
Extend the loader to parse the root file system mount options in /etc/fstab,
and set a new loader variable vfs.root.mountfrom.options with these options.
The root mount options must be a comma-delimited string, as specified in
/etc/fstab.
Only set the vfs.root.mountfrom.options variable if it has not been
set in the environment.

sys/kern/vfs_mount.c
====================
When mounting the root file system, pass the mount options
specified in vfs.root.mountfrom.options, but filter out "rw" and "noro",
since the initial mount of the root file system must be done as "ro".
While we are here, try to add a few hints to the mountroot prompt
to give users and idea what might of gone wrong during mounting
of the root file system.

Reviewed by:	jhb (an earlier patch)
2009-06-01 01:02:30 +00:00
Alan Cox
1f17689408 nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean().  Nothing uses it any longer.
2009-05-31 20:18:02 +00:00
Konstantin Belousov
1358a7957d Unbreak the build. Add missed probes.
Reviewed by:	rwatson
Pointy hat to:	me
2009-05-31 20:16:06 +00:00
Konstantin Belousov
0449e6e1eb Eliminate code duplication in vn_fullpath1() around the cache lookups
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.

Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.

Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.

Tested by:	pho
2009-05-31 14:57:43 +00:00
Dmitry Chagin
4202e1be20 Split native socketpair() syscall onto kern_socketpair() which should
be used by kernel consumers and socketpair() itself.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:12:38 +00:00
Marko Zec
feb08d06b9 Introduce an interm userland-kernel API for creating vnets and
assigning ifnets from one vnet to another.  Deletion of vnets is not
yet supported.

The interface is implemented as an ioctl extension so that no syscalls
had to be introduced.  This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.

The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.

While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.

Approved by:	julian (mentor)
Discussed with:	bz, rwatson
2009-05-31 12:10:04 +00:00
Nathan Whitehorn
f436f17508 Provide a new CPU device driver ivar to report the nominal speed of the
CPU, if available. This is meant to solve the issue of cpufreq misreporting
speeds on CPUs that boot in a reduced power mode and have only relative
speed control.
2009-05-31 08:59:15 +00:00
Attilio Rao
faef64cc39 Remove the now invalid (and possibly unused) debug.mpsafevfs
sysctl/tunable.

Reviewed by:	emaste
Sponsored by:	Sandvine Incorporated
2009-05-30 23:52:23 +00:00
Edward Tomasz Napierala
c97fcdba57 Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL.  For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by:	rwatson@
2009-05-30 13:59:05 +00:00
Jamie Gritton
76ca6f88da Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
Alan Cox
623469c996 Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ rather
than using the kernel object.  This allows the elimination of page queues
locking from vm_hold_free_pages().
2009-05-29 18:35:51 +00:00
Robert Watson
1191932a65 Minor style tweak. 2009-05-29 14:25:51 +00:00
Robert Watson
82b7a39cb3 Since sched_pin() and sched_unpin() are already inlined, don't manually
inline in rmlocks.
2009-05-29 14:20:10 +00:00
John Baldwin
a571ad41ae Remove extra cpu_spinwait() invocations. This should really only be used
in tight spin loops, not in these edge cases where we restart a much
larger loop only a few times.

Reviewed by:	attilio
2009-05-29 14:03:34 +00:00
John Baldwin
fa29f0236f Tweak a few comments on adaptive spinning. 2009-05-29 13:56:34 +00:00
Robert Watson
1a109c1cb0 Make the rmlock(9) interface a bit more like the rwlock(9) interface:
- Add rm_init_flags() and accept extended options only for that variation.
- Add a flags space specifically for rm_init_flags(), rather than borrowing
  the lock_init() flag space.
- Define flag RM_RECURSE to use instead of LO_RECURSABLE.
- Define flag RM_NOWITNESS to allow an rmlock to be exempt from WITNESS
  checking; this wasn't possible previously as rm_init() always passed
  LO_WITNESS when initializing an rmlock's struct lock.
- Add RM_SYSINIT_FLAGS().
- Rename embedded mutex in rmlocks to make it more obvious what it is.
- Update consumers.
- Update man page.
2009-05-29 10:52:37 +00:00
Dag-Erling Smørgrav
32bf7cdf5a Let vfs_lookup() return ENOTDIR if the path has a trailing slash and
the last component is a symlink to something that isn't a directory.

We introduce a new namei flag, TRAILINGSLASH, which is set by lookup()
if the last component is followed by a slash.  The trailing slash is
then stripped, as before.  If the final component is a symlink,
lookup() will return to namei(), which will expand the symlink and
call lookup() with the new path.  When all symlinks have been
resolved, lookup() checks if the TRAILINGSLASH flag is set, and if it
is, and the vnode it ended up with is not a directory, it returns
ENOTDIR.

PR:		kern/21768
Submitted by:	Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after:	3 weeks
2009-05-29 10:02:44 +00:00
Dag-Erling Smørgrav
b181c8aac6 Fix misleading comment.
MFC after:	1 week
2009-05-29 09:52:13 +00:00
Attilio Rao
e31d083357 The patch for r193011 was partially rejected when applied, complete it. 2009-05-29 08:01:48 +00:00
Ed Schouten
c5e30cc02b Last minute TTY API change: remove mutex argument from tty_alloc().
I don't want people to override the mutex when allocating a TTY. It has
to be there, to keep drivers like syscons happy. So I'm creating a
tty_alloc_mutex() which can be used in those cases. tty_alloc_mutex()
should eventually be removed.

The advantage of this approach, is that we can just remove a function,
without breaking the regular API in the future.
2009-05-29 06:41:23 +00:00
Attilio Rao
1ae1c2a3bd Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by:	jeff, kmacy
Reviewed by:	jeff
Tested by:	pho
2009-05-29 01:49:27 +00:00
Zachary Loafman
cfeb7489c2 fail(9) support:
Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.

Approved by:        dfr (mentor)
2009-05-27 16:36:54 +00:00