Commit Graph

11134 Commits

Author SHA1 Message Date
Konstantin Belousov
e401a6a54e Do not advance req->oldidx when sysctl_old_user returning an
error due to copyout failure or short buffer.

The later breaks the usermode iterators of the sysctl results that pack
arbitrary number of variable-sized structures. Iterator expects that
kernel filled exactly oldlen bytes, and tries to interpret half-filled
or garbage structure at the end of the buffer. In particular,
kinfo_getfile(3) segfaulted.

Reported and tested by:	pho
MFC after:	3 weeks
2009-05-14 10:54:57 +00:00
Jeff Roberson
bf422e5f27 - Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
 - Save old file descriptor tables created on expansion until
   the entire descriptor table is freed so that pointers may be
   followed without regard for expanders.
 - Mark the file zone as NOFREE so we may attempt to reference
   potentially freed files.
 - Convert several fget_locked() users to fget_unlocked().  This
   requires us to manage reference counts explicitly but reduces
   locking overhead in the common case.
2009-05-14 03:24:22 +00:00
Alan Cox
1c1b26f276 Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does.  Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies.  Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by:	tegge
2009-05-13 05:39:39 +00:00
Edward Tomasz Napierala
e5023dd9f6 Add missing 'break' statement.
Found with:	Coverity Prevent(tm)
CID:		3919
2009-05-12 17:05:40 +00:00
Konstantin Belousov
3b616faed5 Prevent overflow of uio_resid.
Noted by:	jhb
MFC after:	3 days
2009-05-11 19:58:03 +00:00
Attilio Rao
22d7ae67d4 Fix a kernel compilation error, introduced after r191990, by defining
thread with curthread in the AUDIT case.

Reported by:	dchagin
2009-05-11 16:32:58 +00:00
Attilio Rao
dfd233edd5 Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
Alan Cox
c3d3fe6314 Revert CVS revision 1.94 (svn r16840). Current pmap implementations don't
suffer from the race condition that motivated revision 1.94.  Consequently,
the work-around that was implemented by revision 1.94 is no longer needed.
Moreover, reverting this work-around eliminates the need for
vfs_busy_pages() to acquire the page queues lock when preparing a buffer
for read.

Reviewed by:	tegge
2009-05-11 05:16:57 +00:00
Warner Losh
7cddab635b Spell NULL properly, use (void) rather than () for functions with no
parameters.  Mark two items as static that aren't used elsewhere...
2009-05-09 19:08:22 +00:00
Warner Losh
e678f09a15 Retire kern.vm.kmem.size. It was marked as obsolete prior to 5.2, so
it can go.
2009-05-09 19:00:47 +00:00
Alexander Kabaev
5679fe1957 Do not embed struct ucred into larger netcred parent structures.
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by:	John Hickey
PR:		kern/133439
Reviewed by:	dfr, kib
2009-05-09 18:09:17 +00:00
Marko Zec
2114e063f0 A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by:	bz
Approved by:	julian (mentor) (an earlier version of the diff)
2009-05-08 14:34:25 +00:00
Marko Zec
29b02909eb Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg.  Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-05-08 14:11:06 +00:00
Jamie Gritton
7ae27ff49f Move the per-prison Linux MIB from a private one-off pointer to the new
OSD-based jail extensions.  This allows the Linux MIB to accessed via
jail_set and jail_get, and serves as a demonstration of adding jail support
to a module.

Reviewed by:	dchagin, kib
Approved by:	bz (mentor)
2009-05-07 18:36:47 +00:00
Konstantin Belousov
41b72e6e50 Eliminate the loop and the call to pause(9) in vfs_vget_ino(). If
vfs_busy(MBF_NOWAIT) failed, unlock the vnode and sleep in vfs_busy().

Suggested and reviewed by:	jeff
Tested by:	pho
MFC after:	3 weeks
2009-05-07 18:14:21 +00:00
Ed Schouten
14358b0fec If we have a regular rint handler, never go into rint_bypass mode.
It turns out if we called cfmakeraw() on a TTY with only a rint handler
in place, it could inject data into the TTY, even though it should be
redirected. Always take a look at the hooks before looking at the
termios flags.
2009-05-07 17:39:23 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Jamie Gritton
49939083a0 Add a constant PR_MAXMETHOD to better define the jail/OSD interface.
Reviewed by:	dchagin, kib
Approved by:	bz (mentor)
2009-05-05 05:49:08 +00:00
Ed Schouten
3382ac3233 Remove unneeded check for SESS_LEADER().
We perform the same check ~10 lines above.
2009-05-04 11:11:10 +00:00
Jamie Gritton
3dd4fac97c Don't call the OSD destructor if the data slot is NULL
(since it's already not done on unused slots, which are indistinguishable
to the caller).

Approved by:	bz (mentor)
2009-04-30 22:43:21 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Jeff Roberson
09c8a4cc21 - Fix non-SMP build by encapsulating idle spin logic in a macro.
Pointy hat to:	me
2009-04-29 23:04:31 +00:00
Jamie Gritton
fe2f3c651f Regen for new jail system calls in r191673.
Approved by:	bz (mentor)
2009-04-29 21:50:13 +00:00
Jamie Gritton
b38ff370e4 Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2).  Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
  This replaces jail(2).
* jail_get, to read the parameters of existing jails.  This replaces the
  security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes.  The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by:	bz (mentor)
2009-04-29 21:14:15 +00:00
Bruce M Simpson
33cde13046 Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev.  Summary of changes:

 * Connect netinet6/in6_mcast.c to build.
   The legacy KAME KPIs are mostly preserved.
 * Eliminate now dead code from ip6_output.c.
   Don't do mbuf bingo, we are not going to do RFC 2292 style
   CMSG tricks for multicast options as they are not required
   by any current IPv6 normative reference.
 * Refactor transports (UDP, raw_ip6) to do own mcast filtering.
   SCTP, TCP unaffected by this change.
 * Add ip6_msource, in6_msource structs to in6_var.h.
 * Hookup mld_ifinfo state to in6_ifextra, allocate from
   domifattach path.
 * Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
   Kernel consumers which need this should use in6m_lookup().
 * Refactor IPv6 socket group memberships to use a vector (like IPv4).
 * Update ifmcstat(8) for IPv6 SSM.
 * Add witness lock order for IN6_MULTI_LOCK.
 * Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
 * Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
 * Update carp(4) for new IPv6 SSM KPIs.
 * Virtualize ip6_mrouter socket.
   Changes mostly localized to IPv6 MROUTING.
 * Don't do a local group lookup in MROUTING.
 * Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
 * Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
 * Bump __FreeBSD_version to 800084.
 * Update UPDATING.

NOTE WELL:
 * This code hasn't been tested against real MLDv2 queriers
   (yet), although the on-wire protocol has been verified in Wireshark.
 * There are a few unresolved issues in the socket layer APIs to
   do with scope ID propagation.
 * There is a LOR present in ip6_output()'s use of
   in6_setscope() which needs to be resolved. See comments in mld6.c.
   This is believed to be benign and can't be avoided for the moment
   without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
Jamie Gritton
af7bd9a4f4 Some non-functional changes: whitespace, KASSERT strings, declaration order.
Approved by:	bz (mentor)
2009-04-29 18:41:08 +00:00
Jeff Roberson
113dda8a7c - Fix the FBSDID line. 2009-04-29 03:26:30 +00:00
Jeff Roberson
7b55ab0534 - Remove the bogus idle thread state code. This may have a race in it
and it only optimized out an ipi or mwait in very few cases.
 - Skip the adaptive idle code when running on SMT or HTT cores.  This
   just wastes cpu time that could be used on a busy thread on the same
   core.
 - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive.  Re-use
   CG_FLAG_THREAD to mean SMT or HTT.

Sponsored by:   Nokia
2009-04-29 03:15:43 +00:00
Bjoern A. Zeeb
6aaa0b3cf1 Prevent a superuser inside a jail from modifying the dedicated
root cpuset of that jail.
Processes inside the jail will still be able to change child sets.
A superuser outside of a jail will still be able to change the jail cpuset
and thus limit the number of cpus available to the jail.

Problem reported by: 000.fbsd@quip.cz (Miroslav Lachman)
PR:		kern/134050
Reviewed by:	jeff
MFC after:	3 weeks
X-MFC:		backout r191596
2009-04-28 21:00:50 +00:00
Robert Watson
d02add54ea Improve approximation of style(9). 2009-04-26 21:16:03 +00:00
Marko Zec
5624194730 Extend the vnet module registration / initialization framework
first introduced @ r190909 with a vnet module deregistration
service.

kldunloadable modules, which are currently using vnet_mod_register()
to attach their per-vnet initialization routines to the vnet
initialization framework, should call vnet_mod_deregister() before
acknowledging MOD_UNLOAD requests in their mod_event handlers.  Such
changes to the existing code base will follow in subsequent commits.

vnet_mod_deregister() does not check whether departing vnet modules
are registered as prerequisites for another module(s), so it should
be used with care.  Currently I'm only aware of vnet modules which
are leafs on module dependency graphs that are kldunloadable.

This change also introduces per-vnet module destructor handler, which
calls vnet's module cleanup function, which (if required) has to be
registered in vnet module's vnet_modinfo_t structure .vmi_idetach
field.  Once options VIMAGE becomes operational, the framework will
take care that module's cleanup function become invoked for each
active vnet instance, and that the memory allocated for each instance
gets freed.  Currently calls to destructor handlers must always
succeed.
2009-04-26 07:09:39 +00:00
Ed Schouten
ccfd3aab30 Turn MAXPTSDEVS into a sysctl tunable.
This allows users to increase the maximum amount of pseudo-terminals
without changing any source code. Users must increase UT_LINESIZE before
attempting to increase kern.pts_maxdev.
2009-04-25 10:05:55 +00:00
Bjoern A. Zeeb
47479a8ceb Correct a comment: the function name given had never existed in any
(relevant) version of this file orany of my patches.

MFC after:	1 month
2009-04-22 20:49:54 +00:00
Maksim Yevmenkin
e72a94adc3 Fix sbappendrecord_locked().
The main problem is that sbappendrecord_locked() relies on sbcompress()
to set sb_mbtail. This will not happen if sbappendrecord_locked() is
called with mbuf chain made of exactly one mbuf (i.e. m0->m_next == NULL).
In this case sbcompress() will be called with m == NULL and will do
nothing. I'm not entirely sure if m == NULL is a valid argument for
sbcompress(), and, it rather pointless to call it like that, but keep
calling it so it can do SBLASTMBUFCHK().

The problem is triggered by the SOCKBUF_DEBUG kernel option that
enables SBLASTRECORDCHK() and SBLASTMBUFCHK() checks.

PR:			kern/126742
Investigated by:	pluknet < pluknet -at- gmail -dot- com >
No response from:	freebsd-current@, freebsd-bluetooth@
MFC after:		3 days
2009-04-21 19:14:13 +00:00
Konstantin Belousov
6fae832ad7 Fix typo.
Noted by:	jhb
MFC after:	2 weeks
2009-04-20 15:10:03 +00:00
Konstantin Belousov
007abb3d0f On the exit of the child process which parent either set SA_NOCLDWAIT
or ignored SIGCHLD, unconditionally wake up the parent instead of doing
this only when the child is a last child.

This brings us in line with other U**xes that support SA_NOCLDWAIT. If
the parent called waitpid(childpid), then exit of the child should wake
up the parent immediately instead of forcing it to wait for all children
to exit.

Reported by:	Alan Ferrency <alan pair com>
Submitted by:	Jilles Tjoelker <jilles stack nl>
PR:	108390
MFC after:	2 weeks
2009-04-20 14:34:55 +00:00
Robert Watson
e5a9a8ead8 Lock the interface address list while iterating a network interface's
address list when searching for a link-layer address to use during uuid
generation.

MFC after:	2 weeks
2009-04-19 21:36:18 +00:00
Robert Watson
bb1c7df80f struct malloc_type has had a 'magic' field statically initialized to
M_MAGIC by MALLOC_DEFINE() for a long time; add assertions that
malloc_type's passed to malloc(), free(), etc have that magic set.

MFC after:	2 weeks
2009-04-19 12:41:37 +00:00
Edward Tomasz Napierala
e0ee758989 When allocating 'struct acl' instances, use malloc(9) instead of uma(9).
This struct will get much bigger soon, and we don't want to waste too much
memory on UMA caches.

Reviewed by:	rwatson
2009-04-19 09:56:30 +00:00
Edward Tomasz Napierala
b998d381f2 Use acl_alloc() and acl_free() instead of using uma(9) directly.
This will make switching to malloc(9) easier; also, it would be
neccessary to add these routines if/when we implement variable-size
ACLs.
2009-04-18 16:47:33 +00:00
Alexander Kabaev
8aeb69d0f2 Undo private changes that should never have been committed. 2009-04-17 18:34:11 +00:00
Alexander Kabaev
348496ad39 More fallout from negative dotdot caching. Negative entries should
be removed from and reinserted to proper ncneg list.

Reported by:  pho
Submitted by: kib
2009-04-17 18:11:11 +00:00
Konstantin Belousov
28a1b4eb37 In flushbufqueues(), do not allocate sentinel buffer on the stack,
struct buf is large. Use sleeping malloc(9) call, and zero the allocated
buf as a debugging feature.
2009-04-16 09:37:48 +00:00
Konstantin Belousov
949af70942 Export the number of times bufdaemon got help from the normal threads. 2009-04-16 09:33:52 +00:00
Ed Schouten
6672361085 Remove dead code from devtoname().
In the good old days it was possible to have dev_t's that referred to
nonexistent devices. In these cases devtoname() automatically generated
names. This is no longer possible, so remove this dead code.

Discussed with:	kib
2009-04-15 20:43:12 +00:00
Ed Schouten
bce79dbb29 Remove unneeded variable and casting from newdev().
Remove the `udev' variable, which has a different type than the original
function argument and si_drv0. The `udev' name is also misleading,
because it is not the number returned by dev2udev(). Rename this
argument to `unit'. It is the same number as returned by dev2unit().
2009-04-15 20:15:36 +00:00
Ed Schouten
d7cbfc1b18 Don't use si_drv0 directly.
We should still access si_drv0 using dev2unit(). Also change the
KASSERT() to really print the udev instead of the unit number. I suspect
it's still useful to print the unit number, especially for devices that
use clone lists, so keep the unit number in the panic string.
2009-04-15 20:08:26 +00:00
John Baldwin
3f11530b79 Update comment above _fget() for earlier change to FWRITE failures return
EBADF rather than EINVAL.

Submitted by:	Jaakko Heinonen  jh saunalahti fi
MFC after:	1 month
2009-04-15 19:10:37 +00:00
Alexander Kabaev
9cf6772211 Redo previous change using simpler patch that happens to be also
more correct.

Submitted by: tor
2009-04-14 23:56:48 +00:00
Alexander Kabaev
eed8a9edba Fix yet another negative dotodot entry fallout.
Reported by: pho
2009-04-14 23:46:57 +00:00
Kip Macy
5e6a926611 - use a shared lock for reads
- remove stale comment

Reviewed by:	jeffr
2009-04-13 23:09:44 +00:00
David Xu
945488297b Make UMTX_OP_WAIT_UINT actually wait for an unsigned integer on 64-bits
machine.

MFC after: 1 week
2009-04-13 05:21:17 +00:00
Kip Macy
f0b9868d3a sendfile doesn't modify the vnode - acquire vnode lock shared
Reviewed by:	ups, jeffr
2009-04-12 05:19:35 +00:00
Robert Watson
89f28b1b86 Remove conditionally compiled time counter statistics; tools like
DTrace, kernel profiling, etc, can provide this information without
the overhead.

MFC after:	3 days
Suggested by:	bde
2009-04-11 22:01:40 +00:00
Alexander Kabaev
9d75482f99 Fix v_cache_dd handling for negative entries. v_cache_dd pointer was
not populated in parent directory if negative entry was being
created, yet entry itself was added to the nc_neg list. It was
possible for parent vnode to get discarded later, leaving negative
entry pointing to now unused memory block.

Reported by:	dho
Revewed by:	kib
2009-04-11 20:23:08 +00:00
Konstantin Belousov
fd409594c6 When zapping v_cache_dd for !MAKEENTRY case in cache_lookup(), we shall
lock cache as writer.

Reviewed by:	kan
2009-04-11 16:12:20 +00:00
Marko Zec
bfe1aba468 Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized.  In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw).  Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module.  The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on.  Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first.  The framework will panic if it detects any unresolved
dependencies before completing system initialization.  Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run.  In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container.  IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-04-11 05:58:58 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
Konstantin Belousov
3f54086eba Cache_lookup() for DOTDOT drops dvp vnode lock, allowing dvp to be reclaimed.
Check the condition and return ENOENT then.

In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.

Reported and tested by:	pho
2009-04-10 10:22:44 +00:00
Andrew Thompson
853a10a581 Revert r190676,190677
The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by:	scottl
2009-04-10 04:08:34 +00:00
Ed Schouten
e3b0b98073 Fix tty_wait_background() to comply with standards.
It turns out my handling of SIGTTOU and SIGTTIN didn't entirely comply
to the standards. It is true that in the SIGTTOU case we should not
return EIO when the signal is ignored/blocked, but in the SIGTTIN case
we must.

See also:	POSIX issue 7 section 11.1.4
2009-04-08 15:56:50 +00:00
Robert Watson
5d5c174869 Nul-terminate strings in the VFS name cache, which negligibly change
the size and cost of name cache entries, but make adding debugging
and tracing easier.

Add SDT DTrace probes for various namecache events:

  vfs:namecache:enter:done - new entry in the name cache, passed parent
    directory vnode pointer, name added to the cache, and child vnode
    pointer.

  vfs:namecache:enter_negative:done - new negative entry in the name cache,
    passed parent vnode pointer, name added to the cache.

  vfs:namecache:fullpath:enter - call to vn_fullpath1() is made, passed
    the vnode to resolve to a name.

  vfs:namecache:fullpath:hit - vn_fullpath1() successfully resolved a
    search for the parent of an object using the namecache, passed the
    discovered parent directory vnode pointer, name, and child vnode
    pointer.

  vfs:namecache:fullpath:miss - vn_fullpath1() failed to resolve a search
    for the parent of an object using the namecache, passed the child
    vnode pointer.

  vfs:namecache:fullpath:return - vn_fullpath1() has completed, passed the
    error number, and if that is zero, the vnode to resolve, and the
    returned path.

  vfs:namecache:lookup:hit - postive name cache entry hit, passed the
    parent directory vnode pointer, name, and child vnode pointer.

  vfs:namecache:lookup:hit_negative - negative name cache entry hit,
    passed the parent directory vnode pointer and name.

  vfs:namecache:lookup:miss - name cache miss, passed the parent directory
    pointer and the full remaining component name (not terminated after the
    cache miss component).

  vfs:namecache:purge:done - name cache purge for a vnode, passed the vnode
    pointer to purge.

  vfs:namecache:purge_negative:done - name cache purge of negative entries
    for children of a vnode, passed the vnode pointer to purge.

  vfs:namecache:purgevfs - name cache purge for a mountpoint, passed the
    mount pointer.  Separate probes will also be invoked for each cache
    entry zapped.

  vfs:namecache:zap:done - name cache entry zapped, passed the parent
    directory vnode pointer, name, and child vnode pointer.

  vfs:namecache:zap_negative:done - negative name cache entry zapped,
    passed the parent directory vnode pointer and name.

For any probes involving an extant name cache entry (enter, hit, zapp),
we use the nul-terminated string for the name component.  For misses,
the remainder of the path, including later components, is provided as
an argument instead since there is no handy nul-terminated version of
the string around.  This is arguably a bug.

MFC after:      1 month
Sponsored by:   Google, Inc.
Reviewed by:	jhb, kan, kib (earlier version)
2009-04-07 20:58:56 +00:00
Robert Watson
4b4e58badf Add SDT DTrace probes for namei():
vfs:namei:lookup:entry takes parent directory vnode pointer, path to
    look up, and lookup flags.
  vfs:namei:lookup:return takes an error value, and if successful, the
    returned vnode pointer.

MFC after:      1 month
2009-04-06 10:32:40 +00:00
Dmitry Chagin
cd899aad76 Fix KBI breakage by r190520 which affects older linux.ko binaries:
1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
   is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
   modules won't have the flag set, so the new field brand_note would be
   ignored.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	kib (mentor)
MFC after:	6 days
2009-04-05 09:27:19 +00:00
Alexander Kabaev
bb6418cbe3 Revert change 190655 temporarily. It breaks many setups where nullfs is
used and needs to be revisited.
2009-04-04 17:48:38 +00:00
Marcel Moolenaar
27457a80e2 PowerPC, meet kernel core dumps. The support is based
on a generic dumper that creates an ELF core file and
uses PMAP functions to scan and iterate over memory
chunks, as well as handle memory mappings used during
dumping.
the PMAP layer can choose to return physical memory
chunks or virtual memory chunks. For minidumps, the
chunks should be virtual.

The default MMU I/F implementation for the scan_md()
method returns NULL. Thus, when a PMAP implementation
does not implement the required methods, an empty
core file is created. Here, empty means having an ELF
header only.

Obtained from:	Juniper Networks
2009-04-04 02:12:37 +00:00
Andrew Thompson
626fc9fe3d Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.
2009-04-03 19:46:12 +00:00
Peter Wemm
0e875ecafe vn_vptocnp() unlocks the name cache and forgets to re-lock it before
returning in one error case, and mistakenly unlocks it for the
umount -f case.
2009-04-02 21:16:20 +00:00
Christian Brueffer
1fa80eb15c Fix memory leak in semunload().
PR:		133064
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	1 week
2009-03-30 15:01:29 +00:00
Andrew Thompson
46b70f07bc Further rate limit the root wait status, it will be printed once per
root_mount_rel() wakeup.
2009-03-30 05:57:55 +00:00
Alexander Kabaev
607fc40b04 Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by:	kib
2009-03-29 21:25:40 +00:00
Jamie Gritton
8571af59e5 Whitespace/spelling fixes in advance of upcoming functional changes.
Approved by:	bz (mentor)
2009-03-27 13:13:59 +00:00
Andrew Thompson
d24d45d9a9 Skip the allocation of the root hold token if the mount already happened. 2009-03-27 03:52:08 +00:00
John Baldwin
2401c73637 When looking up the parent devclass of a new devclass, create the parent
devclass if it doesn't already exist.
2009-03-25 17:02:05 +00:00
John Baldwin
049ce0934f When a file lookup fails due to encountering a doomed vnode from a forced
unmount, consistently return ENOENT rather than EBADF.

Reviewed by:	kib
MFC after:	1 month
2009-03-24 18:16:42 +00:00
Jung-uk Kim
eae44ae03d Clean up MI inittodr(9) and kill noop code.
It was derived from i386 version long ago but never resync'ed again.
Originally, i386 version compared the current time from realtime clock
with time_second (which was just `time' in the old days).  When this MI
version was written, it was wrongly compared against `base' AND never
used because of a bug (typo?) in the code.  This check was killed
in i386 version when home-rolled calendaric calculation was removed.
Now, we just remove the code here as well to make the code simpler.
2009-03-23 21:16:21 +00:00
John Baldwin
9b84ba1cbb Improve the description of a few sysctls.
Submitted by:	bde (partially)
MFC after:	3 days
2009-03-23 20:18:06 +00:00
Alexander Kabaev
9999864a87 Add safety check that does not allow empty strings to be queued
to the devctl notification queue. Empty strings cause devctl read
call to return 0 and result in devd exiting prematurely.

The actual offender (ugen notes for root hubs) will be fixed
by separate commit.
2009-03-23 01:13:34 +00:00
Colin Percival
3f935cf342 Correctly sanity-check timer IDs. [SA-09:06]
Limit the size of malloced buffer when dumping environment
variables. [EN-09:01]

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-09:06.ktimer
Errata:		FreeBSD-EN-09:01.kenv
2009-03-23 00:00:50 +00:00
Konstantin Belousov
267c52fc98 Fix several issues with parsing the notes for ELF objects.
Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.

In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.

For clarity, check the calculated note alignment in each iteration.

Reported by:	Chris Palmer <chris noncombatant org> [1]
PR:	kern/132886
Reviewed and tested by:	dchagin
MFC after:	3 days
2009-03-22 13:42:41 +00:00
Konstantin Belousov
15fb32c07d Do not underflow the buffer and then report the problem. Check for the
condition before the buffer write.
Also, since buflen is unsigned, previous check was ignored.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:08:57 +00:00
Konstantin Belousov
83817ce3b1 Remove unneeded braces to reduce used vertical screen space.
The location was missed in r190140.
2009-03-20 11:03:55 +00:00
Konstantin Belousov
9194007261 Do not forget to adjust buflen for the first resolution of the path
from namecache.
While there, compare pointers for equiality.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:00:39 +00:00
Konstantin Belousov
065fc451f8 The nc_nlen member of the struct namecache contains the length of the cached
name, not the length + 1.

PR:	132620, 132542
Reported by:	bf2006a yahoo com
Tested by:	bf2006a, pho
Reviewed by:	marcus
2009-03-20 10:59:06 +00:00
Konstantin Belousov
c4a8c2ee24 When ktracing namei operations, log a result of the __getcwd().
MFC after:	1 week
2009-03-20 10:47:16 +00:00
Konstantin Belousov
bf5c835e1c Remove unneeded braces to reduce used vertical screen space. 2009-03-20 10:04:00 +00:00
Attilio Rao
76ed3c71f1 Fix an old-standing bug that crept in along the several revisions:
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.

Reported by:	miwi, Volker <volker at vwsoft dot com>,
		Ben Kaduk <minimarmot at gmail dot com>,
		Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by:	lulf, miwi
2009-03-17 16:30:49 +00:00
Konstantin Belousov
3ff063577b Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by:	pho
Reviewed by:	kan
2009-03-17 12:53:28 +00:00
Konstantin Belousov
429f5a589b Use the properly sized types for ELF object header and program headers.
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.

Reported by:	swell.k gmail com
Reviewed by:	dchagin
Tested by:	dchagin, swell.k gmail com
2009-03-17 09:50:40 +00:00
Jung-uk Kim
c66d2b38c8 Initial suspend/resume support for amd64.
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
2009-03-17 00:48:11 +00:00
Konstantin Belousov
c1d8b5e82c Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
Robert Watson
e5adda3d51 Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
Jeff Roberson
1723a06485 - Wrap lock profiling state variables in #ifdef LOCK_PROFILING blocks. 2009-03-15 08:03:54 +00:00
Jeff Roberson
2e6b8de462 - Implement a new mechanism for resetting lock profiling. We now
guarantee that all cpus have acknowledged the cleared enable int by
   scheduling the resetting thread on each cpu in succession.  Since all
   lock profiling happens within a critical section this guarantees that
   all cpus have left lock profiling before we clear the datastructures.
 - Assert that the per-thread queue of locks lock profiling is aware of
   is clear on thread exit.  There were several cases where this was not
   true that slows lock profiling and leaks information.
 - Remove all objects from all lists before clearing any per-cpu
   information in reset.  Lock profiling objects can migrate between
   per-cpu caches and previously these migrated objects could be zero'd
   before they'd been removed

Discussed with:	attilio
Sponsored by:	Nokia
2009-03-15 06:41:47 +00:00
Jeff Roberson
d3df4af368 - When a mutex is destroyed while locked we need to inform lock profiling
that it has been released.
2009-03-14 11:43:38 +00:00
Jeff Roberson
04a2868980 - Call lock_profile_release when we're transitioning a lock to be owned by
LK_KERNPROC.

Discussed with:		attilio
2009-03-14 11:43:02 +00:00
Jeff Roberson
53a6c8b3ac - Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>

 - When sched_affinity() is called with a thread that is not curthread we
   need to handle the ON_RUNQ() case by adding the thread to the correct
   run queue.

Submitted by:	Justin Teller <justin.teller@gmail.com>

MFC after:	1 Week
2009-03-14 11:41:36 +00:00
Dmitry Chagin
32c01de21c Implement new way of branding ELF binaries by looking to a
".note.ABI-tag" section.

The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.

Move code which fetch osreldate for ELF binary to check_note() handler.

PR:		118473
Approved by:	kib (mentor)
2009-03-13 16:40:51 +00:00
David Xu
326bf9493d 1) Check NULL pointer before calling umtx_pi_adjust_locked(), this avoids
a PANIC.
2) Rework locking for POSIX priority-mutex, this fixes a
   race where a thread may wait there forever even if the mutex is unlocked.
2009-03-13 06:06:20 +00:00
John Baldwin
42dd14bada Change the sysctls for maxbcache and maxswzone from int to long. I missed
this earlier since these sysctls don't exist in 7.x yet.
2009-03-12 17:23:02 +00:00