long mbuf chain into an arbitrary large uio in a single step.
It is a functional mirror image of m_uiotombuf().
This function is supposed to be used instead of hand rolled code
with the same purpose and to concentrate it into one place for
potential further optimization or hardware assistance.
chains) to pure data mbufs using m_demote(). This removes the
packet header and all m_tag information as they are not meaningful
anymore on a stream socket where mbufs are linked through m->m_next.
Strictly speaking a packet header can be only ever valid on the first
mbuf in an m_next chain.
sbcompress() was doing this already when the mbuf chain layout lent
itself to it (e.g. header splitting or merge-append), just not
consistently.
This frees resources at socket buffer append time instead of at
sbdrop_internal() time after data has been read from the socket.
For MAC the per packet information has done its duty and during
socket buffer appending the policy of the socket itself takes over.
With the append the packet boundaries disappear naturally and with
it any context that was based on it. None of the residual information
from mbuf headers in the socket buffer on stream sockets was looked at.
- remove HT_HEADER test (MT_HEADER == MT_DATA for some time now)
- be more pedantic about m_nextpkt in other than first mbuf
- update m_flags to be retained
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.
This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.
Reviewed by: rwatson
vn_open_cred invocations shall not audit namei path.
In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.
Reported, reviewed and tested by: rwatson
Now that we start to use credentials on character devices more often
(because of MPSAFE TTY), move the prison-checks that are in place in the
TTY code into devfs.
Instead of strictly comparing the prisons, use the more common
prison_check() function to compare credentials. This means that
pseudo-terminals are only visible in devfs by processes within the same
jail and parent jails.
Even though regular users in parent jails can now interact with
pseudo-terminals from child jails, this seems to be the right approach.
These processes are also capable of interacting with the jailed
processes anyway, through signals for example.
Reviewed by: kib, rwatson (older version)
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.
Reported by: Mel Flynn mel.flynn of mailing.thruhere.net
Reviewed by: rwatson
MFC after: 1 week
allocator for the jumbo frames zones. This change has two benefits: (1) a
custom back-end deallocator is no longer required. UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames. Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
- Mark nfsclnt as UNIMPL. It should have been NOSTD instead of NOIMPL back
when it lived in nfsclient.ko, but it was removed from that a long time
ago.
Actually, as it did receive few tuning, the support is disabled by
default, but it can opt-in with the option ADAPTIVE_LOCKMGRS.
Due to the nature of lockmgrs, adaptive spinning needs to be
selectively enabled for any interested lockmgr.
The support is bi-directional, or, in other ways, it will work in both
cases if the lock is held in read or write way. In particular, the
read path is passible of further tunning using the sysctls
debug.lockmgr.retries and debug.lockmgr.loops . Ideally, such sysctls
should be axed or compiled out before release.
Addictionally note that adaptive spinning doesn't cope well with
LK_SLEEPFAIL. The reason is that many (and probabilly all) consumers
of LK_SLEEPFAIL are mainly interested in knowing if the interlock was
dropped or not in order to reacquire it and re-test initial conditions.
This directly interacts with adaptive spinning because lockmgr needs
to drop the interlock while spinning in order to avoid a deadlock
(further details in the comments inside the patch).
Final note: finding someone willing to help on tuning this with
relevant workloads would be either very important and appreciated.
Tested by: jeff, pho
Requested by: many
The code that was in place in exit1() was mainly based on code from the
old TTY layer. The main reason behind this, was because at one moment I
ran a system that had two TTY layers in place at the same time. It is
now sufficient to do the following:
- Remove references from the session structure to the TTY vnode and the
session leader.
- If we have a controlling TTY and the session used by the TTY is equal
to our session, send the SIGHUP.
- If we have a vnode to the controlling TTY which has not been revoked,
revoke it.
While there, change sys/kern/tty.c to use s_ttyp in the comparison
instead of s_ttyvp. It should not make any difference, because s_ttyvp
can only become null when the session leader already left, but it's
nicer to compare against the proper value.
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.
Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.
Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks
Right now the only way to make tcsetsid(3)/TIOCSCTTY work, is by
ensuring the session leader is dead. This means that an application that
catches SIGHUPs and performs a sleep prevents us from assigning a new
session leader.
Change the code to make it work on revoked TTYs as well. This allows us
to change init(8) to make the shutdown script run in a more clean
environment.
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail. Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.
Reviewed by: zec, julian
Approved by: bz (mentor)
There is an external use in the opensolaris code.
I am not sure how this ever worked but I have seen two reports of:
link_elf: symbol hardlink_check_uid undefined
lately.
Reported by: Scott Ullrich (sullrich gmail.com), pfsense
Reported by: Mister Olli (mister.olli googlemail.com)
Because our rc scripts also open the /etc/ttyv* nodes, it revokes the
console, preventing startup messages from being displayed.
I really have to think about this. Maybe we should just give the console
its own TTY and let it build on top of other TTYs. I'm still not sure
what to do with input handling there.
Even though I thought I fixed the staircase issue (and I was no longer
able to reproduce it), I got some reports of the issue still being
there. It turns out the staircase effect still occurred when
/dev/console was kept open while killing the getty on the same TTY
(ttyv0).
For some reason I can't figure out how the old TTY code dealt with that,
so I assume the issue has always been there. I only exposed it more by
merging consolectl with ttyv0, which means that the issue was present,
even on systems without a serial console.
I'm now marking the console device as being closed when closing the
regular TTY device node. This means that when the getty shuts down,
init(8) will open /dev/console, which means the termios attributes will
always be reset in this case.
in symtab_get method symtab parameter is made constant as this reflects
actual intention and usage of the method
Reviewed by: imp, current@
Approved by: jhb (mentor)
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.
Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.
Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.
Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.
Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.
Discussed with: tuexen [1]
probe. The current device order is unchanged. This commit just adds the
infrastructure and ABI changes so that it is easier to merge later changes
into 8.x.
- Driver attachments now have an associated pass level. Attachments are
not allowed to probe or attach to drivers until the system-wide pass level
is >= the attachment's pass level. By default driver attachments use the
"last" pass level (BUS_PASS_DEFAULT). Driver's that wish to probe during
an earlier pass use EARLY_DRIVER_MODULE() instead of DRIVER_MODULE() which
accepts the pass level as an additional parameter.
- A new method BUS_NEW_PASS has been added to the bus interface. This
method is invoked when the system-wide pass level is changed to kick off
a rescan of the device tree so that drivers that have just been made
"eligible" can probe and attach.
- The bus_generic_new_pass() function provides a default implementation of
BUS_NEW_PASS(). It first allows drivers that were just made eligible for
this pass to identify new child devices. Then it propogates the rescan to
child devices that already have an attached driver by invoking their
BUS_NEW_PASS() method. It also reprobes devices without a driver.
- BUS_PROBE_NOMATCH() is only invoked for devices that do not have
an attached driver after being scanned during the final pass.
- The bus_set_pass() function is used during boot to raise the pass level.
Currently it is only called once during root_bus_configure() to raise
the pass level to BUS_PASS_DEFAULT. This has the effect of probing all
devices in a single pass identical to previous behavior.
Reviewed by: imp
Approved by: re (kib)
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
Clists were originally used by the TTY layer as a text buffer interface.
The advantage of clists were that it would allocate a small set of
additional buffers that could be shared between TTYs when needed. In
the modern days we can just allocate some more KBs of memory to keep the
TTYs satisfied. The global cfreelist also requires synchronisation,
which may not be useful when trying to improve scalability.
The MPSAFE TTY layer uses its own text buffers (ttyinq and ttyoutq). We
had a small amount of drivers in the tree that still uses clists, like
the old USB stack and some keyboard drivers. With the old USB stack gone
and the keyboard drivers changed to use a circular buffer, we can safely
remove clists from the kernel.
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
a KASSERT to handle it. People are likely to turn off INVARIANTS RSN
and loading an old module can cause garbage-in here.
I saw the issue with an older nvidia driver (x11/nvidia-driver) loading
into a new kernel - a crash wasn't seen 'till sysctl_kern_malloc_stats().
I was lucky that mtp->ks_shortdesc was NULL and not something horrible.
While I'm here, KASSERT that malloc_uninit() isn't passed something that's
not in kmemstatistics.
MFC after: 3 weeks
count of the number of registered policies.
Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.
This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.
Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.
Obtained from: TrustedBSD Project
shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared. Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.
MFC after: 1 month
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.
Discussed with: rwatson, rmacklem
Each list describes a logical memory object that is backed by one or more
physical address ranges. To minimize locking, the sglist objects
themselves are immutable once they are shared.
These objects may be used in the future to facilitate I/O requests using
physically-addressed buffers. For the immediate future I plan to use them
to implement a new type of VM object and pager.
Reviewed by: jeff, scottl
MFC after: 1 month
consuming one of its spare fields. The cr_flags field is currently
unused, but will be used for features, including capability mode and
pay-as-you-go audit.
Discussed with: jhb, sson
threads:
- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.
In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".
- Allow each protocol to advertise an ordering policy, which can
currently be one of:
NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).
NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.
NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).
- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.
- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.
- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.
- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.
- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.
- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.
In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.
Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.
An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.
A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.
This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.
Bump __FreeBSD_version.
Reviewed by: bz
=================
Extend the loader to parse the root file system mount options in /etc/fstab,
and set a new loader variable vfs.root.mountfrom.options with these options.
The root mount options must be a comma-delimited string, as specified in
/etc/fstab.
Only set the vfs.root.mountfrom.options variable if it has not been
set in the environment.
sys/kern/vfs_mount.c
====================
When mounting the root file system, pass the mount options
specified in vfs.root.mountfrom.options, but filter out "rw" and "noro",
since the initial mount of the root file system must be done as "ro".
While we are here, try to add a few hints to the mountroot prompt
to give users and idea what might of gone wrong during mounting
of the root file system.
Reviewed by: jhb (an earlier patch)
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.
Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.
Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.
Tested by: pho
assigning ifnets from one vnet to another. Deletion of vnets is not
yet supported.
The interface is implemented as an ioctl extension so that no syscalls
had to be introduced. This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.
The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.
While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.
Approved by: julian (mentor)
Discussed with: bz, rwatson
CPU, if available. This is meant to solve the issue of cpufreq misreporting
speeds on CPUs that boot in a reduced power mode and have only relative
speed control.
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.
Reviewed by: rwatson@
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.
The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.
Approved by: bz (mentor)
- Add rm_init_flags() and accept extended options only for that variation.
- Add a flags space specifically for rm_init_flags(), rather than borrowing
the lock_init() flag space.
- Define flag RM_RECURSE to use instead of LO_RECURSABLE.
- Define flag RM_NOWITNESS to allow an rmlock to be exempt from WITNESS
checking; this wasn't possible previously as rm_init() always passed
LO_WITNESS when initializing an rmlock's struct lock.
- Add RM_SYSINIT_FLAGS().
- Rename embedded mutex in rmlocks to make it more obvious what it is.
- Update consumers.
- Update man page.
the last component is a symlink to something that isn't a directory.
We introduce a new namei flag, TRAILINGSLASH, which is set by lookup()
if the last component is followed by a slash. The trailing slash is
then stripped, as before. If the final component is a symlink,
lookup() will return to namei(), which will expand the symlink and
call lookup() with the new path. When all symlinks have been
resolved, lookup() checks if the TRAILINGSLASH flag is set, and if it
is, and the vnode it ended up with is not a directory, it returns
ENOTDIR.
PR: kern/21768
Submitted by: Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after: 3 weeks
I don't want people to override the mutex when allocating a TTY. It has
to be there, to keep drivers like syscons happy. So I'm creating a
tty_alloc_mutex() which can be used in those cases. tty_alloc_mutex()
should eventually be removed.
The advantage of this approach, is that we can just remove a function,
without breaking the regular API in the future.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.
Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.
This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).
KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.
Requested by: jeff, kmacy
Reviewed by: jeff
Tested by: pho
Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.
Approved by: dfr (mentor)
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
get a quick snapshot of the kernel's symbol table including the symbols
from any loaded modules (the symbols are all merged into one symbol
table). Unlike like other implementations, this ksyms driver maps
memory in the process memory space to store the snapshot at the time
/dev/ksyms is opened. It also checks to see if the process has already
a snapshot open and won't allow it to open /dev/ksyms it again until it
closes first. This prevents kernel and process memory from being
exhausted. Note that /dev/ksyms is used by the lockstat(1) command.
Reviewed by: gallatin kib (freebsd-arch)
Approved by: gnn (mentor)
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.
Reviewed by: attilio jhb jb
Approved by: gnn (mentor)
sleep waiting for conditions when the lock may be granted.
To prevent lf_setlock() from accessing possibly freed memory, add reference
counting to the struct lockf_entry. Bump refcount around the sleep.
Make lf_free_lock() return non-zero when structure was freed, and use
this after the sleep to return EINTR to the caller. The error code might
need a clarification, but we cannot return success to usermode, since
the lock is not owned anymore.
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
is acquired. In the lf_purgelocks(), assert that vnode is doomed and set
*statep to NULL before clearing ls_pending list. Otherwise, we allow for
the thread executing lf_advlockasync() to put new pending entry after
state->ls_lock is dropped in lf_purgelocks().
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
In the original MPSAFE TTY code, I changed the behaviour by returning
EBUSY. I thought this made more sense, because it's basically a race to
see who gets the TTY first.
It turns out this is not a good change, because it also causes EBUSY to
be returned when another process is closing the TTY. This can happen
during startup, when /etc/rc (or one of its children) is still busy
draining its data and /sbin/init is attempting to open the TTY to spawn
a getty.
Reported by: bz
Tested by: bz
to optionally have overlapping unit numbers if attached in different
vnets.
At this stage if_loop is the only clonable ifnet class that has been
extended to allow for such overlapping allocation of unit numbers, i.e.
in each vnet it is possible to have a lo0 interface. Other clonable ifnet
classes remain to operate with traditional semantics, i.e. each instance
of a clonable ifnet will be assigned a globally unique unit number,
regardless in which vnet such an ifnet becomes instantiated.
While here, garbage collect unused _lo_list field in struct vnet_net,
as well as improve indentation for #defines in sys/net/vnet.h.
The layout of struct vnet_net has changed, therefore bump
__FreeBSD_version.
This change has no functional impact on nooptions VIMAGE kernel builds.
Reviewed by: bz, brooks
Approved by: julian (mentor)
for reassigning ifnets from one vnet to another.
if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.
if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another. Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.
While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.
Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.
This change should have no functional impact on nooptions VIMAGE kernel
builds.
Reviewed by: bz, rwatson, brooks?
Approved by: julian (mentor)
When enabled all TTY input queue buffers are zeroed when flushing or
closing the TTY. Because TTY input queues are also used to store filled
in passwords, this may be an interesting switch to enable for security
minded people.
the behaviour alredy present with the further malloc() call in
devctl_notify().
This fixes a bug in the CAM layer where the camisr handler finished to
call camperiphfree() (and subsequently destroy_dev() resulting in a new
dev notify) while the xpt lock is held.
PR: kern/130330
Tested by: Riccardo Torrini <riccardo dot torrini at esaote dot com>
shared uses of a resource are recorded on a sub-list hanging off
a main resource object on a main resource list;
without this change a shared resource (e.g. irq) is reported only
once by devinfo -r/-u;
with this change the resource is reported for each driver that
allocates it (which is even more than what vmstat -i -a reports).
Approved by: jhb (mentor)
affinity for the interrupt thread, and requesting that underlying
hardware direct interrupts to the CPU. For software interrupt
threads, implement a no-op interrupt event binder that returns
success, so that the interrupt management code will just set the
ithread's affinity and succeed.
Reviewed by: jhb
MFC after: 1 week
These sysctls don't need any form of locking. At least cp_times is used
by powerd very often, which means I get 50% less calls to non-MPSAFE
sysctls on my system. The other 50% is consumed by dev.cpu.0.freq, but
this seems to need Giant for Newbus.
Provide a more descriptive comment.
Eliminate dead code. The page cannot possibly have PG_ZERO set.
Eliminate unnecessary blank lines.
Reviewed by: tegge
This makes siginfo output look a lot better when pressing it the first
time when in sh(1), for example:
$ load: 0.00 cmd: sh 1945 [ttyin] 3.94r 0.00u 0.00s 0% 1960k
load: 0.00 cmd: sh 1945 [ttyin] 4.19r 0.00u 0.00s 0% 1960k
will now become:
$
load: 0.00 cmd: sh 1945 [ttyin] 3.94r 0.00u 0.00s 0% 1960k
load: 0.00 cmd: sh 1945 [ttyin] 4.19r 0.00u 0.00s 0% 1960k
- Only pick up PROC_LOCK once, which means we can drop the PGRP_LOCK
right after picking up PROC_LOCK for the first time.
- Print the process real time, making it consistent with tools like
time(1).
- Use `p' and `td' to reference the process/thread we are going to
print. Only use pick-variables inside the loops. We already did this
for the threads, but not the processes.
that expect that oldlen is filled with required buffer length even when
supplied buffer is too short and returned error is ENOMEM.
Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when
remaining buffer space is too short for the current kinfo_file structure.
Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition
to keep existing interface for the sysctl, though.
Reported by: ed, Florian Smeets <flo kasimir com>
Tested by: pho
sysctl requests to avoid wiring too much user memory. Only grab this
lock if the user's old buffer is larger than a page as a tradeoff to
allow more concurrency for common small requests.
- Just use a shared lock on the sysctl tree for user sysctl requests now.
MFC after: 1 week
error due to copyout failure or short buffer.
The later breaks the usermode iterators of the sysctl results that pack
arbitrary number of variable-sized structures. Iterator expects that
kernel filled exactly oldlen bytes, and tries to interpret half-filled
or garbage structure at the end of the buffer. In particular,
kinfo_getfile(3) segfaulted.
Reported and tested by: pho
MFC after: 3 weeks
fget_unlocked().
- Save old file descriptor tables created on expansion until
the entire descriptor table is freed so that pointers may be
followed without regard for expanders.
- Mark the file zone as NOFREE so we may attempt to reference
potentially freed files.
- Convert several fget_locked() users to fget_unlocked(). This
requires us to manage reference counts explicitly but reduces
locking overhead in the common case.
following changes:
Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge
Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.
Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.
Introduce vm_page_set_valid().
Reviewed by: tegge
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
suffer from the race condition that motivated revision 1.94. Consequently,
the work-around that was implemented by revision 1.94 is no longer needed.
Moreover, reverting this work-around eliminates the need for
vfs_busy_pages() to acquire the page queues lock when preparing a buffer
for read.
Reviewed by: tegge
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.
While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.
Reported by: John Hickey
PR: kern/133439
Reviewed by: dfr, kib
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.
As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.
Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.
Permit kldload / kldunload operations to be executed only from the default
vimage context.
This change should have no functional impact on nooptions VIMAGE kernel
builds.
Reviewed by: bz
Approved by: julian (mentor)
OSD-based jail extensions. This allows the Linux MIB to accessed via
jail_set and jail_get, and serves as a demonstration of adding jail support
to a module.
Reviewed by: dchagin, kib
Approved by: bz (mentor)
It turns out if we called cfmakeraw() on a TTY with only a rint handler
in place, it could inject data into the TTY, even though it should be
redirected. Always take a look at the hooks before looking at the
termios flags.
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).
Approved by: bz (mentor)
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
and it only optimized out an ipi or mwait in very few cases.
- Skip the adaptive idle code when running on SMT or HTT cores. This
just wastes cpu time that could be used on a busy thread on the same
core.
- Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use
CG_FLAG_THREAD to mean SMT or HTT.
Sponsored by: Nokia
root cpuset of that jail.
Processes inside the jail will still be able to change child sets.
A superuser outside of a jail will still be able to change the jail cpuset
and thus limit the number of cpus available to the jail.
Problem reported by: 000.fbsd@quip.cz (Miroslav Lachman)
PR: kern/134050
Reviewed by: jeff
MFC after: 3 weeks
X-MFC: backout r191596
first introduced @ r190909 with a vnet module deregistration
service.
kldunloadable modules, which are currently using vnet_mod_register()
to attach their per-vnet initialization routines to the vnet
initialization framework, should call vnet_mod_deregister() before
acknowledging MOD_UNLOAD requests in their mod_event handlers. Such
changes to the existing code base will follow in subsequent commits.
vnet_mod_deregister() does not check whether departing vnet modules
are registered as prerequisites for another module(s), so it should
be used with care. Currently I'm only aware of vnet modules which
are leafs on module dependency graphs that are kldunloadable.
This change also introduces per-vnet module destructor handler, which
calls vnet's module cleanup function, which (if required) has to be
registered in vnet module's vnet_modinfo_t structure .vmi_idetach
field. Once options VIMAGE becomes operational, the framework will
take care that module's cleanup function become invoked for each
active vnet instance, and that the memory allocated for each instance
gets freed. Currently calls to destructor handlers must always
succeed.
This allows users to increase the maximum amount of pseudo-terminals
without changing any source code. Users must increase UT_LINESIZE before
attempting to increase kern.pts_maxdev.
The main problem is that sbappendrecord_locked() relies on sbcompress()
to set sb_mbtail. This will not happen if sbappendrecord_locked() is
called with mbuf chain made of exactly one mbuf (i.e. m0->m_next == NULL).
In this case sbcompress() will be called with m == NULL and will do
nothing. I'm not entirely sure if m == NULL is a valid argument for
sbcompress(), and, it rather pointless to call it like that, but keep
calling it so it can do SBLASTMBUFCHK().
The problem is triggered by the SOCKBUF_DEBUG kernel option that
enables SBLASTRECORDCHK() and SBLASTMBUFCHK() checks.
PR: kern/126742
Investigated by: pluknet < pluknet -at- gmail -dot- com >
No response from: freebsd-current@, freebsd-bluetooth@
MFC after: 3 days
or ignored SIGCHLD, unconditionally wake up the parent instead of doing
this only when the child is a last child.
This brings us in line with other U**xes that support SA_NOCLDWAIT. If
the parent called waitpid(childpid), then exit of the child should wake
up the parent immediately instead of forcing it to wait for all children
to exit.
Reported by: Alan Ferrency <alan pair com>
Submitted by: Jilles Tjoelker <jilles stack nl>
PR: 108390
MFC after: 2 weeks
In the good old days it was possible to have dev_t's that referred to
nonexistent devices. In these cases devtoname() automatically generated
names. This is no longer possible, so remove this dead code.
Discussed with: kib
Remove the `udev' variable, which has a different type than the original
function argument and si_drv0. The `udev' name is also misleading,
because it is not the number returned by dev2udev(). Rename this
argument to `unit'. It is the same number as returned by dev2unit().
We should still access si_drv0 using dev2unit(). Also change the
KASSERT() to really print the udev instead of the unit number. I suspect
it's still useful to print the unit number, especially for devices that
use clone lists, so keep the unit number in the panic string.
not populated in parent directory if negative entry was being
created, yet entry itself was added to the nc_neg list. It was
possible for parent vnode to get discarded later, leaving negative
entry pointing to now unused memory block.
Reported by: dho
Revewed by: kib
dependency tracking and ordering enforcement.
With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.
The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.
The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.
For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.
A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.
Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.
The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.
Reviewed by: bz
Approved by: julian (mentor)
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.
Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon
Check the condition and return ENOENT then.
In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.
Reported and tested by: pho
It turns out my handling of SIGTTOU and SIGTTIN didn't entirely comply
to the standards. It is true that in the SIGTTOU case we should not
return EIO when the signal is ignored/blocked, but in the SIGTTIN case
we must.
See also: POSIX issue 7 section 11.1.4
the size and cost of name cache entries, but make adding debugging
and tracing easier.
Add SDT DTrace probes for various namecache events:
vfs:namecache:enter:done - new entry in the name cache, passed parent
directory vnode pointer, name added to the cache, and child vnode
pointer.
vfs:namecache:enter_negative:done - new negative entry in the name cache,
passed parent vnode pointer, name added to the cache.
vfs:namecache:fullpath:enter - call to vn_fullpath1() is made, passed
the vnode to resolve to a name.
vfs:namecache:fullpath:hit - vn_fullpath1() successfully resolved a
search for the parent of an object using the namecache, passed the
discovered parent directory vnode pointer, name, and child vnode
pointer.
vfs:namecache:fullpath:miss - vn_fullpath1() failed to resolve a search
for the parent of an object using the namecache, passed the child
vnode pointer.
vfs:namecache:fullpath:return - vn_fullpath1() has completed, passed the
error number, and if that is zero, the vnode to resolve, and the
returned path.
vfs:namecache:lookup:hit - postive name cache entry hit, passed the
parent directory vnode pointer, name, and child vnode pointer.
vfs:namecache:lookup:hit_negative - negative name cache entry hit,
passed the parent directory vnode pointer and name.
vfs:namecache:lookup:miss - name cache miss, passed the parent directory
pointer and the full remaining component name (not terminated after the
cache miss component).
vfs:namecache:purge:done - name cache purge for a vnode, passed the vnode
pointer to purge.
vfs:namecache:purge_negative:done - name cache purge of negative entries
for children of a vnode, passed the vnode pointer to purge.
vfs:namecache:purgevfs - name cache purge for a mountpoint, passed the
mount pointer. Separate probes will also be invoked for each cache
entry zapped.
vfs:namecache:zap:done - name cache entry zapped, passed the parent
directory vnode pointer, name, and child vnode pointer.
vfs:namecache:zap_negative:done - negative name cache entry zapped,
passed the parent directory vnode pointer and name.
For any probes involving an extant name cache entry (enter, hit, zapp),
we use the nul-terminated string for the name component. For misses,
the remainder of the path, including later components, is provided as
an argument instead since there is no handy nul-terminated version of
the string around. This is arguably a bug.
MFC after: 1 month
Sponsored by: Google, Inc.
Reviewed by: jhb, kan, kib (earlier version)
vfs:namei:lookup:entry takes parent directory vnode pointer, path to
look up, and lookup flags.
vfs:namei:lookup:return takes an error value, and if successful, the
returned vnode pointer.
MFC after: 1 month
1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
modules won't have the flag set, so the new field brand_note would be
ignored.
Suggested by: jhb
Reviewed by: jhb
Approved by: kib (mentor)
MFC after: 6 days
on a generic dumper that creates an ELF core file and
uses PMAP functions to scan and iterate over memory
chunks, as well as handle memory mappings used during
dumping.
the PMAP layer can choose to return physical memory
chunks or virtual memory chunks. For minidumps, the
chunks should be virtual.
The default MMU I/F implementation for the scan_md()
method returns NULL. Thus, when a PMAP implementation
does not implement the required methods, an empty
core file is created. Here, empty means having an ELF
header only.
Obtained from: Juniper Networks
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.
This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.
Reviewed by: kib
It was derived from i386 version long ago but never resync'ed again.
Originally, i386 version compared the current time from realtime clock
with time_second (which was just `time' in the old days). When this MI
version was written, it was wrongly compared against `base' AND never
used because of a bug (typo?) in the code. This check was killed
in i386 version when home-rolled calendaric calculation was removed.
Now, we just remove the code here as well to make the code simpler.
to the devctl notification queue. Empty strings cause devctl read
call to return 0 and result in devd exiting prematurely.
The actual offender (ugen notes for root hubs) will be fixed
by separate commit.
Limit the size of malloced buffer when dumping environment
variables. [EN-09:01]
Approved by: so (cperciva)
Approved by: re (kensmith)
Security: FreeBSD-SA-09:06.ktimer
Errata: FreeBSD-EN-09:01.kenv
Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.
In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.
For clarity, check the calculated note alignment in each iteration.
Reported by: Chris Palmer <chris noncombatant org> [1]
PR: kern/132886
Reviewed and tested by: dchagin
MFC after: 3 days
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.
Reported by: miwi, Volker <volker at vwsoft dot com>,
Ben Kaduk <minimarmot at gmail dot com>,
Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by: lulf, miwi
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.
Reported by: swell.k gmail com
Reviewed by: dchagin
Tested by: dchagin, swell.k gmail com
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free. This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.
Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile. They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:
if_ar
if_axe
if_aue
if_cdce
if_cue
if_kue
if_ray
if_rue
if_rum
if_sr
if_udav
if_ural
if_zyd
Drivers that were already disabled because of tty changes:
if_ppp
if_sl
Discussed on: arch@
guarantee that all cpus have acknowledged the cleared enable int by
scheduling the resetting thread on each cpu in succession. Since all
lock profiling happens within a critical section this guarantees that
all cpus have left lock profiling before we clear the datastructures.
- Assert that the per-thread queue of locks lock profiling is aware of
is clear on thread exit. There were several cases where this was not
true that slows lock profiling and leaks information.
- Remove all objects from all lists before clearing any per-cpu
information in reset. Lock profiling objects can migrate between
per-cpu caches and previously these migrated objects could be zero'd
before they'd been removed
Discussed with: attilio
Sponsored by: Nokia
is calculated as 0 which causes errors elsewhere.
Submitted by: KOIE Hidetaka <koie@suri.co.jp>
- When sched_affinity() is called with a thread that is not curthread we
need to handle the ON_RUNQ() case by adding the thread to the correct
run queue.
Submitted by: Justin Teller <justin.teller@gmail.com>
MFC after: 1 Week
".note.ABI-tag" section.
The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.
Move code which fetch osreldate for ELF binary to check_note() handler.
PR: 118473
Approved by: kib (mentor)
poll(), only copy out the revents field, not the whole pollfd
structure. Otherwise, if the events field is updated
concurrently by another thread, that update may be lost.
This issue apparently causes problems for the JDK on FreeBSD,
which expects the Linux behavior of not updating all fields
(somewhat oddly, Solaris does not implement the required
behavior, but presumably our adaptation of the JDK is based
on the Linux port?).
MFC after: 2 weeks
PR: kern/130924
Submitted by: Kurt Miller <kurt @ intricatesoftware.com>
Discussed with: kib
internal sysctl_sysctl_name() handler to map the MIB array to a string
name and logs this name in the trace log. This can be useful to see
exactly which sysctls a thread is invoking.
MFC after: 1 month
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.
Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month
needed.
- Move the release of the sysctl sx lock after the vsunlock() in
userland_sysctl() to restore the original memlock behavior of
minimizing the amount of memory wired to handle sysctl requests.
MFC after: 1 week
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on
a 3GHz Intel Xeon (4x2) running 7.1. It accounted for almost a quarter of
the total runtime of 'sysctl -a'. It also performs lots of copyout's while
holding the namecache lock (this does not attempt to fix that).
MFC after: 2 weeks
was introduced. If you have a bus, say cardbus, that is derived from
a base-bus (say PCI), then ordinarily all PCI drivers would attach to
cardbus devices. However, there had been one exception: kldload
wouldn't work.
The problem is in devclass_add_driver. In this routine, all we did
was call to the pci device's BUS_DRIVER_ADDED routine. However, since
cardbus bus instances had a different devclass, none of them were
called.
The solution is to call all subclass devclasses, recursively down the
tree, of the class that was loaded. Since we don't have a 'children
class' pointer, we search the whole list of devclasses for a class
whose parent matches. Since just done a kldload time, this isn't as
bad as it sounds. In addition, we short-circuit the whole process by
marking those classes with subclasses with a flag. We'll likely have
to reevaluate this method the number of devclasses with subclasses
gets large.
This means we can remove the "cardbus" lines from all the PCI drivers
since we have no cardbus specific attach device attachments in the
tree.
# Also: minor tweak to an error message
query functions in the kernel, as these effectively serialize
parallel calls to the gettimeofday(2) system call, as well as
other kernel services that use timestamps.
Use the NetBSD version of the fix (kern_tc.c:1.32 by ad@) as
they have picked up our timecounter code and also ran into the
same problem.
Reported by: kris
Obtained from: NetBSD
MFC after: 3 days
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.
This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order. It may also reduce
contention on the global lock for some workloads.
Add global UNIX domain socket locks to hard-coded witness lock
order.
MFC after: 1 week
Discussed with: kris
directory of a vnode to find a dirent with a matching file number. The
name from that dirent is then used to provide the component name.
Note: if the initial vnode argument is not a directory itself, then
the default VOP_VPTOCNP(9) implementation still returns ENOENT.
Reviewed by: kib
Approved by: kib
Tested by: pho
extended attribute get/set; in the case of get an uninitialized user
buffer was passed before the EA was retrieved, making it of relatively
little use; the latter was simply unused by any policies.
Obtained from: TrustedBSD Project
Sponsored by: Google, Inc.
naming by renaming certain "proc" entry points to "cred" entry points,
reflecting their manipulation of credentials. For some entry points,
the process was passed into the framework but not into policies; in
these cases, stop passing in the process since we don't need it.
mac_proc_check_setaudit -> mac_cred_check_setaudit
mac_proc_check_setaudit_addr -> mac_cred_check_setaudit_addr
mac_proc_check_setauid -> mac_cred_check_setauid
mac_proc_check_setegid -> mac_cred_check_setegid
mac_proc_check_seteuid -> mac_cred_check_seteuid
mac_proc_check_setgid -> mac_cred_check_setgid
mac_proc_check_setgroups -> mac_cred_ceck_setgroups
mac_proc_check_setregid -> mac_cred_check_setregid
mac_proc_check_setresgid -> mac_cred_check_setresgid
mac_proc_check_setresuid -> mac_cred_check_setresuid
mac_proc_check_setreuid -> mac_cred_check_setreuid
mac_proc_check_setuid -> mac_cred_check_setuid
Obtained from: TrustedBSD Project
Sponsored by: Google, Inc.
poll_no_poll().
Return a poll_no_poll() result from devfs_poll_f() when
filedescriptor does not reference the live cdev, instead of ENXIO.
Noted and tested by: hps
MFC after: 1 week
Do not overload the local variable size in kern_shmat() due to vm_size_t
change.
Fix style bug by adding explicit comparision with 0.
Discussed with: bde
MFC after: 1 week
wrapper macros that allow trace points and arguments to be declared
using a single macro rather than several. This means a lot less
repetition and vertical space for each trace point.
Use these macros when defining privilege and MAC Framework trace points.
Reviewed by: jb
MFC after: 1 week
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.
Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.
While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.
Approved by: bz (mentor)
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.
Reported by: dhw
Reviewed by: kib, attilio
Sponsored by: Juniper Networks, Inc.
memory from int to size_t. Implement a workaround for current ABI not
allowing to properly save size for and report more then 2Gb sized segment
of shared memory.
This makes it possible to use > 2 Gb shared memory segments on 64bit
architectures. Please note the new BUGS section in shmctl(2) and
UPDATING note for limitations of this temporal solution.
Reviewed by: csjp
Tested by: Nikolay Dzham <i levsha org ua>
MFC after: 2 weeks
It's better to just use internal language constructs, because it is
likely the compiler has a better opinion on whether to perform inlining,
which is very likely to happen to struct winsize.
Submitted by: Christoph Mallon <christoph mallon gmx de>
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
printf() and vprintf() are exactly the same, except the way arguments
are passed. Just like we see in other pieces of code (i.e. libc's
printf()), implement printf() using vprintf().
Submitted by: Christoph Mallon <christoph mallon gmx de>
As mentioned by bz and bde, the change I made wasn't the proper way to
fix. Inspired by bde's patch, perform some small cleanups to uprintf().
Reviewed by: bz
Inside do_execve(), we have a pointer `ndp', which always points to
`&nd'. I can imagine a primitive (non-optimizing) compiler to really
reserve space for such a pointer, so just remove the variable and use
`&nd' directly.
kern_time.c:
- Unused variable `p'.
kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
initialize it. There is no way that error != 0 at the end of
create_thread().
kern_sig.c:
- Unused variable `code'.
kern_synch.c:
- `rval' is always assigned in all different cases.
kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.
kern_malloc.c:
- `size' is always initialized with the proper value before being used.
kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
returns a non-zero value.
kern_exec.c:
- `len' is always assigned inside the if-statement right below it.
tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().
Found by: LLVM's scan-build
`p' is already initialized with `td->td_proc'. Because td is always
curthread, it is safe to initialize it without any locks.
Found by: LLVM's scan-build
priv:kernel:priv_check:priv_ok fires for granted privileges
priv:kernel:priv_check:priv_errr fires for denied privileges
The first argument is the requested privilege number. The naming
convention is a little different from the OpenSolaris equivilent
because we can't have '-' in probefunc names, and our privilege
namespace is different.
MFC after: 1 week
predefined set of methods, which are set in osd_register() and called
via osd_call(). Currently, no methods are defined, though prison
objects will have some in the future.
Expand the locking from a single per-type mutex to three different kinds
of locks (four if you include the requirement that the container
(e.g. prison) be locked when getting/setting data). This clears up one
existing issue, as well as others added by the method support.
Approved by: bz (mentor)
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:
- Add kern_statat_vnhook() to allow binary emulators to `post-process'
struct stat, using the proper vnode.
- Remove unneeded printf's from stat() and statfs().
- Make the Linuxolator use kern_statat_vnhook(), replacing
translate_path_major_minor_at().
- Let translate_fd_major_minor() use vp->v_rdev instead of
vp->v_un.vu_cdev.
Result:
crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx
crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0
crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1
crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2
Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.
Reviewed by: kib, netchild, rdivacky (thanks!)
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.
Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans
- Don't return a negative errno when using an unknown ioctl() on a
pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY,
just like the TTY layer does.
- Even though we should return st_rdev of the master device node when
emulating pty(4) devices, FIODGNAME should still return the name of
the slave device. Otherwise ptsname(3) and ttyname(3) return an
invalid device name.
members for a kinfo entry on a process-wide system.
- Use the newly introduced function in order to fix cases like
KERN_PROC_PROC where aggregating stats are broken because they just
consider the first thread in the pool for each process.
(Note, additively, that KERN_PROC_PROC is rather inaccurate on
thread-wide informations like the 'state' of the process. Such
informations should maybe be invalidated and being forceably discarded
by the consumers?).
- Simplify the logic of sysctl_out_proc() and adjust the
fill_kinfo_thread() accordingly.
- Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add
assertives.
This patch should fix aggregate statistics for KERN_PROC_PROC.
This is one of the reasons why top doesn't use this option and now it
can be use it safely.
ps, when launched in order to display just processes, now should report
correct cpu utilization percentages and times (as opposed by the old
code).
Reviewed by: jhb, emaste
Sponsored by: Sandvine Incorporated
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed. The extra verbosity is
not required in this situation.
Requested by: kib
Approved by: kib
revealed that a process' current working directory can be VBAD if the
directory is removed. This can trigger a panic when procstat -f PID is
run.
Tested by: pho
Discovered by: phobot
Reviewed by: kib
Approved by: kib
the correct behaviour (sorting by distance from the current head position
in the scan direction) and bioq_insert_head() and bioq_insert_tail()
have a well defined (and useful) behaviour, especially when intermixed
with calls to bioq_disksort().
In particular:
- fix a bug in the existing bioq_disksort() that did not use the
current head position correctly;
- redefine semantics of bioq_insert_head() and bioq_insert_tail().
bioq_insert_tail() can now be used as a barrier
between previous and subsequent calls to bioq_disksort().
The code is heavily documented in the source code so please refer
to that for the details.
Much of this code comes from Fabio Checconi. Also thanks to Kirk
for feedback on the (re)definition of bioq_insert_tail().
NOTE: in the current tree there is only a handful of files which
intermix calls to bioq_disksort() with bioq_insert_head() and
bioq_insert_tail(). The ordering of the queue in these situation
was not specified (nor easy to figure out) before, so I doubt any
of that code could be affected by the specification of the API.
Also note that the current implementation is significantly simpler
than the previous one (also used in ata_sort_queue()).
It would be useful to reimplement ata_sort_queue() using
the same code used in bioq_disksort().
MFC after: 1 week
Just like the old TTY layer, the current MPSAFE TTY layer does not make
any attempt to serialize calls of write(). Data is copied into the
kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at
the same time, the data may interleave. This is especially likely when
the TTY starts blocking, because the output queue reaches the high
watermark.
I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used
to mark a TTY as having a thread stuck in write(). Because I don't want
non-blocking processes to be possibly blocked by a sleeping thread, I'm
still allowing it to bypass the protection. According to this message,
the Linux kernel returns EAGAIN in such cases, but I think that's a
little too restrictive:
http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread
PR: kern/118287
from the parent to the child process if they have an operation vector
of &badfileops. This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2). Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.
A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs. However, this race is
vastly narrower.
The fix can be validated using the newfileops_on_fork regression test.
PR: kern/130348
Reported by: Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by: jhb, kib
MFC after: 1 week
Attach call without devclass set crashes the system.
On resume AHCI driver sometimes tries to create duplicate adX device.
It is surely his own problem, but IMHO it is not a reason to crash here.
Other reasons are also possible.
result in errors for a format loading but subsequent correct recognizing
for another format.
File format loading functions should avoid printing any additional
informations but just returning appropriate (and different between each
other) error condition, characterizing different informations.
Additively, the linker should handle appropriately different format
loading errors.
While a general mechanism is desired, fix a simple and common case on
amd64: file type is not recognized for link elf and confuses the linker.
Printout an error if all the registered linker classes can't recognize
and load the module.
Reviewed by: jhb
Sponsored by: Sandvine Incorporated
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().
Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland
buffer for most sysctls (everything except kern.vnode.*). I think to prevent
issues with wiring too much memory it used a 'memlock' to serialize all
sysctl(2) invocations, meaning that only one user buffer could be wired at
a time. In 5.0 the 'memlock' was converted to an sx lock and renamed to
'sysctl lock'. However, it still only served the purpose of serializing
sysctls to avoid wiring too much memory and didn't actually protect the
sysctl tree as its name suggested. These changes expand the lock to actually
protect the tree.
Later on in 5.0, sysctl was changed to not wire buffers for requests by
default (sysctl_handle_opaque() will still wire buffers larger than a single
page, however). As a result, user buffers are no longer wired as often.
However, many sysctl handlers still wire user buffers, so it is still
desirable to serialize userland sysctl requests. Kernel sysctl requests
are allowed to run in parallel, however.
- Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the
sysctl tree for a few places outside of kern_sysctl.c that manipulate
the sysctl tree directly including the kernel linker and vfs_register().
- sysctl_register() and sysctl_unregister() require the caller to lock
the sysctl lock using sysctl_lock() and sysctl_unlock(). The rest of
the public sysctl API manage the locking internally.
- Add a locked variant of sysctl_remove_oid() for internal use so that
external uses of the API do not need to be aware of locking requirements.
- The kernel linker no longer needs Giant when manipulating the sysctl
tree.
- Add a missing break to the loop in vfs_register() so that we stop looking
at the sysctl MIB once we have changed it.
MFC after: 1 month
sysctls during a linker file unload. We drop the lock when doing similar
operations during a linker file load. To close races, clear the LINKED
flag before dropping the lock so that the linker file is no longer visible
to userland.
MFC after: 1 week