by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.
Reviewed by: rwatson, zec
Sponsored by: The FreeBSD Foundation
MFC after: 4 weeks
The /dev/console device node logs all strings that are written to it.
When the string does not contain a trailing newline, it appends one. I
can imagine this was useful a long time ago, but with our current
rc-scripts, it generates a whole bunch of messages that look like:
| Configuring syscons:
| blanktime
| .
By not appending the newlines, the output of `dmesg -a' is now (almost?)
exactly the same as what the user will see on the console device
(syscons, uart).
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
"bowels" of system calls, add a structure holding a set of operations for
things like storing errors, copying in the aiocb structure, storing
status, etc. The 32-bit system calls use a separate operations vector to
handle fuword32 vs fuword, etc. Also, the oldsigevent handling is now
done by having seperate operation vectors with different aiocb copyin
routines.
- Split out kern_foo() functions for the various AIO system calls so the
32-bit front ends can manage things like copying in and converting
timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
just use copyin() to read the array of aiocb pointers instead of using
a for loop that iterated over fuword/fuword32. The error handling in
the old case was incomplete (lio_listio() just ignored any aiocb's that
it got an EFAULT trying to read rather than reporting an error), and
possibly slower.
MFC after: 1 month
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.
Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().
Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().
Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week
move that module to the head of the associated linker file's list of modules.
The end result is that once all the modules are loaded, they are sorted in
the reverse of their load order. This causes the kernel linker to invoke
the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that
MOD_LOAD was invoked. This means that the ordering of MOD_LOAD events that
is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same
order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD
events.
MFC after: 1 month
unloading any modules. As a result, if any module veto's an unload
request via MOD_QUIESCE, the entire set of modules for that linker
file will remain loaded and active now rather than leaving the kld
in a weird state where some modules are loaded and some are unloaded.
- This also moves the logic for handling the "forced" unload flag out of
kern_module.c and into kern_linker.c which is a bit cleaner.
- Add a module_name() routine that returns the name of a module and use that
instead of printing pointer values in debug messages when a module fails
MOD_QUIESCE or MOD_UNLOAD.
MFC after: 1 month
Close subtle but relatively unlikely race conditions when
propagating the vnode write error to other active sessions
tracing to the same vnode, without holding a reference on
the vnode anymore. [2]
PR: kern/126368 [1]
Submitted by: rwatson [2]
Reviewed by: kib, rwatson
MFC after: 4 weeks
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.
Prevent creation of the duplicated negative entries.
Reported and debugged with: pho
Reviewed by: jhb
X-MFC: after shared lookup changes
This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.
Two new OIDs are assigned. The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface
was never actually released on 7.x.
The other main change is to pack the data passed to userland via the
sysctl. kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still
seriously unpleasant, but not quite as devastating). A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.
My immediate problem is valgrind. It traditionally achieves this
functionality by parsing procfs output, in a packed format. Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode. (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)
I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.
exact multiple of system page size should still be allowed to be mapped
in their entirety to match the regular vnode backed file behavior.
Reported by: ed
Reviewed by: jhb
1) to get the 32 and 64 bit versions in sync so that no shims are needed,
Valgrind in particular excercises this. and:
2) reduce the size of the copyout. On large processes this turns out to
be a huge problem. Valgrind also suffers from this since it needs to do
this in a context that can't malloc. I want to pack the records.
3) Add new types.. 'tell me about fd N' and 'tell me about addr N'.
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.
Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.
Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.
Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.
Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
The mqfs_search() routine uses strncmp() to match message queue objects
by name. This is because it can be called from environments where the
file name is not null terminated (the VFS for example).
Unfortunately it doesn't compare the lengths of the message queue names,
which means if a system has "Queue12345", the name "Queue" will also
match.
I noticed this when a student of mine handed in an exercise using
message queues with names "Queue2" and "Queue".
Reviewed by: rink
whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.
Thanks to kib for guidance and patience on this process.
Reviewed by: kib
Approved by: kib
configured, change the message to let people know this is a
possibility. I've slightly changed the message from the one
submitted by Pekka to keep the printf on one line.
Submitted by: Pekka Savola <pekkas@netcore.fi>
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.
Discussed with: dchagin, imp, jhb, peter
and ifnet functions
- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own
- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers
- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues
This work was supported by Bitgravity Inc. and Chelsio Inc.
- An "at" hint now reserves a device name.
- A new BUS_HINT_DEVICE_UNIT method is added to the bus interface. When
determining the unit number of a device, this method is invoked to
let the bus driver specify the unit of a device given a specific
devclass. This is the only way a device can be given a name reserved
via an "at" hint.
- Implement BUS_HINT_DEVICE_UNIT() for the acpi(4) and isa(4) bus drivers.
Both of these busses implement this by comparing the resources for a
given hint device with the resources enumerated by ACPI/PnPBIOS and
wire a unit if the hint resources are a subset of the "real" resources.
- Use bus_hinted_children() for adding hinted devices on isa(4) busses
now instead of doing it by hand.
- Remove the unit kludging from sio(4) as it is no longer necessary.
Prodding from: peter, imp
OK'd by: marcel
MFC after: 1 month
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.
As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.
Reported by: stass
Reviewed by: tegge
PR: 123768
Peter Holm just discovered this funny bug inside the TTY code: if
uiomove() in ttydisc_write() returns an error, we forget to relock the
TTY before jumping out of ttydisc_write(). Fix it by placing
tty_unlock() and tty_lock() around uiomove().
Submitted by: pho
- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
fills an array with two descriptors.
- Remove EFAULT from the manual page. Because of the current calling
convention, pipe(2) raises a segmentation fault when an invalid
address is passed.
- Introduce kern_pipe() to make it easier for binary emulations to
implement pipe(2).
- Make Linux binary emulation use kern_pipe(), which means we don't have
to recover td_retval after calling the FreeBSD system call.
Approved by: rdivacky
Discussed on: arch
This prevents a panic which occurs when a driver attempts to load
firmware at boot via firmware_get() when the firmware module has not
been preloaded. firmware_get() will enqueue a task using a struct
taskqueue allocated on the stack, and the machine will crash much
later in the firmware taskq thread when taskqs are started and the
struct taskqueue is garbage.
Not objected to by: sam
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.
Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().
I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.
Reviewed by: rdivacky, kib
On RELENG_6 (and probably RELENG_7) we see our syscons windows and
pseudo-terminals have the following buffer sizes:
| LINE RAW CAN OUT IHIWT ILOWT OHWT LWT COL STATE SESS PGID DISC
| ttyv0 0 0 0 7680 6720 2052 256 7 OCcl 1146 1146 term
| ttyp0 0 0 0 7680 6720 1296 256 0 OCc 82033 82033 term
These buffer sizes make no sense, because we often have much more output
than input, but I guess having higher input buffer sizes improves
guarantees of the system.
On MPSAFE TTY I just sent both the input and output buffer sizes to 7
KB, which is pretty big on a standard FreeBSD install with 8 syscons
windows and some PTY's. Reduce the baud rate to 9600 baud, which means
we now have the following buffer sizes:
| LINE INQ CAN LIN LOW OUTQ USE LOW COL SESS PGID STATE
| ttyv0 1920 0 0 192 1984 0 199 7 2401 2401 Oil
| pts/0 1920 0 0 192 1984 0 199 5631 1305 2526 Oi
This is a lot smaller, but for pseudo-devices this should be good
enough. You need to do a lot of punching to fill up a 7.5 KB input
buffer. If it turns out things don't work out this way, we'll just
switch to 19200 baud.
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.
In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).
For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.
MFC after: 1 month
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.
Submitted by: ups (mostly)
Tested by: pho
MFC after: 1 month
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).
Discussed with: kib
Tested by: pho
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.
The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.
To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.
As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.
Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.
The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.
Sponsored by: Isilon Systems
MFC after: 1 month
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.
Approved by: gnn (mentor) (older version)
Pointed out by: rdivacky
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.
This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.
Discussed with: kib
Tested by: pho
We often run into these very high column numbers when we run curses
applications, because they don't print any newlines. This messes up the
table output of `pstat -t'. If these numbers get really high, they
aren't of any use to the reader anyway. Convert them to `99999' when
they run out of bounds.
One of the pieces of code that I had left alone during the development
of the MPSAFE TTY layer, was tty_cons.c. This file actually has two
different functions:
- It contains low-level console input/output routines (cnputc(), etc).
- It creates /dev/console and wraps all its cdevsw calls to the
appropriate TTY.
This commit reimplements the second set of functions by moving it
directly into the TTY layer. /dev/console is now a character device node
that's basically a regular TTY, but does a lookup of `si_drv1' each time
you open it. d_write has also been changed to call log_console().
d_close() is not present, because we must make sure we don't revoke the
TTY after writing a log message to it.
Even though I'm not convinced this is in line with the future directions
of our console code, it is a good move for now. It removes recursive
locking from the top half of the TTY layer. The previous implementation
called into the TTY layer with Giant held.
I'm renaming tty_cons.c to kern_cons.c now. The code hardly contains any
TTY related bits, so we'd better give it a less misleading name.
Tested by: Andrzej Tobola <ato iem pw edu pl>,
Carlos A.M. dos Santos <unixmania gmail com>,
Eygene Ryabinkin <rea-fbsd codelabs ru>
within an object that a mapping refers to. fileid and fsid are inode/dev
for vnodes. (Linux procfs has these and valgrind is really unhappy
without them.) I believe I didn't change the size of the struct.
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.
Discussed with: dwhite
Tested by: pho
MFC after: 1 week
that they operate directly on credentials: mac_proc_create_swapper(),
mac_proc_create_init(), and mac_proc_associate_nfsd(). Update policies.
Obtained from: TrustedBSD Project
"ticks" goes negative. This breaks the signed comparison in softclock.
This causes sleep() to never wake up, tcp to stop, etc etc. This is
bad(TM). Use the SEQ_LT() method from tcp's sequence number comparisons.
Due to the nature of the beast it causes lot of unproductive overhead. This
is especially bad when running SMP kernel on VMWare with several virtual
processors - idle FreeBSD guest with SMP kernel takes 150% host CPU time on my
dual-core MacBook Pro when I am enabling two virtual CPUs, making even host
not very usable. Detect when we are running in the sandbox and reduce HZ
to 10 (can be adjusted via VM_HZ in the kernel config) in such cases. This
brings host CPU usage of idle FreeBSD/SMP on two virtual processors down
to 10%.
Detect most popular VM platforms out there - VMWare, Parallels, VirtualBox
and VirtualPC.
MFC after: 2 weeks
when thread is in kernel mode, it can cause dead loop, now unlock
process lock after acquired sleep queue lock and thread lock to
avoid the problem. This means TDF_NEEDSIGCHK and TDF_NEEDSUSPCHK must
be set with process lock and thread lock being hold at same time.
unnecessary, the normal process lock and thread lock are enough. The
spin lock is still needed for process and thread exiting to mimic
single sched_lock.
rest in kern_getdirentries(). Use kern_getdirentries() to implement
freebsd32_getdirentries(). This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.
Submitted by: ups
MFC after: 1 week
- If there aren't spinlocks held, but there are problems with old
sleeplocks, they are not reported.
- If the spinlock found is not the only one, problems are not reported.
Fix these 2 problems.
Reported by: tegge
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.
Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.
Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks
realtimer_expire() to not rearm the timer, otherwise there is a chance
that a callout will be left there and be tiggered in future unexpectly.
Bug reported by: tegge@
not the string formatted at the time of CTRX() call. Stack_ktr(9) uses
an on-stack buffer for the symbol name, that is supplied as an argument
to ktr. As result, stack_ktr() traces show garbage or cause page faults.
Fix stack_ktr() by using pointer to module symbol table that is supposed
to have a longer lifetime.
Tested by: pho
MFC after: 1 week
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.
Reviewed by: rwatson
MFC after: 3 months (set timer; decide then)
PCPU_PTR() curthread can migrate on another CPU and get incorrect
results.
- Fix a similar race into witness_warn().
- Fix the interlock's checks bypassing by correctly using the appropriate
children even when the lock_list chunk to be explored is not the first
one.
- Allow witness_warn() to work with spinlocks too.
Bugs found by: tegge
Submitted by: jhb, tegge
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
- Change the ddb(4) commands to be more useful (by thompsa@):
- `show ttys' is now called `show all ttys'. This command will now
also display the address where the TTY data structure resides.
- Add `show tty <addr>', which dumps the TTY in a readable form.
- Place an upper bound on the TTY buffer sizes. Some drivers do not want
to care about baud rates. Protect these drivers by preventing the TTY
buffers from getting enormous. Right now we'll just clamp it to 64K,
which is pretty high, taking into account that these buffers are only
used by the built-in discipline.
- Only call ttydev_leave() when needed. Back in April/May the TTY
reference counting mechanism was a little different, which required us
to call ttydev_leave() each time we finished a cdev operation.
Nowadays we only need to call ttydev_leave() when we really mark it as
being closed.
- Improve return codes of read() and write() on TTY device nodes.
- Make sure we really wake up all blocked threads when the driver calls
tty_rel_gone(). There were some possible code paths where we didn't
properly wake up any readers/writers.
- Add extra assertions to prevent sleeping on a TTY that has been
abandoned by the driver.
- Use ttydev_cdevsw as a more reliable method to figure out whether a
device node is a real TTY device node.
Obtained from: //depot/projects/mpsafetty/...
Reviewed by: thompsa
this eliminates some problems of locking, e.g, a thread lock is needed
but can not be used at that time. Only the process lock is needed now
for new field.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
designed drivers would never hit, but was exposed in diving into
another problem...
When expanding the devclass array, free the old memory after updating
the pointer to the new memory. For the following single race case,
this helps:
allocate new memory
copy to new memory
free old memory
<interrupt> read pointer to freed memory
update pointer to new memory
Now we do
allocate new memory
copy to new memory
update pointer to new memory
free old memory
Which closes this problem, but doesn't even begin to address the
multicpu races, which all should be covered by Giant at the moment,
but likely aren't completely.
Note: reviewers were ok with this fix, but suggested the use case
wasn't one we wanted to encourage.
Reviewed by: jhb, scottl.
have_interp to TRUE. This allows the code in image activator to try
/libexec/ld-elf.so.1 as interpreter when newinterp is not found to
execute.
Reviewed by: peter
MFC after: 2 weeks (together with r175105)
descriptor pointer in unp_freerights: we can no longer recurse into
unp_gc due to unp_gc being invoked in a deferred way, but it's still
a good idea.
MFC after: 3 days
no data is ready, return 0 rather than blocking or returning EAGAIN.
This is consistent with the behavior of soreceive_generic (soreceive)
in earlier versions of FreeBSD, and restores this behavior for UDP.
Discussed with: jhb, sam
MFC after: 3 days
improperly invoking sosend(), soreceive(), and sopoll() instead of
attach either specialized or _generic() versions of those functions
to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods.
MFC after: 3 days
booting from an MFS root (e.g. from an install CD) firmware_mountroot
can be called twice with the second call happening before the task
callback occurs; this results in the task structure contents being
corrupted because it was declared static.
Submitted by: marius (original version)
- Staticize and locally prototype functions uipc_ctloutput(), unp_dispose(),
unp_init(), and unp_externalize(), none of which have been required
outside of uipc_usrreq.c since uipc_proto.c was removed.
- Remove stale prototype for uipc_usrreq(), which has not existed in the
code since 1997
- Forward declare and staticize uipc_usrreqs structure in uipc_usrreq.c and
not un.h.
- Comment on why uipc_connect2() is still non-static -- it is used directly
by fifofs.
- Remove stale comments, tidy up whitespace.
MFC after: 3 days (where applicable)
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
to store the socket address stored in the first mbuf in a packet chain.
This reduces contention on the lock and CPU system time in certain UDP
workloads.
Tested by: ps
Reviewed by: rwatson
MFC after: 1 week
- Update or remove comments that were left over from the original
soreceive_generic() implementation. Quite a few were misleading in the
context of the new code.
- Since soreceive_dgram() has a simpler structure, replace several gotos
with a while loop making the invariants more clear.
- In the blocking while loop, don't try to handle cases incompatible with
the loop invariant (since m is always NULL, don't check for and handle
non-NULL).
- Don't drop and re-acquire the socket buffer lock unnecessarily after
sbwait() returns, which may help reduce lock contention (etc).
- Assume PR_ATOMIC since we assert it at the top of the function.
MFC after: 3 days
setting TDF_INPANIC then it will never be rescheduled again. Wrap
setting the panic condition with the critical section.
Noted and reviewed by: tegge
MFC after: 1 week
The uminor() and umajor() functions have the same use in kernel space as
the minor() and major() functions in userspace. If we ever get rid of
the minor() function in kernel space, we could decide to just expose
minor() and major() to kernel space, making uminor() and umajor()
redundant.
There are two reasons why we want to have uminor() and umajor() in
<sys/types.h>:
- Having them close together prevents them from diverting. Even though
it's unlikely the definitions will change, it's a good habit to have
them at the same place.
- They don't really belong in kern_conf.c. kern_conf.c has been
liberated from dealing with device major and minor number handling.
The device_ids(9) manpage now lists the wrong #include's, because it
should only list <sys/types.h> now. I'm leaving it as it is now, because
I wonder if we should document them anyway. We're probably better off
documenting minor(3) and major(3).
After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.
This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.
Reviewed by: kib
I've had some reports in the past that opening an already opened TTY
through, for example, /dev/tty can fail with random error codes. Looking
at ttydev_open(), I can see there is a way `error' is returned without
initialising it. Even though I haven't had any confirmation this fixes
the bug, I'll fix it anyway.
Reported by: Andrzej Tobola <ato iem pw edu pl>
To prevent any further confusion about device minor and unit numbers,
we'd better just refer to device unit numbers. Many people still think
the numbers we show inside devfs have any relation to the numbers passed
to make_dev(9), which is not the case.
Discussed with: kib
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.
We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().
Reviewed by: kib
- Instead of using a syscall slot (370) just to get a function prototype
for lkmressys(), add an explicit function prototype to <sys/sysent.h>.
This also removes unused special case checks for 'lkmressys' from
makesyscalls.sh.
- Instead of having magic logic in makesyscalls.sh to only generate a
function prototype the first time 'lkmnosys' is seen, make 'NODEF'
always not generate a function prototype and include an explicit
prototype for 'lkmnosys' in <sys/sysent.h>.
- As a result of the fix in (2), update the LKM syscall entries in
the freebsd32 syscall table to use 'lkmnosys' rather than 'nosys'.
- Use NOPROTO for the __syscall() entry (198) in the native ABI. This
avoids the need for magic logic in makesyscalls.h to only generate
a function prototype the first time 'nosys' is encountered.
variable wait routines. DROP_GIANT() already manages that state in the
Giant interlock case.
- Assert that Giant is held when it is passed as a sleep interlock.
unmounts. When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.
Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.
Tested by: pho
Yesterday I got two reports of potential crashes, related to TTY
deallocation during device closure. When a thread is in TF_OPENCLOSE,
draining its output upon closure, we should not allow calls to
tty_rel_free() to happen at the same time. This could cause the TTY to
be torn down twice.
PR: kern/127561
Reported by: KOIE Hidetaka <koie suri co jp>
Discussed with: thompsa
to the C99 style. At least, it is easier to read sysent definitions
that way, and search for the actual instances of sigcode etc.
Explicitely initialize sysentvec.sv_maxssiz that was missed in most
sysvecs.
No objection from: jhb
MFC after: 1 month
It turns out our old TTY layer (and other implementations) block when
you read() on a PTY master device of which the slave device node has not
been opened yet. Our new implementation just returned 0. This caused
applications like telnetd to die in a very subtle way (when child
processes would open the TTY later than the first call to select()).
Introduce a new flag called PTS_FINISHED, which indicates whether we
should block or bail out of a read() or write() occurs.
Reported by: Claude Buisson <clbuisson orange fr>
One of the features that prevented us from fixing some of the TTY
consumers to work once again, was an interface that allowed consumers to
do the following:
- `Sniff' incoming data, which is used by the snp(4) driver.
- Take direct control of the input and output paths of a TTY, which is
used by ng_tty(4), ppp(4), sl(4), etc.
There's no practical advantage in committing a hooks layer without
having any consumers. In P4 there is a preliminary port of snp(4) and
thompsa@ is busy porting ng_tty(4) to this interface. I already want to
have it in the tree, because this may stimulate others to work on the
remaining modules.
Discussed with: thompsa
Obtained from: //depot/projects/mpsafetty/...
According to style(9), function argument names should only be omitted
for prototypes that are exported to userspace. This means we should
document the function arguments in the TTY header files, because they
are only used in userspace.
While there, change the type of the buffer argument of
ttydisc_rint_bypass() to `const void *' instead of `char *'.
Requested by: attilio
Obtained from: //depot/projects/mpsafetty/...
Because pseudo-terminal master file descriptors no longer have a vnode
underneath, we have to fill in fstat() values ourselves. Make our
implementation somewhat sane by returning the timestamps of the TTY
device node that corresponds with our file descriptor.
Obtained from: //depot/projects/mpsafettty/...
the code to prevent useless waste of space.
- Remove support for quote bits. There is not a single driver that needs
these bits anymore. This means putc() now accepts a char instead of an
int.
- Remove the unneeded catq() and nextc() routines. They were only used
by the old TTY layer.
- Convert the clist code to use ANSI C prototypes.
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month
NODEV is more appropriate when va_rdev doesn't have a meaningful value.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Suggested by: bde
Discussed on: freebsd-fs
MFC after: 1 month
VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't
initialize those fields in VOP_GETATTR() they will have a sane default
value.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month
initialize va_vaflags and va_spare because they are not part of the
VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Reviewed by: bde
Discussed on: freebsd-fs
MFC after: 1 month
returning uninitialized birthtime. Most file systems don't initialize
birthtime properly in their VOP_GETTATTR().
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Reviewed by: bde
Discussed on: freebsd-fs
MFC after: 1 month
years by the priv_check(9) interface and just very few places are left.
Note that compatibility stub with older FreeBSD version
(all above the 8 limit though) are left in order to reduce diffs against
old versions. It is responsibility of the maintainers for any module, if
they think it is the case, to axe out such cases.
This patch breaks KPI so __FreeBSD_version will be bumped into a later
commit.
This patch needs to be credited 50-50 with rwatson@ as he found time to
explain me how the priv_check() works in detail and to review patches.
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reviewed by: rwatson
Unlike tty_rel_gone() and tty_rel_sess(), the tty_rel_pgrp() routine
does not unlock the TTY. I once had the idea to make the code call
tty_rel_pgrp() and tty_rel_sess(), picking up the TTY lock once. This
turned out a little harder than I expected, so this is how it works now.
It's a lot easier if we just let tty_rel_pgrp() unlock the TTY, because
the other routines do this anyway.
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.
Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.
Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.
Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.
Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month
the command set (only so long as the module is present):
o add db_command_register and db_command_unregister to add and remove
commands, respectively
o replace linker sets with SYSINIT's (and SYSUINIT's) that register
commands
o expose 3 list heads: db_cmd_table, db_show_table, and db_show_all_table
for registering top-level commands, show operands, and show all operands,
respectively
While here also:
o sort command lists
o add DB_ALIAS, DB_SHOW_ALIAS, and DB_SHOW_ALL_ALIAS to add aliases
for existing commands
o add "show all trace" as an alias for "show alltrace"
o add "show all locks" as an alias for "show alllocks"
Submitted by: Guillaume Ballet <gballet@gmail.com> (original version)
Reviewed by: jhb
MFC after: 1 month
all the non-filter handlers attached to an interrupt event. This can be
used by device drivers which multiplex their interrupt onto the interrupt
handlers for child devices.
the free list and in this way avoid contention on the w_mtx.
In order to make the code simple, we rely on the rule that when the head
has not a child it also doesn't have other subsequent entries.
Actually this assertion is broken because we can free all the head
children and quit witness_unlock() with the head still allocated, with no
children and subsequent entries present.
Fix this by shifting the head if other entries are present and still
freeing the object, but leaving always an head.
- Fix witness_thread_has_locks() in order to report, correctly, if the
lock list linked to a specific thread has children or not based on the
above explained rule.
- Fix a printout into DDB's "show alllocks" command in order to show,
correctly, the process name that is really what we want.
- Fix style(9) for a comment.
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reported by: Marko Kiiskila <marko dot kiiskila at nokia dot com>
Sponsored by: Nokia
ttydevsw_outwakeup(). This should fix panics which occur after remote
login sessions timeout during moderate TTY activity. An example of
where this might occur is where a pending write to the terminal is
occurring while sshd(8) is shutting down the TTY after a TCP timeout.
Submitted by: ed
of spurious witness warnings since lockmgr grew witness support. Before
this, every time you passed an interlock to a lockmgr lock WITNESS treated
it as a LOR.
Reviewed by: attilio
- Set UMA_ZONE_NOFREE so that the per-turnstile spin locks are type stable
to avoid a race where one thread might dereference a lock in a free'd
turnstile that was previously used by another thread.
Theorized by: tegge (2)
MFC after: 1 week
it had been assigned to the last sleeping thread. That thread might have
started running on another CPU and have reused that sleep queue. Fix it
by just walking the thread queue using TAILQ_FOREACH_SAFE() rather than
a while loop.
PR: amd64/124200
Discovered by: tegge
Tested by: benjsc
MFC after: 1 week
somehow.
As a consequence we may now get an unexpected result(*).
Catch that error cases with a well defined panic giving appropriate
pointers to ease debugging.
(*) While the concensus was that the case should never happen unless
there was a bug, noone was definitively sure.
Discussed with: kmacy (about 8 months back)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months
As discussed with Robert on IRC, checking the permissions on
/dev/console to see if we can call TIOCCONS could be unreliable. When we
run a chroot() without a devfs instance mounted inside, it won't
actually check the permissions on the device node inside the devfs
instance.
Using the already existing PRIV_TTY_CONSOLE for this seems like a better
idea.
Approved by: rwatson
As reported by several users on the mailing lists, applications like
screen(1) fail to properly handle ^S and ^Q characters. This was because
MPSAFE TTY didn't implement packet mode (TIOCPKT) yet. Add basic packet
mode support to make these applications work again.
Obtained from: //depot/projects/mpsafetty/...
When I migrated tty_compat.c to MPSAFE TTY, I just hooked it up to the
build and fixed it until it compiled and somewhat worked. It turns out
this was not the smartest thing, because the old TTY layer also had a
field called t_flags, which contained a set of sgtty flags.
This means our current COMPAT_43TTY code overwrites the TTY flags,
causing all strange problems to occur. Fix this code to use a new struct
member called t_compatflags. This commit may cause kern/127054 to be
fixed, but this still has to be tested/confirmed by the originator. It
has to be fixed anyway.
PR: kern/127054
The ttydisc_getc() routine obtains a read length from ttyoutq_read().
For no valid reason, the current code stores this value in an int, and
returns a size_t. There is no need to perform this useless conversion.
Obtained from: //depot/projects/mpsafetty/...
lock tracking and checks, doing just the former ones.
- Fix a bug where sysctl utility was printing crazy values when setting a
new value for debug.witness.watch [0]
[0] Reported by: yongari
- In the current design, when a TTY decreases its baud rate, it tries to
shrink the queues. This may not always be possible, because it will
not free any blocks that are still filled with data.
Change the TTY queues to store a `quota' value as well, which means it
will not free any blocks when changing the baud rate, but when placing
blocks back into the queue. When the amount of blocks exceeds the
quota, they get freed.
It also fixes some edge cases, where TIOCSETA during read()/
write()-calls could actually make the queue a tiny bit bigger than in
normal cases.
- Don't leak blocks of memory when calling TIOCSETA when the device
driver abandons the TTY while allocating memory.
- Create ttyoutq_init() and ttyinq_init() to initialize the queues,
instead of initializing them by hand. The new TTY snoop driver also
creates an outq, so it's good to have a proper interface to do this.
Obtained from: //depot/projects/mpsafetty/...
1 means that witness is up and running.
0 means that witness is disabled but that it can be established later
again in effective way.
-1 means that witness is disabled permanently
- Fix a bug causing kernel to panic on witness disabling through
witness_watch. lock lists queues were still full of entries and this was
causing throubles with debugging stubs (like witness_thread_exit()).
Reported by: kris, yongari
Sponsored by: Nokia
- Implement IMAXBEL. It turned out the IMAXBEL termios switch was marked
as supported, while it had not been implemented.
- Don't go into the high watermark when in canonical mode, no data has
been canonicalized and the input buffer is full. This caused the
terminal to lock up. This prevented users from pressing
backspace/^U/etc in such cases.
This could easily be simulated by pasting a very big amount of data in
a shell with sh(1) in canonical mode.
Obtained from: //depot/projects/mpsafetty/...
A couple of months ago I was quite impressed, because when I was writing
code, I discovered that uiomove() would not allow any locks to be held,
while ureadc() did, mainly because ureadc() is implemented using the
same building blocks as uiomove().
Let's see if this triggers any aditional witness warnings on our source
tree.
Reviewed by: atillio
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.
Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.
Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.
Use the VV_FORCEINSMQ for the creation of the syncvnode.
Tested by: pho
Reviewed by: tegge
MFC after: 1 month
For some reason a return-statement crept into this code, where it
shouldn't belong. This means we didn't properly unlock the TTY before
returning to userspace.
Submitted by: Tor Egge <tor egge cvsup no freebsd org>
doing it on every CPU.
- Use CPU_ABSENT() rather than pcpu_find() to determine if a CPU is not
present.
- Count up to mp_maxid rather than MAXCPU when iterating over CPUs to
match the rest of the code in the kernel.
MFC after: 1 week
is returned shall be kept in the waitable state.
Add WSTOPPED as an alias for WUNTRACED.
Submitted by: Jukka Ukkonen <jau at iki fi>
PR: standards/116221
MFC after: 2 weeks
executed by fexecve(2), imgp->args->fname is NULL. Moreover, there is
no way to recover the path to the script being executed.
Do what some other U*ixes do unconditionally, namely supply /dev/fd/n
as the script path when called from fexecve(). Document requirement of
having fdescfs mounted as caveat.
allocated for posix_openpt(2). Unfortunately, that identifier
conflicts with other events already allocated to other systems in
OpenBSM. Assign a new globally unique identifier and conform
better to the AUE_ event naming scheme.
This is a stopgap until a new OpenBSM import is done with the
correct identifier, so we'll maintain this as a local diff in svn
until then.
Discussed with: ed
Obtained from: TrustedBSD Project
have NULL mount-points. This is the case for special vnodes, such as the
one used in nameiinit() which is used for crossing mount points in lookup()
to avoid lock ordering issues.
MFC after: 2 weeks
Discussed with: rwatson, kib
The pty(4) driver raises up to warnings when an old BSD-style PTY is
created. The reason why I added this warning, was to make it easier to
spot applications that allocate BSD-style PTY's, while they should just
use openpty() or posix_openpt().
Add a sysctl, which allows you to override the number of remaining
messages, making it possible to suppress the warnings.
Requested by: kib
Reviewed by: kib
(1) Abstract interpreter vnode labeling in execve(2) and mac_execve(2)
so that the general exec code isn't aware of the details of
allocating, copying, and freeing labels, rather, simply passes in
a void pointer to start and stop functions that will be used by
the framework. This change will be MFC'd.
(2) Introduce a new flags field to the MAC_POLICY_SET(9) interface
allowing policies to declare which types of objects require label
allocation, initialization, and destruction, and define a set of
flags covering various supported object types (MPC_OBJECT_PROC,
MPC_OBJECT_VNODE, MPC_OBJECT_INPCB, ...). This change reduces the
overhead of compiling the MAC Framework into the kernel if policies
aren't loaded, or if policies require labels on only a small number
or even no object types. Each time a policy is loaded or unloaded,
we recalculate a mask of labeled object types across all policies
present in the system. Eliminate MAC_ALWAYS_LABEL_MBUF option as it
is no longer required.
MFC after: 1 week ((1) only)
Reviewed by: csjp
Obtained from: TrustedBSD Project
Sponsored by: Apple, Inc.
not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
it is currently off by default) and the filesystem supports shared lookups
(e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss
in the name cache in parallel, each of the lookups may each end up adding an
entry to the namecache resulting in duplicate entries in the namecache
for the same pathname. A subsequent removal of the mapping of that
pathname to that vnode (via remove or rename) would only evict one of the
entries from the name cache. As a result, subseqent lookups for that
pathname would still return the old vnode.
This race was observed with shared lookups over NFS where a file was updated
by writing a new file out to a temporary file name and then renaming that
temporary file to the "real" file to effect atomic updates of a file. Other
processes on the same client that were periodically reading the file would
occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
in nfs_open() would receive that error when given the stale vnode.
The fix here is to check for duplicates in cache_enter() and just return
if an entry for this same directory and leaf file name for this vnode is
already in the cache. The check for duplicates is done by walking the
per-vnode list of name cache entries. It is expected that this list should
be very small in the common case (usually 0 or 1 entries during a
cache_enter() since most files only have 1 "leaf" name).
Reviewed by: ups, scottl
MFC after: 2 months
When my earlier MPSAFE TTY prototypes still implemented line
disciplines, we needed a mechanism to abort read()'s on PTY master
devices when inside the line discipline. Because this is no longer the
case, these checks have become unneeded.
set the MNT_FORCE flag, but do not persist "force"
in the options list, since it is a command, not a persistent property
of a mount.
Similarly, when we see "reload", set MNT_RELOAD,
but delete "reload" from the options list.
MFC after: 1 week
- According to POSIX, tcsetattr() must not fail when any of the bits in
the structure are unsupported, but it must leave the unsupported flags
alone.
- The CIGNORE flag (set by TCSASOFT, extension) was not cleared from
c_cflag, which means using it would cause it to be applied during its
entire lifespan. Eventually make sure we clear the flag.
I don't really like CIGNORE, but I think we must keep it alive right
now. With our new TTY layer, we don't actually need this mechanism,
because if you leave c_cflag, c_ispeed and c_ospeed alone, we won't make
a call into the device driver anyway.
Reported by: naddy
Tested by: naddy
thread_unsuspend_one() needs to optionally wakeup the swapper. Since we
hold the thread lock for that entire function, however, we have to push
that requirement up into the caller.
Found by: rwatson
Unlike pre-MPSAFE TTY, the pts(4) driver always returned ENXIO when a
read() or write() was performed on a pseudo-terminal master device when
the slave device was not opened. The old implementation had different
semantics:
- When the slave device had not been opened yet, read() and write() just
blocked.
- When the slave device had been closed, a read() call would return 0
bytes length.
- When the slave device had been closed, a write() call would return
EIO.
Change the new implementation to return 0 and EIO as well. We don't
implement the first rule, but I suspect this is not needed, because
routines like openpty() also open the slave device node. posix_openpt()
users also do similar things.
Reported by: rink
Tested by: rink
It turned out we transmitted VSTART after each successful read on a TTY
when software flow control was turned on. This was because of a very
evil bug where we tested the TF_HIWAT_IN flag the other way around.
Reported by: Christian Weisgerber <naddy mips inka de>
During the import of the MPSAFE TTY layer (r181905), I changed
acct_process() to lock proctree_lock instead of SESS_LOCK, because
s_ttyp is now locked using proctree_lock. One of the things I forgot,
was to lock it before we PROC_LOCK.
Commit this patch, written by kib@. To ensure we hold proctree_lock as
short as possible, obtaining `ac_tty' has now been made the first step
of filling `acct'.
Reported by: Kevin <kevinxlinuz 163 com>
Solved by: kib
We used to have a single wait channel inside the kernel which could be
used by threads that just wanted to sleep for some time (the next
second). The old TTY layer was the only piece of code that still used
lbolt, because I already removed the use of lbolt from the NFS clients
and the VFS syncer.
Approved by: philip
The previous commit also included changes to all the system call lists,
but it is a tradition to update these lists in a second commit, so rerun
make sysent to update the $FreeBSD$ tags inside these files to refer to
the latest version of syscalls.master.
Requested by: rwatson
The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:
- Improved driver model:
The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.
If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.
- Improved hotplugging:
With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).
The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.
- Improved performance:
One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.
Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.
Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
- Speedup the lock orderings lookup modifying the witness graph from a
linked tree to a matrix. A table lookup caches the lock orderings in
order to make a O(1) access for them. Any witness object has an unique
index withing this lookup cache table.
- Reduce the lock contention on w_mtx acquiring it only when the LOR
actually happens and not in a sane case. In order to do this don't totally
flush lock lists (per-CPU spinlocks list and per-thread sleeplocks list)
but check for ll_count anytime we need to have to verify allocations sanity.
- Introduce the function witness_thread_exit() in the witness namespace which
should verify a thread doesn't hold any witness occurrence why exiting.
- Rename the sysctl debug.witness.graphs into debug.witness.fullgraph and
add debug.witness.badstacks which prints out stacks for LOR revealed.
This is implemented using the stack(9) support, which makes WITNESS to be
dependent by the STACK option or by the DDB (including STACK) option.
- Fix style(9) for src/sys/kern/subr_witness.c
The hash table approach has been developed by Ilya Maykov on the behalf of
Isilon Systems which kindly released the patch.
Jeff Roberson, ported the patch to -CURRENT and fixed w_mtx contention, on the
behalf of Nokia.
Submitted by: Ilya Maykov <ivmaykov at gmail dot com> (Isilon Systems), jeff
Sponsored by: Nokia
the various copyouts associated with initializing the process's
argv/env data in userspace. It is possible that these copyout
operations can fault under memory pressure, possibly resulting
in dead locks. This is believed to be safe since none of the
copyout_strings() operations need to interact with the vnode here.
Submitted by: Zhouyi Zhou
PR: kern/111260
Discussed with: kib
MFC after: 3 weeks
There is no reason the fdopen() routine needs Giant. It only sets
curthread->td_dupfd, based on the device unit number of the cdev.
I guess we won't get massive performance improvements here, but still, I
assume we eventually want to get rid of Giant.
msleep/mtx_sleep or the various cv_*wait*() routines. Currently, the
"unlock" behavior of PDROP and cv_wait_unlock() with Giant is not
permitted as it is will be confusing since Giant is fully unrecursed and
unlocked during a thread sleep.
This is handy for subsystems which wish to allow unlocked drivers to
continue to use Giant such as CAM, the new TTY layer, and the new USB
stack. CAM currently uses a hack that I told Scott to use because I
really didn't want to permit this behavior, and the TTY and USB patches
both have various patches to permit this.
MFC after: 2 weeks
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).
With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.
Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().
Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks
lstat(2) is called on symlinks -- this code appears never to have
worked. The PR this addresses suggests that the intended
original behavior is the right one, but as bde points out in the
PR comments, we do actually support storing a mode on symlinks,
so returning it seems reasonable.
This is consistent with Mac OS X, which despite documentation to
the contrary does return the mode set on a symlink, but not some
other platforms. The Single Unix Spec requires only that the
returned bits be "meaningful", which seems at best unhelpful as
advice goes.
PR: 25018
MFC after: 3 days
vnode lock may cause a LOR between kld_sx lock and vnode lock.
linker_load_dependencies() drops kld_sx, and another thread may attempt
to load the same kld.
Reported and tested by: pjd
MFC after: 1 week
processes are not producing absolute pathname tokens. It is required
that audited pathnames are generated relative to the global root mount
point. This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache. Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.
Further, the string parsing routines have been converted to use the
sbuf(9) framework. This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.
The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele(). This will modify the hold count instead of modifying the
user count. This makes more sense since it's the kernel that requires
the reference to the vnode. This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]
Discussed with: rwatson
Reviewed by: kib [1]
MFC after: 2 weeks
It seems we only use `lbolt' inside the VFS syncer and the TTY layer
now. Because I'm planning to replace the TTY layer next month, there's
no reason to keep `lbolt' if it's only used in a single thread inside
the kernel.
Because the syncer code wanted to wake up the syncer thread before the
timeout, it called sleepq_remove(). Because we now just use a condvar(9)
with a timeout value of `hz', we can wake it up using cv_broadcast()
without waking up any unrelated threads.
Reviewed by: phk
After the import of the new TTY layer, the TTY_QUOTE definition will not
be present anymore. To make sure clists will still work as expected,
introduce an internal definition called QUOTEMASK.
Maybe we can decide to remove the quote bits entirely, but we still have
to look into this. There may be drivers that still use the quote bits.
Obtained from: //depot/projects/mpsafetty
- When a cpuset is applied to a thread, walk the cpuset to see if it is a
"full" cpuset (includes all available CPUs). If not, set a new
TDS_AFFINITY flag to indicate that this thread can't run on all CPUs.
When inheriting a cpuset from another thread during thread creation, the
new thread also inherits this flag. It is in a new ts_flags field in
td_sched rather than using one of the TDF_SCHEDx flags because fork()
clears td_flags after invoking sched_fork().
- When placing a thread on a runqueue via sched_add(), if the thread is not
pinned or bound but has the TDS_AFFINITY flag set, then invoke a new
routine (sched_pickcpu()) to pick a CPU for the thread to run on next.
sched_pickcpu() walks the cpuset and picks the CPU with the shortest
per-CPU runqueue length. Note that the reason for the TDS_AFFINITY flag
is to avoid having to walk the cpuset and examine runq lengths in the
common case.
- To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array
of counters to hold the length of the per-CPU runqueues and update them
when adding and removing threads to per-CPU runqueues.
MFC after: 2 weeks
- Check if panicstr isn't set, if it is ignore the lock. This helps to avoid
confusion, because lockmgr is a no-op when panicstr isn't NULL, so
asserting anything at this point doesn't make sense and can just race with
other panic.
Discussed with: kib
The ttyinfo() routine generates the fancy output when pressing ^T. Right
now it is stored in tty.c. In the MPSAFE TTY code it is already stored
in tty_info.c. To make integration of the MPSAFE TTY code a little
easier, take the same approach.
This makes the TTY code a little bit more readable, because having the
proc_*/thread_* routines in tty.c is very distractful.
Approved by: philip (mentor)
child process immediately after bulk bcopy() without dropping the
process lock.
Since process is not single-threaded when forking, dropping and
reacquiring the lock allows an other thread to change the process title
of the parent in between, and results in hold being done on the invalid
pointer. The problem manifested itself as the double free of the old
p_args.
Reported by: kris
Reviewed by: jhb
MFC after: 1 week
and there is no need to maintain it.
- Fix vn_get() in order to let it call vget(9) with a valid locking
request. vget(9) returns the vnode locked in order to prevent recycling,
but in this case internal XFS locks alredy prevent it from happening, so
it is safe to drop the vnode lock before to return by vn_get().
- Add a VNASSERT() in vget(9) in order to catch malformed locking requests.
Discussed with: kan, kib
Tested by: Lothar Braun <lothar at lobraun dot de>
interrupt-driven configuration handlers to complete, print out a
diagnostic message every 60 second indicating which handlers are
still running. Do this at most 5 times per run so as to avoid
scrolling out any useful information from the kernel message
buffer.
The interval of 60 seconds was selected based on a best guess as
to the nature of "long enough" and may want to be tuned higher
or lower depending on real-world tolerances.
MFC after: 3 days
Discussed with: scottl
for completion in run_interrupt_driven_config_hooks(). This is
helpful when trying to figure out which device drivers have gone
into la-la land during boot-time autoconfiguration.
MFC after: 3 days
- When a tick occurs on a cpu, iterate from cs_softticks until ticks.
The per-cpu tick processing happens asynchronously with the actual
adjustment of the 'ticks' variable. Sometimes the results may
be visible before the local call and sometimes after. Previously this
could cause a one tick window where we didn't evaluate the bucket.
- In softclock fetch curticks before incrementing cc_softticks so we
don't skip insertions which were made for the current time.
Sponsored by: Nokia
sched_tick() to prevent multiple increments for one tick. This pushes
the value out of range and breaks priority calculation.
Reviewed by: kib
Found by: pho/nokia
Sponsored by: Nokia
MFC after: 3 days
set MNT_UPDATE in fsflags, and delete the
"update" option from the global mount options.
MNT_UPDATE is a command, and not a property of a mount
that should persist after the command is executed.
We need to do similar things for MNT_FORCE and MNT_RELOAD.
All mount flags are prefixed by MNT_..... it would
be nice if flags which were commands were named differently
from flags which are persistent properties of a mount.
This was not such a big deal in the pre-nmount() days,
but with nmount() it is more important.
Requested by: yar
MFC after: 2 weeks
SI_ALIAS flag and initialization of the si_parent when alias is created.
Assert that supplied parent device is not NULL.
Both situations could cause NULL dereference in the
devfs_populate_loop() when creating a symlink for SI_ALIAS'ed device.
Namely, cdp->cdp_c.si_parent may be NULL.
Reported by: mav
MFC after: 2 weeks
the syscall code and acquires various event subsystem locks as needed.
The handling of the NOTE_TRACK for EVFILT_PROC is currently done by
calling the kqueue_register() from filt_proc() filter, causing recursive
entrance of the kqueue code. This results in the LORs and recursive
acquisition of the locks.
Implement the variant of the knote() function designed to only handle
the fork() event. It mostly copies the knote() body, but also handles
the NOTE_TRACK, removing the handling from the filt_proc(), where it
causes problems described above. The function is called from the fork1()
instead of knote().
When encountering NOTE_TRACK knote, it marks the knote as influx
and drops the knlist and kqueue lock. In this context call to
kqueue_register is safe from the problems.
An error from the kqueue_register() is reported to the observer as
NOTE_TRACKERR fflag.
PR: 108201
Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version)
Discussed with: jmg
Tested by: pho
MFC after: 2 weeks
KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the
kqueue_scan() thread may be not woken up.
Move the setting of KQ_FLUXWAIT after wakeup to correct the issue.
Reported and tested by: pho
MFC after: 3 days
to global hostname and domainname variables. Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout(). A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.
Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.
MFC after: 3 weeks
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.
Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.
It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.
Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.
soun->sun_path isn't a null-terminated string. As UNIX(4) states, "the
terminating NUL is not part of the address." Since strlcpy has to return
"the total length of the string [it] tried to create," it walks off the end
of soun->sun_path looking for a \0.
This reverts r105332.
Reported by: Ryan Stone
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.
This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.
Tested by: kris, ps
already commited but with a wrong msleep variant and then
backed out. Note that this changes the semantic a little
as msleep_spin does not let us to specify priority after
wakeup.
Approved by: wkoszek, cognet
Approved by: kib (mentor)
semaphores. Specifically, semaphores are now represented as new file
descriptor type that is set to close on exec. This removes the need for
all of the manual process reference counting (and fork, exec, and exit
event handlers) as the normal file descriptor operations handle all of
that for us nicely. It is also suggested as one possible implementation
in the spec and at least one other OS (OS X) uses this approach.
Some bugs that were fixed as a result include:
- References to a named semaphore whose name is removed still work after
the sem_unlink() operation. Prior to this patch, if a semaphore's name
was removed, valid handles from sem_open() would get EINVAL errors from
sem_getvalue(), sem_post(), etc. This fixes that.
- Unnamed semaphores created with sem_init() were not cleaned up when a
process exited or exec'd. They were only cleaned up if the process
did an explicit sem_destroy(). This could result in a leak of semaphore
objects that could never be cleaned up.
- On the other hand, if another process guessed the id (kernel pointer to
'struct ksem' of an unnamed semaphore (created via sem_init)) and had
write access to the semaphore based on UID/GID checks, then that other
process could manipulate the semaphore via sem_destroy(), sem_post(),
sem_wait(), etc.
- As part of the permission check (UID/GID), the umask of the proces
creating the semaphore was not honored. Thus if your umask denied group
read/write access but the explicit mode in the sem_init() call allowed
it, the semaphore would be readable/writable by other users in the
same group, for example. This includes access via the previous bug.
- If the module refused to unload because there were active semaphores,
then it might have deregistered one or more of the semaphore system
calls before it noticed that there was a problem. I'm not sure if
this actually happened as the order that modules are discovered by the
kernel linker depends on how the actual .ko file is linked. One can
make the order deterministic by using a single module with a mod_event
handler that explicitly registers syscalls (and deregisters during
unload after any checks). This also fixes a race where even if the
sem_module unloaded first it would have destroyed locks that the
syscalls might be trying to access if they are still executing when
they are unloaded.
XXX: By the way, deregistering system calls doesn't do any blocking
to drain any threads from the calls.
- Some minor fixes to errno values on error. For example, sem_init()
isn't documented to return ENFILE or EMFILE if we run out of semaphores
the way that sem_open() can. Instead, it should return ENOSPC in that
case.
Other changes:
- Kernel semaphores now use a hash table to manage the namespace of
named semaphores nearly in a similar fashion to the POSIX shared memory
object file descriptors. Kernel semaphores can now also have names
longer than 14 chars (up to MAXPATHLEN) and can include subdirectories
in their pathname.
- The UID/GID permission checks for access to a named semaphore are now
done via vaccess() rather than a home-rolled set of checks.
- Now that kernel semaphores have an associated file object, the various
MAC checks for POSIX semaphores accept both a file credential and an
active credential. There is also a new posixsem_check_stat() since it
is possible to fstat() a semaphore file descriptor.
- A small set of regression tests (using the ksem API directly) is present
in src/tools/regression/posixsem.
Reported by: kris (1)
Tested by: kris
Reviewed by: rwatson (lightly)
MFC after: 1 month
provides the correct semantics for flock(2) style locks which are used by the
lockf(1) command line tool and the pidfile(3) library. It also implements
recovery from server restarts and ensures that dirty cache blocks are written
to the server before obtaining locks (allowing multiple clients to use file
locking to safely share data).
Sponsored by: Isilon Systems
PR: 94256
MFC after: 2 weeks
so we cannot compile it with -fstack-protector[-all] flags (or
it will self-recurse); this is ensured in sys/conf/files. This
OTOH means that checking for defines __SSP__ and __SSP_ALL__ to
determine if we should be compiling the support is impossible
(which it was trying, resulting in an empty object file). Fix
this by always compiling the symbols in this files. It's good
because it allows us to always have SSP support, and then compile
with SSP selectively.
Repoted by: tinderbox
- It is opt-out for now so as to give it maximum testing, but it may be
turned opt-in for stable branches depending on the consensus. You
can turn it off with WITHOUT_SSP.
- WITHOUT_SSP was previously used to disable the build of GNU libssp.
It is harmless to steal the knob as SSP symbols have been provided
by libc for a long time, GNU libssp should not have been much used.
- SSP is disabled in a few corners such as system bootstrap programs
(sys/boot), process bootstrap code (rtld, csu) and SSP symbols themselves.
- It should be safe to use -fstack-protector-all to build world, however
libc will be automatically downgraded to -fstack-protector because it
breaks rtld otherwise.
- This option is unavailable on ia64.
Enable GCC stack protection (aka Propolice) for kernel:
- It is opt-out for now so as to give it maximum testing.
- Do not compile your kernel with -fstack-protector-all, it won't work.
Submitted by: Jeremie Le Hen <jeremie@le-hen.org>
locked and unlocked completely in userland. by locking and unlocking mutex
in userland, it reduces the total time a mutex is locked by a thread,
in some application code, a mutex only protects a small piece of code, the
code's execution time is less than a simple system call, if a lock contention
happens, however in current implemenation, the lock holder has to extend its
locking time and enter kernel to unlock it, the change avoids this disadvantage,
it first sets mutex to free state and then enters kernel and wake one waiter
up. This improves performance dramatically in some sysbench mutex tests.
Tested by: kris
Sounds great: jeff
FIFO, as required by SUSv3. No specific privilege check is performed
in this case, as FIFOs may be created by unprivileged processes
(subject to the normal file system name space restrictions that may be
in place).
Unlike the Apple implementation, we reject requests to create a FIFO
using mknod(2) if there is a non-zero dev argument to the system call,
which is permitted by the Open Group specification ("... undefined
..."). We might want to revise this if we find it causes
compatibility problems for applications in practice.
PR: kern/74242, kern/68459
Obtained from: Apple, Inc.
MFC after: 3 weeks
needed to promote cdev to cdev_priv, the si_priv pointer was followed.
Use member2struct() to calculate address of the wrapping cdev_priv.
Rename si_priv to __si_reserved.
Tested by: pho
Reviewed by: ed
MFC after: 2 weeks
Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.
The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.
Approved by: philip (mentor), pjd
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.
Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.
This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.
The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.
Approved by: philip (mentor)
allocated semaphores, so it's wrong to increase it conditionally,
in this case for every over-the-limit semaphore nsegs is decreased
without being previously increased.
PR: kern/123685
Approved by: cognet (mentor)
and nfs requests processing. Lockmgr lock provides the shared locking for
nfs requests, while exclusive mode is used for modifications. The writer
starvation is handled by lockmgr too.
Reported by: kris, pho, many
Based on the submission by: mohan
Tested by: pho
MFC after: 2 weeks
The Giant lock is acquired in two places in tty_tty.c. In both places,
it is unneeded.
There is no reason to specify D_NEEDGIANT on this device node. The
device node has only been designed to return ENXIO when opened. It
doesn't make any sense to lock/unlock Giant, just to return this error.
D_TTY is also unneeded. The unimplemented functions don't need to be
patched by devfs.
We don't need to lock Giant when we want to lookup the proper TTY vnode.
s_ttyvp is already protected by proctree_lock (see devfs_vnops.c).
Approved by: philip (mentor)
Even though we got rid of device major numbers some time ago, device
drivers still need to provide unique device minor numbers to make_dev().
These numbers are only used inside the kernel. They are not related to
device major and minor numbers which are visible in devfs. These are
actually based on the inode number of the device.
It would eventually be nice to remove minor numbers entirely, but we
don't want to be too agressive here.
Because the 8-15 bits of the device number field (si_drv0) are still
reserved for the major number, there is no 1:1 mapping of the device
minor and unit numbers. Because this is now unused, remove the
restrictions on these numbers.
The MAXMAJOR definition was actually used for two purposes. It was used
to convert both the userspace and kernelspace device numbers to their
major/minor pair, which is why it is now named UMINORMASK.
minor2unit() and unit2minor() have now become useless. Both minor() and
dev2unit() now serve the same purpose. We should eventually remove some
of them, at least turning them into macro's. If devfs would become
completely minor number unaware, we could consider using si_drv0 directly,
just like si_drv1 and si_drv2.
Approved by: philip (mentor)
Right now we perform some of the checks inside the fcntl()'s F_DUPFD
operation twice. We first validate the `fd' argument. When finished,
we validate the `arg' argument. These checks are also performed inside
do_dup().
The reason we need to do this, is because fcntl() should return different
errno's when the `arg' argument is out of bounds (EINVAL instead of
EBADF). To prevent the redundant locking of the PROC_LOCK and
FILEDESC_SLOCK, patch do_dup() to support the error semantics required
by fcntl().
Approved by: philip (mentor)
Because clists are also used outside the TTY layer, rename the file
containing the clist routines to something more accurate.
The mpsafetty TTY layer doesn't use clists. It uses its own buffers,
which also implement the unbuffered copying to userspace. We cannot
simply remove the clist routines then, because this would break various
drivers that are present within the kernel.
Approved by: philip (mentor)
NET_NEEDS_GIANT. netatm has been disconnected from the build for ten
months in HEAD/RELENG_7. Specifics:
- netatm include files
- netatm command line management tools
- libatm
- ATM parts in rescue and sysinstall
- sample configuration files and documents
- kernel support as a module or in NOTES
- netgraph wrapper nodes for netatm
- ctags data for netatm.
- netatm-specific device drivers.
MFC after: 3 weeks
Reviewed by: bz
Discussed with: bms, bz, harti
refcount interface.
It also introduces the correct usage of memory barriers, as sometimes
fdrop() and fhold() are used with shared locks, which don't use any
release barrier.
ttyfree(), freeing the tty. Since destroy_dev() may call d_purge()
cdevsw method, that is the ttypurge() for the tty, the code ends up
accessing freed tty structure.
Put the ttyrel() after destroy_dev() in the ttyfree. To prevent the
panic the rev. 1.274 provided fix for, check the TS_GONE in sysctl
handler and refuse to provide information on such tty.
Reported, debugging help and tested by: pho
DIscussed with and reviewed by: jhb
MFC after: 1 week
in the giant_trick routines after the dev_refthread increments the
si_threadcount. Remove assert, do not perform dev_relthread() for failed
dev_refthread(), and handle failure in the tty_gettp() callers (cdevsw
tty methods).
Before kern_conf.c 1.210 and 1.211, the kernel usually paniced in the
giant_trick routines dereferencing NULL cdevsw, not taking this fault.
Reported by: Vince Hoffman <jhary unsane co uk>
Debugging help and tested by: pho
Reviewed by: jhb
MFC after: 1 week
For some reason, the <sys/tty.h> header file also contains routines of the
clists and console that are used inside the TTY layer. Because the clists
are not only used by the TTY layer (example: various input drivers), we'd
better move the entire clist programming interface into <sys/clist.h>. Also
remove a declaration of nonexistent variable.
The <sys/tty.h> header also contains various definitions for the console
code (tty_cons.c). Also move these to <sys/cons.h>, because they are
not implemented inside the TTY layer.
While there, create separate malloc pools for the clist and console code.
Approved by: philip (mentor)
PIPE_MTX().
Since the pipe_present is cleared before (potentially) sleeping, the
second thread may enter the pipeclose() for the reciprocal pipe end.
The test at the end of the pipeclose() for the pipe_present == 0 would
succeed, allowing the second thread to free the pipe memory. First
threads then accesses the freed memory after being woken up.
Properly track the closing state of the pipe in the pipe_present.
Introduce the intermediate state that marks the pipe as mostly
dismantled but might be sleeping waiting for the knote list to be
cleared. Free the pipe pair memory only when both ends pass that point.
Debugging help and tested by: pho
Discussed with: jmg
MFC after: 2 weeks
monitoring the pipe. The code sets pipe_present = 0 and enters
knlist_cleardel(), where the PIPE_MTX might be dropped when knl->kl_list
cannot be cleared due to influx knotes.
If the following often encountered code fragment
if (!(kn->kn_status & KN_DETACHED))
kn->kn_fop->f_detach(kn);
knote_drop(kn, td); [1]
is executed while the knlist lock is dropped, then the knote memory is freed
by the knote_drop() without knote being removed from the knlist, since
the filt_pipedetach() contains the following:
if (kn->kn_filter == EVFILT_WRITE) {
if (!cpipe->pipe_peer->pipe_present) {
PIPE_UNLOCK(cpipe);
return;
Now, the memory may be reused in the zone, causing the access to the
freed memory. I got the panics caused by the marker knote appearing on
the knlist, that, I believe, manifestation of the issue. In the Peter
Holm test scenarious, we got unkillable processes too.
The pipe_peer that has the knote for write shall be present. Ignore the
pipe_present value for EVFILT_WRITE in filt_pipedetach().
Debugging help and tested by: pho
Discussed with: jmg
MFC after: 2 weeks
the elf files. This is complicated by the fact that the actual CTF
parsing has to be done in CDDL'd code, so the BSD licensed code only
knows about the opaque data which it must be able to free.
argument, call mac_socket_check_connect() on that address before
proceeding with the send. Otherwise policies instrumenting the
connect entry point for the purposes of checking destination
addresses will not have the opportunity to check implicit
connect requests.
MFC after: 3 weeks
Sponsored by: nCircle Network Security, Inc.
The patch does not change the cdevsw KBI. Management of the data is
provided by the functions
int devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr);
int devfs_get_cdevpriv(void **datap);
void devfs_clear_cdevpriv(void);
All of the functions are supposed to be called from the cdevsw method
contexts.
- devfs_set_cdevpriv assigns the priv as private data for the file
descriptor which is used to initiate currently performed driver
operation. dtr is the function that will be called when either the
last refernce to the file goes away, the device is destroyed or
devfs_clear_cdevpriv is called.
- devfs_get_cdevpriv is the obvious accessor.
- devfs_clear_cdevpriv allows to clear the private data for the still
open file.
Implementation keeps the driver-supplied pointers in the struct
cdev_privdata, that is referenced both from the struct file and struct
cdev, and cannot outlive any of the referee.
Man pages will be provided after the KPI stabilizes.
Reviewed by: jhb
Useful suggestions from: jeff, antoine
Debugging help and tested by: pho
MFC after: 1 month
data via ctor and dtor event handlers.
The size of the extra data is allocated opaquely and this file
contains a function which the dtrace module can call to check
that the kernel supports at least the amount of data that it needs.
This file is optionally compiled into nthe kernel if the KDTRACE_HOOKS
kernel option is defined.
This is BSD licensed code written specifically for FreeBSD.
It initialises using SYSINIT so that the SDT provider, probe and
argument description linkage is done whenever a module is loaded,
regardless of whether the DTrace modules are loaded or not.
This file is optionally compiled into the kernel if the KDTRACE_HOOKS
option is defined.
devsoftc.async_proc != NULL because the latter might not be true
sometimes.
This way /etc/rc.suspend gets executed.
Reviwed by: njl
Submitted by: Mitsuru IWASAKI <iwasaki at jp.FreeBSD.org>
Tested also by: Andreas Wetzel <mickey242 at gmx.net>
MFC after: 1 week
(all types) used per socket buffer.
Add support to netstat to print out all of the socket buffer
statistics.
Update the netstat manual page to describe the new -x flag
which gives the extended output.
Reviewed by: rwatson, julian
lock_object, using an unified field called lo_data.
- Replace lo_type usage with the w_name usage and at init time pass the
lock "type" directly to witness_init() from the parent lock init
function. Handle delayed initialization before than
witness_initialize() is called through the witness_pendhelp structure.
- Axe out LO_ENROLLPEND as it is not really needed. The case where the
mutex init delayed wants to be destroyed can't happen because
witness_destroy() checks for witness_cold and panic in case.
- In enroll(), if we cannot allocate a new object from the freelist,
notify that to userspace through a printf().
- Modify the depart function in order to return nothing as in the current
CVS version it always returns true and adjust callers accordingly.
- Fix the witness_addgraph() argument name prototype.
- Remove unuseful code from itismychild().
This commit leads to a shrinked struct lock_object and so smaller locks,
in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are
gained.
Reviewed by: jhb
used to request superpage alignment for the submap.
Request superpage alignment for the kmem_map.
Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)
Remove a stale comment from kmem_malloc().
hand, it may cause other threads to sleep since kqueue_scan() may mark
some knotes as infux. This could lead to the deadlock.
Before kqueue_scan() sleeps, wakeup the threads that are waiting for the
influx knotes produced by this thread.
Tested by: pho (previous version)
Reviewed by: jmg
MFC after: 2 weeks
closed is the legitimate situation. For instance, filedescriptor with
registered events may be closed in parallel with closing the kqueue.
Properly handle the case instead of asserting that this cannot happen.
Reported and tested by: pho
Reviewed by: jmg
MFC after: 2 weeks
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
and its children in the form:
"parent","child"
so that head and bottom of an oriented graph can be easilly detected and
various form of diagrams can be build.
The sysctl is called debug.witness.graphs and it is read-only; in order
to get the list of relations, a simple:
#sysctl debug.witness.graphs
will do the trick.
This approach has been choosen in order to support easilly things like
the DOT format and such. Soon, an auto-explicative awk script, which
filters simple informations returned by the sysctl and converts them into
a real DOT script, will be committed to the repository between examples.
Discussed with: rwatson
method:
- If the last of the child cpufreq drivers returns an error while trying to
fetch its list of supported frequencies but an earlier driver found the
requested frequency, don't return an error to the caller.
- If all of the child cpufreq drivers fail and the attempt to match the
frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.
MFC after: 3 days
PR: kern/121433
Reported by: Eugene Grosbein eugen ! kuzbass dot ru
ALT_BREAK_TO_DEBUGGER. In addition to "Enter ~ ctrl-B" (to enter the
debugger), there is now "Enter ~ ctrl-P" (force panic) and
"Enter ~ ctrl-R" (request clean reboot, ala ctrl-alt-del on syscons).
We've used variations of this at work. The force panic sequence is
best used with KDB_UNATTENDED for when you just want it to dump and
get on with it.
The reboot request is a safer way of getting into single user than
a power cycle. eg: you've hosed the ability to log in (pam, rtld, etc).
It gives init the reboot signal, which causes an orderly reboot.
I've taken my best guess at what the !x86 and non-sio code changes
should be.
This also makes sio release its spinlock before calling KDB/DDB.
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).
Suggested by: jeff
Reviewed by: kib
Tested by: pho
to profile outoing packets for a number of mbuf chain
related parameters
e.g. number of mbufs, wasted space.
probably will do with further work later.
Reviewed by: various
while holding the socket buffer lock. These leads to an
immediate panic due to recursing the socket buffer lock. This
bug was introduced in uipc_syscalls.c:1.240, but masked by
another bug until that was fixed in uipc_syscalls.c:1.269.
Note that the current fix isn't perfect, but better than
panicking: normally we guarantee that simultaneous invocations
of a system call to write on a stream socket won't be
interlaced, which is ensured by use of the socket buffer sleep
lock. This is guaranteed for the sendfile headers, but not
trailers. In practice, this is likely not a problem, but
should be fixed.
MFC after: 3 days
Pointy hat to: andre (1.240), cperciva (1.269)
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.
MFC after: 2 weeks
from idle over the next tick.
- Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
suspended in cpu specific states. This function can fail and cause the
scheduler to fall back to another mechanism (ipi).
- Implement support for mwait in cpu_idle() on i386/amd64 machines that
support it. mwait is a higher performance way to synchronize cpus
as compared to hlt & ipis.
- Allow selecting the idle routine by name via sysctl machdep.idle. This
replaces machdep.cpu_idle_hlt. Only idle routines supported by the
current machine are permitted.
Sponsored by: Nokia
for better structure.
Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.
In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such. All that
is theoretically a matter for userland only.
Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC. For this we have <sys/clock.h>
<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on. These know only seconds and fractions thereof.
Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.
Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c. Remove references to it
elsewhere.
Remove a lot of unnecessary <sys/clock.h> includes.
Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
two ticks by counting the number of switches and the load when
sched_clock() is called.
- If the busy metric exceeds a threshold allow the idle thread to spin
waiting for new work for a brief period to avoid using IPIs. This
reduces the cost on the sender and receiver as well as reducing wakeup
latency considerably when it works.
Sponsored by: Nokia
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.
Sponsored by: Nokia
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.
Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.
The implementation of the lf_purgelocks() is submitted by dfr.
Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks
o Implement IPI_PREEMPT,
o Set td_lock for the thread being switched out,
o For ULE & SMP, loop while td_lock points to blocked_lock for
the thread being switched in,
o Enable ULE by default in GENERIC and SKI,
public namespace for WITNESS as they are only used internally so just
move them in the private namespace for the subsystem (with all related
supporting definitions).
under bootverbose.
Struct ct is used for setting/reading real time clocks and I'm about
to Do Things to some of those, so a bit of preemptive debugging is
in order.
Remove a pointless __inline.
the only one difference is that lockmgr*() functions now accept
LK_NOWITNESS flag which skips ordering for the instanced calling.
- Remove an unuseful stub in witness_checkorder() (because the above check
doesn't allow ever happening) and allow witness_upgrade() to accept
non-try operation too.
lookup hard interrupt events by number. Ignore the irq# for soft intrs.
- Add support to cpuset for binding hardware interrupts. This has the
side effect of binding any ithread associated with the hard interrupt.
As per restrictions imposed by MD code we can only bind interrupts to
a single cpu presently. Interrupts can be 'unbound' by binding them
to all cpus.
Reviewed by: jhb
Sponsored by: Nokia
no longer needed, but for now we still want to be consistent with other
similar checks in the tree.
- Call ASSERT_VOP_ELOCKED() only when vget() returns 0.
Reviewed by: jeff
o create a private task queue thread that sets up root and current
directories (hooking mountroot event as needed); this is necessary
because task queue threads are parented from proc0 and it does not
have a reference to rootvnode (lost when / mounting moved to init)
o bounce image load + unload requests through the private task q so
we can load images even when the request is made from a thread that
does not have sufficient context (e.g. task q thread)
o add a check in the task q thread to fail requests before root is
mounted (just in case)
Reviewed by: jhb, mlaier, luigi (glance)
MFC after: 1 month