maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.
Requested by: rwhatson, jeff
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.
MFC after: 2 weeks
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).
PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month
consists of the null-terminated name and the contents of any structure
you wish to record. A new ktrstruct() function constructs and emits a
KTR_STRUCT record. It is accompanied by convenience macros for struct
stat and struct sockaddr.
In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.
Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.
Derived from patches by Andrew Li <andrew2.li@citi.com>.
PR: kern/117836
MFC after: 3 weeks
directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.
The new procstat output looks like:
PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -
I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.
Reviewed by: rwatson
Approved by: rwatson
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.
Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.
Submitted by: rwatson
happen if there are no files open. Accounting for these can
eventually return a negative value for olenp causing sysctl to
crash with a bad malloc.
Reported by: Pawel Worach <pawel.worach@gmail.com>
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.
Tested by: kris, pho
its -f and -v arguments:
kern.proc.filedesc - dump file descriptor information for a process, if
debugging is permitted, including socket addresses, open flags, file
offsets, file paths, etc.
kern.proc.vmmap - dump virtual memory mapping information for a process,
if debugging is permitted, including layout and information on
underlying objects, such as the type of object and path.
These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.
While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.
Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
can acquire shared filedescriptor locks in the appropriate cases.
- Remove Giant from calls that issue ioctls. The ioctl path has been
mpsafe for some time now.
- Only acquire giant for VOP_ADVLOCK when the filesystem requires giant.
advlock is now mpsafe.
Reviewed by: rwatson
Approved by: re
a privilege is checked against the real uid rather than the effective
uid, instead decide which uid to use in priv_check_cred() based on the
privilege passed in. We use the real uid for PRIV_MAXFILES,
PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of
SUSER_RUID; there are now no flags defined for priv_check_cred().
Obtained from: TrustedBSD Project
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.
Idea by: tegge
Reviewed by: tegge (previous version of patch)
Tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.
Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
sysctl and socket teardown by adding a reference count to the UNIX domain
pcb object and fixing the sysctl that enumerates unpcbs to grab a
reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
file descriptor teardown (fdrop()) by adding a new garbage collection
flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers
in a UNIX domain socket looking for nested file descriptor references
and clears the flag when it is finished. fdrop() checks to see if the
flag is set on a file descriptor whose refcount just dropped to 0 and
waits for unp_gc() to clear the flag before completely destroying the
file descriptor.
MFC after: 1 week
Reviewed by: rwatson
Submitted by: ups
Hopefully makes the panics go away: mx1
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
the file descriptor reference, rather than paying additional lock
operations to acquire a socket reference from the file descriptor.
This will also help to ensure that file descriptor based socket
requests are not delivered to a socket after close. Most consumers
have already been converted to this model.
MFC after: 3 months
"fdinit() fails to initialize newfdp->fd_fd.fd_lastfile to -1. This breaks
fdcopy() which will incorrectly set newfdp->fd_freefile to 1 if no files are
open and the last file descriptor marked as unused for fdp was 0. This later
causes descriptor 0 to be unavailable in newfdp when the optimization is
enabled.
When the last file descriptor previously marked as used is nonzero and marked
as unused, fdunused() incorrectly sets fdp->fd_lastfile to fd - 1 due to
fd_last_used() returning (size - 1). This hides the problem that breaks the
optimization."
This allows us to keep the optimization, while un-breaking it.
This is a RELENG_6 candidate.
PR: kern/87208
MFC after: 1 week
Submitted by: tegge
really breaking things. Simple "close(0); dup(fd)" does not return descriptor
"0" in some cases. Further, this change also breaks some MAC interactions with
mac_execve_will_transition(). Under certain circumstances, fdcheckstd() can
be called in execve(2) causing an assertion that checks to make sure that
stdin, stdout and stderr reside at indexes 0, 1 and 2 in the process fd table
to fail, resulting in a kernel panic when INVARIANTS is on.
This should also kill the "dup(2) regression on 6.x" show stopper item on the
6.1-RELEASE TODO list.
This is a RELENG_6 candidate.
PR: kern/87208
Silence from: des
MFC after: 1 week
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.
MFC after: 1 week
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.
MFC after: 1 week
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
- if minfd < fd_freefile (as is most often the case, since minfd is
usually 0), set it to fd_freefile.
- remove a call to fd_first_free() which duplicates work already done
by fdused().
This change results in a small but measurable speedup for processes
with large numbers (several thousands) of open files.
PR: kern/85176
Submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
MFC after: 3 weeks
opening a device, devfs_open needs the file descriptor to install its
own fileops. Failing to pass the file descriptor causes the vnode to
be returned with the regular vnops, which will cause a panic on the
first read or write because devfs_specops is not meant to support
those operations.
This bug caused a panic after exec'ing any set[ug]id program with
fds 0..2 closed (i.e., if any action had to be taken by fdcheckstd, we
would panic if the exec'd program ever tried to use any of those
descriptors).
Reviewed by: phk
Approved by: re (scottl)