freebsd-nq

Author	SHA1	Message	Date
Pawel Jakub Dawidek	3812dcd3de	Allocate descriptor number in dupfdopen() itself instead of depending on the caller using finstall(). This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes a race between finstall() and dupfdopen(). MFC after: 1 month	2012-06-13 21:32:35 +00:00
Pawel Jakub Dawidek	6195bfebcc	There is only one caller of the dupfdopen() function, so we can simplify it a bit: - We can assert that only ENODEV and ENXIO errors are passed instead of handling other errors. - The caller always call finstall() for indx descriptor, so we can assume it is set. Actually the filedesc lock is dropped between finstall() and dupfdopen(), so there is a window there for another thread to close the indx descriptor, but it will be closed in next commit. Reviewed by: mjg MFC after: 1 month	2012-06-13 19:00:29 +00:00
Mateusz Guzik	2ca63f0a90	Remove 'low' argument from fd_last_used(). This function is static and the only caller always passes 0 as low. While here update note about return values in comment. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 1 month	2012-06-13 17:18:16 +00:00
Mateusz Guzik	02efb9a8b1	Re-apply reverted parts of r236935 by pjd with some changes. If fdalloc() decides to grow fdtable it does it once and at most doubles the size. This still may be not enough for sufficiently large fd. Use fd in calculations of new size in order to fix this. When growing the table, fd is already equal to first free descriptor >= minfd, also fdgrowtable() no longer drops the filedesc lock. As a result of this there is no need to retry allocation nor lookup. Fix description of fd_first_free to note all return values. In co-operation with: pjd Approved by: trasz (mentor) MFC after: 1 month	2012-06-13 17:12:53 +00:00
Pawel Jakub Dawidek	faf0db351d	Revert part of the r236935 for now, until I figure out why it doesn't work properly. Reported by: davidxu	2012-06-12 10:25:11 +00:00
Pawel Jakub Dawidek	039dc89f0d	fdgrowtable() no longer drops the filedesc lock so it is enough to retry finding free file descriptor only once after fdgrowtable(). Spotted by: pluknet MFC after: 1 month	2012-06-11 22:05:26 +00:00
Pawel Jakub Dawidek	d3ec30e525	Use consistent way of checking if descriptor number is valid. MFC after: 1 month	2012-06-11 20:17:20 +00:00
Pawel Jakub Dawidek	fd45a47ba6	Be consistent with white spaces. MFC after: 1 month	2012-06-11 20:01:50 +00:00
Pawel Jakub Dawidek	19d9c0e11e	Remove code duplicated in kern_close() and do_dup() and use closefp() function introduced a minute ago. This code duplication was responsible for the bug fixed in r236853. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 20:00:44 +00:00
Pawel Jakub Dawidek	642db963ab	Introduce closefp() function that we will be able to use to eliminate code duplication in kern_close() and do_dup(). This is committed separately from the actual removal of the duplicated code, as the combined diff was very hard to read. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:57:31 +00:00
Pawel Jakub Dawidek	129c87eb7d	Merge two ifs into one to make the code almost identical to the code in kern_close(). Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:53:41 +00:00
Pawel Jakub Dawidek	d327cee241	Move the code around a bit to move two parts of code duplicated from kern_close() close together. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:51:27 +00:00
Pawel Jakub Dawidek	8b40793150	Now that fdgrowtable() doesn't drop the filedesc lock we don't need to check if descriptor changed from under us. Replace the check with an assert. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:48:55 +00:00
Pawel Jakub Dawidek	69d7614850	When we are closing capability during dup2(), we want to call mq_fdclose() on the underlying object and not on the capability itself. Discussed with: rwatson Sponsored by: FreeBSD Foundation MFC after: 1 month	2012-06-10 14:57:18 +00:00
Pawel Jakub Dawidek	1b693d7494	Merge two ifs into one. Other minor style fixes. MFC after: 1 month	2012-06-10 13:10:21 +00:00
Pawel Jakub Dawidek	8849ae7256	Simplify fdtofp(). MFC after: 1 month	2012-06-10 06:31:54 +00:00
Pawel Jakub Dawidek	e59a97362d	There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as we now use sx lock for filedesc structure protection. Reviewed by: kib MFC after: 1 month	2012-06-09 18:50:32 +00:00
Pawel Jakub Dawidek	68abac4337	Remove now unused variable. MFC after: 1 month MFC with: r236820	2012-06-09 18:48:06 +00:00
Pawel Jakub Dawidek	380513aaae	Make some of the loops more readable. Reviewed by: tegge MFC after: 1 month	2012-06-09 18:03:23 +00:00
Pawel Jakub Dawidek	5d02ed91e9	Correct panic message. MFC after: 1 month MFC with: r236731	2012-06-09 12:27:30 +00:00
Pawel Jakub Dawidek	bf3e37ef15	In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0. Assert that instead of setting it to 0. Sponsored by: FreeBSD Foundation MFC after: 1 month	2012-06-07 23:33:10 +00:00
Eitan Adler	847d0034e3	Return EBADF instead of EMFILE from dup2 when the second argument is outside the range of valid file descriptors PR: kern/164970 Submitted by: Peter Jeremy <peterjeremy@acm.org> Reviewed by: jilles Approved by: cperciva MFC after: 1 week	2012-04-11 14:08:09 +00:00
John Baldwin	e506e182dd	Export some more useful info about shared memory objects to userland via procstat(1) and fstat(1): - Change shm file descriptors to track the pathname they are associated with and add a shm_path() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and shm_path() to export the path, mode, and size of a shared memory object via struct kinfo_file. - Add a struct shmstat to the libprocstat(3) interface along with a procstat_get_shm_info() to export the mode and size of a shared memory object. - Change procstat to always print out the path for a given object if it is valid. - Teach fstat about shared memory objects and to display their path, mode, and size. MFC after: 2 weeks	2012-04-01 18:22:48 +00:00
Peter Holm	ffae9d4d7c	Free up allocated memory used by posix_fadvise(2).	2012-03-08 20:34:13 +00:00
David E. O'Brien	0e31b3c15f	Reformat comment to be more readable in standard Xterm. (while I'm here, wrap other long lines)	2011-11-15 01:48:53 +00:00
John Baldwin	dccc45e4c0	Move the cleanup of f_cdevpriv when the reference count of a devfs file descriptor drops to zero out of _fdrop() and into devfs_close_f() as it is only relevant for devfs file descriptors. Reviewed by: kib MFC after: 1 week	2011-11-04 03:39:31 +00:00
Robert Watson	b160c14194	Correct a bug in export of capability-related information from the sysctls supporting procstat -f: properly provide capability rights information to userspace. The bug resulted from a merge-o during upstreaming (or rather, a failure to properly merge FreeBSD-side changed downstream). Spotted by: des, kibab MFC after: 3 days	2011-10-12 12:08:03 +00:00
Kip Macy	8451d0dd78	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
Jonathan Anderson	cfb5f76865	Add experimental support for process descriptors A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-18 22:51:30 +00:00
Konstantin Belousov	9c00bb9190	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)	2011-08-16 20:07:47 +00:00
Jonathan Anderson	69d377fe1b	Allow Capsicum capabilities to delegate constrained access to file system subtrees to sandboxed processes. - Use of absolute paths and '..' are limited in capability mode. - Use of absolute paths and '..' are limited when looking up relative to a capability. - When a name lookup is performed, identify what operation is to be performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP. With these constraints, openat() and friends are now safe in capability mode, and can then be used by code such as the capability-mode runtime linker. Approved by: re (bz), mentor (rwatson) Sponsored by: Google Inc	2011-08-13 09:21:16 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Jonathan Anderson	c30b9b5169	Export capability information via sysctls. When reporting on a capability, flag the fact that it is a capability, but also unwrap to report all of the usual information about the underlying file. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-07-20 09:53:35 +00:00
Jonathan Anderson	745bae379d	Add implementation for capabilities. Code to actually implement Capsicum capabilities, including fileops and kern_capwrap(), which creates a capability to wrap an existing file descriptor. We also modify kern_close() and closef() to handle capabilities. Finally, remove cap_filelist from struct capability, since we don't actually need it. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-15 09:37:14 +00:00
Jonathan Anderson	5604e481b1	Fix the "passability" test in fdcopy(). Rather than checking to see if a descriptor is a kqueue, check to see if its fileops flags include DFLAG_PASSABLE. At the moment, these two tests are equivalent, but this will change with the addition of capabilities that wrap kqueues but are themselves of type DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's use it. This change has been tested with [the newly improved] tools/regression/kqueue. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-08 12:19:25 +00:00
Edward Tomasz Napierala	afcc55f318	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".	2011-07-06 20:06:44 +00:00
Jonathan Anderson	9acdfe6549	Rework _fget to accept capability parameters. This new version of _fget() requires new parameters: - cap_rights_t needrights the rights that we expect the capability's rights mask to include (e.g. CAP_READ if we are going to read from the file) - cap_rights_t haverights used to return the capability's rights mask (ignored if NULL) - u_char maxprotp the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted (only used if we are going to mmap the file; ignored if NULL) - int fget_flags FGET_GETCAP if we want to return the capability itself, rather than the underlying object which it wraps Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-05 13:45:10 +00:00
Jonathan Anderson	c0467b5e6e	When Capsicum starts creating capabilities to wrap existing file descriptors, we will want to allocate a new descriptor without installing it in the FD array. Split falloc() into falloc_noinstall() and finstall(), and rewrite falloc() to call them with appropriate atomicity. Approved by: mentor (rwatson), re (bz)	2011-06-30 15:22:49 +00:00
Stanislav Sedov	ff6f41a472	- Do no try to drop a NULL filedesc pointer.	2011-05-12 10:56:33 +00:00
Stanislav Sedov	0daf62d9f5	- Commit work from libprocstat project. These patches add support for runtime file and processes information retrieval from the running kernel via sysctl in the form of new library, libprocstat. The library also supports KVM backend for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have been modified to take advantage of the library (as the bonus point the fstat(1) utility no longer need superuser privileges to operate), and the procstat(1) utility is now able to display information from memory dumps as well. The newly introduced fuser(1) utility also uses this library and able to operate via sysctl and kvm backends. The library is by no means complete (e.g. KVM backend is missing vnode name resolution routines, and there're no manpages for the library itself) so I plan to improve it further. I'm commiting it so it will get wider exposure and review. We won't be able to MFC this work as it relies on changes in HEAD, which was introduced some time ago, that break kernel ABI. OTOH we may be able to merge the library with KVM backend if we really need it there. Discussed with: rwatson	2011-05-12 10:11:39 +00:00
Edward Tomasz Napierala	722581d9e6	Add RACCT_NOFILE accounting. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)	2011-04-06 19:13:04 +00:00
Konstantin Belousov	1fe80828e7	After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not	2011-04-01 13:28:34 +00:00
Konstantin Belousov	246d35ec91	Add O_CLOEXEC flag to open(2) and fhopen(2). The new function fallocf(9), that is renamed falloc(9) with added flag argument, is provided to facilitate the merge to stable branch. Reviewed by: jhb MFC after: 1 week	2011-03-25 14:00:36 +00:00
John Baldwin	8e6fa660f2	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week	2011-03-24 18:40:11 +00:00
Bjoern A. Zeeb	1fb51a12f2	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks	2011-02-16 21:29:13 +00:00
Jilles Tjoelker	90750179ec	Do not trip a KASSERT if /dev/null cannot be opened for a setuid program. The fdcheckstd() function makes sure fds 0, 1 and 2 are open by opening /dev/null. If this fails (e.g. missing devfs or wrong permissions), fdcheckstd() will return failure and the process will exit as if it received SIGABRT. The KASSERT is only to check that kern_open() returns the expected fd, given that it succeeded. Tripping the KASSERT is most likely if fd 0 is open but fd 1 or 2 are not. MFC after: 2 weeks	2011-01-28 15:29:35 +00:00
Konstantin Belousov	23b70c1ae2	Finish r210923, 210926. Mark some devices as eternal. MFC after: 2 weeks	2011-01-04 10:59:38 +00:00
Bjoern A. Zeeb	2f826fdf53	Remove one zero from the double-0. This code doesn't have a license to kill. MFC after: 3 days	2010-04-23 14:32:58 +00:00
Konstantin Belousov	080136212f	On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not unlock Giant twice. While there, bring conditions in the do/while loops closer to style, that also makes the lines fit into 80 columns. Reported and tested by: dougb	2009-11-20 22:22:53 +00:00
Xin LI	82aebf697c	Add two new fcntls to enable/disable read-ahead: - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments. Submitted by: Igor Sysoev <is rambler-co ru> Reviewed by: kib@ MFC after: 1 month	2009-09-28 16:59:47 +00:00
Robert Watson	14961ba789	Replace AUDIT_ARG() with variable argument macros with a set more more specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week	2009-06-27 13:58:44 +00:00
Ulf Lilleengen	2dd43ecaec	- Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode use count was increased twice, but only decreased once.	2009-06-24 18:44:38 +00:00
Ulf Lilleengen	7083246d7c	- Fix a bug where a FIFO vnode use count was increased twice, but only decreased once. MFC after: 1 week	2009-06-24 18:38:51 +00:00
John Baldwin	c4f16b69e1	Add a new 'void closefrom(int lowfd)' system call. When called, it closes any open file descriptors >= 'lowfd'. It is largely identical to the same function on other operating systems such as Solaris, DFly, NetBSD, and OpenBSD. One difference from other *BSD is that this closefrom() does not fail with any errors. In practice, while the manpages for NetBSD and OpenBSD claim that they return EINTR, they ignore internal errors from close() and never return EINTR. DFly does return EINTR, but for the common use case (closing fd's prior to execve()), the caller really wants all fd's closed and returning EINTR just forces callers to call closefrom() in a loop until it stops failing. Note that this implementation of closefrom(2) does not make any effort to resolve userland races with open(2) in other threads. As such, it is not multithread safe. Submitted by: rwatson (initial version) Reviewed by: rwatson MFC after: 2 weeks	2009-06-15 20:38:55 +00:00
Jeff Roberson	f4471727f3	- Use an acquire barrier to increment f_count in fget_unlocked and remove the volatile cast. Describe the reason in detail in a comment. Discussed with: bde, jhb	2009-06-02 06:55:32 +00:00
Jamie Gritton	0304c73163	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)	2009-05-27 14:11:23 +00:00
John Baldwin	6ca33ea345	Set the umask in a new file descriptor table earlier in fdcopy() to remove two lock operations.	2009-05-20 18:42:04 +00:00
Konstantin Belousov	6b72d8db47	Revert r192094. The revision caused problems for sysctl(3) consumers that expect that oldlen is filled with required buffer length even when supplied buffer is too short and returned error is ENOMEM. Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when remaining buffer space is too short for the current kinfo_file structure. Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition to keep existing interface for the sysctl, though. Reported by: ed, Florian Smeets <flo kasimir com> Tested by: pho	2009-05-15 14:41:44 +00:00
Jeff Roberson	bf422e5f27	- Implement a lockless file descriptor lookup algorithm in fget_unlocked(). - Save old file descriptor tables created on expansion until the entire descriptor table is freed so that pointers may be followed without regard for expanders. - Mark the file zone as NOFREE so we may attempt to reference potentially freed files. - Convert several fget_locked() users to fget_unlocked(). This requires us to manage reference counts explicitly but reduces locking overhead in the common case.	2009-05-14 03:24:22 +00:00
John Baldwin	3f11530b79	Update comment above _fget() for earlier change to FWRITE failures return EBADF rather than EINVAL. Submitted by: Jaakko Heinonen jh saunalahti fi MFC after: 1 month	2009-04-15 19:10:37 +00:00
Joe Marcus Clarke	0618630015	Remove the printf's when the vnode to be exported for procstat is not a VDIR. If the file system backing a process' cwd is removed, and procstat -f PID is called, then these messages would have been printed. The extra verbosity is not required in this situation. Requested by: kib Approved by: kib	2009-02-14 21:55:09 +00:00
Joe Marcus Clarke	03fd9c2092	Change two KASSERTS to printfs and simple returns. Stress testing has revealed that a process' current working directory can be VBAD if the directory is removed. This can trigger a panic when procstat -f PID is run. Tested by: pho Discovered by: phobot Reviewed by: kib Approved by: kib	2009-02-14 21:12:24 +00:00
Robert Watson	54fffe2d67	Modify fdcopy() so that, during fork(2), it won't copy file descriptors from the parent to the child process if they have an operation vector of &badfileops. This narrows a set of races involving system calls that allocate a new file descriptor, potentially block for some extended period, and then return the file descriptor, when invoked by a threaded program that concurrently invokes fork(2). Similar approches are used in both Solaris and Linux, and the wideness of this race was introduced in FreeBSD when we moved to a more optimistic implementation of accept(2) in order to simplify locking. A small race necessarily remains because the fork(2) might occur after the finit() in accept(2) but before the system call has returned, but that appears unavoidable using current APIs. However, this race is vastly narrower. The fix can be validated using the newfileops_on_fork regression test. PR: kern/130348 Reported by: Ivan Shcheklein <shcheklein at gmail dot com> Reviewed by: jhb, kib MFC after: 1 week	2009-02-11 15:22:01 +00:00
Konstantin Belousov	7efa697d80	Clear the pointers to the file in the struct filedesc before file is closed in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale struct file * values. Reported and tested by: pho MFC after: 1 month	2008-12-30 12:51:56 +00:00
Peter Wemm	43151ee6cf	Merge user/peter/kinfo branch as of r185547 into head. This changes struct kinfo_filedesc and kinfo_vmentry such that they are same on both 32 and 64 bit platforms like i386/amd64 and won't require sysctl wrapping. Two new OIDs are assigned. The old ones are available under COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface was never actually released on 7.x. The other main change is to pack the data passed to userland via the sysctl. kf_structsize and kve_structsize are reduced for the copyout. If you have a process with 100,000+ sockets open, the unpacked records require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still seriously unpleasant, but not quite as devastating). A similar problem exists for the vmentry structure - have lots and lots of shared libraries and small mmaps and its copyout gets expensive too. My immediate problem is valgrind. It traditionally achieves this functionality by parsing procfs output, in a packed format. Secondly, when tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled 32 bit binary which ran directly into the differing data structures in 32 vs 64 bit mode. (valgrind uses this to track file descriptor operations and this therefore affected every single 32 bit binary) I've added two utility functions to libutil to unpack the structures into a fixed record length and to make it a little more convenient to use.	2008-12-02 06:50:26 +00:00
John Baldwin	2ff47c5f18	Remove unnecessary locking around vn_fullpath(). The vnode lock for the vnode in question does not need to be held. All the data structures used during the name lookup are protected by the global name cache lock. Instead, the caller merely needs to ensure a reference is held on the vnode (such as vhold()) to keep it from being freed. In the case of procfs' <pid>/file entry, grab the process lock while we gain a new reference (via vhold()) on p_textvp to fully close races with execve(2). For the kern.proc.vmmap sysctl handler, use a shared vnode lock around the call to VOP_GETATTR() rather than an exclusive lock. MFC after: 1 month	2008-11-04 19:04:01 +00:00
John Baldwin	21fc02d271	Use shared vnode locks instead of exclusive vnode locks for the access(), chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek() (when figuring out the current size of the file in the SEEK_END case), pathconf(), readlink(), and statfs() system calls. Submitted by: ups (mostly) Tested by: pho MFC after: 1 month	2008-11-03 20:31:00 +00:00
Dag-Erling Smørgrav	e11e3f187d	Fix a number of style issues in the MALLOC / FREE commit. I've tried to be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.	2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav	1ede983cc9	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months	2008-10-23 15:53:51 +00:00
Robert Watson	ac2456bfc3	Downgrade XXX to a Note for fgetsock() and fputsock(). MFC after: 3 days	2008-10-12 20:03:17 +00:00
Ed Schouten	bc093719ca	Integrate the new MPSAFE TTY layer to the FreeBSD operating system. The last half year I've been working on a replacement TTY layer for the FreeBSD kernel. The new TTY layer was designed to improve the following: - Improved driver model: The old TTY layer has a driver model that is not abstract enough to make it friendly to use. A good example is the output path, where the device drivers directly access the output buffers. This means that an in-kernel PPP implementation must always convert network buffers into TTY buffers. If a PPP implementation would be built on top of the new TTY layer (still needs a hooks layer, though), it would allow the PPP implementation to directly hand the data to the TTY driver. - Improved hotplugging: With the old TTY layer, it isn't entirely safe to destroy TTY's from the system. This implementation has a two-step destructing design, where the driver first abandons the TTY. After all threads have left the TTY, the TTY layer calls a routine in the driver, which can be used to free resources (unit numbers, etc). The pts(4) driver also implements this feature, which means posix_openpt() will now return PTY's that are created on the fly. - Improved performance: One of the major improvements is the per-TTY mutex, which is expected to improve scalability when compared to the old Giant locking. Another change is the unbuffered copying to userspace, which is both used on TTY device nodes and PTY masters. Upgrading should be quite straightforward. Unlike previous versions, existing kernel configuration files do not need to be changed, except when they reference device drivers that are listed in UPDATING. Obtained from: //depot/projects/mpsafetty/... Approved by: philip (ex-mentor) Discussed: on the lists, at BSDCan, at the DevSummit Sponsored by: Snow B.V., the Netherlands dcons(4) fixed by: kan	2008-08-20 08:31:58 +00:00
Ed Schouten	79da190c16	Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}. There is no reason the fdopen() routine needs Giant. It only sets curthread->td_dupfd, based on the device unit number of the cdev. I guess we won't get massive performance improvements here, but still, I assume we eventually want to get rid of Giant.	2008-08-09 12:42:12 +00:00
John Baldwin	6bc1e9cd84	Rework the lifetime management of the kernel implementation of POSIX semaphores. Specifically, semaphores are now represented as new file descriptor type that is set to close on exec. This removes the need for all of the manual process reference counting (and fork, exec, and exit event handlers) as the normal file descriptor operations handle all of that for us nicely. It is also suggested as one possible implementation in the spec and at least one other OS (OS X) uses this approach. Some bugs that were fixed as a result include: - References to a named semaphore whose name is removed still work after the sem_unlink() operation. Prior to this patch, if a semaphore's name was removed, valid handles from sem_open() would get EINVAL errors from sem_getvalue(), sem_post(), etc. This fixes that. - Unnamed semaphores created with sem_init() were not cleaned up when a process exited or exec'd. They were only cleaned up if the process did an explicit sem_destroy(). This could result in a leak of semaphore objects that could never be cleaned up. - On the other hand, if another process guessed the id (kernel pointer to 'struct ksem' of an unnamed semaphore (created via sem_init)) and had write access to the semaphore based on UID/GID checks, then that other process could manipulate the semaphore via sem_destroy(), sem_post(), sem_wait(), etc. - As part of the permission check (UID/GID), the umask of the proces creating the semaphore was not honored. Thus if your umask denied group read/write access but the explicit mode in the sem_init() call allowed it, the semaphore would be readable/writable by other users in the same group, for example. This includes access via the previous bug. - If the module refused to unload because there were active semaphores, then it might have deregistered one or more of the semaphore system calls before it noticed that there was a problem. I'm not sure if this actually happened as the order that modules are discovered by the kernel linker depends on how the actual .ko file is linked. One can make the order deterministic by using a single module with a mod_event handler that explicitly registers syscalls (and deregisters during unload after any checks). This also fixes a race where even if the sem_module unloaded first it would have destroyed locks that the syscalls might be trying to access if they are still executing when they are unloaded. XXX: By the way, deregistering system calls doesn't do any blocking to drain any threads from the calls. - Some minor fixes to errno values on error. For example, sem_init() isn't documented to return ENFILE or EMFILE if we run out of semaphores the way that sem_open() can. Instead, it should return ENOSPC in that case. Other changes: - Kernel semaphores now use a hash table to manage the namespace of named semaphores nearly in a similar fashion to the POSIX shared memory object file descriptors. Kernel semaphores can now also have names longer than 14 chars (up to MAXPATHLEN) and can include subdirectories in their pathname. - The UID/GID permission checks for access to a named semaphore are now done via vaccess() rather than a home-rolled set of checks. - Now that kernel semaphores have an associated file object, the various MAC checks for POSIX semaphores accept both a file credential and an active credential. There is also a new posixsem_check_stat() since it is possible to fstat() a semaphore file descriptor. - A small set of regression tests (using the ksem API directly) is present in src/tools/regression/posixsem. Reported by: kris (1) Tested by: kris Reviewed by: rwatson (lightly) MFC after: 1 month	2008-06-27 05:39:04 +00:00
Ed Schouten	cc8945d204	Remove redundant checks from fcntl()'s F_DUPFD. Right now we perform some of the checks inside the fcntl()'s F_DUPFD operation twice. We first validate the `fd' argument. When finished, we validate the `arg' argument. These checks are also performed inside do_dup(). The reason we need to do this, is because fcntl() should return different errno's when the `arg' argument is out of bounds (EINVAL instead of EBADF). To prevent the redundant locking of the PROC_LOCK and FILEDESC_SLOCK, patch do_dup() to support the error semantics required by fcntl(). Approved by: philip (mentor)	2008-05-28 20:25:19 +00:00
Attilio Rao	258f4727f1	Replace direct atomic operation for the file refcount witht the refcount interface. It also introduces the correct usage of memory barriers, as sometimes fdrop() and fhold() are used with shared locks, which don't use any release barrier.	2008-05-25 14:57:43 +00:00
Konstantin Belousov	82f4d64035	Implement the per-open file data for the cdev. The patch does not change the cdevsw KBI. Management of the data is provided by the functions int devfs_set_cdevpriv(void priv, cdevpriv_dtr_t dtr); int devfs_get_cdevpriv(void *datap); void devfs_clear_cdevpriv(void); All of the functions are supposed to be called from the cdevsw method contexts. - devfs_set_cdevpriv assigns the priv as private data for the file descriptor which is used to initiate currently performed driver operation. dtr is the function that will be called when either the last refernce to the file goes away, the device is destroyed or devfs_clear_cdevpriv is called. - devfs_get_cdevpriv is the obvious accessor. - devfs_clear_cdevpriv allows to clear the private data for the still open file. Implementation keeps the driver-supplied pointers in the struct cdev_privdata, that is referenced both from the struct file and struct cdev, and cannot outlive any of the referee. Man pages will be provided after the KPI stabilizes. Reviewed by: jhb Useful suggestions from: jeff, antoine Debugging help and tested by: pho MFC after: 1 month	2008-05-21 09:31:44 +00:00
Kris Kennaway	5894445dad	* Correct a mis-merge that leaked the PROC_LOCK [1] * Return ENOENT on error instead of 0 [2] Submitted by: rdivacky [1], kib [2]	2008-04-26 13:16:55 +00:00
Kris Kennaway	b1ba81d948	fdhold can return NULL, so add the one remaining missing check for this condition. Reviewed by: attilio MFC after: 1 week	2008-04-24 22:08:36 +00:00
Doug Rabson	dfdcada31e	Add the new kernel-mode NFS Lock Manager. To use it instead of the user-mode lock manager, build a kernel with the NFSLOCKD option and add '-k' to 'rpc_lockd_flags' in rc.conf. Highlights include: * Thread-safe kernel RPC client - many threads can use the same RPC client handle safely with replies being de-multiplexed at the socket upcall (typically driven directly by the NIC interrupt) and handed off to whichever thread matches the reply. For UDP sockets, many RPC clients can share the same socket. This allows the use of a single privileged UDP port number to talk to an arbitrary number of remote hosts. * Single-threaded kernel RPC server. Adding support for multi-threaded server would be relatively straightforward and would follow approximately the Solaris KPI. A single thread should be sufficient for the NLM since it should rarely block in normal operation. * Kernel mode NLM server supporting cancel requests and granted callbacks. I've tested the NLM server reasonably extensively - it passes both my own tests and the NFS Connectathon locking tests running on Solaris, Mac OS X and Ubuntu Linux. * Userland NLM client supported. While the NLM server doesn't have support for the local NFS client's locking needs, it does have to field async replies and granted callbacks from remote NLMs that the local client has contacted. We relay these replies to the userland rpc.lockd over a local domain RPC socket. * Robust deadlock detection for the local lock manager. In particular it will detect deadlocks caused by a lock request that covers more than one blocking request. As required by the NLM protocol, all deadlock detection happens synchronously - a user is guaranteed that if a lock request isn't rejected immediately, the lock will eventually be granted. The old system allowed for a 'deferred deadlock' condition where a blocked lock request could wake up and find that some other deadlock-causing lock owner had beaten them to the lock. * Since both local and remote locks are managed by the same kernel locking code, local and remote processes can safely use file locks for mutual exclusion. Local processes have no fairness advantage compared to remote processes when contending to lock a region that has just been unlocked - the local lock manager enforces a strict first-come first-served model for both local and remote lockers. Sponsored by: Isilon Systems PR: 95247 107555 115524 116679 MFC after: 2 weeks	2008-03-26 15:23:12 +00:00
Maxim Sobolev	073d8ba485	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff	2008-03-19 09:58:25 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
Maxim Sobolev	c9370ff4d0	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks	2008-03-16 06:21:30 +00:00
Antoine Brodin	e3ad7f6626	Introduce a new F_DUP2FD command to fcntl(2), for compatibility with Solaris and AIX. fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent. Document it. Add some regression tests (identical to the dup2(2) regression tests). PR: 120233 Submitted by: Jukka Ukkonen Approved by: rwaston (mentor) MFC after: 1 month	2008-03-08 22:02:21 +00:00
Dag-Erling Smørgrav	60e15db992	This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload consists of the null-terminated name and the contents of any structure you wish to record. A new ktrstruct() function constructs and emits a KTR_STRUCT record. It is accompanied by convenience macros for struct stat and struct sockaddr. In kdump(1), KTR_STRUCT records are handled by a dispatcher function that runs stringent sanity checks on its contents before handing it over to individual decoding funtions for each type of structure. Currently supported structures are struct stat and struct sockaddr for the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK and AF_IPX is present but disabled, as I am unable to test it properly. Since 's' was already taken, the letter 't' is used by ktrace(1) to enable KTR_STRUCT trace points, and in kdump(1) to enable their decoding. Derived from patches by Andrew Li <andrew2.li@citi.com>. PR: kern/117836 MFC after: 3 weeks	2008-02-23 01:01:49 +00:00
Simon L. B. Nielsen	1b7089994c	Fix sendfile(2) write-only file permission bypass. Security: FreeBSD-SA-08:03.sendfile Submitted by: kib	2008-02-14 11:44:31 +00:00
Joe Marcus Clarke	f280594937	Add support for displaying a process' current working directory, root directory, and jail directory within procstat. While this functionality is available already in fstat, encapsulating it in the kern.proc.filedesc sysctl makes it accessible without using kvm and thus without needing elevated permissions. The new procstat output looks like: PID COMM FD T V FLAGS REF OFFSET PRO NAME 76792 tcsh cwd v d -------- - - - /usr/src 76792 tcsh root v d -------- - - - / 76792 tcsh 15 v c rw------ 16 9130 - - 76792 tcsh 16 v c rw------ 16 9130 - - 76792 tcsh 17 v c rw------ 16 9130 - - 76792 tcsh 18 v c rw------ 16 9130 - - 76792 tcsh 19 v c rw------ 16 9130 - - I am also bumping __FreeBSD_version for this as this new feature will be used in at least one port. Reviewed by: rwatson Approved by: rwatson	2008-02-09 05:16:26 +00:00
Robert Watson	07dd4a31b5	Export a type for POSIX SHM file descriptors via kern.proc.filedesc as used by procstat, or SHM descriptors will show up as type unknown in userspace.	2008-01-20 19:55:52 +00:00
Attilio Rao	22db15c06f	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>	2008-01-13 14:44:15 +00:00
Attilio Rao	cb05b60a89	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>	2008-01-10 01:10:58 +00:00
John Baldwin	8e38aeff17	Add a new file descriptor type for IPC shared memory objects and use it to implement shm_open(2) and shm_unlink(2) in the kernel: - Each shared memory file descriptor is associated with a swap-backed vm object which provides the backing store. Each descriptor starts off with a size of zero, but the size can be altered via ftruncate(2). The shared memory file descriptors also support fstat(2). read(2), write(2), ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared memory file descriptors. - shm_open(2) and shm_unlink(2) are now implemented as system calls that manage shared memory file descriptors. The virtual namespace that maps pathnames to shared memory file descriptors is implemented as a hash table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash of the pathname. - As an extension, the constant 'SHM_ANON' may be specified in place of the path argument to shm_open(2). In this case, an unnamed shared memory file descriptor will be created similar to the IPC_PRIVATE key for shmget(2). Note that the shared memory object can still be shared among processes by sharing the file descriptor via fork(2) or sendmsg(2), but it is unnamed. This effectively serves to implement the getmemfd() idea bandied about the lists several times over the years. - The backing store for shared memory file descriptors are garbage collected when they are not referenced by any open file descriptors or the shm_open(2) virtual namespace. Submitted by: dillon, peter (previous versions) Submitted by: rwatson (I based this on his version) Reviewed by: alc (suggested converting getmemfd() to shm_open())	2008-01-08 21:58:16 +00:00
John Baldwin	e46502943a	Make ftruncate a 'struct file' operation rather than a vnode operation. This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson	2008-01-07 20:05:19 +00:00
Jeff Roberson	a57decdf32	- In sysctl_kern_file skip fdps with negative lastfiles. This can happen if there are no files open. Accounting for these can eventually return a negative value for olenp causing sysctl to crash with a bad malloc. Reported by: Pawel Worach <pawel.worach@gmail.com>	2008-01-03 01:26:59 +00:00
Jeff Roberson	397c19d175	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho	2007-12-30 01:42:15 +00:00
Robert Watson	cc43c38c87	Add two new sysctls in support of the forthcoming procstat(1) to support its -f and -v arguments: kern.proc.filedesc - dump file descriptor information for a process, if debugging is permitted, including socket addresses, open flags, file offsets, file paths, etc. kern.proc.vmmap - dump virtual memory mapping information for a process, if debugging is permitted, including layout and information on underlying objects, such as the type of object and path. These provide a superset of the information historically available through the now-deprecated procfs(4), and are intended to be exported in an ABI-robust form.	2007-12-02 10:10:27 +00:00
Robert Watson	0bf686c125	Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which previously conditionally acquired Giant based on debug.mpsafenet. As that has now been removed, they are no longer required. Removing them significantly simplifies error-handling in the socket layer, eliminated quite a bit of unwinding of locking in error cases. While here clean up the now unneeded opt_net.h, which previously was used for the NET_WITH_GIANT kernel option. Clean up some related gotos for consistency. Reviewed by: bz, csjp Tested by: kris Approved by: re (kensmith)	2007-08-06 14:26:03 +00:00
Jeff Roberson	f6c1ecca50	- Use explicit locking in the various fcntl case statements so that we can acquire shared filedescriptor locks in the appropriate cases. - Remove Giant from calls that issue ioctls. The ioctl path has been mpsafe for some time now. - Only acquire giant for VOP_ADVLOCK when the filesystem requires giant. advlock is now mpsafe. Reviewed by: rwatson Approved by: re	2007-07-03 21:26:06 +00:00
Robert Watson	7251b7863c	Rather than passing SUSER_RUID into priv_check_cred() to specify when a privilege is checked against the real uid rather than the effective uid, instead decide which uid to use in priv_check_cred() based on the privilege passed in. We use the real uid for PRIV_MAXFILES, PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of SUSER_RUID; there are now no flags defined for priv_check_cred(). Obtained from: TrustedBSD Project	2007-06-16 23:41:43 +00:00
Konstantin Belousov	9e223287c0	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)	2007-05-31 11:51:53 +00:00
Konstantin Belousov	5c76452f8f	Mark the filedescriptor table entries with VOP_OPEN being performed for them as UF_OPENING. Disable closing of that entries. This should fix the crashes caused by devfs_open() (and fifo_open()) dereferencing struct file * by index, while the filedescriptor is closed by parallel thread. Idea by: tegge Reviewed by: tegge (previous version of patch) Tested by: Peter Holm Approved by: re (kensmith) MFC after: 3 weeks	2007-05-04 14:23:29 +00:00
John Baldwin	06e043fb20	Avoid a lot of code duplication by using kern_open() to open /dev/null in fdcheckstd() instead of a stripped down version of kern_open()'s code. MFC after: 1 week Reviewed by: cperciva	2007-04-26 18:01:19 +00:00
Robert Watson	5e3f7694b1	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff	2007-04-04 09:11:34 +00:00
John Baldwin	3076ca6720	Just use 'fdrop()' instead of 'FILE_LOCK(); fdrop_locked()' in dupfdopen(). While I'm at it, move the second fdrop() out from under the filedesc lock.	2007-03-15 21:19:21 +00:00
Robert Watson	873fbcd776	Further system call comment cleanup: - Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde) - Remove extra blank lines in some cases. - Add extra blank lines in some cases. - Remove no-op comments consisting solely of the function name, the word "syscall", or the system call name. - Add punctuation. - Re-wrap some comments.	2007-03-05 13:10:58 +00:00
Robert Watson	0c14ff0eb5	Remove 'MPSAFE' annotations from the comments above most system calls: all system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.	2007-03-04 22:36:48 +00:00
Robert Watson	780a98ad1f	Catch up file descriptor printing function in DDB to the addition of kqueues and POSIX message queues.	2007-02-15 10:55:43 +00:00
Robert Watson	442f65e958	Break file descriptor printing logic out of db_show_files() into db_print_file(), and add a new "show file <ptr>" DDB command, which can be used to print out file descriptors referenced in stack traces.	2007-02-15 10:50:48 +00:00
Xin LI	4f506694bb	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.	2007-01-17 14:58:53 +00:00
John Baldwin	9ae328fc8f	- Close a race between enumerating UNIX domain socket pcb structures via sysctl and socket teardown by adding a reference count to the UNIX domain pcb object and fixing the sysctl that enumerates unpcbs to grab a reference on each unpcb while it builds the list to copy out to userland. - Close a race between UNIX domain pcb garbage collection (unp_gc()) and file descriptor teardown (fdrop()) by adding a new garbage collection flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers in a UNIX domain socket looking for nested file descriptor references and clears the flag when it is finished. fdrop() checks to see if the flag is set on a file descriptor whose refcount just dropped to 0 and waits for unp_gc() to clear the flag before completely destroying the file descriptor. MFC after: 1 week Reviewed by: rwatson Submitted by: ups Hopefully makes the panics go away: mx1	2007-01-05 19:59:46 +00:00
Robert Watson	acd3428b7d	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>	2006-11-06 13:42:10 +00:00
John-Mark Gurney	aeab19b21f	return EBADF instead of successfully attaching (and then panicing) when an fd is dieing.. Convinced by: jhb PR: 103127	2006-09-24 02:29:53 +00:00
John Baldwin	b04aff773e	Add a comment to explain what fdclose() does and what it's purpose is since the subtlety eluded me when I looked at it last week.	2006-07-21 20:24:00 +00:00
John Baldwin	c1cccebe8b	Add a kern_close() so that the ABIs can close a file descriptor w/o having to populate a close_args struct and change some of the places that do.	2006-07-08 20:03:39 +00:00
Pawel Jakub Dawidek	0bd645ae0c	Compress direct cr_ruid comparsion and jailed() call to suser_cred(9). Reviewed by: rwatson	2006-06-27 11:32:08 +00:00
Robert Watson	197b35d717	Mark fgetsock() and fputsock() as depcrecated: callers should rely on the file descriptor reference, rather than paying additional lock operations to acquire a socket reference from the file descriptor. This will also help to ensure that file descriptor based socket requests are not delivered to a socket after close. Most consumers have already been converted to this model. MFC after: 3 months	2006-04-01 11:09:54 +00:00
Christian S.J. Peron	2ed4894a26	Restore fd optimization with a few minor tweaks, to quote tegge: "fdinit() fails to initialize newfdp->fd_fd.fd_lastfile to -1. This breaks fdcopy() which will incorrectly set newfdp->fd_freefile to 1 if no files are open and the last file descriptor marked as unused for fdp was 0. This later causes descriptor 0 to be unavailable in newfdp when the optimization is enabled. When the last file descriptor previously marked as used is nonzero and marked as unused, fdunused() incorrectly sets fdp->fd_lastfile to fd - 1 due to fd_last_used() returning (size - 1). This hides the problem that breaks the optimization." This allows us to keep the optimization, while un-breaking it. This is a RELENG_6 candidate. PR: kern/87208 MFC after: 1 week Submitted by: tegge	2006-03-20 00:13:47 +00:00
Christian S.J. Peron	30bacc08e0	Back out fd optimization introduced in revision 1.280 as it appears to be really breaking things. Simple "close(0); dup(fd)" does not return descriptor "0" in some cases. Further, this change also breaks some MAC interactions with mac_execve_will_transition(). Under certain circumstances, fdcheckstd() can be called in execve(2) causing an assertion that checks to make sure that stdin, stdout and stderr reside at indexes 0, 1 and 2 in the process fd table to fail, resulting in a kernel panic when INVARIANTS is on. This should also kill the "dup(2) regression on 6.x" show stopper item on the 6.1-RELEASE TODO list. This is a RELENG_6 candidate. PR: kern/87208 Silence from: des MFC after: 1 week	2006-03-18 23:27:21 +00:00
Wayne Salamon	a750d0b2a2	Add auditing of arguments to the close() and fstat() system calls. Much more argument auditing yet to come, for remaining system calls in this file. Obtained from: TrustedBSD Project Approved by: rwatson (mentor)	2006-02-05 23:57:32 +00:00
John Baldwin	38f63f7e47	Return EBADF rather than EINVAL for FWRITE failure as per POSIX. MFC after: 1 week	2006-01-06 16:30:30 +00:00
David Xu	b2f92ef96b	Last step to make mq_notify conform to POSIX standard, If the process has successfully attached a notification request to the message queue via a queue descriptor, file closing should remove the attachment.	2005-11-30 05:12:03 +00:00
Robert Watson	742be7821c	Add the f_msgcount field to the set of struct file fields printed in show files. MFC after: 1 week	2005-11-10 13:26:29 +00:00
Robert Watson	2be165c93e	Expanet of details printed for each file descriptor to include it's garbage collection flags. Reformat generally to make this fit and leave some room for future expansion. MFC after: 1 week	2005-11-10 11:35:59 +00:00
Robert Watson	b4e507aafa	Add a DDB "show files" command to list the current open file list, some state about each open file, and identify the first process in the process table that references the file. This is helpful in debugging leaks of file descriptors. MFC after: 1 week	2005-11-10 10:42:50 +00:00
Robert Watson	f8a9ed1fa7	Fix typo in recent comment tweak. Submitted by: jkim MFC after: 1 week	2005-11-09 22:02:02 +00:00
Robert Watson	923633b4b5	In closef(), remove the assumption that there is a thread associated with the file descriptor. When a file descriptor is closed as a result of garbage collecting a UNIX domain socket, the file descriptor will not have any associated thread, so the logic to identify advisory locks held by that thread is not appropriate. Check the thread for NULL to avoid this scenario. Expand an existing comment to say a bit more about this. MFC after: 1 week	2005-11-09 20:54:25 +00:00
John Baldwin	68a17869c1	Push down Giant into fdfree() and remove it from two of the callers. Other callers such as some rfork() cases weren't locking Giant anyway. Reviewed by: csjp MFC after: 1 week	2005-11-01 17:13:05 +00:00
Robert Watson	5bb84bc84b	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.	2005-10-31 15:41:29 +00:00
Roman Kurakin	826cf005ed	Use FILEDESC_UNLOCK(fdp) after FILE_UNLOCK(p), not before to avoid LOR. Slightly discussed on current@. LOR #055 MFC after: 14 days	2005-10-04 16:27:54 +00:00
Dag-Erling Smørgrav	d09dfa2bfd	Two minor optimizations of fdalloc(): - if minfd < fd_freefile (as is most often the case, since minfd is usually 0), set it to fd_freefile. - remove a call to fd_first_free() which duplicates work already done by fdused(). This change results in a small but measurable speedup for processes with large numbers (several thousands) of open files. PR: kern/85176 Submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz> MFC after: 3 weeks	2005-08-26 11:16:39 +00:00
Dima Dorfman	1ee6b74603	Fix fdcheckstd to pass the file descriptor along through vn_open. When opening a device, devfs_open needs the file descriptor to install its own fileops. Failing to pass the file descriptor causes the vnode to be returned with the regular vnops, which will cause a panic on the first read or write because devfs_specops is not meant to support those operations. This bug caused a panic after exec'ing any set[ug]id program with fds 0..2 closed (i.e., if any action had to be taken by fdcheckstd, we would panic if the exec'd program ever tried to use any of those descriptors). Reviewed by: phk Approved by: re (scottl)	2005-06-25 03:34:49 +00:00
Jeff Roberson	6de925e58b	- Use NAMEI to pickup Giant if we need it in fpcheckstd().	2005-05-03 10:52:22 +00:00
Giorgos Keramidas	0a11e99990	Remove redundant initialization that is repeated in the for() loop right below it. Approved by: jhb	2005-03-08 16:57:20 +00:00
Giorgos Keramidas	46da8bf8fb	Typo & grammar fixes in comments.	2005-03-08 00:58:50 +00:00
Poul-Henning Kamp	44dc16a986	Make some file/filedesc related functions static	2005-02-10 12:27:58 +00:00
John Baldwin	76951d21d1	- Tweak kern_msgctl() to return a copy of the requested message queue id structure in the struct pointed to by the 3rd argument for IPC_STAT and get rid of the 4th argument. The old way returned a pointer into the kernel array that the calling function would then access afterwards without holding the appropriate locks and doing non-lock-safe things like copyout() with the data anyways. This change removes that unsafeness and resulting race conditions as well as simplifying the interface. - Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(), fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of code duplication for freebsd4 and netbsd compatability system calls. - Add a new lookup function kern_alternate_path() that looks up a filename under an alternate prefix and determines which filename should be used. This is basically a more general version of linux_emul_convpath() that can be shared by all the ABIs thus allowing for further reduction of code duplication.	2005-02-07 18:44:55 +00:00
Poul-Henning Kamp	8516dd18e1	Don't use VOP_GETVOBJECT, use vp->v_object directly.	2005-01-25 00:40:01 +00:00
Jeff Roberson	66ca1b4878	- Use VFS_LOCK_GIANT() in place of mtx_lock(&giant), etc. Sponsored By: Isilon Systems, Inc.	2005-01-24 10:19:31 +00:00
Warner Losh	9454b2d864	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
Poul-Henning Kamp	662d80dc23	Fix a deadlock I introduced this morning. Mostly from: tegge	2004-12-14 20:48:40 +00:00
Poul-Henning Kamp	d986dbb448	Add a new kind of reference count (fd_holdcnt) to struct filedesc which holds on to just the data structure and the mutex. (The existing refcount (fd_refcnt) holds onto the open files in the descriptor.) The fd_holdcnt is protected by fdesc_mtx, fd_refcnt by FILEDESC_LOCK. Add fdhold(struct proc ) which gets a hold on the filedescriptors of the specified proc.. Add fddrop(struct filedesc ) which drops the fd_holdcnt and if zero destroys the mutex and frees the memory. Initialize the fd_holdcnt to one in fdinit(). Normal operations on the filedesc structure will not change it. In fdfree() use fddrop() to dispose of the mutex and structure. Hold the FILEDESC_LOCK() until we have cleaned out the contents and carefully set the fields to null values during cleanup. Use fdhold()/fddrop() in mountcheckdirs() and sysctl_kern_file().	2004-12-14 09:09:51 +00:00
Poul-Henning Kamp	30abaa53df	Make fdesc_mtx private to kern_descrip.c now that the flock has come home.	2004-12-14 08:44:51 +00:00
Poul-Henning Kamp	12b18fdab4	Move the checkdirs() function from vfs_mount.c to kern_descrip.c and call it mountcheckdirs().	2004-12-14 08:23:18 +00:00
Poul-Henning Kamp	c113083c5a	Add new function fdunshare() which encapsulates the necessary light magic for ensuring that a process' filedesc is not shared with anybody. Use it in the two places which previously had private implmentations. This collects all fd_refcnt handling in kern_descrip.c	2004-12-14 07:20:03 +00:00
Poul-Henning Kamp	9722743b9a	Sort and wash #includes.	2004-12-03 21:29:25 +00:00
Poul-Henning Kamp	355be4eeda	Drop ffree() as a separate function and incorporate the only place used.	2004-12-02 12:17:27 +00:00
Poul-Henning Kamp	20ddb405f8	Style polishing. Use grepable functions Other minor nitpickings.	2004-12-02 11:56:13 +00:00
Poul-Henning Kamp	d672e07541	We already have a lock initialization function, use that for fdesc_mtx also. Polish badfo stuff.	2004-12-01 09:42:35 +00:00
Poul-Henning Kamp	010b1e3fdc	Collect the stuff for the /dev/fd/{%d,std{in,out,err}} pseudo-device driver at the bottom of the file.	2004-12-01 09:29:31 +00:00
Poul-Henning Kamp	e4643c730a	"nfiles" is a bad name for a global variable. Call it "openfiles" instead as this is more correct and matches the sysctl variable.	2004-12-01 09:22:26 +00:00
Poul-Henning Kamp	cc2f51ef32	Style: move data to top of file.	2004-12-01 08:06:27 +00:00
Robert Watson	1a1238a112	Don't acquire Giant before calling closef() in close() (and elsewhere); instead acquire it conditionally in closef() if it is required for advisory locking. This removes Giant from the close() path of sockets and pipes (and any other objects that don't acquire Giant in their fo_close path, such as kqueues). Giant will still be acquired twice for vnodes -- once for advisory lock teardown, and a second time in the fo_close method. Both Poul-Henning and I believe that the advisory lock teardown code can be moved into the vn_closefile path shortly. This trims a percent or two off the cost of most non-vnode close operations on SMP, but has a fairly minimal impact on UP where the cost of a single mutex operation is pretty low.	2004-11-28 14:37:17 +00:00
Poul-Henning Kamp	f0775d7c7a	Fix LOR. Solution pointed out by: jhb	2004-11-26 06:14:04 +00:00
David Schultz	c17ff94938	Neither of the arguments to closef() can be NULL anymore, so don't check for that.	2004-11-21 11:06:24 +00:00
Poul-Henning Kamp	dc99052535	Move a FILEDESC_UNLOCK up to maintain correct nesting of FILEDESC/FILE locking.	2004-11-16 09:12:03 +00:00
Poul-Henning Kamp	970d8904d6	Make FILE_LOCK and FILEDESC_LOCK nest properly by postponing the the release of FILEDESC_LOCK a few more lines.	2004-11-15 16:10:55 +00:00
Poul-Henning Kamp	2e4fed7c56	Move #define up.	2004-11-14 09:21:01 +00:00
Poul-Henning Kamp	124e4c3be8	Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST. Use this in all the places where sleeping with the lock held is not an issue. The distinction will become significant once we finalize the exact lock-type to use for this kind of case.	2004-11-13 11:53:02 +00:00
Poul-Henning Kamp	598b7ec86b	Use more intuitive pointer for fdinit() and fdcopy(). Change fdcopy() to take unlocked filedesc.	2004-11-08 12:43:23 +00:00
Poul-Henning Kamp	ef11fbd7c4	Introduce fdclose() which will clean an entry in a filedesc. Replace homerolled versions with call to fdclose(). Make fdunused() static to kern_descrip.c	2004-11-07 22:16:07 +00:00
Poul-Henning Kamp	2f5a40aa3f	Move fdinit() related stuff from .h to .c	2004-11-07 15:34:45 +00:00
Poul-Henning Kamp	8ec21e3a68	Allow fdinit() to be called with a NULL fdp argument so we can use it when setting up init. Make fdinit() lock the fdp argument as needed.	2004-11-07 12:39:28 +00:00
Poul-Henning Kamp	3b19b5af3a	When we open /dev/null for stdin/out/err for safety reasons, do it right: we should preserve f_data and f_ops if they are already set.	2004-11-06 23:36:09 +00:00
Robert Watson	81158452be	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>	2004-10-18 22:19:43 +00:00
Julian Elischer	c233d032d2	Another case where we need to guard against a partially constructed process. Submitted by: Stephan Uphoff ( ups at tree.com ) MFC after: 3 days	2004-10-04 06:45:48 +00:00
Robert Watson	16239786ca	Remove GIANT_REQUIRED from setugidsafety() as knote_fdclose() no longer requires Giant.	2004-08-19 14:59:51 +00:00
Brian Feldman	8912c44d9f	Add the missing knote_fdclose().	2004-08-16 03:09:01 +00:00
John-Mark Gurney	ad3b9257c2	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)	2004-08-15 06:24:42 +00:00
Robert Watson	b223d06425	We're not yet ready to assert !Giant in kern_fcntl(), as it's called with Giant from ABI wrappers such as Linux emulation. Foot shoot off: phk	2004-08-07 14:09:02 +00:00
Robert Watson	a0a819747c	Avoid acquiring Giant for some common light-weight or already MPSAFE fcntl() operations, including: F_DUPFD dup() alias F_GETFD retrieve close-on-exec flag F_SETFD set close-on-exec flag F_GETFL retrieve file descriptor flags For the remaining fcntl() operations, do acquire Giant, especially where we call into fo_ioctl() as a result. We're not yet ready to push Giant into fo_ioctl(). Once we do, this can all become quite a bit prettier.	2004-08-06 22:00:55 +00:00
Robert Watson	0be8ad5fbc	Assert Giant in the following file descriptor-related functions: Function Reason -------- ------ fdfree() VFS setugidsafety() KQueue fdcheckstd() VFS _fgetvp() VFS fgetsock() Conditional assertion based on debug.mpsafenet	2004-08-04 18:35:33 +00:00
Robert Watson	a6719c82b1	Push Giant acquisition down into fo_stat() from most callers. Acquire Giant conditional on debug.mpsafenet in the socket soo_stat() routine, unconditionally in vn_statfile() for VFS, and otherwise don't acquire Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is a no-op. Don't acquire Giant in fstat() system call. Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS stack sitting on top, and therefore there will still be Giant recursion in this case.	2004-07-22 20:40:23 +00:00
Robert Watson	1c1ce9253f	Push acquisition of Giant from fdrop_closed() into fo_close() so that individual file object implementations can optionally acquire Giant if they require it: - soo_close(): depends on debug.mpsafenet - pipe_close(): Giant not acquired - kqueue_close(): Giant required - vn_close(): Giant required - cryptof_close(): Giant required (conservative) Notes: Giant is still acquired in close() even when closing MPSAFE objects due to kqueue requiring Giant in the calling closef() code. Microbenchmarks indicate that this removal of Giant cuts 3%-3% off of pipe create/destroy pairs from user space with SMP compiled into the kernel. The cryptodev and opencrypto code appears MPSAFE, but I'm unable to test it extensively and so have left Giant over fo_close(). It can probably be removed given some testing and review.	2004-07-22 18:35:43 +00:00
Christian S.J. Peron	ed6c545cf0	In addition to the real user ID check, do an explicit jail check to ensure that the caller is not prison root. The intention is to fix file descriptor creation so that prison root can not use the last remaining file descriptors. This privilege should be reserved for non-jailed root users. Approved by: bmilekic (mentor)	2004-07-14 19:04:31 +00:00
Poul-Henning Kamp	a769355f9b	Explicitly initialize f_data and f_vnode to NULL. Report f_vnode to userland in struct xfile.	2004-06-19 11:40:08 +00:00
Poul-Henning Kamp	89c9c53da0	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.	2004-06-16 09:47:26 +00:00
Robert Watson	395a08c904	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-12 20:47:32 +00:00
Poul-Henning Kamp	1930e303cf	Deorbit COMPAT_SUNOS. We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither a sparc32 port nor a SunOS4.x compatibility desire these days.	2004-06-11 11:16:26 +00:00
Robert Watson	63732dce22	Push the VOP_ADVLOCK() call to release advisory locks on vnode file descriptors out of fdrop_locked() and into vn_closefile(). This removes all knowledge of vnodes from fdrop_locked(), since the lock behavior was specific to vnodes. This also removes the specific requirement for Giant in fdrop_locked(), it's now only required by code that it calls into. Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.	2004-06-01 18:03:20 +00:00
Warner Losh	7f8a436ff2	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core	2004-04-05 21:03:37 +00:00
Robert Watson	a1288c786e	Conditionally assert Giant in fputsock() based on the value of debug.mpsafenet.	2004-03-29 00:33:02 +00:00
Don Lewis	47934cef8f	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms	2004-02-26 00:27:04 +00:00
Poul-Henning Kamp	dc08ffec87	Device megapatch 4/6: Introduce d_version field in struct cdevsw, this must always be initialized to D_VERSION. Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.	2004-02-21 21:10:55 +00:00
Dag-Erling Smørgrav	44f4b94b38	Don't bother storing a result when all you need are the side effects.	2004-02-16 18:38:46 +00:00
David Malone	a82294d01c	In fdcheckstd the descriptor table should never be shared, so just KASSERT this rather than trying to deal with what happens when file descriptors change out from under us.	2004-02-15 21:14:48 +00:00
John Baldwin	91d5354a2c	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64	2004-02-04 21:52:57 +00:00
Dag-Erling Smørgrav	a6d4491c71	Restore correct semantics for F_DUPFD fcntl. This should fix the errors people have been getting with configure scripts.	2004-01-17 00:59:04 +00:00
Dag-Erling Smørgrav	56a9fc0e93	WITNESS won't let us hold two filedesc locks at the same time, so juggle fdp and newfdp around a bit.	2004-01-16 21:54:56 +00:00
Dag-Erling Smørgrav	ddce426f69	Remove two KASSERTs which were overly paranoid.	2004-01-16 08:45:56 +00:00
Dag-Erling Smørgrav	12d568c2b1	Take care to drop locks when calling malloc()	2004-01-15 18:50:11 +00:00
Dag-Erling Smørgrav	a2fe44e8cf	New file descriptor allocation code, derived from similar code introduced in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated file descriptors which is used to locate available descriptors when a new one is needed. It also moves the task of growing the file descriptor table out of fdalloc(), reducing complexity in both fdalloc() and do_dup(). Debts of gratitude are owed to tjr@ (who provided the original patch on which this work is based), grog@ (for the gdb(4) man page) and rwatson@ (for assistance with pxeboot(8)).	2004-01-15 10:15:04 +00:00
Dag-Erling Smørgrav	c9de31f55f	Mechanical whitespace cleanup.	2004-01-11 19:39:14 +00:00
Alan Cox	0e88a71798	Remove long dead code, specifically, code related to munmapfd(). (See also vm/vm_mmap.c revision 1.173.)	2004-01-11 06:59:21 +00:00
David Malone	70ad6c2190	Plug a leak of open files that happens when you exec a suid program with one of std{in,out,err} open. This helps with the file descriptor leaks reported on -current. This should probably be merged into 5.2. Reviewed by: ru Tested by: Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>	2003-12-28 19:27:14 +00:00
David Malone	e1419c08e2	falloc allocates a file structure and adds it to the file descriptor table, acquiring the necessary locks as it works. It usually returns two references to the new descriptor: one in the descriptor table and one via a pointer argument. As falloc releases the FILEDESC lock before returning, there is a potential for a process to close the reference in the file descriptor table before falloc's caller gets to use the file. I don't think this can happen in practice at the moment, because Giant indirectly protects closes. To stop the file being completly closed in this situation, this change makes falloc set the refcount to two when both references are returned. This makes life easier for several of falloc's callers, because the first thing they previously did was grab an extra reference on the file. Reviewed by: iedowse Idea run past: jhb	2003-10-19 20:41:07 +00:00
Robert Watson	c142b0fcfe	Remove the global variable 'cmask', which was used to initialize the fd_cmask field in the file descriptor structure for the first process indirectly from CMASK, and when an fd structure is initialized before being filled in, and instead just use CMASK. This appears to be an artifact left over from the initial integration of quotas into BSD. Suggested by: peter	2003-10-02 03:57:59 +00:00
David Malone	d2cce3d6e8	Do some minor Giant pushdown made possible by copyin, fget, fdrop, malloc and mbuf allocation all not requiring Giant. 1) ostat, fstat and nfstat don't need Giant until they call fo_stat. 2) accept can copyin the address length without grabbing Giant. 3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit. 4) move Giant grabbing from each indivitual recv* syscall to recvit.	2003-08-04 21:28:57 +00:00
Alan Cox	fbe1bdddcc	Revision 1.51 of vm/uma_core.c modified uma_large_free() to acquire Giant when needed. So, don't do it here.	2003-07-29 05:23:19 +00:00
Robert Watson	2e4a71cdb1	When exporting file descriptor data for threads invoking the kern.file sysctl, don't return information about processes that fail p_cansee(td, p). This prevents sockstat and related programs from seeing file descriptors owned by processes not in the same jail as the thread, as well as having implications for MAC, etc. This is a partial solution: it permits an information leak about the number of descriptors in the sizing calculation (but this is not new information, you can also get it from kern.openfiles), and doesn't attempt to mask file descriptors based on the properties of the descriptor, only the process referencing it. However, it provides most of what you want under most circumstances, without complicating the locking. PR: 54211 Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>	2003-07-28 16:03:53 +00:00
Poul-Henning Kamp	7c89f162bc	Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.	2003-07-27 17:04:56 +00:00
Alan Cox	18e8d4e79c	revision 1.51 of vm/uma_core.c modified uma_large_malloc() to acquire Giant when needed.	2003-07-25 22:26:43 +00:00
Don Lewis	857d9c60d0	Extend the mutex pool implementation to permit the creation and use of multiple mutex pools with different options and sizes. Mutex pools can be created with either the default sleep mutexes or with spin mutexes. A dynamically created mutex pool can now be destroyed if it is no longer needed. Create two pools by default, one that matches the existing pool that uses the MTX_NOWITNESS option that should be used for building higher level locks, and a new pool with witness checking enabled. Modify the users of the existing mutex pool to use the appropriate pool in the new implementation. Reviewed by: jhb	2003-07-13 01:22:21 +00:00

... 2 3 4 5 6 ...

556 Commits