freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	3dca54ab98	filedesc: move freeing old tables to fdescfree They cannot be accessed by anyone and hold count only protects the structure from being freed.	2014-11-02 14:12:03 +00:00
Mateusz Guzik	3dc85312b2	filedesc: factor out some code out of fdescfree Previously it had a huge self-contained chunk dedicated to dealing with shared tables. No functional changes.	2014-11-02 13:43:04 +00:00
Mateusz Guzik	080fdefc28	filedesc: tidy up fdcheckstd No functional changes.	2014-11-02 02:32:33 +00:00
Mateusz Guzik	d3f3e12a4f	filedesc: lock filedesc lock in fdcloseexec only when needed	2014-11-02 01:13:11 +00:00
Mateusz Guzik	2534d8eeb6	filedesc: drop retval argument from do_dup It was almost always td_retval anyway. For the one case where it is not, preserve the old value across the call.	2014-10-31 10:35:01 +00:00
Mateusz Guzik	8a5177cca3	filedesc: fix missed comments about fdsetugidsafety While here just note that both fdsetugidsafety and fdcheckstd take sleepable locks.	2014-10-31 09:56:00 +00:00
Mateusz Guzik	f652d856ab	filedesc: make fdinit return with source filedesc locked and new one sized appropriately Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable. We don't have to call it with the lock held if we are just creating new filedesc. As a side note, strictly speaking processes can have fdtables with fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file descriptor they get will be 0 and the only syscall allowing to choose fd number requires an active file descriptor. Should this ever change, we can add an 'init' (or similar) parameter to fdgrowtable.	2014-10-31 09:25:28 +00:00
Mateusz Guzik	ffeb890592	filedesc: iterate over fd table only once in fdcopy While here add 'fdused_init' which does not perform unnecessary work. Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold it when appropriate. This function is only used with INVARIANTS. No functional changes intended.	2014-10-31 09:19:46 +00:00
Mateusz Guzik	1a0c80a3df	filedesc: tidy up fdfree Implement fdefree_last variant and get rid of 'last' parameter. No functional changes.	2014-10-31 09:15:59 +00:00
Mateusz Guzik	b97a758ffc	filedesc: tidy up fdcopy a little bit Test for file availability by fde_file != NULL instead of fdisused, this is consistent with similar checks later. Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never reached in practice. fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS. No functional changes.	2014-10-31 05:41:27 +00:00
Mateusz Guzik	f55cf4b0d1	filedesc: make sure to force table reload in fget_unlocked when count == 0 This is a fixup to r273843.	2014-10-30 07:21:38 +00:00
Mateusz Guzik	29c85772bb	filedesc: microoptimize fget_unlocked by retrying obtaining reference count without restarting whole lookup Restart is only needed when fp was closed by current process, which is a much rarer event than ref/deref by some other thread.	2014-10-30 05:21:12 +00:00
Mateusz Guzik	aa77d52800	filedesc: get rid of atomic_load_acq_int from fget_unlocked A read barrier was necessary because fd table pointer and table size were updated separately, opening a window where fget_unlocked could read new size and old pointer. This patch puts both these fields into one dedicated structure, pointer to which is later atomically updated. As such, fget_unlocked only needs data a dependency barrier which is a noop on all supported architectures. Reviewed by: kib (previous version) MFC after: 2 weeks	2014-10-30 05:10:33 +00:00
Mateusz Guzik	58a3dcb229	filedesc assert that table size is at least 3 in fdsetugidsafety Requested by: kib	2014-10-22 08:56:57 +00:00
Mateusz Guzik	11888da8d9	filedesc: cleanup setugidsafety a little Rename it to fdsetugidsafety for consistency with other functions. There is no need to take filedesc lock if not closing any files. The loop has to verify each file and we are guaranteed fdtable has space for at least 20 fds. As such there is no need to check fd_lastfile. While here tidy up is_unsafe.	2014-10-22 00:23:43 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Mateusz Guzik	966ee9f25f	filedesc: plug 2 write-only variables Reported by: Coverity CID: 1245745, 1245746	2014-10-20 21:57:24 +00:00
Mateusz Guzik	55056be254	filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall No functional changes.	2014-10-15 01:16:11 +00:00
Mateusz Guzik	2b4a2528d7	filedesc: fix up breakage introduced in 272505 Include sequence counter supports incoditionally [1]. This fixes reprted build problems with e.g. nvidia driver due to missing opt_capsicum.h. Replace fishy looking sizeof with offsetof. Make fde_seq the last member in order to simplify calculations. Suggested by: kib [1] X-MFC: with 272505	2014-10-05 19:40:29 +00:00
Konstantin Belousov	57c2505e65	On error, sbuf_bcat() returns -1. Some callers returned this -1 to the upper layers, which interpret it as errno value, which happens to be ERESTART. The result was spurious restarts of the sysctls in loop, e.g. kern.proc.proc, instead of returning ENOMEM to caller. Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers expecting errno. In collaboration with: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-05 17:35:59 +00:00
Mateusz Guzik	ee3fd7bbb1	Plug capability races. fp and appropriate capability lookups were not atomic, which could result in improper capabilities being checked. This could result either in protection bypass or in a spurious ENOTCAPABLE. Make fp + capability check atomic with the help of sequence counters. Reviewed by: kib MFC after: 3 weeks	2014-10-04 08:08:56 +00:00
Mateusz Guzik	0c4a09a378	Make do_dup() static and move relevant macros to kern_descrip.c No functional changes.	2014-09-26 19:48:47 +00:00
Konstantin Belousov	f69261f2f9	Fix fcntl(2) compat32 after r270691. The copyin and copyout of the struct flock are done in the sys_fcntl(), which mean that compat32 used direct access to userland pointers. Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which performs neccessary userland memory accesses, and use it from both native and compat32 fcntl syscalls. Reported by: jhibbits Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-09-25 21:07:19 +00:00
John Baldwin	9696feebe2	Add a new fo_fill_kinfo fileops method to add type-specific information to struct kinfo_file. - Move the various fill_*_info() methods out of kern_descrip.c and into the various file type implementations. - Rework the support for kinfo_ofile to generate a suitable kinfo_file object for each file and then convert that to a kinfo_ofile structure rather than keeping a second, different set of code that directly manipulates type-specific file information. - Remove the shm_path() and ksem_info() layering violations. Differential Revision: https://reviews.freebsd.org/D775 Reviewed by: kib, glebius (earlier version)	2014-09-22 16:20:47 +00:00
John Baldwin	2d69d0dcc2	Fix various issues with invalid file operations: - Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(), and invfo_kqfilter() for use by file types that do not support the respective operations. Home-grown versions of invfo_poll() were universally broken (they returned an errno value, invfo_poll() uses poll_no_poll() to return an appropriate event mask). Home-grown ioctl routines also tended to return an incorrect errno (invfo_ioctl returns ENOTTY). - Use the invfo_() functions instead of local versions for unsupported file operations. - Reorder fileops members to match the order in the structure definition to make it easier to spot missing members. - Add several missing methods to linuxfileops used by the OFED shim layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most of these used invfo_(), but a dummy fo_stat() implementation was added.	2014-09-12 21:29:10 +00:00
John Baldwin	0ed667f6e5	Simplify vntype_to_kinfo() by returning when the desired value is found instead of breaking out of the loop and then immediately checking the loop index so that if it was broken out of the proper value can be returned. While here, use nitems().	2014-09-12 20:56:09 +00:00
Mateusz Guzik	64196a9996	Plug unnecessary fp assignments in kern_fcntl. No functional changes.	2014-09-05 23:56:25 +00:00
Gleb Smirnoff	e86447ca44	- Remove socket file operations declaration from sys/file.h. - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-26 14:44:08 +00:00
Mateusz Guzik	037755fd15	Fix up races with f_seqcount handling. It was possible that the kernel would overwrite user-supplied hint. Abuse vnode lock for this purpose. In collaboration with: kib MFC after: 1 week	2014-08-26 08:17:22 +00:00
Mateusz Guzik	a1bf811596	Prepare fget_unlocked for reading fd table only once. Some capsicum functions accept fdp + fd and lookup fde based on that. Add variants which accept fde. Reviewed by: pjd MFC after: 1 week	2014-07-23 19:33:49 +00:00
Mateusz Guzik	b23c40d7b1	Don't zero fd_nfiles during fdp destruction. Code trying to take a look has to check fd_refcnt and it is 0 by that time. This is a follow up to r268505, without this the code would leak memory for tables bigger than the default. MFC after: 1 week	2014-07-10 21:05:45 +00:00
Mateusz Guzik	e518baf8f9	Avoid relocking filedesc lock when closing fds during fdp destruction. Don't call bzero nor fdunused from fdfree for such cases. It would do unnecessary work and complain that the lock is not taken. MFC after: 1 week	2014-07-10 20:59:54 +00:00
Mateusz Guzik	b9d32c36fa	Make fdunshare accept only td parameter. Proc had to match the thread anyway and 2 parameters were inconsistent with the rest. MFC after: 1 week	2014-06-28 05:41:53 +00:00
Mateusz Guzik	35778d7aa9	Make sure to always clear p_fd for process getting rid of its filetable. Filetable can be shared with other processes. Previous code failed to clear the pointer for all but the last process getting rid of the table. This is mostly cosmetics. Get rid of 'This should happen earlier' comment. Clearing the pointer in this place is fine as consumers can reliably check for files availability by inspecting fd_refcnt and vnodes availabity by NULL-checking them. MFC after: 1 week	2014-06-28 05:18:03 +00:00
Mateusz Guzik	450570a55e	Tidy up fd-related functions called by do_execve o assert in each one that fdp is not shared o remove unnecessary NULL checks - all userspace processes have fdtables and kernel processes cannot execve o remove comments about the danger of fd_ofiles getting reallocated - fdtable is not shared and fd_ofiles could be only reallocated if new fd was about to be added, but if that was possible the code would already be buggy as setugidsafety work could be undone MFC after: 1 week	2014-06-23 01:28:18 +00:00
Mateusz Guzik	158627616c	Don't take filedesc lock in fdunshare(). We can read refcnt safely and only care if it is equal to 1. If it could suddenly change from 1 to something bigger the code would be buggy even in the previous form and transitions from > 1 to 1 are equally racy and harmless (we copy even though there is no need). MFC after: 1 week	2014-06-22 21:37:27 +00:00
Mateusz Guzik	adf87ab01c	fd: replace fd_nfiles with fd_lastfile where appropriate fd_lastfile is guaranteed to be the biggest open fd, so when the intent is to iterate over active fds or lookup one, there is no point in looking beyond that limit. Few places are left unpatched for now. MFC after: 1 week	2014-06-22 01:31:55 +00:00
Mateusz Guzik	0f0b852c73	do_dup: plug redundant adjustment of fd_lastfile By that time it was already set by fdalloc, or was there in the first place if fd is replaced. MFC after: 1 week	2014-06-22 00:53:33 +00:00
Mateusz Guzik	f2b1eaec33	Request a non-exiting process in sysctl_kern_proc_{o,}filedesc This fixes a race with exit1 freeing p_textvp. Suggested by: kib MFC after: 1 week	2014-05-02 21:55:09 +00:00
Mateusz Guzik	210a5d1689	Garbage collect fdavail. It rarely returns an error and fdallocn handles the failure of fdalloc just fine.	2014-04-04 05:07:36 +00:00
Mateusz Guzik	f804336026	Mark the following sysctls as MPSAFE: kern.file kern.proc.filedesc kern.proc.ofiledesc MFC after: 7 days	2014-03-21 19:12:05 +00:00
Mateusz Guzik	4c73e705a5	Take filedesc lock only for reading when allocating new fdtable. Code populating the table does this already. MFC after: 1 week	2014-03-21 01:34:19 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Bryan Drewery	63d8fe5531	Fix style of comment blocks. Reported by: peter Approved by: bapt (mentor, implicit) X-MFC with: r262006	2014-02-22 04:28:49 +00:00
Mateusz Guzik	1f9e8f8ad9	Fix a race between kern_proc_{o,}filedesc_out and fdescfree leading to use-after-free. fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but kern_proc_{o,}filedesc_out only checked for hold count. MFC after: 3 days	2014-02-21 22:29:09 +00:00
Bryan Drewery	70f82cfbaf	Fix M_FILEDESC leak in fdgrowtable() introduced in r244510. fdgrowtable() now only reallocates fd_map when necessary. This fixes fdgrowtable() to use the same logic as fdescfree() for when to free the fd_map. The logic in fdescfree() is intended to not free the initial static allocation, however the fd_map grows at a slower rate than the table does. The table is intended to hold 20 fd, but its initial map has many more slots than 20. The slot sizing causes NDSLOTS(20) through NDSLOTS(63) to be 1 which matches NDSLOTS(20), so fdescfree() was assuming that the fd_map was still the initial allocation and not freeing it. This partially reverts r244510 by reintroducing some of the logic it removed in fdgrowtable(). Reviewed by: mjg Approved by: bapt (mentor) MFC after: 2 weeks	2014-02-17 00:00:39 +00:00
Bryan Drewery	88812f91aa	Remove redundant memcpy of fd_ofiles in fdgrowtable() added in r247602 Discussed with: mjg Approved by: bapt (mentor) MFC after: 2 weeks	2014-02-16 23:10:46 +00:00
Mateusz Guzik	231a0fe857	Plug a memory leak in dup2 when both old and new fd have ioctl caps. Reviewed by: pjd MFC after: 3 days	2014-01-03 16:36:55 +00:00
Mateusz Guzik	0918d4b21f	Don't check for fd limits in fdgrowtable_exp. Callers do that already and additional check races with process decreasing limits and can result in not growing the table at all, which is currently not handled. MFC after: 3 days	2014-01-03 16:34:16 +00:00
Adrian Chadd	79750e3b36	Migrate the sendfile_sync structure into a public(ish) API in preparation for extending and reusing it. The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper, used to indicate that the backing store for a group of mbufs has completed. It's only being used by sendfile for now and it's only implementing a sleep/wakeup rendezvous. However, there are other potential signaling paths (kqueue) and other potential uses (socket zero-copy write) where the same mechanism would also be useful. So, with that in mind: * extract the sendfile_sync code out into sf_sync_() methods teach the sf_sync_alloc method about the current config flag - it will eventually know about kqueue. * move the sendfile_sync code out of do_sendfile() - the only thing it now knows about is the sfs pointer. The guts of the sync rendezvous (setup, rendezvous/wait, free) is now done in the syscall wrapper. * .. and teach the 32-bit compat sendfile call the same. This should be a no-op. It's primarily preparation work for teaching the sendfile_sync about kqueue notification. Tested: * Peter Holm's sendfile stress / regression scripts Sponsored by: Netflix, Inc.	2013-12-01 03:53:21 +00:00
Pawel Jakub Dawidek	f2b525e6b9	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days	2013-11-30 15:08:35 +00:00
Konstantin Belousov	1744fe5048	When growing the file descriptor table, new larger memory chunk is allocated, but the old table is kept around to handle the case of threads still performing unlocked accesses to it. Grow the table exponentially instead of increasing its size by sizeof(long) * 8 chunks when overflowing. This mode significantly reduces the total memory use for the processes consuming large numbers of the file descriptors which open them one by one. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-10-09 18:41:35 +00:00
Konstantin Belousov	3625bde45d	Reduce code duplication, introduce the getmaxfd() helper to calculate the max filedescriptor index. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-10-09 18:39:44 +00:00
John-Mark Gurney	da9442ef43	it must be the last member, not might... Reviewed by: attilio Approved by: re (delphij, gjb)	2013-09-26 17:55:04 +00:00
Attilio Rao	57a9eeb4ed	Avoid memory accesses reordering which can result in fget_unlocked() seeing a stale fd_ofiles table once fd_nfiles is already updated, resulting in OOB accesses. Approved by: re (kib) Sponsored by: EMC / Isilon storage division Reported and tested by: pho Reviewed by: benno	2013-09-25 13:37:52 +00:00
Pawel Jakub Dawidek	ab568de789	Handle cases where capability rights are not provided. Reported by: kib	2013-09-05 11:58:12 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Gleb Smirnoff	ca04d21d5f	Make sendfile() a method in the struct fileops. Currently only vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-08-15 07:54:31 +00:00
Mikolaj Golub	9e89077c65	Plug up the lock lock leakage when exporting to a short buffer. Reported by: Alexander Leidinger Submitted by: mjg MFC after: 1 week	2013-07-01 03:27:14 +00:00
Mateusz Guzik	07bd8bf929	Remove duplicate NULL check in kern_proc_filedesc_out. No functional changes. MFC after: 1 week	2013-06-28 18:32:46 +00:00
Mikolaj Golub	6359d169ef	Rework r252313: The filedesc lock may not be dropped unconditionally before exporting fd to sbuf: fd might go away during execution. While it is ok for DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode here, for other types it is unsafe. Instead, drop the lock in export_fd_to_sb(), after preparing data in memory and before writing to sbuf. Spotted by: mjg Suggested by: kib Review by: kib MFC after: 1 week	2013-06-28 18:07:41 +00:00
Mikolaj Golub	bd973910c8	To avoid LOR, always drop the filedesc lock before exporting fd to sbuf. Reviewed by: kib MFC after: 3 days	2013-06-27 19:14:03 +00:00
John Baldwin	958aa57537	Similar to 233760 and 236717, export some more useful info about the kernel-based POSIX semaphore descriptors to userland via procstat(1) and fstat(1): - Change sem file descriptors to track the pathname they are associated with and add a ksem_info() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and ksem_info() to export the path, mode, and value of a semaphore via struct kinfo_file. - Add a struct semstat to the libprocstat(3) interface along with a procstat_get_sem_info() to export the mode and value of a semaphore. - Teach fstat about semaphores and to display their path, mode, and value. MFC after: 2 weeks	2013-05-03 21:11:57 +00:00
Mikolaj Golub	fe52cf5475	Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(), and kern_proc_vmmap_out() functions to output process kinfo structures to sbuf, to make the code reusable. The functions are going to be used in the coredump routine to store procstat info in the core program header notes. Reviewed by: kib MFC after: 3 weeks	2013-04-14 20:01:36 +00:00
Mateusz Guzik	db8f33fd32	Add fdallocn function and use it when passing fds over unix socket. This gets rid of "unp_externalize fdalloc failed" panic. Reviewed by: pjd MFC after: 1 week	2013-04-14 17:08:34 +00:00
Mikolaj Golub	c9d59a63e3	Use pget(9) to reduce code duplication. MFC after: 1 week	2013-04-07 17:44:30 +00:00
Pawel Jakub Dawidek	5f39e56581	Use dedicated malloc type for filecaps-related data, so we can detect any memory leaks easier.	2013-03-03 23:25:45 +00:00
Pawel Jakub Dawidek	a6157c3d61	Plug memory leaks in file descriptors passing.	2013-03-03 23:23:35 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
Pawel Jakub Dawidek	1d59211b2e	Style. Suggested by: kib	2013-02-25 20:51:29 +00:00
Pawel Jakub Dawidek	893365e42d	After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore, so update a stale comment. Reviewed by: kib, keramida	2013-02-25 20:50:08 +00:00
Pawel Jakub Dawidek	4881a5950e	Don't treat pointers as booleans.	2013-02-17 11:47:30 +00:00
Ian Lepore	74938cbb7f	Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero now disables read-ahead. It used to effectively restore the system default readahead hueristic if it had been changed; a negative value now restores the default. Reviewed by: kib	2013-02-13 15:09:16 +00:00
Pawel Jakub Dawidek	a2c496ebb9	Remove label that was accidentally moved during Giant removal from VFS.	2013-01-31 22:14:16 +00:00
Dag-Erling Smørgrav	b5471c918f	Rewrite fdgrowtable() so common mortals can actually understand what it does and how, and add comments describing the data structures and explaining how they are managed.	2012-12-20 20:18:27 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Konstantin Belousov	d8c1da8b90	Add F_DUP2FD_CLOEXEC. Apparently Solaris 11 already did this. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: standards/169962 MFC after: 1 week	2012-07-27 10:41:10 +00:00
Konstantin Belousov	a53cab2c6c	(Incomplete) fixes for symbols visibility issues and style in fcntl.h. Append '__' prefix to the tag of struct oflock, and put it under BSD namespace. Structure is needed both by libc and kernel, thus cannot be hidden under #ifdef _KERNEL. Move a set of non-standard F_* and O_* constants into BSD namespace. SUSv4 explicitely allows implemenation to pollute F_* and O_* names after fcntl.h is included, but it costs us nothing to adhere to the specification if exact POSIX compliance level is requested by user code. Change some spaces after #define to tabs. Noted by and discussed with: bde MFC after: 1 week	2012-07-21 13:02:11 +00:00
Konstantin Belousov	eb3d975443	Remove line which was accidentally kept in r238614. Submitted by: pjd Pointy hat to: kib MFC after: 1 week	2012-07-19 20:38:03 +00:00
Konstantin Belousov	49d02b13bc	Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4. PR: standards/169962 Submitted by: Jukka A. Ukkonen <jau iki fi> MFC after: 1 week	2012-07-19 10:22:54 +00:00
Mateusz Guzik	4fd85c4b5d	Follow-up commit to r238220: Pass only FEXEC (instead of FREAD\|FEXEC) in fgetvp_exec. _fget has to check for !FWRITE anyway and may as well know about FREAD. Make _fget code a bit more readable by converting permission checking from if() to switch(). Assert that correct permission flags are passed. In collaboration with: kib Approved by: trasz (mentor) MFC after: 6 days X-MFC: with r238220	2012-07-09 05:39:31 +00:00
Mateusz Guzik	28a7f60741	Unbreak handling of descriptors opened with O_EXEC by fexecve(2). While here return EBADF for descriptors opened for writing (previously it was ETXTBSY). Add fgetvp_exec function which performs appropriate checks. PR: kern/169651 In collaboration with: kib Approved by: trasz (mentor) MFC after: 1 week	2012-07-08 00:51:38 +00:00
Konstantin Belousov	c5c1199c83	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks	2012-07-02 21:01:03 +00:00
Pawel Jakub Dawidek	d99e1d5fd6	Don't check for race with close on advisory unlock (there is nothing smart we can do when such a race occurs). This saves lock/unlock cycle for the filedesc lock for every advisory unlock operation. MFC after: 1 month	2012-06-17 21:04:22 +00:00
Pawel Jakub Dawidek	604a7c2f00	Extend the comment about checking for a race with close to explain why it is done and why we don't return an error in such case. Discussed with: kib MFC after: 1 month	2012-06-17 16:59:37 +00:00
Pawel Jakub Dawidek	fd6049b186	If VOP_ADVLOCK() call or earlier checks failed don't check for a race with close, because even if we had a race there is nothing to unlock. Discussed with: kib MFC after: 1 month	2012-06-17 16:32:32 +00:00
Pawel Jakub Dawidek	cff2dcd10d	Revert r237073. 'td' can be NULL here. MFC after: 1 month	2012-06-16 12:56:36 +00:00
Pawel Jakub Dawidek	3cde71cb25	One more attempt to make prototypes formated according to style(9), which holefully recovers from the "worse than useless" state. Reported by: bde MFC after: 1 month	2012-06-15 10:00:29 +00:00
Pawel Jakub Dawidek	19a8f6748e	Remove fdtofp() function and use fget_locked(), which works exactly the same. MFC after: 1 month	2012-06-14 16:25:10 +00:00
Pawel Jakub Dawidek	b7fc69ca89	Assert that the filedesc lock is being held when the fdunwrap() function is called. MFC after: 1 month	2012-06-14 16:23:16 +00:00
Pawel Jakub Dawidek	1a94dc8581	Simplify the code by making more use of the fdtofp() function. MFC after: 1 month	2012-06-14 15:37:15 +00:00
Pawel Jakub Dawidek	215aeba939	- Assert that the filedesc lock is being held when fdisused() is called. - Fix white spaces. MFC after: 1 month	2012-06-14 15:35:14 +00:00
Pawel Jakub Dawidek	7aef754274	Style fixes and assertions improvements. MFC after: 1 month	2012-06-14 15:34:10 +00:00
Pawel Jakub Dawidek	8d169d9ff0	Assert that the filedesc lock is not held when closef() is called. MFC after: 1 month	2012-06-14 15:26:23 +00:00
Pawel Jakub Dawidek	eb273c01f3	Style fixes. Reported by: bde MFC after: 1 month	2012-06-14 15:21:57 +00:00
Pawel Jakub Dawidek	c7e9a659ca	Remove code duplication from fdclosexec(), which was the reason of the bug fixed in r237065. MFC after: 1 month	2012-06-14 12:43:37 +00:00
Pawel Jakub Dawidek	8f59e9fddc	When we are closing capabilities during exec, we want to call mq_fdclose() on the underlying object and not on the capability itself. Similar bug was fixed in r236853. MFC after: 1 month	2012-06-14 12:41:21 +00:00
Pawel Jakub Dawidek	5570ae7d87	Style. MFC after: 1 month	2012-06-14 12:37:41 +00:00
Pawel Jakub Dawidek	620216725a	When checking if file descriptor number is valid, explicitely check for 'fd' being less than 0 instead of using cast-to-unsigned hack. Today's commit was brought to you by the letters 'B', 'D' and 'E' :)	2012-06-13 22:12:10 +00:00
Pawel Jakub Dawidek	3812dcd3de	Allocate descriptor number in dupfdopen() itself instead of depending on the caller using finstall(). This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes a race between finstall() and dupfdopen(). MFC after: 1 month	2012-06-13 21:32:35 +00:00
Pawel Jakub Dawidek	6195bfebcc	There is only one caller of the dupfdopen() function, so we can simplify it a bit: - We can assert that only ENODEV and ENXIO errors are passed instead of handling other errors. - The caller always call finstall() for indx descriptor, so we can assume it is set. Actually the filedesc lock is dropped between finstall() and dupfdopen(), so there is a window there for another thread to close the indx descriptor, but it will be closed in next commit. Reviewed by: mjg MFC after: 1 month	2012-06-13 19:00:29 +00:00
Mateusz Guzik	2ca63f0a90	Remove 'low' argument from fd_last_used(). This function is static and the only caller always passes 0 as low. While here update note about return values in comment. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 1 month	2012-06-13 17:18:16 +00:00
Mateusz Guzik	02efb9a8b1	Re-apply reverted parts of r236935 by pjd with some changes. If fdalloc() decides to grow fdtable it does it once and at most doubles the size. This still may be not enough for sufficiently large fd. Use fd in calculations of new size in order to fix this. When growing the table, fd is already equal to first free descriptor >= minfd, also fdgrowtable() no longer drops the filedesc lock. As a result of this there is no need to retry allocation nor lookup. Fix description of fd_first_free to note all return values. In co-operation with: pjd Approved by: trasz (mentor) MFC after: 1 month	2012-06-13 17:12:53 +00:00
Pawel Jakub Dawidek	faf0db351d	Revert part of the r236935 for now, until I figure out why it doesn't work properly. Reported by: davidxu	2012-06-12 10:25:11 +00:00
Pawel Jakub Dawidek	039dc89f0d	fdgrowtable() no longer drops the filedesc lock so it is enough to retry finding free file descriptor only once after fdgrowtable(). Spotted by: pluknet MFC after: 1 month	2012-06-11 22:05:26 +00:00
Pawel Jakub Dawidek	d3ec30e525	Use consistent way of checking if descriptor number is valid. MFC after: 1 month	2012-06-11 20:17:20 +00:00
Pawel Jakub Dawidek	fd45a47ba6	Be consistent with white spaces. MFC after: 1 month	2012-06-11 20:01:50 +00:00
Pawel Jakub Dawidek	19d9c0e11e	Remove code duplicated in kern_close() and do_dup() and use closefp() function introduced a minute ago. This code duplication was responsible for the bug fixed in r236853. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 20:00:44 +00:00
Pawel Jakub Dawidek	642db963ab	Introduce closefp() function that we will be able to use to eliminate code duplication in kern_close() and do_dup(). This is committed separately from the actual removal of the duplicated code, as the combined diff was very hard to read. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:57:31 +00:00
Pawel Jakub Dawidek	129c87eb7d	Merge two ifs into one to make the code almost identical to the code in kern_close(). Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:53:41 +00:00
Pawel Jakub Dawidek	d327cee241	Move the code around a bit to move two parts of code duplicated from kern_close() close together. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:51:27 +00:00
Pawel Jakub Dawidek	8b40793150	Now that fdgrowtable() doesn't drop the filedesc lock we don't need to check if descriptor changed from under us. Replace the check with an assert. Discussed with: kib Tested by: pho MFC after: 1 month	2012-06-11 19:48:55 +00:00
Pawel Jakub Dawidek	69d7614850	When we are closing capability during dup2(), we want to call mq_fdclose() on the underlying object and not on the capability itself. Discussed with: rwatson Sponsored by: FreeBSD Foundation MFC after: 1 month	2012-06-10 14:57:18 +00:00
Pawel Jakub Dawidek	1b693d7494	Merge two ifs into one. Other minor style fixes. MFC after: 1 month	2012-06-10 13:10:21 +00:00
Pawel Jakub Dawidek	8849ae7256	Simplify fdtofp(). MFC after: 1 month	2012-06-10 06:31:54 +00:00
Pawel Jakub Dawidek	e59a97362d	There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as we now use sx lock for filedesc structure protection. Reviewed by: kib MFC after: 1 month	2012-06-09 18:50:32 +00:00
Pawel Jakub Dawidek	68abac4337	Remove now unused variable. MFC after: 1 month MFC with: r236820	2012-06-09 18:48:06 +00:00
Pawel Jakub Dawidek	380513aaae	Make some of the loops more readable. Reviewed by: tegge MFC after: 1 month	2012-06-09 18:03:23 +00:00
Pawel Jakub Dawidek	5d02ed91e9	Correct panic message. MFC after: 1 month MFC with: r236731	2012-06-09 12:27:30 +00:00
Pawel Jakub Dawidek	bf3e37ef15	In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0. Assert that instead of setting it to 0. Sponsored by: FreeBSD Foundation MFC after: 1 month	2012-06-07 23:33:10 +00:00
Eitan Adler	847d0034e3	Return EBADF instead of EMFILE from dup2 when the second argument is outside the range of valid file descriptors PR: kern/164970 Submitted by: Peter Jeremy <peterjeremy@acm.org> Reviewed by: jilles Approved by: cperciva MFC after: 1 week	2012-04-11 14:08:09 +00:00
John Baldwin	e506e182dd	Export some more useful info about shared memory objects to userland via procstat(1) and fstat(1): - Change shm file descriptors to track the pathname they are associated with and add a shm_path() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and shm_path() to export the path, mode, and size of a shared memory object via struct kinfo_file. - Add a struct shmstat to the libprocstat(3) interface along with a procstat_get_shm_info() to export the mode and size of a shared memory object. - Change procstat to always print out the path for a given object if it is valid. - Teach fstat about shared memory objects and to display their path, mode, and size. MFC after: 2 weeks	2012-04-01 18:22:48 +00:00
Peter Holm	ffae9d4d7c	Free up allocated memory used by posix_fadvise(2).	2012-03-08 20:34:13 +00:00
David E. O'Brien	0e31b3c15f	Reformat comment to be more readable in standard Xterm. (while I'm here, wrap other long lines)	2011-11-15 01:48:53 +00:00
John Baldwin	dccc45e4c0	Move the cleanup of f_cdevpriv when the reference count of a devfs file descriptor drops to zero out of _fdrop() and into devfs_close_f() as it is only relevant for devfs file descriptors. Reviewed by: kib MFC after: 1 week	2011-11-04 03:39:31 +00:00
Robert Watson	b160c14194	Correct a bug in export of capability-related information from the sysctls supporting procstat -f: properly provide capability rights information to userspace. The bug resulted from a merge-o during upstreaming (or rather, a failure to properly merge FreeBSD-side changed downstream). Spotted by: des, kibab MFC after: 3 days	2011-10-12 12:08:03 +00:00
Kip Macy	8451d0dd78	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
Jonathan Anderson	cfb5f76865	Add experimental support for process descriptors A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-18 22:51:30 +00:00
Konstantin Belousov	9c00bb9190	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)	2011-08-16 20:07:47 +00:00
Jonathan Anderson	69d377fe1b	Allow Capsicum capabilities to delegate constrained access to file system subtrees to sandboxed processes. - Use of absolute paths and '..' are limited in capability mode. - Use of absolute paths and '..' are limited when looking up relative to a capability. - When a name lookup is performed, identify what operation is to be performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP. With these constraints, openat() and friends are now safe in capability mode, and can then be used by code such as the capability-mode runtime linker. Approved by: re (bz), mentor (rwatson) Sponsored by: Google Inc	2011-08-13 09:21:16 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Jonathan Anderson	c30b9b5169	Export capability information via sysctls. When reporting on a capability, flag the fact that it is a capability, but also unwrap to report all of the usual information about the underlying file. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-07-20 09:53:35 +00:00
Jonathan Anderson	745bae379d	Add implementation for capabilities. Code to actually implement Capsicum capabilities, including fileops and kern_capwrap(), which creates a capability to wrap an existing file descriptor. We also modify kern_close() and closef() to handle capabilities. Finally, remove cap_filelist from struct capability, since we don't actually need it. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-15 09:37:14 +00:00
Jonathan Anderson	5604e481b1	Fix the "passability" test in fdcopy(). Rather than checking to see if a descriptor is a kqueue, check to see if its fileops flags include DFLAG_PASSABLE. At the moment, these two tests are equivalent, but this will change with the addition of capabilities that wrap kqueues but are themselves of type DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's use it. This change has been tested with [the newly improved] tools/regression/kqueue. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-08 12:19:25 +00:00
Edward Tomasz Napierala	afcc55f318	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".	2011-07-06 20:06:44 +00:00
Jonathan Anderson	9acdfe6549	Rework _fget to accept capability parameters. This new version of _fget() requires new parameters: - cap_rights_t needrights the rights that we expect the capability's rights mask to include (e.g. CAP_READ if we are going to read from the file) - cap_rights_t haverights used to return the capability's rights mask (ignored if NULL) - u_char maxprotp the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted (only used if we are going to mmap the file; ignored if NULL) - int fget_flags FGET_GETCAP if we want to return the capability itself, rather than the underlying object which it wraps Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-05 13:45:10 +00:00
Jonathan Anderson	c0467b5e6e	When Capsicum starts creating capabilities to wrap existing file descriptors, we will want to allocate a new descriptor without installing it in the FD array. Split falloc() into falloc_noinstall() and finstall(), and rewrite falloc() to call them with appropriate atomicity. Approved by: mentor (rwatson), re (bz)	2011-06-30 15:22:49 +00:00
Stanislav Sedov	ff6f41a472	- Do no try to drop a NULL filedesc pointer.	2011-05-12 10:56:33 +00:00
Stanislav Sedov	0daf62d9f5	- Commit work from libprocstat project. These patches add support for runtime file and processes information retrieval from the running kernel via sysctl in the form of new library, libprocstat. The library also supports KVM backend for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have been modified to take advantage of the library (as the bonus point the fstat(1) utility no longer need superuser privileges to operate), and the procstat(1) utility is now able to display information from memory dumps as well. The newly introduced fuser(1) utility also uses this library and able to operate via sysctl and kvm backends. The library is by no means complete (e.g. KVM backend is missing vnode name resolution routines, and there're no manpages for the library itself) so I plan to improve it further. I'm commiting it so it will get wider exposure and review. We won't be able to MFC this work as it relies on changes in HEAD, which was introduced some time ago, that break kernel ABI. OTOH we may be able to merge the library with KVM backend if we really need it there. Discussed with: rwatson	2011-05-12 10:11:39 +00:00
Edward Tomasz Napierala	722581d9e6	Add RACCT_NOFILE accounting. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)	2011-04-06 19:13:04 +00:00
Konstantin Belousov	1fe80828e7	After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not	2011-04-01 13:28:34 +00:00
Konstantin Belousov	246d35ec91	Add O_CLOEXEC flag to open(2) and fhopen(2). The new function fallocf(9), that is renamed falloc(9) with added flag argument, is provided to facilitate the merge to stable branch. Reviewed by: jhb MFC after: 1 week	2011-03-25 14:00:36 +00:00
John Baldwin	8e6fa660f2	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week	2011-03-24 18:40:11 +00:00
Bjoern A. Zeeb	1fb51a12f2	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks	2011-02-16 21:29:13 +00:00
Jilles Tjoelker	90750179ec	Do not trip a KASSERT if /dev/null cannot be opened for a setuid program. The fdcheckstd() function makes sure fds 0, 1 and 2 are open by opening /dev/null. If this fails (e.g. missing devfs or wrong permissions), fdcheckstd() will return failure and the process will exit as if it received SIGABRT. The KASSERT is only to check that kern_open() returns the expected fd, given that it succeeded. Tripping the KASSERT is most likely if fd 0 is open but fd 1 or 2 are not. MFC after: 2 weeks	2011-01-28 15:29:35 +00:00
Konstantin Belousov	23b70c1ae2	Finish r210923, 210926. Mark some devices as eternal. MFC after: 2 weeks	2011-01-04 10:59:38 +00:00
Bjoern A. Zeeb	2f826fdf53	Remove one zero from the double-0. This code doesn't have a license to kill. MFC after: 3 days	2010-04-23 14:32:58 +00:00
Konstantin Belousov	080136212f	On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not unlock Giant twice. While there, bring conditions in the do/while loops closer to style, that also makes the lines fit into 80 columns. Reported and tested by: dougb	2009-11-20 22:22:53 +00:00
Xin LI	82aebf697c	Add two new fcntls to enable/disable read-ahead: - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments. Submitted by: Igor Sysoev <is rambler-co ru> Reviewed by: kib@ MFC after: 1 month	2009-09-28 16:59:47 +00:00
Robert Watson	14961ba789	Replace AUDIT_ARG() with variable argument macros with a set more more specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week	2009-06-27 13:58:44 +00:00
Ulf Lilleengen	2dd43ecaec	- Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode use count was increased twice, but only decreased once.	2009-06-24 18:44:38 +00:00
Ulf Lilleengen	7083246d7c	- Fix a bug where a FIFO vnode use count was increased twice, but only decreased once. MFC after: 1 week	2009-06-24 18:38:51 +00:00
John Baldwin	c4f16b69e1	Add a new 'void closefrom(int lowfd)' system call. When called, it closes any open file descriptors >= 'lowfd'. It is largely identical to the same function on other operating systems such as Solaris, DFly, NetBSD, and OpenBSD. One difference from other *BSD is that this closefrom() does not fail with any errors. In practice, while the manpages for NetBSD and OpenBSD claim that they return EINTR, they ignore internal errors from close() and never return EINTR. DFly does return EINTR, but for the common use case (closing fd's prior to execve()), the caller really wants all fd's closed and returning EINTR just forces callers to call closefrom() in a loop until it stops failing. Note that this implementation of closefrom(2) does not make any effort to resolve userland races with open(2) in other threads. As such, it is not multithread safe. Submitted by: rwatson (initial version) Reviewed by: rwatson MFC after: 2 weeks	2009-06-15 20:38:55 +00:00
Jeff Roberson	f4471727f3	- Use an acquire barrier to increment f_count in fget_unlocked and remove the volatile cast. Describe the reason in detail in a comment. Discussed with: bde, jhb	2009-06-02 06:55:32 +00:00
Jamie Gritton	0304c73163	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)	2009-05-27 14:11:23 +00:00
John Baldwin	6ca33ea345	Set the umask in a new file descriptor table earlier in fdcopy() to remove two lock operations.	2009-05-20 18:42:04 +00:00
Konstantin Belousov	6b72d8db47	Revert r192094. The revision caused problems for sysctl(3) consumers that expect that oldlen is filled with required buffer length even when supplied buffer is too short and returned error is ENOMEM. Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when remaining buffer space is too short for the current kinfo_file structure. Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition to keep existing interface for the sysctl, though. Reported by: ed, Florian Smeets <flo kasimir com> Tested by: pho	2009-05-15 14:41:44 +00:00
Jeff Roberson	bf422e5f27	- Implement a lockless file descriptor lookup algorithm in fget_unlocked(). - Save old file descriptor tables created on expansion until the entire descriptor table is freed so that pointers may be followed without regard for expanders. - Mark the file zone as NOFREE so we may attempt to reference potentially freed files. - Convert several fget_locked() users to fget_unlocked(). This requires us to manage reference counts explicitly but reduces locking overhead in the common case.	2009-05-14 03:24:22 +00:00
John Baldwin	3f11530b79	Update comment above _fget() for earlier change to FWRITE failures return EBADF rather than EINVAL. Submitted by: Jaakko Heinonen jh saunalahti fi MFC after: 1 month	2009-04-15 19:10:37 +00:00
Joe Marcus Clarke	0618630015	Remove the printf's when the vnode to be exported for procstat is not a VDIR. If the file system backing a process' cwd is removed, and procstat -f PID is called, then these messages would have been printed. The extra verbosity is not required in this situation. Requested by: kib Approved by: kib	2009-02-14 21:55:09 +00:00
Joe Marcus Clarke	03fd9c2092	Change two KASSERTS to printfs and simple returns. Stress testing has revealed that a process' current working directory can be VBAD if the directory is removed. This can trigger a panic when procstat -f PID is run. Tested by: pho Discovered by: phobot Reviewed by: kib Approved by: kib	2009-02-14 21:12:24 +00:00
Robert Watson	54fffe2d67	Modify fdcopy() so that, during fork(2), it won't copy file descriptors from the parent to the child process if they have an operation vector of &badfileops. This narrows a set of races involving system calls that allocate a new file descriptor, potentially block for some extended period, and then return the file descriptor, when invoked by a threaded program that concurrently invokes fork(2). Similar approches are used in both Solaris and Linux, and the wideness of this race was introduced in FreeBSD when we moved to a more optimistic implementation of accept(2) in order to simplify locking. A small race necessarily remains because the fork(2) might occur after the finit() in accept(2) but before the system call has returned, but that appears unavoidable using current APIs. However, this race is vastly narrower. The fix can be validated using the newfileops_on_fork regression test. PR: kern/130348 Reported by: Ivan Shcheklein <shcheklein at gmail dot com> Reviewed by: jhb, kib MFC after: 1 week	2009-02-11 15:22:01 +00:00
Konstantin Belousov	7efa697d80	Clear the pointers to the file in the struct filedesc before file is closed in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale struct file * values. Reported and tested by: pho MFC after: 1 month	2008-12-30 12:51:56 +00:00
Peter Wemm	43151ee6cf	Merge user/peter/kinfo branch as of r185547 into head. This changes struct kinfo_filedesc and kinfo_vmentry such that they are same on both 32 and 64 bit platforms like i386/amd64 and won't require sysctl wrapping. Two new OIDs are assigned. The old ones are available under COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface was never actually released on 7.x. The other main change is to pack the data passed to userland via the sysctl. kf_structsize and kve_structsize are reduced for the copyout. If you have a process with 100,000+ sockets open, the unpacked records require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still seriously unpleasant, but not quite as devastating). A similar problem exists for the vmentry structure - have lots and lots of shared libraries and small mmaps and its copyout gets expensive too. My immediate problem is valgrind. It traditionally achieves this functionality by parsing procfs output, in a packed format. Secondly, when tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled 32 bit binary which ran directly into the differing data structures in 32 vs 64 bit mode. (valgrind uses this to track file descriptor operations and this therefore affected every single 32 bit binary) I've added two utility functions to libutil to unpack the structures into a fixed record length and to make it a little more convenient to use.	2008-12-02 06:50:26 +00:00
John Baldwin	2ff47c5f18	Remove unnecessary locking around vn_fullpath(). The vnode lock for the vnode in question does not need to be held. All the data structures used during the name lookup are protected by the global name cache lock. Instead, the caller merely needs to ensure a reference is held on the vnode (such as vhold()) to keep it from being freed. In the case of procfs' <pid>/file entry, grab the process lock while we gain a new reference (via vhold()) on p_textvp to fully close races with execve(2). For the kern.proc.vmmap sysctl handler, use a shared vnode lock around the call to VOP_GETATTR() rather than an exclusive lock. MFC after: 1 month	2008-11-04 19:04:01 +00:00
John Baldwin	21fc02d271	Use shared vnode locks instead of exclusive vnode locks for the access(), chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek() (when figuring out the current size of the file in the SEEK_END case), pathconf(), readlink(), and statfs() system calls. Submitted by: ups (mostly) Tested by: pho MFC after: 1 month	2008-11-03 20:31:00 +00:00
Dag-Erling Smørgrav	e11e3f187d	Fix a number of style issues in the MALLOC / FREE commit. I've tried to be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.	2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav	1ede983cc9	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months	2008-10-23 15:53:51 +00:00
Robert Watson	ac2456bfc3	Downgrade XXX to a Note for fgetsock() and fputsock(). MFC after: 3 days	2008-10-12 20:03:17 +00:00
Ed Schouten	bc093719ca	Integrate the new MPSAFE TTY layer to the FreeBSD operating system. The last half year I've been working on a replacement TTY layer for the FreeBSD kernel. The new TTY layer was designed to improve the following: - Improved driver model: The old TTY layer has a driver model that is not abstract enough to make it friendly to use. A good example is the output path, where the device drivers directly access the output buffers. This means that an in-kernel PPP implementation must always convert network buffers into TTY buffers. If a PPP implementation would be built on top of the new TTY layer (still needs a hooks layer, though), it would allow the PPP implementation to directly hand the data to the TTY driver. - Improved hotplugging: With the old TTY layer, it isn't entirely safe to destroy TTY's from the system. This implementation has a two-step destructing design, where the driver first abandons the TTY. After all threads have left the TTY, the TTY layer calls a routine in the driver, which can be used to free resources (unit numbers, etc). The pts(4) driver also implements this feature, which means posix_openpt() will now return PTY's that are created on the fly. - Improved performance: One of the major improvements is the per-TTY mutex, which is expected to improve scalability when compared to the old Giant locking. Another change is the unbuffered copying to userspace, which is both used on TTY device nodes and PTY masters. Upgrading should be quite straightforward. Unlike previous versions, existing kernel configuration files do not need to be changed, except when they reference device drivers that are listed in UPDATING. Obtained from: //depot/projects/mpsafetty/... Approved by: philip (ex-mentor) Discussed: on the lists, at BSDCan, at the DevSummit Sponsored by: Snow B.V., the Netherlands dcons(4) fixed by: kan	2008-08-20 08:31:58 +00:00
Ed Schouten	79da190c16	Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}. There is no reason the fdopen() routine needs Giant. It only sets curthread->td_dupfd, based on the device unit number of the cdev. I guess we won't get massive performance improvements here, but still, I assume we eventually want to get rid of Giant.	2008-08-09 12:42:12 +00:00
John Baldwin	6bc1e9cd84	Rework the lifetime management of the kernel implementation of POSIX semaphores. Specifically, semaphores are now represented as new file descriptor type that is set to close on exec. This removes the need for all of the manual process reference counting (and fork, exec, and exit event handlers) as the normal file descriptor operations handle all of that for us nicely. It is also suggested as one possible implementation in the spec and at least one other OS (OS X) uses this approach. Some bugs that were fixed as a result include: - References to a named semaphore whose name is removed still work after the sem_unlink() operation. Prior to this patch, if a semaphore's name was removed, valid handles from sem_open() would get EINVAL errors from sem_getvalue(), sem_post(), etc. This fixes that. - Unnamed semaphores created with sem_init() were not cleaned up when a process exited or exec'd. They were only cleaned up if the process did an explicit sem_destroy(). This could result in a leak of semaphore objects that could never be cleaned up. - On the other hand, if another process guessed the id (kernel pointer to 'struct ksem' of an unnamed semaphore (created via sem_init)) and had write access to the semaphore based on UID/GID checks, then that other process could manipulate the semaphore via sem_destroy(), sem_post(), sem_wait(), etc. - As part of the permission check (UID/GID), the umask of the proces creating the semaphore was not honored. Thus if your umask denied group read/write access but the explicit mode in the sem_init() call allowed it, the semaphore would be readable/writable by other users in the same group, for example. This includes access via the previous bug. - If the module refused to unload because there were active semaphores, then it might have deregistered one or more of the semaphore system calls before it noticed that there was a problem. I'm not sure if this actually happened as the order that modules are discovered by the kernel linker depends on how the actual .ko file is linked. One can make the order deterministic by using a single module with a mod_event handler that explicitly registers syscalls (and deregisters during unload after any checks). This also fixes a race where even if the sem_module unloaded first it would have destroyed locks that the syscalls might be trying to access if they are still executing when they are unloaded. XXX: By the way, deregistering system calls doesn't do any blocking to drain any threads from the calls. - Some minor fixes to errno values on error. For example, sem_init() isn't documented to return ENFILE or EMFILE if we run out of semaphores the way that sem_open() can. Instead, it should return ENOSPC in that case. Other changes: - Kernel semaphores now use a hash table to manage the namespace of named semaphores nearly in a similar fashion to the POSIX shared memory object file descriptors. Kernel semaphores can now also have names longer than 14 chars (up to MAXPATHLEN) and can include subdirectories in their pathname. - The UID/GID permission checks for access to a named semaphore are now done via vaccess() rather than a home-rolled set of checks. - Now that kernel semaphores have an associated file object, the various MAC checks for POSIX semaphores accept both a file credential and an active credential. There is also a new posixsem_check_stat() since it is possible to fstat() a semaphore file descriptor. - A small set of regression tests (using the ksem API directly) is present in src/tools/regression/posixsem. Reported by: kris (1) Tested by: kris Reviewed by: rwatson (lightly) MFC after: 1 month	2008-06-27 05:39:04 +00:00
Ed Schouten	cc8945d204	Remove redundant checks from fcntl()'s F_DUPFD. Right now we perform some of the checks inside the fcntl()'s F_DUPFD operation twice. We first validate the `fd' argument. When finished, we validate the `arg' argument. These checks are also performed inside do_dup(). The reason we need to do this, is because fcntl() should return different errno's when the `arg' argument is out of bounds (EINVAL instead of EBADF). To prevent the redundant locking of the PROC_LOCK and FILEDESC_SLOCK, patch do_dup() to support the error semantics required by fcntl(). Approved by: philip (mentor)	2008-05-28 20:25:19 +00:00
Attilio Rao	258f4727f1	Replace direct atomic operation for the file refcount witht the refcount interface. It also introduces the correct usage of memory barriers, as sometimes fdrop() and fhold() are used with shared locks, which don't use any release barrier.	2008-05-25 14:57:43 +00:00
Konstantin Belousov	82f4d64035	Implement the per-open file data for the cdev. The patch does not change the cdevsw KBI. Management of the data is provided by the functions int devfs_set_cdevpriv(void priv, cdevpriv_dtr_t dtr); int devfs_get_cdevpriv(void *datap); void devfs_clear_cdevpriv(void); All of the functions are supposed to be called from the cdevsw method contexts. - devfs_set_cdevpriv assigns the priv as private data for the file descriptor which is used to initiate currently performed driver operation. dtr is the function that will be called when either the last refernce to the file goes away, the device is destroyed or devfs_clear_cdevpriv is called. - devfs_get_cdevpriv is the obvious accessor. - devfs_clear_cdevpriv allows to clear the private data for the still open file. Implementation keeps the driver-supplied pointers in the struct cdev_privdata, that is referenced both from the struct file and struct cdev, and cannot outlive any of the referee. Man pages will be provided after the KPI stabilizes. Reviewed by: jhb Useful suggestions from: jeff, antoine Debugging help and tested by: pho MFC after: 1 month	2008-05-21 09:31:44 +00:00
Kris Kennaway	5894445dad	* Correct a mis-merge that leaked the PROC_LOCK [1] * Return ENOENT on error instead of 0 [2] Submitted by: rdivacky [1], kib [2]	2008-04-26 13:16:55 +00:00
Kris Kennaway	b1ba81d948	fdhold can return NULL, so add the one remaining missing check for this condition. Reviewed by: attilio MFC after: 1 week	2008-04-24 22:08:36 +00:00
Doug Rabson	dfdcada31e	Add the new kernel-mode NFS Lock Manager. To use it instead of the user-mode lock manager, build a kernel with the NFSLOCKD option and add '-k' to 'rpc_lockd_flags' in rc.conf. Highlights include: * Thread-safe kernel RPC client - many threads can use the same RPC client handle safely with replies being de-multiplexed at the socket upcall (typically driven directly by the NIC interrupt) and handed off to whichever thread matches the reply. For UDP sockets, many RPC clients can share the same socket. This allows the use of a single privileged UDP port number to talk to an arbitrary number of remote hosts. * Single-threaded kernel RPC server. Adding support for multi-threaded server would be relatively straightforward and would follow approximately the Solaris KPI. A single thread should be sufficient for the NLM since it should rarely block in normal operation. * Kernel mode NLM server supporting cancel requests and granted callbacks. I've tested the NLM server reasonably extensively - it passes both my own tests and the NFS Connectathon locking tests running on Solaris, Mac OS X and Ubuntu Linux. * Userland NLM client supported. While the NLM server doesn't have support for the local NFS client's locking needs, it does have to field async replies and granted callbacks from remote NLMs that the local client has contacted. We relay these replies to the userland rpc.lockd over a local domain RPC socket. * Robust deadlock detection for the local lock manager. In particular it will detect deadlocks caused by a lock request that covers more than one blocking request. As required by the NLM protocol, all deadlock detection happens synchronously - a user is guaranteed that if a lock request isn't rejected immediately, the lock will eventually be granted. The old system allowed for a 'deferred deadlock' condition where a blocked lock request could wake up and find that some other deadlock-causing lock owner had beaten them to the lock. * Since both local and remote locks are managed by the same kernel locking code, local and remote processes can safely use file locks for mutual exclusion. Local processes have no fairness advantage compared to remote processes when contending to lock a region that has just been unlocked - the local lock manager enforces a strict first-come first-served model for both local and remote lockers. Sponsored by: Isilon Systems PR: 95247 107555 115524 116679 MFC after: 2 weeks	2008-03-26 15:23:12 +00:00
Maxim Sobolev	073d8ba485	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff	2008-03-19 09:58:25 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
Maxim Sobolev	c9370ff4d0	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks	2008-03-16 06:21:30 +00:00
Antoine Brodin	e3ad7f6626	Introduce a new F_DUP2FD command to fcntl(2), for compatibility with Solaris and AIX. fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent. Document it. Add some regression tests (identical to the dup2(2) regression tests). PR: 120233 Submitted by: Jukka Ukkonen Approved by: rwaston (mentor) MFC after: 1 month	2008-03-08 22:02:21 +00:00
Dag-Erling Smørgrav	60e15db992	This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload consists of the null-terminated name and the contents of any structure you wish to record. A new ktrstruct() function constructs and emits a KTR_STRUCT record. It is accompanied by convenience macros for struct stat and struct sockaddr. In kdump(1), KTR_STRUCT records are handled by a dispatcher function that runs stringent sanity checks on its contents before handing it over to individual decoding funtions for each type of structure. Currently supported structures are struct stat and struct sockaddr for the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK and AF_IPX is present but disabled, as I am unable to test it properly. Since 's' was already taken, the letter 't' is used by ktrace(1) to enable KTR_STRUCT trace points, and in kdump(1) to enable their decoding. Derived from patches by Andrew Li <andrew2.li@citi.com>. PR: kern/117836 MFC after: 3 weeks	2008-02-23 01:01:49 +00:00
Simon L. B. Nielsen	1b7089994c	Fix sendfile(2) write-only file permission bypass. Security: FreeBSD-SA-08:03.sendfile Submitted by: kib	2008-02-14 11:44:31 +00:00
Joe Marcus Clarke	f280594937	Add support for displaying a process' current working directory, root directory, and jail directory within procstat. While this functionality is available already in fstat, encapsulating it in the kern.proc.filedesc sysctl makes it accessible without using kvm and thus without needing elevated permissions. The new procstat output looks like: PID COMM FD T V FLAGS REF OFFSET PRO NAME 76792 tcsh cwd v d -------- - - - /usr/src 76792 tcsh root v d -------- - - - / 76792 tcsh 15 v c rw------ 16 9130 - - 76792 tcsh 16 v c rw------ 16 9130 - - 76792 tcsh 17 v c rw------ 16 9130 - - 76792 tcsh 18 v c rw------ 16 9130 - - 76792 tcsh 19 v c rw------ 16 9130 - - I am also bumping __FreeBSD_version for this as this new feature will be used in at least one port. Reviewed by: rwatson Approved by: rwatson	2008-02-09 05:16:26 +00:00
Robert Watson	07dd4a31b5	Export a type for POSIX SHM file descriptors via kern.proc.filedesc as used by procstat, or SHM descriptors will show up as type unknown in userspace.	2008-01-20 19:55:52 +00:00
Attilio Rao	22db15c06f	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>	2008-01-13 14:44:15 +00:00
Attilio Rao	cb05b60a89	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>	2008-01-10 01:10:58 +00:00
John Baldwin	8e38aeff17	Add a new file descriptor type for IPC shared memory objects and use it to implement shm_open(2) and shm_unlink(2) in the kernel: - Each shared memory file descriptor is associated with a swap-backed vm object which provides the backing store. Each descriptor starts off with a size of zero, but the size can be altered via ftruncate(2). The shared memory file descriptors also support fstat(2). read(2), write(2), ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared memory file descriptors. - shm_open(2) and shm_unlink(2) are now implemented as system calls that manage shared memory file descriptors. The virtual namespace that maps pathnames to shared memory file descriptors is implemented as a hash table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash of the pathname. - As an extension, the constant 'SHM_ANON' may be specified in place of the path argument to shm_open(2). In this case, an unnamed shared memory file descriptor will be created similar to the IPC_PRIVATE key for shmget(2). Note that the shared memory object can still be shared among processes by sharing the file descriptor via fork(2) or sendmsg(2), but it is unnamed. This effectively serves to implement the getmemfd() idea bandied about the lists several times over the years. - The backing store for shared memory file descriptors are garbage collected when they are not referenced by any open file descriptors or the shm_open(2) virtual namespace. Submitted by: dillon, peter (previous versions) Submitted by: rwatson (I based this on his version) Reviewed by: alc (suggested converting getmemfd() to shm_open())	2008-01-08 21:58:16 +00:00
John Baldwin	e46502943a	Make ftruncate a 'struct file' operation rather than a vnode operation. This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson	2008-01-07 20:05:19 +00:00
Jeff Roberson	a57decdf32	- In sysctl_kern_file skip fdps with negative lastfiles. This can happen if there are no files open. Accounting for these can eventually return a negative value for olenp causing sysctl to crash with a bad malloc. Reported by: Pawel Worach <pawel.worach@gmail.com>	2008-01-03 01:26:59 +00:00
Jeff Roberson	397c19d175	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho	2007-12-30 01:42:15 +00:00
Robert Watson	cc43c38c87	Add two new sysctls in support of the forthcoming procstat(1) to support its -f and -v arguments: kern.proc.filedesc - dump file descriptor information for a process, if debugging is permitted, including socket addresses, open flags, file offsets, file paths, etc. kern.proc.vmmap - dump virtual memory mapping information for a process, if debugging is permitted, including layout and information on underlying objects, such as the type of object and path. These provide a superset of the information historically available through the now-deprecated procfs(4), and are intended to be exported in an ABI-robust form.	2007-12-02 10:10:27 +00:00
Robert Watson	0bf686c125	Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which previously conditionally acquired Giant based on debug.mpsafenet. As that has now been removed, they are no longer required. Removing them significantly simplifies error-handling in the socket layer, eliminated quite a bit of unwinding of locking in error cases. While here clean up the now unneeded opt_net.h, which previously was used for the NET_WITH_GIANT kernel option. Clean up some related gotos for consistency. Reviewed by: bz, csjp Tested by: kris Approved by: re (kensmith)	2007-08-06 14:26:03 +00:00
Jeff Roberson	f6c1ecca50	- Use explicit locking in the various fcntl case statements so that we can acquire shared filedescriptor locks in the appropriate cases. - Remove Giant from calls that issue ioctls. The ioctl path has been mpsafe for some time now. - Only acquire giant for VOP_ADVLOCK when the filesystem requires giant. advlock is now mpsafe. Reviewed by: rwatson Approved by: re	2007-07-03 21:26:06 +00:00
Robert Watson	7251b7863c	Rather than passing SUSER_RUID into priv_check_cred() to specify when a privilege is checked against the real uid rather than the effective uid, instead decide which uid to use in priv_check_cred() based on the privilege passed in. We use the real uid for PRIV_MAXFILES, PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of SUSER_RUID; there are now no flags defined for priv_check_cred(). Obtained from: TrustedBSD Project	2007-06-16 23:41:43 +00:00
Konstantin Belousov	9e223287c0	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)	2007-05-31 11:51:53 +00:00
Konstantin Belousov	5c76452f8f	Mark the filedescriptor table entries with VOP_OPEN being performed for them as UF_OPENING. Disable closing of that entries. This should fix the crashes caused by devfs_open() (and fifo_open()) dereferencing struct file * by index, while the filedescriptor is closed by parallel thread. Idea by: tegge Reviewed by: tegge (previous version of patch) Tested by: Peter Holm Approved by: re (kensmith) MFC after: 3 weeks	2007-05-04 14:23:29 +00:00
John Baldwin	06e043fb20	Avoid a lot of code duplication by using kern_open() to open /dev/null in fdcheckstd() instead of a stripped down version of kern_open()'s code. MFC after: 1 week Reviewed by: cperciva	2007-04-26 18:01:19 +00:00
Robert Watson	5e3f7694b1	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff	2007-04-04 09:11:34 +00:00

... 2 3 4 5 6 ...

655 Commits