freebsd-skq

Author	SHA1	Message	Date
kib	12e8232d4d	Most users of pipe(2) do not call fstat(2) on the returned pipe descriptors. Optimize for the case, by lazily allocating the pipe inode number at the fstat(2) time. If alloc_unr(9) returns failure, do not fail fstat(2), since uses of inode numbers are even rare then fstat(2), but provide zero inode forever. Note that alloc_unr() failure is unlikely due to total number of pipes in the system limited by the number of file descriptors. Based on the submission by: gianni MFC after: 2 weeks	2011-12-06 11:24:03 +00:00
kib	132ad7aa9b	If alloc_unr() call in the pipe_create() failed, then pipe->pipe_ino is -1. But, because ino_t is unsigned, this case was not covered by the test ino > 0 in pipeclose(), leading to the free_unr(-1). Fix it by explicitely comparing with 0 and -1. [1] Do no access freed memory, the inode number was cached to prevent access to cpipe after it possibly was freed, but I failed to commit the right patch. Noted by: gianni [1] Pointy hat to: kib MFC after: 3 days	2011-12-01 11:36:41 +00:00
kib	87f923aeea	Supply unique (st_dev, st_ino) value pair for the fstat(2) done on the pipes. Reviewed by: jhb, Peter Jeremy <peterjeremy acm org> MFC after: 2 weeks	2011-10-05 16:56:06 +00:00
kmacy	99851f359e	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
attilio	683d7a54ce	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks	2011-08-25 15:51:54 +00:00
kib	011f42054d	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)	2011-08-16 20:07:47 +00:00
kib	eb730d92e4	After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not	2011-04-01 13:28:34 +00:00
alc	e498ba0188	Update a comment. The sending process has not mapped the buffer pages since before r127501. Strictly speaking, the buffer pages are not "wired". They remain in the paging queues. However, they are pinned in memory using vm_page_hold().	2011-03-20 15:04:43 +00:00
alc	971b02b7bc	Introduce and use a new VM interface for temporarily pinning pages. This new interface replaces the combined use of vm_fault_quick() and pmap_extract_and_hold() throughout the kernel. In collaboration with: kib@	2010-12-25 21:26:56 +00:00
alc	303f816df2	Implement and use a single optimized function for unholding a set of pages. Reviewed by: kib@	2010-12-17 22:41:22 +00:00
alc	bc80981f79	Update a comment: It no longer makes sense to talk about the page queues lock here.	2010-05-08 23:01:47 +00:00
kmacy	1dc1263413	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib	2010-04-30 00:46:43 +00:00
ed	4f08ecd7ed	Rename st_timespec fields to st_tim for POSIX 2008 compliance. A nice thing about POSIX 2008 is that it finally standardizes a way to obtain file access/modification/change times in sub-second precision, namely using struct timespec, which we already have for a very long time. Unfortunately POSIX uses different names. This commit adds compatibility macros, so existing code should still build properly. Also change all source code in the kernel to work without any of the compatibility macros. This makes it all a less ambiguous. I am also renaming st_birthtime to st_birthtim, even though it was a local extension anyway. It seems Cygwin also has a st_birthtim.	2010-03-28 13:13:22 +00:00
rwatson	53eaed07bc	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks	2009-09-12 20:03:45 +00:00
kib	6c6bda868d	Fix poll(2) and select(2) for named pipes to return "ready for read" when all writers, observed by reader, exited. Use writer generation counter for fifo, and store the snapshot of the fifo generation in the f_seqcount field of struct file, that is otherwise unused for fifos. Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is equal to fifo' fi_wgen, and revert r89376. Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them. Note that the patch does not fix not returning POLLHUP for fifos. PR: kern/94772 Submitted by: bde (original version) Reviewed by: rwatson, jilles Approved by: re (kensmith) MFC after: 6 weeks (might be)	2009-07-07 09:43:44 +00:00
kib	e1cb2941d4	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland	2009-06-10 20:59:32 +00:00
cperciva	632fa45574	Prevent integer overflow in direct pipe write code from circumventing virtual-to-physical page lookups. [09:09] Add missing permissions check for SIOCSIFINFO_IN6 ioctl. [09:10] Fix buffer overflow in "autokey" negotiation in ntpd(8). [09:11] Approved by: so (cperciva) Approved by: re (not really, but SVN wants this...) Security: FreeBSD-SA-09:09.pipe Security: FreeBSD-SA-09:10.ipv6 Security: FreeBSD-SA-09:11.ntpd	2009-06-10 10:31:11 +00:00
rwatson	f4934662e5	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd	2009-06-05 14:55:22 +00:00
jhb	50289fd1c1	- Make maxpipekva a signed long rather than an unsigned long as overflow is more likely to be noticed with signed types. - Make amountpipekva a long as well to match maxpipekva. Discussed with: bde	2009-03-10 21:28:43 +00:00
jhb	80d9458a56	Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month	2009-03-09 19:35:20 +00:00
ed	8d12469978	Several cleanups related to pipe(2). - Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2) fills an array with two descriptors. - Remove EFAULT from the manual page. Because of the current calling convention, pipe(2) raises a segmentation fault when an invalid address is passed. - Introduce kern_pipe() to make it easier for binary emulations to implement pipe(2). - Make Linux binary emulation use kern_pipe(), which means we don't have to recover td_retval after calling the FreeBSD system call. Approved by: rdivacky Discussed on: arch	2008-11-11 14:55:59 +00:00
kib	bb95365b8c	Another problem caused by the knlist_cleardel() potentially dropping PIPE_MTX(). Since the pipe_present is cleared before (potentially) sleeping, the second thread may enter the pipeclose() for the reciprocal pipe end. The test at the end of the pipeclose() for the pipe_present == 0 would succeed, allowing the second thread to free the pipe memory. First threads then accesses the freed memory after being woken up. Properly track the closing state of the pipe in the pipe_present. Introduce the intermediate state that marks the pipe as mostly dismantled but might be sleeping waiting for the knote list to be cleared. Free the pipe pair memory only when both ends pass that point. Debugging help and tested by: pho Discussed with: jmg MFC after: 2 weeks	2008-05-23 11:14:03 +00:00
kib	c106911b42	Destruction of the pipe calls knlist_cleardel() to remove the knotes monitoring the pipe. The code sets pipe_present = 0 and enters knlist_cleardel(), where the PIPE_MTX might be dropped when knl->kl_list cannot be cleared due to influx knotes. If the following often encountered code fragment if (!(kn->kn_status & KN_DETACHED)) kn->kn_fop->f_detach(kn); knote_drop(kn, td); [1] is executed while the knlist lock is dropped, then the knote memory is freed by the knote_drop() without knote being removed from the knlist, since the filt_pipedetach() contains the following: if (kn->kn_filter == EVFILT_WRITE) { if (!cpipe->pipe_peer->pipe_present) { PIPE_UNLOCK(cpipe); return; Now, the memory may be reused in the zone, causing the access to the freed memory. I got the panics caused by the marker knote appearing on the knlist, that, I believe, manifestation of the issue. In the Peter Holm test scenarious, we got unkillable processes too. The pipe_peer that has the knote for write shall be present. Ignore the pipe_present value for EVFILT_WRITE in filt_pipedetach(). Debugging help and tested by: pho Discussed with: jmg MFC after: 2 weeks	2008-05-23 11:09:50 +00:00
jhb	f8a246b979	Make ftruncate a 'struct file' operation rather than a vnode operation. This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson	2008-01-07 20:05:19 +00:00
jeff	ce18638805	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho	2007-12-30 01:42:15 +00:00
jeff	4ec9caf00c	Refactor select to reduce contention and hide internal implementation details from consumers. - Track individual selecters on a per-descriptor basis such that there are no longer collisions and after sleeping for events only those descriptors which triggered events must be rescaned. - Protect the selinfo (per descriptor) structure with a mtx pool mutex. mtx pool mutexes were chosen to preserve api compatibility with existing code which does nothing but bzero() to setup selinfo structures. - Use a per-thread wait channel rather than a global wait channel. - Hide select implementation details in a seltd structure which is opaque to the rest of the kernel. - Provide a 'selsocket' interface for those kernel consumers who wish to select on a socket when they have no fd so they no longer have to be aware of select implementation details. Tested by: kris Reviewed on: arch	2007-12-16 06:21:20 +00:00
dumbbell	fff090c011	The kernel uses two ways to write data on a pipe: o buffered write, for chunks smaller than PIPE_MINDIRECT bytes o direct write, for everything else A call to writev(2) may receive struct iov of various size and the kernel may have to switch from one solution to the other. Before doing this, it must wake reader processes and any select/poll/kqueue up. This commit fixes a bug where select/poll/kqueue are not triggered when switching from buffered write to direct write. It adds calls to pipeselwakeup(). I give more details on freebsd-arch@: http://lists.freebsd.org/pipermail/freebsd-arch/2007-September/006790.html This should fix issues with Erlang (lang/erlang) and kqueue. Reported by: Rickard Green (Erlang)	2007-11-19 15:05:20 +00:00
rwatson	60570a92bf	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer	2007-10-24 19:04:04 +00:00
rwatson	60d85d52f5	Remove amountpipes counter for pipes -- this replicates the function of existing UMA statistics for pipes, and allows us to get rid of both the per-pipe dtor and two atomic operations per pipe required to maintain the counter.	2007-05-27 17:33:10 +00:00
rwatson	69938bd196	Further system call comment cleanup: - Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde) - Remove extra blank lines in some cases. - Add extra blank lines in some cases. - Remove no-op comments consisting solely of the function name, the word "syscall", or the system call name. - Add punctuation. - Re-wrap some comments.	2007-03-05 13:10:58 +00:00
pjd	6cc6a8d100	Use pipe_direct_write() optimization only if the data is in process' memory. This fixes sending data through pipe from the kernel. Fix suggested by: rwatson	2006-12-19 12:52:22 +00:00
rwatson	7beaaf5cd2	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA	2006-10-22 11:52:19 +00:00
rwatson	120490c1a5	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month	2006-06-10 14:34:07 +00:00
glebius	e8ec4c3a0a	- In pipe() return the error returned by pipe_create(), rather then hardcoded ENFILES, which is incorrect. pipe_create() can fail due to ENOMEM. - Update manual page, describing ENOMEM return code. Reviewed by: arch	2006-01-30 08:25:04 +00:00
delphij	4ea00e0984	In pipe_write(): when uiomove() fails, do not spin on it forever. Submitted by: Kostik Belousov <kostikbel at gmail.com> on -current@ Message-ID: <20051216151016.GE84442@deviant.zoral.local> MFC After: 3 weeks	2005-12-16 18:32:39 +00:00
ssouhlal	efe31cd3da	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone	2005-07-01 16:28:32 +00:00
silby	ce62b5450e	Rearrange the kninit calls for both directions of a pipe so that they both happen before pipe backing allocation occurs. Previously, a pipe memory shortage would cause a panic due to a KNOTE call on an uninitialized si_note. Reported by: Peter Holm MFC after: 1 week	2005-01-17 07:56:28 +00:00
imp	20280f1431	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
rwatson	278f52e7a7	Correct a bug introduced in sys_pipe.c:1.179: in pipe_ioctl(), release the pipe mutex before calling fsetown(), as fsetown() may block. The sigio code protects the pipe sigio data using its own mutex, and the pipe reference count held by the caller will prevent the pipe from being prematurely garbage-collected. Discovered by: imp	2004-11-23 22:15:08 +00:00
phk	b3f08741db	Add missing break.	2004-11-16 06:57:52 +00:00
phk	8f49227258	Straighten the ioctl function out to have only one exit point.	2004-11-15 21:51:28 +00:00
phk	52da2f8e34	Introduce fdclose() which will clean an entry in a filedesc. Replace homerolled versions with call to fdclose(). Make fdunused() static to kern_descrip.c	2004-11-07 22:16:07 +00:00
silby	e3f2e32958	Major enhancements to pipe memory usage: - pipespace is now able to resize non-empty pipes; this allows for many more resizing opportunities - Backing is no longer pre-allocated for the reverse direction of pipes. This direction is rarely (if ever) used, so this cuts the amount of map space allocated to a pipe in half. - Pipe growth is now much more dynamic; a pipe will now grow when the total amount of data it contains and the size of the write are larger than the size of pipe. Previously, only individual writes greater than the size of the pipe would cause growth. - In low memory situations, pipes will now shrink during both read and write operations, where possible. Once the memory shortage ends, the growth code will cause these pipes to grow back to an appropriate size. - If the full PIPE_SIZE allocation fails when a new pipe is created, the allocation will be retried with SMALL_PIPE_SIZE. This helps to deal with the situation of a fragmented map after a low memory period has ended. - Minor documentation + code changes to support the above. In total, these changes increase the total number of pipes that can be allocated simultaneously, drastically reducing the chances that pipe allocation will fail. Performance appears unchanged due to dynamic resizing.	2004-08-16 01:27:24 +00:00
jmg	bc1805c6e8	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)	2004-08-15 06:24:42 +00:00
silby	e327e6bd59	Standardize pipe locking, ensuring that everything is locked via pipelock(), not via a mixture of mutexes and pipelock(). Additionally, add a few KASSERTS, and change some statements that should have been KASSERTS into KASSERTS. As a result of these cleanups, some segments of code have become significantly shorter and/or easier to read.	2004-08-03 02:59:15 +00:00
green	9532ab7116	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.	2004-08-02 00:18:36 +00:00
rwatson	882656d16c	Don't perform pipe endpoint locking during pipe_create(), as the pipe can't yet be referenced by other threads. In microbenchmarks, this appears to reduce the cost of pipe();close();close() on UP by 10%, and SMP by 7%. The vast majority of the cost of allocating a pipe remains VM magic. Suggested by: silby	2004-07-23 14:11:04 +00:00
silby	9a2c448df9	Fix a minor error in pipe_stat - st_size was always reported as 0 when direct writes kicked in. Whether this affected any applications is unknown.	2004-07-20 07:06:43 +00:00
alc	521fa57364	Revise the direct or optimized case to use uiomove_fromphys() by the reader instead of ephemeral mappings using pmap_qenter() by the writer. The writer is still, however, responsible for wiring the pages, just not mapping them. Consequently, the allocation of KVA for the direct case is unnecessary. Remove it and the sysctls limiting it, i.e., kern.ipc.maxpipekvawired and kern.ipc.amountpipekvawired. The number of temporarily wired pages is still, however, limited by kern.ipc.maxpipekva. Note: On platforms lacking a direct virtual-to-physical mapping, uiomove_fromphys() uses sf_bufs to cache ephemeral mappings. Thus, the number of available sf_bufs can influence the performance of pipes on platforms such i386. Surprisingly, I saw the greatest gain from this change on such a machine: lmbench's pipe bandwidth result increased from ~1050MB/s to ~1850MB/s on my 2.4GHz, 400MHz FSB P4 Xeon.	2004-03-27 19:50:23 +00:00
rwatson	50cda16803	Assert pipe mutex in pipeselwakeup(), as we manipulate pipe_state in a non-atomic manner. It appears to always be called with the mutex (good).	2004-02-26 00:18:22 +00:00

1 2 3 4 5

219 Commits