freebsd-dev

Author	SHA1	Message	Date
Davide Italiano	68cefd64ad	The variable 'error' in sys_poll() is initialized in declaration to value zero but in any case is overwritten by successive copyin(), making the previous initialization useless. Remove this. As an added bonus this fixes a style(9) bug. Discussed with: kib Approved by: gnn (mentor) MFC after: 3 days	2012-06-17 13:03:50 +00:00
Konstantin Belousov	8bb9a904d5	Instead of incomplete handling of read(2)/write(2) return values that does not fit into registers, declare that we do not support this case using CTASSERT(), and remove endianess-unsafe code to split return value into td_retval. While there, change the style of the sysctl debug.iosize_max_clamp definition. Requested by: bde MFC after: 3 weeks	2012-03-04 14:55:37 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Konstantin Belousov	56be1b9a7a	To limit amount of the kernel memory allocated, and to optimize the iteration over the fdsets, kern_select() limits the length of the fdsets copied in by the last valid file descriptor index. If any bit is set in a mask above the limit, current implementation ignores the filedescriptor, instead of returning EBADF. Fix the issue by scanning the tails of fdset before entering the select loop and returning EBADF if any bit above last valid filedescriptor index is set. The performance impact of the additional check is only imposed on the (somewhat) buggy applications that pass bad file descriptors to select(2) or pselect(2). PR: kern/155606, kern/162379 Discussed with: cognet, glebius Tested by: andreast (powerpc, all 64/32bit ABI combinations, big-endian), marius (sparc64, big-endian) MFC after: 2 weeks	2011-11-13 10:28:01 +00:00
Kip Macy	8451d0dd78	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
Attilio Rao	6aba400a70	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks	2011-08-25 15:51:54 +00:00
Jonathan Anderson	d6f7248983	poll(2) implementation for capabilities. When calling poll(2) on a capability, unwrap first and then poll the underlying object. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-16 14:14:56 +00:00
Jonathan Anderson	d1b6899e83	Rename CAP__KEVENT to CAP__EVENT. Change the names of a couple of capability rights to be less FreeBSD-specific. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-12 14:26:47 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Konstantin Belousov	6d8fedda2c	For some file types, select code registers two selfd structures. E.g., for socket, when specified POLLIN\|POLLOUT in events, you would have one selfd registered for receiving socket buffer, and one for sending. Now, if both events are not ready to fire at the time of the initial scan, but are simultaneously ready after the sleep, pollrescan() would iterate over the pollfd struct twice. Since both times revents is not zero, returned value would be off by one. Fix this by recalculating the return value in pollout(). PR: kern/143029 MFC after: 2 weeks	2010-08-28 17:42:08 +00:00
John Baldwin	7a6f3d7890	Send SIGPIPE to the thread that issued the offending system call rather than to the entire process. Reported by: Anit Chakraborty Reviewed by: kib, deischen (concept) MFC after: 1 week	2010-06-29 20:44:19 +00:00
Konstantin Belousov	61e53a389f	Remove PIOLLHUP from the flags used to test for to set exceptfsd fd_set bits in select(2). It seems that historical behaviour is to not reporting exception on EOF, and several applications are broken. Reported by: Yoshihiko Sarumaru <ysarumaru gmail com> Discussed with: bde PR: ports/140934 MFC after: 2 weeks	2010-05-21 10:36:29 +00:00
Nathan Whitehorn	841c0c7ec7	Provide groundwork for 32-bit binary compatibility on non-x86 platforms, for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32 option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts of the kernel and enhances the freebsd32 compatibility code to support big-endian platforms. Reviewed by: kib, jhb	2010-03-11 14:49:06 +00:00
Konstantin Belousov	066d836b02	Current pselect(3) is implemented in usermode and thus vulnerable to well-known race condition, which elimination was the reason for the function appearance in first place. If sigmask supplied as argument to pselect() enables a signal, the signal might be delivered before thread called select(2), causing lost wakeup. Reimplement pselect() in kernel, making change of sigmask and sleep atomic. Since signal shall be delivered to the usermode, but sigmask restored, set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK should be cleared by ast() in case signal was not gelivered during syscall execution. Reviewed by: davidxu Tested by: pho MFC after: 1 month	2009-10-27 10:55:34 +00:00
Konstantin Belousov	b55ef216fe	kern_select(9) copies fd_set in and out of userspace in quantities of longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in or out 4 bytes more then the process expected. Calculate the amount of bytes to copy taking into account size of fd_set for the current process ABI. Diagnosed and tested by: Peter Jeremy <peterjeremy acm org> Reviewed by: jhb MFC after: 1 week	2009-09-09 20:59:01 +00:00
Konstantin Belousov	f2159cc790	Fix the conformance of poll(2) for sockets after r195423 by returning POLLHUP instead of POLLIN for several cases. Now, the tools/regression/poll results for FreeBSD are closer to that of the Solaris and Linux. Also, improve the POSIX conformance by explicitely clearing POLLOUT when POLLHUP is reported in pollscan(), making the fix global. Submitted by: bde Reviewed by: rwatson MFC after: 1 week	2009-08-23 12:44:15 +00:00
Robert Watson	64c9a4d9ab	Audit file descriptor and command arguments to ioctl(2). Approved by: re (audit argument blanket) MFC after: 1 week	2009-07-02 09:16:25 +00:00
Jeff Roberson	2141453eb4	- Use fd_lastfile + 1 as the upper bound on nd. This is more correct than using the size of the descriptor array. - A lock is not needed to fetch fd_lastfile. The results are stale the instant it is dropped. - Use a private mutex pool for select since the pool mutex is not used as a leaf. - Fetch the si_mtx pointer first before resorting to hashing to compute the mutex address. Reviewed by: McKusick Approved by: re (kib)	2009-07-01 20:43:46 +00:00
Robert Watson	14961ba789	Replace AUDIT_ARG() with variable argument macros with a set more more specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week	2009-06-27 13:58:44 +00:00
Jeff Roberson	bf422e5f27	- Implement a lockless file descriptor lookup algorithm in fget_unlocked(). - Save old file descriptor tables created on expansion until the entire descriptor table is freed so that pointers may be followed without regard for expanders. - Mark the file zone as NOFREE so we may attempt to reference potentially freed files. - Convert several fget_locked() users to fget_unlocked(). This requires us to manage reference counts explicitly but reduces locking overhead in the common case.	2009-05-14 03:24:22 +00:00
Robert Watson	ae81968fd1	When writing out updated pollfd records when returning from poll(), only copy out the revents field, not the whole pollfd structure. Otherwise, if the events field is updated concurrently by another thread, that update may be lost. This issue apparently causes problems for the JDK on FreeBSD, which expects the Linux behavior of not updating all fields (somewhat oddly, Solaris does not implement the required behavior, but presumably our adaptation of the JDK is based on the Linux port?). MFC after: 2 weeks PR: kern/130924 Submitted by: Kurt Miller <kurt @ intricatesoftware.com> Discussed with: kib	2009-03-11 22:00:03 +00:00
Konstantin Belousov	125dcf8c7d	Extract the no_poll() and vop_nopoll() code into the common routine poll_no_poll(). Return a poll_no_poll() result from devfs_poll_f() when filedescriptor does not reference the live cdev, instead of ENXIO. Noted and tested by: hps MFC after: 1 week	2009-03-06 15:35:37 +00:00
Stephane E. Potvin	60b7f468da	Fix select on platforms where sizeof(long) != sizeof(int). This used to work by accident before the cleanup done in revision 187693. Approved by: kan (mentor)	2009-02-02 03:34:40 +00:00
Jeff Roberson	9cdacff1d3	- bit has to be fd_mask to work properly on 64bit platforms. Constants must also be cast even though the result ultimately is promoted to 64bit. - Correct a loop index upper bound in selscan().	2009-01-25 18:38:42 +00:00
Jeff Roberson	748b9df687	- Correct a typo in a comment. Noticed by: danger	2009-01-25 09:17:16 +00:00
Jeff Roberson	11b763df19	Fix errors introduced when I rewrote select. - Restructure selscan() and selrescan() to avoid producing extra selfps when we have a fd in multiple sets. As described below multiple selfps may still exist for other reasons. - Make selrescan() tolerate multiple selfds for a given descriptor set since sockets use two selinfos per fd. If an event on each selinfo fires selrescan() will see the descriptor twice. This could result in select() returning 2x the number of fds actually existing in fd sets. Reported by: mgleason@ncftp.com	2009-01-25 07:24:34 +00:00
David E. O'Brien	715457f6f6	Reverse if() logic to improve readability. Reviewed by: ru	2008-09-23 14:25:38 +00:00
Jeff Roberson	bd4e153568	- Remove stale comment. - In the last revision the code was changed to use maxfilesperproc rather than the per-process file limit to restrict the size of the poll array. This eliminates a significant source of process lock contention in multithreaded programs and is cheaper. This had been committed with the wrong batch of changes.	2008-03-19 07:33:16 +00:00
Jeff Roberson	374ae2a393	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.	2008-03-19 06:19:01 +00:00
John Baldwin	e46502943a	Make ftruncate a 'struct file' operation rather than a vnode operation. This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson	2008-01-07 20:05:19 +00:00
Jeff Roberson	397c19d175	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho	2007-12-30 01:42:15 +00:00
Jeff Roberson	ace8398da0	Refactor select to reduce contention and hide internal implementation details from consumers. - Track individual selecters on a per-descriptor basis such that there are no longer collisions and after sleeping for events only those descriptors which triggered events must be rescaned. - Protect the selinfo (per descriptor) structure with a mtx pool mutex. mtx pool mutexes were chosen to preserve api compatibility with existing code which does nothing but bzero() to setup selinfo structures. - Use a per-thread wait channel rather than a global wait channel. - Hide select implementation details in a seltd structure which is opaque to the rest of the kernel. - Provide a 'selsocket' interface for those kernel consumers who wish to select on a socket when they have no fd so they no longer have to be aware of select implementation details. Tested by: kris Reviewed on: arch	2007-12-16 06:21:20 +00:00
Julian Elischer	431f890614	generally we are interested in what thread did something as opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.	2007-11-14 06:21:24 +00:00
Peter Wemm	c2815ad564	Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate Approved by: re (kensmith)	2007-07-04 22:57:21 +00:00
Jeff Roberson	982d11f836	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)	2007-06-05 00:00:57 +00:00
Alan Cox	fa75abb0d2	Remove unneeded include files.	2007-05-01 06:35:54 +00:00
Robert Watson	5e3f7694b1	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff	2007-04-04 09:11:34 +00:00
Robert Watson	873fbcd776	Further system call comment cleanup: - Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde) - Remove extra blank lines in some cases. - Add extra blank lines in some cases. - Remove no-op comments consisting solely of the function name, the word "syscall", or the system call name. - Add punctuation. - Re-wrap some comments.	2007-03-05 13:10:58 +00:00
Robert Watson	0c14ff0eb5	Remove 'MPSAFE' annotations from the comments above most system calls: all system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.	2007-03-04 22:36:48 +00:00
Bruce M Simpson	6f7ca813c4	Do not dispatch SIGPIPE from the generic write path for a socket; with this patch the code behaves according to the comment on the line above. Without this patch, a socket could cause SIGPIPE to be delivered to its process, once with SO_NOSIGPIPE set, and twice without. With this patch, the kernel now passes the sigpipe regression test. Tested by: Anton Yuzhaninov MFC after: 1 week	2007-03-01 19:20:25 +00:00
Ruslan Ermilov	a1b0a18096	Prevent IOC_IN with zero size argument (this is only supported if backward copatibility options are present) from attempting to free memory that wasn't allocated. This is an old bug, and previously it would attempt to free a null pointer. I noticed this bug when working on the previous revision, but forgot to fix it. Security: local DoS Reported by: Peter Holm MFC after: 3 days	2006-10-14 19:01:55 +00:00
Ruslan Ermilov	9fddcc6661	Fix our ioctl(2) implementation when the argument is "int". New ioctls passing integer arguments should use the _IOWINT() macro. This fixes a lot of ioctl's not working on sparc64, most notable being keyboard/syscons ioctls. Full ABI compatibility is provided, with the bonus of fixing the handling of old ioctls on sparc64. Reviewed by: bde (with contributions) Tested by: emax, marius MFC after: 1 week	2006-09-27 19:57:02 +00:00
John Baldwin	d9f4623307	- Split ioctl() up into ioctl() and kern_ioctl(). The kern_ioctl() assumes that the 'data' pointer is already setup to point to a valid KVM buffer or contains the copied-in data from userland as appropriate (ioctl(2) still does this). kern_ioctl() takes care of looking up a file pointer, implementing FIONCLEX and FIOCLEX, and calling fi_ioctl(). - Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap and mark xenix_rdchk() MPSAFE.	2006-07-08 20:12:14 +00:00
John Baldwin	af56abaab5	Return error from fget_write() rather than hardcoding EBADF now that fget_write() DTRT. Requested by: bde	2006-01-06 16:34:22 +00:00
John Baldwin	e730167f16	Remove XXX comments complaining that write(2) on a read-only descriptor returns EBADF. That errno is correct and is mandated by POSIX. It also goes back to revision 1.1 of our CVS history (i.e. 4.4BSD). The _fget() function should probably also be upated as it currently returns EINVAL in that case rather than EBADF. (It does return EBADF for reads on a write-only descriptor without any XXX comments oddly enough.) Discussed with: scottl, grog, mjacob, bde	2006-01-05 22:20:31 +00:00
John Baldwin	bcd9e0dd20	- Add two new system calls: preadv() and pwritev() which are like readv() and writev() except that they take an additional offset argument and do not change the current file position. In SAT speak: preadv:readv::pread:read and pwritev:writev::pwrite:write. - Try to reduce code duplication some by merging most of the old kern_foov() and dofilefoo() functions into new dofilefoo() functions that are called by kern_foov() and kern_pfoov(). The non-v functions now all generate a simple uio on the stack from the passed in arguments and then call kern_foov(). For example, read() now just builds a uio and calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev(). PR: kern/80362 Submitted by: Marc Olzheim marcolz at stack dot nl (1) Approved by: re (scottl) MFC after: 1 week	2005-07-07 18:17:55 +00:00
Peter Wemm	2de92a386e	Conditionally weaken sys_generic.c rev 1.136 to allow certain dubious ioctl numbers in backwards compatability mode. eg: an IOC_IN ioctl with a size of zero. Traditionally this was what you did before IOC_VOID existed, and we had some established users of this in the tree, namely procfs. Certain 3rd party drivers with binary userland components also have this too. This is necessary to have 4.x and 5.x binaries use these ioctl's. We found this at work when trying to run 4.x binaries. Approved by: re	2005-06-30 00:19:08 +00:00
John Baldwin	b88ec951e1	Implement kern_adjtime(), kern_readv(), kern_sched_rr_get_interval(), kern_settimeofday(), and kern_writev() to allow for further stackgap reduction in the compat ABIs.	2005-03-31 22:51:18 +00:00
Colin Percival	e5e6a46460	Declare "cnt" (a number of bytes to read or write) as an "ssize_t", not as a "long" in dofileread() and dofilewrite(). Discussed with: jhb	2005-02-10 20:19:17 +00:00
Poul-Henning Kamp	4f8d23d662	Previously a read of zero bytes got handled in devfs:vop_read() but I missed that when the vnode bypass was introduced. Deal with zero length transfers before we even get to fo_ops->fo_read(). Found by: Slawa Olhovchenkov <slwzxy.spb.ru@zxy.spb.ru> PR: 75758	2005-01-25 09:15:32 +00:00

1 2 3 4

190 Commits