freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	8c629bdf05	The m_extadd() can fail due to memory allocation failure, thus: - Make it return int, not void. - Add wait parameter. - Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as before this change. Sponsored by: Nginx, Inc.	2013-03-12 12:12:16 +00:00
Alexander Motin	0dbf17e6eb	Make kern_nanosleep() and pause_sbt() to use per-CPU sleep queues. This removes significant sleep queue lock congestion on multithreaded microbenchmarks, making them scale to multiple CPUs almost linearly.	2013-03-12 06:58:49 +00:00
Pawel Jakub Dawidek	be26ba7cd3	Fix memory leak when one process send descriptor over UNIX domain socket, but the other process exited before receiving it.	2013-03-11 22:59:07 +00:00
Michael Tuexen	fbb3471022	Return an error if sctp_peeloff() fails because a socket can't be allocated. MFC after: 3 days	2013-03-11 17:43:55 +00:00
Andre Oppermann	a7aea132cf	Bring back the comment on the sizing of the callout array that got lost in r248031. Requested by: alc, alfred	2013-03-10 22:55:35 +00:00
Davide Italiano	c5904471dc	Fixup r248032: Change size requested to malloc(9) now that callwheel buckets are callout_list and not callout_tailq anymore. This change was already there but it seems it got lost after code churn in r248032. Reported by: alc, kib	2013-03-09 20:03:10 +00:00
Attilio Rao	1fc8c346d5	Improve UMTX_PROFILING: - Use u_int values for length and max_length values - Add a way to reset the max_length heuristic in order to have the possibility to reuse the mechanism consecutively without rebooting the machine - Add a way to quick display top5 contented buckets in the system for the max_length value. This should give a quick overview on the quality of the hash table distribution. Sponsored by: EMC / Isilon storage division Reviewed by: jeff, davide	2013-03-09 15:31:19 +00:00
Konstantin Belousov	7a61281f22	Correct the lock class for the vm object lock. Reported and tested by: joel	2013-03-09 10:16:08 +00:00
Alexander Motin	21a37a7196	Rework overflow checks of r247898 to not let too "intelligent" compiler to optimize it out. Submitted by: bde	2013-03-09 09:07:13 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Andre Oppermann	15ae0c9af9	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide	2013-03-08 10:37:17 +00:00
Andre Oppermann	f8ccf82a4c	Move the auto-sizing of the callout array from init_param2() to kern_timeout_callwheel_alloc() where it is actually used. This is a mechanical move and no tuning parameters are changed. The pre-allocated callout array is only used for legacy timeout(9) calls and is only allocated and active on cpu0. Eventually all remaining users of timeout(9) should switch to the callout_* API. Reviewed by: davide	2013-03-08 10:14:58 +00:00
Alexander Motin	836972b877	Fix off-by-one error in nanoseconds validation. Submitted by: bde	2013-03-07 16:50:07 +00:00
Ian Lepore	9a2bff7ca6	Call sched_prio() to immediately change the priority of the thread in response to an rtprio_thread() call, when the priority is different than the old priority, and either the old or the new priority class is not RTP_PRIO_NORMAL (timeshare). The reasoning for the second half of the test is that if it's a change in timeshare priority, then the scheduler is going to adjust that priority in a way that completely wipes out the requested change anyway, so what's the point? (If that's not true, then allowing a thread to change its own timeshare priority would subvert the scheduler's adjustments and let a cpu-bound thread monopolize the cpu; if allowed at all, that should require priveleges.) On the other hand, if either the old or new priority class is not timeshare, then the scheduler doesn't make automatic adjustments, so we should honor the request and make the priority change right away. The reason the old class gets caught up in this is the very reason for this change: when thread A changes the priority of its child thread B from idle back to timeshare, thread B never actually gets moved to a timeshare-range run queue unless there are some idle cycles available to allow it to first get scheduled again as an idle thread. Reviewed by: jhb@	2013-03-07 02:53:29 +00:00
Alexander Motin	b5ea3779da	Reduce minimal time intervals of setitimer(2) from 1/HZ to 1/(16*HZ) by using callout_reset_sbt() instead of callout_reset(). We can't remove lower limit completely in this case because of significant processing overhead, caused by unability to use direct callout execution due to using process mutex in callout handler for sending SEGALRM signal. With support of periodic events that would allow unprivileged user to abuse the system. Reviewed by: davide	2013-03-06 22:40:47 +00:00
Alexander Motin	980c545d76	Fix time math overflows and improve zero intervals handling in poll(), select(), nanosleep() and kevent() functions after calloutng changes. Reported by: bde	2013-03-06 19:37:38 +00:00
Fabien Thomas	d49302aead	Add a generic way to call per event allocate / release function. Reviewed by: mav MFC after: 1 month	2013-03-05 10:18:48 +00:00
Davide Italiano	ac42a1726a	Complete r247813: Use true/false instead of TRUE/FALSE. Reported by: attilio Requested by: jhb	2013-03-04 21:52:12 +00:00
Davide Italiano	a4a3ce9919	Use C99 'bool' rather than Machish 'boolean_t'. Requested by: jhb	2013-03-04 21:09:22 +00:00
Davide Italiano	40e794ab19	MFcalloutng: - Rewrite kevent() timeout implementation to allow sub-tick precision. - Make the interval timings for EVFILT_TIMER more accurate. This also removes an hack introduced in r238424. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:55:16 +00:00
Davide Italiano	cf5e4fe6bb	MFcalloutng: Fix kern_select() and sys_poll() so that they can handle sub-tick precision for timeouts (in the same fashion it was done for nanosleep() in r247797). Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:41:27 +00:00
Davide Italiano	4601bab1fb	MFcalloutng (r244251 with minor changes): Specify that precision of 0.5s is enough for resource limitation. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:25:12 +00:00
Davide Italiano	c38250c9b9	MFcalloutng (r244255 by mav, with minor changes): Specify that syslog doesn't need exactly 5 wakeups per second. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:07:55 +00:00
Davide Italiano	098176f0d0	MFcalloutng: kern_nanosleep() is now converted to use tsleep_sbt(). With this change nanosleep() and usleep() can handle sub-tick precision for timeouts. Also, try to help coalesce of events passing as argument to tsleep_bt() a precision value calculated as a percentage of the sleep time. This percentage is default 5%, but it can tuned according to users need via the sysctl interface. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 15:57:41 +00:00
Davide Italiano	037637812d	Fix build with DIAGNOSTIC/CALLOUT_PROFILING options turned on. Reported by: kib, David Wolfskill <david at catwhisker dot org> Pointy-hat to: davide	2013-03-04 15:03:52 +00:00
Davide Italiano	24e48c6d5b	MFcalloutng: Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 12:48:41 +00:00
Davide Italiano	461537356a	MFcalloutng: Extend condvar(9) KPI introducing sbt variant of cv_timedwait. This rely on the previously committed sleepq_set_timeout_sbt(). Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 12:20:48 +00:00
Davide Italiano	965ac611ec	MFcalloutng: Convert sleepqueue(9) bits to the new callout KPI. Take advantage of the possibility to run callback directly from hw interrupt context. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 11:51:46 +00:00
Davide Italiano	dbd2e1677f	MFcalloutng (r244355): Make loadavg calculation callout direct. There are several reasons for it: - it is very simple and doesn't worth context switch to SWI; - since SWI is no longer used here, we can remove twelve years old hack, excluding this SWI from from the loadavg statistics; - it fixes problem when eventtimer (HPET) shares interrupt with some other device, and that interrupt thread counted as permanent loadavg of 1; now loadavg accounted before that interrupt thread is scheduled. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, Fabian Keil, markj	2013-03-04 11:22:19 +00:00
Davide Italiano	5b999a6be0	- Make callout(9) tickless, relying on eventtimers(4) as backend for precise time event generation. This greatly improves granularity of callouts which are not anymore constrained to wait next tick to be scheduled. - Extend the callout KPI introducing a set of callout_reset_sbt* functions, which take a sbintime_t as timeout argument. The new KPI also offers a way for consumers to specify precision tolerance they allow, so that callout can coalesce events and reduce number of interrupts as well as potentially avoid scheduling a SWI thread. - Introduce support for dispatching callouts directly from hardware interrupt context, specifying an additional flag. This feature should be used carefully, as long as interrupt context has some limitations (e.g. no sleeping locks can be held). - Enhance mechanisms to gather informations about callwheel, introducing a new sysctl to obtain stats. This change breaks the KBI. struct callout fields has been changed, in particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t' (8 bytes) and another 'sbintime_t' field was added for precision. Together with: mav Reviewed by: attilio, bde, luigi, phk Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm), markj (amd64), mav, Fabian Keil	2013-03-04 11:09:56 +00:00
Pawel Jakub Dawidek	8cb539f18f	For some reason when I started to pass filedescent structures instead of pointers to the file structure receiving descriptors stopped to work when also at least few kilobytes of data is being send. In the kernel the soreceive_generic() function doesn't see control mbuf as the first mbuf and unp_externalize() is never called, first 6(?) kilobytes of data is missing as well on receiving end. This breaks for example tmux. I don't know yet why going from 8 bytes to sizeof(struct filedescent) per descriptor (or even to 16 bytes per descriptor) breaks things, but to work-around it for now use 8 bytes per file descriptor at the cost of memory allocation. Reported by: flo, Diane Bruce, Jan Beich <jbeich@tormail.org> Simple testcase provided by: mjg	2013-03-03 23:39:30 +00:00
Pawel Jakub Dawidek	5f39e56581	Use dedicated malloc type for filecaps-related data, so we can detect any memory leaks easier.	2013-03-03 23:25:45 +00:00
Pawel Jakub Dawidek	a6157c3d61	Plug memory leaks in file descriptors passing.	2013-03-03 23:23:35 +00:00
Davide Italiano	3f555c45eb	callwheelmask and callwheelsize are always greater than zero. Switch their type to u_int.	2013-03-03 15:01:33 +00:00
Davide Italiano	0fb285b716	Remove a couple of unused include.	2013-03-03 14:47:02 +00:00
Alexander Motin	4514d6fa18	MFcalloutng: Some whitespace fixes.	2013-03-03 09:11:24 +00:00
Pawel Jakub Dawidek	378a73d1bd	Regen after r247667.	2013-03-02 21:12:54 +00:00
Pawel Jakub Dawidek	7493f24ee6	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des	2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek	6d4e99aaef	If the target file already exists, check for the CAP_UNLINKAT capabiity right on the target directory descriptor, but only if this is renameat(2) and real target directory descriptor is given (not AT_FDCWD). Without this fix regular rename(2) fails if the target file already exists. Reported by: Michael Butler <imb@protected-networks.net> Reported by: Larry Rosenman <ler@lerctr.org> Sponsored by: The FreeBSD Foundation	2013-03-02 09:58:47 +00:00
Pawel Jakub Dawidek	1dc31587bf	Regen after r247602.	2013-03-02 00:55:09 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
John Baldwin	f9379dc411	Replace the TDP_NOSLEEPING flag with a counter so that the THREAD_NO_SLEEPING() and THREAD_SLEEPING_OK() macros can nest. Reviewed by: attilio	2013-03-01 22:03:31 +00:00
Pawel Jakub Dawidek	71ac38e896	Remove unnecessary variables.	2013-03-01 21:58:56 +00:00
Pawel Jakub Dawidek	f4d0191b22	Reduce lock scope a little.	2013-03-01 21:57:02 +00:00
Marius Strobl	db9066f798	- Use strdup(9) instead of reimplementing it. - Use __DECONST instead of strange casts. - Reduce code duplication and simplify name2oid(). PR: 176373 Submitted by: Christoph Mallon MFC after: 1 week	2013-03-01 18:49:14 +00:00
Konstantin Belousov	58248e57ab	Make the default implementation of the VOP_VPTOCNP() fail if the directory entry, matched by the inode number, is ".". NFSv4 client might instantiate the distinct vnodes which have the same inode number, since single v4 export can be combined from several filesystems on the server. For instance, a case when the nested server mount point is exactly one directory below the top of the export, causes directory and its parent to have the same inode number 2. The vop_stdvptocnp() algorithm then returns "." as the name of the lower directory. Filtering out the "." entry with ENOENT works around this behaviour, the error forces getcwd(3) to fall back to usermode implementation, which compares both st_dev and st_ino. Based on the submission by: rmacklem Tested by: rmacklem MFC after: 1 week	2013-03-01 18:40:14 +00:00
Davide Italiano	e234a588cb	MFcalloutng: Style fixes.	2013-02-28 16:22:49 +00:00
Alexander Motin	fdc5dd2d2f	MFcalloutng: Switch eventtimers(9) from using struct bintime to sbintime_t. Even before this not a single driver really supported full dynamic range of struct bintime even in theory, not speaking about practical inexpediency. This change legitimates the status quo and cleans up the code.	2013-02-28 13:46:03 +00:00
Davide Italiano	acccf7d8b4	MFcalloutng: When CPU becomes idle, cpu_idleclock() calculates time to the next timer event in order to reprogram hw timer. Return that time in sbintime_t to the caller and pass it to acpi_cpu_idle(), where it can be used as one more factor (quite precise) to extimate furter sleep time and choose optimal sleep state. This is a preparatory change for further callout improvements will be committed in the next days. The commmit is not targeted for MFC.	2013-02-28 10:46:54 +00:00
Konstantin Belousov	20f4e3e158	Make recursive getblk() slightly more useful. Keep the buffer state intact if getblk() is done on the already owned buffer. Exit from brelse() early when the lock recursion is detected, otherwise brelse() might prematurely destroy the buffer under some circumstances. Sponsored by: The FreeBSD Foundation Noted by: mckusick Tested by: pho MFC after: 2 weeks	2013-02-27 07:34:09 +00:00

1 2 3 4 5 ...

13106 Commits