freebsd-skq

Author	SHA1	Message	Date
Dag-Erling Smørgrav	008a09355b	If the user tries to set kern.randompid to 1 (which is meaningless), set it to a random value between 100 and 1123, rather than 0 as before. Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D5336	2017-09-10 15:01:29 +00:00
Mateusz Guzik	0bbae6f364	namecache: clean up struct namecache_ts handling namecache_ts differs from mere namecache by few fields placed mid struct. The access to the last element (the name) is thus special-cased. The standard solution is to put new fields at the very beginning anad embedd the original struct. The pointer shuffled around points to the embedded part. If needed, access to new fields can be gained through __containerof. MFC after: 1 week	2017-09-10 11:17:32 +00:00
Mateusz Guzik	dad74ce924	namecache: fold the unlock label into the only consumer No functional changes. MFC after: 1 week	2017-09-08 06:57:11 +00:00
Mateusz Guzik	da8f32a7f1	namecache: factor out dot lookup into a dedicated function The intent is to move uncommon cases out of the way. MFC after: 1 week	2017-09-08 06:51:33 +00:00
Mateusz Guzik	6a569d3525	Annotate Giant with __exclusive_cache_line	2017-09-08 06:46:24 +00:00
Mateusz Guzik	3e72c8449b	Annotate global process locks with __exclusive_cache_line MFC after: 1 week	2017-09-08 06:46:02 +00:00
Mateusz Guzik	574adb65c8	Sprinkle __read_frequently on few obvious places. Note that some of annotated variables should probably change their types to something smaller, preferably bit-sized.	2017-09-06 20:33:33 +00:00
Mateusz Guzik	fe933c1d88	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
Edward Tomasz Napierala	b0618cda03	Make root_mount_rel(9) ignore NULL arguments, like it used to before r313351. It would be better to fix API consumers to not pass NULL there - most of them, such as gmirror, already contain the neccessary checks - but this is easier and much less error-prone. One known user-visible result is that it fixes panic on a failed "graid label". PR: 221846 MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-09-05 14:32:56 +00:00
Warner Losh	519772814d	Add CAM/NVMe support for CAM_DATA_SG This adds support in pass(4) for data to be described with a scatter-gather list (sglist) to augment the existing (single) virtual address. Differential Revision: https://reviews.freebsd.org/D11361 Submitted by: Chuck Tuffli Reviewed by: imp@, scottl@, kenm@	2017-08-29 15:29:57 +00:00
Bryan Drewery	8359a6b7b3	Allow vdrop() of a vnode not yet on the per-mount list after r306512. The old code allowed calling vdrop() before insmntque() to place the vnode back onto the freelist for later recycling. Some downstream consumers may rely on this support. Normally insmntque() failing is fine since is uses vgone() and immediately frees the vnode rather than attempting to add it to the freelist if vdrop() were used instead. Also assert that vhold() cannot be used on such a vnode. Reviewed by: kib, cem, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12126	2017-08-28 19:29:51 +00:00
Conrad Meyer	4ae2ade114	Enhance debugibility of sysctl leaf re-use warnings Print the full conflicting oid path, and include the function name in the warning so it is clear that the warnings are sysctl-related. PR: 221853 Submitted by: Fabian Keil <fk AT fabiankeil.de> (earlier version) Sponsored by: Dell EMC Isilon	2017-08-27 17:12:30 +00:00
Conrad Meyer	eee87314d3	Improve scheduler performance Improve scheduler performance by flattening nonsensical topology layers (layers with only one child don't serve any purpose). This is especially relevant on non-AMD Zen systems after r322776. On my dual core Intel laptop, this brings the kern.sched.topology_spec table down from three levels to two. Submitted by: jeff Reviewed by: attilio Sponsored by: Dell EMC Isilon	2017-08-27 05:14:48 +00:00
John Baldwin	12fb14f36d	Don't grab SOCK_LOCK for soref() when queuing an AIO request. The AIO job holds a reference on the associated file descriptor, so the socket's count should already be > 0. This fixes a LOR with the socket buffer lock after recent socket locking changes in HEAD. Sponsored by: Chelsio Communications	2017-08-25 23:10:27 +00:00
Alan Cox	a0ae476b7e	Correct a regression in the previous change, r322459. Specifically, the removal of the "blk" parameter from blst_meta_alloc() had the unintended effect of generating an out-of-range allocation when the cursor reaches the end of the tree if the number of managed blocks in the tree equals the so-called "radix" (which in the blist code is not the standard notion of what a radix is but rather the maximum number of leaves in a tree of the current height.) In other words, only certain swap configurations were affected, which is why earlier testing did not reveal the problem. Submitted by: Doug Moore <dougm@rice.edu> Reported by: pho, kib Tested by: pho X-MFC with: r322459 Differential Revision: https://reviews.freebsd.org/D12106	2017-08-25 18:47:23 +00:00
Gleb Smirnoff	555b3e2f2c	Third take on the r319685 and r320480. Actually allow for call soisconnected() via soisdisconnected(), and in the earlier unlock earlier to avoid lock recursion. This fixes a situation when a socket on accept queue is reset before being accepted. Reported by: Jason Eggleston <jeggleston llnw.com>	2017-08-24 20:49:19 +00:00
Conrad Meyer	d2e155a4f0	Remove unused declaration and update ddb.4 A follow-up to r322836. Warnings for the unused declaration were breaking some second tier architectures, but did not show up in Clang on x86. Reported by: markj (ddb.4), emaste (declaration) Sponsored by: Dell EMC Isilon	2017-08-24 19:16:25 +00:00
Conrad Meyer	0c1d923efb	Merge print_lockchain and print_sleepchain When debugging a deadlock, it is useful to follow the full chain of locks as far as possible. Reviewed by: jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12115	2017-08-24 15:12:16 +00:00
Jung-uk Kim	a1d0659ca9	Fix size to copyout(9) for cpuset_getid(2). MFC after: 3 days	2017-08-22 20:46:29 +00:00
Conrad Meyer	bb14d5643b	subr_smp: Clean up topology analysis, add additional layers Rather than repeatedly nesting loops, separate concerns with a single loop per call stack level. Use a table to drive the recursive routine. Handle missing topology layers more gracefully (infer a single unit). Analyze some additional optional layers which may be present on e.g. AMD Zen systems (groups, aka dies, per package; and cachegroups, aka CCXes, per group). Display that additional information in the boot-time topology information, when it is relevent (non-one). Reviewed by: markj@, mjoras@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12019	2017-08-22 00:10:15 +00:00
Konstantin Belousov	b59ea73029	Allow vinvalbuf() to operate with the shared vnode lock. This mode allows other clean buffers to arrive while we flush the buf lists for the vnode, which is fine for the targeted use. We only need that all buffers existed at the time of the function start were flushed. In fact, only one assert has to be relaxed. In collaboration with: pho Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12083	2017-08-20 10:07:45 +00:00
Mark Johnston	e9666bf645	Remove some unneeded subroutines for padding writes to dump devices. Right now we only need to pad when writing kernel dump headers, so flatten three related subroutines into one. The encrypted kernel dump code already writes out its key in a dumper.blocksize-sized block. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11647	2017-08-18 04:07:25 +00:00
Mark Johnston	01938d3666	Rename mkdumpheader() and group EKCD functions in kern_shutdown.c. This helps simplify the code in kern_shutdown.c and reduces the number of globally visible functions. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11603	2017-08-18 04:04:09 +00:00
Mark Johnston	50ef60dabe	Factor out duplicated kernel dump code into dump_{start,finish}(). dump_start() and dump_finish() are responsible for writing kernel dump headers, optionally writing the key when encryption is enabled, and initializing the initial offset into the dump device. Also remove the unused dump_pad(), and make some functions static now that they're only called from kern_shutdown.c. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11584	2017-08-18 03:52:35 +00:00
Lawrence Stewart	9a61faf67d	An off-by-one error exists in sbuf_vprintf()'s use of SBUF_HASROOM() when an sbuf is filled to capacity by vsnprintf(), the loop exits without error, and the sbuf is not marked as auto-extendable. SBUF_HASROOM() evaluates true if there is room for one or more non-NULL characters, but in the case that the sbuf was filled exactly to capacity, SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn poisoning the buffer for all subsequent operations. Correct by moving the ENOMEM assignment into the loop where it can be made unambiguously. As a related safety net change, explicitly check for the zero bytes drained case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop in sbuf_vprintf() if a drain function were to inadvertently return a value of zero to sbuf_drain(). Reviewed by: cem, jtl, gallatin MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D8535	2017-08-18 02:06:28 +00:00
Lawrence Stewart	a8ec96af28	Implement simple record boundary tracking in sbuf(9) to avoid record splitting during drain operations. When an sbuf is configured to use this feature by way of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with sbuf_start_section() create a record boundary marker that is used to avoid flushing partial records. Reviewed by: cem,imp,wblock MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D8536	2017-08-17 07:20:09 +00:00
Ian Lepore	ce44a73667	Fix compile error with option DEBUG. This is fallout from some long-ago INTRNG refactoring that didn't get caught at the time because code in a debugf() statement isn't compiled unless DEBUG is defined. PR: 221557	2017-08-16 16:51:55 +00:00
Conrad Meyer	f3fed04372	Fix a couple of comment typos No functional change. Submitted by: Anton Rang <anton.rang AT isilon.com> Sponsored by: Dell EMC Isilon	2017-08-15 02:21:02 +00:00
Ian Lepore	2db14f97de	Add config_intrhook_oneshot(): schedule an intrhook function and unregister it automatically after it runs. The config_intrhook mechanism allows a driver to stall the boot process until device(s) required for booting are available, by not allowing system inits to proceed until all intrhook functions have been unregistered. Virtually all existing code simply unregisters from within the hook function when it gets called. This new function makes that common usage more convenient. Instead of allocating and filling in a struct, passing it to a function that might (in theory) fail, and checking the return code, now a driver can simply call this cannot-fail routine, passing just the intrhook function and its arg. Differential Revision: https://reviews.freebsd.org/D11963	2017-08-13 18:10:24 +00:00
Alan Cox	bee93d3cf0	The _meta_ functions include a radix parameter, a blk parameter, and another parameter that identifies a starting point in the memory address block. Radix is a power of two, blk is a multiple of radix, and the starting point is in the range [blk, blk+radix), so that blk can always be computed from the other two. This change drops the blk parameter from the meta functions and computes it instead. (On amd64, for example, this change reduces subr_blist.o's text size by 7%.) It also makes the radix parameters unsigned to address concerns that the calculation of '-radix' might overflow without the -fwrapv option. (See https://reviews.freebsd.org/D11819.) Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11964	2017-08-13 16:39:49 +00:00
Mark Johnston	af0460beda	Have sendfile_swapin() use vm_page_grab_pages(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942	2017-08-11 16:32:24 +00:00
Mark Johnston	9df950b35d	Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT. This will allow its use in sendfile_swapin(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942	2017-08-11 16:29:22 +00:00
Alan Cox	6921451dab	An invalid page can't be dirty. Reviewed by: kib MFC after: 1 week	2017-08-11 16:27:54 +00:00
Andrew Turner	a92a2f00b1	Only return the current cpu if it's in the cpumask. When we restrict the cpumask it probably means we are unable to sent interrupts to CPUs outside the map. As such only return the current CPU when it's within the mask otherwise return the first valid CPU. This is needed on ThunderX as, in a dual socket configuration, we are unable to send MSI/MSI-X interrupts between sockets. Reviewed by: mmel Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11957	2017-08-11 12:45:58 +00:00
Gleb Smirnoff	ef3266d58a	Plug uninitialized stack variable leak in sendfile(2). Reported by: Ilja Van Sprundel <ivansprundel ioactive.com> Submitted by: Domagoj Stolfa <domagoj.stolfa gmail.com> MFC after: 1 week Security: uninitialized stack variable leak	2017-08-09 17:48:38 +00:00
Alan Cox	5471caf6f1	Introduce vm_page_grab_pages(), which is intended to replace loops calling vm_page_grab() on consecutive page indices. Besides simplifying the code in the caller, vm_page_grab_pages() allows for batching optimizations. For example, the current implementation replaces calls to vm_page_lookup() on consecutive page indices by cheaper calls to vm_page_next(). Reviewed by: kib, markj Tested by: pho (an earlier version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11926	2017-08-09 04:23:04 +00:00
Alan Somers	c45796d54e	Make p1003_1b.aio_listio_max a tunable p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX. Its value will be bounded from below by the compile-time constant AIO_LISTIO_MAX and from above by the compile-time constant MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue. Reviewed by: jhb, kib MFC after: 3 weeks Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D11601	2017-08-08 16:14:31 +00:00
Ruslan Bukin	ca20f8ec29	o Replace __riscv__ with __riscv o Replace __riscv64 with (__riscv && __riscv_xlen == 64) This is required to support new GCC 7.1 compiler. This is compatible with current GCC 6.1 compiler. RISC-V is extensible ISA and the idea here is to have built-in define per each extension, so together with __riscv we will have some subset of these as well (depending on -march string passed to compiler): __riscv_compressed __riscv_atomic __riscv_mul __riscv_div __riscv_muldiv __riscv_fdiv __riscv_fsqrt __riscv_float_abi_soft __riscv_float_abi_single __riscv_float_abi_double __riscv_cmodel_medlow __riscv_cmodel_medany __riscv_cmodel_pic __riscv_xlen Reviewed by: ngie Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11901	2017-08-07 14:09:57 +00:00
Alan Cox	ba98e6a2d7	In case readers are misled by expressions that combine multiplication and division, add parentheses to make the precedence explicit. Submitted by: Doug Moore <dougm@rice.edu> Requested by: imp Reviewed by: imp MFC after: 1 week X-MFC after: r321840 Differential Revision: https://reviews.freebsd.org/D11815	2017-08-04 04:23:23 +00:00
Mark Johnston	526b5fe16c	Amend r321884 to check the refcount and update the class with w_mtx held. Reviewed by: jhb X-MFC with: r321884	2017-08-01 23:14:38 +00:00
Mark Johnston	57688b6e4e	Fix a witness assertion that fires when a lock type's class changes. When all instances of a lock type are destroyed (for example, after a module unload), the corresponding witness entry remains associated with that lock type. In this case, we shouldn't panic if a new instance of the lock type is created and its lock class does not match that recorded in the witness entry. Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11788	2017-08-01 17:50:28 +00:00
Alan Cox	2ac0c7c37c	The blist_meta_* routines that process a subtree take arguments 'radix' and 'skip', which denote, respectively, the largest number of blocks that can be managed by a subtree of that height, and one less than the number of nodes in a subtree of that height. This change removes the 'skip' argument from those functions because 'skip' can be trivially computed from 'radius'. This change also redefines 'skip' so that it denotes the number of nodes in the subtree, and so changes loop upper bound tests from '<= skip' to '< skip' to account for the change. The 'skip' field is also removed from the blist struct. The self-test program is changed so that the print command includes the cursor value in the output. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week	2017-08-01 03:51:26 +00:00
Dmitry Chagin	77d3337c9f	Implement proper Linux /dev/fd and /proc/self/fd behavior by adding Linux specific things to the native fdescfs file system. Unlike FreeBSD, the Linux fdescfs is a directory containing a symbolic links to the actual files, which the process has open. A readlink(2) call on this file returns a full path in case of regular file or a string in a special format (type:[inode], anon_inode:<file-type>, etc..). As well as in a FreeBSD, opening the file in the Linux fdescfs directory is equivalent to duplicating the corresponding file descriptor. Here we have mutually exclusive requirements: - in case of readlink(2) call fdescfs lookup() method should return VLNK vnode otherwise our kern_readlink() fail with EINVAL error; - in the other calls fdescfs lookup() method should return non VLNK vnode. For what new vnode v_flag VV_READLINK was added, which is set if fdescfs has beed mounted with linrdlnk option an modified kern_readlinkat() to properly handle it. For now For Linux ABI compatibility mount fdescfs volume with linrdlnk option: mount -t fdescfs -o linrdlnk null /compat/linux/dev/fd Reviewed by: kib@ MFC after: 1 week Relnotes: yes	2017-08-01 03:40:19 +00:00
Mark Johnston	6c7ebc242b	Batch v_wire_count decrements in vm_hold_free_pages(). Atomic updates to v_wire_count are a significant source of contention, so combine multiple updates into one in this easy case. Also remove an old printf that gets executed if the page is shared-busied, which is a case that will lead to a panic anyway. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11791	2017-07-31 18:48:58 +00:00
Ian Lepore	cd9d9e5417	Add clock_schedule(), a feature that allows realtime clock drivers to request that their clock_settime() methods be called at a given offset from top-of-second. This adds a timeout_task to the rtc_instance so that each clock can be separately added to taskqueue_thread with the scheduling it prefers, instead of looping through all the clocks at once with a single task on taskqueue_thread. If a driver doesn't call clock_schedule() the default is the old behavior: clock_settime() is queued immediately. The motivation behind this is that I was on the path of adding identical code to a third RTC driver to figure out a delta to top-of-second and sleep for that amount of time because writing the the RTC registers resets the hardware's concept of top-of-second. (Sometimes it's not top-of-second, some RTC clocks tick over a half second after you set their time registers.) Worst-case would be to sleep for almost a full second, which is a rude thing to do on a shared task queue thread.	2017-07-31 01:18:21 +00:00
Mark Johnston	d0f68f913b	Correct the predicates on which lockstat:::{thread,spin}-spin fire. In particular, they should fire only if the lock was owned by another thread when we first attempted to acquire that lock. MFC after: 1 week	2017-07-31 00:59:28 +00:00
Ian Lepore	f37b7fc2d4	Add taskqueue_enqueue_timeout_sbt(), because sometimes you want more control over the scheduling precision than 'ticks' can offer, and because sometimes you're already working with sbintime_t units and it's dumb to convert them to ticks just so they can get converted back to sbintime_t under the hood.	2017-07-31 00:54:50 +00:00
Conrad Meyer	ca3fec5042	kldstat: Use sizeof in place of named constants for sizing No functional change. This is handy for FreeBSD derivatives that want to modify the value of MAXPATHLEN, but not the kld_file_stat ABI. Submitted by: Siddhant Agarwal <sagarwal AT isilon.com> Sponsored by: Dell EMC Isilon	2017-07-29 23:31:21 +00:00
Konstantin Belousov	7ceeb35bd8	Make it possible to request nosys logging to console. New kern.lognosys values are 1 - log to ctty 2 - log to console 3 - log to both. Inspired by: eugen Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-27 20:45:41 +00:00
Konstantin Belousov	0948519ded	Make the number of children for pctrie node available outside subr_pctrie.c. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D11435	2017-07-27 16:40:14 +00:00
Alan Cox	a396c83a5e	Change the interactions of the interface functions with the "meta" and "leaf" functions for alloc, free, and fill. After the change, the interface functions call "meta" unconditionally, and the "meta" functions recur unconditionally in looping over their descendants. The "meta" functions start with a validity test, and then a test for the "leaf" case, before falling into the general recursive case. This simplifies and shrinks the code, and, for "free" and "fill" moves panic tests that check the same meta node repeatedly in a loop to a place that will have each node tested once. Remove irrelevant null checks from blist_free and blist_fill. Make the code that initializes a meta node the same in blist_meta_alloc and blist_meta_fill. Parenthesize return expressions in blst_meta_fill. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week	2017-07-24 17:23:53 +00:00
Ian Lepore	f1b21e2c92	Add common code to support realtime clocks that store year without century. Most realtime clocks store the year as 2 BCD digits. Some add a century bit to extend the range another hundred years. Every clock driver has its own code to determine the century and pass a full year value to clock_ct_to_ts(). Now clock drivers can just convert BCD to bin and store the result in the clocktime struct and let the common code figure out the century. Clocks with a century bit can just add 100 to year if the century bit is on.	2017-07-23 21:28:00 +00:00
Michael Tuexen	27d8bea898	Fix getsockopt() for listening sockets when using SO_SNDBUF, SO_RCVBUF, SO_SNDLOWAT, SO_RCVLOWAT. Since r31972 it only worked for non-listening sockets. Sponsored by: Netflix, Inc.	2017-07-21 07:44:43 +00:00
Enji Cooper	d700db78ad	Fix whitespace regression accidentally checked in via ^/head@r280149 MFC after: now	2017-07-18 06:51:27 +00:00
Alan Cox	b411efa49b	Tidy up before making another round of functional changes: Remove end- of-line whitespace, remove excessive whitespace and blank lines, remove dead code, follow our standard style for function definitions, and correct grammatical and factual errors in some of the comments. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week	2017-07-17 23:16:33 +00:00
John Baldwin	c7af789360	Set the current vnet pointer in the socket buffer AIO handler. This fixes panics when using AIO under VIMAGE. Reported by: kp MFC after: 3 days Sponsored by: Chelsio Communications	2017-07-17 16:59:22 +00:00
Ian Lepore	f7afe7679c	Minor optimization: instead of converting between days and years using loops that start in 1970, assume most conversions are going to be for recent dates and use a precomputed number of days through the end of 2016. This is a do-over of r320997, hopefully this time with 100% more workiness. The first attempt had an off-by-one error, but instead of just adding another mysterious +1 adjustment, this rearranges the relationship between recent_base_year and recent_base_days so that the latter is the number of days that occurred before the start of the associated year (instead of the count thru the end of that year). This makes the recent_base stuff work more like the original loop logic that didn't need any +1 adjustments.	2017-07-16 16:54:03 +00:00
Mark Johnston	ab384d75db	Revert r320918 and have mkdumpheader() handle version string truncation. Reported by: jhb MFC after: 1 week	2017-07-15 20:53:08 +00:00
Ian Lepore	cfcdbe4b52	Revert r320997. There are reports of it getting the wrong results, so clearly my testing was insuffficent, and it's best to just revert it until I get it straightened out.	2017-07-15 00:45:22 +00:00
Brooks Davis	6aacc69847	Add 32-bit compat for kinfo_proc's ki_tdaddr. This appears to have been an oversight in r213536. Reviewed by: markj MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11521	2017-07-14 21:13:05 +00:00
Ian Lepore	32cd61b793	Minor optimization: instead of converting between days and years using loops that start in 1970, assume most conversions are going to be for recent dates and use a precomputed number of days through the end of 2016.	2017-07-14 18:36:15 +00:00
Ian Lepore	932d14c6e7	Allow setting debug.clocktime as a tunable. Print 64-bit time_t correctly on 32-bit systems.	2017-07-14 18:13:54 +00:00
Warner Losh	df4245150a	This adds CAM pass(4) support for NVMe IO's. Applications indicate the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or XPT_NVME_IO. Submitted by: Chuck Tuffli <chuck@tuffli.net> Differential Revision: https://reviews.freebsd.org/D10247	2017-07-14 14:52:20 +00:00
Konstantin Belousov	5cead59181	Correct sysent flags for dynamically loaded syscalls. Using the https://github.com/google/capsicum-test/ suite, the PosixMqueue.CapModeForked test was failing due to an ECAPMODE after calling kmq_notify(). On further inspection, the dynamically loaded syscall entry was initialized with sy_flags zeroed out, since SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value. Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an additional argument to specify the sy_flags value. Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11576	2017-07-14 09:34:44 +00:00
Ryan Libby	808a9c8646	kvprintf %b enhancements Make the %b formatter accept number formatting flags. It will now accept alternate form, precision, and length modifiers. It also now partially supports field width (but forces left justification). Reviewed by: markj Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11284	2017-07-12 07:30:14 +00:00
Ian Lepore	a8d7b9d3bb	Support multiple realtime clocks, and remove locking/sleeping restrictions on clock drivers. This tracks multiple concurrent realtime clock drivers in a list sorted by clock resolution. When system time changes (and periodically) the clock_settime() methods of all registered clocks are invoked. To initialize system time, each driver is tried in turn from best to worst resolution, until one succesfully returns a valid time. The code no longer holds a mutex while calling the clock_settime() and clock_gettime() methods of the registered clocks. This allows clock drivers to do whatever kind of locking or sleeping is necessary (this is especially important for i2c clock chips since i2c drivers often need to sleep). A new clock_register_flags() function allows the clock driver to pass flags. The flags currently defined help support drivers that use their own techniques to avoid roundoff errors (prevents the 4/5 rounding done by the subr_rtc code). A driver which may need to wait for resources (such as bus ownership) may pass a flag to indicate that it will obtain system time for itself after waiting for resources; this is merely an optimization to avoid the common code retrieving a timespec that will never get used. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11484	2017-07-12 02:53:54 +00:00
Andrew Gallatin	a5169546ee	Simplify UIO_SYSSPACE and UIO_NOCOPY paths in uiomove Uiomove can only block when the segflag is UIO_USERSPACE, otherwise we end up just doing a bcopy (or nothing) and moving cursors. So only emit witness warnings and set deadlock thread flags in the UIO_USERSPACE case. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D11489	2017-07-06 15:03:54 +00:00
Hans Petter Selasky	fe715b8090	After r319722 two fields were left uninitialized when transforming a socket structure into a listening socket. This resulted in an invalid instruction fault for all 32-bit platforms. When INVARIANTS is set the union where the two uninitialized fields reside gets properly zeroed. This patch ensures the two uninitialized fields are zeroed when INVARIANTS is undefined. For 64-bit platforms this issue was not visible because so->sol_upcall which is uninitialized overlaps with so->so_rcv.sb_state which is already zero during soalloc(); For 32-bit platforms this issue was visible and resulted in an invalid instruction fault, because so->sol_upcall overlaps with so->so_rcv.sb_sel which is always initialized to a valid data pointer during soalloc(). Verifying the offset locations mentioned above are identical is left as an exercise to the reader. PR: 220452 PR: 220358 Reviewed by: ae (network), gallatin Differential Revision: https://reviews.freebsd.org/D11475 Sponsored by: Mellanox Technologies	2017-07-04 18:23:17 +00:00
Konstantin Belousov	467b3975a3	Resolve confusion between different error code spaces. The vm_map_fixed() and vm_map_stack() VM functions return Mach error codes. Convert them into errno values before returning result from exec_new_vmspace(). While there, modernize the comment and do minor style adjustments. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-03 20:44:01 +00:00
Mateusz Guzik	3f7830a31e	rwlock: perform the typically false td_rw_rlocks check later Check if the lock is available first instead. MFC after: 1 week	2017-07-02 01:05:16 +00:00
Alan Cox	545414213d	Change blst_leaf_alloc() to handle a cursor argument, and to improve performance. To find in the leaf bitmap all ranges of sufficient length, use a doubling strategy with shift-and-and until each bit still set represents a bit sequence of length 'count', or until the bitmask is zero. In the latter case, update the hint based on the first bit sequence length not found to be available. For example, seeking an interval of length 12, the set bits of the bitmap would represent intervals of length 1, then 2, then 3, then 6, then 12. If no bits are set at the point when each bit represents an interval of length 6, then the hint can be updated to 5 and the search terminated. If long-enough intervals are found, discard those before the cursor. If any remain, use binary search to find the position of the first of them, and allocate that interval. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib, markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D11426	2017-07-01 05:27:40 +00:00
Konstantin Belousov	9fb8c888f1	Define ino64_trunc_error under same conditions as the code which uses the variable. Noted by: bde Sponsored by: The FreeBSD Foundation	2017-06-30 16:10:21 +00:00
Alan Cox	8056df6e25	Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could still be in effect on the process after an execve(2), which violates the specification for mlockall(2). It's pointless for vm_map_stack() to check the MEMLOCK limit. It will never be asked to wire the stack. Moreover, it doesn't even implement wiring of the stack. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11421	2017-06-30 15:49:36 +00:00
John Baldwin	51645e836d	Store a 32-bit PT_LWPINFO struct for 32-bit process core dumps. Process core notes for a 32-bit process running on a 64-bit host need to use 32-bit structures so that the note layout matches the layout of notes of a core dump of a 32-bit process under a 32-bit kernel. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D11407	2017-06-29 21:31:13 +00:00
Navdeep Parhar	98c9236978	Adjust sowakeup post-r319685 so that it continues to make upcalls but still avoids calling soconnected during sodisconnected. Discussed with: glebius@ Sponsored by: Chelsio Communications	2017-06-29 19:43:27 +00:00
Konstantin Belousov	d137278838	Do not cast struct kevent_args or struct freebsd11_kevent_args to struct g_kevent_args. On some architectures, e.g. PowerPC, there is additional padding in uap. Reported and tested by: andreast Sponsored by: The FreeBSD Foundation	2017-06-29 14:40:33 +00:00
Konstantin Belousov	34d3e89f33	Do not ignore an error from vm_mmap_object(). Found and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-27 20:12:13 +00:00
Alan Cox	d37837249b	Address the remaining integer overflow issues with the "skip" parameters and "next_skip" variables. The "skip" value in struct blist has long been a 64-bit quantity but various functions have implicitly truncated this value to 32 bits. Now, all arithmetic involving the "skip" value is 64 bits wide. (This should allow us to relax the size limit on a swap device in the swap pager.) Maintain the ability to test this allocator as a user-space application by including <stdbool.h>. Remove an unused variable from blst_radix_print(). Reviewed by: kib, markj MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D11358	2017-06-27 17:45:26 +00:00
Conrad Meyer	f5b7359a00	Fix one more place uio_resid is truncated to int A follow-up to r231949 and r194990. Reported by: pho@ Reviewed by: kib@, markj@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11373	2017-06-27 17:23:20 +00:00
Gleb Smirnoff	64290befc1	Provide sbsetopt() that handles socket buffer related socket options. It distinguishes between data flow sockets and listening sockets, and in case of the latter doesn't change resource limits, since listening sockets don't hold any buffers, they only carry values to be inherited by their children.	2017-06-25 01:41:07 +00:00
Mark Johnston	704cb42f2a	Fix the !TD_IS_IDLETHREAD(curthread) locking assertions. Most of the lock slowpaths assert that the calling thread isn't an idle thread. However, this may not be true if the system has panicked, and in some cases the assertion appears before a SCHEDULER_STOPPED() check. MFC after: 3 days Sponsored by: Dell EMC Isilon	2017-06-19 21:09:50 +00:00
Konstantin Belousov	711dba24d7	Allow negative aio_offset only for the read and write LIO ops on device nodes. Otherwise, the current check of aio_offset == -1LL makes it possible to pass negative file offsets down to the filesystems. This trips assertions and is even unsafe for e.g. FFS which keeps metadata at negative offsets. Reported and tested by: pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11266	2017-06-19 15:17:17 +00:00
Alan Cox	d4e3484bd9	Change blist_alloc()'s allocation policy from first-fit to next-fit so that disk writes are more likely to be sequential. This change is beneficial on both the solid state and mechanical disks that I've tested. (A similar change in allocation policy was made by DragonFly BSD in 2013 to speed up Poudriere with "stressful memory parameters".) Increase the width of blst_meta_alloc()'s parameter "skip" and the local variables whose values are derived from it to 64 bits. (This matches the width of the field "skip" that is stored in the structure "blist" and passed to blst_meta_alloc().) Eliminate a pointless check for a NULL blist_t. Simplify blst_meta_alloc()'s handling of the ALL-FREE case. Address nearby style errors. Reviewed by: kib, markj MFC after: 5 weeks Differential Revision: https://reviews.freebsd.org/D11247	2017-06-18 18:23:39 +00:00
Rick Macklem	d1c5e240a8	Make MAXBCACHEBUF a tunable called vfs.maxbcachebuf. By making MAXBCACHEBUF a tunable, it can be increased to allow for larger read/write data sizes for the NFS client. The tunable is limited to MAXPHYS, which is currently 128K. Making MAXPHYS a tunable or increasing its value is being discussed, since it would be nice to support a read/write data size of 1Mbyte for the NFS client when mounting the AmazonEFS file service. Reviewed by: kib MFC after: 2 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D10991	2017-06-17 22:24:19 +00:00
Konstantin Belousov	eb84ca643c	Regen.	2017-06-17 00:58:19 +00:00
Konstantin Belousov	2b34e84335	Add abstime kqueue(2) timers and expand struct kevent members. This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which specifies that the data field contains absolute time to fire the event. To make this useful, data member of the struct kevent must be extended to 64bit. Using the opportunity, I also added ext members. This changes struct kevent almost to Apple struct kevent64, except I did not changed type of ident and udata, the later would cause serious API incompatibilities. The type of ident was kept uintptr_t since EVFILT_AIO returns a pointer in this field, and e.g. CHERI is sensitive to the type (discussed with brooks, jhb). Unlike Apple kevent64, symbol versioning allows us to claim ABI compatibility and still name the new syscall kevent(2). Compat shims are provided for both host native and compat32. Requested by: bapt Reviewed by: bapt, brooks, ngie (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D11025	2017-06-17 00:57:26 +00:00
Konstantin Belousov	f2eb97b2cd	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D11025	2017-06-16 23:41:13 +00:00
Gleb Smirnoff	2b8e036bfc	Plug read(2) and write(2) on listening sockets.	2017-06-15 20:11:29 +00:00
Ryan Libby	c74ae2ca93	ddb show socket debugging Display the mbuf/cluster count for a sockbuf and fix a couple whitespace issues in the output. Reviewed by: jhb, markj (both previous version) Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11062	2017-06-15 04:49:12 +00:00
Ryan Libby	76f2c27264	ddb show files: fix up file types and whitespace This makes ddb show files more descriptive and also adjusts the whitespace to align the columns for non-32-bit architectures. Reviewed by: cem (previous version), jhb Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D11061	2017-06-14 07:46:52 +00:00
Konstantin Belousov	81c3737d95	Remove stray return. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-13 19:02:12 +00:00
Alan Cox	4be4fd5d5f	Reduce the frequency of hint updates on allocation without incurring additional allocation overhead. Previously, blst_meta_alloc() updated the hint after every successful allocation. However, these "eager" hint updates are of no actual benefit if, instead, the "lazy" hint update at the start of blst_meta_alloc() is generalized to handle all cases where the number of available blocks is less than the requested allocation. Previously, the lazy hint update at the start of blst_meta_alloc() only handled the ALL-FULL case. (I would also note that this change provides consistency between blist_alloc() and blist_fill() in that their hint maintenance is now entirely lazy.) Eliminate unnecessary checks for terminators in blst_meta_alloc() and blst_meta_fill() when handling ALL-FREE meta nodes. Eliminate the field "bl_free" from struct blist. It is redundant. Unless the entire radix tree is a single leaf, the count of free blocks is stored in the root node. Instead, provide a function blist_avail() for obtaining the number of free blocks. In blst_meta_alloc(), perform a sanity check on the allocation once rather than repeating it in a loop over the meta node's children. In blst_leaf_fill(), use the optimized bitcount*() function instead of a loop to count the blocks being allocated. Add or improve several comments. Address some nearby style errors. Reviewed by: kib MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D11146	2017-06-13 17:49:49 +00:00
Mark Johnston	46514e7d76	Hint at the intended usage for the "ll" field of struct uuid_private. Discussed with: kib MFC after: 1 week	2017-06-13 15:37:04 +00:00
Konstantin Belousov	b43ce76c77	Add ptrace(PT_GET_SC_ARGS) command to return debuggee' current syscall arguments. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D11080	2017-06-12 21:15:43 +00:00
Konstantin Belousov	f5a077c390	Print unimplemented syscall number to the ctty on SIGSYS, if enabled by the knob kern.lognosys. Discussed with: imp Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 weeks X-Differential revision: https://reviews.freebsd.org/D11080	2017-06-12 21:11:11 +00:00
Konstantin Belousov	2d88da2f06	Move struct syscall_args syscall arguments parameters container into struct thread. For all architectures, the syscall trap handlers have to allocate the structure on the stack. The structure takes 88 bytes on 64bit arches which is not negligible. Also, it cannot be easily found by other code, which e.g. caused duplication of some members of the structure to struct thread already. The change removes td_dbg_sc_code and td_dbg_sc_nargs which were directly copied from syscall_args. The structure is put into the copied on fork part of the struct thread to make the syscall arguments information correct in the child after fork. This move will also allow several more uses shortly. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks X-Differential revision: https://reviews.freebsd.org/D11080	2017-06-12 21:03:23 +00:00
Mark Johnston	56060a373e	Add a helper function for comparing struct uuids. Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11138	2017-06-12 20:14:44 +00:00
Alan Cox	015d7db6b6	Remove an unnecessary field from struct blist. (The comment describing what this field represented was also inaccurate.) Suggested by: kib In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be specified. However, blist_create() was not modified to handle the possibility that a malloc() call failed. Address this omission. Increase the width of the local variable "radix" to 64 bits. (This matches the width of the corresponding field in struct blist.) Reviewed by: kib MFC after: 6 weeks	2017-06-10 16:11:39 +00:00
Alan Cox	a7249a6c85	blist_fill()'s return type is too narrow. blist_fill() accepts a 64-bit quantity as the size of the range to fill, but returns a 32-bit quantity as the number of blocks that were allocated to fill that range. This revision corrects that mismatch. Currently, swaponsomething() limits the size of a swap area to prevent arithmetic arithmetic overflow in other parts of the blist allocator. That limit has also prevented this type mismatch from causing problems. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D11096	2017-06-09 16:19:24 +00:00
Gleb Smirnoff	438902edbd	Fix stat(2) on a listening socket.	2017-06-09 15:54:48 +00:00
Konstantin Belousov	7abe0df223	Enhance vfs.ino64_trunc_error sysctl. Provide a new mode "2" which returns a special overflow indicator in the non-representable field instead of the silent truncation (mode "0") or EOVERFLOW (mode "1"). In particular, the typical use of st_ino to detect hard links with mode "2" reports false positives, which might be more suitable for some uses. Discussed with: bde Sponsored by: The FreeBSD Foundation	2017-06-09 11:17:08 +00:00
Gleb Smirnoff	779f106aa1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
Alan Cox	86dd278f03	When allocating swap blocks, if the available number of free blocks in a subtree is already zero, then setting the "largest contiguous free block" hint for that subtree to anything other than zero makes no sense. To be clear, assigning a value to the hint that is too large is not a correctness problem, only a pessimization. Dragonfly BSD has applied the same change to blst_meta_alloc() but not blst_meta_fill(). MFC after: 6 weeks	2017-06-08 15:48:54 +00:00
Gleb Smirnoff	8d40bada3e	Fix a degenerate case when soisdisconnected() would call soisconnected(). This happens when closing a socket with upcall, and trace is: soclose()-> ... protocol ... -> soisdisconnected() -> socantrcvmore_locked() -> sowakeup() -> soisconnected(). Right now this case is innocent for two reasons. First, soisconnected() doesn't clear SS_ISDISCONNECTED flag. Second, the mutex to lock the socket is the socket receive buffer mutex, and sodisconnected() first disables the receive buffer. But in future code, the mutex to lock socket is different to buffer mutex, and we would get undesired mutex recursion. The fix is to check SS_ISDISCONNECTED flag before calling upcall.	2017-06-08 06:16:47 +00:00
Marcelo Araujo	e0a6a23c6d	Allow sysctl kern.vm_guest to return bhyve when running under bhyve. Submitted by: Sean Fagan <sef@ixsystems.com> Reviewed by: grehan MFH: 4 weeks. Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D11090	2017-06-08 04:02:14 +00:00
Alan Cox	d90bf7d850	Originally, this file could be compiled as a user-space application for testing purposes. However, over the years, various changes to the kernel have broken this feature. This revision applies some fixes to get user- space compilation working again. There are no changes in this revision to code that is used by the kernel. MFC after: 3 days	2017-06-07 16:04:34 +00:00
Gleb Smirnoff	b3244df799	Provide typedef for socket upcall function. While here change so_gen_t type to modern uint64_t.	2017-06-07 01:48:11 +00:00
Gleb Smirnoff	b94f68dc52	Remove a piece of dead code.	2017-06-07 01:21:34 +00:00
Alan Cox	03bdd65f18	When the function blist_fill() was added to the kernel in r107913, the swap pager used a different scheme for striping the allocation of swap space across multiple devices. And, although blist_fill() was intended to support fill operations with large counts, the old striping scheme never performed a fill larger than the stripe size. Consequently, the misplacement of a sanity check in blst_meta_fill() went undetected. Now, moving forward in time to r118390, a new scheme for striping was introduced that maintained a blist allocator per device, but as noted in r318995, swapoff_one() was not fully and correctly converted to the new scheme. This change completes what was started in r318995 by fixing the underlying bug in blst_meta_fill() that stops swapoff_one() from simply performing a single blist_fill() operation. Reviewed by: kib MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11043	2017-06-06 03:32:17 +00:00
Allan Jude	e28f9b7d03	Jails: Optionally prevent jailed root from binding to privileged ports You may now optionally specify allow.noreserved_ports to prevent root inside a jail from using privileged ports (less than 1024) PR: 217728 Submitted by: Matt Miller <mattm916@pulsar.neomailbox.ch> Reviewed by: jamie, cem, smh Relnotes: yes Differential Revision: https://reviews.freebsd.org/D10202	2017-06-06 02:15:00 +00:00
Alan Cox	064650c180	Halve the memory being internally allocated by the blist allocator. In short, half of the memory that is allocated to implement the radix tree is wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and r96851.) Reviewed by: kib, markj Tested by: pho MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11028	2017-06-05 17:14:16 +00:00
Konstantin Belousov	3df7ebc4ed	Add sysctl vfs.ino64_trunc_error controlling action on truncating inode number or link count for the ABI compat binaries. Right now, and by default after the change, too large 64bit values are silently truncated to 32 bits. Enabling the knob causes the system to return EOVERFLOW for stat(2) family of compat syscalls when some values cannot be completely represented by the old structures. For getdirentries(2), knob skips the dirents which would cause non-trivial truncation of d_ino. EOVERFLOW error is specified by the X/Open 1996 LFS document ('Adding Support for Arbitrary File Sizes to the Single UNIX Specification'). Based on the discussion with: bde Sponsored by: The FreeBSD Foundation	2017-06-05 11:40:30 +00:00
Alan Cox	d712b799b5	The data type returned by vmoff() is too narrow in its range. This could break the transmission of files longer than 4 GB on 32-bit architectures. Reviewed by: glebius, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10019	2017-06-03 16:19:33 +00:00
Konstantin Belousov	a7ca2c6ad0	Ensure that cached struct thread does not keep spurious td_su reference on an UFS mount point. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-03 14:12:17 +00:00
Gleb Smirnoff	971af2a311	Rename accept filter getopt/setopt functions, so that they are prefixed with module name and match other functions in the module. There is no functional change.	2017-06-02 17:49:21 +00:00
Gleb Smirnoff	810951ddc9	Style: unwrap lines that doesn't have a good reason to be wrapped.	2017-06-02 17:43:47 +00:00
Gleb Smirnoff	bd617e3b98	Remove write only flag UNP_HAVEPCCACHED.	2017-06-02 17:39:05 +00:00
Gleb Smirnoff	0c3c207ffd	For UNIX sockets make vnode point not to the socket, but to the UNIX PCB, since the latter is the thing that links together VFS and sockets. While here, make the union in the struct vnode anonymous.	2017-06-02 17:31:25 +00:00
Mateusz Guzik	c7a6a1b325	mtx: fix whitespace damage in _mtx_trylock_flags_ MFC after: 3 days	2017-05-30 02:25:47 +00:00
Konstantin Belousov	03311f117b	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation	2017-05-27 17:00:30 +00:00
Conrad Meyer	95b978955c	procstat(1): Add TCP socket send/recv buffer size Add TCP socket send and receive buffer size to procstat -f output. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10689	2017-05-26 22:17:44 +00:00
Allan Jude	c20feae640	Followup to r318765 (capsicumize cpuset_affinity) Update sysent files	2017-05-24 01:01:57 +00:00
Allan Jude	f299c47b52	Allow cpuset_{get,set}affinity in capabilities mode bhyve was recently sandboxed with capsicum, and needs to be able to control the CPU sets of its vcpu threads Reviewed by: emaste, oshogbo, rwatson MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D10170	2017-05-24 00:58:30 +00:00
Steve Wills	a4aaba3b0a	Add security.bsd.see_jail_proc Add security.bsd.see_jail_proc sysctl to hide jail processes from non-root users Reviewed by: jamie Approved by: allanjude Relnotes: yes Differential Revision: https://reviews.freebsd.org/D10770	2017-05-23 16:59:24 +00:00
Konstantin Belousov	ec95c622ff	Regen.	2017-05-23 09:30:42 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Ed Maste	bd309b323a	Regen sysent after r318634, no open(2) in capability mode Sponsored by: The FreeBSD Foundation	2017-05-22 11:45:45 +00:00
Ed Maste	68fc8f3934	disallow open(2) in capability mode Previously open(2) was allowed in capability mode, with a comment that suggested this was likely the case to facilitate debugging. The system call would still fail later on, but it's better to disallow the syscall altogether. We now have the kern.trap_enotcap sysctl or PROC_TRAPCAP_CTL proccontrol to aid in debugging. In any case libc has translated open() to the openat syscall since r277032. Reviewed by: kib, rwatson Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10850	2017-05-22 11:43:19 +00:00
Mark Johnston	3bd485f968	Avoid open-coding PRI_UNCHANGED. MFC after: 1 week	2017-05-18 18:24:11 +00:00
Ed Maste	3e85b721d6	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193	2017-05-17 00:34:34 +00:00
John Baldwin	00f6cd3f56	Add sglist_append_sglist(). This function permits a range of one scatter/gather list to be appended to another sglist. This can be used to construct a scatter/gather list that reorders or duplicates ranges from one or more existing scatter/gather lists. Sponsored by: Chelsio Communications	2017-05-16 23:31:52 +00:00
Konstantin Belousov	391aba32e6	mnt_vnode_next_active: use conventional lock order when trylock fails. Previously, when the VI_TRYLOCK failed, we would spin under the mutex that protects the vnode active list until we either succeeded or noticed that we had hogged the CPU. Since we were violating the lock order, this would guarantee that we would become a hog under any deadlock condition (e.g. a race with vdrop(9) on the same vnode). In the presence of many concurrent threads in sync(2) or vdrop etc, the victim could hang for a long time. Now, avoid spinning by dropping and reacquiring the locks in the conventional lock order when the trylock fails. This requires a dance with the vnode hold count. Submitted by: Tom Rix <trix@juniper.net> Tested by: pho Differential revision: https://reviews.freebsd.org/D10692	2017-05-15 10:02:45 +00:00
Konstantin Belousov	396a0d4455	Do not wake up sleeping thread in reschedule_signals() if the signal is blocked. The spurious wakeup might result in spurious EINTR. The reschedule_signals() function is called when the calling thread has the signal mask changed. For each newly blocked signal, we try to find a thread which might have the signal not blocked. If no such thread exists, sigtd() returns random thread, which must not be waken up. I decided that re-checking, as suggested by PR submitter, is more reasonable change than to change sigtd() interface, due to other uses of sigtd(). signotify() already performs this check. Submitted by: Duane <parakleta@darkreality.org> PR: 219228 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-05-12 15:34:59 +00:00
Mark Johnston	0c46712ca1	Let ptracestop() suspend threads sleeping in an SBDRY section. When a thread enters ptracestop(), for example because it had received SIGSTOP from ptrace(PT_ATTACH), it attempts to suspend other threads in the same process. In the case of a thread sleeping interruptibly in an SBDRY section, sig_suspend_threads() must wake the thread and allow it to reach the user-mode boundary. However, sig_suspend_threads() would erroneously avoid waking up such threads, resulting in an apparent hang. Reviewed by: kib Tested by: pho MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2017-05-11 17:03:45 +00:00
Marius Strobl	26d877f5b8	- Also outside of the KOBJOPLOOKUP macro - which in turn is used by the code auto-generated for *.m - kobj_lookup_method(9) is useful; for example in back-ends or base class device drivers in order to determine whether a default method has been overridden. Thus, allow for the kobj_method_t pointer argument - used by KOBJOPLOOKUP in order to update the cache entry - of kobj_lookup_method(9), to be NULL. Actually, that pointer is redundant as it's just set to the same kobj_method_t that the kobj_lookup_method(9) function returns in the first place, but probably it serves to reduce the number of instructions generated for KOBJOPLOOKUP. - For the same reason, move updating kobj_lookup_{hits,misses} (if KOBJ_STATS is defined) from kobj_lookup_method(9) to KOBJOPLOOKUP. As a side-effect, this gets rid of the convoluted approach of always incrementing kobj_lookup_hits in KOBJOPLOOKUP and then in case of a cache miss, decrementing it in kobj_lookup_method(9) again.	2017-05-08 21:08:39 +00:00
Brooks Davis	f19351aad8	Provide a freebsd32 implementation of sigqueue() The previous misuse of sys_sigqueue() was sending random register or stack garbage to 64-bit targets. The freebsd32 implementation preserves the sival_int member of value when signaling a 64-bit process. Document the mixed ABI implementation of union sigval and the incompability of sival_ptr with pointer integrity schemes. Reviewed by: kib, wblock MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D10605	2017-05-05 18:49:39 +00:00
Mateusz Guzik	8066a14a3c	cache: stop holding the ncneg_hot lock across purging Only non-hot entries are purged so the lock is not needed in the first place. This saves one lock/unlock pair. MFC after: 1 week	2017-05-04 03:11:59 +00:00
Conrad Meyer	29dfb631d8	Extend cpuset_get/setaffinity() APIs Add IRQ placement-only and ithread-only API variants. intr_event_bind has been extended with sibling methods, as it has many more callsites in existing code. Reviewed by: kib@, adrian@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10586	2017-05-03 18:41:08 +00:00
Konstantin Belousov	acd9f51725	Add asserts to verify stability of struct proc and struct thread layouts. Some notes: - Only i386 and amd64 layouts are checked, other Tier-1 (or close to it) architectures would benefit from the same check. - Unconditional enabling of the asserts depend on the stability of locks memory layout. If locks are optimized to avoid bloat when some debugging or profiling features turned off, it makes sense to only assert layout for production configs. Reviewed by: badger, emaste, jhb, vangyzen Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D10526	2017-04-27 21:24:50 +00:00
Patrick Kelsey	1431521236	Remove unnecessary check for NULL mbuf in soreceive_generic(). This check has been redundant since it was introduced in r162554. Reviewed by: emaste, glebius MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10322	2017-04-25 19:54:34 +00:00
Edward Tomasz Napierala	04005c2f92	Make it possible to terminate "show lockedbufs" by pressing "q". MFC after: 2 weeks	2017-04-23 22:20:25 +00:00
Edward Tomasz Napierala	10be945708	Improve BUF_TRACKING by not displaying NULL entries. Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10443	2017-04-23 17:39:31 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Gleb Smirnoff	fef0991322	Typo!	2017-04-17 17:07:51 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Gleb Smirnoff	6286dc78d4	Remove unneeded include of vm_phys.h.	2017-04-17 16:51:04 +00:00
Edward Tomasz Napierala	b66f26e931	Don't try to write out bufs that have already failed with ENXIO. This fixes some panics after disconnecting mounted disks. Submitted by: imp (slightly different version, which I've then lost) Reviewed by: kib, imp, mckusick MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9674	2017-04-14 20:15:34 +00:00
Maxim Sobolev	63649db042	Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned by the shutdown(2) system call. This ability has been lost as part of the svn revision 285910. Reviewed by: ed, rwatson, glebius, hiren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10351	2017-04-14 17:23:28 +00:00
Andrey V. Elsukov	57386f5dce	Fix the build. Reported by: lwhsu	2017-04-14 10:21:38 +00:00
Andrey V. Elsukov	c33a231337	Rework r316770 to make it protocol independent and general, like we do for streaming sockets. And do more cleanup in the sbappendaddr_locked_internal() to prevent leak information from existing mbuf to the one, that will be possible created later by netgraph. Suggested by: glebius Tested by: Irina Liakh <spell at itl ua> MFC after: 1 week	2017-04-14 09:00:48 +00:00
Andrew Turner	4e65501f13	Don't prefix zero with 0x in assym.s. The arm64 binutils only accepts 0 as an offset to the Load-Acquire Register instructions where llvm will acceps both 0 and 0x0. The thread switching code uses these with SCHED_ULE to block waiting for a lock to be released. As the offset of the data to be loaded is zero this is safe, however it is useful to keep the offset in the instruction to document what is being loaded. To work around this issue in binutils only generate the 0x prefix for non-zero values. Reported by: kan Sponsored by: DARPA, AFRL	2017-04-13 15:43:44 +00:00
Patrick Kelsey	67d955aab4	Corrected misspelled versions of rendezvous. The MFC will include a compat definition of smp_no_rendevous_barrier() that calls smp_no_rendezvous_barrier(). Reviewed by: gnn, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10313	2017-04-09 02:00:03 +00:00
Conrad Meyer	69cfbe8851	kern_descrip: Move kinfo_ofile size assert under COMPAT_FREEBSD7 The size and structure are not used outside of FreeBSD 7 compatibility ABIs. Sponsored by: Dell EMC Isilon	2017-04-07 05:00:09 +00:00
Brooks Davis	a3b7d0fb60	Regen after r316594.	2017-04-06 23:40:51 +00:00
Brooks Davis	982519d10f	Change the size argument of __getcwd() to size_t. This matches the getcwd() definition. This is technically an ABI change, but that would only effect 64-bit big-endian platforms that pass arguments on the stack. We have none of those. Reviewed by: jhb Obtained from: CheriABI Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9428	2017-04-06 23:40:13 +00:00
Konstantin Belousov	0226f65940	Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request was issued during VM-initiated i/o (pageout), so that the function does not try to flush or remove pages or wait for the vm object paging-in-progress counter. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:57:53 +00:00
Brooks Davis	db3625531e	Correct a kernel stack leak in 32-bit compat when vfc_name is short. Don't zero unused pointer members again. Per discussion with secteam we are not issuing an advisory for this issue as we have no current evidence it leaks exploitable information. Reviewed by: rwatson, glebius, delphij MFC after: 1 day Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D10227	2017-04-04 17:32:08 +00:00
Robert Watson	8e6be21a58	Audit arguments to posix_fallocate(2) and posix_fadvise(2) system calls. As posix_fadvise() does not lock the vnode argument, don't capture detailed vnode information for the time being. Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-31 14:17:14 +00:00
Robert Watson	15bcf785ba	Audit arguments to POSIX message queues, semaphores, and shared memory. This requires minor changes to the audit framework to allow capturing paths that are not filesystem paths (i.e., will not be canonicalised relative to the process current working directory and/or filesystem root). Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-31 13:43:00 +00:00
Robert Watson	1c2da02938	Audit arguments to System V IPC system calls implementing sempahores, message queues, and shared memory. Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-30 22:26:15 +00:00
Robert Watson	f907080983	Add system-call argument auditing for ACL-related system calls. Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-30 22:00:58 +00:00
Tycho Nightingale	86be94fca3	Add support for capturing 'struct ptrace_lwpinfo' for signals resulting in a process dumping core in the corefile. Also extend procstat to view select members of 'struct ptrace_lwpinfo' from the contents of the note. Sponsored by: Dell EMC Isilon	2017-03-30 18:21:36 +00:00
Konstantin Belousov	3aeacc55a5	A followup to r315749, two more places where brand->interp_path was accessed unconditionally. Reported by: se Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-30 04:21:02 +00:00
Robert Watson	b783025921	When handling msgsys(2), semsys(2), and shmsys(2) multiplex system calls, map the 'which' argument into a suitable audit event identifier for the specific operation requested. Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-29 23:31:35 +00:00
Robert Watson	d8ca0a2b70	Hook up new audit event identifiers for various non-Orange Book/CAPP system calls supported by OpenBSM 1.2-alpha5. Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL	2017-03-29 22:33:56 +00:00
Bruce Evans	1370fa3380	Oops, my fix for bright colors broke bright black some more (in cases that used to work via the bold hack). Fix the table entry for bright black. Fix spelling of plain black in nearby table entries (use the macro for black everywhere everywhere). Fix the currently-unused non-bright color table to not have bright colors in entries 9-15. Improve nearby comments. Start converting to the xterm terminology and default rendering of "bright" instead of "light" for bright colors. Syscons wasn't affected by the bug since I optimized it a little by converting colors 0-15 directly. This also fixes the layering of the conversion for these colors. Apply the same optimization to vt (actually the layer above it). This also moves the conversion 1 closer to the correct layer for colors 0-15. The optimization of just avoiding 2 calls to a trivial function is worth about 10% for simple output to the virtual buffer with occasional rendering. The optimization is so large because the 2 calls are done on every character, so although there are too many other calls and other instructions per character, there are only about 10 times as many. Old versions of syscons were about 10 times faster for simple output, by using a fast path with about 12 instructions per character. Rendering to even slow hardware takes relatively little time provided it is rarely actually done.	2017-03-27 10:48:28 +00:00
Andriy Gapon	20c69e76b4	dtrace sched:::preempt should fire only when there is preemption The probe fire on any thread switch before. Reviewed by: markj MFC after: 1 week Sponsored by: Panzura	2017-03-25 19:08:51 +00:00
Gleb Smirnoff	9e3c8bd3e2	Make sendfile(2) more robust against file change. This fixes a possible crash when the file shrinks. This also fixes sendfile(2) not sending more data in a case when the file grows, and the request is open-ended or specifies a size that is greater than old file size. PR: 217789 Reviewed by: gallatin MFC after: 10 days	2017-03-24 16:01:19 +00:00
Ed Schouten	0fe9832013	Don't require the presence of the compat_3_brand. The existing ELF image activator requires the brandinfo to provide such a string unconditionally, even if the executable format in question doesn't use this type of branding. Skip matching when it's a null pointer. Reviewed by: kib MFC after: 2 weeks	2017-03-23 14:09:45 +00:00
Andriy Gapon	afa0a46cfd	move thread switch tracing from mi_switch to sched_switch This is done so that the thread state changes during the switch are not confused with the thread state changes reported when the thread spins on a lock. Here is an example, three consecutive entries for the same thread (from top to bottom): KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none The above trace could leave an impression that the final state of the thread was "running". After this change the sleep state will be reported after the "spinning" and "running" states reported for the sched lock. Reviewed by: jhb, markj MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9961	2017-03-23 08:57:04 +00:00
Konstantin Belousov	2274ab3d7b	Update r315753 with the proper flag name. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-22 22:28:13 +00:00
Konstantin Belousov	1438fe3cf2	Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only matches static binaries. Interpretation of the 'static' there is that the binary must not specify an interpreter. In particular, shared objects are matched by the brand if BI_CAN_EXEC_DYN is also set. This improves precision of the brand matching, which should eliminate surprises due to brand ordering. Revert r315701. Discussed with and tested by: ed (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-22 22:23:01 +00:00
Konstantin Belousov	7aab7a80e2	Adjust r314851 to not require every brand to specify interpreter path. Reported and tested by: ed Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-22 22:06:48 +00:00
Enji Cooper	8b69c3e79f	Print out name of non-dynamic sysctl in sysctl_remove_oid_locked This will provide a slightly better smoking gun than just stating "can't remove non-dynamic nodes!" when calling sysctl_ctx_free(9) and sysctl_remove_{name,oid}(9) with a non-dynamic (likely static) sysctl. MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-03-22 05:27:20 +00:00
Conrad Meyer	5abb3b74d3	kern_fail: Allow sleeping for more than 2147483/hz seconds Because of integer types, the timeout calculation result was limited to INT_MAX / (1000 * hz) seconds. For systems with hz=10000, this is only 215 seconds. Perform the calculation with 64-bit math to allow sleeping for the full INT_MAX / hz interval (215000 seconds on such hz=10000 systems). Submitted by: Scott Ferris <sferris at isilon.com> Sponsored by: Dell EMC Isilon	2017-03-21 22:41:37 +00:00
Ed Maste	26af611582	tighten buffer bounds in imgact_binmisc_populate_interp We must ensure there's space for the terminating null in the temporary buffer in imgact_binmisc_populate_interp(). Note that there's no buffer overflow here because xbe->xbe_interpreter's length and null termination is checked in imgact_binmisc_add_entry() before imgact_binmisc_populate_interp() is called. However, the latter should correctly enforce its own bounds. Reviewed by: sbruno MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10042	2017-03-21 18:02:14 +00:00
Alan Cox	2a016de1a5	Use IDX_TO_OFF(), not ptoa(), when converting the difference between two vm_pindex_t's into a vm_ooffset_t. The length given to shm_dotruncate() must never be negative. Assert this. Tidy up a comment. Reviewed by: kib MFC after: 1 week	2017-03-20 05:15:55 +00:00
Alan Cox	ac46d38655	Style fixes. In particular, the variable "bogus" is used like a Boolean. Define it as such. Reviewed by: kib MFC after: 1 week	2017-03-19 23:06:11 +00:00
Eric van Gyzen	26f86ab732	Regenerate syscall files for r315526 Sponsored by: Dell EMC	2017-03-19 00:54:24 +00:00
Eric van Gyzen	3f8455b090	Add clock_nanosleep() Add a clock_nanosleep() syscall, as specified by POSIX. Make nanosleep() a wrapper around it. Attach the clock_nanosleep test from NetBSD. Adjust it for the FreeBSD behavior of updating rmtp only when interrupted by a signal. I believe this to be POSIX-compliant, since POSIX mentions the rmtp parameter only in the paragraph about EINTR. This is also what Linux does. (NetBSD updates rmtp unconditionally.) Copy the whole nanosleep.2 man page from NetBSD because it is complete and closely resembles the POSIX description. Edit, polish, and reword it a bit, being sure to keep any relevant text from the FreeBSD page. Reviewed by: kib, ngie, jilles MFC after: 3 weeks Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D10020	2017-03-19 00:51:12 +00:00
Alan Cox	c547cbb49c	Avoid unnecessary calls to vm_map_protect() in elf_load_section(). Typically, when elf_load_section() unconditionally passed VM_PROT_ALL to elf_map_insert(), it was needlessly enabling execute access on the mapping, and it would later have to call vm_map_protect() to correct the mapping's access rights. Now, instead, elf_load_section() always passes its parameter "prot" to elf_map_insert(). So, elf_load_section() must only call vm_map_protect() if it needs to remove the write access that was temporarily granted to perform a copyout(). Reviewed by: kib MFC after: 1 week	2017-03-18 23:37:00 +00:00
Eric van Gyzen	4cf66812ea	nanosleep: plug a kernel memory disclosure nanosleep() updates rmtp on EINVAL. In that case, kern_nanosleep() has not updated rmt, so sys_nanosleep() updates the user-space rmtp by copying garbage from its stack frame. This is not only a kernel memory disclosure, it's also not POSIX-compliant. Fix it to update rmtp only on EINTR. Reviewed by: jilles (via D10020), dchagin MFC after: 3 days Security: possibly Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D10044	2017-03-18 20:16:23 +00:00
Bruce Evans	4eb235fb4f	Fix bright colors for syscons, and make them work for the first time for vt. Restore syscons' rendering of background (bg) brightness as foreground (fg) blinking and vice versa, and add rendering of blinking as background brightness to vt. Bright/saturated is conflated with light/white in the implementation and in this description. Bright colors were broken in all cases, but appeared to work in the only case shown by "vidcontrol show". A boldness hack was applied only in 1 layering-violation place (for some syscons sequences) where it made some cases seem to work but was undone by clearing bold using ANSI sequences, and more seriously was not undone when setting ANSI/xterm dark colors so left them bright. Move this hack to drivers. The boldness hack is only for fg brightness. Restore/add a similar hack for bg brightness rendered as fg blinking and vice versa. This works even better for vt, since vt changes the default text mode to give the more useful bg brightness instead of fg blinking. The brightness bit in colors was unnecessarily removed by the boldness hack. In other cases, it was lost later by teken_256to8(). Use teken_256to16() to not lose it. teken_256to8() was intended to be used for bg colors to allow finer or bg-specific control for the more difficult reduction to 8; however, since 16 bg colors actually work on VGA except in syscons text mode and the conversion isn't subtle enough to significantly in that mode, teken_256to8() is not used now. There are still bugs, especially in vidcontrol, if bright/blinking background colors are set. Restore XOR logic for bold/bright fg in syscons (don't change OR logic for vt). Remove broken ifdef on FG_UNDERLINE and its wrong or missing bit and restore the correct hard-coded bit. FG_UNDERLINE is only for mono mode which is not really supported. Restore XOR logic for blinking/bright bg in syscons (in vt, add OR logic and render as bright bg). Remove related broken ifdef on BG_BLINKING and its missing bit and restore the correct hard-coded bit. The same bit means blinking or bright bg depending on the mode, and we want to ignore the difference everywhere. Simplify conversions of attributes in syscons. Don't pretend to support bold fonts. Don't support unusual encodings of brightness. It is as good as possible to map 16 VGA colors to 16 xterm-16 colors. E.g., VGA brown -> xterm-16 Olive will be converted back to VGA brown, so we don't need to convert to xterm-256 Brown. Teken cons25 compatibility code already does the same, and duplicates some small tables. This is mostly for the sc -> te direction. The other direction uses teken_256to16() which is too generic.	2017-03-18 11:13:54 +00:00
Konstantin Belousov	469ec1eb6a	When clearing altsigstack settings on exec, do it to the right thread. Diagnosed by: smh Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-17 13:37:37 +00:00
Eric Badger	b4d3325975	Don't clear p_ptevents on normal SIGKILL delivery The ptrace() user has the option of discarding the signal. In such a case, p_ptevents should not be modified. If the ptrace() user decides to send a SIGKILL, ptevents will be cleared in ptracestop(). procfs events do not have the capability to discard the signal, so continue to clear the mask in that case. Reviewed by: jhb (initial revision) MFC after: 1 week Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9939	2017-03-16 13:03:31 +00:00
John Baldwin	03f7f17878	Use UMA_ALIGN_PTR instead of sizeof(void ) for zone alignment. uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1, and uma.h provides a set of helper macros for common types. Passing sizeof(void ) results in all of the members being misaligned triggering unaligned access faults on certain architectures (notably MIPS). Reported by: brooks Obtained from: CheriBSD MFC after: 3 days Sponsored by: DARPA / AFRL	2017-03-15 18:23:32 +00:00
Alan Cox	52d1addaa1	Relax the locking requirements for vm_object_page_noreuse(). While reviewing all uses of OFF_TO_IDX(), I observed that vm_object_page_noreuse() is requiring an exclusive lock on the object when, in fact, a shared lock suffices. Reviewed by: kib, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D10011	2017-03-15 17:43:45 +00:00
Mark Johnston	7d88be4c03	When draining a callout, don't clear CALLOUT_ACTIVE while it is running. The callout may reschedule itself and execute again before callout_drain() returns, but we should not clear CALLOUT_ACTIVE until the callout is stopped. Tested by: pho MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2017-03-15 00:29:27 +00:00
Eric van Gyzen	8addc72b3e	Add missing pieces of r315280 I moved this branch from github to a private server, and pulled from the wrong one when committing r315280, so I failed to include two recent commits. Thankfully, they were only cosmetic and were included in the review. Specifically: Add documentation, polish comments, and improve style(9). Tested by: pho (r315280) MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9791	2017-03-14 22:02:02 +00:00
Konstantin Belousov	d1780e8dac	Use atop() instead of OFF_TO_IDX() for convertion of addresses or addresses offsets, as intended. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-14 19:39:17 +00:00
Eric van Gyzen	9dbdf2a169	When the RTC is adjusted, reevaluate absolute sleep times based on the RTC POSIX 2008 says this about clock_settime(2): If the value of the CLOCK_REALTIME clock is set via clock_settime(), the new value of the clock shall be used to determine the time of expiration for absolute time services based upon the CLOCK_REALTIME clock. This applies to the time at which armed absolute timers expire. If the absolute time requested at the invocation of such a time service is before the new value of the clock, the time service shall expire immediately as if the clock had reached the requested time normally. Setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock, including the nanosleep() function; nor on the expiration of relative timers based upon this clock. Consequently, these time services shall expire when the requested relative interval elapses, independently of the new or old value of the clock. When the real-time clock is adjusted, such as by clock_settime(3), wake any threads sleeping until an absolute real-clock time. Such a sleep is indicated by a non-zero td_rtcgen. The sleep functions will set that field to zero and return zero to tell the caller to reevaluate its sleep duration based on the new value of the clock. At present, this affects the following functions: pthread_cond_timedwait(3) pthread_mutex_timedlock(3) pthread_rwlock_timedrdlock(3) pthread_rwlock_timedwrlock(3) sem_timedwait(3) sem_clockwait_np(3) I'm working on adding clock_nanosleep(2), which will also be affected. Reported by: Sebastian Huber <sebastian.huber@embedded-brains.de> Reviewed by: jhb, kib MFC after: 2 weeks Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9791	2017-03-14 19:06:44 +00:00
Konstantin Belousov	01feb4c3d4	Use designated initializers for kevent_copyops. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-14 09:25:01 +00:00
Konstantin Belousov	67d0b0ea60	Hide kev_iovlen() definition under #ifdef KTRACE, fixing build of kernel configs without KTRACE. Reported by: rpokala Sponsored by: The FreeBSD Foundation MFC after: 4 days	2017-03-14 08:45:52 +00:00
Ian Lepore	e3f87f6c70	Change 'Hz' back to 'HZ'... it's referring to the kernel config option named HZ, not being used as an abbreviation of the unit of measure.	2017-03-12 18:07:03 +00:00
Ian Lepore	8a3966405e	Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ).	2017-03-12 17:43:45 +00:00
Konstantin Belousov	9a2dde8013	Avoid reusing p_ksi while it is on queue. When sending SIGCHLD informing reaper that a zombie was reparented to it, we might race with the situation where the previous parent still not finished delivering SIGCHLD and having its p_ksi structure on the signal queue. While on queue, the ksi should not be used for another send. Fix this by copying p_ksi into newly allocated ksi, which is directly put onto reaper sigqueue. The later ensures that siginfo for reaper SIGCHLD is always present, similar to guarantees for siginfo of child. Reported by: bdrewery Discussed with: jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:58:51 +00:00
Konstantin Belousov	9bcf2f2da0	Accept linkers representation for ELF segments with zero on-disk length. For such segments, GNU bfd linker writes knowingly incorrect value into the the file offset field of the program header entry, with the motivation that file should not be mapped for creation of this segment at all. Relax checks for the ELF structure validity when on-disk segment length is zero, and explicitely set mapping length to zero for such segments to avoid validating rounding arithmetic. PR: 217610 Reported by: Robert Clausecker <fuz@fuz.su> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:51:13 +00:00
Konstantin Belousov	973d67c407	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:49:42 +00:00
Konstantin Belousov	1e4296c919	Ktracing kevent(2) calls with unusual arguments might leads to an overly large allocation requests. When ktrace-ing io, sys_kevent() allocates memory to copy the requested changes and reported events. Allocations are sized by the incoming syscall lengths arguments, which are user-controlled, and might cause overflow in calculations or too large allocations. Since io trace chunks are limited by ktr_geniosize, there is no sense it even trying to satisfy unbounded allocations. Export ktr_geniosize and clamp the buffers sizes in advance. PR: 217435 Reported by: Tim Newsham <tim.newsham@nccgroup.trust> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:48:24 +00:00
Alan Cox	e383e820d3	Simplify the control flow and tidy up a comment in map_insert. In collaboration with: kib MFC after: 1 week	2017-03-11 18:57:13 +00:00
Andriy Gapon	28ef18b8c1	trace thread running state when a thread is run for the first time This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing. MFC after: 10 days	2017-03-11 15:57:36 +00:00
Andriy Gapon	6c9271a918	actually implement proc:::lwp-exit probe MFC after: 4 days	2017-03-11 15:47:27 +00:00
Mahdi Mokhtari	32a1fb0d3d	Fix NULL pointer dereference and panic with shm file pread/pwrite. PR: 217429 Reported by: Tim Newsham <tim.newsham@nccgroup.trust> Reviewed by: kib Approved by: dchagin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9844	2017-03-10 10:09:44 +00:00
Gleb Smirnoff	d0147e10ca	In linker_load_file() print name of a file that failed to load. Discussed with: kib	2017-03-09 00:56:07 +00:00
Gleb Smirnoff	a344babf78	Reduce stack usage in link_elf_load_file(), allocating struct nameidata. This function may be called recursively, when a module pulls its dependencies. Under certain circumstances, e.g. quad chain of dependencies and presence of dtrace we may run out of stack.	2017-03-09 00:45:15 +00:00
Gleb Smirnoff	14984031b7	m_mbuftouio() doesn't modify the mbuf.	2017-03-07 19:00:50 +00:00
Eric Badger	b38bd91f4f	don't stop in issignal() if P_SINGLE_EXIT is set Suppose a traced process is stopped in ptracestop() due to receipt of a SIGSTOP signal, and is awaiting orders from the tracing process on how to handle the signal. Before sending any such orders, the tracing process exits. This should kill the traced process. But suppose a second thread handles the SIGKILL and proceeds to exit1(), calling thread_single(). The first thread will now awaken and will have a chance to check once more if it should go to sleep due to the SIGSTOP. It must not sleep after P_SINGLE_EXIT has been set; this would prevent the SIGKILL from taking effect, leaving a stopped orphan behind after the tracing process dies. Also add new tests for this condition. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9890	2017-03-07 13:41:01 +00:00
Konstantin Belousov	15a9aedfa1	When selecting brand based on old Elf branding, prefer the brand which interpreter exactly matches the one requested by the activated image. This change applies r295277, which did the same for note branding, to the old brand selection, with the same reasoning of fixing compat32 interpreter substitution. PR: 211837 Reported by: kenji@kens.fm Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-07 13:38:25 +00:00
Konstantin Belousov	3d560b4be2	Require whole brand string matching for old Elf branding. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-07 13:37:35 +00:00
Konstantin Belousov	0bbee4cd3f	Consistently use vm_ooffset_t type for the vm object offset in elf_load_section. The values passed currently as vm_offset_t are phdr.p_offset, which have the native Elf word size. Since elf_load_section interprets them as the file offset, use vm object offset type. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-07 13:36:43 +00:00
Hiren Panchasara	f41b2de716	Fix the KASSERT check from r314813. len being 0 is valid. Submitted by: ngie Reported by: ngie (via jenkins test run) Sponsored by: Limelight Networks	2017-03-07 06:46:38 +00:00
Hiren Panchasara	b5b023b91e	We've found a recurring problem where some userland process would be stuck spinning at 100% cpu around sbcut_internal(). Inside sbflush_internal(), sb_ccc reached to about 4GB and before passing it to sbcut_internal(), we type-cast it from uint to int making it -ve. The root cause of sockbuf growing this large is unknown. Correct fix is also not clear but based on mailing list discussions, adding KASSERTs to panic instead of looping endlessly. Reviewed by: glebius Sponsored by: Limelight Networks	2017-03-07 00:20:01 +00:00
Gleb Smirnoff	6cf0c1db55	Fix compilation of r314784 on 32 bit.	2017-03-06 22:32:56 +00:00
Gleb Smirnoff	f2498877c9	In panic() print current timestamp, which matches timestamp in the dump header. This will help to correlate console server logs with dump files, no matter how precise is clock on a console server appliance, and how buggy the appliance is.	2017-03-06 19:14:08 +00:00
Konstantin Belousov	aaadc41f6c	Instead of direct use of vm_map_insert(), call vm_map_fixed(MAP_CHECK_EXCL). This KPI explicitely indicates the intent of creating the mapping at the fixed address, and incorporates the map locking into the callee. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-06 14:09:54 +00:00
Alan Cox	28e8da6517	Style and punctuation fixes. Reviewed by: kib MFC after: 3 days	2017-03-05 23:59:04 +00:00
Emmanuel Vadot	c1b014c51c	Export a sysctl dev.<clkdom>.<unit>.clocks for each clock domain containing all the clocks that they provide. Each clocks are exported under the node 'clock.<clkname>' and have the following children nodes : - frequency - parent (The selected parent, if any) - parents (The list of parents, if any) - childrens (The list of childrens, if any) - enable_cnt (The enabled counter) This give us the possibility to examine clocks at runtime and make graph of the clock flow. Reviewed by: mmel MFC after: 2 month Differential Revision: https://reviews.freebsd.org/D9833	2017-03-05 07:13:29 +00:00
Eric van Gyzen	8a8bea603c	Fix grammar in some comments in subr_sleepqueue.c While I'm here, remove trailing whitespace. Reviewed by: kib, mostly, as part of a larger review MFC after: 3 days	2017-03-03 21:03:28 +00:00
Mark Johnston	7813302434	Fix a ticks comparison in sched_pctcpu_update(). We may fail to reset the %CPU tracking window if a thread does not run for over half of the ticks rollover period, resulting in a bogus %CPU value for the thread until ticks fully rolls over. Handle this by comparing the unsigned difference ticks - ts_ltick with SCHED_TICK_TARG instead. Reviewed by: cem, jeff MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-03-03 20:57:40 +00:00
Ed Maste	e052a8b932	kern_sig.c: ANSIfy and remove archaic register keyword Sponsored by: The FreeBSD Foundation	2017-03-02 22:17:53 +00:00
Konstantin Belousov	fe0a8a3994	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-03-02 17:35:13 +00:00
Hans Petter Selasky	403f4a31ab	Implement taskqueue_poll_is_busy() for use by the LinuxKPI. Refer to comment above function for a detailed description. Discussed with: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies	2017-03-02 12:20:23 +00:00
Sean Bruno	d945ed6472	Make gtaskqueue compatible with drm-next such that they can be used with the linuxkpi tasklets. Submitted by: mmacy@nextbsd.org Reported by: hps	2017-03-01 18:37:35 +00:00
Konstantin Belousov	55b985b43b	Use vm_map_insert() instead of vm_map_find() in elf_map_insert(). Elf_map_insert() needs to create mapping at the known fixed address. Usage of vm_map_find() assumes, on the other hand, that any suitable address space range above or equal the specified hint, is acceptable. Due to operating on the fresh or cleared address space, vm_map_find() usually creates mapping starting exactly at hint. Switch to vm_map_insert() use to clearly request fixed mapping from the VM. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-01 10:28:15 +00:00
Konstantin Belousov	e3d8f8fed4	When deallocating the vm object in elf_map_insert() due to vm_map_insert() failure, drop the vnode lock around the call to vm_object_deallocate(). Since the deallocated object is the vm object of the vnode, we might get the vnode lock recursion there. In fact, it is almost impossible to make vm_map_insert() failing there on stock kernel. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-01 10:22:07 +00:00
Mateusz Guzik	a21018063b	locks: ensure proper barriers are used with atomic ops when necessary Unclear how, but the locking routine for mutexes was using the release barrier instead of acquire. This must have been either a copy-pasto or bad completion. Going through other uses of atomics shows no barriers in: - upgrade routines (addressed in this patch) - sections protected with turnstile locks - this should be fine as necessary barriers are in the worst case provided by turnstile unlock I would like to thank Mark Millard and andreast@ for reporting the problem and testing previous patches before the issue got identified. ps. .-'---`-. ,' `. \| \ \| \ \ _ \ ,\ _ ,'-,/-)\ ( * \ \,' ,' ,'-) `._,) -',-') \/ ''/ ) / / / ,'-' Hardware provided by: IBM LTC	2017-03-01 05:06:21 +00:00
Scott Long	38e41e66e5	Provide a comment on why stdio.h needs to be included.	2017-02-28 21:27:51 +00:00
Jung-uk Kim	c4e929946c	Include stdio.h to fix libsbuf build. Reviewed by: scottl	2017-02-28 21:18:45 +00:00
Scott Long	388f3ce6c3	Implement sbuf_prf(), which takes an sbuf and outputs it to stdout in the non-kernel case and to the console+log in the kernel case. For the kernel case it hooks the putbuf() machinery underneath printf(9) so that the buffer is written completely atomically and without a copy into another temporary buffer. This is useful for fixing compound console/log messages that become broken and interleaved when multiple threads are competing for the console. Reviewed by: ken, imp Sponsored by: Netflix	2017-02-28 18:25:06 +00:00
Gleb Smirnoff	efe3b0de14	Remove SVR4 (System V Release 4) binary compatibility support. UNIX System V Release 4 is operating system released in 1988. It ceased to exist in early 2000-s.	2017-02-28 05:14:42 +00:00
Konstantin Belousov	aca4bb9112	Do not leak mount references for dying threads. Thread might create a condition for delayed SU cleanup, which creates a reference to the mount point in td_su, but exit without returning through userret(), e.g. when terminating due to single-threading or process exit. In this case, td_su reference is not dropped and mount point cannot be freed. Handle the situation by clearing td_su also in the thread destructor and in exit1(). softdep_ast_cleanup() has to receive the thread as argument, since e.g. thread destructor is executed in different context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-25 10:38:18 +00:00
Konstantin Belousov	8cd5962571	Remove cpu_deepest_sleep variable. On Core2 and older Intel CPUs, where TSC stops in C2, system does not allow C2 entrance if timecounter hardware is TSC. This is done by tc_windup() which tests for TC_FLAGS_C2STOP flag of the new timecounter and increases cpu_disable_c2_sleep if flag is set. Right now init_TSC_tc() only sets the flag if cpu_deepest_sleep >= 2, but TSC is initialized too early for this variable to be set by acpi_cpu.c. There is no reason to require that ACPI reported C2 and deeper states to set TC_FLAGS_C2STOP, so remove cpu_deepest_sleep test from init_TSC_tc() condition. And since this is the only use of the variable, remove it at all. Reported and submitted by: Jia-Shiun Li <jiashiun@gmail.com> Suggested by: jhb MFC after: 2 weeks	2017-02-24 16:11:55 +00:00
Warner Losh	bbf6e5144e	Cast values to (int) before comparing them to the range of the enum. This ensures they are in range w/o the warnings.	2017-02-24 01:39:12 +00:00
Warner Losh	df1c30f6bd	KDTRACE_HOOKS isn't guaranteed to be defined. Change to check to see if it is defined or not rather than if it is non-zero. Sponsored by: Netflix, Inc	2017-02-24 01:39:08 +00:00
Mateusz Guzik	dfaa7859d6	mtx: microoptimize lockstat handling in spin mutexes and thread lock While here make the code compilablle on kernels with LOCK_PROFILING but without KDTRACE_HOOKS.	2017-02-23 22:46:01 +00:00
Eric van Gyzen	b215ceaaec	Add sem_clockwait_np() This function allows the caller to specify the reference clock and choose between absolute and relative mode. In relative mode, the remaining time can be returned. The API is similar to clock_nanosleep(3). Thanks to Ed Schouten for that suggestion. While I'm here, reduce the sleep time in the semaphore "child" test to greatly reduce its runtime. Also add a reasonable timeout. Reviewed by: ed (userland) MFC after: 2 weeks Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9656	2017-02-23 19:36:38 +00:00
Jonathan T. Looney	c9cde8251c	Fix a panic during boot caused by inadequate locking of some vt(4) driver data structures. vt_change_font() calls vtbuf_grow() to change some vt driver data structures. It uses TF_MUTE to prevent the console from trying to use those data structures while it changes them. During the early stage of the boot process, the vt driver's tc_done routine uses those data structures; however, it is currently called outside the TF_MUTE check. Move the tc_done routine inside the locked TF_MUTE check. PR: 217282 Reviewed by: ed, ray Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9709	2017-02-23 01:18:47 +00:00
Warner Losh	6fec662c86	Make the code match the comments: If we have ANY buf's that failed then return EAGAIN. The current code just returns that if the LAST buf failed. Reviewed by: kib@, trasz@ Differential Revision: https://reviews.freebsd.org/D9677	2017-02-21 18:56:06 +00:00
John Baldwin	150599be12	Consolidate statements to initialize files. Previously, the first lines of various generated files from system call tables were generated in two sections. Some of the initialization was done in BEGIN, and the rest was done when the first line was encountered. The main reason for this split before r313564 was that most of the initialization done in the second section depended on the $FreeBSD$ tag extracted from the system call table. Now that the $FreeBSD$ tag is no longer used, consolidate all of the file initialization in the BEGIN section. This change was tested by confirming that the content of generated files did not change.	2017-02-20 20:37:25 +00:00
Mateusz Guzik	13d2ef0f3a	mtx: fix spin mutexes interaction with failed fcmpset While doing so move recursion support down to the fallback routine.	2017-02-20 19:08:36 +00:00
Eric Badger	82a4538f31	Defer ptracestop() signals that cannot be delivered immediately When a thread is stopped in ptracestop(), the ptrace(2) user may request a signal be delivered upon resumption of the thread. Heretofore, those signals were discarded unless ptracestop()'s caller was issignal(). Fix this by modifying ptracestop() to queue up signals requested by the ptrace user that will be delivered when possible. Take special care when the signal is SIGKILL (usually generated from a PT_KILL request); no new stop events should be triggered after a PT_KILL. Add a number of tests for the new functionality. Several tests were authored by jhb. PR: 212607 Reviewed by: kib Approved by: kib (mentor) MFC after: 2 weeks Sponsored by: Dell EMC In collaboration with: jhb Differential Revision: https://reviews.freebsd.org/D9260	2017-02-20 15:53:16 +00:00
Konstantin Belousov	ecc6c515ab	Apply noexec mount option for mmap(PROT_EXEC). Right now the noexec mount option disallows image activators to try execve the files on the mount point. Also, after r127187, noexec also limits max_prot map entries permissions for mappings of files from such mounts, but not the actual mapping permissions. As result, the API behaviour is inconsistent. The files from noexec mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution permission, it cannot be re-enabled later. Make this consistent logically and aligned with behaviour of other systems, by disallowing PROT_EXEC for mmap(2). Note that this change only ensures aligned results from mmap(2) and mprotect(2), it does not prevent actual code execution from files coming from noexec mount. Such files can always be read into anonymous executable memory and executed from there. Reported by: shamaz.mazum@gmail.com PR: 217062 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-19 20:51:04 +00:00
Mateusz Guzik	b247fd395d	locks: make trylock routines check for 'unowned' value Since fcmpset can fail without lock contention e.g. on arm, it was possible to get spurious failures when the caller was expecting the primitive to succeed. Reported by: mmel	2017-02-19 16:28:46 +00:00
Hans Petter Selasky	316e092a77	Make sure the thread constructor and destructor eventhandlers are called for all threads belonging to a procedure. Currently the first thread in a procedure is kept around as an optimisation step and is never freed. Because the first thread in a procedure is never freed nor allocated, its destructor and constructor callbacks are never called which means per thread structures allocated by dtrace and the Linux emulation layers for example, might be present for threads which don't need these structures. This patch adds a thread construction and destruction call for the first thread in a procedure. Tested: dtrace, linux emulation Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies	2017-02-19 13:15:33 +00:00
Jason A. Harmening	e2a8d17887	Bring back r313037, with fixes for mips: Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to replace pcpu_find(curcpu) in MI code. Reviewed by: andreast, kan, lidl Tested by: lidl(mips, sparc64), andreast(powerpc) Differential Revision: https://reviews.freebsd.org/D9587	2017-02-19 02:03:09 +00:00
Mateusz Guzik	5c5df0d99b	locks: clean up trylock primitives In particular thius reduces accesses of the lock itself.	2017-02-18 22:06:03 +00:00
Bryan Drewery	8e31b510b0	Fix panic with unlocked vnode to vrecycle(). MFC after: 2 weeks	2017-02-18 05:07:53 +00:00
Mateusz Guzik	a24c8eb847	mtx: plug the 'opts' argument when not used	2017-02-18 01:52:10 +00:00
Mateusz Guzik	cbebea4e67	mtx: get rid of file/line args from slow paths if they are unused This denotes changes which went in by accident in r313877. On most production kernels both said parameters are zeroed and have nothing reading them in either __mtx_lock_sleep or __mtx_unlock_sleep. Thus this change stops passing them by internal consumers which this is the case. Kernel modules use _flags variants which are not affected kbi-wise.	2017-02-17 15:40:24 +00:00
Mateusz Guzik	09f1319acd	mtx: restrict r313875 to kernels without LOCK_PROFILING	2017-02-17 15:34:40 +00:00
Mateusz Guzik	7640beb920	mtx: microoptimize lockstat handling in __mtx_lock_sleep This saves a function call and multiple branches after the lock is acquired.	2017-02-17 14:55:59 +00:00
Mateusz Guzik	0108a98012	sx: fix compilation on UP kernels after r313855 sx primitives use inlines as opposed to macros. Change the tested condition to LOCK_DEBUG which covers the case, but is slightly overzelaous. Reported by: kib	2017-02-17 10:58:12 +00:00
Mateusz Guzik	91fa47076d	Introduce SCHEDULER_STOPPED_TD for use when the thread pointer was already read Sprinkle in few places.	2017-02-17 06:45:04 +00:00
Mateusz Guzik	ffd5c94c4f	locks: let primitives for modules unlock without always goging to the slsow path It is only needed if the LOCK_PROFILING is enabled. It has to always check if the lock is about to be released which requires an avoidable read if the option is not specified..	2017-02-17 05:39:40 +00:00
Mateusz Guzik	afa39f7a32	locks: remove SCHEDULER_STOPPED checks from primitives for modules They all fallback to the slow path if necessary and the check is there. This means a panicked kernel executing code from modules will be able to succeed doing actual lock/unlock, but this was already the case for core code which has said primitives inlined.	2017-02-17 05:09:51 +00:00
Ryan Stone	27ee18ad33	Revert r313814 and r313816 Something evidently got mangled in my git tree in between testing and review, as an old and broken version of the patch was apparently submitted to svn. Revert this while I work out what went wrong. Reported by: tuexen Pointy hat to: rstone	2017-02-16 21:18:31 +00:00
Eric van Gyzen	8144690af4	Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel inet_ntoa() cannot be used safely in a multithreaded environment because it uses a static local buffer. Instead, use inet_ntoa_r() with a buffer on the caller's stack. Suggested by: glebius, emaste Reviewed by: gnn MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9625	2017-02-16 20:47:41 +00:00
Ryan Stone	3600f4ba35	Fix a typo in my previous commit Somehow in the late stages of testing my sched_ule patch, a character was accidentally deleted from the file. Correct this. While I'm committing anyway, the previous commit message requires some clarification: in the normal case of unlending priority after releasing a mutex, the thread that was doing the lending will be woken up and immediately become the highest-priority thread, and in that case no priority inversion would take place. However, if that thread is pinned to a different CPU, then the currently running thread that just had its priority lowered will not be preempted and then priority inversion can occur. Reported by: O. Hartmann (typo), jhb (scheduler clarification) MFC after: 1 month Pointy hat to: rstone	2017-02-16 20:06:21 +00:00
Ryan Stone	09ae7c4814	Check for preemption after lowering a thread's priority When a high-priority thread is waiting for a mutex held by a low-priority thread, it temporarily lends its priority to the low-priority thread to prevent priority inversion. When the mutex is released, the lent priority is revoked and the low-priority thread goes back to its original priority. When the priority of that thread is lowered (through a call to sched_priority()), the schedule was not checking whether there is now a high-priority thread in the run queue. This can cause threads with real-time priority to be starved in the run queue while the low-priority thread finishes its quantum. Fix this by explicitly checking whether preemption is necessary when a thread's priority is lowered. Sponsored by: Dell EMC Isilon Obtained from: Sandvine Inc Differential Revision: https://reviews.freebsd.org/D9518 Reviewed by: Jeff Roberson (ule) MFC after: 1 month	2017-02-16 19:41:13 +00:00
Mark Johnston	c6a4ba5a38	Apply MADV_FREE to exec_map entries only after a lowmem event. This effectively provides the same benefit as applying MADV_FREE inline upon every execve, since the page daemon invokes lowmem handlers prior to scanning the inactive queue. It also has less overhead; the cost of applying MADV_FREE is very noticeable on many-CPU systems since it includes that of a TLB shootdown of global PTEs. For instance, this change nearly halves the system CPU usage during a buildkernel on a 128-vCPU EC2 instance (with some other patches applied). Benchmarked by: cperciva (earlier version) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9586	2017-02-15 01:50:58 +00:00
Eric Badger	28d2efa983	sleepq_catch_signals: do thread suspension before signal check Since locks are dropped when a thread suspends, it's possible for another thread to deliver a signal to the suspended thread. If the thread awakens from suspension without checking for signals, it may go to sleep despite having a pending signal that should wake it up. Therefore the suspension check is done first, so any signals sent while suspended will be caught in the subsequent signal check. Reviewed by: kib Approved by: kib (mentor) MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9530	2017-02-14 17:13:23 +00:00
Andriy Gapon	937c1b0757	try to fix RACCT_RSS accounting There could be a race between the vm daemon setting RACCT_RSS based on the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to zero. In that case we can get a zombie process with non-zero RACCT_RSS. If the process is jailed, that may break accounting for the jail. There could be other consequences. Fix this race in the vm daemon by updating RACCT_RSS only when a process is in the normal state. Also, make accounting a little bit more accurate by refreshing the page resident count after calling vm_pageout_map_deactivate_pages(). Finally, add an assert that the RSS is zero when a process is reaped. PR: 210315 Reviewed by: trasz Differential Revision: https://reviews.freebsd.org/D9464	2017-02-14 13:54:05 +00:00
Konstantin Belousov	496ab0532d	Rework r313352. Rename kern_vm_* functions to kern_*. Move the prototypes to syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as needed, to avoid headers pollution. Requested by: alc, jhb Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D9535	2017-02-13 09:04:38 +00:00
Konstantin Belousov	987ff18184	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 21:05:44 +00:00
Konstantin Belousov	f2277b64ec	Switch copyout_map() to use vm_mmap_object() instead of vm_mmap(). This is both a microoptimization and a move of the consumer to more commonly used vm function. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 20:54:31 +00:00
Mateusz Guzik	c4a48867f1	lockmgr: implement fast path The main lockmgr routine takes 8 arguments which makes it impossible to tail-call it by the intermediate vop_stdlock/unlock routines. The routine itself starts with an if-forest and reads from the lock itself several times. This slows things down both single- and multi-threaded. With the patch single-threaded fstats go 4% up and multithreaded up to ~27%. Note that there is still a lot of room for improvement. Reviewed by: kib Tested by: pho	2017-02-12 09:49:44 +00:00
John Baldwin	bb9b710477	Regenerate all the system call tables to drop "created from" lines. One of the ibcs2 files contains some actual changes (new headers) as it hasn't been regenerated after older changes to makesyscalls.sh.	2017-02-10 19:45:02 +00:00
John Baldwin	807a7231f2	Drop the "created from" line from files generated by makesyscalls.sh. This information is less useful when the generated files are included in source control along with the source. If needed it can be reconstructed from the $FreeBSD$ tag in the generated file. Removing this information from the generated output permits committing the generated files along with the change to the system call master list without having inconsistent metadata in the generated files. Reviewed by: emaste, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D9497	2017-02-10 19:25:52 +00:00
Konstantin Belousov	e83a71c656	Fix r313495. The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN() did not initialized file type. This is a typical code path used by normal file systems. Also, change error returned for inappropriate file type used for O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page. Reported by: cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw> Tested by: dhw Sponsored by: The FreeBSD Foundation MFC after: 13 days	2017-02-10 14:49:04 +00:00
Konstantin Belousov	e628e1b919	Increase a chance of devfs_close() calling d_close cdevsw method. If a file opened over a vnode has an advisory lock set at close, vn_closefile() acquires additional vnode use reference to prevent freeing the vnode in vn_close(). Side effect is that for device vnodes, devfs_close() sees that vnode reference count is greater than one and refuses to call d_close(). Create internal version of vn_close() which can avoid dropping the vnode reference if needed, and use this to execute VOP_CLOSE() without acquiring a new reference. Note that any parallel reference to the vnode would still prevent d_close call, if the reference is not from an opened file, e.g. due to stat(2). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-09 23:36:50 +00:00
Konstantin Belousov	7903b00087	Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK) for files which do not have DTYPE_VNODE type. Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on a file which type is not DTYPE_VNODE. Do the same when lock is requested from open(2). Restructure the block in vn_open_vnode() which handles O_EXLOCK and O_SHLOCK open flags to make it easier to quit its execution earlier with an error. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-09 23:35:57 +00:00
Mateusz Guzik	8eaaf58a5f	rwlock: fix r313454 The runlock slow path would update wrong variable before restarting the loop, in effect corrupting the state. Reported by: pho	2017-02-09 13:32:19 +00:00
Mateusz Guzik	3b3cf014fc	locks: tidy up unlock fallback paths Update comments to note these functions are reachable if lockstat is enabled. Check if the lock has any bits set before attempting unlock, which saves an unnecessary atomic operation.	2017-02-09 08:19:30 +00:00
Mateusz Guzik	834f70f32f	sx: implement slock/sunlock fast path See r313454.	2017-02-08 19:29:34 +00:00
Mateusz Guzik	b0a61642d4	rwlock: implemenet rlock/runlock fast path This improves singlethreaded throughput on my test machine from ~247 mln ops/s to ~328 mln. It is mostly about avoiding the setup cost of lockstat. Reviewed by: jhb (previous version)	2017-02-08 19:28:46 +00:00
John Baldwin	885f13dc96	Copy the e_machine and e_flags fields from the binary into an ELF core dump. In the kernel, cache the machine and flags fields from ELF header to use in the ELF header of a core dump. For gcore, the copy these fields over from the ELF header in the binary. This matters for platforms which encode ABI information in the flags field (such as o32 vs n32 on MIPS). Reviewed by: kib Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D9392	2017-02-07 20:34:03 +00:00
Emmanuel Vadot	beaa6e1e64	subr_sfbus.c need sys/proc.h for struct thread definition. This fixes kernel build for armv6. Discussed with: kib	2017-02-07 17:31:24 +00:00
Mateusz Guzik	dbccc8105c	rwlock: implement RW_LOCK_WRITER_RECURSED bit This moves recursion handling out of the inlined wunlock path and in particular saves a read and a branch. Discussed with:	2017-02-07 17:04:31 +00:00
Mateusz Guzik	f743ea9638	Bump struct thread alignment to 32. This gives additional bits to use in locking primitives which store the lock thread pointer in the lock value. Discussed with: kib	2017-02-07 17:03:22 +00:00
Mateusz Guzik	3c798b2b1f	locks: follow up r313386 Unfinished diff was committed by accident. The loop in lock_delay was changed to decrement, but the loop iterator was still incrementing.	2017-02-07 16:01:07 +00:00
Mateusz Guzik	8e5a3e9a9d	locks: change backoff to exponential Previous implementation would use a random factor to spread readers and reduce chances of starvation. This visibly reduces effectiveness of the mechanism. Switch to the more traditional exponential variant. Try to limit starvation by imposing an upper limit of spins after which spinning is half of what other threads get. Note the mechanism is turned off by default. Reviewed by: kib (previous version)	2017-02-07 14:49:36 +00:00
Edward Tomasz Napierala	1110d0029a	Make root_mount_hold() work after boot. This is important for two reasons. First is rerooting into USB-mounted device that happens to be not yet enumerated. The second is when mounting with (non-root) filesystem on USB device on a hub that's enumerated later than the root mount: the rc scripts explicitly mount for the root mount holds to be released, but each USB bus takes the hold asynchronously, and if that happens after root mount, it would just get ignored. Reviewed by: marcel MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9388	2017-02-06 20:44:34 +00:00
Edward Tomasz Napierala	4f9d7bad48	In r290196 the root mount hold mechanism was changed to make it not wait for mount hold release if the root device already exists. So, unless your rootdev is not on USB - ie in the usual case - the root mount won't wait for USB. However, the old behaviour was sometimes used as "wait until USB is fully enumerated", and r290196 broke that. This commit adds vfs.root_mount_always_wait tunable, to force the kernel to always wait for root mount holds, even if the root is already there. Reviewed by: kib MFC after: 2 weeks Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9387	2017-02-06 20:36:59 +00:00
Andrew Turner	c0d5237034	Only allow the pic type to be either a PIC or MSI type. All interrupt controller drivers handle either MSI/MSI-X interrupts, or regular interrupts, as such enforce this in the interrupt handling framework. If a later driver was to handle both it would need to create one of each. This will allow future changes to allow the xref space to overlap, but refer to different drivers. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation X-Differential Revision: https://reviews.freebsd.org/D8616	2017-02-06 13:08:48 +00:00
Mateusz Guzik	c1aaf63cb5	locks: fix recursion support after recent changes When a relevant lockstat probe is enabled the fallback primitive is called with a constant signifying a free lock. This works fine for typical cases but breaks with recursion, since it checks if the passed value is that of the executing thread. Read the value if necessary.	2017-02-06 09:40:14 +00:00
Mateusz Guzik	993ddec44d	rwlock: move lockstat handling out of inline primitives See r313275 for details. One difference here is that recursion handling was removed from the fallback routine. As it is it was never supposed to see a recursed lock in the first place. Future changes will move it out of inline variants, but right now there is no easy to way to test if the lock is recursed without reading additional words.	2017-02-05 13:37:23 +00:00
Edward Tomasz Napierala	96ee43103d	Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(), and use it in compats instead of their sys_*() counterparts. Reviewed by: kib, jhb, dchagin MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9383	2017-02-05 13:24:54 +00:00
Mateusz Guzik	6ebb77b6a6	sx: move lockstat handling out of inline primitives See r313275 for details.	2017-02-05 09:54:16 +00:00
Mateusz Guzik	dc0896512c	mtx: fixup r313278, the assignemnt was supposed to go inside the loop	2017-02-05 09:53:13 +00:00
Mateusz Guzik	cae4ab7f37	mtx: fix up _mtx_obtain_lock_fetch usage in thread lock Since _mtx_obtain_lock_fetch no longer sets the argument to MTX_UNOWNED, callers have to do it on their own.	2017-02-05 09:35:17 +00:00
Mateusz Guzik	08da267775	mtx: move lockstat handling out of inline primitives Lockstat requires checking if it is enabled and if so, calling a 6 argument function. Further, determining whether to call it on unlock requires pre-reading the lock value. This is problematic in at least 3 ways: - more branches in the hot path than necessary - additional cacheline ping pong under contention - bigger code Instead, check first if lockstat handling is necessary and if so, just fall back to regular locking routines. For this purpose a new macro is introduced (LOCKSTAT_PROFILE_ENABLED). LOCK_PROFILING uninlines all primitives. Fold in the current inline lock variant into the _mtx_lock_flags to retain the support. With this change the inline variants are not used when LOCK_PROFILING is defined and thus can ignore its existence. This results in: text data bss dec hex filename 22259667 1303208 4994976 28557851 1b3c21b kernel.orig 21797315 1303208 4994976 28095499 1acb40b kernel.patched i.e. about 3% reduction in text size. A remaining action is to remove spurious arguments for internal kernel consumers.	2017-02-05 08:04:11 +00:00
Mateusz Guzik	3ae56ce958	sx: add witness support missed in r313272	2017-02-05 06:51:45 +00:00
Mateusz Guzik	9d2e4290ff	sx: uninline slock/sunlock Shared locking routines explicitly read the value and test it. If the change attempt fails, they fall back to a regular function which would retry in a loop. The problem is that with many concurrent readers the risk of failure is pretty high and even the value returned by fcmpset is very likely going to be stale by the time the loop in the fallback routine is reached. Uninline said primitives. It gives a throughput increase when doing concurrent slocks/sunlocks with 80 hardware threads from ~50 mln/s to ~56 mln/s. Interestingly, rwlock primitives are already not inlined.	2017-02-05 05:20:29 +00:00
Mateusz Guzik	fa47404353	sx: switch to fcmpset Discussed with: jhb Tested by: pho (previous version)	2017-02-05 04:54:20 +00:00
Mateusz Guzik	c84f347985	rwlock: switch to fcmpset Discussed with: jhb Tested by: pho	2017-02-05 04:53:13 +00:00
Mateusz Guzik	90836c3270	mtx: switch to fcmpset The found value is passed to locking routines in order to reduce cacheline accesses. mtx_unlock grows an explicit check for regular unlock. On ll/sc architectures the routine can fail even if the lock could have been handled by the inline primitive. Discussed with: jhb Tested by: pho (previous version)	2017-02-05 03:26:34 +00:00
Mateusz Guzik	2d78a5531e	vfs: use atomic_fcmpset in vfs_refcount_*	2017-02-05 03:23:16 +00:00
Mark Johnston	69d2418faa	Make witness_warn() always print to the console. witness_warn() either breaks into the debugger or panics the system, so its output should go to the console regardless of the witness(4) output channel configuration. MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-02-05 02:27:04 +00:00
Mateusz Guzik	3a2f282532	fd: switch fget_unlocked to atomic_fcmpset	2017-02-05 01:40:27 +00:00
Jason A. Harmening	ad62ba6e96	Revert r313037 The switch to get_pcpu() in MI code seems to cause hangs on MIPS. Back out until we can get a better idea of what's happening there. Reported by: kan, lidl	2017-02-04 06:24:49 +00:00
Hartmut Brandt	4b481ba0ed	Merge filt_soread and filt_solisten and decide what to do when checking for EVFILT_READ at the point of the check not when the event is registers. This fixes a problem with asio when accepting a connection. Reviewed by: kib@, Scott Mitchell	2017-02-01 13:12:07 +00:00
Jason A. Harmening	65ed483615	Implement get_pcpu() for the remaining architectures and use it to replace pcpu_find(curcpu) in MI code.	2017-02-01 03:32:49 +00:00
Edward Tomasz Napierala	b38b22b0b2	Add kern_pread() and kern_pwrite(), and use it in compats instead of their sys_*() counterparts. The svr4 is left unchanged. Reviewed by: kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9379	2017-01-31 15:35:18 +00:00
Edward Tomasz Napierala	fc8bde8ffe	Replace calls to sys_truncate() with kern_truncate(). Reviewed by: kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9371	2017-01-31 15:19:44 +00:00
Edward Tomasz Napierala	ea2ebdc19e	Add kern_cpuset_getid() and kern_cpuset_setid(), and use them in compat32 instead of their sub_*() counterparts. Reviewed by: jhb@, kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9382	2017-01-31 15:11:23 +00:00
Andriy Gapon	826b3d3187	put very expensive sanity checks of advisory locks under DIAGNOSTIC The checks have quadratic complexity over a number of advisory locks active for a file and that could be a lot. What's the worse is that the checks are done while holding ls_lock. That could lead to a long a very long backlog and performance degradation even if all requested locks are compatible (e.g. all shared locks). The checks used to be under INVARIANTS. Discussed with: kib MFC after: 2 weeks Sponsored by: Panzura	2017-01-30 15:20:13 +00:00
Edward Tomasz Napierala	d293f35c09	Add kern_listen(), kern_shutdown(), and kern_socket(), and use them instead of their sys_*() counterparts in various compats. The svr4 is left untouched, because there's no point. Reviewed by: ed@, kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9367	2017-01-30 12:57:22 +00:00
Edward Tomasz Napierala	f67d6b5f12	Add kern_lseek() and use it instead of sys_lseek() in various compats. I didn't touch svr4/, there's no point. Reviewed by: ed@, kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9366	2017-01-30 12:24:47 +00:00
Edward Tomasz Napierala	ae6b6ef6cb	Replace sys_ftruncate() with kern_ftruncate() in various compats. Reviewed by: kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9368	2017-01-30 11:50:54 +00:00
Mateusz Guzik	dfecf51dd0	cache: use vrefact for '.' lookups and refing the rdir in fullpath	2017-01-30 03:20:05 +00:00
Mateusz Guzik	3071469d57	fd: sprinkle __read_mostly and __exclusive_cache_line	2017-01-30 03:07:32 +00:00
Baptiste Daroussin	b4b4b5304b	Revert crap accidentally committed	2017-01-28 16:31:23 +00:00
Baptiste Daroussin	814aaaa7da	Revert r312923 a better approach will be taken later	2017-01-28 16:30:14 +00:00
Mateusz Guzik	95839d3d25	hwpmc: annotate pmc_hook and pmc_intr as __read_mostly MFC after: 1 month	2017-01-27 22:14:42 +00:00
Mateusz Guzik	f1f7f1cb29	hwpmc: partially depessimize mmap handling if the module is not loaded In particular this means the pmc sx lock is no longer taken when an executable mapping succeeds. MFC after: 1 week	2017-01-27 22:13:15 +00:00
Mateusz Guzik	290511163d	Sprinkle __read_mostly on backoff and lock profiling code. MFC after: 1 month	2017-01-27 15:03:51 +00:00
Mateusz Guzik	17071ff298	cache: annotate with __read_mostly and __exclusive_cache_line MFC after: 1 month	2017-01-27 14:56:36 +00:00
Sean Bruno	de414cfe14	A few more style bugs lying around in here. Submitted by: bde	2017-01-26 13:48:45 +00:00
Gleb Smirnoff	beb4b31200	For non-listening AF_UNIX sockets return error code EOPNOTSUPP to match documentation and SUS.	2017-01-25 22:26:45 +00:00
Ed Maste	f27ac8e297	ANSIfy kern_ntptime.c	2017-01-25 20:22:32 +00:00
Sean Bruno	06bb7c507a	Replace overlooked smp_started checks and variable use in a print with the now used tqg_smp_started. Submitted by: bde	2017-01-25 15:54:44 +00:00
Ed Maste	77ebe276ba	imgact_elf: refactor et_dyn_addr calculation This simplifies the logic somewhat. It is extracted from the change in review in D5603. Differential Revision: https://reviews.freebsd.org/D9321	2017-01-24 22:46:43 +00:00
Mateusz Guzik	543b2f425d	proc: perform a lockless check in sys_issetugid Discussed with: kib MFC after: 1 week	2017-01-24 21:48:57 +00:00
Conrad Meyer	90a79ac576	Use time_t for intermediate values to avoid overflow in clock_ts_to_ct Add additionally safety and overflow checks to clock_ts_to_ct and the BCD routines while we're here. Perform a safety check in sys_clock_settime() first to avoid easy local root panic, without having to propagate an error value back through dozens of APIs currently lacking error returns. PR: 211960, 214300 Submitted by: Justin McOmie <justin.mcomie at gmail.com>, kib@ Reported by: Tim Newsham <tim.newsham at nccgroup.trust> Reviewed by: kib@ Sponsored by: Dell EMC Isilon, FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9279	2017-01-24 18:05:29 +00:00
Sean Bruno	bd84f70044	iflib: Add internal tracking of smp startup status to reliably figure out what methods are to be used to get gtaskqueue up and running. e1000: Calculating this pointer gives undefined behaviour when (last == -1) (it is before the buffer). The pointer is always followed. Panics occurred when it points to an unmapped page. Otherwise, the pointed-to garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the broken case the loop was usually null and the function just returned, and this was acidentally correct. Submitted by: bde Reported by: Matt Macy <mmacy@nextbsd.org>	2017-01-24 16:05:42 +00:00
Sean Bruno	36fa5d5b64	Revert 312696 due to build tests.	2017-01-24 15:55:52 +00:00
Sean Bruno	562a3182f6	iflib: Add internal tracking of smp startup status to reliably figure out what methods are to be used to get gtaskqueue up and running. e1000: Calculating this pointer gives undefined behaviour when (last == -1) (it is before the buffer). The pointer is always followed. Panics occurred when it points to an unmapped page. Otherwise, the pointed-to garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the broken case the loop was usually null and the function just returned, and this was acidentally correct. Submitted by: bde Reviewed by: Matt Macy <mmacy@nextbsd.org>	2017-01-24 14:48:32 +00:00
Konstantin Belousov	3467f88cd6	Add comments explaining unobvious td_critnest adjustments in critical_exit(). Based on the discussion with: jhb Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: D9276 MFC after: 1 week	2017-01-22 19:41:42 +00:00
Konstantin Belousov	25c6816845	More style cleanup. Use ANSI C definition for vn_closefile(). Switch to VNASSERT in _vn_lock(), simplify messages. Sponsored by: The FreeBSD Foundation X-MFC with: r312600, r312601, r312602, r312606	2017-01-22 19:38:45 +00:00
Konstantin Belousov	aec8391d46	Provide fallback VOP methods for crossmp vnode. In particular, crossmp vnode might leak into rename code. PR: 216380 Reported by: fnacl@protonmail.com Sponsored by: The FreeBSD Foundation X-MFC with: r309425	2017-01-22 19:36:02 +00:00
Edward Tomasz Napierala	5c93966020	Remove redundant KASSERT.	2017-01-22 15:35:51 +00:00
Edward Tomasz Napierala	8acac5a9f5	Improve debugging printf.	2017-01-22 15:27:14 +00:00
Mateusz Guzik	eaf0969bda	vfs: fix LK_RETRY logic braino in r312600	2017-01-21 20:34:20 +00:00
Mateusz Guzik	829857c893	vfs: __predict_false the need to handle F_HASLOCK Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast majority of the time, so it is typically true.	2017-01-21 19:01:42 +00:00
Mateusz Guzik	abbc538d9a	vfs: fix whitespace damage in r312600 While here wrap the previously overly long line so that it fits 80 chars.	2017-01-21 18:56:58 +00:00
Mateusz Guzik	1091fb52c1	vfs: refactor _vn_lock Stop testing for LK_RETRY and error multiple times. Also postpone the VI_DOOMED until after LK_RETRY was seen as it reads from the vnode. No functional changes.	2017-01-21 18:38:16 +00:00
Mateusz Guzik	067115e050	vfs: hide the getvnode NULL mp message behind DIAGNOSTIC Since crossmp vnode changes the message was being printed on each boot. Reported by: trasz Discussed with: kib	2017-01-21 16:59:50 +00:00
Hans Petter Selasky	10c8755706	Fix for race leading to endless timer interrupts related to configtimer(). During normal operation "state->nextcallopt" will always be less than or equal to "state->nextcall" and checking only "state->nextcallopt" before calling "callout_process()" is sufficient. However when "configtimer()" is called a race might happen requiring both of these binary times to be checked. Short description of race: 1) A configtimer() call will reset both "state->nextcall" and "state->nextcallopt" to the same binary time. 2) If a "callout_reset()" call happens between "configtimer()" and the next "callout_process()" call, "state->nextcallopt" will get updated and "state->nextcall" will remain at the current time. Refer to logic inside cpu_new_callout(). 3) getnextcpuevent() only respects "state->nextcall" and returns this value over and over again, even if it is in the past, until "now >= state->nextcallopt" becomes true. Then these two time variables are corrected by a "callout_process()" call and the situation goes back to normal. The problem manifests itself in different ways. The common factor is the timer process(es) consume all CPU on one or more CPU cores for a long time, blocking other kernel processes from getting execution time. This can be seen by very high interrupt counts as displayed by "vmstat -i \| grep timer" right after boot. When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting this bug apparently increased. Example output from "vmstat -i" before patch: cpu0:timer 7591 69 cpu9:timer 39031773 358089 cpu4:timer 9359 85 cpu3:timer 9100 83 cpu2:timer 9620 88 Example output from "vmstat -i" after patch: cpu0:timer 4242 34 cpu6:timer 5531 44 cpu3:timer 6450 52 cpu1:timer 4545 36 cpu9:timer 7153 58 Before the patch cpu9 in the example above, was spinning in a loop in order to reach 39 million interrupts just a few seconds after bootup. After the patch the timer interrupt counts are more or less consistent. Discussed with: mav @ Reported by: several people MFC after: 1 week Sponsored by: Mellanox Technologies	2017-01-20 17:40:31 +00:00
Ed Maste	039644eca9	ANSYfy kern_ktrace.c and remove archaic register keyword Sponsored by: The FreeBSD Foundation	2017-01-20 14:59:56 +00:00
Andriy Gapon	c468ff880a	don't abort writing of a core dump after EFAULT It's possible to get EFAULT when writing a segment backed by a file if the segment extends beyond the file. The core dump could still be useful if we skip the rest of the segment and proceed to other segements. The skipped segment (or a portion of it) will be zero-filled. While there, use 'const' to signify that core_write() only reads the buffer and use __DECONST before calling vn_rdwr_inchunks() because it can be used for both reading and writing. Before the change: kernel: Failed to write core file for process mmap_trunc_core (error 14) kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6 After the change: kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped) Reviewed by: julian, kib Obtained from: Panzura (older version of the change) MFC after: 5 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9233	2017-01-20 13:39:07 +00:00
Andriy Gapon	ad9dadc437	fix a thread preemption regression in schedulers introduced in r270423 Commit r270423 fixed a regression in sched_yield() that was introduced in earlier changes. Unfortunately, at the same time it introduced an new regression. The problem is that SWT_RELINQUISH (6), like all other SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags & SWT_RELINQUISH) is true in cases where that was not really indended, for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11). A straight forward fix would be to use (flags & SW_TYPE_MASK) == SWT_RELINQUISH, but my impression is that the switch types are designed mostly for gathering statistics, not for influencing scheduling decisions. So, I decided that it would be better to check for SW_PREEMPT flag instead. That's also the same flag that was checked before r239157. I double-checked how that flag is used and I am confident that the flag is set only in the places where we really have the preemption: - critical_exit + td_owepreempt - sched_preempt in the ULE scheduler - sched_preempt in the 4BSD scheduler Reviewed by: kib, mav MFC after: 4 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9230	2017-01-19 18:46:41 +00:00
Mateusz Guzik	c5f61e6f96	sx: reduce lock accesses similarly to r311172 Discussed with: jhb Tested by: pho (previous version)	2017-01-18 17:55:08 +00:00
Mateusz Guzik	3f0a0612e8	rwlock: reduce lock accesses similarly to r311172 Discussed with: jhb Tested by: pho (previous version)	2017-01-18 17:53:57 +00:00
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
Ed Maste	bf9ebe74e2	disambiguate msleep KASSERT diagnostics Previously "panic: msleep" could happen for a few different reasons. Break the KASSERTs out into individual cases to identify the failing condition. Found during the investigation that resulted in r308288. Reviewed by: kib, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D8604	2017-01-16 20:34:42 +00:00
Sean Bruno	374f3e042c	Remove Assert that seems to be hit in various configurations during normal operations.	2017-01-16 19:01:41 +00:00
Maxim Sobolev	339efd75a4	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
Sean Bruno	227743cad4	Change startup order for the no EARLY_AP_STARTUP case to initialize gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP which is far too late. Add an assertion in taskqgroup_attach() to catch startup initialization failures in the future. Reported by: kib bde	2017-01-16 16:58:12 +00:00
Hiren Panchasara	7d03ff1fe9	Add kevent EVFILT_EMPTY for notification when a client has received all data i.e. everything outstanding has been acked. Reviewed by: bz, gnn (previous version) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9150	2017-01-16 08:25:33 +00:00
Conrad Meyer	db4fcadf52	"Buses" is the preferred plural of "bus" Replace archaic "busses" with modern form "buses." Intentionally excluded: * Old/random drivers I didn't recognize * Old hardware in general * Use of "busses" in code as identifiers No functional change. http://grammarist.com/spelling/buses-busses/ PR: 216099 Reported by: bltsrc at mail.ru Sponsored by: Dell EMC Isilon	2017-01-15 17:54:01 +00:00

... 5 6 7 8 9 ...

15875 Commits