freebsd-nq

Author	SHA1	Message	Date
Sean Bruno	009ad5724d	Revert r324405 at the request of the submitter pending better solution. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks	2017-10-10 00:32:21 +00:00
Gleb Smirnoff	9c82bec42d	Improvements to sendfile(2) mbuf free routine. o Fall back to default m_ext free mech, using function pointer in m_ext_free, and remove sf_ext_free() called directly from mbuf code. Testing on modern CPUs showed no regression. o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses SF_SYNC flag. Lack of the flag allows us not to dereference ext_arg2, saving from a cache line miss. o Create function sendfile_free_page() that later will be used, for multi-page mbufs. For now compiler will inline it into sendfile_free_mext(). In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 21:06:16 +00:00
Gleb Smirnoff	07e87a1d55	In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively, in mb_free_ext() always use fields from the original refcount holding mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext(). However, treat EXT_EXTREF mbufs differently, since they are different - they don't have a refcount holding mbuf. Provide longer comments in m_ext declaration to explain this change and change from r296242. In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:51:58 +00:00
Gleb Smirnoff	e8fd18f306	Shorten list of arguments to mbuf external storage freeing function. All of these arguments are stored in m_ext, so there is no reason to pass them in the argument list. Not all functions need the second argument, some don't even need the first one. The second argument lives in next cache line, so not dereferencing it is a performance gain. This was discovered in sendfile(2), which will be covered by next commits. The second goal of this commit is to bring even more flexibility to m_ext mbufs, allowing to create more fields in m_ext, opaque to the generic mbuf code, and potentially set and dereferenced by subsystems. Reviewed by: gallatin, kbowling Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:35:31 +00:00
Hans Petter Selasky	32b413d7f0	When showing the sleepqueues from the in-kernel debugger, properly dump all the sendqueues and not just the first one History: It appears that in the commit which introduced the code, r165272, the array indexes of "sq_blocked[0]" and "td_name[i]" were interchanged. In r180927 "td_name[i]" was corrected to "td_name[0]", but "sq_blocked[0]" was left unchanged. PR: 222624 Discussed with: kmacy @ MFC after: 1 week Sponsored by: Mellanox Technologies	2017-10-09 18:33:29 +00:00
Alan Cox	03ca213761	The recent change to initialization of blists (r324420) relied on '-1' appearing only where the code explicitly set it, but since much of the data was not initialized, '-1' appeared other places too, and led to panics. Clear the allocated data before initializing nonzero values by allocating with M_ZERO. Submitted by: Doug Moore <dougm@rice.edu> Reported by: Oleg V. Nauman <oleg@theweb.org.ua>, cy Tested by: Oleg V. Nauman <oleg@theweb.org.ua> MFC after: 1 week X-MFC with: r324420 Differential Revision: https://reviews.freebsd.org/D12627	2017-10-09 18:19:06 +00:00
Alan Cox	8eefcd407b	The blst_radix_init function has two purposes - to compute the number of nodes to allocate for the blist, and to initialize them. The computation can be done much more quickly by identifying the terminating node, if any, at every level of the tree and then summing the number of nodes at each level that precedes the topmost terminator. The initialization can also be done quickly, since settings at the root mark the tree as all-allocated, and only a few terminator nodes need to be marked in the rest of the tree. Eliminate blst_radix_init, and perform its two functions more simply in blist_create. The allocation of the blist takes places in two pieces, but there's no good reason to do so, when a single allocation is sufficient, and simpler. Allocate the blist struct, and the array of nodes associated with it, with a single allocation. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11968	2017-10-08 22:17:39 +00:00
Ian Lepore	7f92689427	Add eventhandler notifications for newbus device attach/detach. The detach case is slightly complicated by the fact that some in-kernel consumers may want to know before a device detaches (so they can release related resources, stop using the device, etc), but the detach can fail. So there are pre- and post-detach notifications for those consumers who need to handle all cases. A couple salient comments from the review, they amount to some helpful documentation about these events, but there's currently no good place for such documentation... Note that in the current newbus locking model, DETACH_BEGIN and DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave with other attach/detach events arbitrarily. The handlers should be prepared for such situations. Also should note that detach may be called after the parent bus knows the hardware has left the building. In-kernel consumers have to be prepared to cope with this race. Differential Revision: https://reviews.freebsd.org/D12557	2017-10-08 17:33:49 +00:00
Ian Lepore	fc09164658	Restore the ability to deregister an eventhandler from within the callback. When the EVENTHANDLER(9) subsystem was created, it was a documented feature that an eventhandler callback function could safely deregister itself. In r200652 that feature was inadvertantly broken by adding drain-wait logic to eventhandler_deregister(), so that it would be safe to unload a module upon return from deregistering its event handlers. There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many of them are depending on the drain-wait logic that has been in place for 8 years. So instead of creating a separate eventhandler_drain() and adding it to some or all of those 145 call sites, this creates a separate eventhandler_drain_nowait() function for the specific purpose of deregistering a callback from within the running callback. Differential Revision: https://reviews.freebsd.org/D12561	2017-10-08 17:21:16 +00:00
Sean Bruno	75c8dfb6ae	Check so_error early in sendfile() call. Prior to this patch, if a connection was reset by the remote end, sendfile() would just report ENOTCONN instead of ECONNRESET. Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: glebius Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12575	2017-10-07 23:30:57 +00:00
Mateusz Guzik	709939a7b7	namecache: factor out ~MAKEENTRY lookups from the common path Lookups of the sort are rare compared to regular ones and succesfull ones result in removing entries from the cache. In the current code buckets are rlocked and a trylock dance is performed, which can fail and cause a restart. Fixing it will require a little bit of surgery and in order to keep the code maintaineable the 2 cases have to split. MFC after: 1 week	2017-10-06 23:05:55 +00:00
Mark Johnston	f38c0c46c5	Let stack_create(9) take a malloc flags argument. Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12614	2017-10-06 21:52:28 +00:00
Emmanuel Vadot	b1d51685ea	vfs_export_lookup: Fix r324054 When using the default address list nam is still valid, the code in r324054 assumed that is was NULL. Reported by: Guy Yur <guyyur@gmail.com> Tested by: Guy Yur <guyyur@gmail.com>	2017-10-06 09:02:36 +00:00
Mateusz Guzik	d07e22cdd8	locks: take the number of readers into account when waiting Previous code would always spin once before checking the lock. But a lock with e.g. 6 readers is not going to become free in the duration of once spin even if they start draining immediately. Conservatively perform one for each reader. Note that the total number of allowed spins is still extremely small and is subject to change later. MFC after: 1 week	2017-10-05 19:18:02 +00:00
Stephen Hurd	1c0054d261	Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers Improved logging added in r323879 exposed an error during attach. We need the irq, not the rid to work correctly. em uses shared irqs, so it will use the same irq for TX as RX. bnxt does not use shared irqs, or TX irqs at all, so there's no need to set the TX irq affinity. Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12496	2017-10-05 14:43:30 +00:00
Mateusz Guzik	20a15d1752	locks: partially tidy up waiting on readers spin first instant of instantly re-readoing and don't re-read after spinning is finished - the state is already known. Note the code is subject to significant changes later. MFC after: 1 week	2017-10-05 13:01:18 +00:00
Andriy Gapon	693593b6f0	sysctl-s in a module should be accessible only when the module is initialized A sysctl can have a custom handler that may access data that is initialized via SYSINIT(9) or via a module event handler (also invoked via SYSINIT). Thus, it is not safe to allow access to the module's sysctl-s until the initialization is performed. Likewise, we should not allow access to teh sysctl-s after the module is uninitialized. The latter is easy to achieve by properly ordering linker_file_unregister_sysctls and linker_file_sysuninit. The former is not as easy for two reasons: - the initialization may depend on tunables which get set when sysctl-s are registered, so we need to set the tunables before running sysinit-s - the initialization may try to dynamically add more sysctl-s under statically defined sysctl nodes So, this change splits the sysctl setup into two phases. In the first phase the sysctl-s are registered as before but they are disabled and hidden from consumers. In the second phase, done after sysinit-s, normal access to the sysctl-s is enabled. The change should affect only dynamic module loading and unloading after the system boot-up. Nothing changes for sysctl-s compiled into the kernel and sysctl-s in preloaded modules. Discussed with: hselasky, ian, jhb Reviewed by: julian, kib MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D12545	2017-10-05 12:32:14 +00:00
Gleb Smirnoff	0e229f343f	Hide struct socket and struct unpcb from the userland. Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and are not guaranteed for stability of the structures. The violators list is the the usual one: libprocstat(3) and netstat(1) internally and lsof in ports. In struct xunpcb remove the inclusion of kernel structure and add a bunch of spare fields. The xsocket already has socket not included, but add there spares as well. Embed xsockbuf into xsocket. Sort declarations in sys/socketvar.h to separate kernel only from userland available ones. PR: 221820 (exp-run)	2017-10-02 23:29:56 +00:00
Alan Cox	0c0e1e96c6	Use vm_page_active() rather than directly accessing the page's queue field. Reviewed by: kib, markj MFC after: 2 weeks X-MFC with: r324146	2017-10-02 07:30:21 +00:00
Andriy Gapon	550374efe6	revert r324166, it has an unrelated change in it	2017-10-01 16:37:54 +00:00
Andriy Gapon	7d5c6491f0	MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best() illumos/illumos-gate@7d3000f774 `7d3000f774` https://www.illumos.org/issues/8521 Yuri reported this to the mailing list: doing a `reboot -d` on current illumos-gate HEAD gives the following ":: findleaks -dv" output: findleaks: maximum buffers => 301061 findleaks: actual buffers => 297587 findleaks: findleaks: potential pointers => 29289774 findleaks: dismissals => 26242305 (89.5%) findleaks: misses => 331153 ( 1.1%) findleaks: dups => 2419681 ( 8.2%) findleaks: follows => 296635 ( 1.0%) findleaks: findleaks: peak memory usage => 7353 kB findleaks: elapsed CPU time => 1.5 seconds findleaks: elapsed wall time => 2.0 seconds findleaks: CACHE LEAKED BUFCTL CALLER ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-01 16:34:16 +00:00
Mark Johnston	0ffc7ed7e3	Have uiomove_object_page() keep accessed pages in the active queue. Previously, uiomove_object_page() would maintain LRU by requeuing the accessed page. This involves acquiring one of the heavily contended page queue locks. Moreover, it is unnecessarily expensive for pages in the active queue. As of r254304 the page daemon continually performs a slow scan of the active queue, with the effect that unreferenced pages are gradually moved to the inactive queue, from which they can be reclaimed. Prior to that revision, the active queue was scanned only during shortages of free and inactive pages, meaning that unreferenced pages could get "stuck" in the queue. Thus, tmpfs was required to use the inactive queue and requeue pages in order to maintain LRU. Now that this is no longer the case, tmpfs I/O operations can use the active queue and avoid the page queue locks in most cases, instead setting PGA_REFERENCED on referenced pages to provide pseudo-LRU. Reviewed by: alc (previous version) MFC after: 2 weeks	2017-09-30 23:41:28 +00:00
Konstantin Belousov	d3c968bf84	Revert r323722. A better fix will be committed shortly, as well as some still useful bits of the reverted revision. The problem with the committed fix is that there are still issues with returning from NMI, when NMI interrupted kernel in a moment where the kernel segments selectors were still not loaded into registers. If this happens, the NMI return would loose the userspace selectors because r323722 does not reload segment registers on return to kernel mode. Fixing the problem is complicated. Since an alternative approach to handle the original bug exists, it makes sence to stop adding more complexity. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-28 08:38:24 +00:00
John Baldwin	c2dc6d5db1	Use UMA_ALIGNOF() for name cache UMA zones. This fixes kernel crashes due to misaligned accesses to the 64-bit time_t embedded in struct namecache_ts in MIPS n32 kernels. MFC after: 1 week Sponsored by: DARPA / AFRL	2017-09-27 23:18:57 +00:00
Emmanuel Vadot	5e254379a8	vfs_export: Simplify vfs_export_lookup If the filesystem is not exported directly return NULL. If no address is given and filesystem is exported using some default one return it directly, if it doesn't have a default one directly return NULL. Reviewed by: kib, bapt MFC after: 1 week Sponsored by: Gandi.net Differential Revision: https://reviews.freebsd.org/D12505	2017-09-27 09:39:16 +00:00
Mateusz Guzik	a79d52d739	sysctl: remove target buffer read/write checks prior to calling the handler Said checks were inherently racy anyway as jokers could unmap target areas before the handler got around to accessing them. This saves time by avoiding locking the address space. MFC after: 1 week	2017-09-27 01:31:52 +00:00
Mateusz Guzik	956713cb74	Annotate sysctlmemlock with __exclusive_cache_line. MFC after: 1 week	2017-09-27 01:27:43 +00:00
Mateusz Guzik	2f1ddb89fc	mtx: drop the tid argument from _mtx_lock_sleep tid must be equal to curthread and the target routine was already reading it anyway, which is not a problem. Not passing it as a parameter allows for a little bit shorter code in callers. MFC after: 1 week	2017-09-27 00:57:05 +00:00
John Baldwin	09f3bb8756	Log signal number passed to PT_STEP requests in KTR_PTRACE traces. MFC after: 1 week	2017-09-25 20:38:55 +00:00
Conrad Meyer	f41b85a63c	ddb(4): Add 'show badstacks' command to show witness badstacks Add a DDB command that mirrors sysctl debug.witness.badstacks. Reapply r323935 after fixing trivial deficiency. I forgot to compile with WITNESS enabled. Thanks emaste@ for fixing the build while I was asleep. Reported by: rstone Reviewed by: rstone (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12468	2017-09-23 17:48:49 +00:00
Ed Maste	4c087f8a83	Revert r323935 as it broke the build subr_witness.c:2577:4: error: use of undeclared identifier 'req' req->oldidx = 0; ^	2017-09-23 12:35:46 +00:00
Stephen Hurd	d57a78580e	Make struct grouptask gt_name member a char array Previously, it was just a pointer which was copied, but some callers pass in a stack variable which will go out of scope. Add GROUPTASK_NAMELEN macro (32) and snprintf() the name into it, using "grouptask" if name is NULL. We can now safely include gtask->gt_name in console messages. Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12449	2017-09-23 01:39:16 +00:00
Conrad Meyer	6fec2d2cce	ddb(4): Add 'show badstacks' command to show witness badstacks Add a DDB command that mirrors sysctl debug.witness.badstacks. Reported by: rstone Reviewed by: rstone Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12468	2017-09-22 20:01:12 +00:00
Kirk McKusick	75e3597abb	Continuing efforts to provide hardening of FFS, this change adds a check hash to cylinder groups. If a check hash fails when a cylinder group is read, no further allocations are attempted in that cylinder group until it has been fixed by fsck. This avoids a class of filesystem panics related to corrupted cylinder group maps. The hash is done using crc32c. Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily used in embedded systems with small memories and low-powered processors which need as light-weight a filesystem as possible. Specifics of the changes: sys/sys/buf.h: Add BX_FSPRIV to reserve a set of eight b_xflags that may be used by individual filesystems for their own purpose. Their specific definitions are found in the header files for each filesystem that uses them. Also add fields to struct buf as noted below. sys/kern/vfs_bio.c: It is only necessary to compute a check hash for a cylinder group when it is actually read from disk. When calling bread, you do not know whether the buffer was found in the cache or read. So a new flag (GB_CKHASH) and a pointer to a function to perform the hash has been added to breadn_flags to say that the function should be called to calculate a hash if the data has been read. The check hash is placed in b_ckhash and the B_CKHASH flag is set to indicate that a read was done and a check hash calculated. Though a rather elaborate mechanism, it should also work for check hashing other metadata in the future. A kernel internal API change was to change breada into a static fucntion and add flags and a function pointer to a check-hash function. sys/ufs/ffs/fs.h: Add flags for types of check hashes; stored in a new word in the superblock. Define corresponding BX_ flags for the different types of check hashes. Add a check hash word in the cylinder group. sys/ufs/ffs/ffs_alloc.c: In ffs_getcg do the dance with breadn_flags to get a check hash and if one is provided, check it. sys/ufs/ffs/ffs_vfsops.c: Copy across the BX_FFSTYPES flags in background writes. Update the check hash when writing out buffers that need them. sys/ufs/ffs/ffs_snapshot.c: Recompute check hash when updating snapshot cylinder groups. sys/libkern/crc32.c: lib/libufs/Makefile: lib/libufs/libufs.h: lib/libufs/cgroup.c: Include libkern/crc32.c in libufs and use it to compute check hashes when updating cylinder groups. Four utilities are affected: sbin/newfs/mkfs.c: Add the check hashes when building the cylinder groups. sbin/fsck_ffs/fsck.h: sbin/fsck_ffs/fsutil.c: Verify and update check hashes when checking and writing cylinder groups. sbin/fsck_ffs/pass5.c: Offer to add check hashes to existing filesystems. Precompute check hashes when rebuilding cylinder group (although this will be done when it is written in fsutil.c it is necessary to do it early before comparing with the old cylinder group) sbin/dumpfs/dumpfs.c Print out the new check hash flag(s) sbin/fsdb/Makefile: Needs to add libufs now used by pass5.c imported from fsck_ffs. Reviewed by: kib Tested by: Peter Holm (pho)	2017-09-22 12:45:15 +00:00
Stephen Hurd	bf227542f3	Fix undeclared identifier error introduced in r323879 It doesn't appear to be safe to use gtask->gt_name. Reported by: Mark Johnston, Jenkins Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12448	2017-09-21 23:27:35 +00:00
John Baldwin	e1d15b892a	Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices. Move handling of these three pathconf() variables out of vop_stdpathconf() and into devfs_pathconf() as TTY devices can only be devfs files. In addition, only return settings for these three variables for devfs devices whose device switch has the D_TTY flag set. Discussed with: bde, kib Sponsored by: Chelsio Communications	2017-09-21 23:05:32 +00:00
Stephen Hurd	326aacb0e3	Improved logging of gtaskqueue failues Check the return code of intr_setaffinity() and log any errors it returns. When a qid is not located, log an error before returning failure. Also, use __func__ rather than hardcoding the function name Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12436	2017-09-21 21:14:48 +00:00
Stephen Hurd	a0fcc37122	Fix M_GTASKQUEUE definition Previously had the same short and long description as taskqueues. This could cause problems with memguard(9) and vmstat -m which use the short description as a unique identifier. Reviewed by: sbruno Approved by: sbruno (mentor) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12438	2017-09-21 20:34:33 +00:00
Konstantin Belousov	9770475ce7	Do not vrele() covered vnode under the mp mutex. If vrele() changes the hold count to zero, it needs to acquire the vnode lock. Sponsored by: The FreeBSD Foundation Discussed with: avg X-MFC with: r323578	2017-09-19 16:49:45 +00:00
Konstantin Belousov	5bf949377e	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411	2017-09-19 16:46:37 +00:00
Konstantin Belousov	5efe338f3d	Fix handling of the segment registers on i386. Suppose that userspace is executing with the non-standard segment descriptors. Then, until exception or interrupt handler executed SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs. If an interrupt occurs in this window, the interrupt handler is executed unsafely, relying on usability of the usermode registers. If the interrupt results in the context switch on return, the contamination of the kernel state spreads to the thread we switched to. As result, kernel data accesses might fault or, if only the base is changed, completely messed up. More, if the user segment was allocated in LDT, another thread might mark the descriptor as invalid before doreti code tried to reload them. In this case kernel panics. The issue exists for all exception entry points which use trap gate, and thus do not automatically disable interrupts on entry, and for lcall_handler. Fix is two-fold: first, we need to disable interrupts for all kernel entries, changing the IDT descriptor types from trap gate to interrupt gate. Interrupts are re-enabled not earlier than the kernel segments are loaded into the segment registers. Second, we only load the segment registers from the trap frame when returning to usermode. For the later, all interrupt return paths must happen through the doreti common code. There is no way to disable interrupts on call gate, which is the supposed mode of servicing for lcall $7,$0 syscalls. Change the LDT descriptor 0 into a code segment type and point it to the userspace trampoline which redirects the syscall to int $0x80. All the measures make the segment register handling similar to that of amd64. We do not apply amd64 optimizations of not reloading segment registers on return from the syscall. Reported by: Maxime Villard <max@m00nbsd.net> Tested by: pho (the non-lcall part) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12402	2017-09-18 20:22:42 +00:00
Alan Cox	ec371b57e8	Modify blst_leaf_alloc to take only the cursor argument. Modify blst_leaf_alloc to find allocations that cross the boundary between one leaf node and the next when those two leaves descend from the same meta node. Update the hint field for leaves so that it represents a bound on how large an allocation can begin in that leaf, where it currently represents a bound on how large an allocation can be found within the boundaries of the leaf. The first phase of blst_leaf_alloc currently shrinks sequences of consecutive 1-bits in mask until each has been shrunken by count-1 bits, so that any bits remaining show where an allocation can begin, or until all the bits have disappeared, in which case the allocation fails. This change amends that so that the high-order bit is copied, as if, when the last block was free, it was followed by an endless stream of free blocks. It also amends the early stopping condition, so that the shrinking of 1-sequences stops early when there are none, or there is only one unbounded one remaining. The search for the first set bit is unchanged, and the code path thereafter is mostly unchanged unless the first set bit is in a position that makes some of those copied sign bits matter. In that case, we look for a next leaf, and at what blocks it can provide, to see if a cross-boundary allocation is possible. The hint is updated on a successful allocation that clears the last bit, but it not updated on a failed allocation that leaves the last bit set. So, as long as the last block is free, the hint value for the leaf is large. As long as the last block is free, and there's a next leaf, a large allocation can begin here, perhaps. A stricter rule than this would mean that allocations and frees in one leaf could require hint updates to the preceding leaf, and this change seeks to leave the freeing code unmodified. Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and blist_leaf_fill, as well as in blst_leaf_alloc. Correct a panic message in blst_leaf_free. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11819	2017-09-16 18:12:15 +00:00
Stephen Hurd	ab2e3f7958	Revert r323516 (iflib rollup) This was really too big of a commit even if everything worked, but there are multiple new issues introduced in the one huge commit, so it's not worth keeping this until it's fixed. I'll work on splitting this up into logical chunks and introduce them one at a time over the next week or two. Approved by: sbruno (mentor) Sponsored by: Limelight Networks	2017-09-16 02:41:38 +00:00
Gleb Smirnoff	584ab65a75	Fix locking in soisconnected(). When a newborn socket moves from incomplete queue to complete one, we need to obtain the listening socket lock after the child, which is a wrong order. The old code did that in potentially endless loop of mtx_trylock(). The new one does only one attempt of mtx_trylock(), and in case of failure references listening socket, unlocks child and locks everything in right order. In case if listening socket shuts down during that, just bail out. Reported & tested by: Jason Eggleston <jeggleston llnw.com> Reported & tested by: Jason Wolfe <jason llnw.com>	2017-09-14 18:05:54 +00:00
John Baldwin	c2f37b9245	Add AT_HWCAP and AT_EHDRFLAGS on all platforms. A new 'u_long sv_hwcap' field is added to 'struct sysentvec'. A process ABI can set this field to point to a value holding a mask of architecture-specific CPU feature flags. If an ABI does not wish to supply AT_HWCAP to processes the field can be left as NULL. The support code for AT_EHDRFLAGS was already present on all systems, just the #define was not present. This is a step towards unifying the AT_ constants across platforms. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12290	2017-09-14 14:26:55 +00:00
Andriy Gapon	cbc785c293	dounmount: do not release the mount point's reference on the covered vnode As long as mnt_ref is not zero there can be a consumer that might try to access mnt_vnodecovered. For this reason the covered vnode must not be freed until mnt_ref goes to zero. So, move the release of the covered vnode to vfs_mount_destroy. Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12329	2017-09-14 08:47:06 +00:00
Gleb Smirnoff	d37aa3ccce	Use soref() in sendfile(2) instead fhold() to reference a socket. The problem is that fdrop() requires syscall context, as it may enter sleep in some cases. The reason to use it in the original non-blocking sendfile implementation, was to avoid use of global ACCEPT_LOCK() on every I/O completion. Now in head sorele() no longer requires this lock.	2017-09-13 22:11:05 +00:00
Gleb Smirnoff	100db364eb	Fix two issues with not ready data in sockets (read: sendfile) in UNIX sockets. o Check that socket is still connected in uipc_ready(). If not we are responsible to free mbufs. o In uipc_send() if socket appears to be disconnected, but we are sending data with pending I/Os, don't free mbufs. Reported by: Kevin Bowling <kbowling llnw.com> Tested by: Kevin Bowling <kbowling llnw.com> PR: 222259 Reported by: Mark Martinec <Mark.Martinec ijs.si> MFC after: 3 days	2017-09-13 16:47:23 +00:00
Stephen Hurd	d300df0182	Roll up iflib commits from github. This pulls in most of the work done by Matt Macy as well as other changes which he has accepted via pull request to his github repo at https://github.com/mattmacy/networking/ This should bring -CURRENT and the github repo into close enough sync to allow small feature branches rather than a large chain of interdependant patches being developed out of tree. The reset of the synchronization should be able to be completed on github by splitting the remaining changes that are not yet ready into short feature branches for later review as smaller commits. Here is a summary of changes included in this patch: 1) More checks when INVARIANTS are enabled for eariler problem detection 2) Group Task Queue cleanups - Fix use of duplicate shortdesc for gtaskqueue malloc type. Some interfaces such as memguard(9) use the short description to identify malloc types, so duplicates should be avoided. 3) Allow gtaskqueues to use ithreads in addition to taskqueues - In some cases, this can improve performance 4) Better logging when taskqgroup_attach*() fails to set interrupt affinity. 5) Do not start gtaskqueues until they're needed 6) Have mp_ring enqueue function enter the ABDICATED rather than BUSY state. This moves the TX to the gtaskq and allows processing to continue faster as well as make TX batching more likely. 7) Add an ift_txd_errata function to struct if_txrx. This allows drivers to inspect/modify mbufs before transmission. 8) Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need checksums zeroed for checksum offload to work. This avoids modifying packet data in the TX path when possible. 9) Use ithreads for iflib I/O instead of taskqueues 10) Clean up ioctl and support async ioctl functions 11) Prefetch two cachlines from each mbuf instead of one up to 128B. We often need to parse packet header info beyond 64B. 12) Fix potential memory corruption due to fence post error in bit_nclear() usage. 13) Improved hang detection and handling 14) If the packet is smaller than MTU, disable the TSO flags. This avoids extra packet parsing when not needed. 15) Move TCP header parsing inside the IS_TSO?() test. This avoids extra packet parsing when not needed. 16) Pass chains of mbufs that are not consumed by lro to if_input() rather call if_input() for each mbuf. 17) Re-arrange packet header loads to get as much work as possible done before a cache stall. 18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/ IFDI_DETACH(); 19) Attempt to distribute RX/TX tasks across cores more sensibly, especially when RX and TX share an interrupt. RX will attempt to take the first threads on a core, and TX will attempt to take successive threads. 20) Allow iflib_softirq_alloc_generic() to request affinity to the same cpus an interrupt has affinity with. This allows TX queues to ensure they are serviced by the socket the device is on. 21) Add new iflib sysctls to net.iflib: - timer_int - interval at which to run per-queue timers in ticks - force_busdma 22) Add new per-device iflib sysctls to dev.X.Y.iflib - rx_budget allows tuning the batch size on the RX path - watchdog_events Count of watchdog events seen since load 23) Fix error where netmap_rxq_init() could get called before IFDI_INIT() 24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY when waiting for firmware - After interrupts are enabled, convert all waits to sleeps - Eliminates e1000 software/firmware synchronization busy waits after startup 25) e1000: Remove special case for budget=1 in em_txrx.c - Premature optimization which may actually be incorrect with multi-segment packets 26) e1000: Split out TX interrupt rather than share an interrupt for RX and TX. - Allows better performance by keeping RX and TX paths separate 27) e1000: Separate igb from em code where suitable Much easier to understand separate functions and "if (is_igb)" than previous tests like "if (reg_icr & (E1000_ICR_RXSEQ \| E1000_ICR_LSC))" #blamebruno Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12235	2017-09-13 01:18:42 +00:00
Alan Cox	d027ed2e7a	To analyze the allocation of swap blocks by blist functions, add a method for analyzing the radix tree structures and reporting on the number, and sizes, of maximal intervals of free blocks. The report includes the number of maximal intervals, and also the number of them in each of several size ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size boundaries defined by Fibonacci numbers. The report is written in the test tool with the 's' command, or in a running kernel by sysctl. The analysis of the radix tree frequently computes the position of the lone bit set in a u_daddr_t, a computation that also appears in leaf allocation. That computation has been moved into a function of its own, and optimized for cases where an inlined machine instruction can replace the usual binary search. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11906	2017-09-10 17:46:03 +00:00

1 2 3 4 5 ...

15625 Commits