freebsd-skq

Author	SHA1	Message	Date
Konstantin Belousov	9a2c85350a	Partially revert r255986: do not call VOP_FSYNC() when helping bufdaemon in getnewbuf(), do use buf_flush(). The difference is that bufdaemon uses TRYLOCK to get buffer locks, which allows calls to getnewbuf() while another buffer is locked. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-27 11:13:19 +00:00
Rick Macklem	7cfdc2a7bc	MAXBSIZE defines both the largest UFS block size and the largest size for a buffer in the buffer cache. This patch defines a new constant MAXBCACHEBUF, which is the largest size for a buffer in the buffer cache. Having a separate constant allows MAXBCACHEBUF to be set larger than MAXBSIZE on a per-architecture basis, so that NFS can do larger read/writes for these architectures. It modifies sys/param.h so that BKVASIZE can also be set on a per-architecture basis. A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE is fixed as well. Differential Revision: https://reviews.freebsd.org/D2330 Reviewed by: mav, kib MFC after: 2 weeks	2015-04-25 00:52:01 +00:00
Benno Rice	43348dc2ad	Reset bp->bio_done to unmapped_buf when removing a transient map in biodone. Submitted by: Scott Ferris <scott.ferris@isilon.com> Sponsored by: EMC / Isilon Storage Division Reviewed by: kib	2015-03-16 20:00:09 +00:00
Konstantin Belousov	904ed548bb	When getnewbuf_reuse_bp() is called to reclaim some (clean) buffer, the vnode owning the buffer is not locked. More, it cannot be locked safely, since getnewbuf_reuse_bp() is called from newbuf(), and some other vnode is already locked, for which reused buffer will be reassigned. As the consequence, reclamation of the owning vnode could go in parallel, in particular, the call to vnode_destroy_vobject(), which deallocates the vm object and zeroes the v_bufobj->bo_object. Note that the pages wired by the buffer are left wired and can be safely freed by the vfs_vmio_release() without the need for the vm object lock. Also, seeing stale pointer to the v_object is safe due to vm object type stability. Check for bo_bufobj != NULL and cache the value in local variable to avoid trying to lock NULL vm object. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-08 16:42:34 +00:00
Alexander Motin	ccf8a5688a	Revert somewhat hackish geom_disk optimization, committed as part of r256880, and the following r273143 commit, supposed to workaround introduced issue by quite innocent-looking change. While there is no clear understanding why, but r273143 is accused in data corruption in some environments with high I/O load. I personally don't see any problem in that commit, and possibly it is just a trigger to some other bug somewhere, but better safe then sorry for now. Requested by: scottl@ MFC after: 3 days	2014-10-25 15:16:19 +00:00
Alexander Motin	99b9076c21	Remove setting BIO_DONE flag for BIOs that have done() method. This fixes use-after-free, caused by geom_disk, completing same BIO twice to save extra allocation, and getting BIO_DONE set after the first. MFC after: 1 week	2014-10-15 18:36:34 +00:00
Jung-uk Kim	37417245bf	Make kern.nswbuf tunable from loader. MFC after: 1 week	2014-10-07 20:13:47 +00:00
Benno Rice	c079e1c018	Add KASSERTs to catch the case where a developer may have forgotten to set bo_bsize on a bufobj. This is a slight modification of the patch provided. PR: 193146 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC Isilon Storage Division	2014-09-04 00:10:06 +00:00
Kirk McKusick	5f9500c358	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Konstantin Belousov	a19c5d3716	Devolatile as needed. Sponsored by: The FreeBSD Foundation MFC after: 13 days	2014-06-09 09:10:31 +00:00
Konstantin Belousov	7f82c6c17f	Change the nblock mutex, protecting the needsbuffer buffer deficit flags, to rwlock. Lock it in read mode when used from subroutines called from buffer release code paths. The needsbuffer is now updated using atomics, while read lock of nblock prevents loosing the wakeups from bufspacewakeup() and bufcountadd() in getnewbuf_bufd_help(). In several interesting loads, needsbuffer flags are never set, while buffers are reused quickly. This causes brelse() and bqrelse() from different threads to content on the nblock. Now they take nblock in read mode, together with needsbuffer not needing an update, allowing higher parallelism. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-09 03:38:03 +00:00
Konstantin Belousov	23f6698fbd	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
John Baldwin	44afcdabf3	Fix a typo.	2014-01-21 03:24:52 +00:00
Konstantin Belousov	e136eac224	Revert r259200. There are geoms/drivers which do not update bio_completed, only manage bio_resid, e.g. sa(4). Reported and tested by: Manfred Antar <null@pozo.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-27 17:04:51 +00:00
Konstantin Belousov	b4aa4fed2b	Fix detection of EOF in kern_physio(). If bio_length was clipped by the excess code in g_io_check(), bio_resid is also truncated by g_io_deliver(). As result, bufdonebio() assigns truncated value to the buffer b_resid field. Use the residual bio_completed to calculate buffer b_resid from b_bcount in bufdonebio(), instead of bio_resid, calculated from bio_length in g_io_deliver(). The issue is seemingly caused by the code rearrange into g_io_check(), which is not present in stable/10. The change still looks as the useful change to have in 10 nevertheless. Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch> Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-10 21:15:18 +00:00
John Baldwin	ba9593f3bd	Don't allow vfs.lorunningspace or vfs.hirunningspace to be set such that lorunningspace is greater than hirunningspace as the system performs terribly if it is mistuned in this fashion. MFC after: 1 week	2013-11-15 15:29:53 +00:00
Alexander Motin	8160afdab1	MFprojects/camlock r256619: Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping temporary mapped buffer. That fixes double unmap if biodone() called twice for the same BIO (but with different done methods). Move mapping removal before calling bio_done() method. I believe that it is very wrong to do anything to BIO after reporting completion. kib@ thinks it was done for some forgotten now case when bio_done() method needed mapped buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called in the same way; 2) IMO there is no guatantee that buffer will be mapped at this point at all, for example, if all underlying stack supports unmapped I/O, so bio_done() handler can not expect that.	2013-10-21 06:44:55 +00:00
Alexander Motin	77a30af6f8	MFprojects/camlock r256370: - Take BIO lock in biodone() only when there is no completion callback set and so we should wake up thread waiting in biowait(). - Remove msleep() timeout from biowait(). It was added 11 years ago, when there was no locks used, and it should not be needed any more.	2013-10-16 09:56:40 +00:00
Konstantin Belousov	acb9d2c7f0	The device vnodes are often unlocked when bread() or bwrite() is called. This probably should be fixed eventually, but for now it is not needed to try to flush such vnodes from the buffer allocation context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-09 18:45:01 +00:00
Konstantin Belousov	432e79fc33	When helping the bufdaemon from the buffer allocation context, there is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)	2013-10-02 06:00:34 +00:00
Konstantin Belousov	ac34145005	Reimplement r255797 using LK_TRYUPGRADE. The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:04:57 +00:00
Konstantin Belousov	12af71a69f	Revert r255797. The LK_UPGRADE \| LK_NOWAIT drops the lock. Approved by: re (marius, implicit)	2013-09-22 20:29:03 +00:00
Konstantin Belousov	d1f8ca485d	Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-09-22 19:23:48 +00:00
Konstantin Belousov	3aaea6efd5	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-08 17:51:22 +00:00
Konstantin Belousov	a677b31425	The vm_pageout_flush() functions sbusies pages in the passed pages run. After that, the pager put method is called, usually translated to VOP_WRITE(). For the filesystems which use buffer cache, bufwrite() sbusies the buffer pages again, waiting for the xbusy state to drain. The later is done in vfs_drain_busy_pages(), which is called with the buffer pages already sbusied (by vm_pageout_flush()). Since vfs_drain_busy_pages() can only wait for one page at the time, and during the wait, the object lock is dropped, previous pages in the buffer must be protected from other threads busying them. Up to the moment, it was done by xbusying the pages, that is incompatible with the sbusy state in the new implementation of busy. Switch to sbusy. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2013-09-05 12:56:08 +00:00
Konstantin Belousov	4f8cf6e59b	Both cluster_rbuild() and cluster_wbuild() sometimes set the pages shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation	2013-08-22 18:26:45 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Konstantin Belousov	982d771242	Assert that runningbufspace does not underflow. Sponsored by: The FreeBSD Foundation	2013-07-13 19:36:18 +00:00
Konstantin Belousov	da4ca6c8ab	There is no need to count waiters for the runningbufspace. Sponsored by: The FreeBSD Foundation	2013-07-13 19:34:34 +00:00
Konstantin Belousov	92e5367354	Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O error if any user wired mappings exist. Doing the invalidation destroys the user wiring. The change is the temporal measure to close the bug, the more proper fix is to delegate the invalidation of the page to upper layers always. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:36:26 +00:00
Alfred Perlstein	d7b5c50b92	Make kassert_printf use __printflike. Fix associated errors/warnings while I'm here. Requested by: avg	2013-07-07 21:39:37 +00:00
Jeff Roberson	5f51836645	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-06-28 03:51:20 +00:00
Jeff Roberson	ba39d89bc9	- Consolidate duplicate code into support functions. - Split the bqlock into bqclean and bqdirty locks. - Only acquire the wakeup synchronization locks when we cross a threshold requiring them. - Restructure the way flushbufqueues() targets work so they are more smp friendly and sane. Reviewed by: kib Discussed with: mckusick, attilio Sponsored by: EMC / Isilon Storage Division M vfs_bio.c	2013-06-05 23:53:00 +00:00
Konstantin Belousov	92fab43f7f	When auto-sizing the buffer cache, limit the amount of physical memory used as the estimation of size, to 32GB. This provides around 100K of buffer headers and corresponding KVA for buffer map at the peak. Sizing the cache larger is not useful, also resulting in the wasting and exhausting of KVA for large machines. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation	2013-06-03 04:16:48 +00:00
Alan Cox	39a4cd0cec	Reduce the scope of the VM object locking in brelse(). In my tests, this change reduced the total number of VM object lock acquisitions by brelse() by 74%. Sponsored by: EMC / Isilon Storage Division	2013-06-02 16:18:03 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Attilio Rao	bed927ee17	vm_object locking is not needed there as pages are already wired. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-05-21 20:54:03 +00:00
Attilio Rao	e3ed7ff03f	Use readlocking now that assertions on vm_page_lookup() are relaxed. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: flo, pho	2013-05-17 20:03:55 +00:00
Konstantin Belousov	d1e99f43ed	Add dev_strategy_csw() function, which is similar to dev_strategy() but assumes that a thread reference was already obtained on the passed device. Use the function from physio(), to avoid two extra dev_mtx lock and unlock. Note that physio() is always used as the cdevsw method, or is called from a cdevsw method, and the caller already owns the reference. dev_strategy() is left to keep KPI intact, but now it is implemented as a wrapper around dev_strategy_csw(). Do some style cleanup in physio(). Requested and reviewed by: kan (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-03-27 11:34:27 +00:00
Konstantin Belousov	88c8c0a70f	On i386, double the default size of the bio transient map. With the maxbcache size fixed, the auto-tuned transient map is too small for real-world load on i386. Tested by: David Wolfskill Sponsored by: The FreeBSD Foundation	2013-03-27 10:56:15 +00:00
Konstantin Belousov	7db07e1c85	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
Konstantin Belousov	e3269b5096	In bufwrite(), a dirty buffer is moved to the clean queue before the bufobj counter of the writes in progress is incremented. Other thread inspecting the bufobj would consider it clean. For the regular vnodes, the vnode lock is typically held both by the thread performing the bufwrite() and an other thread doing syncing, which prevents the situation. On the other hand, writes to the VCHR vnodes are done without holding vnode lock. Increment the write ref counter for the buffer object before calling bundirty(). Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks	2013-03-20 21:08:00 +00:00
Konstantin Belousov	e81ff91e62	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:43:57 +00:00
Konstantin Belousov	7d5365c70b	Add a helper function vfs_bio_bzero_buf() to zero the portion of the buffer, transparently handling mapped or unmapped buffers. Its intent is to replace the use of bzero(bp->b_data) in cases where the buffer might be unmapped, to avoid unneeded upgrades. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:27:14 +00:00
Konstantin Belousov	ee75e7de7b	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00

1 2 3 4 5 ...

672 Commits