freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Gleb Kurtsou	dde58752db	Adjust printf format specifiers for dev_t and ino_t in kernel. ino_t and dev_t are about to become uint64_t. Reviewed by: kib, mckusick	2014-12-17 07:27:19 +00:00
Gleb Smirnoff	90effb2341	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
Gleb Smirnoff	dbfd8ef2e4	buf.h is not needed here, and pollutes when ufsmount.h is included from userland code. Sponsored by: Nginx, Inc.	2014-11-23 01:02:19 +00:00
Gleb Smirnoff	8c7f0b92b4	Include required files directly instead of pollution via ufs/ufsmount.h. Sponsored by: Nginx, Inc.	2014-11-23 01:01:14 +00:00
Davide Italiano	0a4102656e	Use the correct variable name.	2014-11-22 00:42:30 +00:00
Davide Italiano	c38b12220f	Make ufs_dirhashreclaimperc a percentage for real and rename it to ufs_dirhashreclaimpercent, as suggested by jhb@. As an added bonus this avoids divide-by-zero errors. Requested by: jhb, markj Reviewied by: jhb, markj	2014-11-22 00:37:37 +00:00
Konstantin Belousov	ca109b01cf	When non-forced unmount or remount rw->ro is performed, writes on UFS are not suspended. In particular, on the SU-enabled vulumes, there is no reason why, between the call to softdep_flushfiles() and softdep_waitidle(), SU work items cannot be queued. Correct the condition to trigger the panic by only checking when forced operation is done. Convert direct panic() call into KASSERT(), there is no invalid on-disk data structures directly involved, so follow the usual debugging vs. non-debugging approach. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-02 13:14:55 +00:00
Mateusz Guzik	4fce16e4c9	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks	2014-10-20 18:00:50 +00:00
Mateusz Guzik	8f889ce7c8	Use lockless quota checks in qsync and qsyncvp. No strong objections from: kib, mckusick MFC after: 1 week	2014-10-16 12:41:14 +00:00
Konstantin Belousov	df5c9c0411	Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS survives remount in rw, also it is set for vnodes on rootfs before noatime can be set or clock is adjusted. All conditions result in wrong atime for accessed vnodes. Submitted by: bde MFC after: 1 week	2014-10-11 19:09:56 +00:00
Warner Losh	34eed6d3de	Restore the backed-out change, using __offsetof instead.	2014-10-10 00:35:08 +00:00
Baptiste Daroussin	16b2833f90	Backout r272825 every useland usage of ufs/ufs/dir.h are now broken with that change	2014-10-09 17:26:29 +00:00
Baptiste Daroussin	92a1d420ba	Use offsetof() from sys/types.h instead of a custom one This fixes build with recent gcc versions	2014-10-09 15:26:22 +00:00
Konstantin Belousov	d15b55c554	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
Alan Cox	3e5c84e292	We don't need an exclusive object lock on the expected execution path through {ext2,ffs}_getpages(). Reviewed by: kib, pfg MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-13 18:26:13 +00:00
Konstantin Belousov	c6fef2d49a	Direct access to the quota files, in particular, lookup, causes lock conflict with the quota metadata access. Mark quota vnode lock as recursive and always exclusive to avoid the problem. Reported by: hrs Tested by: hrs, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-29 09:04:24 +00:00
Davide Italiano	266d427bc0	Rather than using an hardcoded reclaim age, rely on an LRU-like approach for dirhash cache, setting a target percent to reclaim (exposed via SYSCTL). This allows to always make some amount of progress keeping the maximum reclaim age dynamic. Tested by: pho Reviewed by: jhb	2014-08-25 17:06:18 +00:00
Konstantin Belousov	29d03af1e4	Do not busy the UFS mount point inside VOP_RENAME(). The kern_renameat() already starts write on the mp, which prevents parallel unmount from proceed. Busying mp after vn_start_write() deadlocks the unmount. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:15:23 +00:00
Konstantin Belousov	4ce90426e0	Correct the test for condition to suspend UFS filesystem during unmount. There is no need to suspend read-only filesystem, while we need suspension on modificable mount point. Reported by: rwatson Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:13:03 +00:00
Konstantin Belousov	a6b5e6e32f	Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now. Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-12 09:33:00 +00:00
Kirk McKusick	3da0b29d99	The SUJ journal is only prepared to handle full-size block numbers, so we have to adjust freeblk records to reflect the change to a full-size block. For example, suppose we have a block made up of fragments 8-15 and want to free its last two fragments. We are given a request that says: FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0 where frags are the number of fragments to free and oldfrags are the number of fragments to keep. To block align it, we have to change it to have a valid full-size blkno, so it becomes: FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6 Submitted by: Mikihito Takehara Tested by: Mikihito Takehara Reviewed by: Jeff Roberson MFC after: 1 week	2014-08-07 16:53:07 +00:00
Kirk McKusick	5f9500c358	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
Warner Losh	ddd812b850	Simplify comment to remove multiple negative and passive voice.	2014-07-23 16:18:54 +00:00
Konstantin Belousov	65589a29f4	Check for the cross-device cross-link attempt in the VFS, instead of forcing filesystem VOP_LINK() methods to repeat the code. In tmpfs_link(), remove redundand check for the type of the source, already done by VFS. Note that NFS server already performs this check before calling VOP_LINK(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-16 14:04:46 +00:00
Konstantin Belousov	895b3782c6	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:10:00 +00:00
Konstantin Belousov	7b81a399a4	In msdosfs_setattr(), add a check for result of the utimes(2) permissions test, forgotten in r164033. Refactor the permission checks for utimes(2) into vnode helper function vn_utimes_perm(9), and simplify its code comparing with the UFS origin, by writing the call to VOP_ACCESSX only once. Use the helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5). Reported by: bde Reviewed by: bde, trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-17 07:11:00 +00:00
Konstantin Belousov	23f6698fbd	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
John-Mark Gurney	e9838c1140	don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags.. If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2, you'll end up corrupting your FS as the superblock is updated and written to a different location... makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing this issue... Reviewed by: silience from mckusick MFC after: 1 week	2014-06-03 21:46:13 +00:00
Scott Long	7d155880ee	Due to reasons unknown at this time, the system can be forced to write a journal block even when there are no journal entries to be written. Until the root cause is found, handle this case by ensuring that a valid journal segment is always written. Second, the data buffer used for writing journal entries was never being scrubbed of old data. Fix this. Submitted by: Takehara Mikihito Obtained from: Netflix, Inc. MFC after: 3 days	2014-05-06 20:40:16 +00:00
Kirk McKusick	e24784d32a	Update comment to explain search order reverted to historical order in -r254996. Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days	2014-03-22 11:26:39 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Jeff Roberson	4803948fe2	- If we fail to do a non-blocking acquire of a buf lock while doing a waiting sync pass we need to do a blocking acquire and restart. Another thread, typically the buf daemon, may have this buf locked and if we don't wait we can fail to sync the file. This lead to a great variety of softdep panics because we rely on all dependencies being flushed before proceeding in several cases. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:13:21 +00:00
Jeff Roberson	c13a58b022	- Gracefully handle truncation failures when trying to shrink directories. This could cause dirhash panics since the dirhash state would be successfully truncated while the directory was not. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:10:07 +00:00
Pedro F. Giffuni	4896af9f55	ufs: small formatting fixes. Cleanup some extra space. Use of tabs vs. spaces. No functional change. MFC after: 3 days Reviewed by: mckusick	2014-03-02 02:52:34 +00:00
Kirk McKusick	be1fd82346	Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks	2013-12-30 17:04:24 +00:00
Kirk McKusick	9cd7cb1bf1	Properly handle unsigned comparison. MFC after: 2 weeks	2013-12-30 06:19:42 +00:00
Kirk McKusick	2e436d88a3	We needlessly panic when trying to flush MKDIR_PARENT dependencies. We had previously tried to flush all MKDIR_PARENT dependencies (and all the NEWBLOCK pagedeps) by calling ffs_update(). However this will only resolve these dependencies in direct blocks. So very large directories with MKDIR_PARENT dependencies in indirect blocks had not yet gotten flushed. As the directory is in the midst of doing a complete sync, we simply defer the checking of the MKDIR_PARENT dependencies until the indirect blocks have been sync'ed. Reported by: Shawn Wallbridge of imaginaryforces.com Tested by: John-Mark Gurney <jmg@funkthat.com> PR: 183424 MFC after: 2 weeks	2013-12-01 07:34:21 +00:00
John-Mark Gurney	e0ce310797	fix white space... MFC after: 1 week	2013-11-20 21:21:29 +00:00
John-Mark Gurney	b6ffc3b567	fix a use after free, jsegdep_merge will free wk, avoid the next check... CID: 1006098 Sponsored by: Imaginary Forces Reviewed by: mckusick MFC after: 1 week	2013-11-20 21:16:53 +00:00
Pedro F. Giffuni	4b367145f7	UFS2: make di_extsize unsigned. di_extsize is the EA size and as such it should be unsigned. Adjust related types for consistency. Reviewed by: mckusick (previous version) MFC after: 3 weeks	2013-10-24 00:33:29 +00:00
Brooks Davis	cf058082cd	Allow kernels without options SOFTUPDATES to build. This should fix the embedded tinderboxes. Reviewed by: emaste	2013-10-21 20:51:08 +00:00
Kirk McKusick	07599ccb28	Fix build problem on ARM (which defaults to building without soft updates). Reported by: Tinderbox Sponsored by: Netflix	2013-10-21 13:09:09 +00:00
Kirk McKusick	58941b9f15	Restructuring of the soft updates code to set it up so that the single kernel-wide soft update lock can be replaced with a per-filesystem soft-updates lock. This per-filesystem lock will allow each filesystem to have its own soft-updates flushing thread rather than being limited to a single soft-updates flushing thread for the entire kernel. Move soft update variables out of the ufsmount structure and into their own mount_softdeps structure referenced by ufsmount field um_softdep. Eventually the per-filesystem lock will be in this structure. For now there is simply a pointer to the kernel-wide soft updates lock. Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock pointer in the mount_softdeps structure instead of a pointer to the kernel-wide soft-updates lock. Replace the five hash tables used by soft updates with per-filesystem copies of these tables allocated in the mount_softdeps structure. Several functions that flush dependencies when too many are allocated in the kernel used to operate across all filesystems. They are now parameterized to flush dependencies from a specified filesystem. For now, we stick with the round-robin flushing strategy when the kernel as a whole has too many dependencies allocated. While there are many lines of changes, there should be no functional change in the operation of soft updates. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-21 00:28:02 +00:00
Kirk McKusick	cc76ac5a6c	Fourth of several cleanups to soft dependency implementation. Add KASSERTS that soft dependency functions only get called for filesystems running with soft dependencies. Calling these functions when soft updates are not compiled into the system become panic's. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 22:21:01 +00:00
Kirk McKusick	519e3c3b9f	Third of several cleanups to soft dependency implementation. Ensure that softdep_unmount() and softdep_setup_sbupdate() only get called for filesystems running with soft dependencies. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 21:11:40 +00:00
Kirk McKusick	90a306d8af	Second of several cleanups to soft dependency implementation. Delete two unused functions in ffs_sofdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:52:07 +00:00
Kirk McKusick	8850120f48	First of several cleanups to soft dependency implementation. Convert three functions exported from ffs_softdep.c to static functions as they are not used outside of ffs_softdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:41:38 +00:00
Pedro F. Giffuni	b876c780d0	Make di_blocks unsigned in UFS1 as is the case already for UFS2. Most of the code between UFS1 and UFS2 is shared so this change is pretty safe. Not only this makes UFS1 and 2 consistent but it also matches what NetBSD and MacOS X have for some years now. Reviewed by: mckusick MFC after: 1 month	2013-10-14 18:17:09 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Kirk McKusick	2ce451f089	In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order. MFC after: 2 weeks	2013-08-28 17:46:32 +00:00
Kirk McKusick	28702816d8	A performance problem was reported in PR kern/181226: I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782. The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas. The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks	2013-08-28 17:38:05 +00:00
Ivan Voras	60dd37465a	Take a very small step toward the Century of the Anchovy by increasing the time dirhash entries stay in memory before being considered for eviction to 1 minute.	2013-08-28 10:06:20 +00:00
Kenneth D. Merry	7da1a731c6	Expand the use of stat(2) flags to allow storing some Windows/DOS and CIFS file attributes as BSD stat(2) flags. This work is intended to be compatible with ZFS, the Solaris CIFS server's interaction with ZFS, somewhat compatible with MacOS X, and of course compatible with Windows. The Windows attributes that are implemented were chosen based on the attributes that ZFS already supports. The summary of the flags is as follows: UF_SYSTEM: Command line name: "system" or "usystem" ZFS name: XAT_SYSTEM, ZFS_SYSTEM Windows: FILE_ATTRIBUTE_SYSTEM This flag means that the file is used by the operating system. FreeBSD does not enforce any special handling when this flag is set. UF_SPARSE: Command line name: "sparse" or "usparse" ZFS name: XAT_SPARSE, ZFS_SPARSE Windows: FILE_ATTRIBUTE_SPARSE_FILE This flag means that the file is sparse. Although ZFS may modify this in some situations, there is not generally any special handling for this flag. UF_OFFLINE: Command line name: "offline" or "uoffline" ZFS name: XAT_OFFLINE, ZFS_OFFLINE Windows: FILE_ATTRIBUTE_OFFLINE This flag means that the file has been moved to offline storage. FreeBSD does not have any special handling for this flag. UF_REPARSE: Command line name: "reparse" or "ureparse" ZFS name: XAT_REPARSE, ZFS_REPARSE Windows: FILE_ATTRIBUTE_REPARSE_POINT This flag means that the file is a Windows reparse point. ZFS has special handling code for reparse points, but we don't currently have the other supporting infrastructure for them. UF_HIDDEN: Command line name: "hidden" or "uhidden" ZFS name: XAT_HIDDEN, ZFS_HIDDEN Windows: FILE_ATTRIBUTE_HIDDEN This flag means that the file may be excluded from a directory listing if the application honors it. FreeBSD has no special handling for this flag. The name and bit definition for UF_HIDDEN are identical to the definition in MacOS X. UF_READONLY: Command line name: "urdonly", "rdonly", "readonly" ZFS name: XAT_READONLY, ZFS_READONLY Windows: FILE_ATTRIBUTE_READONLY This flag means that the file may not written or appended, but its attributes may be changed. ZFS currently enforces this flag, but Illumos developers have discussed disabling enforcement. The behavior of this flag is different than MacOS X. MacOS X uses UF_IMMUTABLE to represent the DOS readonly permission, but that flag has a stronger meaning than the semantics of DOS readonly permissions. UF_ARCHIVE: Command line name: "uarch", "uarchive" ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE Windows name: FILE_ATTRIBUTE_ARCHIVE The UF_ARCHIVED flag means that the file has changed and needs to be archived. The meaning is same as the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute. msdosfs and ZFS have special handling for this flag. i.e. they will set it when the file changes. sys/param.h: Bump __FreeBSD_version to 1000047 for the addition of new stat(2) flags. chflags.1: Document the new command line flag names (e.g. "system", "hidden") available to the user. ls.1: Reference chflags(1) for a list of file flags and their meanings. strtofflags.c: Implement the mapping between the new command line flag names and new stat(2) flags. chflags.2: Document all of the new stat(2) flags, and explain the intended behavior in a little more detail. Explain how they map to Windows file attributes. Different filesystems behave differently with respect to flags, so warn the application developer to take care when using them. zfs_vnops.c: Add support for getting and setting the UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN, UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags. All of these flags are implemented using attributes that ZFS already supports, so the on-disk format has not changed. ZFS currently doesn't allow setting the UF_REPARSE flag, and we don't really have the other infrastructure to support reparse points. msdosfs_denode.c, msdosfs_vnops.c: Add support for getting and setting UF_HIDDEN, UF_SYSTEM and UF_READONLY in MSDOSFS. It supported SF_ARCHIVED, but this has been changed to be UF_ARCHIVE, which has the same semantics as the DOS archive attribute instead of inverse semantics like SF_ARCHIVED. After discussion with Bruce Evans, change several things in the msdosfs behavior: Use UF_READONLY to indicate whether a file is writeable instead of file permissions, but don't actually enforce it. Refuse to change attributes on the root directory, because it is special in FAT filesystems, but allow most other attribute changes on directories. Don't set the archive attribute on a directory when its modification time is updated. Windows and DOS don't set the archive attribute in that scenario, so we are now bug-for-bug compatible. smbfs_node.c, smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM, UF_READONLY and UF_ARCHIVE in SMBFS. This is similar to changes that Apple has made in their version of SMBFS (as of smb-583.8, posted on opensource.apple.com), but not quite the same. We map SMB_FA_READONLY to UF_READONLY, because UF_READONLY is intended to match the semantics of the DOS readonly flag. The MacOS X code maps both UF_IMMUTABLE and SF_IMMUTABLE to SMB_FA_READONLY, but the immutable flags have stronger meaning than the DOS readonly bit. stat.h: Add definitions for UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY and UF_HIDDEN. The definition of UF_HIDDEN is the same as the MacOS X definition. Add commented-out definitions of UF_COMPRESSED and UF_TRACKED. They are defined in MacOS X (as of 10.8.2), but we do not implement them (yet). ufs_vnops.c: Add support for getting and setting UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY, UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS. Alphabetize the flags that are supported. These new flags are only stored, UFS does not take any action if the flag is set. Sponsored by: Spectra Logic Reviewed by: bde (earlier version)	2013-08-21 23:04:48 +00:00
Kirk McKusick	824009a16a	This bug fix is in a code path in rename taken when there is a collision between a rename and an open system call for the same target file. Here, rename releases its vnode references, waits for the open to finish, and then restarts by reacquiring its needed vnode locks. In this case, rename was unlocking but failing to release its reference to one of its held vnodes. The effect was that even after all the actual references to the vnode had gone, the vnode still showed active references. For files that had been removed, their space was not reclaimed until the filesystem was forcibly unmounted. This bug manifested itself in the Postgres server which would leak/lose hundreds of files per day amounting to many gigabytes of disk space. This bug required shutting down Postgres, forcibly unmounting its filesystem, remounting its filesystem and restarting Postgres every few days to recover the lost space. Reported by: Dan Thomas and Palle Girgensohn Bug-fix by: kib Tested by: Dan Thomas and Palle Girgensohn MFC after: 2 weeks	2013-08-06 16:50:05 +00:00
Kirk McKusick	8cf85cf292	With the addition of journalled soft updates, the "newblk" structures persist much longer than previously. Historically we had at most 100 entries; now the count may reach a million. With the increased count we spent far too much time looking them up in the grossly undersized newblk hash table. Configure the newblk hash table to accurately reflect the number of entries that it must index. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2013-08-05 22:02:45 +00:00
Kirk McKusick	57591d8e78	To better understand performance problems with journalled soft updates, we need to collect the highest level of allocation for each of the different soft update dependency structures. This change collects these statistics and makes them available using `sysctl debug.softdep.highuse'. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2013-08-05 22:01:16 +00:00
Kirk McKusick	fdd9a6f478	Update to comments describing block allocation policy. Submitted by: Bruce Evans	2013-07-14 18:44:33 +00:00
Konstantin Belousov	2aea094f65	Only copy as much bytes as there in superblock, instead of the full block copy, when copying the superblock into the snapshot. UFS1 does not align superblock on the block boundary, and bcopy runs off the end of the buffer. Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-12 18:52:33 +00:00
Pedro F. Giffuni	53aa3d1a99	Change i_gen in UFS to an unsigned type. Missing type change from r252435. This fixes a "Stale NFS file handle" error. Reported by: Claude Bisson Tested by: Claude Bisson Pointed hat: pfg	2013-07-10 18:19:48 +00:00
Konstantin Belousov	cc3d8c35f5	There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock. Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress. Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2013-07-09 20:49:32 +00:00
Kirk McKusick	3f8db5b147	Make better use of metadata area by avoiding using it for data blocks that no should no longer immediately follow their indirect blocks. MFC after: 2 weeks	2013-07-02 21:07:08 +00:00
Pedro F. Giffuni	5f3a9c4000	Style fix: spaces. Cleanup the incomplete revert. Reported by: bde MFC after: 4 weeks	2013-07-02 18:45:37 +00:00
Pedro F. Giffuni	9a0aea4625	Change i_gen in UFS to an unsigned type. Revert the simplification of the i_gen calculation. It is still a good idea to avoid zero values and for the case of old filesystems there is probably no advantage in using the complete 32 bits anyways. Discussed with: bde MFC after: 4 weeks	2013-07-01 21:43:40 +00:00
Pedro F. Giffuni	bcb2f550be	Change i_gen in UFS to an unsigned type. Further simplify the i_gen calculation for older disks. Having a zero here is not really a problem and this is more similar to what is done in newfs_random(). Reported by: Xin Li MFC after: 4 weeks	2013-07-01 14:49:23 +00:00
Gleb Kurtsou	e5a5b5b64e	Don't assume that UFS on-disk format of a directory is the same as defined by <sys/dirent.h> Always start parsing at DIRBLKSIZ aligned offset, skip first entries if uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too small for single entry. Preallocate buffer for cookies. Cookies will be replaced with d_off field in struct dirent at later point. Skip entries with zero inode number. Stop mangling dirent in ufs_extattr_iterate_directory(). Reviewed by: kib Sponsored by: Google Summer Of Code 2011	2013-07-01 04:06:40 +00:00
Pedro F. Giffuni	60f7df4c1b	Change i_gen in UFS to an unsigned type. Missed format specifier. Reported by: mdf MFC after: 4 weeks	2013-07-01 03:31:19 +00:00
Pedro F. Giffuni	eee4072f13	Change i_gen in UFS to an unsigned type. In UFS, i_gen is a random generated value and there is not way for it to be negative. Actually, the value of i_gen is just used to match bit patterns and it is of not consequence if the values are signed or not. Following other filesystems, set it to unsigned and use it as such, Discussed by: mckusick Reviewed by: mckusick (previous version) MFC after: 4 weeks	2013-07-01 03:00:15 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Kirk McKusick	97371fa56d	Properly spell sentinel (missed in 250891) No functional changes. Spotted by: Navdeep Parhar and Alexey Dokuchaev MFC after: 2 weeks	2013-05-22 05:07:55 +00:00
Kirk McKusick	b1bd9340fa	Add missing buffer releases (brelse) after bread calls that return an error. One could argue that returning a buffer even when it is not valid is incorrect, but bread has always returned a buffer valid or not. Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:57:22 +00:00
Kirk McKusick	21844a3d5d	Add missing 28th element to softdep types name array. Found by: Coverity Scan, CID 1007621 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:48:24 +00:00
Kirk McKusick	d80dbbdb4a	Null a pointer after it is freed so that when it is returned the return value is NULL. Based on the returned flags, the return value should never be inspected in the case where NULL is returned, but it is good coding practice not to return a pointer to freed memory. Found by: Coverity Scan, CID 1006096 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:40:26 +00:00
Kirk McKusick	64e2b0887c	Remove a bogus check for a NULL buffer pointer. Add a KASSERT that it is not NULL. Found by: Coverity Scan, CID 1009114 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:30:34 +00:00
Kirk McKusick	13e369a747	Properly spell sentinel (not sintenel or sentinal). No functional changes. Spotted by: kib MFC after: 2 weeks	2013-05-22 00:17:50 +00:00
Eitan Adler	a164074fc4	Fix several typos PR: kern/176054 Submitted by: Christoph Mallon <christoph.mallon@gmx.de> MFC after: 3 days	2013-05-12 16:43:26 +00:00
Gabor Kovesdan	ab3f6b347e	- Correct mispellings of the word occurrence Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:40:10 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Kirk McKusick	cd861931e6	The code in clear_remove() and clear_inodedeps() skips one entry in the pagedep and inodedep hash tables. An entry in the table is skipped because 'pagedep_hash' and 'inodedep_hash' hold the size of the hash tables - 1. The chance that this would have any operational failure is extremely unlikely. These funtions only need to find a single entry and are only called when there are too many entries. The chance that they would fail because all the entries are on the single skipped hash chain are remote. Submitted by: Pedro Martelletto Reviewed by: kib MFC after: 2 weeks	2013-04-03 19:26:32 +00:00
Kirk McKusick	baa12a84a7	The purpose of this change to the FFS layout policy is to reduce the running time for a full fsck. It also reduces the random access time for large files and speeds the traversal time for directory tree walks. The key idea is to reserve a small area in each cylinder group immediately following the inode blocks for the use of metadata, specifically indirect blocks and directory contents. The new policy is to preferentially place metadata in the metadata area and everything else in the blocks that follow the metadata area. The size of this area can be set when creating a filesystem using newfs(8) or changed in an existing filesystem using tunefs(8). Both utilities use the `-k held-for-metadata-blocks' option to specify the amount of space to be held for metadata blocks in each cylinder group. By default, newfs(8) sets this area to half of minfree (typically 4% of the data area). This work was inspired by a paper presented at Usenix's FAST '13: www.usenix.org/conference/fast13/ffsck-fast-file-system-checker Details of this implementation appears in the April 2013 of ;login: www.usenix.org/publications/login/april-2013-volume-38-number-2. A copy of the April 2013 ;login: paper can also be downloaded from: www.mckusick.com/publications/faster_fsck.pdf. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks	2013-03-22 21:45:28 +00:00
Kirk McKusick	3289d5877a	When renaming a directory from one parent directory to another, we need to call ufs_checkpath() to walk from our new location to the root of the filesystem to ensure that we do not encounter ourselves along the way. Until now, we accomplished this by reading the ".." entries of each directory in our path until we reached the root (or encountered an error). This change tries to avoid the I/O of reading the ".." entries by first looking them up in the name cache and only doing the I/O when the name cache lookup fails. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks	2013-03-20 17:57:00 +00:00
Konstantin Belousov	59a01b70af	UFS support of the unmapped i/o for the user data buffers. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl, jhb, bf	2013-03-19 15:08:15 +00:00
Konstantin Belousov	e81ff91e62	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:43:57 +00:00
Konstantin Belousov	0d3bb4afa8	Remove negative name cache entry pointing to the target name, which could be instantiated while tdvp was unlocked. Reported by: Rick Miller <vmiller at hostileadmin com> Tested by: pho MFC after: 1 week	2013-03-17 15:11:37 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
Konstantin Belousov	c535690b33	Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-14 20:28:26 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Konstantin Belousov	ba05dec5a4	The softdep freeblks workitem might hold a reference on the dquot. Current dqflush() panics when a dquot with with non-zero refcount is encountered. The situation is possible, because quotas are turned off before softdep workitem queue if flushed, due to the quota file writes might create softdep workitems. Make the encountering an active dquot in dqflush() not fatal, return the error from quotaoff() instead. Ignore the quotaoff() failures when ffs_flushfiles() is called in the course of softdep_flushfiles() loop, until the last iteration. At the last loop, the quotas must be closed, and because SU workitems should be already flushed, the references to dquot are gone. Sponsored by: The FreeBSD Foundation Reported and tested by: pho Reviewed by: mckusick MFC after: 2 weeks	2013-02-27 07:32:39 +00:00
Konstantin Belousov	94f4ac214c	An inode block must not be blockingly read while cg block is owned. The order is inode buffer lock -> snaplk -> cg buffer lock, reversing the order causes deadlocks. Inode block must not be written while cg block buffer is owned. The FFS copy on write needs to allocate a block to copy the content of the inode block, and the cylinder group selected for the allocation might be the same as the owned cg block. The reserved block detection code in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the situation, because the locked cg buffer is not exposed to it. In order to maintain the dependency between initialized inode block and the cg_initediblk pointer, look up the inode buffer in non-blocking mode. If succeeded, brelse cg block, initialize the inode block and write it. After the write is finished, reread cg block and update the cg_initediblk. If inode block is already locked by another thread, let the another thread initialize it. If another thread raced with us after we started writing inode block, the situation is detected by an update of cg_initediblk. Note that double-initialization of the inode block is harmless, the block cannot be used until cg_initediblk is incremented. Sponsored by: The FreeBSD Foundation In collaboration with: pho Reviewed by: mckusick MFC after: 1 month X-MFC-note: after r246877	2013-02-27 07:31:23 +00:00
Kirk McKusick	7839b23f00	The UFS2 filesystem allocates new blocks of inodes as they are needed. When a cylinder group runs short of inodes, a new block for inodes is allocated, zero'ed, and written to the disk. The zero'ed inodes must be on the disk before the cylinder group can be updated to claim them. If the cylinder group claiming the new inodes were written before the zero'ed block of inodes, the system could crash with the filesystem in an unrecoverable state. Rather than adding a soft updates dependency to ensure that the new inode block is written before it is claimed by the cylinder group map, we just do a barrier write of the zero'ed inode block to ensure that it will get written before the updated cylinder group map can be written. This change should only slow down bulk loading of newly created filesystems since that is the primary time that new inode blocks need to be created. Reported by: Robert Watson Reviewed by: kib Tested by: Peter Holm	2013-02-16 15:11:40 +00:00
Konstantin Belousov	9604a7f1b8	Fix several unsafe pointer dereferences in the buffered_write() function, implementing the sysctl vfs.ffs.set_bufoutput (not used in the tree yet). - The current directory vnode dereference is unsafe since fd_cdir could be changed and unreferenced, lock the filedesc around and vref the fd_cdir. - The VTOI() conversion of the fd_cdir is unsafe without first checking that the vnode is indeed from an FFS mount, otherwise the code dereferences a random memory. - The cdir could be reclaimed from under us, lock it around the checks. - The type of the fp vnode might be not a disk, or it might have changed while the thread was in flight, check the type. Reviewed and tested by: mckusick MFC after: 2 weeks	2013-02-10 10:17:33 +00:00
Pedro F. Giffuni	a940ce65cd	Remove unused MAXSYMLINKLEN macro. Reviewed by: mckusick PR: kern/175794 MFC after: 1 week	2013-02-08 20:30:19 +00:00
Pedro F. Giffuni	be0c475e56	UFS: Remove dead assignment. Submitted by: Christoph Mallon MFC after: 3 days	2013-02-03 21:30:02 +00:00
Kirk McKusick	fe85d98a5b	For UFS2 i_blocks is unsigned. The current "sanity" check that it has gone below zero after the blocks in its inode are freed is a no-op which the compiler fails to warn about because of the use of the DIP macro. Change the sanity check to compare the number of blocks being freed against the value i_blocks. If the number of blocks being freed exceeds i_blocks, just set i_blocks to zero. Reported by: Pedro Giffuni (pfg@) MFC after: 2 weeks	2013-02-03 17:16:32 +00:00
Konstantin Belousov	ddd6b3fc33	Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation	2013-01-11 06:08:32 +00:00
Konstantin Belousov	f99cb34c4f	The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-01-01 16:14:48 +00:00
Konstantin Belousov	91e9474552	Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD	2012-12-28 23:08:30 +00:00
Attilio Rao	b1308d72c2	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week	2012-12-21 13:14:12 +00:00
Konstantin Belousov	9f37ee804a	Fix a typo, resulting in the NULL pointer dereference. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2012-12-15 02:03:59 +00:00
Attilio Rao	c6e0355cee	r16312 is not any longer real since many years (likely since when VFS received granular locking) but the comment present in UFS has been copied all over other filesystems code incorrectly for several times. Removes comments that makes no sense now. Reviewed by: kib MFC after: 3 days	2012-11-19 22:43:45 +00:00
Edward Tomasz Napierala	213a71c68b	Fix build of kdump(1).	2012-11-18 22:03:31 +00:00
Edward Tomasz Napierala	1848286ada	Add UFS writesuspension mechanism, designed to allow userland processes to modify on-disk metadata for filesystems mounted for write. Reviewed by: kib, mckusick Sponsored by: FreeBSD Foundation	2012-11-18 18:57:19 +00:00
Jeff Roberson	ad9cdc05ba	- Fix a truncation bug with softdep journaling that could leak blocks on crash. When truncating a file that never made it to disk we use the canceled allocation dependencies to hold the journal records until the truncation completes. Previously allocdirect dependencies on the id_bufwait list were not considered and their journal space could expire before the bitmaps were written. Cancel them and attach them to the freeblks as we do for other allocdirects. - Add KTR traces that were used to debug this problem. - When adding jsegdeps, always use jwork_insert() so we don't have more than one segdep on a given jwork list. Sponsored by: EMC / Isilon Storage Division	2012-11-14 06:37:43 +00:00
Jeff Roberson	b2c29d39cd	- Fix a bug that has existed since the original softdep implementation. When a background copy of a cg is written we complete any work associated with that bmsafemap. If new work has been added to the non-background copy of the buffer it will be completed before the next write happens. The solution is to do the rollbacks when we make the copy so only those dependencies that were present at the time of writing will be completed when the background write completes. This would've resulted in various bitmap related corruptions and panics. It also would've expired journal entries early causing journal replay to miss some records. MFC after: 2 weeks	2012-11-12 19:53:55 +00:00
Attilio Rao	bc2258da88	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.	2012-11-09 18:02:25 +00:00
Jeff Roberson	53cc0bebb9	- Correct rev 242734, segments can sometimes get stuck. Be a bit more defensive with segment state. Reported by: b. f. <bf1783@googlemail.com>	2012-11-09 04:04:25 +00:00
Jeff Roberson	40b43503c0	- Implement BIO_FLUSH support around journal entries. This will not 100% solve power loss problems with dishonest write caches. However, it should improve the situation and force a full fsck when it is unable to resolve with the journal. - Resolve a case where the journal could wrap in an unsafe way causing us to prematurely lose journal entries in very specific scenarios. Discussed with: mckusick MFC after: 1 month	2012-11-08 01:41:04 +00:00
Kirk McKusick	aa7ddc85c7	When a file is first being written, the dynamic block reallocation (implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks so as to cluster them together into a contiguous set of blocks on the disk. When the cluster crosses the boundary into the first indirect block, the first indirect block is initially allocated in a position immediately following the last direct block. Block reallocation would usually destroy locality by moving the indirect block out of the way to keep the data blocks contiguous. This change compensates for this problem by noting that the first indirect block should be left immediately following the last direct block. It then tries to start a new cluster of contiguous blocks (referenced by the indirect block) immediately following the indirect block. We should also do this for other indirect block boundaries, but it is only important for the first one. Suggested by: Bruce Evans MFC: 2 weeks	2012-11-03 18:55:55 +00:00
Jeff Roberson	6d95eb4c5f	- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If the previous diradd had already finished it could have been reclaimed already. This would only happen under heavy dependency pressure. Reported by: Andrey Zonov <zont@FreeBSD.org> Discussed with: mckusick MFC after: 1 week	2012-11-02 21:04:06 +00:00
Konstantin Belousov	140dedb81c	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks	2012-11-02 13:56:36 +00:00
Edward Tomasz Napierala	549f62fa42	Fix problem with geom_label(4) not recognizing UFS labels on filesystems extended using growfs(8). The problem here is that geom_label checks if the filesystem size recorded in UFS superblock is equal to the provider (i.e. device) size. This check cannot be removed due to backward compatibility. On the other hand, in most cases growfs(8) cannot set fs_size in the superblock to match the provider size, because, differently from newfs(8), it cannot recompute cylinder group sizes. To fix this problem, add another superblock field, fs_providersize, used only for this purpose. The geom_label(4) will attach if either fs_size (filesystem created with newfs(8)) or fs_providersize (filesystem expanded using growfs(8)) matches the device size. PR: kern/165962 Reviewed by: mckusick Sponsored by: FreeBSD Foundation	2012-10-30 21:32:10 +00:00
Edward Tomasz Napierala	f1988d463c	Fix two problems that caused instant panic when the device mounted with softupdates went away. Note that this does not fix the problem entirely; I'm committing it now to make it easier for someone to pick up the work. Reviewed by: mckusick	2012-10-28 18:53:28 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Matthew D Fleming	fc8fdae0df	Fix up kernel sources to be ready for a 64-bit ino_t. Original code by: Gleb Kurtsou	2012-09-27 23:30:49 +00:00
Mateusz Guzik	1ec9bedabe	Remove unused member of struct indir (in_exists) from UFS and EXT2 code. Reviewed by: mckusick Approved by: trasz (mentor) MFC after: 1 week	2012-08-17 17:45:27 +00:00
Konstantin Belousov	1c771f9222	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks	2012-08-05 14:11:42 +00:00
Kevin Lo	f7a3729c91	Use NULL instead of 0 for pointers	2012-07-22 15:40:31 +00:00
Konstantin Belousov	c5c1199c83	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks	2012-07-02 21:01:03 +00:00
Konstantin Belousov	7aac7bc18a	Fix unbounded-length malloc, controlled from usermode. The added check is performed before exact size of the buffer is calculated, but the buffer cannot have size greater then the total space allocated for extended attributes. The existing check is executing with precise size, but it is too late, since buffer needs to be allocated in advance. Also, adapt to uio_resid being of ssize_t type. Use lblktosize instead of multiplying by fs block size by hand as well. Reported and tested by: pho MFC after: 1 week	2012-06-21 09:20:07 +00:00
Kirk McKusick	aa445c9d7c	In softdep_setup_inomapdep() we may have to allocate both inodedep and bmsafemap dependency structures in inodedep_lookup() and bmsafemap_lookup() respectively. The setup of these structures must be done while holding the soft-dependency mutex. If the inodedep is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the bmsafemap. If the bmsafemap is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the inodedep. To resolve this problem, bmsafemap_lookup has had a parameter added that allows a pre-malloc'ed bmsafemap to be passed in so that it does not need to release the mutex to create a new bmsafemap. The softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency before acquiring the mutex and starting to build the inodedep with a call to inodedep_lookup(). The subsequent call to bmsafemap_lookup() is passed this pre-allocated bmsafemap entry so that it need not release the mutex if it needs to create a new one. Reported by: Peter Holm Tested by: Peter Holm MFC after: 1 week	2012-06-11 23:07:21 +00:00
Konstantin Belousov	b569050a78	Enable vn_io_fault() lock avoidance for UFS. Tested by: pho MFC after: 2 months	2012-05-30 16:45:41 +00:00
Konstantin Belousov	6ee10a96c0	Implement SEEK_HOLE/SEEK_DATA for UFS. MFC after: 2 weeks	2012-05-26 05:29:53 +00:00
Kirk McKusick	8b6207110d	Add missing `continue' statement at end of case. Found by: Kevin Lo (kevlo@) MFC after: 1 week	2012-05-18 15:20:21 +00:00
Edward Tomasz Napierala	06c65b6b41	Remove unused thread argument from ufs_extattr_uepm_lock()/ufs_extattr_uepm_unlock().	2012-04-23 17:56:35 +00:00
Edward Tomasz Napierala	05cc75de83	Fix build.	2012-04-23 17:54:49 +00:00
Edward Tomasz Napierala	26621e1f06	Remove unused thread argument from clear_inodeps() and clear_remove().	2012-04-23 14:44:18 +00:00
Edward Tomasz Napierala	af6e6b87ad	Remove unused thread argument to vrecycle(). Reviewed by: kib	2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala	c52fd858ae	Remove unused thread argument from vtruncbuf(). Reviewed by: kib	2012-04-23 13:21:28 +00:00
Edward Tomasz Napierala	72b8ff1c74	Fix use-after-free introduced in r234036. Reviewed by: mckusick Tested by: pho	2012-04-21 10:45:46 +00:00
Kirk McKusick	dca5e0ec50	This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines. The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped. The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file. In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-20 07:00:28 +00:00
Jaakko Heinonen	808dd116cb	The part about exec atime no longer applies in the comment. Pointed out by: bde	2012-04-18 15:19:00 +00:00
Kirk McKusick	71469bb38f	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-17 16:28:22 +00:00
Kirk McKusick	ecb6e528c5	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks	2012-04-11 23:01:11 +00:00
Jaakko Heinonen	fce74feae1	- Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS privilege attempts to toggle SF_SETTABLE flags. - Use the '^' operator in the SF_SNAPSHOT anti-toggling check. Flags are now stored to ip->i_flags in one place after all checks. Submitted by: bde	2012-04-10 15:59:37 +00:00
Edward Tomasz Napierala	2b028c25d3	Fix panic in ffs_reload(), which may happen when read-only filesystem gets resized and then reloaded. Reviewed by: kib, mckusick (earlier version) Sponsored by: The FreeBSD Foundation	2012-04-08 13:44:55 +00:00
Kirk McKusick	b73ffa31d4	Drop an unnecessary setting of si_mountpt when updating a UFS mount point. Clearly it must have been set when the mount was done. Reviewed by: kib	2012-04-08 06:14:49 +00:00
Jaakko Heinonen	63dfd849a1	Add a check for unsupported file flags to ufs_setattr(). Discussed with: bde MFC after: 2 weeks	2012-04-04 14:50:21 +00:00
Kirk McKusick	23d6e518da	A file cannot be deallocated until its last name has been removed and it is no longer referenced by a user process. The inode for a file whose name has been removed, but is still referenced at the time of a crash will still be allocated in the filesystem, but will have no references (e.g., they will have no names referencing them from any directory). With traditional soft updates these unreferenced inodes will be found and reclaimed when the background fsck is run. When using journaled soft updates, the kernel must keep track of these inodes so that it can find and reclaim them during the cleanup process. Their existence cannot be stored in the journal as the journal only handles short-term events, and they may persist for days. So, they are tracked by keeping them in a linked list whose head pointer is stored in the superblock. The journal tracks them only until their linked list pointers have been commited to disk. Part of the cleanup process involves traversing the list of unreferenced inodes and reclaiming them. This bug was triggered when confusion arose in the commit steps of keeping the unreferenced-inode linked list coherent on disk. Notably, a race between the link() system call adding a link-count to a file and the unlink() system call removing a link-count to the file. Here if the unlink() ran after link() had looked up the file but before link() had incremented the link-count of the file, the file's link-count would drop to zero before the link() incremented it back up to one. If the file was referenced by a user process, the first transition through zero made it appear that it should be added to the unreferenced-inode list when in fact it should not have been added. If the new name created by link() was deleted within a few seconds (with the file still referenced by a user process) it would legitimately be a candidate for addition to the unreferenced-inode list. The result was that there were two attempts to add the same inode to the unreferenced-inode list which scrambled the unreferenced-inode list's pointers leading to a panic. The fix is to detect and avoid the false attempt at adding it to the unreferenced-inode list by having the link() system call check to see if the link count is zero before it increments it. If it is, the link() fails with ENOENT (showing that it has failed the link()/unlink() race). While tracking down this bug, we have added additional assertions to detect the problem sooner and also simplified some of the code. Reported by: Kirk Russell Fix submitted by: Jeff Roberson Tested by: Peter Holm PR: kern/159971 MFC (to 9 only): 2 weeks	2012-04-02 21:58:37 +00:00
Jaakko Heinonen	fccf286d24	- Use more natural ip->i_flags instead of vap->va_flags in the final flags check. - Add a comment for the immutable/append check done after handling of the flags. - Style improvements. No functional change intended. Submitted by: bde MFC after: 2 weeks	2012-04-02 16:33:21 +00:00
Kirk McKusick	6c09f4a27c	A refinement of change 232351 to avoid a race with a forcible unmount. While we have a snapshot vnode unlocked to avoid a deadlock with another inode in the same inode block being updated, the filesystem containing it may be forcibly unmounted. When that happens the snapshot vnode is revoked. We need to check for that condition and fail appropriately. This change will be included along with 232351 when it is MFC'ed to 9. Spotted by: kib Reviewed by: kib	2012-03-28 21:21:19 +00:00
Kirk McKusick	1faacf5d09	Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'. Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected. This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too. Reviewed by: kib	2012-03-28 20:49:11 +00:00
Konstantin Belousov	ea573a50b3	Do trivial reformatting of the comment to record the missed commit message for r233609: Restore the writes of atimes, quotas and superblock from syncer vnode. Noted by: rdivacky	2012-03-28 14:16:15 +00:00
Konstantin Belousov	a988a5c609	Reviewed by: bde, mckusick Tested by: pho MFC after: 2 weeks	2012-03-28 14:06:47 +00:00
Konstantin Belousov	64c8ead942	Microoptimize: in qsync loop over mount vnodes, only unlock mount interlock after we committed to try to vget() the vnode. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 1 week	2012-03-28 13:56:18 +00:00
Konstantin Belousov	e0c1740853	Update comment. MFC after: 3 days	2012-03-28 13:47:07 +00:00
Kirk McKusick	75a5838904	Add a third flags argument to ffs_syncvnode to avoid a possible conflict with MNT_WAIT flags that passed in its second argument. This will be MFC'ed together with r232351. Discussed with: kib	2012-03-25 00:02:37 +00:00
Konstantin Belousov	064f517d2b	Supply boolean as the second argument to ffs_update(), and not a MNT_[NO]WAIT constants, which in fact always caused sync operation. Based on the submission by: bde Reviewed by: mckusick MFC after: 2 weeks	2012-03-13 22:04:27 +00:00
Konstantin Belousov	92ccae0399	Remove superfluous brackets. Submitted by: alc MFC after: 2 weeks	2012-03-11 21:25:42 +00:00
Konstantin Belousov	dd522d76dc	Do schedule delayed writes for async mounts. While there, make some style adjustments, like missed () around return values. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks	2012-03-11 20:26:19 +00:00
Konstantin Belousov	2fd2c0b1e3	Do not fall back to slow synchronous i/o when low on memory or buffers. The bawrite() schedules the write to happen immediately, and its use frees the current thread to do more cleanups. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks	2012-03-11 20:23:46 +00:00
Konstantin Belousov	4cd74eecda	In ffs_syncvnode(), pass boolean false as second argument of ffs_update(). Synchronous inode block update is not needed for MNT_LAZY callers (syncer), and since waitfor values are not zero, code did unneccessary synchronous update. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks	2012-03-11 20:18:14 +00:00
Konstantin Belousov	18ef3670e5	Remove not needed ARGSUSED lint command. Submitted by: bde MFC after: 3 days	2012-03-11 20:15:12 +00:00
Konstantin Belousov	b80dcb55aa	Remove fifo.h. The only used function declaration from the header is migrated to sys/vnode.h. Submitted by: gianni	2012-03-11 12:19:58 +00:00
Peter Holm	e521b5288a	Revert r232692 as the correct place to fix this is at the syscall level.	2012-03-09 17:19:50 +00:00
Konstantin Belousov	38ddb5725b	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week	2012-03-09 00:12:05 +00:00
John Baldwin	b47f624183	Add KTR_VFS traces to track modifications to a vnode's writecount.	2012-03-08 20:27:20 +00:00
Peter Holm	80042581a5	syscall() fuzzing can trigger this panic. Return EINVAL instead. MFC after: 1 week	2012-03-08 12:49:08 +00:00
John Baldwin	58d65e8031	Similar to the fixes in 226967 and 226987, purge any name cache entries associated with the previous vnode (if any) associated with the target of a rename(). Otherwise, a lookup of the target pathname concurrent with a rename() could re-add a name cache entry after the namei(RENAME) lookup in kern_renameat() had purged the target pathname. MFC after: 2 weeks	2012-03-02 18:55:19 +00:00
Kirk McKusick	35338e6091	This change avoids a kernel deadlock on "snaplk" when using snapshots on UFS filesystems running with journaled soft updates. This is the first of several bugs that need to be fixed before removing the restriction added in -r230250 to prevent the use of snapshots on filesystems running with journaled soft updates. The deadlock occurs when holding the snapshot lock (snaplk) and then trying to flush an inode via ffs_update(). We become blocked by another process trying to flush a different inode contained in the same inode block that we need. It holds the inode block for which we are waiting locked. When it tries to write the inode block, it gets blocked waiting for the our snaplk when it calls ffs_copyonwrite() to see if the inode block needs to be copied in our snapshot. The most obvious place that this deadlock arises is in the ffs_copyonwrite() routine when it updates critical metadata in a snapshot and tries to write it out before proceeding. The fix here is to write the data and indirect block pointer for the snapshot, but to skip the call to ffs_update() to write the snapshot inode. To ensure that we will never have to update a pointer in the inode itself, the ffs_snapshot() routine that creates the snapshot has to ensure that all the direct blocks are allocated as part of the creation of the snapshot. A less obvious place that this deadlock occurs is when we hold the snaplk because we are deleting a snapshot. In the course of doing the deletion, we need to allocate various soft update dependency structures and allocate some journal space. If we hit a resource limit while doing this we decrease the resources in use by flushing out an existing dirty file to get it to give up the soft dependency resources that it holds. The flush can cause an ffs_update() to be done on the inode for the file that we have selected to flush resulting in the same deadlock as described above when the inode that we have chosen to flush resides in the same inode block as the snapshot inode that we hold. The fix is to defer cleaning up any time that the inode on which we are operating is a snapshot. Help and review by: Jeff Roberson Tested by: Peter Holm MFC (to 9 only) after: 2 weeks	2012-03-01 18:45:25 +00:00
Konstantin Belousov	2b19466836	Properly lock DQREF() with dqhlock. Missed locking caused counter corruption. Assert that the dq reference value is sane before decrementing it. Reported and tested by: pho MFC after: 1 week	2012-02-22 20:03:51 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Kirk McKusick	e8e848ef8e	Missing conditions in checking whether an inode has been written. Found and tested by: Peter Holm MFC after: 2 weeks (to 9 only)	2012-02-13 01:33:39 +00:00
Kirk McKusick	5ecba76999	Historically when an application wrote an entire block of a file, the kernel allocated a buffer but did not zero it as it was about to be completely filled by a uiomove() from the user's buffer. However, if the uiomove() failed, the old contents of the buffer could be exposed especially if the file was being mmap'ed. The fix was to always zero the buffer when it was allocated. This change first attempts the uiomove() to the newly allocated (and dirty) buffer and only zeros it if the uiomove() fails. The effect is to eliminate the gratuitous zeroing of the buffer in the usual case where the uiomove() successfully fills it. Reviewed by: kib Tested by: scottl MFC after: 2 weeks (to 9 only)	2012-02-09 22:34:16 +00:00
Kirk McKusick	19c87af0fd	In the original days of BSD, a sync was issued on every filesystem every 30 seconds. This spike in I/O caused the system to pause every 30 seconds which was quite annoying. So, the way that sync worked was changed so that when a vnode was first dirtied, it was put on a 30-second cleaning queue (see the syncer_workitem_pending queues in kern/vfs_subr.c). If the file has not been written or deleted after 30 seconds, the syncer pushes it out. As the syncer runs once per second, dirty files are trickled out slowly over the 30-second period instead of all at once by a call to sync(2). The one drawback to this is that it does not cover the filesystem metadata. To handle the metadata, vfs_allocate_syncvnode() is called to create a "filesystem syncer vnode" at mount time which cycles around the cleaning queue being sync'ed every 30 seconds. In the original design, the only things it would sync for UFS were the filesystem metadata: inode blocks, cylinder group bitmaps, and the superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from which the filesystem is mounted). Somewhere in its path to integration with FreeBSD the flushing of the filesystem syncer vnode got changed to sync every vnode associated with the filesystem. The result of this change is to return to the old filesystem-wide flush every 30-seconds behavior and makes the whole 30-second delay per vnode useless. This change goes back to the originally intended trickle out sync behavior. Key to ensuring that all the intended semantics are preserved (e.g., that all inode updates get flushed within a bounded period of time) is that all inode modifications get pushed to their corresponding inode blocks so that the metadata flush by the filesystem syncer vnode gets them to the disk in a timely way. Thanks to Konstantin Belousov (kib@) for doing the audit and commit -r231122 which ensures that all of these updates are being made. Reviewed by: kib Tested by: scottl MFC after: 2 weeks	2012-02-07 20:43:28 +00:00
Konstantin Belousov	36fc415d5d	Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to guarantee that all UFS inode metadata changes results in the dirtiness of the inodeblock. Due to missed inodeblock updates, syncer was required to fsync each mount point' vnode to guarantee periodic metadata flush. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks	2012-02-07 09:51:41 +00:00
Konstantin Belousov	752a98b13e	Add missing opt_quota.h include to activate #ifdef QUOTA blocks, apparently a step in unbreaking QUOTA support. Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com> MFC after: 1 week	2012-02-06 17:59:14 +00:00
Konstantin Belousov	b313a71044	JNEWBLK dependency may legitimately appear on the buf dependency list. If softdep_sync_buf() discovers such dependency, it should do nothing, which is safe as it is only waiting on the parent buffer to be written, so it can be removed. Committed on behalf of: jeff MFC after: 1 week	2012-02-06 11:47:24 +00:00
Konstantin Belousov	c480f781ea	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks	2012-02-06 11:04:36 +00:00
Kirk McKusick	86b571509a	There are several bugs/hangs when trying to take a snapshot on a UFS/FFS filesystem running with journaled soft updates. Until these problems have been tracked down, return ENOTSUPP when an attempt is made to take a snapshot on a filesystem running with journaled soft updates. MFC after: 2 weeks	2012-01-17 01:14:56 +00:00
Kirk McKusick	cc672d3599	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks	2012-01-17 01:08:01 +00:00
Ivan Voras	d83064751a	Add a bit of verbosity to the comment.	2012-01-16 15:47:42 +00:00
Kirk McKusick	b60ee81e3d	Convert FFS mount error messages from kernel printf's to using the vfs_mount_error error message facility provided by the nmount interface. Clean up formatting of mount warnings which still need to use kernel printf's since they do not return errors. Requested by: Craig Rodrigues <rodrigc@crodrigues.org> MFC after: 2 weeks	2012-01-14 07:26:16 +00:00
Konstantin Belousov	3ab0160340	Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon(). The vfs_busy() is after covered vnode lock in the global lock order, but since quotaon() does recursive VFS call to open quota file, we usually end up locking covered vnode after mp is busied in sys_quotactl(). Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon, esp. if vfs_busy cannot succeed due to unmount being performed. Reported and tested by: pho MFC after: 1 week	2012-01-08 23:06:53 +00:00
Ed Schouten	8f8d30274a	Migrate ufs and ext2fs from skpc() to memcchr(). While there, remove a useless check from the code. memcchr() always returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff can never be equal to zero. Also, the fact that memcchr() returns a pointer instead of the number of bytes until the end, makes conversion to an offset far more easy.	2012-01-01 20:47:33 +00:00
Gleb Kurtsou	58b1333ae5	Use implementation independent inoNN_t scalars for on-disk UFS structures Approved by: mdf (mentor)	2011-11-09 07:48:48 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
Ed Schouten	8f80f103b4	Remove MALLOC_DECLAREs of nonexisting malloc-pools. After careful grepping, it seems none of these pools can be found in our source tree. They are not in use, nor are they defined.	2011-11-06 20:16:50 +00:00
Peter Holm	b76b6150d5	Fix the wrong commit log message for r226967: "Added missing cache purge of from argument" and fix the comment.	2011-10-31 20:24:33 +00:00
Peter Holm	6890c3a990	The kern_renameat() looks up the fvp using the DELETE flag, which causes the removal of the name cache entry for fvp. Reported by: Anton Yuzhaninov <citrin citrin ru> In collaboration with: kib MFC after: 1 week	2011-10-31 15:01:47 +00:00
Kirk McKusick	8cd680f8c3	This update eliminates a lock-order reversal warning discovered whle tracking down the system hang reported in kern/160662 and corrected in revision 225806. The LOR is not the cause of the system hang and indeed cannot cause an actual deadlock. However, it can be easily eliminated by defering the acquisition of a buflock until after all the vnode locks have been acquired. Reported by: Hans Ottevanger PR: kern/160662	2011-09-27 17:41:48 +00:00
Kirk McKusick	6b3b8a2109	This update eliminates the system hang reported in kern/160662 when taking a snapshot on a filesystem running with journaled soft updates. Reported by: Hans Ottevanger Fix verified by: Hans Ottevanger PR: kern/160662	2011-09-27 17:34:02 +00:00
Konstantin Belousov	b296414c62	Use nowait sync request for a vnode when doing softdep cleanup. We possibly own the unrelated vnode lock, doing waiting sync causes deadlocks. Reported and tested by: pho Approved by: re (bz)	2011-09-20 21:53:26 +00:00
Martin Matuska	82378711f9	Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week	2011-08-25 08:17:39 +00:00
Andrey V. Elsukov	5373cf4e34	Fix lock leak. Reported by: Alex Lyashkov Approved by: re (kib) MFC after: 1 week	2011-08-23 08:47:27 +00:00
Robert Watson	4b3a6fb933	Fix two cases involving opt_capsicum.h and module builds: (1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the #include. (2) portalfs depends on opt_capsicum.h, so have the Makefile generate one if required. These affect only modules built without a kernel (i.e, not buildkernel, but yes buildworld if the dubious MODULES_WITH_WORLD is used). Approved by: re (bz) Sponsored by: Google Inc	2011-08-15 07:32:44 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Kirk McKusick	fddf7baebe	Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP is set so that mount can revert back to using MNT_NOWAIT when doing getmntinfo. Approved by: re (kib)	2011-07-30 00:43:18 +00:00
Kirk McKusick	d716efa9f7	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)	2011-07-24 18:27:09 +00:00
Kirk McKusick	2621f2c43f	Default debugging error messages to off for journaled soft updates sysctls. Delete limiting on output of these sysctls. Approved by: re (kib)	2011-07-22 18:03:33 +00:00
Kirk McKusick	927a12ae16	Add an FFS specific mount option to allow a filesystem checker (typically fsck_ffs) to register that it wishes to use FFS specific sysctl's to update the filesystem. This ensures that two checkers cannot run on a given filesystem at the same time and that no other process accidentally or maliciously uses the filesystem updating sysctls inappropriately. This functionality is needed by the journaling soft-updates recovery code.	2011-07-15 16:20:33 +00:00
Kirk McKusick	b8ea56d7e4	Consistently check mount flag (MNTK_SUJ) rather than superblock flag (FS_SUJ) when determining whether to do journaling-based operations. The mount flag is set only when journaling is active while the superblock flag is set to indicate that journaling is to be used. For example, when the filesystem is mounted read-only, the journaling may be present (FS_SUJ) but not active (MNTK_SUJ). Inappropriate checking of the FS_SUJ flag was causing some journaling actions to be attempted at inappropriate times.	2011-07-14 18:06:13 +00:00
Kirk McKusick	17ff0cf70f	When first creating snapshots, we may free some blocks within it. These blocks should not have TRIM applied to them. Submitted by: Kostik Belousov	2011-07-10 05:34:49 +00:00
Kirk McKusick	8795189c98	Allow disk partitions associated with UFS read-only mounted filesystems to be opened for writing. This functionality used to be special-cased for just the root filesystem, but with this change is now available for all UFS filesystems. This change is needed for journaled soft updates recovery. Discussed with: Jeff Roberson	2011-07-10 00:41:31 +00:00
Konstantin Belousov	58f9394c50	Use 'curthread_pflags' instead of 'thread_pflags' to signify that only curthread can be operated upon. Requested by: attilio MFC after: 1 week	2011-07-09 15:16:07 +00:00
Konstantin Belousov	acf5d7101c	Use helper functions instead of manually managing TDP_INBDFLUSH. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version) MFC after: 1 week	2011-07-09 14:42:45 +00:00
Jeff Roberson	e9b4d8327f	- Speed up pendingblock processing again. Having too much delay between ffs_blkfree() and the pending adjustment causes all kinds of space related problems.	2011-07-04 22:08:04 +00:00
Jeff Roberson	f2803e61fa	- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now find themselves on snapshot vnodes. Reported by: pho	2011-07-04 21:04:25 +00:00
Jeff Roberson	8e4f5b70b0	- It is impossible to run request_cleanup() while doing a copyonwrite. This will most likely cause new block allocations which can recurse into request cleanup. - While here optimize the ufs locking slightly. We need only acquire and drop once. - process_removes() and process_truncates() also is only needed once. - Attempt to flush each item on the worklist once but do not loop forever if some can not be completed. Discussed with: mckusick	2011-07-04 20:53:55 +00:00
Jeff Roberson	c0f4e7afa4	- Fix an inode quota leak. We need to decrement the quota once and only once. Tested by: pho Reviewed by: mckusick	2011-07-04 20:52:23 +00:00
Kirk McKusick	08af0c8b8d	Handle the FREEDEP case in softdep_sync_buf(). This fix failed to get added in -r223325. Submitted by: Peter Holm	2011-06-29 22:12:43 +00:00
Alan Cox	6bbee8e28a	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib	2011-06-29 16:40:41 +00:00
Jeff Roberson	16f7d82285	- Fix directory count rollbacks by passing the mode to the journal dep earlier. - Add rollback/forward code for frag and cluster accounting. - Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)	2011-06-20 03:25:09 +00:00
Kirk McKusick	9957ac07b2	Fixed dereference of a NULL pointer. Reported by: Peter Holm	2011-06-18 21:10:03 +00:00
Kirk McKusick	ff13f23f84	Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that file. No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor should they as it is totally in-kernel interfaces). For added protection I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL. Feedback from: Bruce Evans and Tai-hwa Liang	2011-06-16 23:40:10 +00:00
Tai-hwa Liang	09108f76fa	Fixing compilation bustage by introducing another forward declaration.	2011-06-16 05:26:03 +00:00
Kirk McKusick	43a3cc7796	Ensure that filesystem metadata contained within persistent snapshots is always kept consistent. Suggested by: Jeff Roberson	2011-06-15 23:19:09 +00:00
Kirk McKusick	2191e465cc	With the restructuring of the block reclaimation code, the notification messages for a filesystem being out of space need to be moved so that they do not print out until after a failed cleanup attempt. Suggested by: Jeff Roberson	2011-06-15 18:05:08 +00:00
Kirk McKusick	e34a713594	Missing cleanup case after completion of a snapshot vnode write claiming a released block. Submitted by: Jeff Roberson Tested by: Peter Holm	2011-06-15 06:13:08 +00:00
Dimitry Andric	222ef43340	Use alternative, less messy solution to avoid breakage after r223020: put the snapdata structure between #ifdef _KERNEL guards. Suggested by: kib	2011-06-13 16:05:41 +00:00
Kirk McKusick	9eb8728aa5	Update to soft updates journaling to properly track freed blocks that get claimed by snapshots. Submitted by: Jeff Roberson Tested by: Peter Holm	2011-06-12 19:27:05 +00:00
Kirk McKusick	9420dc62cd	Disable the soft updates journaling after a filesystem is successfully downgraded to read-only. It will be restarted if the filesystem is upgraded back to read-write.	2011-06-12 18:46:48 +00:00
Jeff Roberson	280e091a99	Implement fully asynchronous partial truncation with softupdates journaling to resolve errors which can cause corruption on recovery with the old synchronous mechanism. - Append partial truncation freework structures to indirdeps while truncation is proceeding. These prevent new block pointers from becoming valid until truncation completes and serialize truncations. - On completion of a partial truncate journal work waits for zeroed pointers to hit indirects. - softdep_journal_freeblocks() handles last frag allocation and last block zeroing. - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it is only implemented in one place. - Block allocation failure handling moved up one level so it does not proceed with buf locks held. This permits us to do more extensive reclaims when filesystem space is exhausted. - softdep_sync_metadata() is broken into two parts, the first executes once at the start of ffs_syncvnode() and flushes truncations and inode dependencies. The second is called on each locked buf. This eliminates excessive looping and rollbacks. - Improve the mechanism in process_worklist_item() that handles acquiring vnode locks for handle_workitem_remove() so that it works more generally and does not loop excessively over the same worklist items on each call. - Don't corrupt directories by zeroing the tail in fsck. This is only done for regular files. - Push a fsync complete record for files that need it so the checker knows a truncation in the journal is no longer valid. Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts) Tested by: pho	2011-06-10 22:48:35 +00:00
Jeff Roberson	e84fa3ba71	- Add support for referencing quota structures without needing the inode pointer for softupdates. Submitted by: mckusick	2011-06-10 22:19:44 +00:00
Jeff Roberson	5aa336ed20	- If the fsync in ufs_direnter fails SUJ can later panic because we have partially added a name. Allow ufs_direnter() to continue in the hopes that it is a transient error. If it is not, the directory is corrupted already from IO errors and writing this new block is not likely to make things worse.	2011-06-10 22:18:25 +00:00
Kirk McKusick	9f62b10cb3	Grammer fix in comment. Eliminate one (of several) possible conflicting buffer locks when trying to reclaim blocks. Rest of fix to be incorporated as part of SUJ update by jeff. Pointed out by: Kostik Belousov	2011-06-05 22:36:30 +00:00
Kirk McKusick	1508294bb6	Due to a lag in updating the fs_pendinginodes count, we cannot depend on it to decide whether we should try to reclaim inodes when we run short. Discovered by: Peter Holm	2011-05-28 15:07:29 +00:00
Kirk McKusick	99f6ac66ad	The check for whether a block is going to be claimed by a snapshot needs to happen before we notify the underlying layer that it is being freed.	2011-05-26 23:56:58 +00:00
Rick Macklem	dbed8d1fc8	Fix the ufs/ffs file system so that it uses the lock flags argument added to VFS_FHTOVP() by r222167. Reviewed by: mckusick	2011-05-22 20:39:07 +00:00
Rick Macklem	694a586a43	Add a lock flags argument to the VFS_FHTOVP() file system method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed. Reviewed by: kib	2011-05-22 01:07:54 +00:00
Matthew D Fleming	3d08a76bbc	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >	2011-05-13 05:27:58 +00:00
Konstantin Belousov	d3e4b05d20	Fix typos. Noted by: Fabian Keil <freebsd-listen fabiankeil de> Pointy hat to: kib MFC after: 1 week	2011-04-30 22:46:02 +00:00
Konstantin Belousov	4417ac326a	Clarify the comment. MFC after: 1 week	2011-04-30 13:49:03 +00:00
Konstantin Belousov	d9ca1af7ed	VFS sometimes is unable to inactivate a vnode when vnode use count goes to zero. E.g., the vnode might be only shared-locked at the time of vput() call. Such vnodes are kept in the hash, so they can be found later. If ffs_valloc() allocated an inode that has its vnode cached in hash, and still owing the inactivation, then vget() call from ffs_valloc() clears VI_OWEINACT, and then the vnode is reused for the newly allocated inode. The problem is, the vnode is not reclaimed before it is put to the new use. ffs_valloc() recycles vnode vm object, but this is not enough. In particular, at least v_vflag should be cleared, and several bits of UFS state need to be removed. It is very inconvenient to call vgone() at this point. Instead, move some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(), and call the helper from VOP_RECLAIM and ffs_valloc(). Reviewed by: mckusick Tested by: pho MFC after: 3 weeks	2011-04-24 10:47:56 +00:00
Jeff Roberson	273ca85137	- Refactor softdep_setup_freeblocks() into a set of functions to prepare for a new journal specific partial truncate routine. - Use dep_current[] in place of specific dependency counts. This is automatically maintained when workitems are allocated and has less risk of becoming incorrect.	2011-04-11 01:43:59 +00:00
Jeff Roberson	4ac80906c3	Fix a long standing SUJ performance problem: - Keep a hash of indirect blocks that have recently been freed and are still referenced in the journal. - Lookup blocks in this hash before forcing a new block write to wait on the journal entry to hit the disk. This is only necessary to avoid confusion between old identities as indirects and new identities as file blocks. - Don't free jseg structures until the journal has written a record that invalidates it. This keeps the indirect block information around for as long as is required to be safe. - Force an empty journal block write when required to flush out stale journal data that is simply waiting for the oldest valid sequence number to advance beyond it.	2011-04-10 03:49:53 +00:00
Jeff Roberson	59343c7b98	- Don't invalidate jnewblks immediately upon discovering that the block will be removed. Permit the journal to proceed so that we don't leave a rollback in a cg for a very long time as this can cause terrible perf problems in low memory situations. Tested by: pho	2011-04-07 03:19:10 +00:00
Kirk McKusick	4c821a3978	Be far more persistent in reclaiming blocks and inodes before giving up and declaring a filesystem out of space. Especially necessary when running on a small filesystem. With this improvement, it should be possible to use soft updates on a small root filesystem. Kudos to: Peter Holm Testing by: Peter Holm MFC: 2 weeks	2011-04-05 21:26:05 +00:00
Jeff Roberson	f79d4144ab	Fix problems that manifested from filesystem full conditions: - In softdep_revert_mkdir() find the dotaddref before we attempt to cancel the jaddref so we can make assumptions about where the dotaddref is on the list. cancel_jaddref() does not always remove items from the list anymore. - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE was never set. This ensures that dependencies will continue to be processed on the inowait/bufwait list and is more an artifact of the structure of the code than a pure ordering problem. - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed appropriately. This normally occurs when the refs are added to the journal but if they are canceled before this point the state would never be set and the dependency could never be freed. Reported by: pho Tested by: pho	2011-04-02 21:52:58 +00:00
Konstantin Belousov	861ed1162b	Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case. Submitted by: Aleksandr Rybalko <ray dlink ua>	2011-03-28 12:39:48 +00:00
Kirk McKusick	0a809056ce	Add retry code analogous to the block allocation retry code to avoid running out of inodes. Reported by: Peter Holm	2011-03-23 05:13:54 +00:00
Konstantin Belousov	16b1f68d8c	Retire opt_ffs_broken_fixme.h. Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with usual layering. Requested by: bde MFC after: 1 week	2011-03-20 21:05:09 +00:00
Konstantin Belousov	ffda66c299	Remove the #if defined(FFS) \|\| defined(IFS) braces around the calls to ffs_snapgone(). ufs.ko module is not build with FFS define, causing snapshot inode number slots in superblock never be freed, as well as a reference on the snapshot vnode. IFS was removed several years ago, and UFS/FFS separation was not maintained for real. Reported, analyzed and tested by: Yamagi Burmeister <lists yamagi org> MFC after: 3 days	2011-03-17 11:23:12 +00:00
Konstantin Belousov	0714775845	Simplify uses of the web of pointers. Reviewed by: mckusick MFC after: 1 week	2011-03-07 22:36:11 +00:00
John Baldwin	8587289fb8	The UFS dirhash code was attempting to update shared state in the dirhash from multiple threads while holding a shared lock during a lookup operation. This could result in incorrect ENOENT failures which could then be permanently stored in the name cache. Specifically, the dirhash code optimizes the case that a single thread is walking a directory sequentially opening (or stat'ing) each file. It uses state in the dirhash structure to determine if a given lookup is using the optimization. If the optimization fails, it disables it and restarts the lookup. The problem arises when two threads both attempt the optimization and fail. The first thread will restart the loop, but the second thread will incorrectly think that it did not try the optimization and will only examine a subset of the directory entires in its hash chain. As a result, it may fail to find its directory entry and incorrectly fail with ENOENT. To make this safe for use with shared locks, simplify the state stored in the dirhash and move some of the state (the part that determines if the current thread is trying the optimization) into a local variable. One result is that we will now try the optimization more often. We still update the value under the shared lock, but it is a single atomic store similar to i_diroff that is stored in UFS directory i-nodes for the non-dirhash lookup. Reviewed by: kib MFC after: 1 week	2011-03-07 18:33:29 +00:00
John Baldwin	96e1934a43	Use ffs() to locate free bits in the inode bitmap rather than a loop with bit shifts. Reviewed by: mckusick MFC after: 1 month	2011-03-04 22:26:41 +00:00
Konstantin Belousov	c30c6a2311	v_mountedhere is a member of the union. Check that the vnodes have proper type before using the member. Reported and tested by: Michael Butler <imb protected-networks net>	2011-02-19 07:47:25 +00:00
Konstantin Belousov	455a6e0ff3	Use the native sector size of the device backing the UFS volume for SU+J journal blocks, instead of hard coding 512 byte sector size. Journal need to atomically write the block, that can only be guaranteed at the device sector size, not larger. Attempt to write less then sector size results in driver errors. Note that this is the first structure in UFS that depends on the sector size. Other elements are written in the units of fragments. In collaboration with: pho Reviewed by: jeff Tested by: bz, pho	2011-02-12 12:52:12 +00:00
Alexander Leidinger	3502f01f20	Wrap long line. Noticed by: bz	2011-02-10 08:06:56 +00:00
Alexander Leidinger	3eb6e1317c	Add some FEATURE macros for some UFS features. SU+J is not included as a FEATURE macro: - it was not in the tree during the GSoC - I do not see an option to en-/disable it in NOTES Two minor changes where made during the review compared to what was developed during GSoC 2010. No FreeBSD version bump, the userland application to query the features will be committed last and can serve as an indication of the availablility if needed. Sponsored by: Google Summer of Code 2010 Submitted by: kibab Reviewed by: kib X-MFC after: to be determined in last commit with code from this project	2011-02-09 15:33:13 +00:00
Matthew D Fleming	e7ceb1e99b	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.	2011-02-08 00:16:36 +00:00
Matthew D Fleming	08b163fa51	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week	2011-02-02 16:35:10 +00:00
Sergey Kandaurov	b04409f5fc	Embed a quota error message (C string) into uprintf() fmt. While here, fix whitespaces. Approved by: kib (mentor)	2011-01-13 16:29:27 +00:00
Matthew D Fleming	fbbb13f962	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the kernel changes.	2011-01-12 19:54:19 +00:00
Konstantin Belousov	ac32f1176b	Instead of incrementing freework reference counter in indir_trunc(), do it at the allocation time for journaled fs and indirect blocks, when the allocated object is not accessible outside. Requested and reviewed by: jeff Tested by: pho	2011-01-04 10:25:55 +00:00
Konstantin Belousov	465e3ccdbb	Handle missing jremrefs when a directory is renamed overtop of another, deleting it. If the directory is removed, UFS always need to remove the .. ref, even if the ultimate ref on the parent would not change. The new directory must have a new journal entry for that ref. Otherwise journal processing would not properly account for the parent's reference since it will belong to a removed directory entry. Change ufs_rename()'s dotdot rename section to always setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency allocated via setup_dotdot_link(). Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove(). Remove the isdirrem > 1 checks from newdirrem(). Reported by: many Submitted by: jeff Tested by: pho	2010-12-30 10:52:07 +00:00
Konstantin Belousov	42a6fc4385	In indir_trunc(), when processing jnewblk entries that are not written to the disk, recurse to handle indirect blocks of next level that are hidden by the corresponding entry. In collaboration with: pho Reviewed by: jeff, mckusick Tested by: mckusick, pho	2010-12-30 10:41:17 +00:00
Konstantin Belousov	8c2a54de80	Add kernel side support for BIO_DELETE/TRIM on UFS. The FS_TRIM fs flag indicates that administrator requested issuing of TRIM commands for the volume. UFS will only send the command to disk if the disk reports GEOM::candelete attribute. Since disk queue is reordered, data block is marked as free in the bitmap only after TRIM command completed. Due to need to sleep waiting for i/o to finish, TRIM bio_done routine schedules taskqueue to set the bitmap bit. Based on the patch by: mckusick Reviewed by: mckusick, pjd Tested by: pho MFC after: 1 month	2010-12-29 12:25:28 +00:00
Konstantin Belousov	d2d6c59245	Move the definition of mkdirlisthd from header to C file. Reviewed by: mckusick Tested by: pho	2010-12-29 12:16:06 +00:00
Konstantin Belousov	abf6c181e4	Use a proper type for the variable holding the summary size of the inode data. Otherwise, on 32bit systems, unlinked inode which size is the multiple of 4GB was not truncated, causing corruption. Reported by: brucec Reviewed by: mckusick Tested by: pho	2010-12-29 11:19:39 +00:00
Kirk McKusick	84ad0a66d0	This patch fixes a soft update panic while running perl 5.12 tests which produced: panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164 Reported by: Dimitry Andric Fixed by: Jeff Roberson	2010-12-23 00:38:57 +00:00
Konstantin Belousov	fddd463dc2	Journal start looks up .sujournal file by doing lookup on the root dvp. As result, failed softdep_mount() might leave up to two vnodes on the mp mountlist, preventing mnt_ref from going to zero. Call ffs_flushfiles() after failed softdep_mount() to clean mountlist. Initial report by: Garrett Cooper Reproduced and tested by: pho	2010-12-01 21:19:11 +00:00
Peter Holm	bcc5c95b6b	First step in fixing the handle_workitem_freeblocks panic. In collaboration with: kib	2010-11-27 20:27:07 +00:00
Kirk McKusick	18709a09ed	Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant. Drop reference to it in mount(8). MFC: 3 days	2010-11-20 18:40:50 +00:00
Konstantin Belousov	730b63b0c2	Remove prtactive variable and related printf()s in the vop_inactive and vop_reclaim() methods. They seems to be unused, and the reported situation is normal for the forced unmount. MFC after: 1 week X-MFC-note: keep prtactive symbol in vfs_subr.c	2010-11-19 21:17:34 +00:00
Konstantin Belousov	be913821af	The softdep_setup_freeblocks() adds worklist items before deallocate_dependencies() is done. This opens a race between softdep thread and the thread that does the truncation: A write of the indirect block causes the freeblks to become ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And then, softdep_disk_write_complete() would reassign the workitem to the mount point worklist, causing premature processing of the workitem, or journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does the same reassign. indir_trunc() then would find the indirect block that is locked (with lock owned by kernel) but without any dependencies, causing it to hang in getblk() waiting for buffer lock. Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies() finished. Analyzed, suggested and reviewed by: jeff Tested by: pho	2010-11-11 11:54:01 +00:00
Konstantin Belousov	496fd81362	Change #ifdef INVARIANTS panic into KASSERT, and print some useful information to diagnose the issue, in handle_complete_freeblocks(). Reviewed by: jeff Tested by: pho	2010-11-11 11:41:52 +00:00
Konstantin Belousov	d23c72cdb5	In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped. I believe there is a window otherwise where jblocks can be accessed without proper initialization. Reviewed by: jeff Tested by: pho	2010-11-11 11:38:57 +00:00
Konstantin Belousov	fae5c47dd4	Add function lbn_offset to calculate offset of the indirect block of given level. Reviewed by: jeff Tested by: pho	2010-11-11 11:35:42 +00:00
Konstantin Belousov	4e4ff01629	Fix typo. Function is called ffs_blkfree.	2010-11-11 11:26:59 +00:00
John Baldwin	b3e3402d3a	Remove unused includes of <sys/mutex.h> and <machine/mutex.h>.	2010-11-09 20:41:10 +00:00
Ivan Voras	8e431dd6f1	Bring vfs.ufs.dirhash_maxmem into the age of the fruitbat and make it autotuned. It is only an upper bound (the memory is not always allocated) and the system contains a vm_lowmem handler so nothing will crash and burn if it's tuned too high. Reviewed by: mckusick	2010-10-25 21:46:23 +00:00
Konstantin Belousov	d0cc54f3b4	The r184588 changed the layout of struct export_args, causing an ABI breakage for old mount(2) syscall, since most struct <filesystem>_args embed export_args. The mount(2) is supposed to provide ABI compatibility for pre-nmount mount(8) binaries, so restore ABI to pre-r184588. Requested and reviewed by: bde MFC after: 2 weeks	2010-10-10 07:05:47 +00:00
Alan Cox	a03e344a7f	M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that have no run-time effect.	2010-10-02 17:58:57 +00:00
Kirk McKusick	e69bed360f	Since local variable 'i' is used only in a KASSERT, declare and initialize it only if INVARIANTS is defined to avoid a declared but unused warning. Suggested by: Brian Somers <brian@FreeBSD.org>	2010-09-29 14:46:57 +00:00
Konstantin Belousov	063045a555	Fix typo in comment.	2010-09-29 07:40:11 +00:00
David E. O'Brien	59b3a4ebb5	Correct some non-code typos.	2010-09-17 09:14:40 +00:00
Kirk McKusick	c0b2efce9e	Update comments in soft updates code to more fully describe the addition of journalling. Only functional change is to tighten a KASSERT. Reviewed by: jeff Roberson	2010-09-14 18:04:05 +00:00
John Baldwin	3634d5b241	Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and LK_CANRECURSE after a lock is created. Use them to implement macros that otherwise manipulated the flags directly. Assert that the associated lockmgr lock is exclusively locked by the current thread when manipulating these flags to ensure the flag updates are safe. This last change required some minor shuffling in a few filesystems to exclusively lock a brand new vnode slightly earlier. Reviewed by: kib MFC after: 3 days	2010-08-20 19:46:50 +00:00
Konstantin Belousov	691401eef8	Softdep_process_worklist() should unsuspend not only before processing the worklist (in softdep_process_journal), but also after flushing the workitems. Might be, we should even do this before bwillwrite() too, but this seems to be not needed for now. Fs might be suspended during processing the queue, and then there is nobody around to unsuspend. In collaboration with: pho Tested by: bz Reviewed by: jeff	2010-08-12 08:35:24 +00:00
John Baldwin	61e1c19319	Revert the previous commit. The race is not applicable to the lockmgr implementation in 8.0 and later as its flags field does not hold dynamic state such as waiters flags, but is only modified in lockinit() aside from VN_LOCK_*(). Discussed with: attilio	2010-07-16 19:52:03 +00:00
John Baldwin	dbfcf8cfea	When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE in the vnode lock's flags) until after they had determined if the vnode was a FIFO. This occurs after the vnode has been inserted a VFS hash or some similar table, so it is possible for another thread to find this vnode via vget() on an i-node number and block on the vnode lock. If the lockmgr interlock (vnode interlock for vnode locks) is not held when clearing the LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result the thread blocked on the vnode lock may never get woken up. Fix this by holding the vnode interlock while modifying the lock flags in this case. MFC after: 3 days	2010-07-16 19:20:20 +00:00
Jeff Roberson	9f9c8c59ae	- Handle the truncation of an inode with an effective link count of 0 in the context of the process that reduced the effective count. Previously all truncation as a result of unlink happened in the softdep flush thread. This had the effect of being impossible to rate limit properly with the journal code. Now the process issuing unlinks is suspended when the journal files. This has a side-effect of improving rm performance by allowing more concurrent work. - Handle two cases in inactive, one for effnlink == 0 and another when nlink finally reaches 0. - Eliminate the SPACECOUNTED related code since the truncation is no longer delayed. Discussed with: mckusick	2010-07-06 07:11:04 +00:00
Konstantin Belousov	427ef27ec7	Ensure that VOP_ACCESSX is called with exclusively locked vnode for the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp vnode lock from shared to exclusive to assign the dquot structure to the vnode, and ufs_delete_denied() is called when tvp is locked. Since upgrade drops shared lock when non-blocked upgrade failed, LOR is there. Reported and tested by: Dmitry Pryanishnikov <lynx.ripe gmail com> Tested by: pho PR: kern/147890 MFC after: 1 week	2010-06-20 13:35:16 +00:00
Andriy Gapon	d89c217f30	ffs_softdep: change K&R in function defintions to ANSI prototypes Apparently it's bad when we first have an ANSI prototype in function declaration, but then use K&R in its defintion. Complaint from: clang MFC after: 2 weeks	2010-06-11 18:26:53 +00:00
Konstantin Belousov	db875cbd74	Extend the scope of the lock on the quota file vnode in quotaon() to cover the initial read by dqopen(). Assert that vnode is locked in dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps Giant locked if needed around the call.	2010-06-03 10:24:53 +00:00
Andriy Gapon	0b9626482b	ffs_mount: accept and drop userland-only options that can be passed from loader(8) In r193192 loader(8) has grown an ability to pass root mount options from fstab via vfs.root.mountfrom.options. Unfortunately, some options that can be present in fstab are for userland only and lead to root mounting failure when seen by kernel. Rather than teaching loader about FFS-specific options that should be filtered out, ffs_mount recognizes those options as valid, but ignores and deletes[1] them. [1] is suggested by jh. PR: kern/141050 Reported by: many Reviewed by: jh, bde MFC after: 4 days	2010-05-19 09:32:11 +00:00
Jeff Roberson	f0268739c7	- Don't immediately re-run softdepflush if we didn't make any progress on the last iteration. This can lead to a deadlock when we have worklist items that cannot be immediately satisfied. Reported by: uqs, Dimitry Andric <dimitry@andric.com> - Remove some unnecessary debugging code and place some other under SUJ_DEBUG. - Examine the journal state in softdep_slowdown(). - Re-format some comments so I may more easily add flag descriptions.	2010-05-19 06:18:01 +00:00
Jeff Roberson	8ef48de888	- Call softdep_prealloc() before any of the balloc routines in the snapshot code. - Don't fsync() vnodes in prealloc if copy on write is in progress. It is not safe to recurse back into the write path here. Reported by: Vladimir Grebenschikov <vova@fbsd.ru>	2010-05-07 08:45:21 +00:00
Jeff Roberson	2c3ae115b6	- Use the correct flag mask when determining whether an inode has successfully made it to the free list yet or not. This fixes a deadlock that can occur with unlinked but referenced files. Journal space and inodedeps were not correctly reclaimed because the inode block was not left dirty. Tested/Reported by: lwindschuh@googlemail.com	2010-05-07 08:20:56 +00:00
Kirk McKusick	e27ed89aef	Merger of the quota64 project into head. This joint work of Dag-Erling Smørgrav and myself updates the FFS quota system to support both traditional 32-bit and new 64-bit quotas (for those of you who want to put 2+Tb quotas on your users). By default quotas are not compiled into the kernel. To include them in your kernel configuration you need to specify: options QUOTA # Enable FFS quotas If you are already running with the current 32-bit quotas, they should continue to work just as they have in the past. If you wish to convert to using 64-bit quotas, use `quotacheck -c 64'; if you wish to revert from 64-bit quotas back to 32-bit quotas, use `quotacheck -c 32'. There is a new library of functions to simplify the use of the quota system, do `man quotafile' for details. If your application is currently using the quotactl(2), it is highly recommended that you convert your application to use the quotafile interface. Note that existing binaries will continue to work. Special thanks to John Kozubik of rsync.net for getting me interested in pursuing 64-bit quota support and for funding part of my development time on this project.	2010-05-07 00:41:12 +00:00
Alan Cox	eb00b276ab	Eliminate page queues locking around most calls to vm_page_free().	2010-05-06 18:58:32 +00:00
Kirk McKusick	945f418ab8	Final update to current version of head in preparation for reintegration.	2010-05-06 17:37:23 +00:00
Alan Cox	5ac59343be	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib	2010-05-05 18:16:06 +00:00
Edward Tomasz Napierala	b5f770bd86	Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize(). Reviewed by: kib	2010-05-05 16:44:25 +00:00
Andriy Gapon	deb3b115e2	ffs_vfsops: restore alphabetic order of options in ffs_opts The order was not correct only for nfsv4acls. ("no" prefix is ignored) MFC after: 1 week	2010-04-29 10:04:00 +00:00
Jeff Roberson	2bd20091e4	- When canceling jaddrefs they may not yet be in the journal if this is via a revert call. In this case don't attempt to remove something that has not yet been added. Otherwise this jaddref must hang around to prevent the bitmap write as normal.	2010-04-28 07:57:37 +00:00
Jeff Roberson	3b32573a9f	- Fix builds without SOFTUPDATES defined in the kernel config.	2010-04-28 07:26:41 +00:00
Kirk McKusick	a4bf5fb987	Update to current version of head.	2010-04-28 05:33:59 +00:00
Pawel Jakub Dawidek	a8750f2dca	Fix build for UFS without SOFTUPDATES.	2010-04-24 07:36:33 +00:00
Jeff Roberson	113db2dddb	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm	2010-04-24 07:05:35 +00:00
Konstantin Belousov	5673e3cb08	The cache_enter(9) function shall not be called for doomed dvp. Assert this. In the reported panic, vdestroy() fired the assertion "vp has namecache for ..", because pseudofs may end up doing cache_enter() with reclaimed dvp, after dotdot lookup temporary unlocked dvp. Similar problem exists in ufs_lookup() for "." lookup, when vnode lock needs to be upgraded. Verify that dvp is not reclaimed before calling cache_enter(). Reported and tested by: pho Reviewed by: kan MFC after: 2 weeks	2010-04-20 10:19:27 +00:00
Andriy Gapon	ecaf3257be	ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj The assignment is already done in g_vfs_open. Redundant assignment is harmless, but can become a problem if g_vfs_open logic is changed. MFC after: 1 week	2010-04-03 08:25:04 +00:00
Kirk McKusick	516ad57b74	Debugging nits found while testing the new 64-bit quota code.	2010-03-16 06:12:30 +00:00
Dag-Erling Smørgrav	1a0fda2b54	IFH@204581	2010-03-04 13:35:57 +00:00
Konstantin Belousov	2950ff259c	When ffs_realloccg() failed to allocate bigger fragment and, because pending blocks are scheduled for removal, goes to retry the (re)allocation, clear the bp pointer. It might happen that meantime free space is really exhausted and we are entering nospace: label without bread()ing buffer, causing stale bp value to be brelse()d again. Tested by: pho (Producing a scenario to reliably reproduce the race appeared to be much harder then fixing the bug) MFC after: 1 week	2010-02-13 10:34:50 +00:00
Kirk McKusick	81479e688b	One last pass to get all the unsigned comparisons correct.	2010-02-11 18:14:53 +00:00
Kirk McKusick	e870d1e6f9	This fix corrects a problem in the file system that treats large inode numbers as negative rather than unsigned. For a default (16K block) file system, this bug began to show up at a file system size above about 16Tb. To fully handle this problem, newfs must be updated to ensure that it will never create a filesystem with more than 2^32 inodes. That patch will be forthcoming soon. Reported by: Scott Burns, John Kilburg, Bruce Evans Followup by: Jeff Roberson PR: 133980 MFC after: 2 weeks	2010-02-10 20:10:35 +00:00
Edward Tomasz Napierala	619961810c	Remove unused variable.	2010-02-10 18:56:49 +00:00
Edward Tomasz Napierala	fa90da2be6	Return proper error code. Found with: clang	2010-01-25 16:09:50 +00:00
Edward Tomasz Napierala	780b61c45b	Move out code that does POSIX.1e ACL inheritance into separate routines. Reviewed by: rwatson	2010-01-24 15:12:27 +00:00

... 4 5 6 7 8 ...

2188 Commits