freebsd-nq

Author	SHA1	Message	Date
Rick Macklem	de9a1a70ab	Add support for a "forced" pnfsdskill to the pNFS server kernel code. The pnfsdskill(8) command will normally fail if there is no valid mirror for the DS to be disabled. However, a system administrator may need to disable a DS which does not have a valid mirror so that the nfsd threads can be terminated. This patch adds the kernel code needed by pnfsdskill(8) to implement this "forced" case of disabling a DS. This patch only affects the pNFS server.	2018-07-09 19:58:01 +00:00
Rick Macklem	acc6e58def	Fix the kernel part of pnfsdscopymr() to handle holes in the file being copied. If a mirrored DS is being recovered that has a lot of large sparse files, pnfsdscopymr(8) would use a lot of space on the recovered mirror since it would write the "holes" in the file being mirrored. This patch adds code to check for a "hole" and skip doing the write. The check is done on a "per PNFSDS_COPYSIZ size block", which is currently 64K. I think that most file server file systems will be using a blocksize at least this large. If the file server is using a smaller blocksize and smaller holes need to be preserved, PNFSDS_COPYSIZ could be decreased. The block of 0s is malloc()d, since pnfsdcopymr(8) should be an infrequent occurrence.	2018-07-08 18:15:55 +00:00
Rick Macklem	ed66a76bca	Fix handling of the hybrid DS case for a pNFS server. After the addition of the "#mds_path" suffix for a DS specification on the "-p" nfsd option, it is possible to have a mix of DSs assigned to an MDS file system and DSs that store files for all DSs. This is what I referred to as "hybrid" above. At first, I didn't think this hybrid case would be useful, but I now believe that some system administrators may fine it useful. This patch modifies the file storage assignment algorithm so that it makes the "#mds_path" DSs take priority and the all file systems DSs are now only used for MDS file systems with no "#mds_path" DS servers. This only affects the pNFS server for this "hybrid" case.	2018-07-07 19:27:49 +00:00
Rick Macklem	5b500ea949	Change the pNFS server so that it does not disable a mirrored DS for an NFSERR_STALE error reported via a LayoutReturn. The current FreeBSD client can generate these errors for an operational DS while doing a recovery of a mirror after a mirrored DS has been repaired. I am not sure why these errors occur, but my best current guess is a race between the Layout Recall issued by the kernel code run from pnfsdscopymr(8) and a Read operation on the DS for the file bing copied. The errors are not fatal, since the client falls back on doing I/O through the MDS, which can do the I/O successfully as a proxy. (The fact that the MDS can do this indicates that the file does still exist on the functioning DS.) This change only affects the pNFS server and only when a client does a LayoutReturn with the NFSERR_STALE error report.	2018-07-06 19:18:45 +00:00
Rick Macklem	ff3b992f38	Fix the pNFS server so that it handles the "#mds_path" check for mirrors. The recently added feature of the pNFS server will set an fsid for the MDS file system to define the file system a DS should store files for. For a case where a DS handling all file systems has failed, it was possible for the code to check for a mirror with a specified fs, even though nfsdev_mdsisset was 0, possibly causing a false successful check for a mirror. This patch adds a check for nfsdev_mdsisset != 0 to avoid this. It only affects the pNFS server for a rare case. Found via code inspection.	2018-07-04 19:46:26 +00:00
Rick Macklem	2f32675c83	Add an optional feature to the pNFS server. Without this patch, the pNFS server distributes the data storage files across all of the specified DSs. A tester noted that it would be nice if a system administrator could control which DSs are used to store the file data for a given exported MDS file system. This patch adds the kernel support to do this. It also makes a slight semantic change to nfsv4_findmirror(), since some uses of it no longer require that the DS being searched for have a current mirror. A patch that will be committed in a few minutes will modify the nfsd daemon to support this feature. The patch should only affect sites using the pNFS server (specified via the "-p" command line option for nfsd. Suggested by: james.rose@framestore.com	2018-07-02 19:21:33 +00:00
Rick Macklem	1aabf3fd5e	Fix the pNFS server for a case where mirror level equals number of DSs. If a pNFS service was set up where the number of DSs equals the mirror level and then a DS was disabled, the service would create files with duplicate entries for the same DS. This bug occurred because I didn't realize that TAILQ_FOREACH_FROM() would start at the beginning of the list when the inital value of the variable was NULL. This patch also changes the pNFS server DS file creation code so that it creates entrie(s) with 0.0.0.0 IP address when it cannot create mirror level files due to lack of DSs. The patch only affects the pNFS service and only when it was created with a number of DSs equal to the mirror level and mirroring is enabled.	2018-06-29 12:41:36 +00:00
Rick Macklem	c16f407e31	Add a counter to limit the number of disabled DSs for a mirrored pNFS MDS. This patch adds a counter that limits the number of disabled mirrored DSs to mirror level - 1. It also makes a small change that keeps a Write that has failed with EACCES when attempted by a client to a DS from disabling the DS. This patch only affects the pNFS server.	2018-06-22 00:55:39 +00:00
Rick Macklem	90d2dfab19	Merge the pNFS server code from projects/pnfs-planb-server into head. This code merge adds a pNFS service to the NFSv4.1 server. Although it is a large commit it should not affect behaviour for a non-pNFS NFS server. Some documentation on how this works can be found at: http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt and will hopefully be turned into a proper document soon. This is a merge of the kernel code. Userland and man page changes will come soon, once the dust settles on this merge. It has passed a "make universe", so I hope it will not cause build problems. It also adds NFSv4.1 server support for the "current stateid". Here is a brief overview of the pNFS service: A pNFS service separates the Read/Write oeprations from all the other NFSv4.1 Metadata operations. It is hoped that this separation allows a pNFS service to be configured that exceeds the limits of a single NFS server for either storage capacity and/or I/O bandwidth. It is possible to configure mirroring within the data servers (DSs) so that the data storage file for an MDS file will be mirrored on two or more of the DSs. When this is used, failure of a DS will not stop the pNFS service and a failed DS can be recovered once repaired while the pNFS service continues to operate. Although two way mirroring would be the norm, it is possible to set a mirroring level of up to four or the number of DSs, whichever is less. The Metadata server will always be a single point of failure, just as a single NFS server is. A Plan B pNFS service consists of a single MetaData Server (MDS) and K Data Servers (DS), all of which are recent FreeBSD systems. Clients will mount the MDS as they would a single NFS server. When files are created, the MDS creates a file tree identical to what a single NFS server creates, except that all the regular (VREG) files will be empty. As such, if you look at the exported tree on the MDS directly on the MDS server (not via an NFS mount), the files will all be of size 0. Each of these files will also have two extended attributes in the system attribute name space: pnfsd.dsfile - This extended attrbute stores the information that the MDS needs to find the data storage file(s) on DS(s) for this file. pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime and Change attributes for the file, so that the MDS doesn't need to acquire the attributes from the DS for every Getattr operation. For each regular (VREG) file, the MDS creates a data storage file on one (or more if mirroring is enabled) of the DSs in one of the "dsNN" subdirectories. The name of this file is the file handle of the file on the MDS in hexadecimal so that the name is unique. The DSs use subdirectories named "ds0" to "dsN" so that no one directory gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize on the MDS, with the default being 20. For production servers that will store a lot of files, this value should probably be much larger. It can be increased when the "nfsd" daemon is not running on the MDS, once the "dsK" directories are created. For pNFS aware NFSv4.1 clients, the FreeBSD server will return two pieces of information to the client that allows it to do I/O directly to the DS. DeviceInfo - This is relatively static information that defines what a DS is. The critical bits of information returned by the FreeBSD server is the IP address of the DS and, for the Flexible File layout, that NFSv4.1 is to be used and that it is "tightly coupled". There is a "deviceid" which identifies the DeviceInfo. Layout - This is per file and can be recalled by the server when it is no longer valid. For the FreeBSD server, there is support for two types of layout, call File and Flexible File layout. Both allow the client to do I/O on the DS via NFSv4.1 I/O operations. The Flexible File layout is a more recent variant that allows specification of mirrors, where the client is expected to do writes to all mirrors to maintain them in a consistent state. The Flexible File layout also allows the client to report I/O errors for a DS back to the MDS. The Flexible File layout supports two variants referred to as "tightly coupled" vs "loosely coupled". The FreeBSD server always uses the "tightly coupled" variant where the client uses the same credentials to do I/O on the DS as it would on the MDS. For the "loosely coupled" variant, the layout specifies a synthetic user/group that the client uses to do I/O on the DS. The FreeBSD server does not do striping and always returns layouts for the entire file. The critical information in a layout is Read vs Read/Writea and DeviceID(s) that identify which DS(s) the data is stored on. At this time, the MDS generates File Layout layouts to NFSv4.1 clients that know how to do pNFS for the non-mirrored DS case unless the sysctl vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File layouts are generated. The mirrored DS configuration always generates Flexible File layouts. For NFS clients that do not support NFSv4.1 pNFS, all I/O operations are done against the MDS which acts as a proxy for the appropriate DS(s). When the MDS receives an I/O RPC, it will do the RPC on the DS as a proxy. If the DS is on the same machine, the MDS/DS will do the RPC on the DS as a proxy and so on, until the machine runs out of some resource, such as session slots or mbufs. As such, DSs must be separate systems from the MDS. Tested by: james.rose@framestore.com Relnotes: yes	2018-06-12 19:36:32 +00:00
Rick Macklem	8472f76005	Revert r334586 since I now think __unused is the better way to handle this.	2018-06-04 11:35:04 +00:00
Rick Macklem	12c7a494ad	Fix a gcc8 warning about a write only variable. gcc8 warns that "verf" was set but not used. This was because the code that uses it is disabled via a "#if 0". This patch adds a "#if 0" to the variable's declaration and assignment to get rid of the warning. This way the code could be re-enabled without difficulty. Requested by: mmacy MFC after: 2 weeks	2018-06-03 19:46:44 +00:00
Rick Macklem	9442a64e53	Add the BindConnectiontoSession operation to the NFSv4.1 server. Under some fairly unusual circumstances, the Linux NFSv4.1 client is doing a BindConnectiontoSession operation for TCP connections. It is also used by the ESXi6.5 NFSv4.1 client. This patch adds this operation to the NFSv4.1 server. Reported by: andreas.nagy@frequentis.com Tested by: andreas.nagy@frequentis.com MFC after: 2 weeks	2018-06-01 19:47:41 +00:00
Rick Macklem	440e2f9e91	Strengthen locking for the NFSv4.1 server DestroySession operation. If a client did a DestroySession on a session while it was still in use, the server might try to use the session structure after it is free'd. I think the client has violated RFC5661 if it does this, but this patch makes DestroySession block all other nfsd threads so no thread could be using the session when it is free'd. After the DestroySession, nfsd threads will not be able to find the session. The patch also adds a check for nd_sessionid being set, although if that was not the case it would have been all 0s and unlikely to have a false match. This might fix the crashes described in PR#228497 for the FreeNAS server. PR: 228497 MFC after: 1 week	2018-05-30 20:16:17 +00:00
Rick Macklem	04b1905584	Add a missing nfsrv_freesession() call for an unlikely failure case. Since NFSv4.1 clients normally create a single session which supports both fore and back channels, it is unlikely that a callback will fail due to a lack of a back channel. However, if this failure occurred, the session wasn't being dereferenced and would never be free'd. Found by inspection during pNFS server development. Tested by: andreas.nagy@frequentis.com MFC after: 2 months	2018-05-17 21:17:20 +00:00
Rick Macklem	0ebe2634be	End grace for the NFSv4 server if all mounts do ReclaimComplete. The NFSv4 protocol requires that the server only allow reclaim of state and not issue any new open/lock state for a grace period after booting. The NFSv4.0 protocol required this grace period to be greater than the lease duration (over 2minutes). For NFSv4.1, the client tells the server that it has done reclaiming state by doing a ReclaimComplete operation. If all NFSv4 clients are NFSv4.1, the grace period can end once all the clients have done ReclaimComplete, shortening the time period considerably. This patch does this. If there are any NFSv4.0 mounts, the grace period will still be over 2minutes. This change is only an optimization and does not affect correct operation. Tested by: andreas.nagy@frequentis.com MFC after: 2 months	2018-05-15 20:28:50 +00:00
Rick Macklem	8932a4835f	Fix the eir_server_scope reply argument for NFSv4.1 ExchangeID. In the reply to an ExchangeID operation, the NFSv4.1 server returns a "scope" value (eir_server_scope). If this value is the same, it indicates that two servers share state, which is never the case for FreeBSD servers. As such, the value needs to be unique and it was without this patch. However, I just found out that it is not supposed to change when the server reboots and without this patch, it did change. This patch fixes eir_server_scope so that it does not change when the server is rebooted. The only affect not having this patch has is that Linux clients don't reclaim opens and locks after a server reboot, which meant they lost any byte range locks held before the server rebooted. It only affects NFSv4.1 mounts and the FreeBSD NFSv4.1 client was not affected by this bug. MFC after: 1 week	2018-05-13 23:38:01 +00:00
Rick Macklem	0f13d146a0	Fix a slow leak of session structures in the NFSv4.1 server. For a fairly rare case of a client doing an ExchangeID after a hard reboot, the old confirmed clientid still exists, but some clients use a new co_verifier. For this case, the server was not freeing up the sessions on the old confirmed clientid. This patch fixes this case. It also adds two LIST_INIT() macros, which are actually no-ops, since the structure is malloc()d with M_ZERO so the pointer is already set to NULL. It should have minimal impact, since the only way I could exercise this code path was by doing a hard power cycle (pulling the plus) on a machine running Linux with a NFSv4.1 mount on the server. Originally spotted during testing of the ESXi 6.5 client. Tested by: andreas.nagy@frequentis.com MFC after: 2 months	2018-05-13 12:42:53 +00:00
Rick Macklem	bb3436966a	The NFSv4.1 server should return NFSERR_BACKCHANBUSY instead of NFS_OK. When an NFSv4.1 session is busy due to a callback being in progress, nfsrv_freesession() should return NFSERR_BACKCHANBUSY instead of NFS_OK. The only effect this has is that the DestroySession operation will report the failure for this case and this probably has little or no effect on a client. Spotted by inspection and no failures related to this have been reported. MFC after: 2 months	2018-05-13 12:29:09 +00:00
Rick Macklem	5d4835e4b7	Add support for the TestStateID operation to the NFSv4.1 server. The Linux client now uses the TestStateID operation, so this patch adds support for it to the NFSv4.1 server. The FreeBSD client never uses this operation, so it should not be affected. MFC after: 2 months	2018-05-11 22:16:23 +00:00
Rick Macklem	7427a9f138	Revert r333183, since I am not sure that just initializing the list is the correct thing to do and that is already done without this commit.	2018-05-02 21:29:42 +00:00
Rick Macklem	858bb2fc1a	Add two missing LIST_INIT()s. This patch adds two missing LIST_INIT()s. Found by inspection. In practice, these are currently no-ops, since the structure they are in is malloc'd with M_ZERO and all LIST_INIT does is set the pointer in the list head to NULL. (In other words, the M_ZERO has already correctly initialized it.) MFC after: 2 months	2018-05-02 20:36:11 +00:00
Rick Macklem	6269d66373	Fix OpenDowngrade for NFSv4.1 if a client sets the OPEN_SHARE_ACCESS_WANT* bits. The NFSv4.1 RFC specifies that the OPEN_SHARE_ACCESS_WANT bits can be set in the OpenDowngrade share_access argument and are basically ignored. I do not know of a extant NFSv4.1 client that does this, but this little patch fixes it just in case. It also changes the error from NFSERR_BADXDR to NFSERR_INVAL since the NFSv4.1 RFC specifies this as the error to be returned if bogus bits are set. (The NFSv4.0 RFC didn't specify any error for this, so the error reply can be changed for NFSv4.0 as well.) Found by inspection while looking at a problem with OpenDowngrade reported for the ESXi 6.5 NFSv4.1 client. Reported by: andreas.nagy@frequentis.com PR: 227214 MFC after: 1 week	2018-04-19 20:30:33 +00:00
Conrad Meyer	b97b91b547	nfs: Remove NFSSOCKADDRALLOC, NFSSOCKADDRFREE macros They were just thin wrappers over malloc(9) w/ M_ZERO and free(9). Discussed with: rmacklem, markj Sponsored by: Dell EMC Isilon	2018-01-25 22:38:39 +00:00
Conrad Meyer	222daa421f	style: Remove remaining deprecated MALLOC/FREE macros Mechanically replace uses of MALLOC/FREE with appropriate invocations of malloc(9) / free(9) (a series of sed expressions). Something like: * MALLOC(a, b, ... -> a = malloc(... * FREE( -> free( * free((caddr_t) -> free( No functional change. For now, punt on modifying contrib ipfilter code, leaving a definition of the macro in its KMALLOC(). Reported by: jhb Reviewed by: cy, imp, markj, rmacklem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14035	2018-01-25 22:25:13 +00:00
Emmanuel Vadot	becffb36cc	nfs: Do not printf each time a lock structure is freed during module unload There can be a lot of those structures and printing a line each time we free one on module unload. MFC after: 3 days	2018-01-18 15:28:49 +00:00
John Baldwin	b1288166e0	Use long for the last argument to VOP_PATHCONF rather than a register_t. pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf() functions now accept a pointer to a long value rather than modifying td_retval directly. Instead, the system calls explicitly store the returned long value in td_retval[0]. Requested by: bde Reviewed by: kib Sponsored by: Chelsio Communications	2018-01-17 22:36:58 +00:00
Alexander Kabaev	151ba7933a	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385	2017-12-25 04:48:39 +00:00
Rick Macklem	b07bc39a23	Avoid the overhead of acquiring a lock in nfsrv_checkgetattr() when there are no write delegations issued. manu@ reported on the freebsd-current@ mailing list that there was a significant performance hit in nfsrv_checkgetattr() caused by the acquisition/release of a state lock, even when there were no write delegations issued. This patch add a count of outstanding issued write delegations to the NFSv4 server. This count allows nfsrv_checkgetattr() to return without acquiring any lock when the count is 0, avoiding the performance hit for the case where no write delegations are issued. Reported by: manu Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13327	2017-12-04 21:50:27 +00:00
Pedro F. Giffuni	d63027b668	sys/fs: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:15:37 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Rick Macklem	57ef3db3a9	Fix the client IP address reported by nfsdumpstate for 64bit arch and NFSv4.1. The client IP address was not being reported for some NFSv4 mounts by nfsdumpstate. Upon investigation, two problems were found for mounts using IPv4. One was that the code (originally written and tested on i386) assumed that a "u_long" was a "uint32_t" and would exactly store an IPv4 host address. Not correct for 64bit arches. Also, for NFSv4.1 mounts, the field was not being filled in. This was basically correct, because NFSv4.1 does not use a callback address. However, it meant that nfsdumpstate could not report the client IP addr. This patch should fix both of these issues. For IPv6, the address will still not be reported. The original NFSv4 RFC only specified IPv4 callback addresses. I think this has changed and, if so, a future commit to fix reporting of IPv6 addresses will be needed. Reported by: manu PR: 223036 MFC after: 2 weeks	2017-10-15 22:22:27 +00:00
Rick Macklem	ce8d06fe87	Change a panic to an error return. There was a panic() in the NFS server's write operation that didn't need to be a panic() and could just be an error return. This patch makes that change. Found by code inspection during development of the pNFS service. MFC after: 2 weeks	2017-09-24 20:05:48 +00:00
Alexander Motin	e9c9826673	Improve FHA locality control for NFS read/write requests. This change adds two new tunables, allowing to control serialization for read and write NFS requests separately. It does not change the default behavior since there are too many factors to consider, but gives additional space for further experiments and tuning. The main motivation for this change is very low write speed in case of ZFS with sync=always or when NFS clients requests sychronous operation, when every separate request has to be written/flushed to ZIL, and requests are processed one at a time. Setting vfs.nfsd.fha.write=0 in that case allows to increase ZIL throughput by several times by coalescing writes and cache flushes. There is a worry that doing it may increase data fragmentation on disks, but I suppose it should not happen for pool with SLOG. MFC after: 1 week Sponsored by: iXsystems, Inc.	2017-07-31 15:23:19 +00:00
Edward Tomasz Napierala	1d2fef9b9a	Rename vfs.nfsd.enable_uidtostring to vfs.nfs.enable_uidtostring. It applies to both NFS client and NFS server, and is useful for both. This is different from vfs.nfsd.enable_stringtouid, which is specific to server side. Reviewed by: rmacklem@ MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-07-19 09:59:32 +00:00
Edward Tomasz Napierala	6a3450e178	Add vfs.nfsd.nfsd_enable_uidtostring, which works just like vfs.nfsd.nfsd_enable_stringtouid, but in reverse - when set to 1, it forces the NFSv4 server to return numeric UIDs and GIDs instead of "user@domain" strings. This helps with clients that can't translate returned identifiers, eg when rerooting. The same can be achieved by just never running nfsuserd(8), but the sysctl is useful to toggle the behaviour back and forth without rebooting. Reviewed by: rmacklem (earlier version) MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11326	2017-06-26 13:11:21 +00:00
Rick Macklem	95ac7f1a74	Fix the NFS client/server so that it actually uses the 64bit ino_t filenos. The code still doesn't use d_off. That will come in a future commit. The code also removes the checks for servers returning a fileno that doesn't fit in 32bits, since that should work ok now. Bump __FreeBSD_version since this patch changes the interface between the NFS kernel modules. Reviewed by: kib	2017-06-18 21:48:31 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Rick Macklem	858f6fe327	Allow use of a write open stateid for reading in the NFSv4 server. The NFSv4 RFCs give a server the option of allowing the use of an open stateid for write access to be used for a Read operation. This patch enables this by default and adds a sysctl to disable it, for anyone who does not want this capability. Allowing this is particularily useful for a pNFS Data Server (DS), since they are not permitted to allow the use of special stateids. Discovered during recent testing of the pNFS server under development. MFC after: 2 weeks	2017-04-24 20:46:19 +00:00
Rick Macklem	dedec68c32	Fix the setting of atime for Linux client NFSv4 mounts. The FreeBSD NFSv4 server did not set the attribute bit for TimeAccess in the reply to an Open with exclusive_create, as required by the RFCs. (This is required since the FreeBSD NFS server stores the create_verifier in the va_atime attribute.) As such, the Linux NFSv4 client did not set the TimeAccess (atime) in the Setattr done in an RPC after the one with the Open/exclusive_create. This patch fixes the server to set the TimeAccess bit in the reply. I believe that storing the create_verifier in an extended attribute for file systems that support extended attributes might be a good idea, but I will wait for a discussion of this on the freebsd-fs@ email list before considering committing a patch to do this. Reported by: jim@ks.uiuc.edu Suggested by: dfr MFC after: 2 weeks	2017-04-21 00:17:47 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Konstantin Belousov	aca4bb9112	Do not leak mount references for dying threads. Thread might create a condition for delayed SU cleanup, which creates a reference to the mount point in td_su, but exit without returning through userret(), e.g. when terminating due to single-threading or process exit. In this case, td_su reference is not dropped and mount point cannot be freed. Handle the situation by clearing td_su also in the thread destructor and in exit1(). softdep_ast_cleanup() has to receive the thread as argument, since e.g. thread destructor is executed in different context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-25 10:38:18 +00:00
Eric van Gyzen	8144690af4	Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel inet_ntoa() cannot be used safely in a multithreaded environment because it uses a static local buffer. Instead, use inet_ntoa_r() with a buffer on the caller's stack. Suggested by: glebius, emaste Reviewed by: gnn MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9625	2017-02-16 20:47:41 +00:00
Andriy Gapon	90f90687b3	add svcpool_close to handle killed nfsd threads This patch adds a new function to the server krpc called svcpool_close(). It is similar to svcpool_destroy(), but does not free the data structures, so that the pool can be used again. This function is then used instead of svcpool_destroy(), svcpool_create() when the nfsd threads are killed. PR: 204340 Reported by: Panzura Approved by: rmacklem Obtained from: rmacklem MFC after: 1 week	2017-02-14 17:49:08 +00:00
Baptiste Daroussin	b4b4b5304b	Revert crap accidentally committed	2017-01-28 16:31:23 +00:00
Baptiste Daroussin	814aaaa7da	Revert r312923 a better approach will be taken later	2017-01-28 16:30:14 +00:00
Konstantin Belousov	2f304845e2	Do not allocate struct statfs on kernel stack. Right now size of the structure is 472 bytes on amd64, which is already large and stack allocations are indesirable. With the ino64 work, MNAMELEN is increased to 1024, which will make it impossible to have struct statfs on the stack. Extracted from: ino64 work by gleb Discussed with: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-05 17:19:26 +00:00
Josh Paetzel	b5a8f340f1	Workaround NFS bug with readdirplus when there are greater than 1 billion files in a filesystem. Reviewed by kib MFC after: 2 weeks Sponsored by: iXsystems Differential Revision: D9009	2017-01-02 19:18:56 +00:00
Rick Macklem	a5d19b81b4	Fix the NFSv4.1 server for Open reclaim after a reboot. The NFSv4.1 server failed to update the nfs-stablerestart file for a client when the client was issued its first Open. As such, recovery of Opens after a server reboot failed with NFSERR_NOGRACE. This patch fixes this. It also changes the code so that it malloc()'s the 1024 byte array instead of allocating it on the kernel stack for both NFSv4.0 and NFSv4.1. Note that this bug only affected NFSv4.1 and only when clients attempted to reclaim Opens after a server reboot. MFC after: 2 weeks	2016-12-05 22:36:25 +00:00
Konstantin Belousov	7359fdcf5f	Allow some dotdot lookups in capability mode. If dotdot lookup does not escape from the file descriptor passed as the lookup root, we can allow the component traversal. Track the directories traversed, and check the result of dotdot lookup against the recorded list of the directory vnodes. Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently disabled by default until more verification of the approach is done. Disallow non-local filesystems for dotdot, since remote server might conspire with the local process to allow it to escape the namespace. This might be too cautious, provide the knob vfs.lookup_cap_dotdot_nonlocal to override as well. Idea by: rwatson Discussed with: emaste, jonathan, rwatson Reviewed by: mjg (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 week Differential revision: https://reviews.freebsd.org/D8110	2016-11-02 12:43:15 +00:00
Rick Macklem	dcb19c3886	A problem w.r.t. interoperation between the FreeBSD NFSv4.1 server with delegations enabled and the Linux NFSv4.1 client was reported in reviews.freebsd.org/D7891. I believe that the FreeBSD server behaviour conforms to the RFC and that the Linux client has a bug. Therefore, I do not think the proposed patch is appropriate. When nfsrv_writedelegifpos is non-zero, the FreeBSD server will issue a write delegation for a read open if possible. The Linux client then erroneously assumes that the credentials used for the read open can write the file. This patch reverses the default value for nfsrv_writedelegifpos to 0 so that the default behaviour is Linux compatible and adds a sysctl that can be used to set nfsrv_writedelegifpos. This change should only affect users that are mounting a FreeBSD server with delegations enabled (they are not enabled by default) with a Linux NFSv4.1 client mount. Reported by: fatih.acar@gandi.net Tested by: fatih.acar@gandi.net MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7891	2016-10-20 23:53:16 +00:00
Rick Macklem	1b819cf265	Update the nfsstats structure to include the changes needed by the patch in D1626 plus changes so that it includes counts for NFSv4.1 (and the draft of NFSv4.2). Also, make all the counts uint64_t and add a vers field at the beginning, so that future revisions can easily be implemented. There is code in place to handle the old vesion of the nfsstats structure for backwards binary compatibility. Subsequent commits will update nfsstat(8) to use the new fields. Submitted by: will (earlier version) Reviewed by: ken MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D1626	2016-08-12 22:44:59 +00:00
Conrad Meyer	5ecc225fc5	nfsd: Fix use-after-free in NFS4 lock test service Trivial use-after-free where stp was freed too soon in the non-error path. To fix, simply move its release to the end of the routine. Reported by: Coverity CID: 1006105 Sponsored by: EMC / Isilon Storage Division	2016-05-12 05:03:12 +00:00
Rick Macklem	de2413b95e	Don't increment srvrpccnt[] for the NFSv4.1 operations. When support for NFSv4.1 was added to the NFS server, it broke the server rpc count stats, since newnfsstats.srvrpccnt[] doesn't have entries for the new NFSv4.1 operations. Without this patch, the code was incrementing bogus entries in newnfsstats for the new NFSv4.1 operations. This patch is an interim fix. The nfsstats structure needs to be updated and that will come in a future commit. Reported by: cem MFC after: 2 weeks	2016-05-07 22:45:08 +00:00
Pedro F. Giffuni	ee58b56452	nfsserver: minor spelling fix in comment. No functional change.	2016-05-06 23:40:37 +00:00
Rick Macklem	8eabbbe24b	Give mountd -S priority over outstanding RPC requests when suspending the nfsd. It was reported via email that under certain heavy RPC loads long delays before the exports would be updated was observed when using "mountd -S". This patch reverses the priority between the exclusive lock request to suspend the nfsd threads and the shared lock request for performing RPCs. As such, when mountd attempts to suspend the nfsd threads, it gets priority over outstanding RPC requests to do this. I suspect that the case reported was an artificial test load, but this patch did fix the problem for the reporter. Reported and Tested by: josephlai@qnap.com MFC after: 2 weeks	2016-05-06 23:26:17 +00:00
Pedro F. Giffuni	a96c9b30e2	NFS: spelling fixes on comments. No funcional change.	2016-04-29 16:07:25 +00:00
Rick Macklem	ae03cbd7f3	Allow the NFSv4 server to reply NFSERR_WRONGSEC for the SetClientID operation. It was reported via email that a Linux client couldn't do a Kerberized NFS mount when only "sec=krb5" was specified for the exports. The Linux client attempted a mount via krb5i and the server replied NFSERR_SERVERFAULT. Although NFSERR_WRONGSEC isn't listed as an error for SetClientID, I think it is the correct reply, so this patch enables that. I do not know if this fixes the mount attempt, but adding "krb5i" to the list of allowed security flavours does allow the mount to work. Reported by: joef@spectralogic.com MFC after: 2 weeks	2016-04-23 21:18:45 +00:00
Rick Macklem	0533d72612	Fix a LOR in the NFSv4.1 server. The ordering of acquisition of the state and session mutexes was reversed in two cases executed when an NFSv4.1 client created/freed a session. Since clients will typically do this only when mounting and dismounting, the likelyhood of causing a deadlock was low but possible. This can only occur for NFSv4.1 mounts, since the others do not use sessions. This was detected while testing the pNFS server/client where the client crashed during dismounting. The patch also reorders the unlocks, although that isn't necessary for correct operation. MFC after: 2 weeks	2016-04-23 01:22:04 +00:00
Rick Macklem	13c581fc54	If the VOP_SETATTR() call that saves the exclusive create verifier failed, the NFS server would leave the newly created vnode locked. This could result in a file system that would not unmount and processes wedged, waiting for the file to be unlocked. Since this VOP_SETATTR() never fails for most file systems, this bug doesn't normally manifest itself. I found it during testing of an exported GlusterFS file system, which can fail. This patch adds the vput() and changes the error to the correct NFS one. MFC after: 2 weeks	2016-04-12 20:23:09 +00:00
Pedro F. Giffuni	74b8d63dcc	Cleanup unnecessary semicolons from the kernel. Found with devel/coccinelle.	2016-04-10 23:07:00 +00:00
Rick Macklem	84be7e0952	Add kernel support to the NFS server for the "-manage-gids" option that will be added to the nfsuserd daemon in a future commit. It modifies the cache used by NFSv4 for name<-->id translation (both username/uid and group/gid) to support this. When "-manage-gids" is set, the server looks up each uid for the RPC and uses the list of groups cached in the server instead of the list of groups provided in the RPC request. The cached group list is acquired for the cache by the nfsuserd daemon via getgrouplist(3). This avoids the 16 groups limit for the list in the RPC request. Since the cache is now used for every RPC when "-manage-gids" is enabled, the code also modifies the cache to use a separate mutex for each hash list instead of a single global mutex. Suggested by: jpaetzel Tested by: jpaetzel MFC after: 2 weeks	2015-11-30 21:54:27 +00:00
Rick Macklem	a0962bf8bc	When the nfsd threads are terminated, the NFSv4 server state (opens, locks, etc) is retained, which I believe is correct behaviour. However, for NFSv4.1, the server also retained a reference to the xprt (RPC transport socket structure) for the backchannel. This caused svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed a socket upcall to occur after the mutexes in the svcpool were destroyed, causing a crash. This patch fixes the code so that the backchannel xprt structure is dereferenced just before svcpool_destroy() is called, so the code does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall. Tested by: g_amanakis@yahoo.com PR: 204340 MFC after: 2 weeks	2015-11-21 23:55:46 +00:00
Rick Macklem	29dc40b6be	For the case where an NFSv4.1 ExchangeID operation has the client identifier that already has a confirmed ClientID, the nfsrv_setclient() function would not fill in the clientidp being returned. As such, the value of ClientID returned would be whatever garbage was on the stack. An NFSv4.1 client would not normally do this, but it appears that it can happen for certain Linux clients. When it happens, the client persistently retries the ExchangeID and Create_session after Create_session fails when it uses the bogus clientid. With this patch, the correct clientid is replied. This problem was identified in a packet trace supplied by Ahmed Kamal via email. Reported by: email.ahmedkamal@googlemail.com MFC after: 2 weeks	2015-08-14 22:02:14 +00:00
Rick Macklem	25f37276e5	This patch fixes a problem where, if the NFSv4 server has a previous unconfirmed clientid structure for the same client on the last hash list, this old entry would not be removed/deleted. I do not think this bug would have caused serious problems, since the new entry would have been before the old one on the list. This old entry would have eventually been scavenged/removed. Detected while reading the code looking for another bug. MFC after: 3 days	2015-07-29 23:06:30 +00:00
Rick Macklem	0c419e226c	Make the NFS server use shared vnode locks for a few cases that are allowed by the VFS/VOP interface instead of using exclusive locks. MFC after: 2 weeks	2015-05-29 20:22:53 +00:00
Rick Macklem	1f54e596ad	Make the size of the hash tables used by the NFSv4 server tunable. No appreciable change in performance was observed after increasing the sizes of these tables and then testing with a single client. However, there was an email that indicated high CPU overheads for a heavily loaded NFSv4 and it is hoped that increasing the sizes of the hash tables via these tunables might help. The tables remain the same size by default. Differential Revision: https://reviews.freebsd.org/D2596 MFC after: 2 weeks	2015-05-27 22:00:05 +00:00
Konstantin Belousov	1bc93bb7b9	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
Rick Macklem	2fbb9d563b	Fix the NFS server's handling of a bogus NFSv2 ROOT RPC. The ROOT RPC is deprecated in the NFSv2 RFC, RFC-1094 and should never be used by a client. Tested by: thmu@freenet.de MFC after: 1 week	2015-04-25 00:58:24 +00:00
Edward Tomasz Napierala	50a220c699	Replace "new NFS" with just "NFS" in some sysctl description strings. Sponsored by: The FreeBSD Foundation	2015-04-19 06:18:41 +00:00
Rick Macklem	66e80f77d2	mav@ has found that NFS servers exporting ZFS file systems can perform better when using a 128K read/write data size. This patch changes NFS_MAXDATA from 64K to 128K so that clients can use 128K for NFS mounts to allow this. The patch also renames NFS_MAXDATA to NFS_SRVMAXIO so that it is clear that it applies to the NFS server side only. It also avoids a name conflict with the NFS_MAXDATA defined in rpcsvc/nfs_prot.h, that is used for userland RPC. Tested by: mav Reviewed by: mav MFC after: 2 weeks	2015-04-16 22:35:15 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Robert Watson	eae6da3db4	Use M_SIZE() instead of hand-crafted (and mostly correct) NFSMSIZ() macro in the NFS server; garbage collect now-unused NFSMSIZ() and M_HASCL() macros. Also garbage collect now-unused versions in headers for the removed previous NFS client and server. Reviewed by: rmacklem Sponsored by: EMC / Isilon Storage Division	2015-01-07 17:22:56 +00:00
Rick Macklem	52f1bb38c2	A deadlock in the NFSv4 server with vfs.nfsd.enable_locallocks=1 was reported via email. This was caused by a LOR between the sleep lock used to serialize the local locking (nfsrv_locklf()) and locking the vnode. I believe this patch fixes the problem by delaying relocking of the vnode until the sleep lock is unlocked (nfsrv_unlocklf()). To avoid nfsvno_advlock() having the side effect of unlocking the vnode, unlocking the vnode was moved to before the functions that call nfsvno_advlock(). It shouldn't affect the execution of the default case where vfs.nfsd.enable_locallocks=0. Reported by: loic.blot@unix-experience.fr Discussed with: kib MFC after: 1 week	2014-12-25 01:55:17 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Konstantin Belousov	42ecb595f2	Allow the vfs.nfsd knobs to be set from loader.conf (or using kenv(8)). This is useful when nfsd is loaded as module. Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-10-27 07:47:13 +00:00
Marcelo Araujo	f9246664f5	Make the sysctl(8) for checkutf8 positively defined and improve the description of it. Submitted by: Ronald Klop <ronald-lists@klop.ws> Reviewed by: rmacklem Approved by: rmacklem Sponsored by: QNAP Systems Inc.	2014-10-17 02:11:09 +00:00
Marcelo Araujo	3dd6b7ff3d	Add two sysctl(8) to enable/disable NFSv4 server to check when setting user nobody and/or setting group nogroup as owner of a file or directory. Usually at the client side, if there is an username that is not in the client's passwd database, some clients will send 'nobody@<your.dns.domain>' in the wire and the NFSv4 server will treat it as an ERROR. However, if you have a valid user nobody in your passwd database, the NFSv4 server will treat it as a NFSERR_BADOWNER as its believes the client doesn't has the username mapped. Submitted by: Loic Blot <loic.blot@unix-experience.fr> Reviewed by: rmacklem Approved by: rmacklem MFC after: 2 weeks	2014-10-16 02:24:19 +00:00
Marcelo Araujo	d8a5961f88	Fix failures and warnings reported by newpynfs20090424 test tool. This fix addresses only issues with the pynfs reports, none of these issues are know to create problems for extant real clients. Submitted by: Bart Hsiao <bart.hsiao@gmail.com> Reworked by: myself Reviewed by: rmacklem Approved by: rmacklem Sponsored by: QNAP Systems Inc.	2014-10-03 02:24:41 +00:00
Rick Macklem	03738f6076	Change the NFS server's printf related to hitting the DRC cache's flood level so that it suggests increasing vfs.nfsd.tcphighwater. Suggested by: h.schmalzbauer@omnilan.de	2014-08-10 01:13:32 +00:00
Konstantin Belousov	e7375b6fa5	Do not generate 1000 unique lock names for nfsrc hash chain locks. It overflows witness. Shorten the names of some nfs mutexes. Reported and tested by: pho No objections from: rmacklem, mav Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-31 19:24:44 +00:00
Rick Macklem	6c7d2293d3	The new NFSv3 server did not generate directory postop attributes for the reply to ReaddirPlus when the server failed within the loop that calls VFS_VGET(). This failure is most likely an error return from VFS_VGET() caused by a bogus d_fileno that was truncated to 32bits. This patch fixes the server so that it will return directory postop attributes for the failure. It does not fix the underlying issue caused by d_fileno being uint32_t when a file system like ZFS generates a fileno that is greater than 32bits. Reported by: jpaetzel Reviewed by: jpaetzel MFC after: 1 month	2014-07-04 22:47:07 +00:00
Rick Macklem	c59e4cc34d	Merge the NFSv4.1 server code in projects/nfsv4.1-server over into head. The code is not believed to have any effect on the semantics of non-NFSv4.1 server behaviour. It is a rather large merge, but I am hoping that there will not be any regressions for the NFS server. MFC after: 1 month	2014-07-01 20:47:16 +00:00
Bryan Drewery	4fc0f18c20	Change NFS readdir() to only ignore cookies preceding the given offset for UFS rather than for all but ZFS. This code was assuming that offsets were monotonically increasing for all file systems except ZFS and that the cookies from a previous call may have been rewound to a block boundary. According to mckusick@ only UFS is known to do this, so only requests against UFS file systems should remove cookies smaller than the given offset. This fixes serving TMPFS over NFS as it too does not have monotonically increasing offsets. The comment around the code also indicated it was specific to UFS. Some of the code using 'not_zfs' is specific to ZFS snapshot handling, so add a 'is_zfs' variable for those cases. It's possible that 'is_zfs' check for VFS_VGET() support may not be specific to ZFS. This needs more research and testing. After this fix TMPFS and other file systems can be served over NFS. To test I compared the results of syncing a /usr/src tree into a tmpfs and serving that over NFS. Before the fix 3589 files were missing on the remote view. After the fix all files were successfully found. Reviewed by: rmacklem Discussed with: mckusick, rmacklem via fs@ Discussed at: http://lists.freebsd.org/pipermail/freebsd-fs/2014-April/019264.html MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2014-07-01 20:00:35 +00:00
Rick Macklem	ca4defd583	The new NFS server would not allow a hard link to be created to a symlink. This restriction (which was inherited from OpenBSD) is not required by the NFS RFCs. Since this is allowed by the old NFS server, it is a POLA violation to not allow it. This patch modifies the new NFS server to allow this. Reported by: jhb Reviewed by: jhb MFC after: 3 days	2014-06-06 21:38:49 +00:00
Rick Macklem	ca20bd924f	The new draft specification for NFSv4.0 specifies that a server should either accept owner and owner_group strings that are just the digits of the uid/gid or return NFS4ERR_BADOWNER. This patch adds a sysctl vfs.nfsd.enable_stringtouid, which can be set to enable the server w.r.t. accepting numeric string. It also ensures that NFS4ERR_BADOWNER is returned if numeric uid/gid strings are not enabled. This fixes the server for recent Linux nfs4 clients that use numeric uid/gid strings by default. Reported and tested by: craigyk@gmail.com MFC after: 2 weeks	2014-05-03 00:13:45 +00:00
Rick Macklem	3c53f923dc	The PR reported that the old NFS server did not set uio_td == NULL for the VOP_READ() call. This patch fixes both the old and new server for this case. PR: 185232 Submitted by: PR had patch for old server Reviewed by: kib MFC after: 2 weeks	2014-04-24 20:47:58 +00:00
Rick Macklem	ab7f24103e	Remove an unnecessary level of indirection for an argument. This simplifies the code and should avoid the clang sparc port from generating an abort() call. Requested by: rdivacky Submitted by: jhb MFC after: 2 weeks	2014-04-23 23:13:46 +00:00
Xin LI	25bfde79d6	Fix NFS deadlock vulnerability. [SA-14:05] Fix "Heartbleed" vulnerability and ECDSA Cache Side-channel Attack in OpenSSL. [SA-14:06]	2014-04-08 18:27:32 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Alexander Motin	6103bae6ae	Fix lock leak in purely hypothetical case of TCP connection without SVC_ACK method. This change should be NOP now, but it is better to be future safe. Reported by: rmacklem	2014-01-14 20:18:38 +00:00
Alexander Motin	45e18ea7ea	Fix off-by-one error in r260229. Coverity CID: 1148955	2014-01-07 11:43:51 +00:00
Alexander Motin	d473bac729	Rework NFS Duplicate Request Cache cleanup logic. - Introduce additional hash to group requests by hash of sockref. This allows to process TCP acknowledgements without looping though all the cache, and as result allows to do it every time. - Indroduce additional callbacks to notify application layer about sockets disconnection. Without this last few requests processed just before socket disconnection never processed their ACKs and stuck in cache for many hours. - Implement transport-specific method for tracking reply acknowledgements. New implementation does not cross multiple stack layers to get the data and does not have race conditions that previously made some requests stuck in cache. This could be done more efficiently at sockbuf layer, but that would broke some KBIs, while I don't know other consumers for it aside NFS. - Instead of traversing all DRC twice per request, run cleaning only once per request, and except in some conditions traverse only single hash slot at a time. Together this limits NFS DRC growth only to situations of real connectivity problems. If network is working well, and so all replies are acknowledged, cache remains almost empty even after hours of heavy load. Without this change on the same test cache was growing to many thousand requests even with perfectly working local network. As another result this reduces CPU time spent on the DRC handling during SPEC NFS benchmark from about 10% to 0.5%. Sponsored by: iXsystems, Inc.	2014-01-03 15:09:59 +00:00
Alexander Motin	1555cf04fc	Slightly simplify expiration logic introduced in r254337. - Do not update the histogram for items we are any way deleting from cache. - Do not update the histogram if nfsrc_tcphighwater is not set. - Remove some extra math operations.	2013-12-25 16:58:42 +00:00
Rick Macklem	43a213bb92	The NFSv4 server would call VOP_SETATTR() with a shared locked vnode when a Getattr for a file is done by a client other than the one that holds the file's delegation. This would only happen when delegations are enabled and the problem is fixed by this patch. MFC after: 1 week	2013-12-25 01:03:14 +00:00
Rick Macklem	0c695afb96	An intermittent problem with NFSv4 exporting of ZFS snapshots was reported to the freebsd-fs mailing list. I believe the problem was caused by the Readdir operation using VFS_VGET() for a snapshot file entry instead of VOP_LOOKUP(). This would not occur for NFSv3, since it will do a VFS_VGET() of "." which fails with ENOTSUPP at the beginning of the directory, whereas NFSv4 does not check "." or "..". This patch adds a call to VFS_VGET() for the directory being read to check for ENOTSUPP. I also observed that the mount_on_fileid and fsid attributes were not correct at the snapshot's auto mountpoints when looking at packet traces for the Readdir. This patch fixes the attributes by doing a check for different v_mount structure, even if the vnode v_mountedhere is not set. Reported by: jas@cse.yorku.ca Tested by: jas@cse.yorku.ca Reviewed by: asomers MFC after: 1 week	2013-12-24 22:24:17 +00:00
Alexander Motin	10f8f58d4a	Fix RPC server threads file handle affinity to work better with ZFS. Instead of taking 8 specific bytes of file handle to identify file during RPC thread affitinity handling, use trivial hash of the full file handle. ZFS's struct zfid_short does not have padding field after the length field, as result, originally picked 8 bytes are loosing lower 16 bits of object ID, causing many false matches and unneeded requests affinity to same thread. This fix substantially improves NFS server latency and scalability in SPEC NFS benchmark by more flexible use of multiple NFS threads. Sponsored by: iXsystems, Inc.	2013-12-23 08:43:16 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Rick Macklem	93c5875b24	Fix several performance related issues in the new NFS server's DRC for NFS over TCP. - Increase the size of the hash tables. - Create a separate mutex for each hash list of the TCP hash table. - Single thread the code that deletes stale cache entries. - Add a tunable called vfs.nfsd.tcphighwater, which can be increased to allow the cache to grow larger, avoiding the overhead of frequent scans to delete stale cache entries. (The default value will result in frequent scans to delete stale cache entries, analagous to what the pre-patched code does.) - Add a tunable called vfs.nfsd.cachetcp that can be used to disable DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP. It also adjusts the size of nfsrc_floodlevel dynamically, so that it is always greater than vfs.nfsd.tcphighwater. For UDP the algorithm remains the same as the pre-patched code, but the tunable vfs.nfsd.udphighwater can be used to allow the cache to grow larger and reduce the overhead caused by frequent scans for stale entries. UDP also uses a larger hash table size than the pre-patched code. Reported by: wollman Tested by: wollman (earlier version of patch) Submitted by: ivoras (earlier patch) Reviewed by: jhb (earlier version of patch) MFC after: 1 month	2013-08-14 21:11:26 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Dag-Erling Smørgrav	72ccd4cc6b	Fix typo in comment. Submitted by: Alex Weber <alexwebr@gmail.com> MFC after: 1 week	2013-05-15 08:38:49 +00:00
Dag-Erling Smørgrav	c93c82f464	Fix a bug that allows NFS clients to issue READDIR on files. PR: kern/178016 Security: CVE-2013-3266 Security: FreeBSD-SA-13:05.nfsserver	2013-04-29 20:09:44 +00:00
Kenneth D. Merry	adb974068b	Move the NFS FHA (File Handle Affinity) code from sys/nfsserver to sys/nfs, since it is now shared by the two NFS servers. Suggested by: rmacklem Sponsored by: Spectra Logic MFC after: 2 weeks	2013-04-17 22:42:43 +00:00
Kenneth D. Merry	d96b98a360	Revamp the old NFS server's File Handle Affinity (FHA) code so that it will work with either the old or new server. The FHA code keeps a cache of currently active file handles for NFSv2 and v3 requests, so that read and write requests for the same file are directed to the same group of threads (reads) or thread (writes). It does not currently work for NFSv4 requests. They are more complex, and will take more work to support. This improves read-ahead performance, especially with ZFS, if the FHA tuning parameters are configured appropriately. Without the FHA code, concurrent reads that are part of a sequential read from a file will be directed to separate NFS threads. This has the effect of confusing the ZFS zfetch (prefetch) code and makes sequential reads significantly slower with clients like Linux that do a lot of prefetching. The FHA code has also been updated to direct write requests to nearby file offsets to the same thread in the same way it batches reads, and the FHA code will now also send writes to multiple threads when needed. This improves sequential write performance in ZFS, because writes to a file are now more ordered. Since NFS writes (generally less than 64K) are smaller than the typical ZFS record size (usually 128K), out of order NFS writes to the same block can trigger a read in ZFS. Sending them down the same thread increases the odds of their being in order. In order for multiple write threads per file in the FHA code to be useful, writes in the NFS server have been changed to use a LK_SHARED vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem doesn't allow multiple writers to a file at once. ZFS is currently the only filesystem that allows multiple writers to a file, because it has internal file range locking. This change does not affect the NFSv4 code. This improves random write performance to a single file in ZFS, since we can now have multiple writers inside ZFS at one time. I have changed the default tuning parameters to a 22 bit (4MB) window size (from 256K) and unlimited commands per thread as a result of my benchmarking with ZFS. The FHA code has been updated to allow configuring the tuning parameters from loader tunable variables in addition to sysctl variables. The read offset window calculation has been slightly modified as well. Instead of having separate bins, each file handle has a rolling window of bin_shift size. This minimizes glitches in throughput when shifting from one bin to another. sys/conf/files: Add nfs_fha_new.c and nfs_fha_old.c. Compile nfs_fha.c when either the old or the new NFS server is built. sys/fs/nfs/nfsport.h, sys/fs/nfs/nfs_commonport.c: Bring in changes from Rick Macklem to newnfs_realign that allow it to operate in blocking (M_WAITOK) or non-blocking (M_NOWAIT) mode. sys/fs/nfs/nfs_commonsubs.c, sys/fs/nfs/nfs_var.h: Bring in a change from Rick Macklem to allow telling nfsm_dissect() whether or not to wait for mallocs. sys/fs/nfs/nfsm_subs.h: Bring in changes from Rick Macklem to create a new nfsm_dissect_nonblock() inline function and NFSM_DISSECT_NONBLOCK() macro. sys/fs/nfs/nfs_commonkrpc.c, sys/fs/nfsclient/nfs_clkrpc.c: Add the malloc wait flag to a newnfs_realign() call. sys/fs/nfsserver/nfs_nfsdkrpc.c: Setup the new NFS server's RPC thread pool so that it will call the FHA code. Add the malloc flag argument to newnfs_realign(). Unstaticize newnfs_nfsv3_procid[] so that we can use it in the FHA code. sys/fs/nfsserver/nfs_nfsdsocket.c: In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types that use the LK_SHARED lock type. sys/fs/nfsserver/nfs_nfsdport.c: In nfsd_fhtovp(), if we're starting a write, check to see whether the underlying filesystem supports shared writes. If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE. sys/nfsserver/nfs_fha.c: Remove all code that is specific to the NFS server implementation. Anything that is server-specific is now accessed through a callback supplied by that server's FHA shim in the new softc. There are now separate sysctls and tunables for the FHA implementations for the old and new NFS servers. The new NFS server has its tunables under vfs.nfsd.fha, the old NFS server's tunables are under vfs.nfsrv.fha as before. In fha_extract_info(), use callouts for all server-specific code. Getting file handles and offsets is now done in the individual server's shim module. In fha_hash_entry_choose_thread(), change the way we decide whether two reads are in proximity to each other. Previously, the calculation was a simple shift operation to see whether the offsets were in the same power of 2 bucket. The issue was that there would be a bucket (and therefore thread) transition, even if the reads were in close proximity. When there is a thread transition, reads wind up going somewhat out of order, and ZFS gets confused. The new calculation simply tries to see whether the offsets are within 1 << bin_shift of each other. If they are, the reads will be sent to the same thread. The effect of this change is that for sequential reads, if the client doesn't exceed the max_reqs_per_nfsd parameter and the bin_shift is set to a reasonable value (22, or 4MB works well in my tests), the reads in any sequential stream will largely be confined to a single thread. Change fha_assign() so that it takes a softc argument. It is now called from the individual server's shim code, which will pass in the softc. Change fhe_stats_sysctl() so that it takes a softc parameter. It is now called from the individual server's shim code. Add the current offset to the list of things printed out about each active thread. Change the num_reads and num_writes counters in the fha_hash_entry structure to 32-bit values, and rename them num_rw and num_exclusive, respectively, to reflect their changed usage. Add an enable sysctl and tunable that allows the user to disable the FHA code (when vfs.XXX.fha.enable = 0). This is useful for before/after performance comparisons. nfs_fha.h: Move most structure definitions out of nfs_fha.c and into the header file, so that the individual server shims can see them. Change the default bin_shift to 22 (4MB) instead of 18 (256K). Allow unlimited commands per thread. sys/nfsserver/nfs_fha_old.c, sys/nfsserver/nfs_fha_old.h, sys/fs/nfsserver/nfs_fha_new.c, sys/fs/nfsserver/nfs_fha_new.h: Add shims for the old and new NFS servers to interface with the FHA code, and callbacks for the The shims contain all of the code and definitions that are specific to the NFS servers. They setup the server-specific callbacks and set the server name for the sysctl and loader tunable variables. sys/nfsserver/nfs_srvkrpc.c: Configure the RPC code to call fhaold_assign() instead of fha_assign(). sys/modules/nfsd/Makefile: Add nfs_fha.c and nfs_fha_new.c. sys/modules/nfsserver/Makefile: Add nfs_fha_old.c. Reviewed by: rmacklem Sponsored by: Spectra Logic MFC after: 2 weeks	2013-04-17 21:00:22 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
John Baldwin	a89a2c8ba4	Further cleanups to use of timestamps in NFS: - Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds portion of wall-time stamps to manage timeouts on events. - Remove unused nd_starttime from the per-request structure in the new NFS server. - Use nanotime() for the modification time on a delegation to get as precise a time as possible. - Use time_second instead of extracting the second from a call to getmicrotime(). Submitted by: bde (3) Reviewed by: bde, rmacklem MFC after: 2 weeks	2013-01-25 15:25:24 +00:00
Xin LI	e4558aacfc	Make it possible to force async at server side on new NFS server, similar to the old one's nfs.nfsrv.async. Please note that by enabling this option (default is disabled), the system could potentionally have silent data corruption if the server crashes before write is committed to non-volatile storage, as the client side have no way to tell if the data is already written. Submitted by: rmacklem MFC after: 2 weeks	2013-01-18 19:42:08 +00:00
John Baldwin	d177f14da9	Use vfs_timestamp() to set file timestamps rather than invoking getmicrotime() or getnanotime() directly in NFS. Reviewed by: rmacklem, bde MFC after: 1 week	2013-01-18 18:43:38 +00:00
Rick Macklem	1f60bfd822	Move the NFSv4.1 client patches over from projects/nfsv4.1-client to head. I don't think the NFS client behaviour will change unless the new "minorversion=1" mount option is used. It includes basic NFSv4.1 support plus support for pNFS using the Files Layout only. All problems detecting during an NFSv4.1 Bakeathon testing event in June 2012 have been resolved in this code and it has been tested against the NFSv4.1 server available to me. Although not reviewed, I believe that kib@ has looked at it.	2012-12-08 22:52:39 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Rick Macklem	6001db296e	Add two new options to the nfssvc(2) syscall that allow processes running as root to suspend/resume execution of the kernel nfsd threads. An earlier version of this patch was tested by Vincent Hoffman (vince at unsane.co.uk) and John Hickey (jh at deterlab.net). Reviewed by: kib MFC after: 2 weeks	2012-10-14 22:33:17 +00:00
Konstantin Belousov	877d24ac8a	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks	2012-09-28 11:25:02 +00:00
Rick Macklem	c52005a31d	Modify the NFSv4 client so that it can handle owner and owner_group strings that consist entirely of digits, interpreting them as the uid/gid number. This change was needed since new (>= 3.3) Linux servers reply with these strings by default. This change is mandated by the rfc3530bis draft. Reported on freebsd-stable@ under the Subject heading "Problem with Linux >= 3.3 as NFSv4 server" by Norbert Aschendorff on Aug. 20, 2012. Tested by: norbert.aschendorff at yahoo.de Reviewed by: jhb MFC after: 2 weeks	2012-09-20 02:49:25 +00:00
Rick Macklem	2108487ead	Fix two cases in the new NFS server where a tsleep() is used, when the code should actually protect the tested variable with a mutex. Since the tsleep()s had a 10sec timeout, the race would have only delayed the allocation of a new clientid for a client. The sleeps will also rarely occur, since having a callback in progress when a client acquires a new clientid, is unlikely. in practice, since having a callback in progress when a fresh clientid is being acquired by a client is unlikely. MFC after: 1 month	2012-05-12 22:20:55 +00:00
John W. De Boskey	3676a0d890	Use the common api helper routine instead of freeing the namei buffer directly. Approved by: rmacklem (mentor) MFC after: 1 month	2012-05-08 03:39:44 +00:00
Rick Macklem	a607cc6d8e	Fix a leak of namei lookup path buffers that occurs when a ZFS volume is exported via the new NFS server. The leak occurred because the new NFS server code didn't handle the case where a file system sets the SAVENAME flag in its VOP_LOOKUP() and ZFS does this for the DELETE case. Tested by: Oliver Brandmueller (ob at gruft.de), hrs PR: kern/167266 MFC after: 1 month	2012-04-27 20:23:24 +00:00
Kirk McKusick	f257ebbb2e	This change creates a new list of active vnodes associated with a mount point. Active vnodes are those with a non-zero use or hold count, e.g., those vnodes that are not on the free list. Note that this list is in addition to the list of all the vnodes associated with a mount point. To avoid adding another set of linkage pointers to the vnode structure, the active list uses the existing linkage pointers used by the free list (previously named v_freelist, now renamed v_actfreelist). This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-20 06:50:44 +00:00
Rick Macklem	b76ec2db93	The name caching changes of r230394 exposed an intermittent bug in the new NFS server for NFSv4, where it would report ENOENT when the file actually existed on the server. This turned out to be caused by not initializing ni_topdir before calling lookup() and there was a rare case where the value on the stack location assigned to ni_topdir happened to be a pointer to a ".." entry, such that "dp == ndp->ni_topdir" succeeded in lookup(). This patch initializes ni_topdir to fix the problem. MFC after: 5 days	2012-03-03 16:13:20 +00:00
Rick Macklem	7cfce7cec7	hrs@ reported a panic to freebsd-stable@ under the subject line "panic in 8.3-PRERELEASE" on Feb. 22, 2012. This panic was caused by use of a mix of tsleep() and msleep() calls on the same event in the new NFS server DRC code. It did "mtx_unlock(); tsleep();" in two places, which kib@ noted introduced a slight risk that the wakeup() would occur before the tsleep(), resulting in a 10sec delay before waking up. This patch fixes the problem by replacing "mtx_unlock(); tsleep();" with mtx_sleep(..PDROP..). It also changes a nfsmsleep() call to mtx_sleep() so that the code uses mtx_sleep() consistently within the file. Tested by: hrs (in progress) Reviewed by: jhb MFC after: 5 days	2012-02-23 16:47:05 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Rick Macklem	13b2772f8e	Delete a couple of out of date comments that are no longer true in the new NFS client. Requested by: bde MFC after: 1 week	2012-02-16 02:19:53 +00:00
Rick Macklem	5b79362b47	Tai Horgan reported via email that there were two places in the new NFSv4 server where the code follows the wrong list. Fortunately, for these fairly rare cases, the lc_stateid[] lists are normally empty. This patch fixes the code to follow the correct list. Reported by: tai.horgan at isilon.com Discussed with: zack MFC after: 2 weeks	2012-01-14 04:04:58 +00:00
Rick Macklem	22ea9f58f0	Patch the new NFS server in a manner analagous to r228520 for the old NFS server, so that it correctly handles a count == 0 argument for Commit. PR: kern/118126 MFC after: 2 weeks	2011-12-16 00:58:41 +00:00
Rick Macklem	34f2e649d0	This patch adds a sysctl to the NFSv4 server which optionally disables the check for a UTF-8 compliant file name. Enabling this sysctl results in an NFSv4 server that is non-RFC3530 compliant, therefore it is not enabled by default. However, enabling this sysctl results in NFSv3 compatible behaviour and fixes the problem reported by "dan at sunsaturn.com" to freebsd-current@ on Nov. 14, 2011 under the subject "NFSV4 readlink_stat". Tested by: dan at sunsaturn.com Reviewed by: zack MFC after: 2 weeks	2011-12-04 16:33:04 +00:00
John Baldwin	574862c8ba	Enhance the sequential access heuristic used to perform readahead in the NFS server and reuse it for writes as well to allow writes to the backing store to be clustered. - Use a prime number for the size of the heuristic table (1017 is not prime). - Move the logic to locate a heuristic entry from the table and compute the sequential count out of VOP_READ() and into a separate routine. - Use the logic from sequential_heuristic() in vfs_vnops.c to update the seqcount when a sequential access is performed rather than just increasing seqcount by 1. This lets the clustering count ramp up faster. - Allow for some reordering of RPCs and if it is detected leave the current seqcount as-is rather than dropping back to a seqcount of 1. Also, when out of order access is encountered, cut seqcount in half rather than dropping it all the way back to 1 to further aid with reordering. - Fix the new NFS server to properly update the next offset after a successful VOP_READ() so that the readahead actually works. Some of these changes came from an earlier patch by Bjorn Gronwall that was forwarded to me by bde@. Discussed with: bde, rmacklem, fs@ Submitted by: Bjorn Gronwall (1, 4) MFC after: 2 weeks	2011-12-01 18:46:28 +00:00
Rick Macklem	6854d64811	This patch enables the new/default NFS server's use of shared vnode locking for read, readdir, readlink, getattr and access. It is hoped that this will improve server performance for these operations, since they will no longer be serialized for a given file/vnode.	2011-11-22 00:35:30 +00:00
Kip Macy	8451d0dd78	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
Rick Macklem	322c8d9e25	Fix the NFS servers so that they can do a Lookup of "..", which requires that ni_strictrelative be set to 0, post-r224810. Tested by: swills (earlier version), geo dot liaskos at gmail.com Approved by: re (kib)	2011-09-03 00:28:53 +00:00
Rick Macklem	de67b4966c	Fix the NFSv4 server so that it returns NFSERR_SYMLINK when an attempt to do an Open operation on any type of file other than VREG is done. A recent discussion on the IETF working group's mailing list (nfsv4@ietf.org) decided that NFSERR_SYMLINK should be returned for all non-regular files and not just symlinks, so that the Linux client would work correctly. This change does not affect the FreeBSD NFSv4 client and is not believed to have a negative effect on other NFSv4 clients. Reviewed by: zkirsch Approved by: re (kib) MFC after: 2 weeks	2011-08-20 21:26:35 +00:00
Jonathan Anderson	985a88e2a6	Fix a merge conflict. r224086 added "goto out"-style error handling to nfssvc_nfsd(), in order to reliably call NFSEXITCODE() before returning. Our Capsicum changes, based on the old "return (error)" model, did not merge nicely. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-16 14:23:16 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Zack Kirsch	06521fbb49	Fix an NFS server issue where it was not correctly setting the eof flag when a READ had hit the end of the file. Also, clean up some cruft in the code. Approved by: re (kib) Reviewed by: rmacklem MFC after: 2 weeks	2011-08-03 18:50:19 +00:00
Rick Macklem	6b3dfc6ab0	Fix rename in the new NFS server so that it does not require a recursive vnode lock on the directory for the case where the new file name is in the same directory as the old one. The patch handles this as a special case, recognized by the new directory having the same file handle as the old one and just VREF()s the old dir vnode for this case, instead of doing a second VFS_FHTOVP() to get it. This is required so that the server will work for file systems like msdosfs, that do not support recursive vnode locking. This problem was discovered during recent testing by pho@ when exporting an msdosfs file system via the new NFS server. Tested by: pho Reviewed by: zkirsch Approved by: re (kib) MFC after: 2 weeks	2011-07-31 20:06:11 +00:00
Zack Kirsch	a9285ae5c4	Add DEXITCODE plumbing to NFS. Isilon has the concept of an in-memory exit-code ring that saves the last exit code of a function and allows for stack tracing. This is very helpful when debugging tough issues. This patch is essentially a no-op for BSD at this point, until we upstream the dexitcode logic itself. The patch adds DEXITCODE calls to every NFS function that returns an errno error code. A number of code paths were also reorganized to have single exit paths, to reduce code duplication. Submitted by: David Kwan <dkwan@isilon.com> Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:51:09 +00:00
Zack Kirsch	68347a92db	Simple find/replace of VOP_ISLOCKED -> NFSVOPISLOCKED. This is done so that NFSVOPISLOCKED can be modified later to add enhanced logging and assertions. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:05:41 +00:00
Zack Kirsch	a998963469	Simple find/replace of VOP_UNLOCK -> NFSVOPUNLOCK. This is done so that NFSVOPUNLOCK can be modified later to add enhanced logging and assertions. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:05:36 +00:00
Zack Kirsch	98f234f338	Simple find/replace of vn_lock -> NFSVOPLOCK. This is done so that NFSVOPLOCK can be modified later to add enhanced logging and assertions. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:05:31 +00:00
Zack Kirsch	c383087c0c	Remove unnecessary thread pointer from VOPLOCK macros and current users. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:05:26 +00:00
Zack Kirsch	40435b74f4	Move nfsvno_pathconf to be accessible to sys/fs/nfs; no functionality change. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:05:17 +00:00
Zack Kirsch	b008a72c86	Small acl patch to return the aclerror that comes back from nfsrv_dissectacl(). This fixes a problem where ATTRNOTSUPP was being returned instead of BADOWNER. Reviewed by: rmacklem Approved by: zml (mentor) MFC after: 2 weeks	2011-07-16 08:04:57 +00:00
Rick Macklem	53f476cab3	Fix the new NFSv4 server so that it checks for VREAD_ACL when a client does a Getattr for an ACL and not VREAD_ATTRIBUTES. This was found during the recent NFSv4 interoperability Bakeathon. MFC after: 2 weeks	2011-06-21 19:58:29 +00:00
Rick Macklem	37b88c2d51	Fix the new NFSv4 server so that it only allows Lookup of directories and symbolic links when traversing non-exported file systems. Found during the recent NFSv4 interoperability Bakeathon. MFC after: 2 weeks	2011-06-20 22:02:01 +00:00
Rick Macklem	5a55e04ffa	Fix the new NFSv4 server so that it allows Access and Readlink operations while traversing non-exported file systems. This is required for some non-FreeBSD clients to do NFSv4 mounts. Found during the recent NFSv4 interoperability Bakeathon. MFC after: 2 weeks	2011-06-20 21:57:26 +00:00
Rick Macklem	4e22c98a39	Fix a number of places where the new NFS server did not lock the mutex when manipulating rc_flag in the DRC cache. This is believed to fix a hung server that was reported to the freebsd-fs@ list on June 9 under the subject heading "New NFS server stress test hang", where all the threads were waiting for the RC_LOCKED flag to clear. Tested by: jwd at slowblink.com MFC after: 2 weeks	2011-06-19 23:54:01 +00:00
Rick Macklem	7e7fd7d177	Fix the kgssapi so that it can be loaded as a module. Currently the NFS subsystems use five of the rpcsec_gss/kgssapi entry points, but since it was not obvious which others might be useful, all nineteen were included. Basically the nineteen entry points are set in a structure called rpc_gss_entries and inline functions defined in sys/rpc/rpcsec_gss.h check for the entry points being non-NULL and then call them. A default value is returned otherwise. Requested by rwatson. Reviewed by: jhb MFC after: 2 weeks	2011-06-19 22:08:55 +00:00
Rick Macklem	c5c142f652	Modify the new NFS server so that the NFSv3 Pathconf RPC doesn't return an error when the underlying file system lacks support for any of the four _PC_xxx values used, by falling back to default values. Tested by: avg MFC after: 2 weeks	2011-06-04 01:13:09 +00:00
Rick Macklem	ff29f3b241	Fix the new NFS client so that it handles NFSv4 state correctly during a forced dismount. This required that the exclusive and shared (refcnt) sleep lock functions check for MNTK_UMOUNTF before sleeping, so that they won't block while nfscl_umount() is getting rid of the state. As such, a "struct mount *" argument was added to the locking functions. I believe the only remaining case where a forced dismount can get hung in the kernel is when a thread is already attempting to do a TCP connect to a dead server when the krpc client structure called nr_client is NULL. This will only happen just after a "mount -u" with options that force a new TCP connection is done, so it shouldn't be a problem in practice. MFC after: 2 weeks	2011-05-27 22:05:10 +00:00
Rick Macklem	694a586a43	Add a lock flags argument to the VFS_FHTOVP() file system method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed. Reviewed by: kib	2011-05-22 01:07:54 +00:00
Rick Macklem	a0c2c3691c	Change the new NFS server so that it uses vfs.nfsd naming for its sysctls instead of vfs.newnfs. This separates the names from the ones used by the client.	2011-05-08 01:01:27 +00:00

1 2 3 4 5 ...

339 Commits