doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
When the byte range for copy_file_range(2) doesn't go to EOF on the
output file and there is a hole in the input file, a hole must be
"punched" in the output file. This is done by writing a block of bytes
all set to 0.
Without this patch, the write is done unconditionally which means that,
if the output file already has a hole in that byte range, a unneeded data block
of all 0 bytes would be allocated.
This patch adds code to check for a hole in the output file, so that it can
skip doing the write if there is already a hole in that byte range of
the output file. This avoids unnecessary allocation of blocks to the
output file.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D21155
ensure that the subsequent mbuf contains the remainder of the bytes
the caller sought. If this is not the case, fall through to the code
which gathers the bytes in a new mbuf.
This fixes a bug where m_pulldown() could fail to gather all the desired
bytes into consecutive memory.
PR: 238787
Reported by: A reddit user
Discussed with: emaste
Obtained from: NetBSD
MFC after: 3 days
An earlier version of the patch had code that set "error" between
line#s 2797-2799. When that code was moved, the second check for "error != 0"
could never be true and the check became harmless cruft.
This patch removes the cruft, mainly to make Coverity happy.
Reported by: asomers, cem
Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the
vnode unlocked, it is possible for another thread to do:
- truncate(), lseek(), write()
between the two calls and create a hole where FIOSEEKDATA returned the
start of data.
For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for
the hole location. This could result in an infinite loop in the copy
code, since copylen is set to 0 and the copy doesn't advance.
Usually, this race is avoided because of the use of rangelocks, but the
NFS server does not do range locking and could do a sequence like the
above to create the hole.
This patch checks for this case and makes the hole search fail, to avoid
the infinite loop.
At this time, it is an open question as to whether or not the NFS server
should do range locking to avoid this race.
KDB is standard and the kdb_active variable is always available. So,
de-conditionalize inclusion of sys/kdb.h in kern_sysctl.c.
Reported by: Michael Butler <imb AT protected-networks.net>
X-MFC-With: r350713
Sponsored by: Dell EMC Isilon
Implement `sysctl` in `ddb` by overriding `SYSCTL_OUT`. When handling the
req, we install custom ddb in/out handlers. The out handler prints straight
to the debugger, while the in handler ignores all input. This is intended
to allow us to print just about any sysctl.
There is a known issue when used from ddb(4) entered via 'sysctl
debug.kdb.enter=1'. The DDB mode does not quite prevent all lock
interactions, and it is possible for the recursive Giant lock to be unlocked
when the ddb(4) 'sysctl' command is used. This may result in a panic on
return from ddb(4) via 'c' (continue). Obviously, this is not a problem
when debugging already-paniced systems.
Submitted by: Travis Lane (formerly: <travis.lane AT isilon.com>)
Reviewed by: vangyzen (earlier version), Don Morris <dgmorris AT earthlink.net>
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20219
The API is used to gracefully terminate text line(s) with a single \n. If
the formatted buffer was empty or already ended in \n, it is unmodified.
Otherwise, a newline character is appended to it. The API, like other
sbuf-modifying routines, is only valid while the sbuf is not FINISHED.
Reviewed by: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21030
Code flow was somewhat difficult to read due to the combination of
multiple return sites and the 4x possible dynamic constructions of an
sbuf. (Future consideration: do we need all 4?) Refactored slightly to
improve legibility.
No functional change.
Sponsored by: Dell EMC Isilon
The goal is to avoid some kinds of low-memory deadlock when formatting
heap-allocated buffers.
Reviewed by: vangyzen
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21015
We don't need to check if the parent is already set.
This is done already in the proc_reparent.
No functional behaviour changes intended.
MFC after: 1 month
The process is reparented to the debugger while it is attached.
B B
/ ----> |
A A D
Every time when the process is reparented, it is added to the orphan list
of the previous parent:
A->orphan = B
D->orphan = NULL
When the A process will close the process descriptor to the B process,
the B process will be reparented to the init process.
B B - init
| ---->
A D A D
A->orphan = B
D->orphan = B
In this scenario, the B process is in the orphan list of A and D.
When the last process descriptor is closed instead of reparenting
it to the reaper let it stay with the debugger process and set
our previews parent to the reaper.
Add test case for this situation.
Notice that without this patch the kernel will crash with this test case:
panic: orphan 0xfffff8000e990530 of 0xfffff8000e990000 has unexpected oppid 1
Reviewed by: markj, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D20361
In case of the process being debugged. The P_TRACED is cleared very early,
which would make procdesc_close() not calling proc_clear_orphan().
That would result in the debugged process can not be able to collect
status of the process with process descriptor.
Reviewed by: markj, kib
Tested by: pho
MFC after: 1 month
PowerPC, and possibly other architectures, use different address ranges for
PCI space vs physical address space, which is only mapped at resource
activation time, when the BAR gets written. The DRM kernel modules do not
activate the rman resources, soas not to waste KVA, instead only mapping
parts of the PCI memory at a time. This introduces a
BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI
driver, to perform this necessary translation without activating the
resource.
In addition to system KPI changes, LinuxKPI is updated to handle a
big-endian host, by adding proper endian swaps to the I/O functions.
Submitted by: mmacy
Reported by: hselasky
Differential Revision: https://reviews.freebsd.org/D21096
These vnodes are explicitly opened via VOP_OPEN via
exec_check_permissions identical to the main exectuable image.
Setting ISOPEN allows filesystems to perform suitable checks in
VOP_LOOKUP (e.g. close-to-open consistency in the NFS client).
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D21129
Previously we would check for blessings before marking a given lock
pair as reversed, so each "reversed" lock acquisition would require
a linear scan of the table. Instead, check the table after marking
the pair as reversed but before generating a report.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21135
The check for P_SINGLE_EXIT was shadowed by the (P_SHOULDSTOP || traced) check.
Reported by: bdrewery (might be)
Reviewed by: markj
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21124
This effectively makes the stack base on the csu _start entry
randomized.
The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.
Only amd64 for now.
Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081
In particular, restart should be only done when the failure is
transient. For this, recheck the count1 value after the operation.
Note that do_sem_wait() is older usem interface.
Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The motivation for this change is to allow wrappers around shm to be written
that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets
it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which
is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1().
Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from
the context.
sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX
requirements, and a comment has been dropped in to kern_fd_open() to explain
the situation and add a pointer to where O_CLOEXEC setting is maintained for
shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally
sets O_CLOEXEC to match previous behavior.
This also has the side-effect of making flags correctly reflect the
O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a
glance-over leads me to believe that it didn't really matter.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21119
witness has long had a facility to "bless" designated lock pairs. Lock
order reversals between a pair of blessed locks are not reported upon.
We have a number of long-standing false positive LOR reports; start
marking well-understood LORs as blessed.
This change hides reports about UFS vnode locks and the UFS dirhash
lock, and UFS vnode locks and buffer locks, since those are the two that
I observe most often. In the long term it would be preferable to be
able to limit blessings to a specific site where a lock is acquired,
and/or extend witness to understand why some lock order reversals are
valid (for example, if code paths with conflicting lock orders are
serialized by a third lock), but in the meantime the false positives
frequently confuse users and generate bug reports.
Reviewed by: cem, kib, mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21039
copy_file_range() operates on a pair of file descriptors; it requires
CAP_READ for the source descriptor and CAP_WRITE for the destination
descriptor.
Reviewed by: kevans, oshogbo
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21113
The current implementation of gzipped a.out support was based
on a very old version of InfoZIP which ships with an ancient
modified version of zlib, and was removed from the GENERIC
kernel in 1999 when we moved to an ELF world.
PR: 205822
Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp>
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D21099
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.
As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986
This is a partial merge of 350144 from projects/fuse2
PR: 236466
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21095
v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032
This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.
vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.
Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.
Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584
turnstile_{lock,unlock}() were added for use in epoch. turnstile_lock()
returned NULL to indicate that the calling thread had lost a race and
the turnstile was no longer associated with the given lock, or the lock
owner. However, reader-writer locks may not have a designated owner,
in which case turnstile_lock() would return NULL and
epoch_block_handler_preempt() would leak spinlocks as a result.
Apply a minimal fix: return the lock owner as a separate return value.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21048
fget_unlocked() and fhold().
On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
kern.maxfilesperproc: 939132
kern.maxprocperuid: 34203
which already overflows u_int. More, the malicious code can create
transient references by sending fds over unix sockets.
I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html
Reviewed by: markj (previous version), mjg
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20947
When sendmsg(2) sucessfully internalized one SCM_RIGHTS control
message, but failed to process some other control message later, both
file references and filedescent memory needs to be freed. This was not
done, only mbuf chain was freed.
Noted, test case written, reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21000
I accidentally broke the main point of r349248 when making stylistic changes
in r349391. Restore the original behavior, and also fix an additional
overflow that was possible when uio->uio_resid was nearly SSIZE_MAX.
Reported by: cem
Reviewed by: bde
MFC after: 2 weeks
MFC-With: 349248
Sponsored by: The FreeBSD Foundation