Commit Graph

279 Commits

Author SHA1 Message Date
Robert Watson
4f2cbaf3cd Track pipe(2) reads and writes as rusage message receives and sends, a
feature misplaced during the transition from BSD 4.4's socket implementation
to the optimised FreeBSD pipe implementation.

MFC after:		1 week
Reviewed by:		arichardson, imp
Differential Revision:	https://reviews.freebsd.org/D27878
2021-01-10 12:16:39 +00:00
Mateusz Guzik
2e51c2bfd1 pipe: follow up cleanup to previous
The commited patch was incomplete.

- add back missing goto retry, noted by jhb
- 'if (error)'  -> 'if (error != 0)'
- consistently do:

if (error != 0)
    break;
continue;

instead of:

if (error != 0)
    break;
else
    continue;

This adds some 'continue' uses which are not needed, but line up with the
rest of pipe_write.
2020-11-25 22:53:21 +00:00
Mateusz Guzik
c8df8543fd pipe: drop spurious pipeunlock/pipelock cycle on write 2020-11-25 21:41:23 +00:00
Mateusz Guzik
f9fe7b28bc pipe: thundering herd problem in pipelock
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27273
2020-11-19 19:25:47 +00:00
Mateusz Guzik
b8cb628534 pipe: tidy up pipelock 2020-11-19 08:16:45 +00:00
Mateusz Guzik
89744405e6 pipe: allow for lockless pipe_stat
pipes get stated all thet time and this avoidably contributed to contention.
The pipe lock is only held to accomodate MAC and to check the type.

Since normally there is no probe for pipe stat depessimize this by having the
flag.

The pipe_state field gets modified with locks held all the time and it's not
feasible to convert them to use atomic store. Move the type flag away to a
separate variable as a simple cleanup and to provide stable field to read.
Use short for both fields to avoid growing the struct.

While here short-circuit MAC for pipe_poll as well.
2020-11-19 06:30:25 +00:00
Mateusz Guzik
331c21dd5e pipe: whitespace nit in previous 2020-11-04 23:17:41 +00:00
Mateusz Guzik
c22ba7bb06 pipe: fix POLLHUP handling if no events were specified
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.

While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.

Reported by:	Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27094
2020-11-04 23:11:54 +00:00
Mateusz Guzik
6fed89b179 kern: clean up empty lines in .c and .h files 2020-09-01 22:12:32 +00:00
Mark Johnston
85232c2ff1 Rename the pipe_map field of struct pipe.
This is to avoid conflicts with a upcoming macro.  pipe_pages is a
more accurate name since the field tracks pages wired into the kernel as
part of a process-to-process copy operation.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-08-14 14:50:41 +00:00
Mateusz Guzik
4f00177887 pipe: reduce atime precision
The routine is called on successful write and read, which on pipes happens a
lot and for small sizes.

Precision provided by default seems way bigger than necessary and it causes
problems in vms on amd64 (it rdtscp's which vmexits). getnanotime seems to
provide the level roughly in lines of Linux so we should be good here.

Sample result from will-it-scale pipe1_processes -t 1 (ops/s):
before: 426464
after: 3247421

Note the that atime handling for named pipes is broken with and without the
patch. The filesystem code is never used for updating atime and never looks
at the updated field. Consequently, while there are no provisions added to
handle named pipes separately, the change is a nop for that case.

Differential Revision:	 https://reviews.freebsd.org/D23964
2020-08-05 19:15:59 +00:00
Mark Johnston
569eb766c5 Fix handling of EV_EOF for named pipes.
Contrary to the kevent man page, EV_EOF on a fifo is not cleared by
EV_CLEAR.  Modify the read and write filters to clear EV_EOF when the
fifo's PIPE_EOF flag is clear, and update the man page to document the
new behaviour.

Modify the write filter to return the amount of buffer space available
even if no readers are present.  This matches the behaviour for sockets.

When reading from a pipe, only call pipeselwakeup() if some data was
actually read.  This prevents the continuous re-triggering of a
EVFILT_READ event on EOF when in edge-triggered mode.

PR:		203366, 224615
Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24528
2020-04-27 15:59:19 +00:00
Mark Johnston
9b22722423 Call pipeselwakeup() after toggling PIPE_EOF.
This ensures that pipe_poll() and the pipe kqueue filters observe
PIPE_EOF and set EV_EOF accordingly.  As a result an extra call to
knote() after setting PIPE_EOF is unnecessary.

Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24528
2020-04-27 15:59:07 +00:00
Mark Johnston
9ab4355732 Avoid returning POLLIN if the pipe descriptor is not open for reading.
Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24528
2020-04-27 15:58:55 +00:00
Konstantin Belousov
2d3c083fd7 pipe: explain why not deallocating inode number is fine.
Suggested and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24009
2020-03-09 23:40:25 +00:00
Konstantin Belousov
c6d3d601c9 Preallocate pipe buffers on pipe creation.
Return ENOMEM if one of the buffer cannot be created even with the
minimal size.  This should avoid subsequent spurious ENOMEM errors
from write(2) when buffer cannot be allocated on the fly, after we
reported that the pipe was create succesfully.

Reported by:	Keno Fischer <keno@juliacomputing.com>
Reviewed by:	markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 21:55:26 +00:00
Konstantin Belousov
1213de28f8 Style.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 19:46:28 +00:00
Mateusz Guzik
3ff65f71cb Remove duplicated empty lines from kern/*.c
No functional changes.
2020-01-30 20:05:05 +00:00
Mark Johnston
1cbfe73da5 Fix handling of PIPE_EOF in the direct write path.
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0.  Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed.  In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.

Reported by:	syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com
Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22261
2019-11-11 20:44:30 +00:00
Mark Johnston
6bc13e042f Modify pipe_poll() to properly check for pending direct writes.
With r349546, it is a responsibility of the writer to clear PIPE_DIRECTW
after pinned data has been read.  In particular, once a reader has
drained this data, there is a small window where the pipe is empty but
PIPE_DIRECTW is set.  pipe_poll() was using the presence of PIPE_DIRECTW
to determine whether to return POLLIN, so in this window it would
claim that data was available to read when this was not the case.

Fix this by modifying several checks for PIPE_DIRECTW to instead look
at the number of residual bytes in data pinned by a direct writer.  In
some cases we really do want to check for PIPE_DIRECTW, since the
presence of this flag indicates that any attempt to write to the pipe
will block on the existing direct writer.

Bisected and test case provided by:	mav
Tested by:	pho
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21333
2019-08-21 19:35:04 +00:00
Mark Johnston
02476c44c5 Fix mutual exclusion in pipe_direct_write().
We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where
the reader copies data directly from pages mapped into the writer.
However, when a reader finishes such a copy, it previously cleared
PIPE_DIRECTW, allowing multiple writers to race and corrupt the state
used to track wired pages belonging to the writer.

Fix this by having the writer clear PIPE_DIRECTW and instead use the
count of unread bytes to determine whether a write is finished.

Reported by:	syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com
Reviewed by:	kib, mjg
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20784
2019-06-29 16:05:52 +00:00
Alan Somers
38b06f8ac4 fcntl: fix overflow when setting F_READAHEAD
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936.  Formalize that by using a union type.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20710
2019-06-20 23:07:20 +00:00
Mark Johnston
18a7de663b Move a racy assertion in filt_pipewrite().
EVFILT_WRITE knotes for pipes live on the knlist for the other end of the
pipe.  Since they do not hold a reference on the corresponding file
structure, they may be removed from the knlist by pipeclose() while still
remaining active.  In this case, there is no knlist lock acquired before
filt_pipewrite() is called, so the assertion fails.

Fix the problem by first checking whether that end of the pipe has been
closed.  These checks are memory safe since the knote holds a reference
on one end of the pipe, and the pipe structure is not freed until both
ends are closed.  The checks are not racy since PIPE_EOF is never cleared
after being set, and pipe_present is never set back to PIPE_ACTIVE after
pipeclose() has been called.

PR:		235640
Reported and tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19224
2019-02-19 15:46:43 +00:00
Mark Johnston
648890835c Remove a write-only variable orphaned by r340677. 2019-02-17 16:56:41 +00:00
Mateusz Guzik
737037f6c0 pipe: use unr64
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18054
2018-11-20 14:59:27 +00:00
Mark Johnston
07702f72e5 Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map.
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.

Reported by:	jhb
Reviewed by:	alc, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17827
2018-11-06 21:57:03 +00:00
Ed Maste
b8d908b71e ANSIfy sys/kern 2018-06-01 13:26:45 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Pedro F. Giffuni
8a36da99de sys/kern: adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:20:12 +00:00
Brooks Davis
a72c64b0b6 Generate syscall tables and update pipe() implementation after r302094.
Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6816
2016-06-22 21:18:19 +00:00
Pedro F. Giffuni
55e0987aea sys: extend use of the howmany() macro when available.
We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.
2016-04-26 15:38:17 +00:00
Ed Schouten
8328babdd0 Make pipes in CloudABI work.
Summary:
Pipes in CloudABI are unidirectional. The reason for this is that
CloudABI attempts to provide a uniform runtime environment across
different flavours of UNIX.

Instead of implementing a custom pipe that is unidirectional, we can
simply reuse Capsicum permission bits to support this. This is nice,
because CloudABI already attempts to restrict permission bits to
correspond with the operations that apply to a certain file descriptor.

Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes
a pair of filecaps. These filecaps are passed to the newly introduced
falloc_caps() function that creates the descriptors with rights in
place.

Test Plan:
CloudABI pipes seem to be created with proper rights in place:

https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44

Reviewers: jilles, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3236
2015-07-29 17:18:27 +00:00
Conrad Meyer
c578e0fb48 pipe_direct_write: Fix mismatched pipelock/unlock
If a signal is caught in pipelock, causing it to fail, pipe_direct_write
should not try to pipeunlock.

Reported by:	pho
Differential Revision:	https://reviews.freebsd.org/D3069
Reviewed by:	kib
Approved by:	markj (mentor)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-07-13 17:45:22 +00:00
Mateusz Guzik
90f54cbfeb fd: remove filedesc argument from fdclose
Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.
2015-04-11 15:40:28 +00:00
Konstantin Belousov
ff5ba73987 Fix use after free in pipe_dtor(). PIPE_NAMED flag must be tested
before pipeclose() is called, since for !PIPE_NAMED case, when peer is
already closed, the pipe pair memory is freed.

Submitted by:	luke.tw@gmail.com
PR:	197246
Tested by:	pho
MFC after:	3 days
2015-02-03 10:29:40 +00:00
Konstantin Belousov
fe63170115 Do not assert that the new pipepair mutex is not initialized. The
backing memory contains garbage and might trigger the assertion.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-21 16:32:54 +00:00
Konstantin Belousov
6762091ea4 Remove lock recursion for the pipe pair mutex, and disable the
recursion on mutex initialization.

The only places where the recursive acquire is performed are read and
write filters, since knlist, which uses the pipe pair mutex as lock,
is locked when filter is called.

The recursion was added in r93296, and consistent locking for
kn_fop->f_event() introduced in r133741.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2014-11-29 17:18:20 +00:00
Konstantin Belousov
ab57474c83 When other end of the pipe closed during the write, but some bytes
were written, return short write instead of EPIPE.

Update comment.

Discussed with:	bde (long time ago)
MFC after:	2 weeks
2014-11-03 10:01:56 +00:00
John Baldwin
9696feebe2 Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
  various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
  for each file and then convert that to a kinfo_ofile structure rather than
  keeping a second, different set of code that directly manipulates
  type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision:	https://reviews.freebsd.org/D775
Reviewed by:	kib, glebius (earlier version)
2014-09-22 16:20:47 +00:00
John Baldwin
cd550b9b52 Tweak pipe_truncate() to more closely match pipe_chown() and pipe_chmod()
by checking PIPE_NAMED and using invfo_truncate() for unnamed pipes.
2014-09-12 21:20:36 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Mateusz Guzik
183870cf75 Ignore the error from pipespace_new when creating a pipe.
It can fail if pipe map is exhausted (as a result of too many pipes created),
but it is not fatal and could be provoked by unprivileged users. The only
consequence is worse performance with given pipe.

Reported by:	ivoras
Suggested by:	kib
MFC after:	1 week
2014-05-02 00:52:13 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Gleb Smirnoff
ca04d21d5f Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
Jilles Tjoelker
dc570d5e56 Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
Jilles Tjoelker
d289dc7b73 Rename do_pipe() to kern_pipe2() and declare it properly. 2013-03-31 17:42:54 +00:00
Pawel Jakub Dawidek
49549b1894 Remove redundant space. 2013-02-17 11:48:16 +00:00
David Xu
5ff2bb52cc I am comparing current pipe code with the one in 8.3-STABLE r236165,
I found 8.3 is a history BSD version using socket to implement FIFO
pipe, it uses per-file seqcount to compare with writer generation
stored in per-pipe object. The concept is after all writers are gone,
the pipe enters next generation, all old readers have not closed the
pipe should get the indication that the pipe is disconnected, result
is they should get EPIPE, SIGPIPE or get POLLHUP in poll().
But newcomer should not know that previous writters were gone, it
should treat it as a fresh session.
I am trying to bring back FIFO pipe to history behavior. It is still
unclear that if single EOF flag can represent SBS_CANTSENDMORE and
SBS_CANTRCVMORE which socket-based version is using, but I have run
the poll regression test in tool directory, output is same as the one
on 8.3-STABLE now.
I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1.
expected POLLHUP; got 0" might be bogus, because newcomer should not
know that old writers were gone. I got the same behavior on Linux.
Our implementation always return POLLIN for disconnected pipe even it
should return POLLHUP, but I think it is not wise to remove POLLIN for
compatible reason, this is our history behavior.

Regression test: /usr/src/tools/regression/poll
2012-07-31 05:48:35 +00:00