Current implementation utilize off-by-one struct prison_ip to access the
IPv[46] addresses. It is error prone and hence comes the regression fix
21ad3e27fa and ddbf879d79. Use flexible array member so that compiler
will catch such errors and it will also be easier to review.
No functional change intended.
Reviewed by: melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D37874
The comment above bintime2timespec() says:
When converting between timestamps on parallel timescales of differing
resolutions it is historical and scientific practice to round down.
However, the delta_nsec value is a time difference and not a timestamp. Also
the rounding errors accumulate in the frequency accumulator, see hardpps().
So, rounding to the closest integer is probably slightly better.
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
Let A be the current calculation of the frequency accumulator (pps_fcount)
update in pps_event()
scale = (uint64_t)1 << 63;
scale /= captc->tc_frequency;
scale *= 2;
bt.sec = 0;
bt.frac = 0;
bintime_addx(&bt, scale * tcount);
bintime2timespec(&bt, &ts);
hardpps(tsp, ts.tv_nsec + 1000000000 * ts.tv_sec);
and hardpps(..., delta_nsec):
u_nsec = delta_nsec;
if (u_nsec > (NANOSECOND >> 1))
u_nsec -= NANOSECOND;
else if (u_nsec < -(NANOSECOND >> 1))
u_nsec += NANOSECOND;
pps_fcount += u_nsec;
This change introduces a new calculation which is slightly simpler and more
straight forward. Name it B.
Consider the following sample values with a tcount of 2000000100 and a
tc_frequency of 2000000000 (2GHz).
For A, the scale is 9223372036. Then scale * tcount is 18446744994337203600
which is larger than UINT64_MAX (= 18446744073709551615). The result is
920627651984 == 18446744994337203600 % UINT64_MAX. Since all operands are
unsigned the result is well defined through modulo arithmetic. The result of
bintime2timespec(&bt, &ts) is 49. This is equal to the correct result
1000000049 % NANOSECOND.
In hardpps(), both conditional statements are not executed and pps_fcount is
incremented by 49.
For the new calculation B, we have 1000000000 * tcount is 2000000100000000000
which is less than UINT64_MAX. This yields after the division with tc_frequency
the correct result of 1000000050 for delta_nsec.
In hardpps(), the first conditional statement is executed and pps_fcount is
incremented by 50.
This shows that both methods yield roughly the same results. However, method B
is easier to understand and requires fewer conditional statements.
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
Use local variables for the captured timehand and timecounter in pps_event().
This fixes a potential issue in the nsec preparation for hardpps(). Here the
timecounter was accessed through the captured timehand after the generation was
checked.
Make a snapshot of the relevent timehand values early in pps_event(). Check
the timehand generation only once during the capture and event processing. Use
atomic_thread_fence_acq() similar to the other readers.
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
The routine is used as a general event-limiting routine in places which
have nothing to do with packets.
Provide a define to keep everything happy.
Reviewed by: rew
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D38746
We short-circuit lockmgr functions in the face of a kernel panic. Other
lock implementations do this with a SCHEDULER_STOPPED() check, which
covers the additional case where the debugger is active but the system
has not panicked. Update this code to match that behaviour.
Reviewed by: mjg, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D38655
If there are multiple instances of mountd(8) (in different
prisons), there will be confusion if they manipulate the
exports of the same file system. This patch adds mnt_exjail
to "struct mount" so that the credentials (and, therefore,
the prison) that did the exports for that file system can
be recorded. If another prison has already exported the
file system, vfs_export() will fail with an error.
If mnt_exjail == NULL, the file system has not been exported.
mnt_exjail is checked by the NFS server, so that exports done
from within a different prison will not be used.
The patch also implements vfs_exjail_destroy(), which is
called from prison_cleanup() to release all the mnt_exjail
credential references, so that the prison can be removed.
Mainly to avoid doing a scan of the mountlist for the case
where there were no exports done from within the prison,
a count of how many file systems have been exported from
within the prison is kept in pr_exportcnt.
Reviewed by: markj
Discussed with: jamie
Differential Revision: https://reviews.freebsd.org/D38371
MFC after: 3 months
This API change led to unexpected consequences with Go runtime. The
Go runtime emulates blocking sockets over non-blocking sockets and
for that uses available event dispatcher on the target OS, which is
kevent(2) if availabe, with OS independent layer on top. It expects
that if whatever O_NONBLOCK socket returned ever EAGAIN, then it is
supposed to be reported as writable by the event dispatcher. kevent(2)
would never report a unix/dgram socket, since they never change their
state, they always are writeable. The expectations of Go are not
literally specified by SUS, however they are in its spirit. The SUS
specifies EAGAIN for send(2) as "The socket's file descriptor is marked
O_NONBLOCK and the requested operation would block" [1]. This doesn't
apply to FreeBSD unix/dgram socket, it never blocks on send(2).
Thus, changing API trying to mimic Linux was a mistake. But what about
the problem we tried to fix? Discussed that with Max Dounin of nginx,
and we agreed that the log bomb described shall be fixed on nginx side,
and it actually isn't specific to FreeBSD, may happen with nginx on any
non-Linux system with a certain configuration.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/send.html
This reverts commit 65572cade3.
`prison_ip_restrict()` is called in loop FOREACH_PRISON_DESCENDANT_LOCKED.
While under low memory, it is still possible that in subsequent rounds
`prison_ip_restrict()` succeed and `redo_ip[46]` flip over from true to
false, thus leave some prisons's IPv[46] addresses unrestricted.
Reviewed by: jamie
Fixes: 8bce8d28ab jail: Avoid multipurpose return value of function prison_ip_restrict()
Differential Revision: https://reviews.freebsd.org/D38697
The interrupt counts may have been valuable in the past, but now DDB can
readily provide them via 'show intrcnt'. This is one of the only
consumers of these counter arrays outside of the interrupt code itself,
and this should be avoided.
Reviewed by: mhorne, fuz
Differential Revision: https://reviews.freebsd.org/D37870
- Use atomic_store to set job->error. atomic_set does an or
operation, not assignment.
- Use refcount_* to manage job->nbio.
This ensures proper memory barriers are present so that the last bio
won't see a possibly stale value of job->error.
- Don't re-read job->error after reading it via atomic_load.
Reported by: markj (1)
Reviewed by: mjg, markj
Differential Revision: https://reviews.freebsd.org/D38611
Use atomic_fetchadd in place of separate atomic_subtract / atomic_load.
Reviewed by: markj
Sponsored by: HPE TidalScale
Differential Revision: https://reviews.freebsd.org/D38559
Crucially, this allows releasing counters, and interrupt sources by
extension. Where before we were incrementing intrcnt_index with atomics,
now we protect the bitmap using the existing isrc_table_lock mutex.
Reviewed by: mmel
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D38437
soconnectat() tries to ensure that one cannot connect a connected
socket. However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.
Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38507
Also do minor style adjustments.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38549
For sched_relinquish(). This fixes the build for some kernel configs.
Reported by: Jenkins
Fixes: 1029dab634 ("mi_switch(): clean up switch types and their usage")
Address some last minute review feedback on c0e4090e3d
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.
Pointed out by: jhb
Sponsored by: Netflix
This allows us to avoid spurious calls to ktls_disable_ifnet()
When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.
This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.
While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.
This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.
Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.
1. Ensure that every caller provides a type. In most cases, we upgrade
the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
will be documented there.
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38184
ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE
switch types, let's do that here as well. SWT_PREEMPT is somewhat
redundant when we also have the SW_PREEMPT flag.
This only has an effect for kernels built with SCHED_STATS.
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38183
Do this ahead of adding a man page that describes the function. No
functional change.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38181
The equivalent function is now named thread_create(). Mention
kthread_add() where it is also relevant.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38180
Its purpose is to reserve all I/O space belonging to physical memory
from nexus, preventing it from being handed out by bus_alloc_resource()
to callers such as xenpv_alloc_physmem(), which looks for the first
available free range it can get. This mimics the existing pseudo-driver
on x86.
If needed, the device can be disabled with hint.ram.0.disabled="1" in
/boot/device.hints.
Reviewed by: imp
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32343
There's no reason to use one over the other here, let's prefer the
interface that's used elsewhere in the kernel.
No functional change intended.
Reviewed by: mjg
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38360
Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
if_snd_tag_alloc(). Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
serialize and restore an ifnet pointer. Add the necessary wrapper to
get the index generation for this.
Reviewed By: jhb
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340
During testing of exporting file systems in jails, I
noticed that the export structures on a mount
were not being free'd when the mount is dismounted.
This bug appears to have been in the system for a
very long time. It would have resulted in a slow memory
leak when exported file systems were dismounted.
Prior to r362158, freeing the structures during dismount
would not have been safe, since VFS_CHECKEXP() returned
a pointer into an export structure, which might still have been
used by the NFS server for an in-progress RPC when the file system
is dismounted. r362158 fixed this, so it should now be safe
to free the structures in vfs_mount_destroy(), which is what
this patch does.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D38385
Writes on UFS through a mapped region do not allocate disk blocks in
holes immediately. The blocks are allocated when the pages are paged out
first time.
This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(),
because VOP_BMAP() reports hole for the place which already contains a
valid data.
Clean the pages before doing VOP_BMAP() in the affected functions. In
principle, we could clean less by only requesting clean starting from
the offset, but it is probably not very important.
PR: 269261
Reported by: asomers
Reviewed by: asomers, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38379