Different lagg protocols have different means and policies to process incoming
traffic. For example, for failover protocol, by default received traffic is only
accepted when they are received through the active port. For lacp protocol, LACP
control messages are tapped off, also traffic will be dropped if they are
received through the port which is not in collecting state or is not joined to
the active aggregator. It confuses if user dump and see inbound traffic on
lagg(4) interfaces but they are actually silently dropped and not passed into
the net stack.
Tap traffic after protocol processing so that user will have consistent view of
the inbound traffic, meanwhile mbuf is set with correct receiving interface and
bpf(4) will diagnose the right direction of inbound packets.
PR: 270417
Reviewed by: melifaro (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39225
From static code analysis, some device drivers (cxgbe, mlx4, mthca, and qlnx)
do not enter net epoch before lagg_input_infiniband(). If IPoIB interface is a
member of lagg(4) interface, and after returning from lagg_input_infiniband()
the receiving interface of mbuf is set to lagg(4) interface, then when
concurrently destroying the lagg(4) interface, there is a small window that the
interface gets destroyed and becomes invalid before infiniband_input() re-enter
net epoch, thus leading use-after-free.
Widen NET_EPOCH coverage to prevent use-after-free.
Thanks hselasky@ for testing with mlx5 devices.
Reviewed by: hselasky
Tested by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39275
NETLINK is going to replace rtsock and a number of other ioctl/sysctl interfaces.
In-base utilies such as route(8), netstat(8) and soon ifconfig(8)
are being converted to use netlink sockets as a transport between
kernel and userland.
In the current configuration, it still possible have the kernel
without NETLINK (`nooptions NETLINK`) and use the aforementioned
utilies by buidling the world with `WITHOUT_NETLINK` src.conf knob.
However, this approach does not cover the cases when person unintentionally
builds a custom kernel without netlink and tries to use the standard userland.
This change adds `option NETLINK` to the default options for each
architecture, fixing the custom kernel issue.
For arm, this change uses `std.armv6` and `std.armv7` (netlink already in)
instead of DEFAULTS.
Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D39339
Use already-existing RTM_F_PREFIX rtm_flag to indicate that the
request assumes exact-prefix lookup instead of the
longest-prefix-match.
MFC after: 2 weeks
This will be used later in the linsysfs module to filter out VNETs.
Reviewed by: des
Differential revision: https://reviews.freebsd.org/D39382
MFC after: 1 month
Since 81167243b the size of struct pfs_node is 280 bytes, so the kernel
memory allocator takes memory from 384 bytes sized bucket. However, the
length of the node name is mostly short, e.g., for Linux emulation layer
it is up to 16 bytes. The size of struct pfs_node w/o pfs_name is 152
bytes, i.e., we have 104 bytes left to fit the node name into the 256
bytes-sized bucket.
Reviewed by: des
Differential revision: https://reviews.freebsd.org/D39381
MFC after: 1 month
Each physical port has an associated loopback tx channel and anything
transmitted over that channel by the driver is looped back internally by
the hardware as if received on that physical port. This change allows
tracing filters to be installed in this loopback path.
MFC after: 1 week
Sponsored by: Chelsio Communications
NFSv4.1/4.2 uses operation bitmaps for various operations,
such as the SP4_MACH_CRED case for ExchangeID.
This patch adds support for operation bitmaps so that
support for SP4_MACH_CRED can be added to the NFSv4.1/4.2
server in a future commit.
This commit should not change any NFSv4.1/4.2 semantics.
MFC after: 3 months
Split up the lhw lock and the scan lock. The latter is a mtx
while the former changes from mtx to sx as mac80211 downcalls may
sleep (and the ic lock is not usable in that case either and a larger
project to fix).
This will also enforce some lookups under lock (mostly scan) as well
as general protection for more compat code and avoid a possible
deadlock with one of the upcoming callbacks from driver into the
compat code.
Sponsored by: The FreeBSD Foundation
MFC after: 7 days
- Mark assert dummy variables as __unused.
- Use a dummy (void) cast of the flags argument passed to
spin_unlock_irqrestore so it gets treated as used.
Reviewed by: manu, hselasky
Differential Revision: https://reviews.freebsd.org/D39349
Use the GICD_SIZE macro (0x10000), which is half the size of the current
fixed-sized mapping (128 * 1024 == 0x20000).
In ARM64 Hyper-V instances, it seems the Distributor's registers are
located immediately preceding a range of physical memory in the bus
address space. Thus, when ram0 is attaching and attempts to reserve
SYS_RES_MEMORY resources corresponding to its physmem ranges, it fails,
because the first 0x10000 bytes of this range are already owned by gic0.
PR: 270415
Reported by: whu
Tested by: whu
Differential Revision: https://reviews.freebsd.org/D39260
init_pagetables is mapped into the segment containing the BSS, but does
not get zeroed by locore. It is used for bootstrap page table pages.
It happens that the bootstrap kernel stack is also placed in that
section, but there's no reason it shouldn't live in the BSS, so move it
there. No functional change intended.
Reviewed by: andrew
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D39367
The ENTRY macro adds instructions to the start of a function but not
EENTRY. To use these instructions in both functions move the EENTRY
use before the ENTRY use.
Sponsored by: Arm Ltd
On arm64, the PCB is stored at the top of the thread stack. For thread0
this comes from the static "initstack" region, which is placed in the
.init_pagetable section, which is not part of the BSS and thus doesn't
get zeroed by locore. (See the comment in ldscript.arm64.) It is thus
possible for the pcb_flags field to be uninitialized, which can result
in PCB_SINGLE_STEP being set.
Fix this by simply initializing the field. A separate commit will move
initstack out of the .init_pagetable section, since it has no reason to
be there, but it is preferable to explicitly initialize PCB fields
anyway. In particular, regular kernel stacks are not zeroed upon
allocation, so we should be consistent here.
Reviewed by: andrew
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D39343
Add opt_netlink.h to the linux_common module, on i386, where we don't
uses linux_common module, move opt_netlink.h inclusion under
i386 condition.
MFC after: 2 weeks
Get/set commands can now choose to provide the interface name rather
than the interface index. This allows userspace to avoid a call to
if_nametoindex().
Suggested by: melifaro
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39359
Rather than falling through to the default case handle the unknown
exception with its own panic message. As ESR_EL1 is zero for this
exception stop printing it.
Sponsored by: Arm Ltd
Rather than printing ic_name ourselves (or not at all) use ic_printf()
as a common function from net80211 where possible.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
As in commit 2f1cfb7f63ca ("gmirror: Pre-allocate the timeout event
structure"), graid3 must avoid M_WAITOK allocations in callout handlers.
Reported by: graid3 regression tests
MFC after 2 weeks
Currently, sysctls which enable KDB in some way are flagged with
CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0.
This is so that KDB cannot be used to lower a running system's
securelevel, see commit 3d7618d8bf0b7. However, the newer mac_ddb(4)
restricts DDB operations which could be abused to lower securelevel
while retaining some ability to gather useful debugging information.
To enable the use of KDB (specifically, DDB) on systems with a raised
securelevel, change the KDB sysctl policy: rather than relying on
CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap().
If the securelevel is raised, only pass control to the backend if MAC
specifically grants access; otherwise simply check to see if mac_ddb
vetoes the request, as before.
Add a new secure sysctl, debug.kdb.enter_securelevel, to override this
behaviour. That is, the sysctl lets one enter a KDB backend even with a
raised securelevel, so long as it is set before the securelevel is
raised.
Reviewed by: mhorne, stevek
MFC after: 1 month
Sponsored by: Juniper Networks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37122
Use full-featured ifa_ifwithroute() to guess route ifa/ifp
instead of ifa_ifwithnet(). This change makes the route addition
logic closer to the rt_getifa_fib() used by rtsock.
Reported by: glebius
Tested by: glebius
Differential Revision: https://reviews.freebsd.org/D39335
MFC after: 2 weeks
This is a total hack/bare minimum which follows inet4.
Otherwise 2 threads removing the same address can easily crash.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39317
The algorithm for laying out new directories was devised in the 1980s
and markedly improved the performance of the filesystem. In those days
large disks had at most 100 cylinder groups and often as few as 10-20.
Modern multi-terrabyte disks have thousands of cylinder groups. The
original algorithm does not handle these large sizes well. This change
attempts to expand the scope of the original algorithm to work well
with these much larger disks while still retaining the properties
of the original algorithm for small disks.
The filesystem implementation is divided into policy routines and
implementation routines. The policy routines can be changed in any
way desired without risk of corrupting the filesystem. The policy
requests are handled by the implementation layer. If the policy
asks for an available resource, it is granted. But if it asks for
an already in-use resource, then the implementation will provide
an available one nearby the request. Thus it is impossible for a
policy to double allocate. This change is limited to the policy
implementation.
This change updates the ffs_dirpref() routine which is responsible
for selecting the cylinder group into which a new directory should
be placed. If we are near the root of the filesystem we aim to
spread them out as much as possible. As we descend deeper from the
root we cluster them closer together around their parent as we
expect them to be more closely interactive. Higher-level directories
like usr/src/sys and usr/src/bin should be separated while the
directories in these areas are more likely to be accessed together
so should be closer. And directories within commands or kernel
subsystems should be closer still.
We pick a range of cylinder groups around the cylinder group of the
directory in which we are being created. The size of the range for
our search is based on our depth from the root of our filesystem.
We then probe that range based on how many directories are already
present. The first new directory is at 1/2 (middle) of the range;
the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8,
5/8, 7/8, 1/16, 3/16, 5/16, etc.
It is desirable to store the depth of a directory in its on-disk
inode so that it is available when we need it. We add a new field
di_dirdepth to track the depth of each directory. Because there are
few spare fields left in the inode, we choose to share an existing
field in the inode rather than having one of our own. Specifically
we create a union with the di_freelink field. The di_freelink field
is used to track inodes that have been unlinked but remain referenced.
It is not needed until a rmdir(2) operation has been done on a
directory. At that point, the directory has no contents and even
if it is kept active as a current directory is no longer able to
have any new directories or files created in it. Thus the use of
di_dirdepth and di_freelink will never coincide.
Reported by: Timo Voelker
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39246
Those input routines are identical.
Also inline two fast paths.
No functional change intended.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39251
creds are not using the refcount API for a long time now, but this
previously failed to fail to compile because the type remained int.
Now it broke due to conversion to long.
The prior implementation of xen_intr_resume() was wiping
xen_intr_port_to_isrc[] and then rebuilding from the x86 interrupt
table. Rework to instead wipe the channel numbers (->xi_port) and then
scan the table for sources with invalid channels.
This will be slower due to scanning the whole table, but this removes
the dependency on the x86 interrupt code.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30599
[royger]
Split line over 80 characters.
The portions of xen_rebind_ipi() and xen_rebind_virq() were already
near-identical. While xen_rebind_ipi() should panic() on
single-processor, still having the functionality to invoke seems
harmless.
Meanwhile much of the loop from xen_intr_resume() seemed to want to be
closer to this same code. This pushes related bits closer together.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30598
Remove these no longer needed headers. Key for making xen_intr.c
machine-independent as they don't exist on other architectures.
Originally this was part of a much larger commit, but was broken off
for submission to the FreeBSD project.
Reviewed by: royger
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>
Original implementation: Julien Grall <julien@xen.org>, 2015-10-20 09:14:56
MFC after: 1 week