Introduce a new function, lltable_get(), to retrieve lltable pointer
for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
ARP/ND records.
Differential Revision: https://reviews.freebsd.org/D33660
MFC after: 2 weeks
Now struct prison has two pointers (IPv4 and IPv6) of struct
prison_ip type. Each points into epoch context, address count
and variable size array of addresses. These structures are
freed with network epoch deferred free and are not edited in
place, instead a new structure is allocated and set.
While here, the change also generalizes a lot (but not enough)
of IPv4 and IPv6 processing. E.g. address family agnostic helpers
for kern_jail_set() are provided, that reduce v4-v6 copy-paste.
The fast-path prison_check_ip[46]_locked() is also generalized
into prison_ip_check() that can be executed with network epoch
protection only.
Reviewed by: jamie
Differential revision: https://reviews.freebsd.org/D33339
When no mask is supplied to the ioctl adding an Internet interface
address, revert to using the historical class mask rather than a
single default. Similarly for the NFS bootp code.
MFC after: 3 weeks
Reviewed by: melifaro glebius
Differential Revision: https://reviews.freebsd.org/D32951
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level. Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD. Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits). Warn when an Internet address is set without a mask.
MFC after: 1 month
Reviewed by: cy
Differential Revision: https://reviews.freebsd.org/D32708
The modification to the hash are already naturally locked by
in_control_sx. Convert the hash lists to CK lists. Remove the
in_ifaddr_rmlock. Assert the network epoch where necessary.
Most cases when the hash lookup is done the epoch is already entered.
Cover a few cases, that need entering the epoch, which mostly is
initial configuration of tunnel interfaces and multicast addresses.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D32584
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D32434
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122. Until
now, we would broadcast the host zero address as well as the configured
address. Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.
See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316. Note, Linux
now implements this.
Reviewed by: rgrimes, tuexen; melifaro (previous version)
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D31861
Current logic always selects an IFA of the same family from the
outgoing interfaces. In IPv4 over IPv6 setup there can be just
single non-127.0.0.1 ifa, attached to the loopback interface.
Create a separate rt_getifa_family() to handle entire ifa selection
for the IPv4 over IPv6.
Differential Revision: https://reviews.freebsd.org/D31868
MFC after: 1 week
With the new FIB_ALGO infrastructure, nearly all subsystems use
fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
still requires RIB read lock to return the result. One of such places
is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
handling ARP replies triggers the problem descibed above when the amount of
ARP replies is high.
To be more specific, prior to creating new ARP entry, routing lookup for the entry
address in interface fib is executed. The following conditions are the verified:
1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
we want to support the use case of having multiple interfaces with the same prefix.
In fact, the current code just checks if the returned prefix covers target address
(always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
regardless of its interface.
Change the code to perform the following:
1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
the input interface.
Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by: ae
MFC after: 1 week
PR: 257965
It appears that maliciously crafted ifaliasreq can lead to NULL
pointer dereference in in_aifaddr_ioctl(). In order to replicate
that, one needs to
1. Ensure that carp(4) is not loaded
2. Issue SIOCAIFADDR call setting ifra_vhid field of the request
to a negative value.
A repro code would look like this.
int main() {
struct ifaliasreq req;
struct sockaddr_in sin, mask;
int fd, error;
bzero(&sin, sizeof(struct sockaddr_in));
bzero(&mask, sizeof(struct sockaddr_in));
sin.sin_len = sizeof(struct sockaddr_in);
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr("192.168.88.2");
mask.sin_len = sizeof(struct sockaddr_in);
mask.sin_family = AF_INET;
mask.sin_addr.s_addr = inet_addr("255.255.255.0");
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return (-1);
memset(&req, 0, sizeof(struct ifaliasreq));
strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name));
memcpy(&req.ifra_addr, &sin, sin.sin_len);
memcpy(&req.ifra_mask, &mask, mask.sin_len);
req.ifra_vhid = -1;
return ioctl(fd, SIOCAIFADDR, (char *)&req);
}
To fix, discard both positive and negative vhid values in
in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer
dereference and kernel panic.
Reviewed by: imp@
Pull Request: https://github.com/freebsd/freebsd-src/pull/530
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31390
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.
Reported by: bapt, Greg V
Sponsored by: The FreeBSD Foundation
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.
Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.
Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423
It fixes loopback route installation for the interfaces
in the different fibs using the same prefix.
Reviewed By: donner
PR: 189088
Differential Revision: https://reviews.freebsd.org/D28673
MFC after: 1 week
* Fix bug with /32 aliases introduced in 81728a538d.
* Explicitly document business logic for IPv4 ifa routes.
* Remove remnants of rtinit()
* Deduplicate ifa->route prefix code by moving it into ia_getrtprefix()
* Deduplicate conditional check for ifa_maintain_loopback_route() by
moving into ia_need_loopback_route()
* Remove now-unused flags argument from in_addprefix().
Reviewed by: donner
PR: 252883
Differential Revision: https://reviews.freebsd.org/D28246
rtinit[1]() is a function used to add or remove interface address prefix routes,
similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
in reality.
1) IPv6 code does not use it for the ifa routes. There is a separate layer,
nd6_prelist_(), providing interface for maintaining interface routes. Its part,
responsible for the actual route table interaction, mimics rtenty() code.
2) rtinit tries to combine multiple actions in the same function: constructing
proper route attributes and handling iterations over multiple fibs, for the
non-zero net.add_addr_allfibs use case. It notably increases the code complexity.
3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
for p2p connections, host routes and p2p routes are handled in the same way.
Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
aliases.
4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
complicating rib_action() implementation.
5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
ifa messages in certain corner cases.
To address all these points, the following has been done:
* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
responsible for the actual route notifications.
Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses
Differential revision: https://reviews.freebsd.org/D28186
It is lightweight way to check if an IPv4 address exists.
Submitted by: Roy Marples
Reviewed by: gnn, melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26636
When SIOCAIFADDR ioctl configures an IPv4 address that is already exist,
it removes old ifaddr. When this IPv4 address is only one configured on
the interface, this also leads to leaving from AllHosts multicast group.
Then an address is added again, but due to the bug, this doesn't lead
to joining to AllHosts multicast group.
Submitted by: yannis.planus_alstomgroup.com
Reviewed by: gnn
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26757
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.
Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2
Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default
Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449
destroying a VNET or a network interface.
Else the inm release tasks, both IPv4 and IPv6 may cause a panic
accessing a freed VNET or network interface.
Reviewed by: jmg@
Discussed with: bz@
Differential Revision: https://reviews.freebsd.org/D24914
MFC after: 1 week
Sponsored by: Mellanox Technologies
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
After the afdata read lock was converted to epoch(9), readers could
observe a linked LLE and block on the LLE while a thread was
unlinking the LLE. The writer would then release the lock and schedule
the LLE for deferred free, allowing readers to continue and potentially
schedule the LLE timer. By the point the timer fires, the structure is
freed, typically resulting in a crash in the callout subsystem.
Fix the problem by modifying the lookup path to check for the LLE_LINKED
flag upon acquiring the LLE lock. If it's not set, the lookup fails.
PR: 234296
Reviewed by: bz
Tested by: sbruno, Victor <chernov_victor@list.ru>,
Mike Andrews <mandrews@bit0.com>
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18906
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17100
This deduplicates the code a bit, and also implicitly adds missing
callout_stop() to in[6]_lltable_delete_entry() functions.
PR: 209682, 225927
Submitted by: hselasky (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D4605
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.
Synchronously remove all external references to a multicast address before enqueueing for delete.
Reported by: lwhsu
Approved by: sbruno
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
Current arp/nd code relies on the feedback from the datapath indicating
that the entry is still used. This mechanism is incorporated into the
arpresolve()/nd6_resolve() routines. After the inpcb route cache
introduction, the packet path for the locally-originated packets changed,
passing cached lle pointer to the ether_output() directly. This resulted
in the arp/ndp entry expire each time exactly after the configured max_age
interval. During the small window between the ARP/NDP request and reply
from the router, most of the packets got lost.
Fix this behaviour by plugging datapath notification code to the packet
path used by route cache. Unify the notification code by using single
inlined function with the per-AF callbacks.
Reported by: sthaug at nethelp.no
Reviewed by: ae
MFC after: 2 weeks
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).
This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.
This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.
There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.
Reviewed by: glebius
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost. This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.
To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero. The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).
Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity. This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.
Submitted by: David A. Bright <david.a.bright@dell.com>
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D7695
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK. This could cause a panic if a
concurrent operation modified the list.
Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address. in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.
This is completely redundant as we have already performed the
route lookup, so the source IP is already known. Just use that
address to directly check whether the destination IP is a
broadcast address or not.
MFC after: 2 months
Sponsored By: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266