missing it.
Remove the "hidden" kernel only include of vimage.h from ip_var.h added
with the very first Vimage commit r181803 to avoid further kernel poisoning.
1) All bit disappears
2) The two sets of gaps (nr and non-nr) are
disjointed, you don't have gaps struck in
both places.
This adjusts us to coorespond to the new draft. Still
to-do, cleanup the code so that there are only one set
of sack routines (original NR-Sack done by E cloned all
sack code).
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
the t_starttime field in tcpcb.
Requested by: bde (1, 3)
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.
Reviewed by: rwatson
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
instead of unsigned longs. This fixes a few overflow edge cases on 64-bit
platforms. Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.
Reviewed by: silby, gnn, lstewart
MFC after: 1 week
and use malloc() instead if/when it is necessary.
The problem is less relevant in previous versions because
the variable involved (tmp_pipe) is much smaller there.
Still worth fixing though.
Submitted by: Marta Carbone (GSOC)
MFC after: 3 days
In case of !INET we will not have a timestamp on the trace for now
but that might only affect spx debugging as long as INET6 requires
INET.
Reviewed by: rwatson (earlier version)
- clear the head pointer immediately before using it, so there is
no chance of mistakes;
- call reap_rules() unconditionally. The function can handle a NULL
argument just fine, and the cost of the extra call is hardly
significant given that we do it rarely and outside the lock.
MFC after: 3 days
If packet leaves ipfw to other kernel subsystem (dummynet, netgraph, etc)
it carries pointer to matching ipfw rule. If this packet then reinjected back
to ipfw, ruleset processing starts from that rule. If rule was deleted
meanwhile, due to existed race condition panic was possible (as well as
other odd effects like parsing rules in 'reap list').
P.S. this commit changes ABI so userland ipfw related binaries should be
recompiled.
MFC after: 1 month
Tested by: Mikolaj Golub
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
version field sent via gif(4)+if_bridge(4). The EtherIP
implementation found on FreeBSD 6.1, 6.2, 6.3, 7.0, 7.1, and 7.2 had
an interoperability issue because it sent the incorrect EtherIP
packets and discarded the correct ones.
This change introduces the following two flags to gif(4):
accept_rev_ethip_ver: accepts both correct EtherIP packets and ones
with reversed version field, if enabled. If disabled, the gif
accepts the correct packets only. This flag is enabled by
default.
send_rev_ethip_ver: sends EtherIP packets with reversed version field
intentionally, if enabled. If disabled, the gif sends the correct
packets only. This flag is disabled by default.
These flags are stored in struct gif_softc and can be set by
ifconfig(8) on per-interface basis.
Note that this is an incompatible change of EtherIP with the older
FreeBSD releases. If you need to interoperate older FreeBSD boxes and
new versions after this commit, setting "send_rev_ethip_ver" is
needed.
Reviewed by: thompsa and rwatson
Spotted by: Shunsuke SHINOMIYA
PR: kern/125003
MFC after: 2 weeks
adjust conf/files and modules' Makefiles accordingly.
No code or ABI changes so this and most of previous related
changes can be easily MFC'ed
MFC after: 5 days
pipes, queues, tags, rule numbers and so on.
These are all different namespaces, and the only thing they have in
common is the fact they use a 16-bit slot to represent the argument.
There is some confusion in the code, mostly for historical reasons,
on how the values 0 and 65535 should be used. At the moment, 0 is
forbidden almost everywhere, while 65535 is used to represent a
'tablearg' argument, i.e. the result of the most recent table() lookup.
For now, try to use explicit constants for the min and max allowed
values, and do not overload the default rule number for that.
Also, make the MTAG_IPFW declaration only visible to the kernel.
NOTE: I think the issue needs to be revisited before 8.0 is out:
the 2^16 namespace limit for rule numbers and pipe/queue is
annoying, and we can easily bump the limit to 2^32 which gives
a lot more flexibility in partitioning the namespace.
MFC after: 5 days
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
structure contents are a bad idea in the kernel for binary
compatibility reasons, and this is a single pointer that is now included
in compiles by default anyway due to options MAC being in GENERIC.
+ move ipfw and dummynet hooks declarations to raw_ip.c (definitions
in ip_var.h) same as for most other global variables.
This removes some dependencies from ip_input.c;
+ remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly;
+ remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly;
+ move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it;
To be merged together with rev 193497
MFC after: 5 days
stuff to its own directory, and cleaning headers and dependencies:
In this commit:
+ remove one use of a typedef;
+ document dn_rule_delete();
+ replace one usage of the DUMMYNET_LOADED macro with its value;
No MFC planned until the cleanup is complete.
of the credit of a pipe. On passing, also use explicit
signed/unsigned types for two other fields.
Noticed by Oleg Bulyzhin and Maxim Ignatenko long ago,
i forgot to commit the fix.
Does not affect RELENG_7.
modules are loaded by avoiding mbuf label lookups when policies aren't
loaded, pushing further socket locking into MAC policy modules, and
avoiding locking MAC ifnet locks when no policies are loaded:
- Check mac_policies_count before looking for mbuf MAC label m_tags in MAC
Framework entry points. We will still pay label lookup costs if MAC
policies are present but don't require labels (typically a single mbuf
header field read, but perhaps further indirection if IPSEC or other
m_tag consumers are in use).
- Further push socket locking for socket-related access control checks and
events into MAC policies from the MAC Framework, so that sockets are
only locked if a policy specifically requires a lock to protect a label.
This resolves lock order issues during sonewconn() and also in local
domain socket cross-connect where multiple socket locks could not be
held at once for the purposes of propagatig MAC labels across multiple
sockets. Eliminate mac_policy_count check in some entry points where it
no longer avoids locking.
- Add mac_policy_count checking in some entry points relating to network
interfaces that otherwise lock a global MAC ifnet lock used to protect
ifnet labels.
Obtained from: TrustedBSD Project
count of the number of registered policies.
Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.
This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.
Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.
Obtained from: TrustedBSD Project
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.
Discussed with: rwatson, rmacklem
an accessor function to get the correct rnh pointer back.
Update netstat to get the correct pointer using kvm_read()
as well.
This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.
Reviewed by: julian, rwatson, zec
X-MFC: not possible
* Tighten v1 query input processing.
* Borrow changes from MLDv2 for how general queries are processed.
* Do address field validation upfront before accepting input.
* Do NOT switch protocol version if old querier present timer active.
* Always clear IGMPv3 state in igmp_v3_cancel_link_timers().
* Update comments.
Tested by: deeptech71 at gmail dot com
threads:
- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.
In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".
- Allow each protocol to advertise an ordering policy, which can
currently be one of:
NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).
NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.
NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).
- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.
- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.
- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.
- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.
- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.
- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.
In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.
Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.
An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.
A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.
This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.
Bump __FreeBSD_version.
Reviewed by: bz
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).
Discussed with: julian, adrian, jhb, rwatson, kmacy
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
of ip_forward(), if the IPSEC is compiled in. It is possible that there
is an SPD that this packets will go through, even if there is no matching
route. If not, ICMP will be sent anyway, after ip_output().
This is somewhat similar in purpose to r191621, except that one was
for the packets sent from the host, while this one is for packets
being forwarded by the host.
Reviewed by: bz@
Sponsored by: Wheel Sp. z o.o. (http://www.wheel.pl)
request to arrive. Previously it would end up returning as soon as the
request length stored in the first two bytes had arrived.
Reviewed by: dwmalone
MFC after: 1 week
be named "udp_inpcb" to avoid a naming conflict with tcp[1].
For consistency rename the uma zone for TCP from "inpcb" to "tcp_inpcb".
Found by: rwatson [1]
Discussed with: rwatson
So far the udp_tun_func_t had been (ab)using inp_ppcb for udp in kernel
tunneling callbacks. Move that into the udpcb and add a field for flags
there to be used by upcoming changes instead of sticking udp only flags
into in_pcb flags2.
Bump __FreeBSD_version for ports to detect it and because of vnet* struct
size changes.
Submitted by: jhb (7.x version)
Reviewed by: rwatson
kernel option.
This also permits tuning of the option per virtual network stack, as
well as separately per inet, inet6.
The kernel option is left for a transition period, marked deprecated,
and will be removed soon.
Initially requested by: phk (1 year 1 day ago)
MFC after: 4 weeks
because struct vnet_net holds the rt_tables[][] for MRT and array size
is compile time dependent. If you had ROUTETABLES set to >1 after
r192011 V_loif was pointing into nonsense leading to strange results
or even panics for some people.
Reviewed by: mz
last year or two's work on routing:
- Combine iproute initialization and flowtable lookup blocks, eliminating
unnecessary tests for known-zero'd iproute fields.
- Add a comment indicating (a) why the route entry returned by the
flowtable is considered stable and (b) that the flowtable lookup must
occur after the setup of the mbuf flow ID.
- Assert the inpcb lock before any use of inpcb fields.
Reviewed by: kmacy
route is also being deleted, the link-layer address table
(arp or nd6) will flush those L2 llinfo entries that match
the removed prefix.
Reviewed by: kmacy
net.inet.ip.fw.one_pass is a classic ip_input.c variable and is used in
the pfil and bridge code as well. As ipfw is loadable we need to always
provide it. That is the reason why it lives in struct vnet_inet and
not in struct vnet_ipfw.
to a non loopback/ppp link types) through the loopback interface. Prior
to the new L2/L3 rewrite, this host route is implicitly added by the L2
code during RTM_RESOLVE of that interface address. This host route is
deleted when that interface is removed.
Reviewed by: kmacy
'net.inet.ip.fw.default_to_accept'. The current value can also be queried
via a read-only sysctl of the same name.
Requested by: plosher
MFC after: 1 week
(i.e. seems to be) already set.
This should reduce console noise due to curvnet recursion reports.
This change has no impact on nooptions VIMAGE builds.
Approved by: julian (mentor)
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
incorrectly output, if the RB-tree enumeration happened to reuse the
same chain for a mode switch: that is, both ALLOW and BLOCK records
were appended for the same group, in the same mbuf packet chain.
This was introduced during an mbuf chain layout bug fix involving
m_getptr(), which obviously cannot count from offset 0 on the
second pass through the RB-tree when serializing the IGMPv3
group records into the pending mbuf chain.
Cut over to KTR_INET for IGMPv3 CTR usage.
processed by ipfw once - avoid second ipfw_chk() call.
This saves us from unnecessary IPFW_RLOCK(), m_tag_find() calls and
ip/tcp/udp header parsing.
MFC after: 2 month
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.
Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)
(colloquially known as if_addrlist). Currently not acquired around
interface address loops that call out to the routing code due to
potential lock order issues.
MFC after: 3 weeks
lookup of 'ia' from if_addrhead through most use. Note that we
currently have to drop it prematurely in some cases due to calls out to
the routing and interface code while using 'ia', but this closes many
races. Annotate several potential races that persist after this change.
Move to using M_NOWAIT for allocating new interface addresses due to
lock(s) being held.
MFC after: 3 weeks
of the implementation of ioctls. This makes the mapping of ioctls to
specific privileges more explicit, and also simplifies the
implementation by reducing the use of FALLTHROUGH handling in switch.
While this is not intended to be a functional change, it does mean
that certain privilege checks are now performed earlier, so EPERM
might be returned in preference to EADDRNOTAVAIL for management
ioctls that could have failed for both reasons.
MFC after: 3 weeks
that it is easier to lock:
- Handle the unsupported ioctl case at the beginning of in_control(),
handing off to ifp->if_ioctl, rather than looking up interfaces and
addresses unnecessarily in this case.
- Make it an invariant that ifp is always non-NULL when running
in_control()-implemented ioctls, simplifying the code structure.
MFC after: 3 weeks
Match the bracketing in netstat.
Since the cleanup of MROUTING, ports have broken because they
expect to include <netinet/ip_mroute.h> without including
<sys/queue.h>. Fix breakage at source.
The real fix, of course, is to fix the MROUTING APIs by blowing them
away and replacing them with something else...
variable. Acquire the interface address list lock when iterating over
the interface address list searching for a matching received broadcast
address.
MFC after: 2 weeks
we recognize its a retransmit, ahead of the PR-SCTP
work. Without this fix, we end up NOT reducing flight
size and causing an miscalculation when PR-SCTP is active
and data is skipped.
Obtained from: Michael Tuexen.
and CARPSTATS_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the implementation
of these statistics, such as using per-CPU versions of the data
structure.
MFC after: 3 days
and PIMSTAT_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structure.
MFC after: 3 days
and MRTSTAT_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structure.
MFC after: 3 days
IGMPSTAT_ADD() and IGMPSTAT_INC(), rather than directly
manipulating the fields of the structure. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.
MFC after: 3 days
macros: ICMPSTAT_ADD(), ICMPSTAT_INC(), ICMP6STAT_ADD(), and
ICMP6STAT_INC(), rather than directly manipulating the fields
of these structures across the kernel. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.
In on case, icmp6stat members are manipulated indirectly, by
icmp6_errcount(), and this will require further work to fix
for per-CPU stats.
MFC after: 3 days
and UDPSTAT_INC(), rather than directly manipulating the fields
across the kernel. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structures.
MFC after: 3 days
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.
MFC after: 3 days
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.
MFC after: 3 days
-UdpAliasIn(): correctly check return code after modules ran.
-alias_nbt: in case of malformed packets (or some other unrecoverable
error), toss the packet.
dependency tracking and ordering enforcement.
With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.
The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.
The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.
For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.
A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.
Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.
The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.
Reviewed by: bz
Approved by: julian (mentor)
types of MAC overheads such as preambles, link level retransmissions
and more.
Note- this commit changes the userland/kernel ABI for pipes
(but not for ordinary firewall rules) so you need to rebuild
kernel and /sbin/ipfw to use dummynet features.
Please check the manpage for details on the new feature.
The MFC would be trivial but it breaks the ABI, so it will
be postponed until after 7.2 is released.
Interested users are welcome to apply the patch manually
to their RELENG_7 tree.
Work supported by the European Commission, Projects Onelab and
Onelab2 (contract 224263).
more adequate TCP performance with IPv6.
Changes for IPv4, r166403 and r172795, both ignored the
IPv6 counterpart and left it in the state of art of year 2000.
The same logic in syncache already shares code between v4 and v6 so
things do not need to be adapted there.
Reported by: Steinar Haug (sthaug nethelp.no)
Tested by: Steinar Haug (sthaug nethelp.no)
MFC after: 3 days
from existing functions for initializing global state.
At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.
Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).
While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.
Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.
Approved by: julian (mentor)
in the case where a single mbuf is allocated due to
m_getcl() returning NULL, we already call MH_ALIGN,
so do not increment m->m_data in this case.
Found during MLDv2 port.
- PR-SCTP had major issues when skipping through a multi-part message.
o Did not look at socket buffer.
o Did not properly handle the reassmebly queue.
o The MARKED segments could interfere and un-skip a chunk causing
a problem with the proper FWD-TSN.
o No FR of FWD-TSN's was being done.
- NR-Sack code was basically disabled. It needed fixes that
never got into the real code.
- CMT code had issues when the two paths were NOT the same b/w. We
found a few small bugs, but also the critcal one here was not
dividing the rwnd amongst the paths.
Obtained from: Michael Tuexen and myself at the IETF hack-fest ;-)
they were passed uninitialized to in6_pcblookup_hash. Instead, do as is done
for IPv4 and use the addresses within the sockaddr structure, which are
correctly populated.
This fixes tcpdrop(8) for IPv6 address pairs.
Reviewed by: bz
This is purely a forwarding plane cleanup; no control plane
code is involved.
Summary:
* Split IPv4 and IPv6 MROUTING support. The static compile-time
kernel option remains the same, however, the modules may now
be built for IPv4 and IPv6 separately as ip_mroute_mod and
ip6_mroute_mod.
* Clean up the IPv4 multicast forwarding code to use BSD queue
and hash table constructs. Don't build our own timer abstractions
when ratecheck() and timevalclear() etc will do.
* Expose the multicast forwarding cache (MFC) and virtual interface
table (VIF) as sysctls, to reduce netstat's dependence on libkvm
for this information for running kernels.
* bandwidth meters however still require libkvm.
* Make the MFC hash table size a boot/load-time tunable ULONG,
net.inet.ip.mfchashsize (defaults to 256).
* Remove unused members from struct vif and struct mfc.
* Kill RSVP support, as no current RSVP implementation uses it.
These stubs could be moved to raw_ip.c.
* Don't share locks or initialization between IPv4 and IPv6.
* Don't use a static struct route_in6 in ip6_mroute.c.
The v6 code is still using a cached struct route_in6, this is
moved to mif6 for the time being.
* More cleanup remains to be merged from ip_mroute.c to ip6_mroute.c.
v4 path tested using ports/net/mcast-tools.
v6 changes are mostly mechanical locking and *have not* been tested.
As these changes partially break some kernel ABIs, they will not
be MFCed. There is a lot more work to be done here.
Reviewed by: Pavlin Radoslavov
any IPv4 multicast operations which reference it.
There is a potential race because ifma_protospec is set to NULL
when we discover the underlying ifnet has gone away. This write
is not covered by the IF_ADDR_LOCK, and it's difficult to widen
its scope without making it a recursive lock. It isn't clear why
this manifests more quickly with 802.11 interfaces, but does not
seem to manifest at all with wired interfaces.
With this change, the 802.11 related panics reported by sam@
and cokane@ should go away. It is not the right fix, that requires
more thought before 8.0.
Idea from: sam
Tested by: cokane
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free. This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.
Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile. They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:
if_ar
if_axe
if_aue
if_cdce
if_cue
if_kue
if_ray
if_rue
if_rum
if_sr
if_udav
if_ural
if_zyd
Drivers that were already disabled because of tty changes:
if_ppp
if_sl
Discussed on: arch@
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):
INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED
Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.
MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz
- When sending large PR-SCTP messages over a
lossy link we would incorrectly calculate the fwd-tsn
- When receiving large multipart pr-sctp packets we would
incorrectly send back a SACK that would renege improperly
on already received packets thus causing unneeded retransmissions.
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this. This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.
MFC after: 2 weeks
in6p_ip6_nxt
in6p_vflag
in6p_flags
in6p_socket
in6p_lport
in6p_fport
in6p_ppcb
Remove unused v6 macro aliases for inpcb flags:
IN6P_HIGHPORT
IN6P_LOWPORT
IN6P_ANONPORT
IN6P_RECVIF
IN6P_MTUDISC
IN6P_FAITH
IN6P_CONTROLOPTS
References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.
MFC after: 3 days
Reviewed by: bz
IPv4 stack.
Diffs are minimized against p4.
PCS has been used for some protocol verification, more widespread
testing of recorded sources in Group-and-Source queries is needed.
sizeof(struct igmpstat) has changed.
__FreeBSD_version is bumped to 800070.
1) WP should never be marked unless flight size is 0
2) When recovering from wp if the peer ack's it we don't mark for retran
3) When recovering, we must assure a timer is still running.
into the advance_peer_ack point so we would incorrectly
send a wrong value in the FWD-TSN
- PR-SCTP bug, where an PR packet is used for a window
probe which could incorrectly get the packet moved
back into the send_queue, which will cause major issues and
should not happen.
- Fix a trace to use the proper macro.
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.
Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.
This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.
It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.
IPv4 multicast sends are looped back to senders by default
on a stack-wide basis, rather than relying on the socket option.
Note that the sysctl only applies to newly created multicast sockets.
RH0 was deprecated by RFC 5095.
While most of the code had been disabled by #if 0 already, leave a
bit of infrastructure for possible RH2 code and a log message under
BURN_BRIDGES in case a user still tries to send RH0 packets.
Reviewed by: gnn (a bit back, earlier version)
as a handler.
The variable was exported only for debugging, but there is little reason
to do it now that the timekeeping is supported by various other variables.
For the time being just comment out the sysctl, but I think this
should go away.
of sysctl_variables.
I would also remove it from the VNET record but I am unsure if
there is any ABI issue -- so for the time being just mark it as
unused in ip_fw.h, and then we will collect the garbage at some
appropriate time in the future.
MFC after: 3 days
which are not in a module of their own like gif.
Single kernel compiles and universe will fail if the size of the struct
changes. Th expected values are given in sys/vimage.h.
See the comments where how to handle this.
Requested by: peter
- Fix the copy, we can't do a blind copy but must transfer
the data from the old to the new.
- Fix the ACK processing so we properly stop retransmitting
the thing.
- Fix it so if we get a retran we will properly reply with
the saved response without doing anything.
MFC after: 1 month
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
checks for the tcpcb, previously used to detect complete disconnection,
with INP_DROPPED checks. Correct that, preventing shutdown() from
improperly generating a TCP segment with destination IP and port of
0.0.0.0:0.
PR: kern/132050
Reported by: david gueluy <david.gueluy at netasq.com>
MFC after: 3 weeks
to in_rtrequest(); the radix head lock is already acquired before
rnh_walktree is called in in_rtqtimo_one(). This avoids a recursive
acquisition that is no longer permitted in 8.x due to use of an rwlock
for the radix head lock.
Reported by: dikshie <dikshie at gmail.com>
MFC after: 3 days
longer do we require SCTP to be in the kernel for the
lib to be able to handle SCTP. We do this by moving
the CRC32c checksum into libkern/crc32.c and then adjusting
all routines to use the common methods. Note that this
will improve the performance of iSCSI since they were
using the old single 256 bit table lookup versus the
slicing 8 algorithm (which gives a 4x speed up in
CRC32c calculation :-D)
Reviewed by:rwatson, gnn, scottl, paolo
MFC after: 4 week? (assuming we MFC the alias_sctp changes)
Add a note next to fields in network format.
The n_* types are not enough for compiler checks on endianness, and their
use often requires an otherwise unnecessary #include <netinet/in_systm.h>
The typedef in in_systm.h are still there.
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.
The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.
Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.
Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.
Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.
Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().
Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.
Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.
Approved by: bz (mentor)
is messing with the UDP tunnel. This means
that if two users actually tried to change the
tunnel port at the same time interesting things COULD
result, but its probably very unlikely to happen :-)
- Prepare for CRC offloading, add MIB counters (RS/MT).
- Bugfix: Disable CRC computation for IPv6 addresses with local scope (MT).
- Bugfix: Handle close() with SO_LINGER correctly when notifications
are generated during the close() call(MT).
- Bugfix: Generate DRY event when sender is dry during subscription.
Only for 1-to-1 style sockets (RS/MT)
- Bugfix: Put vtags for the correct amount of time into time-wait (MT).
- Bugfix: Clear vtag entries correctly on expiration (MT).
- Bugfix: shutdown() indicates ENOTCONN when called for unconnected
1-to-1 style sockets (MT).
- Bugfix: In sctp Auth code (PL).
- Add support for devices that support SCTP csum offload (igb).
- Add missing sctp_associd to mib sysctl xsctp_tcb structure (RS)
Obtained from: With help from Peter Lei and Michael Tuexen
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.
The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.
Reviewed by: gnn, rwatson, kmacy
section included a lot of stuff that did not belong there.
So split the block in multiple components each around the relevant stuff.
This said, I wonder if building a kernel where SYSCTL_NODE is not
defined is supported at all.
Submitted by: Marta Carbone
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.
The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.
Reviewed by: rpaulo, gnn
Approved by: gnn, kmacy (mentors)
Sponsored by: FreeBSD Foundation
read with libkvm) to the addresses of a prison, when inside a
jail. [1]
As the patch from the PR was pre-'new-arp', add checks to the
llt_dump handlers as well.
While touching RTM_GET in route_output(), consistently use
curthread credentials rather than the creds from the socket
there. [2]
PR: kern/68189
Submitted by: Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1]
Discussed with: rwatson [2]
Reviewed by: rwatson
MFC after: 4 weeks