sctp_inpcb:
1) Make sure not to remove the flag on the PCB until
after the close() caller is back in control with the
lock. Otherwise a quickly freeing assoc could kill the
inpcb and cause a panic.
2) Make sure all calls to log_closing have not released
the locks before calling the log function, we don't
want the logging function to crash us due to a freed
inpcb.
3) Make sure that when we get to the end, we release all
locks (after removing them from view) and as long as
we are NOT the inp-kill timer removing the inp, call
the callout_drain() function so a racing timer won't
later call in and cause a racing crash.
MFC after: 1 week
1) Only use both mapping arrays when NR sack is off. This
way we can hold off moving the cumack (not the best but
workable) when NR-sack is on.
2) We must make sure to just return on the move of the
bit to the NR array if the cum-ack as already went
past the TSN. This prevents marking a bit behind the
array and hitting the invariant code that panic's us.
MFC after: 1 week
We were only paying attention to the nr-mapping-array. Which
seems to make sense on the surface, by definition things
up to the cum-ack should be deliverable thus in the nr-mapping-array.
However (there is always a gotcha) thats not true when it
comes to large messages. The stack may hold the message
while re-assembling it not not deliver it based on several
thresholds. If that happens (which it would for smaller
large messages) then the cum-ack is figured wrong. We
now properly use both arrays in the cum-ack calculation.
MFC after: 1 week.
the timer. This is done by considering the locks
we will destroy and if they are contended we consider
it the same as a reference count being up. Fixing this
appears to cleanup another crash that was appearing with
all the timers where the socket buf lock got corrupted.
2) Fix the sysctl code to take a lot more care when looking
at INP's that are in the GONE or ALLGONE state.
MFC after: 1 week
of apitesters.. Basically we end up with attempting
to destroy a lock thats contended on. A cookie echo
arrives at the same time that the close is happening.
The close gets the lock but the cookie echo has already
passed the check for the gone flag and is then locked
waiting on the create lock.. when we go to destroy it
bam. For now we do the timer destroy for all calls
to close.. We can probably optimize this later so that
we check whats being contended on and if there is contention
then do the timer thing. but this is probably safest since
the inp has been removed from all lists and references and
only the timer can find it.. once the locks are released all
other places will instantly see the GONE flag and bail (thats
what the change in sctp_input is one place that was lacking
the bail code).
MFC after: 1 week
held by checking the create and inp locks as well.
2) Fix a bug in that when a socket is closed an INIT-ACK
is returned, we do NOT unlock the locked_tcb unless its
different (an unlikely scenario). If we blindly unlock as
we were doing before we can end up unlocking the actual
stcb thats about to be sent down to the free function which
requires the lock be held.
MFC after: 1 week
was setup to do an abortive close an association that was
in the accept_queue could get stuck and never freed. Now
we properly start the kill timer on the socket and turn
off the flag (same thing we do for the graceful close method).
MFC after: 1 week
1) Fix the alignment of a comment.
2) Fix a BUG where we were NOT paying attention
to the RESEND marking on retransmitting control
chunks.. and worse we were not decrementing the
retran count that could cause us to loop forever.
3) Add in the valdiate_no_lock function on invariants
so that we will really check all ways out to be sure
a lock does not slip out locked.
MFC after: 1 week.
1) Makes it so that the INVARIANT function validate nolocks is
available anywhere.
2) Fixes a BUG where a close has been done on a collision socket
and the cookie processing would return leaving a lock held.
MFC after: 1 week
code base. We now properly have ONE thread
that services all VNET's. Also we purge out
the old timer based iterator code which had
multiple LOR's and other issues.
MFC after: 3 days
- Make sure that when you kick the streams you add correctly
using a 16 bit unsigned.
- Make sure when sending out you allow FWD-TSN to skip over
and list the ACKED chunks in the stream/seq list (so the
rcv will kick the stream)
MFC after: 3 days
- Slide the map at the proper place.
- Mark the bits in the nr_array ONLY if there
is no marking.
- When generating a FWD-TSN we allow us to skip past
ACKED chunks too.
MFC after: 1 weeks
user sets up a socket to a server sends data and closes
the socket before the server has called accept(). It used
to NOT work at all. Now we add a flag to the assoc and
defer assoc cleanup so that the accept will suceed.
"Whitspace" churn after the VIMAGE/VNET whirls.
Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.
Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.
This also removes some header file pollution for putatively
static global variables.
Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.
Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days
that we allow all possible jail IPs as source address rather than
forcing the "primary". While IPv6 naturally has source address
selection, for legacy IP we do not go through the pain in case
IP_HDRINCL was not set. People should bind(2) for that.
This will, for example, allow ping(|6) -S to work correctly for
non-primary addresses.
Reported by: (ten 211.ru)
Tested by: (ten 211.ru)
MFC after: 4 days
by mrinfo and mtrace, was dropped by the IGMP TTL check. IGMP control
traffic must always have a TTL of 1.
Submitted by: Matthew Luckie
MFC after: 3 days
* Fix delaying of SACK by taking out old optimization code
which does not optimize anymore.
* Fix fast retransmission of chunks abandoned by the
"number of retransmissions" policy.
MFC after: 3 days.
prevented the link-layer entry from being freed.
In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.
In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.
In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().
In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.
Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE
This adds the explicit include (so far probably included through one of the
few "hidden" includes in other header files) for vnet.h and adds a cast
to unbreak LINT-VIMAGE.
IPv4 addresses can and do change during normal operation. Testing by
pfSense developers exposed an issue where OpenOSPFD was using the IPv4
address to leave the OSPF link-scope multicast groups on a dynamic
OpenVPN tun interface, rather than using RFC 3678 with the interface
index, which won't be raced when the interface's addresses change.
In inp_join_group():
If we are already a member of an ASM group, and IP_ADD_MEMBERSHIP or
MCAST_JOIN_GROUP ioctls are re-issued, return EADDRINUSE as per the
legacy 4.4BSD multicast API. This bends RFC 3678 slightly, but does
not violate POLA for apps using the old API.
It also stops us falling through to kicking IGMP state transactions
in what is otherwise a no-op case.
[This has already been dealt with in HEAD, but make it explicit before
we MFC the change to 8.]
In inp_leave_group():
Fix a bogus conditional.
Move the ifp null check to ioctls MCAST_LEAVE* in the switch..case
where it actually belongs.
If an interface was specified, by primary IPv4 address, for ioctl
IP_DROP_MEMBERSHIP or MCAST_LEAVE_GROUP (an ASM full leave operation),
then and only then should we look up the ifp from the IPv4 address in
mreqs.imr_interface.
If not, we fall through to imo_match_group() as before, but only in
the IP_DROP_MEMBERSHIP case.
With these changes, the legacy 4.4BSD multicast API idempotence should
be mostly preserved in the SSM enabled IPv4 stack.
Found by: ermal (with pfSense)
MFC after: 3 days
compiled with "options VIMAGE".
As it is now, there is still a single instance of the pipes,
and it is only usable from vnet0 (the main instance).
Trying to use a pipe from a different vimage does not crash
the system as it did before, but the traffic coming out from
the pipe goes to the wrong place, and i still need to
figure out where.
Support for per-vimage pipes is almost there (just a matter of
uncommenting the VNET_* definitions for dn_cfg, plus putting into
the structure the remaining static variables), however i need
first to figure out how init/uninit work, and also to understand
where packets are ending up on exit from a pipe.
In summary: vimage support for dummynet is not complete yet,
but we are getting there.
* Fix handling of mapping arrays when draining mbufs or processing
FORWARD-TSN chunks.
* Cleanup code (no duplicate code anymore for SACKs and NR-SACKs).
Part of this code was developed together with rrs.
MFC after: 2 weeks.
This patch has two fixes for potential kernel panics (one wrong
index, one access to the wrong lock) and two fixes to wrong logic
in a conditional. The potential panics are also on stable/8,
so I am going to MFC the fix quickly.
enabled. Basically most of the operations were incorrect causing
bad sacks when you enabled nr-sack. The fixes range across
4 files and unifiy most of the processing so that we only test
nr_sack flags to decide which type of sack to generate.
Optimization left for this is to combine the sack generation
code and make it capable of generating either sack thus shrinking
out a routine.
Reviewed by: tuexen@freebsd.org
mapping_array expansion would break. Basically
once we expanded the array we no longer had both
mapping arrays in sync which the sack processing code depends on.
This would mean we were randomly referring to memory that was probably
not there. This mostly just gave us bad sack results going back to the peer.
If INVARIENTS was on of course we would hit the panic routine in the sack_check
call.
We also add a print routine for the place where one would panic in
invarients so one can see what the main mapping array holds.
Reviewed by: tuexen@freebsd.org
MFC after: 2 weeks
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%
MFC after: 7 days
dscp as a search key in table lookups;
+ (re)implement a sysctl variable to control the expire frequency of
pipes and queues when they become empty;
+ add 'queue number' as optional part of the flow_id. This can be
enabled with the command
queue X config mask queue ...
and makes it possible to support priority-based schedulers, where
packets should be grouped according to the priority and not some
fields in the 5-tuple.
This is implemented as follows:
- redefine a field in the ipfw_flow_id (in sys/netinet/ip_fw.h) but
without changing the size or shape of the structure, so there are
no ABI changes. On passing, also document how other fields are
used, and remove some useless assignments in ip_fw2.c
- implement small changes in the userland code to set/read the field;
- revise the functions in ip_dummynet.c to manipulate masks so they
also handle the additional field;
There are no ABI changes in this commit.
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.
MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate
MFC after: 7 days
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.
The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.
Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.
MFC after: 5 days
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).
Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.
Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days
violated: so_pcb can never be NULL for a valid UDP socket, and it is
always SOCK_DGRAM. Use sotoinpcb() as the rest of the UDP code does.
MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks
been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was
guaranteed to be stable when a valid socket reference is held (as it is in
the output path).
MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks
tcbinfo lock there: r175612, which re-added it, masked a race between
sonewconn(2) and accept(2) that could allow an incompletely initialized
address on a newly-created socket on a listen queue to be exposed. Full
details can be found in that commit message.
MFC after: 1 week
Sponsored by: Juniper Networks
to not leak them making the VM subsystem unhappy with every stoped vnet(*).
We will still leak pages (especially as zones are marked NOFREE).
(*) This will also keep vmstat -z more usable.
Sponsored by: ISPsystem
MFC after: 5 days
and tested over the past two months in the ipfw3-head branch. This
also happens to be the same code available in the Linux and Windows
ports of ipfw and dummynet.
The major enhancement is a completely restructured version of
dummynet, with support for different packet scheduling algorithms
(loadable at runtime), faster queue/pipe lookup, and a much cleaner
internal architecture and kernel/userland ABI which simplifies
future extensions.
In addition to the existing schedulers (FIFO and WF2Q+), we include
a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new,
very fast version of WF2Q+ called QFQ.
Some test code is also present (in sys/netinet/ipfw/test) that
lets you build and test schedulers in userland.
Also, we have added a compatibility layer that understands requests
from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries,
and replies correctly (at least, it does its best; sometimes you
just cannot tell who sent the request and how to answer).
The compatibility layer should make it possible to MFC this code in a
relatively short time.
Some minor glitches (e.g. handling of ipfw set enable/disable,
and a workaround for a bug in RELENG_7's /sbin/ipfw) will be
fixed with separate commits.
CREDITS:
This work has been partly supported by the ONELAB2 project, and
mostly developed by Riccardo Panicucci and myself.
The code for the qfq scheduler is mostly from Fabio Checconi,
and Marta Carbone and Francesco Magno have helped with testing,
debugging and some bug fixes.
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().
Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.
Sponsored by: ISPsystem
Reviewed by: rwatson
MFC After: 5 days
tearing down a network stack (in the VIMAGE jail+vnet case).
For that break out the logic from tcp_hc_purge() into an internal
function we can call from both, the sysctl handler and the
tcp_hc_destroy().
Sponsored by: ISPsystem
Reviewed by: silby, lstewart
MFC After: 8 days
the IP addresses of the tunnel end points to the same value. In
these cases the loopback route is not installed for the local
end.
Verified by: avg
MFC after: 5 days
For our compiler the two constructs are completely equivalent, but
some compilers (including MSC and tcc) use the base type for alignment,
which in the cases touched here result in aligning the bitfields
to 32 bit instead of the 8 bit that is meant here.
Note that almost all other headers where small bitfields
are used have u_int8_t instead of u_int.
MFC after: 3 days
freed the inpcb, it was possible to not set the
proper flags on the pcb (i.e. the socket is not there).
This is HIGHLY unlikely since no one else should be
able to find the socket.. but for consistency we
do the proper loop thing to make sure that we
mark the socket as gone on the PCB.
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.
This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.
Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]
Reviewed by: jamie, hrs (ipv6 part)
Pointed out by: hrs [1]
MFC After: 2 weeks
Asked for by: Jase Thew (bazerka beardz.net)
ip_divert work as a client of pf(4),
make ip_divert not depend on ipfw.
This is achieved by moving to ip_var.h the struct ipfw_rule_ref
(which is part of the mtag for all reinjected packets) and other
declarations of global variables, and moving to raw_ip.c global
variables for filter and divert hooks.
Note that names and locations could be made more generic
(ipfw_rule_ref is really a generic reference robust to reconfigurations;
the packet filter is not necessarily ipfw; filters and their clients
are not necessarily limited to ipv4), but _right now_ most
of this stuff works on ipfw and ipv4, so i don't feel like
doing a gratuitous renaming, at least for the time being.
statically configured entry of the same host. This bug was
due to the expiration timer was not cancelled when installing
the static entry. Since there exist a potential race condition
with respect to timer cancellation, simply check for the
LLE_STATIC bit inside the expiration function instead of
cancelling the active timer.
MFC after: 5 days
- use a uniform mtag format for all packets that exit and re-enter
the firewall in the middle of a rulechain. On reentry, all tags
containing reinject info are renamed to MTAG_IPFW_RULE so the
processing is simpler.
- make ipfw and dummynet use ip_len and ip_off in network format
everywhere. Conversion is done only once instead of tracking
the format in every place.
- use a macro FREE_PKT to dispose of mbufs. This eases portability.
On passing i also removed a few typos, staticise or localise variables,
remove useless declarations and other minor things.
Overall the code shrinks a bit and is hopefully more readable.
I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr.
For ng_ipfw i am actually waiting for feedback from glebius@ because
we might have some small changes to make.
For if_bridge and if_ethersubr feedback would be welcome
(there are still some redundant parts in these two modules that
I would like to remove, but first i need to check functionality).
aliases were added or deleted. The announced route entry for
an address alias is no longer empty because this empty route
entry was causing some route daemon to fail and exit abnormally.
MFC after: 5 days
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.
MFC after: 5 days
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.
PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
the compiler seems unable to identify a common subexpression
(written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl
r201011
- move most of ng_ipfw.h into ip_fw_private.h, as this code is
ipfw-specific. This removes a dependency on ng_ipfw.h from some files.
- move many equivalent definitions of direction (IN, OUT) for
reinjected packets into ip_fw_private.h
- document the structure of the packet tags used for dummynet
and netgraph;
r201049
- merge some common code to attach/detach hooks into
a single function.
r201055
- remove some duplicated code in ip_fw_pfil. The input
and output processing uses almost exactly the same code so
there is no need to use two separate hooks.
ip_fw_pfil.o goes from 2096 to 1382 bytes of .text
r201057 (see the svn log for full details)
- macros to make the conversion of ip_len and ip_off
between host and network format more explicit
r201113 (the remaining parts)
- readability fixes -- put braces around some large for() blocks,
localize variables so the compiler does not think they are uninitialized,
do not insist on precise allocation size if we have more than we need.
r201119
- when doing a lookup, keys must be in big endian format because
this is what the radix code expects (this fixes a bug in the
recently-introduced 'lookup' option)
No ABI changes in this commit.
MFC after: 1 week
or we create loops.
The divert cookie (that can be set from userland too)
contains the matching rule nr, so we must start from nr+1.
Reported by: Joe Marcus Clarke
reformatting to avoid unnecessary line breaks, small block
restructuring to avoid unnecessary nesting, replace macros
with function calls, etc.
As a side effect of code restructuring, this commit fixes one bug:
previously, if a realloc() failed, memory was leaked. Now, the
realloc is not there anymore, as we first count how much memory
we need and then do a single malloc.
and remove all O(N) sequences from kernel critical sections in ipfw.
In detail:
1. introduce a IPFW_UH_LOCK to arbitrate requests from
the upper half of the kernel. Some things, such as 'ipfw show',
can be done holding this lock in read mode, whereas insert and
delete require IPFW_UH_WLOCK.
2. introduce a mapping structure to keep rules together. This replaces
the 'next' chain currently used in ipfw rules. At the moment
the map is a simple array (sorted by rule number and then rule_id),
so we can find a rule quickly instead of having to scan the list.
This reduces many expensive lookups from O(N) to O(log N).
3. when an expensive operation (such as insert or delete) is done
by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
without blocking the bottom half of the kernel, then acquire
IPFW_WLOCK and quickly update pointers to the map and related info.
After dropping IPFW_LOCK we can then continue the cleanup protected
by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
is only blocked for O(1).
4. do not pass pointers to rules through dummynet, netgraph, divert etc,
but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
We validate the slot index (in the array of #2) with chain_id,
and if successful do a O(1) dereference; otherwise, we can find
the rule in O(log N) through <rulenum, rule_id>
All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t
Operation costs now are as follows:
Function Old Now Planned
-------------------------------------------------------------------
+ skipto X, non cached O(N) O(log N)
+ skipto X, cached O(1) O(1)
XXX dynamic rule lookup O(1) O(log N) O(1)
+ skipto tablearg O(N) O(1)
+ reinject, non cached O(N) O(log N)
+ reinject, cached O(1) O(1)
+ kernel blocked during setsockopt() O(N) O(1)
-------------------------------------------------------------------
The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI
Supported by: Valeria Paoli
MFC after: 1 month
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
structure.
Reviewed by: rwatson
MFC after: 2 weeks