for example:
fwd tablearg ip from any to table(1)
where table 1 has entries of the form:
1.1.1.0/24 10.2.3.4
208.23.2.0/24 router2
This allows trivial implementation of a secondary routing table implemented
in the firewall layer.
I expect more work (under discussion with Glebius) to follow this to clean
up some of the messy parts of ipfw related to tables.
Reviewed by: Glebius
MFC after: 1 month
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.
Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners. This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments. There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.
Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.
This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.
Reviewed by: gnn
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)
I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)
Reviewed by: rwatson@, mohans@
MFC after: 1 week
parameter that can specify configuration parameters:
o rev cloner api's to add optional parameter block
o add SIOCCREATE2 that accepts parameter data
o rev vlan support to use new api (maintain old code)
Reviewed by: arch@
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures. Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.
Suggested by: glebius, jhb, ru
used since FreeBSD-SA-06:04.ipfw.
Adopt send_reject6 to what had been done for legacy IP: no longer
send or permit sending rejects for any but the first fragment.
Discussed with: oleg, csjp (some weeks ago)
o don't assign remote/local host/port information manually between provided
struct in_conninfo and struct syncache, bcopy() it instead
o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
the purpose of this field
o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
o fix IPSEC error case printf's to report correct function name
o in syncache_socket() only transpose enhanced tcp options parameters to
struct tcpcb when the inpcb doesn't has TF_NOOPT set
o in syncache_respond() reorder stack variables
o in syncache_respond() remove bogus KASSERT()
No functional changes.
Sponsored by: TCP/IP Optimization Fundraise 2005
o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines
No functional changes.
Sponsored by: TCP/IP Optimization Fundraise 2005
infinite loop with net.inet6.ip6.fw.deny_unknown_exthdrs=0.
- Teach ipv6 and ipencap as they appear in an IPv4/IPv6 over IPv6
tunnel.
- Test the next extention header even when the routing header type
is unknown with net.inet6.ip6.fw.deny_unknown_exthdrs=0.
Found by: xcast-fan-club
MFC after: 1 week
memory location for already existing/initialized mutexes. With random
data in the memory location this fails (ie. after a soft reboot).
Reported by: brueffer, YAMAMOTO Shigeru
Submitted by: YAMAMOTO Shigeru <shigeru-at-iij.ad.jp>
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.
PR: 93236
Submitted by: Grant Edwards <grante@visi.com>
MFC after: 1 month
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.
On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.
Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
- 'tag' & 'untag' action parameters.
- 'tagged' & 'limit' rule options.
Rule examples:
pipe 1 tag tablearg ip from table(1) to any
allow ip from any to table(2) tagged tablearg
allow tcp from table(3) to any 25 setup limit src-addr tablearg
sbin/ipfw/ipfw2.c:
1) new macros
GET_UINT_ARG - support of 'tablearg' keyword, argument range checking.
PRINT_UINT_ARG - support of 'tablearg' keyword.
2) strtoport(): do not silently truncate/accept invalid port list expressions
like: '1,2-abc' or '1,2-3-4' or '1,2-3x4'. style(9) cleanup.
Approved by: glebius (mentor)
MFC after: 1 month
dropped. This prevents a bug introduced during the socket/pcb refcounting
work from occuring, in which occasionally the retransmit timer may fire
after a connection has been reset, resulting in the resulting R|A TCP
packet having a source port of 0, as the port reservation has been
released.
While here, fixing up some RUNLOCK->WUNLOCK bugs.
MFC after: 1 month
(1) bpf peer attaches to interface netif0
(2) Packet is received by netif0
(3) ifp->if_bpf pointer is checked and handed off to bpf
(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
initialized to NULL.
(5) ifp->if_bpf is dereferenced by bpf machinery
(6) Kaboom
This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.
Summary of changes:
- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
bpf interface structure. Once this is done, ifp->if_bpf should never be
NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
a lockless read bpf peer list associated with the interface. It should
be noted that the bpf code will pickup the bpf_interface lock before adding
or removing bpf peers. This should serialize the access to the bpf descriptor
list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present
Now what happens is:
(1) Packet is received by netif0
(2) Check to see if bpf descriptor list is empty
(3) Pickup the bpf interface lock
(4) Hand packet off to process
From the attach/detach side:
(1) Pickup the bpf interface lock
(2) Add/remove from bpf descriptor list
Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).
[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
not we have any bpf peers that might be interested in receiving packets.
In collaboration with: sam@
MFC after: 1 month
Since tags are kept while packet resides in kernelspace, it's possible to
use other kernel facilities (like netgraph nodes) for altering those tags.
Submitted by: Andrey Elsukov <bu7cher at yandex dot ru>
Submitted by: Vadim Goncharov <vadimnuclight at tpu dot ru>
Approved by: glebius (mentor)
Idea from: OpenBSD PF
MFC after: 1 month
a defensive programming measure.
Note that whilst these members are not used by the ip_output()
path, we are passing an instance of struct ip_moptions here
which is declared on the stack (which could be considered a
bad thing).
ip_output() does not consume struct ip_moptions, but in case it
does in future, declare an in_multi vector on the stack too to
behave more like ip_findmoptions() does.
as not connected. In soclose() case rip_detach() will kill inpcb for
us later.
It makes rawconnect regression test do not panic a system.
Reviewed by: rwatson
X-MFC after: with all 1th April inpcb changes
connections and get rid of the flow_id as it is not guaranteed to be stable
some (most?) current implementations seem to just zero it out.
PR: kern/88664
Reported by: jylefort
Submitted by: Joost Bekkers (w/ changes)
Tested by "regisr" <regisrApoboxDcom>
By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.
Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.
seperately. Also use pfil hook/unhook instead of keeping the check
functions in pfil just to return there based on the sysctl. While here fix
some whitespace on a nearby SYSCTL_ macro.
for signicantly optimized UDP socket I/O when using a single UDP
socket from many threads or processes that share it, by avoiding
significant locking and other overhead in the general sosend()
path that isn't necessary for simple datagram sockets. Specifically,
this change results in a significant performance improvement for
threaded name service in BIND9 under load.
Suggested by: Jinmei_Tatsuya at isc dot org
after ipsec4_output processing else KAME IPSec using the handbook
configuration with gif(4) will panic the kernel.
Problem reported by: t. patterson <tp lot.org>
Tested by: t. patterson <tp lot.org>
return NULL. In principle this shouldn't change the behavior, but
avoids returning a potentially invalid/inappropriate pointer to
the caller.
Found with: Coverity Prevent (tm)
Submitted by: pjd
MFC after: 3 months
the fact that the loop through inpcb's in udp_input() tracks the
last inpcb while looping. We keep that name in the calling loop
but not in the delivery routine itself.
MFC after: 3 months
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.
MFC after: 3 months
common pcb tear-down logic into tcp_detach(), which is called from
either. Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence. This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket. This bug could often
be reproduced using repeated runs of the listenclose regression test.
MFC after: 3 months
PR: 96090
Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
number state, rather than re-using pcbinfo. This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.
MFC after: 3 months
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.
Improve consistency of spelling in assertions about inp != NULL.
MFC after: 3 months
reference. For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled. This allows tcp_timewait() to
consistently unlock the inpcb.
Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months
(tcp_sack_output_debug checks cached hints aginst computed values by walking the
scoreboard and reports discrepancies). The sack hinting code has been stable for
many months now so it is time for the debug code to go. Leaving tcp_sack_output_debug
ifdef'ed out in case we need to resurrect it at a later point.
tcp_timewait(). This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.
Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.
Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.
MFC after: 3 months
casts.
Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.
Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.
Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.
MFC after: 3 months
immediately rather than jumping to the normal output handling, which
assumes we've pulled out the inpcb, which hasn't happened at this
point (and isn't necessary).
Return ECONNABORTED instead of EINVAL when the inpcb has entered
INP_TIMEWAIT or INP_DROPPED, as this is the documented error value.
This may correct the panic seen by Ganbold.
MFC after: 1 month
Reported by: Ganbold <ganbold at micom dot mng dot net>
disconnect for fully connected sockets was dropped, meaning that if
the socket was closed while the connection was alive, it would be
leaked. Structure tcp_usr_detach() so that there are two clear
parts: initiating disconnect, and reclaiming state, and reintroduce
the tcp_disconnect() call in the first part.
MFC after: 3 months
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close(). In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).
MFC after: 3 months
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
- in_pcbdetach(), which removes the link between an inpcb and its
socket.
- in_pcbfree(), which frees a detached pcb.
Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory. Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().
While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.
MFC after: 3 months
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.
This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.
MFC after: 3 months
reason, seems to be where new flags are getting defined:
INP_DROPPED - The protocol has terminated this connection and the socket
is not reusable: when the socket code enters the protocol,
an error is immediately returned. This will substitute for
NULLing the so_pcb socket field, helping to implement the
invariant that all valid sockets have valid pcb's in TCP.
INP_SOCKREF - The protocol has become the owner of the socket reference,
and will need to free it when freeing the pcb, which will
be used when a TCP socket is closed but still has queued
data.
MFC after: 1 month