memory location for already existing/initialized mutexes. With random
data in the memory location this fails (ie. after a soft reboot).
Reported by: brueffer, YAMAMOTO Shigeru
Submitted by: YAMAMOTO Shigeru <shigeru-at-iij.ad.jp>
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.
PR: 93236
Submitted by: Grant Edwards <grante@visi.com>
MFC after: 1 month
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.
On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.
Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
- 'tag' & 'untag' action parameters.
- 'tagged' & 'limit' rule options.
Rule examples:
pipe 1 tag tablearg ip from table(1) to any
allow ip from any to table(2) tagged tablearg
allow tcp from table(3) to any 25 setup limit src-addr tablearg
sbin/ipfw/ipfw2.c:
1) new macros
GET_UINT_ARG - support of 'tablearg' keyword, argument range checking.
PRINT_UINT_ARG - support of 'tablearg' keyword.
2) strtoport(): do not silently truncate/accept invalid port list expressions
like: '1,2-abc' or '1,2-3-4' or '1,2-3x4'. style(9) cleanup.
Approved by: glebius (mentor)
MFC after: 1 month
dropped. This prevents a bug introduced during the socket/pcb refcounting
work from occuring, in which occasionally the retransmit timer may fire
after a connection has been reset, resulting in the resulting R|A TCP
packet having a source port of 0, as the port reservation has been
released.
While here, fixing up some RUNLOCK->WUNLOCK bugs.
MFC after: 1 month
(1) bpf peer attaches to interface netif0
(2) Packet is received by netif0
(3) ifp->if_bpf pointer is checked and handed off to bpf
(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
initialized to NULL.
(5) ifp->if_bpf is dereferenced by bpf machinery
(6) Kaboom
This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.
Summary of changes:
- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
bpf interface structure. Once this is done, ifp->if_bpf should never be
NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
a lockless read bpf peer list associated with the interface. It should
be noted that the bpf code will pickup the bpf_interface lock before adding
or removing bpf peers. This should serialize the access to the bpf descriptor
list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present
Now what happens is:
(1) Packet is received by netif0
(2) Check to see if bpf descriptor list is empty
(3) Pickup the bpf interface lock
(4) Hand packet off to process
From the attach/detach side:
(1) Pickup the bpf interface lock
(2) Add/remove from bpf descriptor list
Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).
[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
not we have any bpf peers that might be interested in receiving packets.
In collaboration with: sam@
MFC after: 1 month
Since tags are kept while packet resides in kernelspace, it's possible to
use other kernel facilities (like netgraph nodes) for altering those tags.
Submitted by: Andrey Elsukov <bu7cher at yandex dot ru>
Submitted by: Vadim Goncharov <vadimnuclight at tpu dot ru>
Approved by: glebius (mentor)
Idea from: OpenBSD PF
MFC after: 1 month
a defensive programming measure.
Note that whilst these members are not used by the ip_output()
path, we are passing an instance of struct ip_moptions here
which is declared on the stack (which could be considered a
bad thing).
ip_output() does not consume struct ip_moptions, but in case it
does in future, declare an in_multi vector on the stack too to
behave more like ip_findmoptions() does.
as not connected. In soclose() case rip_detach() will kill inpcb for
us later.
It makes rawconnect regression test do not panic a system.
Reviewed by: rwatson
X-MFC after: with all 1th April inpcb changes
connections and get rid of the flow_id as it is not guaranteed to be stable
some (most?) current implementations seem to just zero it out.
PR: kern/88664
Reported by: jylefort
Submitted by: Joost Bekkers (w/ changes)
Tested by "regisr" <regisrApoboxDcom>
By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.
Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.
seperately. Also use pfil hook/unhook instead of keeping the check
functions in pfil just to return there based on the sysctl. While here fix
some whitespace on a nearby SYSCTL_ macro.
for signicantly optimized UDP socket I/O when using a single UDP
socket from many threads or processes that share it, by avoiding
significant locking and other overhead in the general sosend()
path that isn't necessary for simple datagram sockets. Specifically,
this change results in a significant performance improvement for
threaded name service in BIND9 under load.
Suggested by: Jinmei_Tatsuya at isc dot org
after ipsec4_output processing else KAME IPSec using the handbook
configuration with gif(4) will panic the kernel.
Problem reported by: t. patterson <tp lot.org>
Tested by: t. patterson <tp lot.org>
return NULL. In principle this shouldn't change the behavior, but
avoids returning a potentially invalid/inappropriate pointer to
the caller.
Found with: Coverity Prevent (tm)
Submitted by: pjd
MFC after: 3 months
the fact that the loop through inpcb's in udp_input() tracks the
last inpcb while looping. We keep that name in the calling loop
but not in the delivery routine itself.
MFC after: 3 months
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.
MFC after: 3 months
common pcb tear-down logic into tcp_detach(), which is called from
either. Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence. This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket. This bug could often
be reproduced using repeated runs of the listenclose regression test.
MFC after: 3 months
PR: 96090
Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
number state, rather than re-using pcbinfo. This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.
MFC after: 3 months
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.
Improve consistency of spelling in assertions about inp != NULL.
MFC after: 3 months
reference. For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled. This allows tcp_timewait() to
consistently unlock the inpcb.
Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months
(tcp_sack_output_debug checks cached hints aginst computed values by walking the
scoreboard and reports discrepancies). The sack hinting code has been stable for
many months now so it is time for the debug code to go. Leaving tcp_sack_output_debug
ifdef'ed out in case we need to resurrect it at a later point.
tcp_timewait(). This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.
Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months