called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.
Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb. Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.
This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden. When these locks were mutexes, they simply recursed.
Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).
I believe this does NOT address the crashes people have been seeing
in rt_check.
MFC after: 1 week
the same way it has been implemented for IPv4.
Reviewed by: bms (skimmed)
Tested by: Nick Hilliard (nick netability.ie) (with more changes)
MFC after: 2 months
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.
See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.
Submitted by: Alana Huebner, Lawrence Stewart,
Grenville Armitage (caia.swin.edu.au)
Reviewed by: dwmalone, gnn, rpaulo
MFC After: 3 days
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.
With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.
MFC after: 2 months
calls the latter.
Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.
This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.
PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).
This prevents people from shooting themselves in the foot.
PR: kern/118455 (remotely related)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months
This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.
MFC after: 2 months
the default rule number but also the maximum rule number. User space
software such as ipfw and natd should be aware of its value. The
software that already includes ip_fw.h should use the defined value. All
other a expected to use sysctl (as discussed on net@).
MFC after: 5 days.
Discussed on: net@
translation. It turns out this is useful for applications which require
source port randomization for security (i.e. dns servers).
Discussed with: secteam
Requested by: mlaier
MFC after: 2 weeks
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.
PR: kern/119635
Reviewed by: rwatson
MFC after: 5 days
- Change it so that without INVARIANTs there are
no panics in SCTP.
- sctp_timer changes so that we have a recovery mechanism
when the sent list is out of order.
storage. We can safely remove the label copying operations since
M_MOVE_PKTHDR will move the mbuf tags (which contain MAC labels) to
the destination mbuf.
MFC after: 1 week
Discussed with: rwatson
we can be sure that it's valid.
In case we abort early free it again else put it into the syncache.
We need the cred in the syncache to be able to restrict what will be
exportet by the sysctl helper function syncache_pcblist() (to netstat)
within jails.
PR: kern/126493
Reviewed by: rwatson (earlier versions)
MFC after: 3 days
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.
MFC after: 3 days
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.
PR: kern/126349
Reviewed by: rwatson, dwmalone
keyword. But it doesn't work. Two options.. make it no longer accept it,
or actually make it work.. I chose the 2nd..
Allow the tablearg to be used to specify a skipto destination.
This is actually a very powerful construct if used correctly, or a sink
of cpu cycles if used badly.
changes t teh man page will follow.
This gives significant performance improvements when many raw sockets used.
Benchmarks of mpd handeling 1000 simultaneous PPTP connections show up to 50%
performance boost. With higher number of connections benefit becomes even
bigger. PopTop snd others should also get some benefits.
- removing 'const' qualifier from an input parameter to conform to the type
required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.
Observed by: tinderbox
information from rip_input() to rip_append(). Instead, pass the source
address for an IP datagram to rip_append() using a stack-allocated
sockaddr_in, similar to udp_input() and udp_append().
Prior to the move to rwlocks for inpcbinfo, this was not a problem, as
use of the global was synchronized using the ripcbinfo mutex, but with
read-locking there is the potential for a race during concurrent
receive.
This problem is not present in the IPv6 raw IP socket code, which
already used a stack variable for the address.
Spotted by: mav
MFC after: 1 week (before inpcbinfo rwlock changes)
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
udp_output() so that argument validation occurs before jail processing.
Add additional comments explaining what's going on when we process
addresses and binding during udp_output().
MFC after: 3 weeks
2) Adds some __UserSpace__ on some of the common defines that
the user space code needs
3) Fixes a bug when we send up data to a user that failed. We
need to a) trim off the data chunk headers, if present, and
b) make sure the frag bit is communicated properly for the
msgs coming off the stream queues... i.e. we see if some
of the msg has been taken.
Obtained from: jeli contributed the VIMAGE changes on this pass Thanks Julain!
inpcb. When directly invoking udp_notify() from udp_ctlinput(), acquire
only a read lock; we may still see write locks in udp_notify() as the
in_pcbnotifyall() routine is shared with TCP and always uses a write lock
on the inpcb being notified.
MFC after: 1 month
some code paths, global or inpcb write locks are required, but for other
code paths, read locks or no locking at all are sufficient for the data
structures.
MFC after: 1 month
source or a specific destination address is requested as part of a send
on a UDP socket, read lock the inpcb rather than write lock it. This
will allow fully parallel transmit down to the IP layer when sending
simultaneously from multiple threads on a connected UDP socket.
Parallel transmit for more complex cases, such as when sendto(2) is
invoked with an address and there's already a local binding, will
follow.
MFC after: 1 month
possible to exhaust and garble stack with a packet that contains a couple
of hundreds nested encapsulation levels.
Submitted by: Ming Fu <fming@borderware.com>
Reviewed by: rwatson
PR: kern/85320
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.
Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.
It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.
Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.
generating an RTM_MISS for every IP packet forwarded making user space
routing daemons unhappy.
PR: kern/123621, kern/124540, kern/122338
Reported by: Paul <paul gtcomm.net>, Mike Tancsa <mike sentex.net> on net@
Tested by: Paul and Mike
Reviewed by: andre
MFC after: 3 days
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.
This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.
Tested by: kris, ps
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.
While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.
MFC after: 2 months
Tested by: gnn, kris, ps
in_ifaddrhashtbl in in_ifinit because error handler in in_control removes
entries only for AF_INET addresses. If in_ifinit is called for the cloned
inteface that has just been created its address family is not AF_INET and
therefor LIST_REMOVE is not called for respective LIST_INSERT_HEAD and
freed entries remain in in_ifaddrhashtbl and lead to memory corruption.
PR: kern/124384
This is needed for correct behavior when packets are lost or reordered.
PR: kern/123950
Reviewed by: andre@, silby@
Reported by: Yahoo!, Wang Jin
MFC after: 1 week
- only one functino to destroy an SCTP stack sctp_finish()
- Make it so this function also arranges for any threads
created by the image to do a kthread_exit()
- Vimage prep - these are major restructures to move
all global variables to be accessed via a macro or two.
The variables all go into a single structure.
- Asconf address addition tweaks (add_or_del Interfaces)
- Fix rwnd calcualtion to be more conservative.
- Support SACK_IMMEDIATE flag to skip delayed sack
by demand of peer.
- Comment updates in the sack mapping calculations
- Invarients panic added.
- Pre-support for UDP tunneling (we can do this on
MAC but will need added support from UDP to
get a "pipe" of UDP packets in.
- clear trace buffer sysctl added when local tracing on.
Note the majority of this huge patch is all the vimage prep stuff :-)