from the incoming SYN handling section of tcp_input().
Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed. It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake. It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.
Change return value of syncache_add() to void. No status communication
is required.
when the ACK is invalid and doesn't belong to any registered connection,
either in syncache or through SYN cookies. True but a NULL struct socket
is returned when the 3WHS completed but the socket could not be created
due to insufficient resources or limits reached.
For both cases an RST is sent back in tcp_input().
A logic error leading to a panic is fixed where syncache_expand() would
free the mbuf on socket allocation failure but tcp_input() later supplies
it to tcp_dropwithreset() to issue a RST to the peer.
Reported by: kris (the panic)
to free the oldest entry in the current bucket row. The global
entry limit may be smaller than the bucket rows and their limit
combined however. Thus only try to free a syncache entry if we
found one in this bucket row.
Reported by: kris
directly to a merged model where only one callout, the next to fire,
is registered.
Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.
The single new callout is a mutex callout on inpcb simplifying the
locking a bit.
tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.
Reviewed by: rwatson (earlier version)
tcp_input() to syncache_socket() where it belongs and the majority
of it already happens.
The "tp->snd_up = tp->snd_una" is removed as it is done with the
tcp_sendseqinit() macro a few lines earlier.
- *ip is not initialized in the case of inet6 connection, but ip->ip_len is
being changed anyway
Now the question is, why does it think an ipv4 connection is an ipv6 connection?
xemacs still doesn't work over X11 forwarding, but the kernel no longer panics.
and syncache_respond() into its own generic function tcp_addoptions().
tcp_addoptions() is alignment agnostic and does optimal packing in all cases.
In struct tcpopt rename to_requested_s_scale to just to_wscale.
Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."
Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005
upper-bounding it to the size of the initial socket buffer lower-bound it
to the smallest MSS we accept. Ideally we'd use the actual MSS information
here but it is not available yet.
For socket buffer auto sizing to be effective we need room to grow the
receive window. The window scale shift is determined at connection setup
and can't be changed afterwards. The previous, original, method effectively
just did a power of two roundup of the socket buffer size at connection
setup severely limiting the headroom for larger socket buffers.
Tested by: many (as part of the socket buffer auto sizing patch)
MFC after: 1 month
kernel. This LOR snuck in with some of the recent syncache changes. To
fix this, the inpcb handling was changed:
- Hang a MAC label off the syncache object
- When the syncache entry is initially created, we pickup the PCB lock
is held because we extract information from it while initializing the
syncache entry. While we do this, copy the MAC label associated with
the PCB and use it for the syncache entry.
- When the packet is transmitted, copy the label from the syncache entry
to the mbuf so it can be processed by security policies which analyze
mbuf labels.
This change required that the MAC framework be extended to support the
label copy operations from the PCB to the syncache entry, and then from
the syncache entry to the mbuf.
These functions really should be referencing the syncache structure instead
of the label. However, due to some of the complexities associated with
exposing this syncache structure we operate directly on it's label pointer.
This should be OK since we aren't making any access control decisions within
this code directly, we are merely allocating and copying label storage so
we can properly initialize mbuf labels for any packets the syncache code
might create.
This also has a nice side effect of caching. Prior to this change, the
PCB would be looked up/locked for each packet transmitted. Now the label
is cached at the time the syncache entry is initialized.
Submitted by: andre [1]
Discussed with: rwatson
[1] andre submitted the tcp_syncache.c changes
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
functionality:
- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.
This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.
The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.
A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.
Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
o don't assign remote/local host/port information manually between provided
struct in_conninfo and struct syncache, bcopy() it instead
o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
the purpose of this field
o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
o fix IPSEC error case printf's to report correct function name
o in syncache_socket() only transpose enhanced tcp options parameters to
struct tcpcb when the inpcb doesn't has TF_NOOPT set
o in syncache_respond() reorder stack variables
o in syncache_respond() remove bogus KASSERT()
No functional changes.
Sponsored by: TCP/IP Optimization Fundraise 2005
memory location for already existing/initialized mutexes. With random
data in the memory location this fails (ie. after a soft reboot).
Reported by: brueffer, YAMAMOTO Shigeru
Submitted by: YAMAMOTO Shigeru <shigeru-at-iij.ad.jp>
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.
On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.
Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).
Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.
PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005
in syncache_lookup() is not cleared and may lead to an arbitrary and
bogus rtentry pointer which later gets free'd.
Reviewed by: andre
MFC after: 3 days
that currently can't be triggered. But better be safe than sorry
later on. Additionally it properly silences Coverity Prevent for
future tests.
Found by: Coverity Prevent(tm)
Coverity ID: CID802
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days
include ip_options.h into all files making use of IP Options functions.
From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)
From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)
No functional changes in this commit.
Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.
Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.
Sponsored by: TCP/IP Optimization Fundraise 2005
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.
Found by: Mohan Srinivasan.
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
- If the peer sends the Signature option in the SYN, use of Timestamps
and Window Scaling were disabled (even if the peer supports them).
- The sender must not disable signatures if the option is absent in
the received SYN. (See comment in syncache_add()).
Found, Submitted by: Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>.
Reviewed by: Mohan Srinivasan <mohans at yahoo-inc dot com>.
A complete rationale and discussion is given in this message
and the resulting discussion:
http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706
Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.
Discussed on: -arch
it travels through the IP stack. This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.
The IP source route options of a packet are now stored in a mtag instead of the
global variable.
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs
This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.
Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>
have already done this, so I have styled the patch on their work:
1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.
2) named the sysctl net.inet.ip.random_id
3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.
The sysctl defaults to 0 (sequential IP IDs).
Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.
Discussed with: bosko, tegge, rwatson
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.
Tested and Discovered by: Orla McGann <orly@cnri.dit.ie>
Approved by: ume, silby
MFC after: 1 month
originated on RELENG_4 and was ported to -CURRENT.
The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.
You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.
Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:
- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().
While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.
Tested by: nork
Obtained from: KAME
segments are lost for the application. This broke, for example,
ports/benchmarks/dbs which needs the SYN segment to filter the
contents of the trace buffer for the connection it is interested in.
This patch makes the SYN segments available again. Unfortunately they
are now associated with the listening socket instead of the new one, so
a change to applications is required, but without this patch it wouldn't
work altogether.
PR: kern/45966
Security improvements:
- Increase the size of each syncookie secret from 32 to 128 bits
in order to make brute force attacks on the secrets much more
difficult.
- Always return the lowest order dword from the MD5 hash; this
allows us to expose 2 more bits of the cookie and makes ACK
floods which seek to guess the cookie value more difficult.
Performance improvements:
- Increase the lifetime of each syncookie from 4 seconds to 16
seconds. This increases the usefulness of syncookies during
an attack.
- From Yahoo!: Reduce the number of calls to MD5Update; this
results in a ~17% increase in cookie generation time here.
Reviewed by: hsu, jayanth, jlemon, nectar
MFC After: 15 seconds
initialized until after a syncookie was generated. As a result,
all connections resulting from a returned cookie would end up using
a MSS of ~512 bytes. Now larger packets will be used where possible.
MFC after: 5 days
associated with the syncache entry: in case tcp_close() has been
called on the corresponding listening socket, the lock has been
destroyed as a side effect of in_pcbdetach(), causing a panic when
we attempt to lock on it.
Reviewed by: hsu
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).
As noted previously, don't use FAST_IPSEC with INET6 at the moment.
Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version
Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month
kernel access control.
Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
syncache_respond(A), ip_output(), ip_input(), tcp_input(), syncache_badack(B)
Which winds up deleting a different entry from the syncache. Handle
this by not utilizing the next entry in the timer chain until after
syncache_respond() completes. The case of A == B should not be possible.
Problem found by: Don Bowman <don@sandvine.com>
Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.
PR: 39141
MFC after: 2 weeks
This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.
results in the syncache entry being turned into a socket. While it's
not used in the main tree, this is required in the MAC tree so that
labels can be propagated from the mbuf to the socket. This is also
useful if you're doing things like transparent IP connection hijacking
and you want to use the syncache/cookie mechanism, but we won't go
there.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
All TCP ISNs that are sent out are valid cookies, which allows entries
in the syncache to be dropped and still have the ACK accepted later.
As all entries pass through the syncache, there is no sudden switchover
from cache -> cookies when the cache is full; instead, syncache entries
simply have a reduced lifetime. More details may be found in the
"Resisting DoS attacks with a SYN cache" paper in the Usenix BSDCon 2002
conference proceedings.
Sponsored by: DARPA, NAI Labs
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.
Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).
Some cleanup. Identify additonal issues in comments.
MFC after: 1 day
to be followed by nfsnodehashtbl, so bzeroing callouts beyond the end of
tcp_syncache soon caused a null pointer panic when nfsnodehashtbl was
accessed.