Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active. This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive). This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.
- Add gre_mtx to protect global softc list.
- Hold gre_mtx over various list operations (insert, delete).
- Centralize if_gre interface teardown in gre_destroy(), and call this
from modevent unload and gre_clone_destroy().
- Export gre_mtx to ip_gre.c, which walks the gre list to look up gre
interfaces during encapsulation. Add a wonking comment on how we need
some sort of drain/reference count mechanism to keep gre references
alive while in use and simultaneous destroy.
This commit does not lockdown softc data, which follows in a future
commit.
- Add encapmtx to protect ip_encap.c global variables (encapsulation
list).
- Unifdef #ifdef 0 pieces of encap_init() which was (and now really
is) basically a no-op.
- Lock encapmtx when walking encaptab, modifying it, comparing
entries, etc.
- Remove spl's.
Note that currently there's no facilite to make sure outstanding
use of encapsulation methods on a table entry have drained bfore
we allow a table entry to be removed. As such, it's currently the
caller's responsibility to make sure that draining takes place.
Reviewed by: mlaier
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.
Noticed by: rwatson
Approved by: bms(mentor)
to NET_UNLOCK_GIANT(). While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion. Add a comment saying
as much.
Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.
Approved by: bms(mentor)
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).
Approved by: bms(mentor)
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.
Approved by: jlemon (years ago, for a more invasive fix)
amount of segments it will hold.
The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:
net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).
net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).
net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.
net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.
Tested by: bms, silby
Reviewed by: bms, silby
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).
This is (mostly) work from: sam
Silence from: -arch
Approved by: bms(mentor), sam, rwatson
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
used for the ICMP reply source in reponse to packets which are not
directly addressed to us. By default continue with with normal
source selection.
Reviewed by: bms
at packet arrival.
For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.
This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
ifconfig(8) flag since header for version 2 is the same but IP payload
is prepended with additional 4-bytes field.
Inspired by: Roman Synyuk <roman@univ.kiev.ua>
MFC after: 2 weeks
recwin and sendwin. This removes a big source of confusion and makes
following the code much easier.
Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)
Makes it possible to have multiple packet aliasing instances in a
single process by moving all static and global variables into an
instance structure called "struct libalias".
Redefine a new API based on s/PacketAlias/LibAlias/g
Add new "instance" argument to all functions in the new API.
Implement old API in terms of the new API.
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect. The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.
MFC after: 30 days
resource exhaustion attacks.
For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.
The resource exhaustion works in two ways:
o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.
For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.
This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).
We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.
o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.
For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).
This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.
We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.
MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day
restore the general pre-randomid behaviour.
Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.
Other BSD and Linux have found and fixed the same issues.
PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)
rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting
All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.
Reviewed by: sam (mentor)
are acting as router (ipforwarding enabled).
This doesn't fix the problem that host routes from ICMP redirects
are never removed from the kernel routing table but removes the
problem for machines doing packet forwarding.
Reviewed by: sam (mentor)
if_gre.c rev.1.41-1.49
o Spell output with two ts.
o Remove assigned-to but not used variable.
o fix grammatical error in a diagnostic message.
o u_short -> u_int16_t.
o gi_len is ip_len, so it has to be network byteorder.
if_gre.h rev.1.11-1.13
o prototype must not have variable name.
o u_short -> u_int16_t.
o Spell address with two d's.
ip_gre.c rev.1.22-1.29
o KNF - return is not a function.
o The "osrc" variable in gre_mobile_input() is only ever set but not
referenced; remove it.
o correct (false) assumptions on mbuf chain. not sure if it really helps, but
anyways, it is necessary to perform m_pullup.
o correct arg to m_pullup (need to count IP header size as well).
o remove redundant adjustment of m->m_pkthdr.len.
o clear m_flags just for safety.
o tabify.
o u_short -> u_int16_t.
MFC after: 2 weeks
a new bpf_mtap2 routine that does the right thing for an mbuf
and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
when prepending the address family (several places were assuming
sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
mbufs have been eliminated; this may better be moved to the bpf
routines
Reviewed by: arch@ and several others
otherwise they are initialized twice when the code is statically
configured in the kernel because the module load method gets
invoked before the user application calls ip_mrouter_init
o add a mutex to synchronize the module init/done operations; this
sort of was done using the value of ip_mroute but X_ip_mrouter_done
sets it to NULL very early on which can lead to a race against
ip_mrouter_init--using the additional mutex means this is safe now
o don't call ip_mrouter_reset from ip_mrouter_init; this now happens
once at module load and X_ip_mrouter_done does the appropriate
cleanup work to insure the data structures are in a consistent
state so that a subsequent init operation inherits good state
Reviewed by: juli
wait, rather than the socket label. This avoids reaching up to
the socket layer during connection close, which requires locking
changes. To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer(). Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf. Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Before committing the initial tcp_hostcache I changed them from memcpy()
to conform with FreeBSD style without realizing the difference in argument
definition.
This fixes hostcache operation for IPv6 (in general and explicitly IPv6
path mtu discovery) and T/TCP (RFC1644).
Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Approved by: re (rwatson)
code is compiled in to support the O_IPSEC operator. Previously no
support was included and ipsec rules were always matching. Note that
we do not return an error when an ipsec rule is added and the kernel
does not have IPsec support compiled in; this is done intentionally
but we may want to revisit this (document this in the man page).
PR: 58899
Submitted by: Bjoern A. Zeeb
Approved by: re (rwatson)
When the hostcache bucket limit is reached the last bucket wasn't
removed from the bucket row but inserted a few lines later at the
bucket row head again. This leads to infinite loop when the same
bucket row is accessed the next time for a lookup/insert or purge
action.
Tested by: imp, Matt Smith
Approved by: re (rwatson)
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.
Reviewed by: sam (mentor)
Approved by: re (scottl)
for ipfw processing w/o an indication the packets were generated
by ipfw--and so should not be processed (this manifested itself
as a LOR.) The flag bit in the mbuf that was used to mark the
packets was not listed in M_COPYFLAGS so if a packet had a header
prepended (as done by IPsec) the flag was lost. Correct this by
defining a new M_PROTO6 flag and use it to mark packets that need
this processing.
Reviewed by: bms
Approved by: re (rwatson)
MFC after: 2 weeks
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.
Reviewed by: sam
Approved by: re (rwatson)
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
accordingly. The define is left intact for ABI compatibility
with userland.
This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
do not have mh_nextpkt initialized. Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.
Reviewed by: rwatson
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.
Requested by: luigi
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.
Reviewed by: sam (mentor)
Short description of ip_fastforward:
o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input
Enable with sysctl -w net.inet.ip.fastforwarding=1
Reviewed by: sam (mentor)
o correct a read-lock assert in in_pcblookup_local that should be
a write-lock assert (since time wait close cleanups may alter state)
Supported by: FreeBSD Foundation
preemption two CPUs can be in the same function at the same time
and clobber each others variables. Remove register declaration
from local variables.
Reviewed by: sam (mentor)
This switch toggles between strict multicast delivery, and traditional
multicast delivery.
The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.
The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.
Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.
The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.
PR: kern/58359
Submitted by: William A. Carrel
Reviewed by: rwatson
MFC after: 1 week
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.
Reviewed by: sam
o pickup Giant in divert_packet to protect sbappendaddr since it
can be entered through MPSAFE callouts or through ip_input when
mpsafenet is 1
o add missing locking on output
o add locking to abort and shutdown
o add a ctlinput handler to invalidate held routing table references
on an ICMP redirect (may not be needed)
Supported by: FreeBSD Foundation
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond
Supported by: FreeBSD Foundation
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive
Supported by: FreeBSD Foundation
has a LOR between IPFW inpcb locks but I'm committing it now as the
lesser of two evils (the other being unlocked use of in_pcblookup).
Supported by: FreeBSD Foundation
to a routing table entry w/o bumping the reference count or locking
against the entry being free'd. This caused major havoc (for some
reason it appeared most frequently for folks running natd). Fix
is to bump the reference count whenever we copy the route cache
contents into a private copy so the entry cannot be reclaimed out
from under us. This is a short term fix as the forthcoming routing
table changes will eliminate this cache entirely.
Supported by: FreeBSD Foundation
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.
Tested by: nork
Obtained from: KAME
header copy made on input path: this is now handled differently.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
directly on the radix tree and does not hold any routing table refernces.
This fixes the reference counting problems that manifested itself as a
panic during unmount of filesystems that were mounted by NFS over an
interface that had been removed.
Supported by: FreeBSD Foundation
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.
The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.
- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.
This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.
This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.
Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).
Supported by: FreeBSD Foundation
the module. Previously we grabbed the mutex used by the callouts,
then stopped the callout with callout_stop, but if the callout
was already active and blocked by the mutex then it would continue
later and reference the mutex after it was destroyed. Instead
stop the callout first then lock.
Supported by: FreeBSD Foundation
o add some more debugging help for figuring out why folks are
getting complaints about releasing routing table entries with
a zero refcnt
o fix comment that talked about spl's
o remove duplicate define of DUMMYNET_DEBUG
Supported by: FreeBSD Foundation
- implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c.
make ip{,6}_ecn_egress() return integer to tell the caller that
this packet should be dropped.
- handle ECN at fragment reassembly in ip_input.c and frag6.c.
Obtained from: KAME
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.
Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".
Pointed out by: Jonathan Stone
Reviewed by: Robert Watson
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.
Obtained from: KAME
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.
No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.
This should reduce the need for sysadmins to lower the default msl on
busy servers.
when loaded as a module
o cleanup data structures on module unload when no application has
been started (i.e. kldload, kldunload w/o mrtd)
o remove extraneous unlocks immediately prior to destroying them
Supported by: FreeBSD Foundation
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.
Supported by: FreeBSD Foundation
Submitted by: Pyun YongHyeon
trashed after being freed. This has caused several panics including
kern/42277 related to soft updates. Jim Kuhn tracked the problem
down to ipfw limit rule processing. In the expiry of dynamic rules,
it is possible for an O_LIMIT_PARENT rule to be removed when it still
has live children. When the children eventually do expire, a pointer
to the (long gone) parent is dereferenced and a count decremented.
Since this memory can, and is, allocated for other purposes (in the
case of kern/42277 an inodedep structure), chaos ensues. The offset
in question in inodedep is the offset of the 16 bit count field in
the ipfw2 ipfw_dyn_rule.
Submitted by: Jim Kuhn <jkuhn@sandvine.com>
Reviewed by: "Evgueni V. Gavrilov" <aquatique@rusunix.org>
Reviewed by: Ben Pfountz <netprince@vt.edu>
MFC after: 1 week
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.
Other/related changes:
o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts
Notes:
1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.
Add a more verbose comment about exactly what this code does.
Submitted by: ru
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules
Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.
Approved by: jake (mentor)
Reported by: Apple Product Security <product-security@apple.com>
Skinny is the protocol used by Cisco IP phones to talk to Cisco Call
Managers. With this code, one can use a Cisco IP phone behind a FreeBSD
NAT gateway.
Currently, having the Call Manager behind the NAT gateway is not supported.
More information on enabling Skinny support in libalias, natd, and ppp
can be found in those applications' manpages.
PR: 55843
Reviewed by: ru
Approved by: ru
MFC after: 30 days
sending an ICMP packet doesn't cause a panic. A better solution is needed;
possibly defering the transmit to a dedicated thread.
Observed by: "Aaron Wohl" <freebsd@soith.com>
o change timeout to MPSAFE callout
o restructure rule deletion to deal with locking requirements
o replace static buffer used for ipfw control operations with malloc'd storage
Sponsored by: FreeBSD Foundation
o change time to MPSAFE callout
o make debug printfs conditional on DUMMYNET_DEBUG and runtime controllable
by net.inet.ip.dummynet.debug
o make boot-time printf dependent on bootverbose
Sponsored by: FreeBSD Foundation
Special thanks to Pavlin Radoslavov <pavlin@icir.org> for testing and
fixing numerous problems.
Sponsored by: FreeBSD Foundation
Reviewed by: Pavlin Radoslavov <pavlin@icir.org>
Changes from the original implementation:
- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.
- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.
- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
NB: There is a known LOR on the forwarding path; this needs to be resolved
together with a similar issue in the bridge. For the moment it is
believed to be benign.
Sponsored by: FreeBSD Fondation
returned mbuf can be NULL. Check for NULL in rip_output() when
prepending an IP header. This prevents mbuf exhaustion from
causing a local kernel panic when sending raw IP packets.
PR: kern/55886
Reported by: Pawel Malachowski <pawmal-posting@freebsd.lublin.pl>
MFC after: 3 days
mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()
These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
from queue(3).
Improve vertical compactness by using a IGMP_PRINTF() macro rather
than #ifdefing IGMP_DEBUG a large number of debugging printfs.
Reviewed by: mdodd (SLIST changes)
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.
Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)
of bw_meter entries were processed up to one second ahead.
After an unappropriate rescheduling of some of the bw_meter
entries, the upcalls weren't delivered.
* pim_register_prepare() uses the appropriate sw_csum flag to
call ip_fragment() so the IP checksum is computed properly.
* Modify pim_register_prepare() to take care of IP packets that
don't need fragmentation.
* Add-back in_delayed_cksum() to encap_send(), because it seems it
should be there.
Submitted by: Pavlin Radoslavov <pavlin@icir.org>