do not have mh_nextpkt initialized. Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.
Reviewed by: rwatson
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.
Requested by: luigi
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.
Reviewed by: sam (mentor)
Short description of ip_fastforward:
o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input
Enable with sysctl -w net.inet.ip.fastforwarding=1
Reviewed by: sam (mentor)
o correct a read-lock assert in in_pcblookup_local that should be
a write-lock assert (since time wait close cleanups may alter state)
Supported by: FreeBSD Foundation
preemption two CPUs can be in the same function at the same time
and clobber each others variables. Remove register declaration
from local variables.
Reviewed by: sam (mentor)
This switch toggles between strict multicast delivery, and traditional
multicast delivery.
The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.
The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.
Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.
The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.
PR: kern/58359
Submitted by: William A. Carrel
Reviewed by: rwatson
MFC after: 1 week
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.
Reviewed by: sam
o pickup Giant in divert_packet to protect sbappendaddr since it
can be entered through MPSAFE callouts or through ip_input when
mpsafenet is 1
o add missing locking on output
o add locking to abort and shutdown
o add a ctlinput handler to invalidate held routing table references
on an ICMP redirect (may not be needed)
Supported by: FreeBSD Foundation
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond
Supported by: FreeBSD Foundation
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive
Supported by: FreeBSD Foundation
has a LOR between IPFW inpcb locks but I'm committing it now as the
lesser of two evils (the other being unlocked use of in_pcblookup).
Supported by: FreeBSD Foundation
to a routing table entry w/o bumping the reference count or locking
against the entry being free'd. This caused major havoc (for some
reason it appeared most frequently for folks running natd). Fix
is to bump the reference count whenever we copy the route cache
contents into a private copy so the entry cannot be reclaimed out
from under us. This is a short term fix as the forthcoming routing
table changes will eliminate this cache entirely.
Supported by: FreeBSD Foundation
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.
Tested by: nork
Obtained from: KAME
header copy made on input path: this is now handled differently.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
directly on the radix tree and does not hold any routing table refernces.
This fixes the reference counting problems that manifested itself as a
panic during unmount of filesystems that were mounted by NFS over an
interface that had been removed.
Supported by: FreeBSD Foundation
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.
The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.
- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.
This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.
This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.
Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).
Supported by: FreeBSD Foundation
the module. Previously we grabbed the mutex used by the callouts,
then stopped the callout with callout_stop, but if the callout
was already active and blocked by the mutex then it would continue
later and reference the mutex after it was destroyed. Instead
stop the callout first then lock.
Supported by: FreeBSD Foundation
o add some more debugging help for figuring out why folks are
getting complaints about releasing routing table entries with
a zero refcnt
o fix comment that talked about spl's
o remove duplicate define of DUMMYNET_DEBUG
Supported by: FreeBSD Foundation
- implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c.
make ip{,6}_ecn_egress() return integer to tell the caller that
this packet should be dropped.
- handle ECN at fragment reassembly in ip_input.c and frag6.c.
Obtained from: KAME
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.
Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".
Pointed out by: Jonathan Stone
Reviewed by: Robert Watson
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.
Obtained from: KAME
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.
No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.
This should reduce the need for sysadmins to lower the default msl on
busy servers.
when loaded as a module
o cleanup data structures on module unload when no application has
been started (i.e. kldload, kldunload w/o mrtd)
o remove extraneous unlocks immediately prior to destroying them
Supported by: FreeBSD Foundation
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.
Supported by: FreeBSD Foundation
Submitted by: Pyun YongHyeon
trashed after being freed. This has caused several panics including
kern/42277 related to soft updates. Jim Kuhn tracked the problem
down to ipfw limit rule processing. In the expiry of dynamic rules,
it is possible for an O_LIMIT_PARENT rule to be removed when it still
has live children. When the children eventually do expire, a pointer
to the (long gone) parent is dereferenced and a count decremented.
Since this memory can, and is, allocated for other purposes (in the
case of kern/42277 an inodedep structure), chaos ensues. The offset
in question in inodedep is the offset of the 16 bit count field in
the ipfw2 ipfw_dyn_rule.
Submitted by: Jim Kuhn <jkuhn@sandvine.com>
Reviewed by: "Evgueni V. Gavrilov" <aquatique@rusunix.org>
Reviewed by: Ben Pfountz <netprince@vt.edu>
MFC after: 1 week
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.
Other/related changes:
o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts
Notes:
1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.
Add a more verbose comment about exactly what this code does.
Submitted by: ru
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules
Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.
Approved by: jake (mentor)
Reported by: Apple Product Security <product-security@apple.com>
Skinny is the protocol used by Cisco IP phones to talk to Cisco Call
Managers. With this code, one can use a Cisco IP phone behind a FreeBSD
NAT gateway.
Currently, having the Call Manager behind the NAT gateway is not supported.
More information on enabling Skinny support in libalias, natd, and ppp
can be found in those applications' manpages.
PR: 55843
Reviewed by: ru
Approved by: ru
MFC after: 30 days
sending an ICMP packet doesn't cause a panic. A better solution is needed;
possibly defering the transmit to a dedicated thread.
Observed by: "Aaron Wohl" <freebsd@soith.com>
o change timeout to MPSAFE callout
o restructure rule deletion to deal with locking requirements
o replace static buffer used for ipfw control operations with malloc'd storage
Sponsored by: FreeBSD Foundation
o change time to MPSAFE callout
o make debug printfs conditional on DUMMYNET_DEBUG and runtime controllable
by net.inet.ip.dummynet.debug
o make boot-time printf dependent on bootverbose
Sponsored by: FreeBSD Foundation
Special thanks to Pavlin Radoslavov <pavlin@icir.org> for testing and
fixing numerous problems.
Sponsored by: FreeBSD Foundation
Reviewed by: Pavlin Radoslavov <pavlin@icir.org>
Changes from the original implementation:
- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.
- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.
- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
NB: There is a known LOR on the forwarding path; this needs to be resolved
together with a similar issue in the bridge. For the moment it is
believed to be benign.
Sponsored by: FreeBSD Fondation
returned mbuf can be NULL. Check for NULL in rip_output() when
prepending an IP header. This prevents mbuf exhaustion from
causing a local kernel panic when sending raw IP packets.
PR: kern/55886
Reported by: Pawel Malachowski <pawmal-posting@freebsd.lublin.pl>
MFC after: 3 days
mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()
These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
from queue(3).
Improve vertical compactness by using a IGMP_PRINTF() macro rather
than #ifdefing IGMP_DEBUG a large number of debugging printfs.
Reviewed by: mdodd (SLIST changes)
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.
Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)
of bw_meter entries were processed up to one second ahead.
After an unappropriate rescheduling of some of the bw_meter
entries, the upcalls weren't delivered.
* pim_register_prepare() uses the appropriate sw_csum flag to
call ip_fragment() so the IP checksum is computed properly.
* Modify pim_register_prepare() to take care of IP packets that
don't need fragmentation.
* Add-back in_delayed_cksum() to encap_send(), because it seems it
should be there.
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
binaries in /bin and /sbin installed in /lib. Only the versioned files
reside in /lib, the .so symlink continues to live /usr/lib so the
toolchain doesn't need to be modified.
segments are lost for the application. This broke, for example,
ports/benchmarks/dbs which needs the SYN segment to filter the
contents of the trace buffer for the connection it is interested in.
This patch makes the SYN segments available again. Unfortunately they
are now associated with the listening socket instead of the new one, so
a change to applications is required, but without this patch it wouldn't
work altogether.
PR: kern/45966
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.
This is a NOP unless you compile with TCPDEBUG.
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.
This is a NOP unless you compile with TCPDEBUG.
new ATMIOCOPENVCC/CLOSEVCC. This allows us to not only use UBR channels
for IP over ATM, but also CBR, VBR and ABR. Change the format of the
link layer address to specify the channel characteristics. The old
format is still supported and opens UBR channels.
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):
options MROUTING # Multicast routing
options PIM # Protocol Independent Multicast
2. Add support for advanced multicast API setup/configuration and
extensibility.
3. Add support for kernel-level PIM Register encapsulation.
Disabled by default. Can be enabled by the advanced multicast API.
4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
This change allows one to specify almost the complete traffic parameters
for IPoverATM channels through the routing table. Up to now we used
4 byte DL addresses (flag, vpi, vciH, vciL). This format is still allowed.
If the address is longer, however, the 5th byte is interpreted as the
traffic class (UBR, CBR, VBR or ABR) and the remaining bytes are the
parameters for this traffic class:
UBR: 0 byte or 3 byte PCR
CBR: 3 byte PCR
VBR: 3 byte PCR, 3 byte SCR, 3 byte MBS
ABR: 3 byte PCR, 3 byte MCR, 3 byte ICR, 3 byte TBE, 1 byte NRM,
1 byte TRM, 2 bytes ADTF, 1 byte RIF, 1 byte RDF and 1 byte CDF
A script to generate the corresponding 'route add' arguments will follow soon.
sysctl:
- sysctlbyname("net.inet.ip.mfctable", ...)
- sysctlbyname("net.inet.ip.viftable", ...)
This change is needed so netstat can use sysctlbyname() to read
the data from those tables.
Otherwise, in some cases "netstat -g" may fail to report the
multicast forwarding information (e.g., if we run a multicast
router on PicoBSD).
* Bug fix: when sending IGMPMSG_WRONGVIF upcall to the multicast
routing daemon, set properly "im->im_vif" to the receiving
incoming interface of the packet that triggered that upcall
rather than to the expected incoming interface of that packet.
* Bug fix: add missing increment of counter "mrtstat.mrts_upcalls"
* Few formatting nits (e.g., replace extra spaces with TABs)
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
code used to call rtrequest(RTM_DELETE, ...). This is a problem, because
the function that just has called us (route_output)
is not really happy with the route it just is creating beeing ripped out
from under it. Unfortunately we also cannot return an error from
ifa_rtrequest. Therefore mark the route just as RTF_REJECT.
check for raw IP system management operations is often (although
not always) implicit due to the namespacing of raw IP sockets. I.e.,
you have to have privilege to get a raw IP socket, so much of the
management code sitting on raw IP sockets assumes that any requests
on the socket should be granted privilege.
Obtained from: TrustedBSD Project
Product of: France
Set 31 is still special because rules belonging to it are not deleted
by the "ipfw flush" command, but must be deleted explicitly with
"ipfw delete set 31" or by individual rule numbers.
This implement a flexible form of "persistent rules" which you might
want to have available even after an "ipfw flush".
Note that this change does not violate POLA, because you could not
use set 31 in a ruleset before this change.
sbin/ipfw changes to allow manipulation of set 31 will follow shortly.
Suggested by: Paul Richards
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]
Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]
Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]
Since we already had 'O_NOP' instructions which always match, all
I needed to do is allow the NOP command to have arbitrary length
(i.e. move its label in a different part of the switch() which
validates instructions).
The kernel must know nothing about comments, everything else is
done in userland (which will be described in the upcoming ipfw2.c
commit).
support matching a list of addr/mask pairs so one can write
more efficient rulesets which were not possible before e.g.
add 100 skipto 1000 not src-ip 10.0.0.0/8,127.0.0.1/8,192.168.0.0/16
The change is fully backward compatible.
ipfw2 and manpage commit to follow.
MFC after: 3 days
Should work with both regular and fast ipsec (mutually exclusive).
See manpage for more details.
Submitted by: Ari Suutari (ari.suutari@syncrontech.com)
Revised by: sam
MFC after: 1 week
"ipid" options. This feature has been requested by several users.
On passing, fix some minor bugs in the parser. This change is fully
backward compatible so if you have an old /sbin/ipfw and a new
kernel you are not in trouble (but you need to update /sbin/ipfw
if you want to use the new features).
Document the changes in the manpage.
Now you can write things like
ipfw add skipto 1000 iplen 0-500
which some people were asking to give preferential treatment to
short packets.
The 'MFC after' is just set as a reminder, because I still need
to merge the Alpha/Sparc64 fixes for ipfw2 (which unfortunately
change the size of certain kernel structures; not that it matters
a lot since ipfw2 is entirely optional and not the default...)
PR: bin/48015
MFC after: 1 week
this makes connect act more sensibly in these cases.
PR: 50839
Submitted by: Barney Wolff <barney@pit.databus.com>
Patch delayed by laziness of: silby
MFC after: 1 week
is not called, and no static rules match an outgoing packet, the
latter retains its source IP address. This is in support of the
"static NAT only" mode.
only meaningful for fragments. Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce. I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)
Suggested by: Steve Bellovin's paper in IMW'02
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.
Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.
Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
copying for mbuf headers now works properly in m_dup_pkthdr(), so
we don't need to do an explicit copy.
Approved by: re (jhb)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
state. Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use. Attempt to clarify and correct the logic.
Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system. We'll need to address
this at some point in the future.
Pointed out by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.
Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)
This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.
There are three sysctl tunables provided to govern system behavior.
net.inet.ip.rfc3514:
Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.
net.inet.ip.hear_no_evil
If set the host will discard all received evil packets.
net.inet.ip.speak_no_evil
If set the host will discard all transmitted evil packets.
The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.
For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.
Quote from kern/37573:
There is an obvious race in netinet/ip_dummynet.c:config_pipe().
Interrupts are not blocked when changing the params of an
existing pipe. The specific crash observed:
... -> config_pipe -> set_fs_parms -> config_red
malloc a new w_q_lookup table but take an interrupt before
intializing it, interrupt handler does:
... -> dummynet_io -> red_drops
red_drops dereferences the uninitialized (zeroed) w_q_lookup
table.
o Flush accumulated credits for idle pipes.
o Flush accumulated credits when change pipe characteristics.
o Change dn_flow_queue.numbytes type to unsigned long.
Overlapping dn_flow_queue->numbytes in ready_event() leads to
numbytes becomes negative and SET_TICKS() macro returns a very
big value. heap_insert() overlaps dn_key again and inserts a
queue to a ready heap with a sched_time points to the past.
That leads to an "infinity" loop.
PR: kern/33234, kern/37573, misc/42459, kern/43133,
kern/44045, kern/48099
Submitted by: Mike Hibler <mike@cs.utah.edu> (kern/37573)
MFC after: 6 weeks
additional flags argument to indicate blocking disposition, and
pass in M_NOWAIT from the IP reassembly code to indicate that
blocking is not OK when labeling a new IP fragment reassembly
queue. This should eliminate some of the WITNESS warnings that
have started popping up since fine-grained IP stack locking
started going in; if memory allocation fails, the creation of
the fragment queue will be aborted.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
more clear.
This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now. We
should probably move this code into a m_fragment() function when it's
correct.
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.
net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)
comes in on is the same interface that we would route out of to get to
the packet's source address. Essentially automates an anti-spoofing
check using the information in the routing table.
Experimental. The usage and rule format for the feature may still be
subject to change.
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.
Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs
that matches snd_max, then do not respond with an ack, just drop the
segment. This fixes a problem where a simultaneous close results in
an ack loop between two time-wait states.
Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG>
Sponsored by: DARPA, NAI Labs
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.
Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Security improvements:
- Increase the size of each syncookie secret from 32 to 128 bits
in order to make brute force attacks on the secrets much more
difficult.
- Always return the lowest order dword from the MD5 hash; this
allows us to expose 2 more bits of the cookie and makes ACK
floods which seek to guess the cookie value more difficult.
Performance improvements:
- Increase the lifetime of each syncookie from 4 seconds to 16
seconds. This increases the usefulness of syncookies during
an attack.
- From Yahoo!: Reduce the number of calls to MD5Update; this
results in a ~17% increase in cookie generation time here.
Reviewed by: hsu, jayanth, jlemon, nectar
MFC After: 15 seconds
packets coming out of a GIF tunnel are re-processed by ipfw, et. al.
By default they are not reprocessed. With the option they are.
This reverts 1.214. Prior to that change packets were not re-processed.
After they were which caused problems because packets do not have
distinguishing characteristics (like a special network if) that allows
them to be filtered specially.
This is really a stopgap measure designed for immediate MFC so that
4.8 has consistent handling to what was in 4.7.
PR: 48159
Reviewed by: Guido van Rooij <guido@gvr.org>
MFC after: 1 day
and enable it by default, with a limit of 16.
At the same time, tweak maxfragpackets downward so that in the worst
possible case, IP reassembly can use only 1/2 of all mbuf clusters.
MFC after: 3 days
Reviewed by: hsu
Liked by: bmah
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.
Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.
Sponsored by: DARPA, NAI Labs
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()
Sponsored by: DARPA, NAI Labs
cr_uid.
Note: we do not have socheckuid() in RELENG_4, ip_fw2.c uses its
own macro for a similar purpose that is why ipfw2 in RELENG_4 processes
uid rules correctly. I will MFC the diff for code consistency.
Reported by: Oleg Baranov <ol@csa.ru>
Reviewed by: luigi
MFC after: 1 month
ipsec4_process_packet; they happen when a packet is dropped because
an SA acquire is initiated
Submitted by: Doug Ambrisko <ambrisko@verniernetworks.com>
you still don't want to use the two together, but it's ok to have
them in the same kernel (the problem that initiated this bandaid
has long since been fixed)
initialized until after a syncookie was generated. As a result,
all connections resulting from a returned cookie would end up using
a MSS of ~512 bytes. Now larger packets will be used where possible.
MFC after: 5 days
- Honor the previous behavior of maxfragpackets = 0 or -1
- Take a better stab at fragment statistics
- Move / correct a comment
Suggested by: maxim@
MFC after: 7 days
functions implemented approximately the same limits on fragment memory
usage, but in different fashions.)
End user visible changes:
- Fragment reassembly queues are freed in a FIFO manner when maxfragpackets
has been reached, rather than all reassembly stopping.
MFC after: 5 days
in addition to secure level 1. The mask supports up to a secure level of 8
but only add defines through CTLFLAG_SECURE3 for now.
As per the missif in the log entry for 1.11 of ip_fw2.c which added the
secure flag to the IPFW sysctl's in the first place, change the secure
level requirement from 1 to 3 now that we have support for it.
Reviewed by: imp
With Design Suggestions by: imp
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.
These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.
Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.
Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.
Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>
Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of
order / reverse-time-indexed packet should be acknowledged as specified
in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994)
simply acknowledged the segment unconditionally, which is incorrect, and
was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN
in addition to (tlen != 0), which may or may not be correct, but the
worst that ought to happen should be a retry by the sender.
in network byte order, but icmp_error() expects the IP header to
be in host order and the code here did not perform the necessary
swapping for the bridged case. This bug causes an "icmp_error: bad
length" panic when certain length IP packets (e.g. ip_len == 0x100)
are rejected by the firewall with an ICMP response.
MFC after: 3 days