panics, which occur when stale ifnet pointers are left in struct
moptions hung off of inpcbs:
- Add in_ifdetach(), which matches in6_ifdetach(), and allows the
protocol to perform early tear-down on the interface early in
if_detach().
- Annotate that if_detach() needs careful consideration.
- Remove calls to in_pcbpurgeif0() in the handling of SIOCDIFADDR --
this is not the place to detect interface removal! This also
removes what is basically a nasty (and now unnecessary) hack.
- Invoke in_pcbpurgeif0() from in_ifdetach(), in both raw and UDP
IPv4 sockets.
It is now possible to run the msocket_ifnet_remove regression test
using HEAD without panicking.
MFC after: 3 days
has been done in icmp_input() already.
This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.
PR: kern/81813
Submitted by: Vitezslav Novy <vita at fio.cz>
MFC after: 3 days
first interface is detached from parent and then bpfdetach() is called.
If the interface was the last carp(4) interface attached to parent, then
the mutex on parent is destroyed. When bpfdetach() calls if_setflags()
we panic on destroyed mutex.
To prevent the above scenario, clear pointer to parent, when we detach
ourselves from parent.
ARP requests only on the network where this IP address belong, to.
Before this change we did replied on all interfaces. This could
lead to an IP address conflict with host we are doing ARP proxy
for.
PR: kern/75634
Reviewed by: andre
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.
Submitted by: Mohan Srinivasan, Noritoshi Demizu
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.
This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.
Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".
MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005
cluster if needed.
Fixes the TCP issues raised in I-D draft-gont-icmp-payload-00.txt.
This aids in-the-wild debugging a lot and allows the receiver to do
more elaborate checks on the validity of the response.
MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005
packet in an ICMP reply. The minimum of 8 bytes is internally
enforced. The maximum quotation is the remaining space in the
reply mbuf.
This option is added in response to the issues raised in I-D
draft-gont-icmp-payload-00.txt.
MFC after: 2 weeks
Spnsored by: TCP/IP Optimizations Fundraise 2005
the IP address the packet came through in. This is useful for routers
to show in traceroutes the actual path a packet has taken instead of
the possibly different return path.
The new sysctl is named net.inet.icmp.reply_from_interface and defaults
to off.
MFC after: 2 weeks
than one interface in one subnet. However, some userland apps rely on
the believe that this configuration is impossible.
Add a sysctl switch net.inet.ip.same_prefix_carp_only. If the switch
is on, then kernel will refuse to add an additional interface to
already connected subnet unless the interface is CARP. Default
value is off.
PR: bin/82306
In collaboration with: mlaier
* Correct handling of IPv6 Extension Headers.
* Add unreach6 code.
* Add logging for IPv6.
Submitted by: sysctl handling derived from patch from ume needed for ip6fw
Obtained from: is_icmp6_query and send_reject6 derived from similar
functions of netinet6,ip6fw
Reviewed by: ume, gnn; silence on ipfw@
Test setup provided by: CK Software GmbH
MFC after: 6 days
incoming ARP packet and route request adding/removing
ARP entries. The root of the problem is that
struct llinfo_arp was accessed without any locks.
To close race we will use locking provided by
rtentry, that references this llinfo_arp:
- Make arplookup() return a locked rtentry.
- In arpresolve() hold the lock provided by
rt_check()/arplookup() until the end of function,
covering all accesses to the rtentry itself and
llinfo_arp it refers to.
- In in_arpinput() do not drop lock provided by
arplookup() during first part of the function.
- Simplify logic in the first part of in_arpinput(),
removing one level of indentation.
- In the second part of in_arpinput() hold rtentry
lock while copying address.
o Fix a condition when route entry is destroyed, while
another thread is contested on its lock:
- When storing a pointer to rtentry in llinfo_arp list,
always add a reference to this rtentry, to prevent
rtentry being destroyed via RTM_DELETE request.
- Remove this reference when removing entry from
llinfo_arp list.
o Further cleanup of arptimer():
- Inline arptfree() into arptimer().
- Use official queue(3) way to pass LIST.
- Hold rtentry lock while reading its structure.
- Do not check that sdl_family is AF_LINK, but
assert this.
Reviewed by: sam
Stress test: http://www.holm.cc/stress/log/cons141.html
Stress test: http://people.freebsd.org/~pho/stress/log/cons144.html
to atomically return either an existing set of IP multicast options for the
PCB, or a newlly allocated set with default values. The inpcb is returned
locked. This function may sleep.
Call ip_moptions() to acquire a reference to a PCB's socket options, and
perform the update of the options while holding the PCB lock. Release the
lock before returning.
Remove garbage collection of multicast options when values return to the
default, as this complicates locking substantially. Most applications
allocate a socket either to be multicast, or not, and don't tend to keep
around sockets that have previously been used for multicast, then used for
unicast.
This closes a number of race conditions involving multiple threads or
processes modifying the IP multicast state of a socket simultaenously.
MFC after: 7 days
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags. Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags. This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.
Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.
Reviewed by: pjd, bz
MFC after: 7 days
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.
For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.
Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().
Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.
Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.
Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.
Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.
Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days
FreeBSD specific ip_newid() changes NetBSD does not have.
Correct handling of non AF_INET packets passed to bpf [2].
PR: kern/80340[1], NetBSD PRs 29150[1], 30844[2]
Obtained from: NetBSD ip_gre.c rev. 1.34,1.35, if_gre.c rev. 1.56
Submitted by: Gert Doering <gert at greenie.muc.de>[2]
MFC after: 4 days
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.
Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME
redundant with respect to existing mbuf copy label routines. Expose
a new mac_copy_mbuf() routine at the top end of the Framework and
use that; use the existing mpo_copy_mbuf_label() routine on the
bottom end.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)
packet filter. This would cause a panic on architectures that require strict
alignment such as sparc64 (tier1) and ia64/ppc (tier2).
This adds two new macros that check the alignment, these are compile time
dependent on __NO_STRICT_ALIGNMENT which is set for i386 and amd64 where
alignment isn't need so the cost is avoided.
IP_HDR_ALIGNED_P()
IP6_HDR_ALIGNED_P()
Move bridge_ip_checkbasic()/bridge_ip6_checkbasic() up so that the alignment
is checked for ipfw and dummynet too.
PR: ia64/81284
Obtained from: NetBSD
Approved by: re (dwhite), mlaier (mentor)
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.
Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).
Found by: Pawel Jakub Dawidek.
Submitted by: Raja Mukerji, Noritoshi Demizu.
Approved by: re
so residue of division for all hosts on net is the same, and thus only
one VHID answers. Change source IP in host byte order.
Reviewed by: mlaier
Approved by: re (scottl)
The ipfw tables lookup code caches the result of the last query. The
kernel may process multiple packets concurrently, performing several
concurrent table lookups. Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.
Submitted by: ru
Reviewed by: csjp, mlaier
Security: CAN-2005-2019
Security: FreeBSD-SA-05:13.ipfw
Correct bzip2 permission race condition vulnerability.
Obtained from: Steve Grubb via RedHat
Security: CAN-2005-0953
Security: FreeBSD-SA-05:14.bzip2
Approved by: obrien
Correct TCP connection stall denial of service vulnerability.
A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.
Submitted by: Noritoshi Demizu
Reviewed by: andre, Mohan Srinivasan
Security: CAN-2005-2068
Security: FreeBSD-SA-05:15.tcp
Approved by: re (security blanket), cperciva
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.
Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re
kernel module. LibAlias is not aware about checksum offloading,
so the caller should provide checksum calculation. (The only
current consumer is ng_nat(4)). When TCP packet internals has
been changed and it requires checksum recalculation, a cookie
is set in th_x2 field of TCP packet, to inform caller that it
needs to recalculate checksum. This ugly hack would be removed
when LibAlias is made more kernel friendly.
Incremental checksum updates are left as is, since they don't
conflict with offloading.
Approved by: re (scottl)
a DLT_NULL interface. In particular:
1) Consistently use type u_int32_t for the header of a
DLT_NULL device - it continues to represent the address
family as always.
2) In the DLT_NULL case get bpf_movein to store the u_int32_t
in a sockaddr rather than in the mbuf, to be consistent
with all the DLT types.
3) Consequently fix a bug in bpf_movein/bpfwrite which
only permitted packets up to 4 bytes less than the MTU
to be written.
4) Fix all DLT_NULL devices to have the code required to
allow writing to their bpf devices.
5) Move the code to allow writing to if_lo from if_simloop
to looutput, because it only applies to DLT_NULL devices
but was being applied to other devices that use if_simloop
possibly incorrectly.
PR: 82157
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (scottl)
to the statement in ip_mroute.h, as well as being the same as what
OpenBSD has done with this file. It matches the copyright in NetBSD's
1.1 through 1.14 versions of the file as well, which they subsequently
added back.
It appears to have been lost in the 4.4-lite1 import for FreeBSD 2.0,
but where and why I've not investigated further. OpenBSD had the same
problem. NetBSD had a copyright notice until Multicast 3.5 was
integrated verbatim back in 1995. This appears to be the version that
made it into 4.4-lite1.
Approved by: re (scottl)
MFC after: 3 days
- do not use static memory as we are under a shared lock only
- properly rtfree routes allocated with rtalloc
- rename to verify_path6()
- implement the full functionality of the IPv4 version
Also make O_ANTISPOOF work with IPv6.
Reviewed by: gnn
Approved by: re (blanket)
on an IPv4 packet as these variables are uninitialized if not. This used to
allow arbitrary IPv6 packets depending on the value in the uninitialized
variables.
Some opcodes (most noteably O_REJECT) do not support IPv6 at all right now.
Reviewed by: brooks, glebius
Security: IPFW might pass IPv6 packets depending on stack contents.
Approved by: re (blanket)
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.
This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.
Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.
Reviewed by: sobomax, sam
do the subsequent ip_output() in IPFW. In ipfw_tick(), the keep-alive
packets must be generated from the data that resides under the
stateful lock, but they must not be sent at that time, as this would
cause a lock order reversal with the normal ordering (interface's
lock, then locks belonging to the pfil hooks).
In practice, this caused deadlocks when using IPFW and if_bridge(4)
together to do stateful transparent filtering.
MFC after: 1 week
the tail (in tcp_sack_option()). The bug was caused by incorrect
accounting of the retransmitted bytes in the sackhint.
Reported by: Kris Kennaway.
Submitted by: Noritoshi Demizu.
policy. It may be used to provide more detailed classification of
traffic without actually having to decide its fate at the time of
classification.
MFC after: 1 week
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.
Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan
This is the last requirement before we can retire ip6fw.
Reviewed by: dwhite, brooks(earlier version)
Submitted by: dwhite (manpage)
Silence from: -ipfw
if_ioctl routine. This should fix a number of code paths through
soo_ioctl() that could call into Giant-locked network drivers without
first acquiring Giant.
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()
MFC after: 7 days
that if we sort the incoming SACK blocks, we can update the scoreboard
in one pass of the scoreboard. The added overhead of sorting upto 4
sack blocks is much lower than traversing (potentially) large
scoreboards multiple times. The code was updating the scoreboard with
multiple passes over it (once for each sack option). The rewrite fixes
that, reducing the complexity of the main loop from O(n^2) to O(n).
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
real next hole to retransmit from the scoreboard, caused by a bug
which did not update the "nexthole" hint in one case in
tcp_sack_option().
Reported by: Daniel Eriksson
Submitted by: Mohan Srinivasan
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.
This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.
The debug code that sanity checks the hints against the computed
value will be removed eventually.
Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.
Security: FreeBSD-SA-05:08.kmem
- kernel module declarations and handler.
- macros to map malloc(3) calls to malloc(9) ones.
- malloc(9) declarations.
- call finishoff() from module handler MOD_UNLOAD case
instead of atexit(3).
- use panic(9) instead of abort(3)
- take time from time_second instead of gettimeofday(2)
- define INADDR_NONE
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.
Suggested by: dwmalone
needed only for implicit connect cases. Under load, especially on SMP,
this can greatly reduce contention on the tcbinfo lock.
NB: Ambiguities about the state of so_pcb need to be resolved so that
all use of the tcbinfo lock in non-implicit connection cases can be
eliminated.
Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>
fields of an ICMP packet.
Use this to allow ipfw to pullup only these values since it does not use
the rest of the packet and it was failed on ICMP packets because they
were not long enough.
struct icmp should probably be modified to use these at some point, but
that will break a fair bit of code so it can wait for another day.
On the off chance that adding this struct breaks something in ports,
bump __FreeBSD_version.
Reported by: Randy Bush <randy at psg dot com>
Tested by: Randy Bush <randy at psg dot com>
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.
Found by: Mohan Srinivasan.
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.
Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.
- If the peer sends the Signature option in the SYN, use of Timestamps
and Window Scaling were disabled (even if the peer supports them).
- The sender must not disable signatures if the option is absent in
the received SYN. (See comment in syncache_add()).
Found, Submitted by: Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>.
Reviewed by: Mohan Srinivasan <mohans at yahoo-inc dot com>.
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking. Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache. Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.
Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper. However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive. In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.
Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after: 3 days
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.
Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.
Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days
IPv6 support. The header in IPv6 is more complex then in IPv4 so we
want to handle skipping over it in one location.
Submitted by: Mariano Tortoriello and Raffaele De Lorenzo (via luigi)
in flight in SACK recovery.
Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>
setting ts_recent to an arbitrary value, stopping further
communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
fall back to the non RFC 1323 RTT calculation.
Submitted by: Raja Mukerji (raja at moselle dot com)
Reviewed by: Noritoshi Demizu, Mohan Srinivasan
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.
Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).
(Fix for kern/78226).
Submitted by : Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by : Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).
try to reasseble the packet from the fragments queue with the only
fragment, finish with the first fragment as soon as we create a queue.
Spotted by: Vijay Singh
o Drop the fragment if maxfragsperpacket == 0, no chances we
will be able to reassemble the packet in future.
Reviewed by: silby
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code. In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.
PR: kern/79342
Spotted by: Anjali Kulkarni
Glanced at by: silby
MFC after: 2 weeks
we have a non-NULL args.rule. If the same packet later is subject to "tee"
rule, its original is sent again into ipfw_chk() and it reenters at the same
rule. This leads to infinite loop and frozen router.
Assign args.rule to NULL, any time we are going to send packet back to
ipfw_chk() after a tee rule. This is a temporary workaround, which we
will leave for RELENG_5. In HEAD we are going to make divert(4) save
next rule the same way as dummynet(4) does.
PR: kern/79546
Submitted by: Oleg Bulyzhin
Reviewed by: maxim, andre
MFC after: 3 days
libalias.
In /usr/src/lib/libalias/alias.c, the functions LibAliasIn and
LibAliasOutTry call the legacy PacketAliasIn/PacketAliasOut instead
of LibAliasIn/LibAliasOut when the PKT_ALIAS_REVERSE option is set.
In this case, the context variable "la" gets lost because the legacy
compatibility routines expect "la" to be global. This was obviously
an oversight when rewriting the PacketAlias* functions to the
LibAlias* functions.
The fix (as shown in the patch below) is to remove the legacy
subroutine calls and replace with the new ones using the "la" struct
as the first arg.
Submitted by: Gil Kloepfer <fgil@kloepfer.org>
Confirmed by: <nicolai@catpipe.net>
PR: 76839
MFC after: 3 days
carp_carpdev_state_locked() is called every time carp interface is attached.
The first call backs up flags of the first interface, and the second
call backs up them again, erasing correct values.
To solve this, a carp_sc_state_locked() function is introduced. It is
called when interface is attached to parent, instead of calling
carp_carpdev_state_locked. carp_carpdev_state_locked() calls
carp_sc_state_locked() for each sc in chain.
Reported by: Yuriy N. Shkandybin, sem
returns error. In this case mbuf has already been freed. [1]
- Remove redundant declaration.
PR: kern/78893 [1]
Submitted by: Liang Yi [1]
Reviewed by: sam
MFC after: 1 day
Add two another workarounds for carp(4) interfaces:
- do not add connected route when address is assigned to carp(4) interface
- do not add connected route when other interface goes down
Embrace workarounds with #ifdef DEV_CARP
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.
Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).
a libalias application (e.g. natd, ppp, etc.) to crash. Note: Skinny support
is not enabled in natd or ppp by default.
Approved by: secteam (nectar)
MFC after: 1 day
Secuiryt: This fixes a remote DoS exploit
attached to a parent interface we use its mutex to lock the softc. This
means that in several places like carp_ioctl() we lock softc conditionaly.
This should be redesigned.
To avoid LORs when MII announces us a link state change, we schedule
a quick callout and call carp_carpdev_state_locked() from it.
Initialize callouts using NET_CALLOUT_MPSAFE.
Sponsored by: Rambler
Reviewed by: mlaier
- In carp_send_ad_all() walk through list of all carp interfaces
instead of walking through list of all interfaces.
Sponsored by: Rambler
Reviewed by: mlaier
- Use our loop DLT type, not OpenBSD. [1]
- The fields that are converted to network byte order are not 32-bit
fields but 16-bit fields, so htons should be used in htonl. [1]
- Secondly, ip_input changes ip->ip_len into its value without
the ip-header length. So, restore the length to make bpf happy. [1]
- Use bpf_mtap2(), use temporary af1, since bpf_mtap2 doesn't
understand uint8_t af identifier.
Submitted by: Frank Volf [1]
ignore the sack options in that segment. Else we'd end up
corrupting the scoreboard.
Found by: Raja Mukerji (raja at moselle dot com)
Submitted by: Mohan Srinivasan
- Simplify CARP_LOG() and making it working (we don't have addlog in FreeBSD).
- Introduce CARP_DEBUG() which logs with LOG_DEBUG severity when
net.inet.carp.log > 1
- Use CARP_DEBUG to log state changes of carp interfaces.
After CARP_LOG() cleanup it appeared that carp_input_c() does not need sc
argument. Remove it.
Sponsored by: Rambler
mastering on all other interfaces:
- call carp_carpdev_state() on initialize instead of just setting to INIT
- in carp_carpdev_state() check that interface is UP, instead of checking
that it is not DOWN, because a rebooted machine may have interface in
UNKNOWN state.
Sponsored by: Rambler
Obtained from: OpenBSD (partially)
with the kernel compile time option:
options IPFIREWALL_FORWARD_EXTENDED
This option has to be specified in addition to IPFIRWALL_FORWARD.
With this option even packets targeted for an IP address local
to the host can be redirected. All restrictions to ensure proper
behaviour for locally generated packets are turned off. Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.
Document the two kernel options.
PR: kern/71910
PR: kern/73129
MFC after: 1 week
so that parent interface is not left in promiscous mode after carp
interface is destroyed.
This is not perfect, since promisc counter is added when carp
interface is assigned an IP address. However, when address is removed
parent interface is still in promiscuous mode. Only removal of
carp interface removes promisc from parent. Same way in OpenBSD.
Sponsored by: Rambler
hosts to share an IP address, providing high availability and load
balancing.
Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.
FreeBSD port done solely by Max Laier.
Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)
address is not supplied, then jail IP is choosed and in_pcbbind() is called.
Since udp_output() does not save local addr after call to in_pcbconnect_setup(),
in_pcbbind() is called for each packet, and this is incorrect.
So, we shall treat jailed sockets specially in udp_output(), we will save
their local address.
This fixes a long standing bug with broken sendto() system call in jails.
PR: kern/26506
Reviewed by: rwatson
MFC after: 2 weeks
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).
Reviewed by: silence on net@
MFC after: 2 weeks
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().
Reported by: Sam Jensen <sam at wand dot net dot nz>
Submitted by: Mohan Srinivasan
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.
Submitted by: Mohan Srinivasan
o Use SYSCTL_IN() macro instead of direct call of copyin(9).
Submitted by: ume
o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.
Suggested by: ume
utility:
The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.
Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month
a UMA zone instead. This should eliminate a bit of the locking
overhead associated with with malloc and reduce the memory
consumption associated with each new state.
Reviewed by: rwatson, andre
Silence on: ipfw@
MFC after: 1 week
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.
MFC after: 2 weeks
Discussed with: phk, cperciva
General head nod: silby
- ip_fw_chk() returns action as function return value. Field retval is
removed from args structure. Action is not flag any more. It is one
of integer constants.
- Any action-specific cookies are returned either in new "cookie" field
in args structure (dummynet, future netgraph glue), or in mbuf tag
attached to packet (divert, tee, some future action).
o Convert parsing of return value from ip_fw_chk() in ipfw_check_{in,out}()
to a switch structure, so that the functions are more readable, and a future
actions can be added with less modifications.
Approved by: andre
MFC after: 2 months
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.
Submitted by: Mohan Srinivasan
that the RFC 793 specification for accepting RST packets should be
following. When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.
connection rates, which is causing problems for some users.
To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.
Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds. These thresholds may be tuned via sysctl.
Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.
MFC After: 20 seconds
cases for tcp_input():
While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb. As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected. This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.
Discussed with: sam
multiple MIB entries using sysctl in short order, which might
result in unexpected values for tcp_maxidle being generated by
tcp_slowtimo. In practice, this will not happen, or at least,
doesn't require an explicit comment.
MFC after: 2 weeks
Andre:
First lets get major new features into the kernel in a clean and nice way,
and then start optimizing. In this case we don't have any obfusication that
makes later profiling and/or optimizing difficult in any way.
Requested by: csjp, sam
mechanism used by pfil. This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.
So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.
It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.
Reviewed by: andre, rwatson
Tested by: myself, seanc
Silence on: ipfw@
MFC after: 1 month
tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()
Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).
MFC after: 2 weeks
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.
Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().
MFC after: 2 weeks
modifications to the inpcb IP options mbuf:
- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.
MFC after: 2 weeks
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows. Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD. I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted. This API/ABI should be considered unstable for the
time being.
window was 0 bytes in size. This may have been the cause of unsolved
"connection not closing" reports over the years.
Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.
Submitted by: Michiel Boland
MFC after: 2 weeks
contents of the tcpcb are read and modified in volume.
In tcp_input(), replace th comparison with 0 with a comparison with
NULL.
At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'. Try to improve consistency between
various assertions regarding headlocked to be more informative.
MFC after: 2 weeks
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.
In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.
In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.
Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).
In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.
In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.
In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.
MFC after: 2 weeks
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing. Updates to multiple
sysctls within evaluation of a single addition are unlikely.
Annotate that tcp_canceltimers() is currently unused.
De-spl tcp_timer_delack().
De-spl tcp_timer_2msl().
MFC after: 2 weeks
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().
Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller. This is a bug.
Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.
Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.
MFC after: 2 weeks
protects access to the ISN state variables.
Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.
Staticize internal ISN variables since they're not used outside of
tcp_subr.c.
MFC after: 2 weeks
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
doesn't have listen queue. It was called only from div_disconnect(),
which is removed now.
Reviewed by: rwatson, maxim
Approved by: julian (mentor)
MT5 after: 1 week
MT4 after: 1 month
after allowing more than one address with the same prefix.
Reported by: Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by: ru (also NetBSD rev. 1.83)
Pointyhat to: mlaier