and has been broken twice:
- in the beginning of div_output() replace KASSERT with assignment, as
it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
unconditionally. Before doing this check whether we have one. [2]
A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.
Discussed with: mlaier [1]
Silence from: green [2]
Reviewed by: maxim
Approved by: julian (mentor)
MFC after: 1 week
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.
Discussed on: arch, net
Tested by: many testers of the CARP patches
Nits from: ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from: WIDE via OpenBSD
MFC after: 1 month
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled. Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed. Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.
MFC after: 3 weeks
destined for a blackhole route.
This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.
Submitted by: james at towardex dot com
Sponsored by: eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after: 3 weeks
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:
- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.
- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.
- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().
- Re-annotate comments to reflect updates.
With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.
Discovered by: kris (Bug Magnet)
MFC after: 3 weeks
A complete rationale and discussion is given in this message
and the resulting discussion:
http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706
Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.
Discussed on: -arch
under high load: only set function state to loop and continuing sending
if there is no data left to send.
RELENG_5_3 candidate.
Feet provided: Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by: mohan <mohans at yahoo-inc dot com>
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler. We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
of protocols. The call to divert_packet() is done through a function pointer. All
semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to
install divert rules and natd will complain about 'protocol not supported'. Once
it is loaded both will work and accept rules and open the divert socket. The module
can only be unloaded if no divert sockets are open. It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
protocols in inetsw[] and define initially eight spacer slots.
Remove conflicting declaration 'struct pr_usrreqs nousrreqs'. It is
now declared and initialized in kern/uipc_domain.c.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain. However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.
The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way. To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required. The function does not allow to overwrite an already
registered protocol. The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
(sorele()/sotryfree()):
- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.
- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.
This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.
RELENG_5_3 candidate.
MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.
RELENG_5 candidate.
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used. Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput(). Other socket options were protected
by a call to suser(). This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.
However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.
This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets. When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check. This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.
RELENG_5 candidate.
Reviewed by: csjp
reaching into the socket buffer. This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it). Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although haven't
seen it reported).
RELENG_5 candidate.
callouts as non-CALLOUT_MPSAFE. Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.
MFC after: 3 days
Reported by: Dikshie < dikshie at ppk dot itb dot ac dot id >
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.
Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.
This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.
Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days
LOR IDs: 14 - 17 (not fixed yet)
When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only. Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.
OK'ed by: andre
Reviewed by: -current
family to the ip_protox[] array. The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).
Add more comments ip_init().
it travels through the IP stack. This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.
The IP source route options of a packet are now stored in a mtag instead of the
global variable.