For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.
Clean up some layering violations (access to tcp internals from in_pcb)
This piece of code has not been referenced since it was put there
in 1995. Also done a codebased search on popular networking libraries
and third-party applications. This is an orphan.
Reviewed by: jesper
connection, but send it immediately. Prior to this change, it was possible
to delay a delayed-ack for multiple times, resulting in degraded TCP
behavior in certain corner cases.
error will be passed up to the user, who will close the connection, so
it does not appear to make a sense to leave the connection open.
This also fixes a bug with kqueue, where the filter does not set EOF
on the connection, because the connection is still open.
Also remove calls to so{rw}wakeup, as we aren't doing anything with
them at the moment anyway.
Reviewed by: alfred, jesper
reset TCP connections which are in the SYN_SENT state, if the sequence
number in the echoed ICMP reply is correct. This behavior can be
controlled by the sysctl net.inet.tcp.icmp_may_rst.
Currently, only subtypes 2,3,10,11,12 are treated as such
(port, protocol and administrative unreachables).
Assocaiate an error code with these resets which is reported to the
user application: ENETRESET.
Disallow resetting TCP sessions which are not in a SYN_SENT state.
Reviewed by: jesper, -net
and 1.84 of src/sys/netinet/udp_usrreq.c
The changes broken down:
- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.
Approved by: jlemon
Inspired by: NetBSD
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.
Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()
It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0
Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.
PR: kern/25195
Submitted by: originally by jesper, reimplimented by jlemon's advice
Reviewed by: jlemon
Approved by: jlemon
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).
This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.
Reviewed by: bde
Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h
Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input
In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG
Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.
In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.
PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>
address is configured on a interface. This is useful for routers with
dynamic interfaces. It is now possible to say:
0100 allow tcp from any to any established
0200 skipto 1000 tcp from any to any
0300 allow ip from any to any
1000 allow tcp from 1.2.3.4 to me 22
1010 deny tcp from any to me 22
1020 allow tcp from any to any
and not have to worry about the behaviour if dynamic interfaces configure
new IP numbers later on.
The check is semi expensive (traverses the interface address list)
so it should be protected as in the above example if high performance
is a requirement.
were performed to determine if the received packet should be reset. This
created erroneous ratelimiting and false alarms in some cases. The code
has now been reorganized so that the checks for validity come before
the call to badport_bandlim. Additionally, a few changes in the symbolic
names of the bandlim types have been made, as well as a clarification of
exactly which type each RST case falls under.
Submitted by: Mike Silbersack <silby@silby.com>
turned on, and the case of it not being defined at all.
i.e. Disabling bridging re-enables some of the checks it disables.
Submitted by: "Rogier R. Mulhuijzen" <drwilco@drwilco.net>
available, the error return should be EADDRNOTAVAIL rather than
EAGAIN.
PR: 14181
Submitted by: Dima Dorfman <dima@unixfreak.org>
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
splimp() -- we need it because dummynet can be invoked by the
bridging code at splimp().
This should cure the pipe "stalls" that several people have been
reporting on -stable while using bridging+dummynet (the problem
would not affect routers using dummynet).
reserved and now allocated TCP flags in incoming packets. This patch
stops overloading those bits in the IP firewall rules, and moves
colliding flags to a seperate field, ipflg. The IPFW userland
management tool, ipfw(8), is updated to reflect this change. New TCP
flags related to ECN are now included in tcp.h for reference, although
we don't currently implement TCP+ECN.
o To use this fix without completely rebuilding, it is sufficient to copy
ip_fw.h and tcp.h into your appropriate include directory, then rebuild
the ipfw kernel module, and ipfw tool, and install both. Note that a
mismatch between module and userland tool will result in incorrect
installation of firewall rules that may have unexpected effects. This
is an MFC candidate, following shakedown. This bug does not appear
to affect ipfilter.
Reviewed by: security-officer, billf
Reported by: Aragon Gouveia <aragon@phat.za.net>
to supress logging when ARP replies arrive on the wrong interface:
"/kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1"
the default is to log just to give notice about possibly incorrectly
configured networks.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.
rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.
quote begin.
A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.
quote end.
I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.
When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.
The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.
I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.
PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>
1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:
Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second
Submitted by: Mike Silbersack <silby@silby.com>
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.
IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.
Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.
* Some dummynet code incorrectly handled a malloc()-allocated pseudo-mbuf
header structure, called "pkt," and could consequently pollute the mbuf
free list if it was ever passed to m_freem(). The fix involved passing not
pkt, but essentially pkt->m_next (which is a real mbuf) to the mbuf
utility routines.
* Also, for dummynet, in bdg_forward(), made the code copy the ethernet header
back into the mbuf (prepended) because the dummynet code that follows expects
it to be there but it is, unfortunately for dummynet, passed to bdg_forward
as a seperate argument.
PRs: kern/19551 ; misc/21534 ; kern/23010
Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: bmilekic
Approved by: luigi
instead.
Also, fix a small set of "avail." If we're setting `avail,' we shouldn't
be re-checking whether m_flags is M_EXT, because we know that it is, as if
it wasn't, we would have already returned several lines above.
Reviewed by: jlemon
only be checked if the system is currently performing New Reno style
fast recovery. However, this value was being checked regardless of the
NR state, with the end result being that the congestion window was never
opened.
Change the logic to check t_dupack instead; the only code path that
allows it to be nonzero at this point is NewReno, so if it is nonzero,
we are in fast recovery mode and should not touch the congestion window.
Tested by: phk
PPTP links are no longer dropped by simple (and inappropriate in this
case) "inactivity timeout" procedure, only when requested through the
control connection.
It is now possible to have multiple PPTP servers running behind NAT.
Just redirect the incoming TCP traffic to port 1723, everything else
is done transparently.
Problems were reported and the fix was tested by:
Michael Adler <Michael.Adler@compaq.com>,
David Andersen <dga@lcs.mit.edu>
<sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.
Define __offsetof() in <machine/ansi.h>
Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>
Remove myriad of local offsetof() definitions.
Remove includes of <stddef.h> in kernel code.
NB: Kernelcode should *never* include from /usr/include !
Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.
Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.
Paritials reviews by: various.
Significant brucifications by: bde
of IP datagram. This fixes the problem when firewall denied fragmented
packets whose last fragment was less than minimum protocol header size.
Found by: Harti Brandt <brandt@fokus.gmd.de>
PR: kern/22309
in the code enforces this. So, do not check for and attempt a
false reassembly if only IP_RF is set.
Also, removed the dead code, since we no longer use dtom() on
return from ip_reass().
statistics on a per network address basis.
Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.
Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.
and instead reapply the revision 1.49 of mbuf.h, i.e.
Fixed regression of the type of the `header' member of struct pkthdr from
`void *' to caddr_t in rev.1.51. This mainly caused an annoying warning
for compiling ip_input.c.
Requested by: bde
enough into the mbuf data area. Solve this problem once and for all
by pulling up the entire (standard) header for TCP and UDP, and four
bytes of header for ICMP (enough for type, code and cksum fields).
the IP_FW_IF_IPID rule. (We have recently decided to keep the
ip_id field in network byte order inside the kernel, see revision
1.140 of src/sys/netinet/ip_input.c).
I did not like to have the conversion happen in userland, and I
think that the similar conversions for fw_tcp(seq|ack|win) should
be moved out of userland (src/sbin/ipfw/ipfw.c) into the kernel.
in favor of the new-style per-vif socket.
this does not affect the behavior of the ISI rsvpd but allows
another rsvp implementation (e.g., KOM rsvp) to take advantage
of the new style for particular sockets while using the old style
for others.
in the future, rsvp supporn should be replaced by more generic
router-alert support.
PR: kern/20984
Submitted by: Martin Karsten <Martin.Karsten@KOM.tu-darmstadt.de>
Reviewed by: kjc
but have a network interrupt arrive and deactivate the timeout before
the callout routine runs. Check for this case in the callout routine;
it should only run if the callout is active and not on the wheel.
It causes a panic when/if snd_una is incremented elsewhere (this
is a conservative change, because originally no rollback occurred
for any packets at all).
Submitted by: Vivek Sadananda Pai <vivek@imimic.com>
Update copyrights.
Introduce a new sysctl node:
net.inet.accf
Although acceptfilters need refcounting to be properly (safely) unloaded
as a temporary hack allow them to be unloaded if the sysctl
net.inet.accf.unloadable is set, this is really for developers who want
to work on thier own filters.
A near complete re-write of the accf_http filter:
1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump
to the application.
Because of the performance implications of this there is a sysctl
'net.inet.accf.http.parsehttpversion' that when set to non-zero
parses the HTTP version.
The default is to parse the version.
2) Check if a socket has filled and dump to the listener
3) optimize the way that mbuf boundries are handled using some voodoo
4) even though you'd expect accept filters to only be used on TCP
connections that don't use m_nextpkt I've fixed the accept filter
for socket connections that use this.
This rewrite of accf_http should allow someone to use them and maintain
full HTTP compliance as long as net.inet.accf.http.parsehttpversion is
set.
for them does not belong in the IP_FW_F_COMMAND switch, that mask doesn't even
apply to them(!).
2. You cannot add a uid/gid rule to something that isn't TCP, UDP, or IP.
XXX - this should be handled in ipfw(8) as well (for more diagnostic output),
but this at least protects bogus rules from being added.
Pointy hat: green
datagram embedded into ICMP error message, not with protocol
field of ICMP message itself (which is always IPPROTO_ICMP).
Pointed by: Erik Salander <erik@whistle.com>
fields between host and network byte order. The details:
o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.
o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).
o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().
o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.
o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.
not alias `ip_src' unless it comes from the host an original
datagram that triggered this error message was destined for.
PR: 20712
Reviewed by: brian, Charles Mott <cmott@scientech.com>
When this happens, we know for sure that the packet data was not
received by the peer. Therefore, back out any advancing of the
transmit sequence number so that we send the same data the next
time we transmit a packet, avoiding a guaranteed missed packet and
its resulting TCP transmit slowdown.
In most systems ip_output() probably never returns an error, and
so this problem is never seen. However, it is more likely to occur
with device drivers having short output queues (causing ENOBUFS to
be returned when they are full), not to mention low memory situations.
Moreover, because of this problem writers of slow devices were
required to make an unfortunate choice between (a) having a relatively
short output queue (with low latency but low TCP bandwidth because
of this problem) or (b) a long output queue (with high latency and
high TCP bandwidth). In my particular application (ISDN) it took
an output queue equal to ~5 seconds of transmission to avoid ENOBUFS.
A more reasonable output queue of 0.5 seconds resulted in only about
50% TCP throughput. With this patch full throughput was restored in
the latter case.
Reviewed by: freebsd-net
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.
Reviewed by: Peter Wemm,asmodai
reply if the requesting machine isn't on the interface we believe
it should be. Prevents arp wars when you plug cables in the wrong
way around.
PR: 9848
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Not objected to by: wollman
- Multiple PPTP clients behind NAT to the same or different servers.
- Single PPTP server behind NAT -- you just need to redirect TCP
port 1723 to a local machine. Multiple servers behind NAT is
possible but would require a simple API change.
- No API changes!
For more information on how this works see comments at the start of
the alias_pptp.c.
PacketAliasPptp() is no longer necessary and will be removed soon.
Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru
Rewritten by: ru
Reviewed by: Erik Salander <erik@whistle.com>
accept filters are now loadable as well as able to be compiled into
the kernel.
two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)
Reviewed by: jmg
It does mean that it is now possible to run passive-mode FTP
server behind NAT.
- SECURITY: FTP aliasing engine now ensures that:
o the segment preceding a PORT/227 segment terminates with a \r\n;
o the IP address in the PORT/227 matches the source IP address of
the packet;
o the port number in the PORT command or 277 reply is greater than
or equal to 1024.
Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru
It also squashes 99% of packet kiddie synflood orgies. For example, to
rate syn packets without MSS,
ipfw pipe 10 config 56Kbit/s queue 10Packets
ipfw add pipe 10 tcp from any to any in setup tcpoptions !mss
Submitted by: Richard A. Steenbergen <ras@e-gerbil.net>
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.
Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.
Problem discovered by: Mike Silbersack (silby@silby.com)
Pushed for a fix by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: jayanth
down as a result of a reset. Returning EINVAL in that case makes no
sense at all and just confuses people as to what happened. It could be
argued that we should save the original address somewhere so that
getsockname() etc can tell us what it used to be so we know where the
problem connection attempts are coming from.
integer expression. Otherwise the sizeof() call will force the expression
to be evaluated as unsigned, which is not the intended behavior.
Obtained from: NetBSD (in a different form)
code retransmitting data from the wrong offset.
As a footnote, the newreno code was partially derived from NetBSD
and Tom Henderson <tomh@cs.berkeley.edu>
of the individual drivers and into the common routine ether_input().
Also, remove the (incomplete) hack for matching ethernet headers
in the ip_fw code.
The good news: net result of 1016 lines removed, and this should make
bridging now work with *all* Ethernet drivers.
The bad news: it's nearly impossible to test every driver, especially
for bridging, and I was unable to get much testing help on the mailing
lists.
Reviewed by: freebsd-net
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
calling in_pcbbind so that in_pcbbind sees a valid address if no
address was specified (since divert sockets ignore them).
PR: 17552
Reviewed by: Brian