With the second (and last) part of my previous Summer of Code work, we get:
-ipfw's in kernel nat
-redirect_* and LSNAT support
General information about nat syntax and some examples are available
in the ipfw (8) man page. The redirect and LSNAT syntax are identical
to natd, so please refer to natd (8) man page.
To enable in kernel nat in rc.conf, two options were added:
o firewall_nat_enable: equivalent to natd_enable
o firewall_nat_interface: equivalent to natd_interface
Remember to set net.inet.ip.fw.one_pass to 0, if you want the packet
to continue being checked by the firewall ruleset after being
(de)aliased.
NOTA BENE: due to some problems with libalias architecture, in kernel
nat won't work with TSO enabled nic, thus you have to disable TSO via
ifconfig (ifconfig foo0 -tso).
Approved by: glebius (mentor)
access plus timers. This makes the code
more portable and able to change out the
mbuf or timer system used more easily ;-)
b) removal of all use of pkt-hdr's until only
the places we need them (before ip_output routines).
c) remove a bunch of code not needed due to <b> aka
worrying about pkthdr's :-)
d) There was one last reorder problem it looks where
if a restart occur's and we release and relock (at
the point where we setup our alias vtag) we would
end up possibly getting the wrong TSN in place. The
code that fixed the TSN's just needed to be shifted
around BEFORE the release of the lock.. also code that
set the state (since this also could contribute).
Approved by: gnn
o fixed a comment
o made in kernel libalias a bit less verbose (disabled automatic
logging everytime a new link is added or deleted)
Approved by: glebius (mentor)
2) Fix all "magic numbers" to be constants.
3) A collision case that would generate two associations to
the same peer due to a missing lock is fixed.
4) Added tracking of where timers are stopped.
Approved by: gnn
kernel. This LOR snuck in with some of the recent syncache changes. To
fix this, the inpcb handling was changed:
- Hang a MAC label off the syncache object
- When the syncache entry is initially created, we pickup the PCB lock
is held because we extract information from it while initializing the
syncache entry. While we do this, copy the MAC label associated with
the PCB and use it for the syncache entry.
- When the packet is transmitted, copy the label from the syncache entry
to the mbuf so it can be processed by security policies which analyze
mbuf labels.
This change required that the MAC framework be extended to support the
label copy operations from the PCB to the syncache entry, and then from
the syncache entry to the mbuf.
These functions really should be referencing the syncache structure instead
of the label. However, due to some of the complexities associated with
exposing this syncache structure we operate directly on it's label pointer.
This should be OK since we aren't making any access control decisions within
this code directly, we are merely allocating and copying label storage so
we can properly initialize mbuf labels for any packets the syncache code
might create.
This also has a nice side effect of caching. Prior to this change, the
PCB would be looked up/locked for each packet transmitted. Now the label
is cached at the time the syncache entry is initialized.
Submitted by: andre [1]
Discussed with: rwatson
[1] andre submitted the tcp_syncache.c changes
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.
This is the "+ one more change" missed in the original commit.
Noticed by: tinderbox
Pointy hat to: me (#1)
In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.
Fixing the IP accounting issue, if we plan to do so, needs to be better
thought out; the 'fix' introduces a hash lookup and a possible kernel panic.
Reported by: Mark Tinguely
marked INP_DROPPED or INP_TIMEWAIT:
o return ECONNRESET instead of EINVAL for close, disconnect, shutdown,
rcvd, rcvoob, and send operations
o return ECONNABORTED instead of EINVAL for accept
These changes should reduce confusion in applications since EINVAL is
normally interpreted to mean an invalid file descriptor. This change
does not conflict with POSIX or other standards I checked. The return
of EINVAL has always been possible but rare; it's become more common
with recent changes to the socket/inpcb handling and with finer-grained
locking and preemption.
Note: there are other instances of EINVAL for this state that were
left unchanged; they should be reviewed.
Reviewed by: rwatson, andre, ru
MFC after: 1 month
We are not yet aware of the protocol internals but this way
SCTP traffic over v6 will not be discarded.
Reported by: Peter Lei via rrs
Tested by: Peter Lei <peterlei cisco.com>
not being aquired. This meant that when we cleanup
the outbound we may have one in transit to be
added with the old sequence number. This is bad
since then we loose a message :(
Also the report_outbound needed to have the right
lock when its called which it did not.. I added
the lock with of course a flag since we want to
have the lock before we call it in the restart
case.
This also fixed the FIX ME case where, in the cookie
collision case, we mark for retransmit any that
were bundled with the cookie that was dropped.
This also means changes to the output routine
so we can assure getting the COOKIE-ACK sent
BEFORE we retransmit the Data.
Approved by: gnn
a colliding INIT. This if fine except when we have
data outstanding... we basically reset it to the
previous value it was.. so then we end up assigning
the same TSN to two different data chunks.
This patch:
1) Finds a missing lock for when we change the stream
numbers during COOKIE and INIT-ACK processing.. we
were NOT locking the send_buffer.. which COULD cause
problems (found by inspection looking for <2>)
2) Fixes a case during a colliding INIT where we incorrectly
reset the sending Sequence thus in some cases duplicately
assigning a TSN.
3) Additional enhancments to logging so we can see strm/tsn in
the receiver AND new tracking to watch what the sender
is doing with TSN and STRM seq's.
Approved by: gnn
We were calling select_a_tag() inside sctp_send_initate_ack().
During collision cases we have a stcb and thus a SCTP_LOCK. When
we call select_a_tag it (below it) locks the INFO lock. We now
1) pre-select the nonce-tie-tags in sctputil.c during setup of
a tcb.
2) In the other case where we have to select tags, we unlock after
incr the ref cnt (so assoc won't go away0 and then do the
tag selection followed by a relock and decr the refcnt.
Approved by: gnn
reset comes in we need to calculate the length and
therefore the number of listed streams (if any) based
on the TLV type. Otherwise if we get a retran we could
in theory panic by sending a notification to a user with
a incorrect list and thus no memory listing the streams.
Found in IOS by devtest :-)
Approved by: gnn
copy's were incorrect and so was the locking.
-A bug was also found that would create a race and
panic when an abort arrived on a socket being read
from.
-Also fix the reader to get MSG_TRUNC when a partial
delivery is aborted.
-Also addresses a couple of coverity caught error path
memory leaks and a couple of other valid complaints
Approved by: gnn
patch was prepared and committed to priv(9) calls. Add XXX comments
as, in each case, the semantics appear to differ from the TCP/UDP
versions of the calls with respect to jail, and because cr_canseecred()
is not used to validate the query.
Obtained from: TrustedBSD Project
to wakeup on close of the sender. It basically moves
the return (when the asoc has a reader/writer) further
down and gets the wakeup and assoc appending (of the
PD-API event) moved up before the return. It also
moves the flag set right before the return so we can
assure only once adding the PD-API events.
Approved by: gnn
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
This also moves two 16 bit int's to become 32 bit
values so we do not have to use atomic_add_16.
Most of the changes are %p, casts and other various
nasty's that were in the orignal code base. With this
commit my machine will now do a build universe.. however
I as yet have not tested on a 64bit machine .. it may not work :-(
inserted a few to the new files.. but I falied to
add the #include <sys/cdef.h>
Which causes a compile error.. sorry about that... got it
now :-)
Approved by:gnn
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.comtuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0
So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.
I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)
There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..
If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)
Reviewed by: gnn
Approved by: gnn
- Pay respect to net.isr.direct: use netisr_dispatch() instead of ip_input()
Reviewed by: glebius, rwatson
- purge_flow_set():
- Do not leak memory while purging queues which are not bound to pipe.
- style(9) cleanup
MFC after: 2 months
net.inet.ip.dummynet.curr_time
net.inet.ip.dummynet.searches
net.inet.ip.dummynet.search_steps
to SYSCTL_LONG nodes. It will prevent frequent wrap around on 64bit archs.
- Implement simple mechanics for dummynet(4) internal time correction.
Under certain circumstances (system high load, dummynet lock contention, etc)
dummynet's tick counter can be significantly slower than it should be.
(I've observed up to 25% difference on one of my production servers).
Since this counter used for packet scheduling, it's accuracy is vital for
precise bandwidth limitation.
Introduce new sysctl nodes:
net.inet.ip.dummynet.
tick_lost - number of ticks coalesced by taskqueue thread.
tick_adjustment - number of time corrections done.
tick_diff - adjusted vs non-adjusted tick counter difference
tick_delta - last vs 'standard' tick differnece (usec).
tick_delta_sum - accumulated (and not corrected yet) time
difference (usec).
Reviewed by: glebius
MFC after: 2 month
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
entries' by src:port and dst:port pairs. IPv6 part is non-functional
as ``limit'' does not support IPv6 flows.
PR: kern/103967
Submitted by: based on Bruce Campbell patch
MFC after: 1 month
in ip6_output. In case this fails handle the error directly and log it[1].
In addition permit CARP over v6 in ip_fw2.
PR: kern/98622
Similar patch by: suz
Discussed with: glebius [1]
Tested by: Paul.Dekkers surfnet.nl, Philippe.Pegon crc.u-strasbg.fr
MFC after: 3 days
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.
Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.
Reviewed by: glebius, ru
of its internal state to ignore the failed send and try again a bit later.
If the error is EPERM the packet got blocked by the local firewall and the
revert may cause the session to get stuck and retry indefinitely. This way
we treat it like a packet loss and let the retransmit timer and timeouts
do their work over time.
The correct behavior is to drop a connection that gets an EPERM error.
However this _may_ introduce some POLA problems and a two commit approach
was chosen.
Discussed with: glebius
PR: kern/25986
PR: kern/102653
the MROUTER is running, the system would panic as described in the PR.
The fix in the PR is a good start, however, the other state associated
with the multicast forwarding cache has to be freed in order to avoid
leaking memory and other possible panics.
More care and attention is needed in this area.
PR: kern/82882
MFC after: 1 week
goes away. Without this change, it leaks in_multi (and often ether_multi
state) if many clonable interfaces are created and destroyed in quick
succession.
The concept of this fix is borrowed from KAME. Detailed information about
this behaviour, as well as test cases, are available in the PR.
PR: kern/78227
MFC after: 1 week
With the first part of my previous Summer of Code work, we get:
-made libalias modular:
-support for 'particular' protocols (like ftp/irc/etcetc) is no more
hardcoded inside libalias, but it's available through external
modules loadable at runtime
-modules are available both in kernel (/boot/kernel/alias_*.ko) and
user land (/lib/libalias_*)
-protocols/applications modularized are: cuseeme, ftp, irc, nbt, pptp,
skinny and smedia
-added logging support for kernel side
-cleanup
After a buildworld, do a 'mergemaster -i' to install the file libalias.conf
in /etc or manually copy it.
During startup (and after every HUP signal) user land applications running
the new libalias will try to read a file in /etc called libalias.conf:
that file contains the list of modules to load.
User land applications affected by this commit are ppp and natd:
if libalias.conf is present in /etc you won't notice any difference.
The only kernel land bit affected by this commit is ng_nat:
if you are using ng_nat, and it doesn't correctly handle
ftp/irc/etcetc sessions anymore, remember to kldload
the correspondent module (i.e. kldload alias_ftp).
General information and details about the inner working are available
in the libalias man page under the section 'MODULAR ARCHITECTURE
(AND ipfw(4) SUPPORT)'.
NOTA BENE: this commit affects _ONLY_ libalias, ipfw in-kernel nat
support will be part of the next libalias-related commit.
Approved by: glebius
Reviewed by: glebius, ru
the VRRPv2 advertisements will originate from the wrong source address.
This only affects kernels compiled with MROUTING and after the MRT_INIT
ioctl() has been issued.
Set imo_multicast_vif in carp's softc to the invalid value -1 after it is
zeroed by softc allocation, to stop the ip_output() path looking up the
incorrect source address thinking a vif is set.
PR: kern/100532
Submitted by: Bohus Plucinsky
MFC after: 1 week
the header field for possible later IPSEC SPD lookup, even
when the kernel is built without 'options INET6'.
PR: kern/57760
MFC after: 1 week
Submitted by: Joachim Schueth
functionality:
- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.
This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.
The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.
A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.
Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
exists to allow the mandatory access control policy to properly initialize
mbufs generated by the firewall. An example where this might happen is keep
alive packets, or ICMP error packets in response to other packets.
This takes care of kernel panics associated with un-initialize mbuf labels
when the firewall generates packets.
[1] I modified this patch from it's original version, the initial patch
introduced a number of entry points which were programmatically
equivalent. So I introduced only one. Instead, we should leverage
mac_create_mbuf_netlayer() which is used for similar situations,
an example being icmp_error()
This will minimize the impact associated with the MFC
Submitted by: mlaier [1]
MFC after: 1 week
This is a RELENG_6 candidate
validity of ro->ro_rt first. This prevents crashing on any non-normally
routed IP packet.
Coverity CID: 162 (incorrectly, it was re-introduced by previous commit)
support a network w/ split mtu's by assigning each host route the correct
mtu. an aspiring programmer could write a daemon to probe hosts and find
out if they support a larger mtu.
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).
Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).
TSO is only used if we are in a pure bulk sending state. The presence of
TCP-MD5, SACK retransmits, SACK advertizements, IPSEC and IP options prevent
using TSO. With TSO the TCP header is the same (except for the sequence number)
for all generated packets. This makes it impossible to transmit any options
which vary per generated segment or packet.
The length of TSO bursts is limited to TCP_MAXWIN.
The sysctl net.inet.tcp.tso globally controls the use of TSO and is enabled.
TSO enabled sends originating from tcp_output() have the CSUM_TCP and CSUM_TSO
flags set, m_pkthdr.csum_data filled with the header pseudo-checksum and
m_pkthdr.tso_segsz set to the segment size (net payload size, not counting
IP+TCP headers or TCP options).
IPv6 currently lacks a pseudo-header checksum function and thus doesn't support
TSO yet.
Tested by: Jack Vogel <jfvogel-at-gmail.com>
Sponsored by: TCP/IP Optimization Fundraise 2005
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly
Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005
and skip over the normal IP processing.
Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.
PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
Reviewed by: ru
for example:
fwd tablearg ip from any to table(1)
where table 1 has entries of the form:
1.1.1.0/24 10.2.3.4
208.23.2.0/24 router2
This allows trivial implementation of a secondary routing table implemented
in the firewall layer.
I expect more work (under discussion with Glebius) to follow this to clean
up some of the messy parts of ipfw related to tables.
Reviewed by: Glebius
MFC after: 1 month
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.
Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners. This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments. There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.
Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>