data structures pick up security and synchronization primitives, it
becomes increasingly desirable not to arbitrarily export them via
include files to userland, as the userland applications pick up new
#include dependencies.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
net.inet.tcp.rexmit_min (default 3 ticks equiv)
This sysctl is the retransmit timer RTO minimum,
specified in milliseconds. This value is
designed for algorithmic stability only.
net.inet.tcp.rexmit_slop (default 200ms)
This sysctl is the retransmit timer RTO slop
which is added to every retransmit timeout and
is designed to handle protocol stack overheads
and delayed ack issues.
Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!). Essentially,
the original code made the RTO calculation almost
completely irrelevant.
Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.
one second but it badly breaks throughput on networks with minor packet
loss.
Complaints by: at least two people tracked down to this.
MFC after: 3 days
just because you leave your session idle.
Also, put in a fix for 64-bit architectures (to be revised).
In detail:
ip_fw.h
* Reorder fields in struct ip_fw to avoid alignment problems on
64-bit machines. This only masks the problem, I am still not
sure whether I am doing something wrong in the code or there
is a problem elsewhere (e.g. different aligmnent of structures
between userland and kernel because of pragmas etc.)
* added fields in dyn_rule to store ack numbers, so we can
generate keepalives when the dynamic rule is about to expire
ip_fw2.c
* use a local function, send_pkt(), to generate TCP RST for Reset rules;
* save about 250 bytes by cleaning up the various snprintf()
in ipfw_log() ...
* ... and use twice as many bytes to implement keepalives
(this seems to be working, but i have not tested it extensively).
Keepalives are generated once every 5 seconds for the last 20 seconds
of the lifetime of a dynamic rule for an established TCP flow. The
packets are sent to both sides, so if at least one of the endpoints
is responding, the timeout is refreshed and the rule will not expire.
You can disable this feature with
sysctl net.inet.ip.fw.dyn_keepalive=0
(the default is 1, to have them enabled).
MFC after: 1 day
(just kidding... I will supply an updated version of ipfw2 for
RELENG_4 tomorrow).
This was always broken in HEAD (the offending statement was introduced
in rev. 1.123 for HEAD, while RELENG_4 included this fix (in rev.
1.99.2.12 for RELENG_4) and I inadvertently deleted it in 1.99.2.30.
So I am also restoring these two lines in RELENG_4 now.
We might need another few things from 1.99.2.30.
no punch_fw was used.
Fix another couple of bugs which prevented rules from being
installed properly.
On passing, use IPFW2 instead of NEW_IPFW to compile the new code,
and slightly simplify the instruction generation code.
Following Darren's suggestion, make Dijkstra happy and rewrite the
ipfw_chk() main loop removing a lot of goto's and using instead a
variable to store match status.
Add a lot of comments to explain what instructions are supposed to
do and how -- this should ease auditing of the code and make people
more confident with it.
In terms of code size: the entire file takes about 12700 bytes of text,
about 3K of which are for the main function, ipfw_chk(), and 2K (ouch!)
for ipfw_log().
now it should support all the instructions of the old ipfw.
Fix some bugs in the user interface, /sbin/ipfw.
Please check this code against your rulesets, so i can fix the
remaining bugs (if any, i think they will be mostly in /sbin/ipfw).
Once we have done a bit of testing, this code is ready to be MFC'ed,
together with a bunch of other changes (glue to ipfw, and also the
removal of some global variables) which have been in -current for
a couple of weeks now.
MFC after: 7 days
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.
MFC after: 2 weeks
syncache_respond(A), ip_output(), ip_input(), tcp_input(), syncache_badack(B)
Which winds up deleting a different entry from the syncache. Handle
this by not utilizing the next entry in the timer chain until after
syncache_respond() completes. The case of A == B should not be possible.
Problem found by: Don Bowman <don@sandvine.com>
This code makes use of variable-size kernel representation of rules
(exactly the same concept of BPF instructions, as used in the BSDI's
firewall), which makes firewall operation a lot faster, and the
code more readable and easier to extend and debug.
The interface with the rest of the system is unchanged, as witnessed
by this commit. The only extra kernel files that I am touching
are if_fw.h and ip_dummynet.c, which is quite tied to ipfw. In
userland I only had to touch those programs which manipulate the
internal representation of firewall rules).
The code is almost entirely new (and I believe I have written the
vast majority of those sections which were taken from the former
ip_fw.c), so rather than modifying the old ip_fw.c I decided to
create a new file, sys/netinet/ip_fw2.c . Same for the user
interface, which is in sbin/ipfw/ipfw2.c (it still compiles to
/sbin/ipfw). The old files are still there, and will be removed
in due time.
I have not renamed the header file because it would have required
touching a one-line change to a number of kernel files.
In terms of user interface, the new "ipfw" is supposed to accepts
the old syntax for ipfw rules (and produce the same output with
"ipfw show". Only a couple of the old options (out of some 30 of
them) has not been implemented, but they will be soon.
On the other hand, the new code has some very powerful extensions.
First, you can put "or" connectives between match fields (and soon
also between options), and write things like
ipfw add allow ip from { 1.2.3.4/27 or 5.6.7.8/30 } 10-23,25,1024-3000 to any
This should make rulesets slightly more compact (and lines longer!),
by condensing 2 or more of the old rules into single ones.
Also, as an example of how easy the rules can be extended, I have
implemented an 'address set' match pattern, where you can specify
an IP address in a format like this:
10.20.30.0/26{18,44,33,22,9}
which will match the set of hosts listed in braces belonging to the
subnet 10.20.30.0/26 . The match is done using a bitmap, so it is
essentially a constant time operation requiring a handful of CPU
instructions (and a very small amount of memmory -- for a full /24
subnet, the instruction only consumes 40 bytes).
Again, in this commit I have focused on functionality and tried
to minimize changes to the other parts of the system. Some performance
improvement can be achieved with minor changes to the interface of
ip_fw_chk_t. This will be done later when this code is settled.
The code is meant to compile unmodified on RELENG_4 (once the
PACKET_TAG_* changes have been merged), for this reason
you will see #ifdef __FreeBSD_version in a couple of places.
This should minimize errors when (hopefully soon) it will be time
to do the MFC.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
Add XXX comments to mark places which need to be taken care of
if we want to remove this part of the kernel from Giant.
Add a comment on a potential performance problem with ip_forward()
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
Register the ISR early, but do not actually kick off the timer until we
see some activity. This still saves us from running the arp timers on
a system with no network cards.
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.
Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)
Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.
PR: 39141
MFC after: 2 weeks
This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count
- so_options
- so_linger
- so_state
o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:
- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()
Reviewed by: alfred
results in the syncache entry being turned into a socket. While it's
not used in the main tree, this is required in the MAC tree so that
labels can be propagated from the mbuf to the socket. This is also
useful if you're doing things like transparent IP connection hijacking
and you want to use the syncache/cookie mechanism, but we won't go
there.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Ipfw processing of frames at layer 2 can be enabled by the sysctl variable
net.link.ether.ipfw=1
Consider this feature experimental, because right now, the firewall
is invoked in the places indicated below, and controlled by the
sysctl variables listed on the right. As a consequence, a packet
can be filtered from 1 to 4 times depending on the path it follows,
which might make a ruleset a bit hard to follow.
I will add an ipfw option to tell if we want a given rule to apply
to ether_demux() and ether_output_frame(), but we have run out of
flags in the struct ip_fw so i need to think a bit on how to implement
this.
to upper layers
| |
+----------->-----------+
^ V
[ip_input] [ip_output] net.inet.ip.fw.enable=1
| |
^ V
[ether_demux] [ether_output_frame] net.link.ether.ipfw=1
| |
+->- [bdg_forward]-->---+ net.link.ether.bridge_ipfw=1
^ V
| |
to devices
bridged packets only, soon to come also for packets on ordinary
ether_input() and ether_output() paths. The syntax is
ipfw add <action> MAC dst src type
where dst and src can be "any" or a MAC address optionallyfollowed
by a mask, e.g.
10:20:30:40:50
10:20:30:40:50/32
10:20:30:40:50&ff:ff:ff:f0:ff:0f
and type can be a single ethernet type, a range, or a type followed by
a mask (values are always in hexadecimal) e.g.
0800
0800-0806
0800/8
0800&03ff
Note, I am still uncertain on what is the best format for inputting
these values, having the values in hexadecimal is convenient in most
cases but can be confusing sometimes. Suggestions welcome.
Implement suggestion from PR 37778 to allow "not me" on destination
and source IP. The code in the PR was slightly wrong and interfered
with the normal handling of IP addresses. This version hopefully is
correct.
Minor cleanup of the code, in some places moving the indentation to 4
spaces because the code was becoming too deep. Eventually, in a
separate commit, I will move the whole file to 4 space indent.
were totally useless and have been removed.
ip_input.c, ip_output.c:
Properly initialize the "ip" pointer in case the firewall does an
m_pullup() on the packet.
Remove some debugging code forgotten long ago.
ip_fw.[ch], bridge.c:
Prepare the grounds for matching MAC header fields in bridged packets,
so we can have 'etherfw' functionality without a lot of kernel and
userland bloat.
field. This returns the sdl_data field to a variable-length field. More
importantly, this prevents a easily-reproduceable data-corruption bug when
the interface name plus the hardware address exceed the sdl_data field's
original 12 byte limit. However, token-ring interfaces may still overflow
the new sdl_data field's 46 byte limit if the interface name exceeds 6
characters (since 6 characters for interface name plus 6 for hardware
address plus 34 for source routing = the size of sdl_data). Further
refinements could overcome this limitation but would break binary
compatibility; this commit only addresses fixing the bug for
commonly-occuring cases without breaking binary compatibility with the
intention that the functionality can be MFC'ed to -stable.
See message ID's (both send to -arch):
20020421013332.F87395-100000@gateway.posi.net20020430181359.G11009-300000@gateway.posi.net
for a more thorough description of the bug addressed and how to
reproduce it.
Approved by: silence on -arch and -net
Sponsored by: NTT Multimedia Communications Labs
MFC after: 1 week
- Used mld_xxx and MLD_xxx instead of mld6_xxx and MLD6_xxx according
to the official defintions in rfc2292bis
(macro definitions for backward compatibility were provided)
- Changed the first member of mld_hdr{} from mld_hdr to mld_icmp6_hdr
to avoid name space conflict in C++
This change makes ports/net/pchar compilable again under -CURRENT.
Obtained from: KAME
Turn the sigio sx into a mutex.
Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.
In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.
more on how ipfw(8) deals with tiny fragments. While we're at it, add
a quick log message to even let people know we dropped a packet. (Note
that the second FINE POINT is somewhat redundant given the first, but
since the code is there, leave the docs for it.)
MFC after: 1 day
Requested by: bde
Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.
While I am here, sort include files alphabetically, where possible.
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().
sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.
sections for various standards. Conditionalize sections for various
standards. Use standards conforming spelling for types in the
sockaddr_in structure.
calling ioctl(SIOC[AS]IFADDR).
This allows the following:
ifconfig xx0 inet 1.2.3.1 netmask 0xffffff00
ifconfig xx0 inet 1.2.3.17 netmask 0xfffffff0 alias
ifconfig xx0 inet 1.2.3.25 netmask 0xfffffff8 alias
ifconfig xx0 inet 1.2.3.26 netmask 0xffffffff alias
but would (given the above) reject this:
ifconfig xx0 inet 1.2.3.27 netmask 0xfffffff8 alias
due to the conflicting netmasks. I would assert that it's wrong
to mask the EEXIST returned from rtinit() as in the above scenario, the
deletion of the 1.2.3.25 address will leave the 1.2.3.27 address
as unroutable as it was in the first place.
Offered for review on: -arch, -net
Discussed with: stephen macmanus <stephenm@bayarea.net>
MFC after: 3 weeks
This change allows bootp to work with more than one interface, at the
expense of some rather ``wrong'' looking code. I plan to MFC this in
place of luigi's recent #ifdef BOOTP stuff that was committed to this
file in -stable, as that's slightly more wrong that this is.
Offered for review on: -arch, -net
MFC after: 2 weeks
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
MI, not required to be a fixed size, and used in multiple headers.
This will grow in time, as more things move here from <sys/types.h>
and <machine/ansi.h>.
o Add missing type definitions (uint16_t and uint32_t) to
<arpa/inet.h> and <netinet/in.h>.
o Reduce pollution in <sys/types.h> by using `#if _FOO_T_DECLARED'
widgets to avoid including <sys/stdint.h>.
o Add some missing type definitions to <unistd.h> and note the ones
that still need to be added.
o Make use of <sys/_types.h> primitives in <grp.h> and <sys/types.h>.
Reviewed by: bde
Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
pointer which will then result in the allocated route's reference
count never being decremented. Just flood ping the localhost and
watch refcnt of the 127.0.0.1 route with netstat(1).
Submitted by: jayanth
Back out ip_output.c,v 1.143 and ip_mroute.c,v 1.69 that allowed
ip_output() to be called with a NULL route pointer. The previous
paragraph shows why this was a bad idea in the first place.
MFC after: 0 days
This increases the number of concurrent outgoing connections from ~4000
to ~16000. Other OSes (Solaris, OS X, NetBSD) and many other NAT
products have already made this change without ill effects, so we
should not run into any problems.
MFC after: 1 week
to are about to expire. This prevents high packet rate flows from
experiencing packet drops at the sender following ARP cache entry
timeout.
PR: kern/25517
Reviewed by: luigi
MFC after: 7 days
for POSIX.1-2001 conformance.
o Add magic to <netinet/in.h> and <netinet6/in6.h> to prevent
redefining INET_ADDRSTRLEN and INET6_ADDRSTRLEN.
o Add a note about missing typedefs in <arpa/inet.h>.
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.
Reviewed by: rwatson
were destined for a broadcast IP address. All TCP packets with a
broadcast destination must be ignored. The system only ignored packets
that were _link-layer_ broadcasts or multicast. We need to check the
IP address too since it is quite possible for a broadcast IP address
to come in with a unicast link-layer address.
Note that the check existed prior to CSRG revision 7.35, but was
removed. This commit effectively backs out that nine-year-old change.
PR: misc/35022
so that after the first time we can follow the pointer instead
of having to scan the list.
This was the intended behaviour from day one.
PR: 34639
MFC-after: 3 days
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.
Tested on: alpha, i386
Reviewed by: bde, jake, tmm
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
some time. _All_ packets, regardless of destination, were accepted by
the machine as if addressed to it.
Jump back to 'pass' processing for a teed packet instead of falling
through as if it was ours.
PR: kern/31130
Reviewed by: -net, luigi
MFC after: 2 weeks
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.
Obtained from: KAME
(We should be able to handle locally originated IP packets, and
these do not have m_pkthdr.rcvif set.)
PR: kern/32806, kern/33766
Reviewed by: luigi
Fix tested by: Maxim Konovalov <maxim@macomnet.ru>,
Erwin Lansing <erwin@lansing.dk>
The following steps are involved:
a) the IP options related to routing (LSRR and SSRR) are processed
as though the router were a host,
b) the other IP options are processed as usual only if the packet
is destined for the router; otherwise they are ignored.
PR: kern/23123
Discussed in: freebsd-hackers
An old route will be NULL at that point if a packet were initially
routed to an interface (using the IP_ROUTETOIF flag.)
Submitted by: Igor Timkin <ivt@gamma.ru>
All TCP ISNs that are sent out are valid cookies, which allows entries
in the syncache to be dropped and still have the ACK accepted later.
As all entries pass through the syncache, there is no sudden switchover
from cache -> cookies when the cache is full; instead, syncache entries
simply have a reduced lifetime. More details may be found in the
"Resisting DoS attacks with a SYN cache" paper in the Usenix BSDCon 2002
conference proceedings.
Sponsored by: DARPA, NAI Labs
Now that we've increased the size of our send / receive buffers, bursting
an entire window onto the network may cause congestion. As a result,
we will slow start beginning with a flightsize of 4 packets.
Problem reported by: Thomas Zenker <thz@Lennartz-electronic.de>
MFC after: 3 days
Easily exploitable by flood pinging the target
host over an interface with the IFF_NOARP flag
set (all you need to know is the target host's
MAC address).
MFC after: 0 days
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.
Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).
Some cleanup. Identify additonal issues in comments.
MFC after: 1 day
o Hide nonstandard functions and types in <netinet/in.h> when
_POSIX_SOURCE is defined.
o Add some missing types (required by POSIX.1-200x) to <netinet/in.h>.
o Restore vendor ID from Rev 1.1 in <netinet/in.h> and make use of new
__FBSDID() macro.
o Fix some miscellaneous issues in <arpa/inet.h>.
o Correct final argument for the inet_ntop() function (POSIX.1-200x).
o Get rid of the namespace pollution from <sys/types.h> in
<arpa/inet.h>.
Reviewed by: fenner
Partially submitted by: bde
interface address, blow the address away again before returning the
error.
In in_ifinit(), if we get an error from rtinit() and we've also got
a destination address, return the error rather than masking EEXISTS.
Failing to create a host route when configuring an interface should
be treated as an error.
received on an interface without an IP address, try to find a
non-loopback AF_INET address to use. If that fails, drop it.
Previously, we used the address at the top of the in_ifaddrhead list,
which didn't make much sense, and would cause a panic if there were no
AF_INET addresses configured on the system.
PR: 29337, 30524
Reviewed by: ru, jlemon
Obtained from: NetBSD
for passive mode data connections (PASV/EPSV -> 227/229). Well,
the actual punching happens a bit later, when the aliasing link
becomes fully specified.
Prodded by: Danny Carroll <dannycarroll@hotmail.com>
MFC after: 1 week
to be followed by nfsnodehashtbl, so bzeroing callouts beyond the end of
tcp_syncache soon caused a null pointer panic when nfsnodehashtbl was
accessed.
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.
Reviewed by: julian
Obtained from: TrustedBSD Project
"[...] and removes the hostcache code from standard kernels---the
code that depends on it is not going to happen any time soon,
I'm afraid."
Time to clean up.
called and ip_output() encounters an error and bails (i.e. host
unreachable), we will leak an mbuf. This is because the code calls
m_freem(m0) after jumping to the bad: label at the end of the function,
when it should be calling m_freem(m). (m0 is the original mbuf list
_without_ the options mbuf prepended.)
Obtained from: NetBSD
ifconfig, which expects the address returned by the SIOCGIFNETMASK ioctl
to have a valid sa_family. Similar changes may be necessary for IPv6.
While we're here, get rid of an unnecessary temp variable.
MFC after: 2 weeks
is calculated. this caused some trouble in the code which the ip header
is not modified. for example, inbound policy lookup failed.
Obtained from: KAME
MFC after: 1 week
Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).
Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0
Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.
Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.
Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)
Submitted by: Jayanth Vijayaraghavan
kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
+ implement "limit" rules, which permit to limit the number of sessions
between certain host pairs (according to masks). These are a special
type of stateful rules, which might be of interest in some cases.
See the ipfw manpage for details.
+ merge the list pointers and ipfw rule descriptors in the kernel, so
the code is smaller, faster and more readable. This patch basically
consists in replacing "foo->rule->bar" with "rule->bar" all over
the place.
I have been willing to do this for ages!
MFC after: 1 week
not referenced in Stevens, and does not compile with g++.
There is an equivalent structure, struct ipoption in ip_var.h
which is actually used in various parts of the kernel, and also referenced
in Stevens.
Bill Fenner also says:
... if you want the trivia, struct ip_opts was introduced
in in.h SCCS revision 7.9, on 6/28/1990, by Mike Karels.
struct ipoption was introduced in ip_var.h SCCS revision 6.5,
on 9/16/1985, by... Mike Karels.
MFC-after: 3 days
NAT in extended passive mode if the server's public IP address was
different from the main NAT address. This caused a wrong aliasing
link to be created that did not route the incoming packets back to
the original IP address of the server.
natd -v -n pub0 -redirect_address localFTP publicFTP
Note that even if localFTP == publicFTP, one still needs to supply
the -redirect_address directive. It is needed as a helper because
extended passive mode's 229 reply does not contain the IP address.
MFC after: 1 week
and speed. No new functionality added (yet) apart from a bugfix.
MFC will occur in due time and probably in stages.
BUGFIX: fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (there is a PR on this).
OTHER CHANGES: minor changes to the internal struct for static and dynamic rules.
Requires rebuild of ipfw binary.
Add comments to show how data structures are linked together.
(It probably makes no sense to keep the chain pointers separate
from actual rule descriptors. They will be hopefully merged soon.
keep a (sysctl-readable) counter for the number of static rules,
to speed up IP_FW_GET operations
initial support for a "grace time" for expired connections, so we
can set timeouts for closing connections to much shorter times.
merge zero_entry() and resetlog_entry(), they use basically the
same code.
clean up and reduce replication of code for removing rules,
both for readability and code size.
introduce a separate lifetime for dynamic UDP rules.
fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (PR ...)
restructure dynamic rule descriptors
introduce some local variables to avoid multiple dereferencing of
pointer chains (reduces code size and hopefully increases speed).
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
new data is acknowledged, reset the dupacks to 0.
The problem was spotted when a connection had its send buffer full
because the congestion window was only 1 MSS and was not being incremented
because dupacks was not reset to 0.
Obtained from: Yahoo!
to the application as a RST would, this way we're compatible with the most
applications.
MFC candidate.
Submitted by: Scott Renfro <scott@renfro.org>
Reviewed by: Mike Silbersack <silby@silby.com>
about rules and dynamic rules. it later fills this buffer with these
rules.
it also takes the opporunity to compare the expiration of the dynamic
rules with the current time and either marks them for deletion or simply
charges the countdown.
unfortunatly it does this all (the sizing, the buffer copying, and the
expiration GC) with no spl protection whatsoever. it was possible for
the dynamic rule(s) to be ripped out from under the request before it
had completed, resulting in corrupt memory dereferencing.
Reviewed by: ps
MFC before: 4.4-RELEASE, hopefully.
In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.
Reviewed by: jesper
cdevsw entries have been for a long time.
Discover that we now have two version sof the same structure.
I will shoot one of them shortly when I figure out why someone thinks
they need it. (And I can prove they don't)
(netinet/ipprotosw.h should GO AWAY)
Avoid using parenthesis enclosure macros (.Pq and .Po/.Pc) with plain text.
Not only this slows down the mdoc(7) processing significantly, but it also
has an undesired (in this case) effect of disabling hyphenation within the
entire enclosed block.
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).
Obtained from: KAME
MFC after: 3 days
- Use sysctl to export stats
- Use ip_encap.c's encapsulation support
- Update lkm to kld (is 6 years a record for a broken module?)
- Remove some unused cruft
This macro was supposed to only match local IP addresses of
interfaces, and all consumers of this macro assume this as
well. (See IP_MULTICAST_IF and IP_ADD_MEMBERSHIP socket
options in the ip(4) manpage.)
This fixes a major security breach in IPFW-based firewalls
where the `me' keyword would match the other end of a P2P
link.
PR: kern/28567
This should help us in nieve benchmark "tests".
It seems a wide number of people think 32k buffers would not cause major
issues, and is in fact in use by many other OS's at this time. The
receive buffers can be bumped higher as buffers are hardly used and several
research papers indicate that receive buffers rarely use much space at all.
Submitted by: Leo Bicknell <bicknell@ufp.org>
<20010713101107.B9559@ussenterprise.ufp.org>
Agreed to in principle by: dillon (at the 32k level)
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.
While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.
To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.
Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.
Reviewed by: jlemon
Tested by: numerous subscribers of -net