IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.
Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.
Reviewed by: rwatson (as part of a larger changeset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.
whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
already (but probably had been way above as the code was there twice)
and describe what was last changed in rev. 1.199 there (which now is
in sync with in6_src.c r184096).
Pointed at by: mlaier
MFC after: 2 mmonths
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.
Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred
For the jail case we are already looping over the interface addresses
before falling back to the only IP address of a jail in case of no
match. This is in preparation for the upcoming multi-IPv4/v6/no-IP
jail patch this change was developed with initially.
This also changes the semantics of selecting the IP for processes within
a jail as it now uses the same logic as outside the jail (with additional
checks) but no longer is on a mutually exclusive code path.
Benchmarks had shown no difference at 95.0% confidence for neither the
plain nor the jail case (even with the additional overhead). See:
http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html
Inpsired by a patch from: Yahoo! (partially)
Tested by: latest multi-IP jail patch users (implictly)
Discussed with: rwatson (general things around this)
Reviewed by: mostly silence (feedback from bms)
Help with benchmarking from: kris
MFC after: 2 months
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
- removing 'const' qualifier from an input parameter to conform to the type
required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.
Observed by: tinderbox
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
IPPORT_EPHEMERALFIRST and IPPORT_EPHEMERALLAST with values
10000 and 65535 respectively.
The rationale behind is that it makes the attacker's life more
difficult if he/she wants to guess the ephemeral port range and
also lowers the probability of a port colision (described in
draft-ietf-tsvwg-port-randomization-01.txt).
While there, remove code duplication in in_pcbbind_setup().
Submitted by: Fernando Gont <fernando at gont.com.ar>
Approved by: njl (mentor)
Reviewed by: silby, bms
Discussed on: freebsd-net
free the MAC label on the inpcb before freeing the inpcb.
MFC after: 3 days
Submitted by: tanyong <tanyong at ercist dot iscas dot ac dot cn>,
zhouzhouyi
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
This commit includes only the kernel files, the rest of the files
will follow in a second commit.
Reviewed by: bz
Approved by: re
Supported by: Secure Computing
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
Reviewed by: ru
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)
I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)
Reviewed by: rwatson@, mohans@
MFC after: 1 week
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.
MFC after: 3 months
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.
Improve consistency of spelling in assertions about inp != NULL.
MFC after: 3 months
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.
Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months
casts.
Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.
Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.
Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.
MFC after: 3 months
- in_pcbdetach(), which removes the link between an inpcb and its
socket.
- in_pcbfree(), which frees a detached pcb.
Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory. Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().
While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.
MFC after: 3 months
we have another PCB which is bound to 0.0.0.0. If a PCB has the
INP_IPV6 flag, then we set its cost higher than IPv4 only PCBs.
Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME
MFC after: 1 week
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code. In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.
PR: kern/79342
Spotted by: Anjali Kulkarni
Glanced at by: silby
MFC after: 2 weeks
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).
Reviewed by: silence on net@
MFC after: 2 weeks
connection rates, which is causing problems for some users.
To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.
Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds. These thresholds may be tuned via sysctl.
Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.
MFC After: 20 seconds
(sorele()/sotryfree()):
- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.
- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.
This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.
RELENG_5_3 candidate.
MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>
Alice is too lazy to write a server application in PF-independent
manner. Therefore she knocks up the server using PF_INET6 only
and allows the IPv6 socket to accept mapped IPv4 as well. An evil
hacker known on IRC as cheshire_cat has an account in the same
system. He starts a process listening on the same port as used
by Alice's server, but in PF_INET. As a consequence, cheshire_cat
will distract all IPv4 traffic supposed to go to Alice's server.
Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections. After this
change, the above scenario will be impossible. In the same setting,
the user who attempts to start his server last will get EADDRINUSE.
Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.
This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code. It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.
MFC after: 1 month
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.
Submitted by: Gleb Smirnoff
Reviewed by: -current (silence)
SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.
- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
unless it's in the closed or listening state (remote address
== INADDR_ANY).
If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft. And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.
No objections raised in: -net, -arch
MFC after: 1 month
implementation taken directly from OpenBSD.
I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.
to implement this mistake.
Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!". In KNF, "!" is not even used to compare booleans with 0).
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.
Reviewed by: sam (mentor)
Approved by: re (scottl)
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.
Reviewed by: sam
Approved by: re (rwatson)
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
o correct a read-lock assert in in_pcblookup_local that should be
a write-lock assert (since time wait close cleanups may alter state)
Supported by: FreeBSD Foundation
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.
Reviewed by: sam
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.
Tested by: nork
Obtained from: KAME
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.
- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.
This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).
Supported by: FreeBSD Foundation
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.
Other/related changes:
o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts
Notes:
1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.
Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
you still don't want to use the two together, but it's ok to have
them in the same kernel (the problem that initiated this bandaid
has long since been fixed)
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().
This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.
Discussed on: -net
Approved by: re
called in_pcbbind_setup() that does everything except commit the
changes to the PCB. There should be no functional change here, but
in_pcbbind_setup() will be used by the soon-to-appear IP_SENDSRCADDR
control message implementation to check or allocate the source
address and port.
Discussed on: -net
Approved by: re
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).
As noted previously, don't use FAST_IPSEC with INET6 at the moment.
Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks
in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in*
structure and initialize it with the address and port information passed
as arguments. Use calls to these new functions to replace code that is
replicated multiple times in in_setsockaddr(), in_setpeeraddr(),
in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and
in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that
we can call in_sockaddr() with temporary copies of the address and port
after the PCB is unlocked.
Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC()
inside in6_mapped_peeraddr() while the PCB is locked) by changing
the implementation of tcp6_usr_accept() to match tcp_usr_accept().
Reviewed by: suz
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.
Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count
- so_options
- so_linger
- so_state
o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:
- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()
Reviewed by: alfred
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
This increases the number of concurrent outgoing connections from ~4000
to ~16000. Other OSes (Solaris, OS X, NetBSD) and many other NAT
products have already made this change without ill effects, so we
should not run into any problems.
MFC after: 1 week
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.
Obtained from: KAME
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).
Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0
Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.
Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).
Obtained from: KAME
MFC after: 3 days
RTF_DYNAMIC route, it got freed twice). I am not sure what was
the actual problem in 1992, but the current behavior is memory
leak if PCB holds a reference to a dynamically created/modified
routing table entry. (rt_refcnt>0 and we don't call rtfree().)
My test bed was:
1. Set net.inet.tcp.msl to a low value (for test purposes), e.g.,
5 seconds, to speed up the transition of TCP connection to a
"closed" state.
2. Add a network route which causes ICMP redirect from the gateway.
3. ping(8) host H that matches this route; this creates RTF_DYNAMIC
RTF_HOST route to H. (I was forced to use ICMP to cause gateway
to generate ICMP host redirect, because gateway in question is a
4.2-STABLE system vulnerable to a problem that was fixed later in
ip_icmp.c,v 1.39.2.6, and TCP packets with DF bit set were
triggering this bug.)
4. telnet(1) to H
5. Block access to H with ipfw(8)
6. Send something in telnet(1) session; this causes EPERM, followed
by an in_losing() call in a few seconds.
7. Delete ipfw(8) rule blocking access to H, and wait for TCP
connection moving to a CLOSED state; PCB is freed.
8. Delete host route to H.
9. Watch with netstat(1) that `rttrash' increased.
10. Repeat steps 3-9, and watch `rttrash' increases.
PR: kern/25421
MFC after: 2 weeks
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.
TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.
Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks
However, if the RTF_DELCLONE and RTF_WASCLONED condition passes, but the ref
count is > 1, we won't decrement the count at all. This could lead to
route entries never being deleted.
Here, we call rtfree() not only if the initial two conditions fail, but
also if the ref count is > 1 (and we therefore don't immediately delete
the route, but let rtfree() handle it).
This is an urgent MFC candidate. Thanks go to Mike Silbersack for the
fix, once again. :-)
Submitted by: Mike Silbersack <silby@silby.com>
For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.
Clean up some layering violations (access to tcp internals from in_pcb)
and 1.84 of src/sys/netinet/udp_usrreq.c
The changes broken down:
- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.
Approved by: jlemon
Inspired by: NetBSD
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.
Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project
available, the error return should be EADDRNOTAVAIL rather than
EAGAIN.
PR: 14181
Submitted by: Dima Dorfman <dima@unixfreak.org>
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.
Reviewed by: Peter Wemm,asmodai
down as a result of a reset. Returning EINVAL in that case makes no
sense at all and just confuses people as to what happened. It could be
argued that we should save the original address somewhere so that
getsockname() etc can tell us what it used to be so we know where the
problem connection attempts are coming from.
Remove a bogus (redundant, just weird, etc.) key_freeso(so).
There are no consumers of it now, nor does it seem there
ever will be.
in6?_pcb.c:
Add an if (inp->in6?p_sp != NULL) before the call to
ipsec[46]_delete_pcbpolicy(inp). In low-memory conditions
this can cause a crash because in6?_sp can be NULL...
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
packet divert at kernel for IPv6/IPv4 translater daemon
This includes queue related patch submitted by jburkhol@home.com.
Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
for IPv6 yet)
With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
would make a difference. However, my previous diff _did_ change the
behavior in some way (not necessarily break it), so I'm fixing it.
Found by: bde
Submitted by: bde
exit on errors.
If we don't, in_pcbrehash() is called without a preceeding
in_pcbinshash(), causing a crash.
There are apparently several conditions that could cause the crash;
PR misc/12256 is only one of these.
PR: misc/12256
This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1:
s/suser/suser_xxx/
2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with
later.
There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has SO_REUSEPORT
set.
PR: kern/7713
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
ftp://ftp.isi.edu/in-notes/iana/assignments/port-numbers
port numbers are divided into three ranges:
0 - 1023 Well Known Ports
1024 - 49151 Registered Ports
49152 - 65535 Dynamic and/or Private Ports
This patch changes the "local port range" from 40000-44999
to the range shown above (plus fix the comment in in_pcb.c).
WARNING: This may have an impact on firewall configurations!
PR: 5402
Reviewed by: phk
Submitted by: Stephen J. Roznowski <sjr@home.net>
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
fix PR#3618 weren't sufficient since malloc() can block - allowing the
net interrupts in and leading to the same problem mentioned in the
PR (a panic). The order of operations has been changed so that this
is no longer a problem.
Needs to be brought into the 2.2.x branch.
PR: 3618
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
in_setsockaddr and in_setpeeraddr.
Handle the case where the socket was disconnected before the network
interrupts were disabled.
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).
Discussed-with: wollman