Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
underneath #ifdef VIMAGE blocks.
This change introduces some churn in #include ordering and nesting
throughout the network stack and drivers but is not expected to cause
any additional issues.
In the next step this will allow us to instantiate the virtualization
container structures and switch from using global variables to their
"containerized" counterparts.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.
Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.
Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.
Reviewed by: rwatson (as part of a larger changeset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.
whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
as in_pcbdetach() and we don't need the code twice.
Reviewed by: rwatson
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
-Improvement: panic() on INVARIANTS kernels if memory allocation
fails for a tagblock in sctp_add_vtag_to_timewait().
-Bugfix: Protect code in sctp_is_in_timewait() by
SCTP_INP_INFO_WLOCK/SCTP_INP_INFO_WUNLOCK.
-Cleanup: Get rid of unused variable now in sctp_init_asoc().
-Bugfix: Reuse the correct vtag in sctp_add_vtag_to_timewait().
-Cleanup: Get rid of unused constant SCTP_TIME_WAIT_SHORT
in sctp_constants.h.
-Improvement: Use all hash buckets of the vtag hash table.
-Cleanup: Get rid of then unused constant SCTP_STACK_VTAG_HASH_SIZE_A.
-Bugfix: Handle SHUTDOWN;SACK packet correctly.
-Bugfix: Last TSN in a gap ack block was not being "ack'd"
in the internal scoreboard.
Obtained from: (with help from Michael Tuexen)
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.
Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.
Reported by: kmacy
MFC after: 2 months (along with r182851)
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.
In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.
This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.
MFC after: 2 months (along with r182851)
This should fix q_time overflow, which happens after 2^32/(86400*hz) days of
uptime (~50days for hz = 1000).
q_time overflow cause following:
- traffic shaping may not work in 'fast' mode (not enabled by default).
- incorrect average queue length calculation in RED/GRED algorithm.
NB: due to ABI change this change is not applicable to stable.
PR: kern/128401
a) Need for EEOR mode to take the min of the socket buffer size and the
add more threshold, otherwise if you are so silly as to set a send
buf size less than the add-more you could block forever in eeor mode.
b) We were incorrectly using the sysctl vs the calculated value. This
causes us to block forever if the addmore theshold is larger than
then the socket buffer size.
- If we send EXACTLY the size left in the send buffer
and then send again, we end up with exactly 0 bytes and
don't hit the pre-block code to wait for more space.
- If we fall into the loop with our max_len == 0 (the bug
above) we then call in to copy out the data, setup the length
of the waiting to transmit data to 0 and call the mbuf copy routine
which 0 indicates copy all the data to the mbuf chain.. which it
does. This then leaves a "stuck" message on the stream queue with
its size exactly 0 bytes but all the data there and thus nothing
left in the uio structure. We then reach a stuck forever state
never being able to send data.
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.
MFC after: 3 days
- Consistently add parentheses to return statements.
- Use NULL instead of 0 when comparing pointers, also avoiding
unnecessary casts.
- Do not use pointers as booleans.
Reviewed by: rwatson (earlier version)
MFC after: 2 months
already (but probably had been way above as the code was there twice)
and describe what was last changed in rev. 1.199 there (which now is
in sync with in6_src.c r184096).
Pointed at by: mlaier
MFC after: 2 mmonths
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.
Reviewed by: silby, rpaulo, mlaier
MFC after: 2 months
be given when the user has enabled it). (Michael Tuexen)
- Sack Immediately was not being set properly on the actual chunk, it
was only put in the rcvd_flags which is incorrect. (Michael Tuexen)
- added an ifndef userspace to one of the already present macro's for
inet (Brad Penoff)
Obtained from: Michael Tuexen and Brad Penoff
MFC after: 4 weeks
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.
Reviewed by: rwatson
MFC after: 3 months (set timer; decide then)
netisr or ithread's socket buffer size limit is not the right limit to
use. Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.
In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any. However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.
MFC after: 3 days
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.
Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred
For the jail case we are already looping over the interface addresses
before falling back to the only IP address of a jail in case of no
match. This is in preparation for the upcoming multi-IPv4/v6/no-IP
jail patch this change was developed with initially.
This also changes the semantics of selecting the IP for processes within
a jail as it now uses the same logic as outside the jail (with additional
checks) but no longer is on a mutually exclusive code path.
Benchmarks had shown no difference at 95.0% confidence for neither the
plain nor the jail case (even with the additional overhead). See:
http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html
Inpsired by a patch from: Yahoo! (partially)
Tested by: latest multi-IP jail patch users (implictly)
Discussed with: rwatson (general things around this)
Reviewed by: mostly silence (feedback from bms)
Help with benchmarking from: kris
MFC after: 2 months
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
in the transmit path, such as TCPS_TIMEWAIT, fail the credential
extraction immediately rather than acquiring locks and looking up
the inpcb on the global lists in order to reach the conclusion that
the credential extraction has failed.
This is more efficient, but more importantly, it avoids lock
recursion on the inpcbinfo, which is no longer allowed with rwlocks.
This appears to have been responsible for at least two reported
panics.
MFC after: 3 days
Reported by: ganbold
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.
Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb. Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.
This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden. When these locks were mutexes, they simply recursed.
Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).
I believe this does NOT address the crashes people have been seeing
in rt_check.
MFC after: 1 week
the same way it has been implemented for IPv4.
Reviewed by: bms (skimmed)
Tested by: Nick Hilliard (nick netability.ie) (with more changes)
MFC after: 2 months
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.
See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.
Submitted by: Alana Huebner, Lawrence Stewart,
Grenville Armitage (caia.swin.edu.au)
Reviewed by: dwmalone, gnn, rpaulo
MFC After: 3 days
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.
With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.
MFC after: 2 months
calls the latter.
Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.
This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.
PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).
This prevents people from shooting themselves in the foot.
PR: kern/118455 (remotely related)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months
This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.
MFC after: 2 months
the default rule number but also the maximum rule number. User space
software such as ipfw and natd should be aware of its value. The
software that already includes ip_fw.h should use the defined value. All
other a expected to use sysctl (as discussed on net@).
MFC after: 5 days.
Discussed on: net@
translation. It turns out this is useful for applications which require
source port randomization for security (i.e. dns servers).
Discussed with: secteam
Requested by: mlaier
MFC after: 2 weeks
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.
PR: kern/119635
Reviewed by: rwatson
MFC after: 5 days
- Change it so that without INVARIANTs there are
no panics in SCTP.
- sctp_timer changes so that we have a recovery mechanism
when the sent list is out of order.
storage. We can safely remove the label copying operations since
M_MOVE_PKTHDR will move the mbuf tags (which contain MAC labels) to
the destination mbuf.
MFC after: 1 week
Discussed with: rwatson
we can be sure that it's valid.
In case we abort early free it again else put it into the syncache.
We need the cred in the syncache to be able to restrict what will be
exportet by the sysctl helper function syncache_pcblist() (to netstat)
within jails.
PR: kern/126493
Reviewed by: rwatson (earlier versions)
MFC after: 3 days
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.
MFC after: 3 days
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.
PR: kern/126349
Reviewed by: rwatson, dwmalone
keyword. But it doesn't work. Two options.. make it no longer accept it,
or actually make it work.. I chose the 2nd..
Allow the tablearg to be used to specify a skipto destination.
This is actually a very powerful construct if used correctly, or a sink
of cpu cycles if used badly.
changes t teh man page will follow.