explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
- Rename output routines tcp_gen_* -> tcp_output_*.
- Rename notification routines that turn in to no-ops in the absence of TOE
from tcp_gen_* -> tcp_offload_*.
- Fix some minor comment nits.
- Add a /* FALLTHROUGH */
Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
the inpcb when there's an inpcb without associated timewait state, and
not unlocking when the inpcb has been freed. This avoids a kernel panic
when tcpdrop(8) is run on a socket in the TIMEWAIT state.
MFC after: 3 days
Reported by: Rako <rako29 at gmail dot com>
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
problems with the syncache, it produces a lot of console noise and has led
to quite a few false positive bug reports. It can be selectively
re-enabled when debugging specific problems by frobbing the same sysctl.
Discussed with: silby
Approved by: re (gnn)
- Reintegrate the ANSI C function declaration change
from tcp_timer.c rev 1.92
- Reorganize the tcpcb structure so that it has a single
pointer to the "tcp_timer" structure which contains all
of the tcp timer callouts. This change means that when
the single tcp timer change is reintegrated, tcpcb will
not change in size, and therefore the ABI between
netstat and the kernel will not change.
Neither of these changes should have any functional
impact.
Reviewed by: bmah, rrs
Approved by: re (bmah)
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change. This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).
In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI. The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.
This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.
Approved by: re (kensmith)
Submitted by: silby
Tested by: rwatson, kris
Problems reported by: peter, kris, others
projected_offset against isn_offset to account for
wrap around.
Reviewed by: gnn, kmacy, silby
Submitted by: yusheng.huang@bluecoat.com
Approved by: re
MFC: 3 days
be in ticks "for algorithm stability" when originally committed, it turns
out that it has a significant impact in timing out connections. When we
changed HZ from 100 to 1000, this had a big effect on reducing the time
before dropping connections.
To demonstrate, boot with kern.hz=100. ssh to a box on local ethernet
and establish a reliable round-trip-time (ie: type a few commands).
Then unplug the ethernet and press a key. Time how long it takes to
drop the connection.
The old behavior (with hz=100) caused the connection to typically drop
between 90 and 110 seconds of getting no response.
Now boot with kern.hz=1000 (default). The same test causes the ssh session
to drop after just 9-10 seconds. This is a big deal on a wifi connection.
With kern.hz=1000, change sysctl net.inet.tcp.rexmit_min from 3 to 30.
Note how it behaves the same as when HZ was 100. Also, note that when
booting with hz=100, net.inet.tcp.rexmit_min *used* to be 30.
This commit changes TCPTV_MIN to be scaled with hz. rexmit_min should
always be about 30. If you set hz to Really Slow(TM), there is a safety
feature to prevent a value of 0 being used.
This may be revised in the future, but for the time being, it restores the
old, pre-hz=1000 behavior, which is significantly less annoying.
As a workaround, to avoid rebooting or rebuilding a kernel, you can run
"sysctl net.inet.tcp.rexmit_min=30" and add "net.inet.tcp.rexmit_min=30"
to /etc/sysctl.conf. This is safe to run from 6.0 onwards.
Approved by: re (rwatson)
Reviewed by: andre, silby
sys.net.inet.tcp.log_debug = 1
It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.
It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug. Enabling of the former
also causes the latter to engage, but not vice versa.
Use consistent terminology in tcp log messages:
"ignored" means a segment contains invalid flags/information and
is dropped without changing state or issuing a reply.
"rejected" means a segments contains invalid flags/information but
is causing a reply (usually RST) and may cause a state change.
Approved by: re (rwatson)
This commit includes only the kernel files, the rest of the files
will follow in a second commit.
Reviewed by: bz
Approved by: re
Supported by: Secure Computing
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
o add the hex output of the th_flags field to the example log
line in comments
o simplify the log line length calculation and make it less
evil
o correct the test for the length panic; the line isn't on
the stack but malloc'ed
for use thoughout the tcp subsystem.
It is IPv4 and IPv6 aware creates a line in the following format:
"TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"
A "\n" is not included at the end. The caller is supposed to add
further information after the standard tcp log header.
The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use. All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.
Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.
Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.
tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
void *ip6hdr)
Usage example:
struct ip *ip;
char *tcplog;
if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
tcplog, __func__);
free(s, M_TCPLOG);
}
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:
tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()
tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()
This is a mechanical move with appropriate renames and making
them static if used only locally.
The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.
functions from their origininal place to their own files.
TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.
directly to a merged model where only one callout, the next to fire,
is registered.
Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.
The single new callout is a mutex callout on inpcb simplifying the
locking a bit.
tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.
Reviewed by: rwatson (earlier version)
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.
Reviewed by: gnn, silby.
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.
Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.
Reviewed by: glebius, ru
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).
Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly
Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
Reviewed by: ru