vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
1) Allow the sending of more than one control message at a time
over a unix domain socket. This should cover the PR 29499.
2) This requires that unp_{ex,in}ternalize and unp_scan understand
mbufs with more than one control message at a time.
3) Internalize and externalize used to work on the mbuf in-place.
This made life quite complicated and the code for sizeof(int) <
sizeof(file *) could end up doing the wrong thing. The patch always
create a new mbuf/cluster now. This resulted in the change of the
prototype for the domain externalise function.
4) You can now send SCM_TIMESTAMP messages.
5) Always use CMSG_DATA(cm) to determine the start where the data
in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1)
in some places, which gives the wrong alignment on the alpha.
(NetBSD made this fix some time ago).
This results in an ABI change for discriptor passing and creds
passing on the alpha. (Probably on the IA64 and Spare ports too).
6) Fix userland programs to use CMSG_* macros too.
7) Be more careful about freeing mbufs containing (file *)s.
This is made possible by the prototype change of externalise.
PR: 29499
MFC after: 6 weeks
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
the socket buffer size, the receive is done in sections. After completing
a read, call pru_rcvd on the underlying protocol before blocking again.
This allows the the protocol to take appropriate action, such as
sending a TCP window update to the peer, if the window happened to
close because the socket buffer was filled. If the protocol is not
notified, a TCP transfer may stall until the remote end sends a window
probe.
allow the watermark to be passed in via the data field during the EV_ADD
operation.
Hook this up to the socket read/write filters; if specified, it overrides
the so_{rcv|snd}.sb_lowat values in the filter.
Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>
the current socket error in fflags. This may be useful for determining
why a connect() request fails.
Inspired by: "Jonathan Graehl" <jonathan@graehl.org>
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.
Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().
the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and
a u_long target to set it to. The whole thing is splnet().
This fixes a problem that jdp has been able to provoke.
1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.
fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.
2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.
fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.
3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.
Reviewed by: green
accept filters are now loadable as well as able to be compiled into
the kernel.
two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)
Reviewed by: jmg
until the incoming connection has either data waiting or what looks like a
HTTP request header already in the socketbuffer. This ought to reduce
the context switch time and overhead for processing requests.
The initial idea and code for HTTPACCEPT came from Yahoo engineers and has
been cleaned up and a more lightweight DELAYACCEPT for non-http servers
has been added
Reviewed by: silence on hackers.
TCP/IP (v4) sockets, and routing sockets. Previously, interaction
with IPv6 was not well-defined, and might be inappropriate for some
environments. Similarly, sysctl MIB entries providing interface
information also give out only addresses from those protocol domains.
For the time being, this functionality is enabled by default, and
toggleable using the sysctl variable jail.socket_unixiproute_only.
In the future, protocol domains will be able to determine whether or
not they are ``jail aware''.
o Further limitations on process use of getpriority() and setpriority()
by jailed processes. Addresses problem described in kern/17878.
Reviewed by: phk, jmg
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.
PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: green, wollman
for IPv6 yet)
With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.
This batch of changes compile to the same object files.
Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt. This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion. This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.
Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it. The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it. (Don't blame Jayanth
for this patch though)
An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened. However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code. There are other things that call pru_send()
directly though.
Problem originally noted by: John Plevyak <jplevyak@inktomi.com>
This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
code when examining their fix, which caused my code (in rev 1.52) to:
- panic("soaccept: !NOFDREF")
- fatal trap 12, with tracebacks going thru soclose and soaccept
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.
Reviewed by: wollman
Obtained from: NetBSD
(ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.
Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.
This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.
Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.
PR: kern/7899
Reviewed by: bde, elvind
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
writes of size (100,208]+N*MCLBYTES.
The bug:
sosend() hands each mbuf off to the protocol output routine as soon as it
has copied it, in the hopes of increasing parallelism (see
http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for
TCP as long as the first mbuf handed off is at least the MSS. However,
when doing small writes (between MHLEN and MINCLSIZE), the transaction is
split into 2 small MBUF's and each is individually handed off to TCP.
TCP assumes that the first small mbuf is the whole transaction, so sends
a small packet. When the second small mbuf arrives, Nagle prevents TCP
from sending it so it must wait for a (potentially delayed) ACK. This
sends throughput down the toilet.
The workaround:
Set the "atomic" flag when we're doing small writes. The "atomic" flag
has two meanings:
1. Copy all of the data into a chain of mbufs before handing off to the
protocol.
2. Leave room for a datagram header in said mbuf chain.
TCP wants the first but doesn't want the second. However, the second
simply results in some memory wastage (but is why the workaround is a
hack and not a fix).
The real fix:
The real fix for this problem is to introduce something like a "requested
transfer size" variable in the socket->protocol interface. sosend()
would then accumulate an mbuf chain until it exceeded the "requested
transfer size". TCP could set it to the TCP MSS (note that the
current interface causes strange TCP behaviors when the MSS > MCLBYTES;
nobody notices because MCLBYTES > ethernet's MTU).
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
so_error is set, clear it before returning it. The behavior
introduced in 4.3-Reno (to not clear so_error) causes potentially
transient errors (e.g. ECONNREFUSED if the other end hasn't opened
its socket yet) to be permanent on connected datagram sockets that
are only used for writing.
(soreceive() clears so_error before returning it, as does
getsockopt(...,SO_ERROR,...).)
Submitted by: Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
detail is passed back and forwards). This mostly came from NetBSD, except
that our interfaces have changed a lot and this funciton is in a different
part of the kernel.
Obtained from: NetBSD
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
Specifically, don't allow a value < 1 for any of them (it doesn't make
sense), and don't let the low water mark be greater than the corresponding
high water mark.
Pre-Approved by: wollman
Obtained from: NetBSD
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
all of the configurables and instrumentation related to
inter-process communication mechanisms. Some variables,
like mbuf statistics, are instrumented here for the first
time.
For mbuf statistics: also keep track of m_copym() and
m_pullup() failures, and provide for the user's inspection
the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).
Submitted by: Louis Mamakos <loiue@TransSys.com>
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).
(maximum size of a socket buffer) tunable.
Permit callers of listen(2) to specify a negative backlog, which
is translated into somaxconn. Previously, a negative backlog was
silently translated into 0.
argument is now more than just a single flag. (kern_malloc.c)
Used new M_KERNEL value for socket allocations that previous were
"M_NOWAIT". Note that this will change when we clean up the M_ namespace
mess.
Submitted by: John Dyson