This is valueable for library code which needs to be able to find out
whether the current process is or *was* set[ug]id at some point in the
past, and may have a "tainted" execution environment. This is especially
a problem with the trend to immediately revoke privs at startup and regain
them for critical sections. One problem with this is that if a cracker
is able to compromise the program while it's still got a saved id, the
cracker can direct the program to regain the privs. Another problem is
that the user may be able to affect the program in some other way (eg:
setting resolver host aliases) and the library code needs to know when it
should disable these sorts of features.
Reviewed by: ache
Inspired by: OpenBSD (but with a different implementation)
that allows traditional BSD setuid/setgid behavior.
The only visible difference should be that a non-root setuid program
(eg: inn's "rnews" program) that is setuid to news, can completely
"become" uid news. (ie: setuid(geteuid()) This was allowed in
traditional 4.2/4.3BSD and is now "blessed" by Posix as a special
case of "appropriate privilige".
Also, be much more careful with the P_SUGID flag so that we can use it
for issetugid() - only set it if something changed.
Reviewed by: ache
vector except for the egid in groups[0]. There is a risk that programs
that come from SYSV/Linux that expect this to work and don't check for
error returns may accidently pass root's groups on to child processes.
We now do what is least suprising (to non BSD programs/programmers) in
this scenario, and nothing is changed for programs written with BSD groups
rules in mind.
Reviewed by: ache
to removing the connection from the queue. The problem here is that
falloc() may block and this would allow another process to accept the
connection instead. If this happens to leave the queue empty, then the
system will panic with an "accept: nothing queued".
Also changed a wakeup() to a wakeup_one() to avoid the "thundering herd"
problem on new connections in Apache (or any other application that has
multiple processes blocked in accept() for the same socket).
as shadows of their containing directory. This should solve the problem
of users not being able to delete their symlinks from /tmp once and for
all.
Symlinks do not have modes though, they are accessable to everything that
can read the directory (as before). They are made to show this fact at
lstat time (they appear as mode 0777 always, since that's how the the
lookup routines in the kernel treat them).
More commits will follow, eg: add a real lchown() syscall and man pages.
centric rather than VM-centric to fix a problem with errors not being
detectable when the header is read.
Killed exech_map as a result of these changes.
There appears to be no performance difference with this change.
they were created later on. This is not the case when processing
syscalls.isc in the ibcs2 area. (It generates no declarations, it's
all either hidden (already prototyped elsewhere) or unimplemented).
<sys/ioctl_compat.h> and sometimes <sys/filio.h> instead of
<sys/ioctl.h> in tty-related files. <sys/ttycom.h> is still
usually imported bogusly via <sys/termios.h>.
<sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h>
in miscellaneous files. Most of these files have nothing to do
with ttys but need to include <sys/ttycom.h> to get the definitions
of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into
ioctls.
automatically have random generation numbers. The kenel way of handling those
also changed. Further it is advised to run fsirand on all your nfs exported
filesystems. the code is mostly copied from OpenBSD, with the randomization
chanegd to use /dev/urandom
Reviewed by: Garrett
Obtained from: OpenBSD
null casts. `time' is nonvolatile for accesses within a region locked
by splclock()/splx(). Accesses outside such a region are invalid, and
splx() must have the side effect of potentially changing all global
variables (since there are hundreds of sort of volatile variables like
`time'), so declaring `time' as volatile didn't have any real benefits.
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.
processes using AF_LOCAL sockets. This hack is going to be used with
Secure RPC to duplicate a feature of STREAMS which has no real counterpart
in sockets (with STREAMS/TLI, you can apparently use t_getinfo() to learn
UID of a local process on the other side of a transport endpoint).
What happens is this: the client sets up a sendmsg() call with ancillary
data using the SCM_CREDS socket-level control message type. It does not
need to fill in the structure. When the kernel notices the data,
unp_internalize() fills in the cmesgcred structure with the sending
process' credentials (UID, EUID, GID, and ancillary groups). This data
is later delivered to the receiving process. The receiver can then
perform the follwing tests:
- Did the client send ancillary data?
o Yes, proceed.
o No, refuse to authenticate the client.
- The the client send data of type SCM_CREDS?
o Yes, proceed.
o No, refuse to authenticate the client.
- Is the cmsgcred structure the right size?
o Yes, proceed.
o No, signal a possible error.
The receiver can now inspect the credential information and use it to
authenticate the client.
devtotty(). devtotty() must check its arg carefully since the arg is
supplied as ioctl data. This should fix PR3004.
Renamed devtotty() to snpdevtotty().
formula uses `& nchash'. This is very broken when nchash is a prime
number instead of 1 less than a power of 2, but the Lite2 formula was
merged in.
Merged some cosmetic changes from Lite2, rev.1.21 and Lite1. The merge
was difficult because the Lite2 code is essentially ours (phk's) except
where Lite2 improved or broke it.
Summary of the Lite2 changes:
- in the copyright, phk's rights have been transferred to the Regents.
This change should be reviewed.
- nchENOENT went away; the "no" vnode is now simply 0.
- comments were improved.
- style was "improved".
- goto instead of Fanatism (sic) was considered bad :-).
- there are some small changes to support whiteouts.
- new cache entries are added in more cases. More work is required
near here to change the hash table size if kern.desiredvnodes is
changed using sysctl.
- rescanning of the hash bucket in cache_purgevfs() was removed. This
change should be reviewed.
(phk's) sysctl framework, and I needed special code to disambiguate
the VFS_GENERIC node from the VFS_VFSCONF leaf, so I only converted
the leaves to the FreeBSD framework. The error handling isn't quite
right. CSRGS's sysctls seem to return ENOTDIR too much and FreeBSD's
sysctls don't agree with the man page.
and getvfsbyname() interfaces. The new interfaces are now hidden from
applications unless _NEW_VFSCONF is defined. The new vfsconf interfaces
don't work yet.
cruft and resulted in loading usually following a null pointer. Use
something closer to the pre-Lite2 code, including not making a copy of
the new filesystem's config info. Not making a copy also fixes a race
for loading and a memory leak for unloading.
Fixed unloading of vfs's. maxvfsconf wasn't maintained.
Look up the vfs to unload by name instead of by number. The numbers
should go away as soon as all mount utilities are converted.
- getnewvnode() and vref() were missing one simple_unlock() each.
- the Lite2 locking changes weren't merged at all in
printlockedvnodes() or sysctl_vnode(). Merging these undid
some KNF style regressions.
all of the configurables and instrumentation related to
inter-process communication mechanisms. Some variables,
like mbuf statistics, are instrumented here for the first
time.
For mbuf statistics: also keep track of m_copym() and
m_pullup() failures, and provide for the user's inspection
the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
- avoid malloc() if the number of fds is small.
- pack the bits better so that `small' is quite large.
- don't waste time generating zero bits for null fd_set pointers or
scanning these bits.
Possibly improved select():
- free malloc()ed storage before returning. This is simpler and I
think huge select()s aren't worth optimizing since they are rare,
relative gain would be small and there would be tiny costs for all
selects().
Reviewed by: ache (first version by him too)
execve() clears the P_SUGID process flag in execve() if the binary
executed does not have suid or sgid permission bits set.
This also happens when the effective uid is different from the real
uid or the effective gid is different from the real gid. Under
these circumstances, the process still has set id privileges and
the P_SUGID flag should not be cleared.
Submitted by: Tor Egge <Tor.Egge@idt.ntnu.no>
Successful lstat()s purged an existing entry as well as not caching the
result.
This bug was introduced in Lite1 by setting the LOCKPARENT flag for
[o]lstat() in order to support the inherit-attributes-from-parent-
directory misfeature for symlinks. LOCKPARENT was previously only set
for CREATEs and DELETEs. It is now set for LOOKUPs, but only for
[o]lstat(), so the problem wasn't very noticeable.
the old VFS_VFSCONF sysctl is enabled by default.
Initialize the vfc_vfsops field to non-NULL in sysctl_ovfs_conf()
so that the old VFS_VFSCONF sysctl actually works. The old (still
current) getvfsent.c uses this "kernel-only" field to decide which
vfs's are configured (the old implementation returned null entries
for unconfigured vfs's).
to coredump previously since it (somewhat uniquely) is setuid and forks
without execing, and thus without passing P_SUGID the child could
coredump and possibly divulge sensitive information (such as encrypted
passwords from the passwd database).
clusters greater than one page in length by calling contigmalloc1().
This uses a helper process `mclalloc' to do the allocation if
the system runs out at interrupt time to avoid calling contigmalloc
at high spl. It is not yet clear to me whether this works.
sb_max * MCLBYTES / (MSIZE + MCLBYTES)
used in sbreserve() to overflow, causing all socket creation attempts
to fail. Force the calculation to use u_quad_t's, which makes overflow
less likely.
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
The limit is now only used by init, so it may as well be "infinite".
Don't use RLIM_INFINITY, since setrlimit() doesn't allow setting
that value. Use maxfiles instead of RLIM_INFINITY for the hard
limit for the same reason.
Similarly for the maxprocesses limits (use the "infinite" value of
maxproc instead if MAXUPRC and RLIM_INFINITY).
NOFILES, MAXUPRC, CHILD_MAX and OPEN_MAX are no longer used in
/usr/src and should go away. Their values are almost guaranteed to
be wrong now that login.conf exists, so anything that uses the values
is broken. Unfortunately, there are probably a lot of ports that
depend on them being defined.
The global limits maxfilesperproc and maxprocperuid should go away
too.
on it.
makesyscalls.sh:
This parsed $Id$. Fixed(?) to parse $FreeBSD$. The output is wrong when
the id is not expanded in the source file.
syscalls.master:
Fixed declaration of sigsuspend(). There are still some bogons and
spam involving sigset_t.
Use `struct foo *' instead of the equivalent `foo_t *' for some nfs and
lfs syscalls so that <sys/sysproto.h> doesn't depend on <sys/mount.h>.
variable `kern.maxvnodes' which gives much better control over vnode
allocation than EXTRAVNODES (except in -current between 1995/10/28 and
1996/11/12, kern.maxvnodes was read-only and thus useless).
when allocating memory for network buffers at interrupt time. This is due
to inadequate checking for the new mcl_map. Fixed by merging mb_map and
mcl_map into a single mb_map.
Reviewed by: wollman
rev.1.10 two years ago. Children continued to run at splhigh()
after returning from vm_fork(). This mainly affected kernel
processes and init. For ordinary processes, interrupts are normally
unmasked a few instructions later after fork() returns (it may be
important for syscall() not to reschedule the child processes).
Kernel processes had workarounds for the problem. Init manages to
start because some routines "know" that it is safe to go to sleep
despite their caller starting them at a high ipl. Then its ipl
gets fixed on its first normal return from a syscall.
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).
NOTE: THAT LKMS MUST BE REBUILT!!!
Broke locking on named pipes in the same way as locking on non-vnodes
(wrong errno). This will be fixed later.
The fix involves negative logic. Named pipes are now distinguished from
other types of files with vnodes, and there is additional code to handle
vnodes and named pipes in the same way only where that makes sense (not
for lseek, locking or TIOCSCTTY).
This makes unexpected faults (in an interrupt handler) more likely
to crash properly. It could be done even better (more robustly and
more efficiently) using lazy fault handling.
modules sort of works now. Their devswitch entries aren't cleaned
up, so accessing them after they have been unloaded causes a panic
in spec_open().
Submitted by: durian@plutotech.com (Mike Durian), IIRC
Most of the standard utilities that depended on (or were broken in
a different way by) the old behaviour of interpreting "" as "."
were fixed a year or two ago. There is still a fairly harmless
bug in tar and a harmless bug in gzip. Tar apparently replaces
"/" by "" when it strips leading slashes.
decrease the size of buffer_map to approx 2/3 of what it used to be
(buffer_map can be smaller now.) The original commit of these changes
increased the size of buffer_map to the point where the system would
not boot on large systems -- now large systems with large caches will
have even less problems than before.
the sd & od drivers. There is also slight changes to fdisk & newfs
in order to comply with different sectorsizes.
Currently sectors of size 512, 1024 & 2048 are supported, the only
restriction beeing in fdisk, which hunts for the sectorsize of
the device.
This is based on patches to od.c and the other system files by
John Gumb & Barry Scott, minor changes and the sd.c patches by
me.
There also exist some patches for the msdos filesys code, but I
havn't been able to test those (yet).
John Gumb (john@talisker.demon.co.uk)
Barry Scott (barry@scottb.demon.co.uk)
scheme. Additionally, add the capability for checking for unexpected
kernel page faults. The maximum amount of kva space for buffers hasn't
been decreased from where it is, but it will now be possible to do so.
This scheme manages the kva space similar to the buffers themselves. If
there isn't enough kva space because of usage or fragementation, buffers
will be reclaimed until a buffer allocation is successful. This scheme
should be very resistant to fragmentation problems until/if the LFS code
is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS
is not likely to use such a scheme.
Now there should be NO problem allocating buffers up to MAXPHYS.
succeeds. Writing an action now succeeds iff the handler isn't changed.
(POSIX allows attempts to change the handler to be ignored or cause an
error. Changing other parts of the action is allowed (except attempts
to mask unmaskable signals are silently ignored as usual).)
Found by: NIST-PCTS
the queues and generate a SIGINT. Previously, this wasn't done if ISIG
was clear or the VINTR character was disabled, and it was done by
converting the BREAK to a VINTR character and sometimes bogusly echoing
this character.
Found by: NIST-PCTS
larger than the vfs layer can provide. We now automatically support
32K clusters if MSDOSFS is installed, and panic if a filesystem tries
to allocate a buffer larger than MAXBSIZE.
This commit is a result of some "prodding" by BDE.
substantially increasing buffer space. Specifically, we double
the number of buffers, but allocate only half the amount of memory
per buffer. Note that VDIR files aren't cached unless instantiated
in a buffer. This will significantly improve caching.
using a sockaddr_dl.
Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).
(1) deleted #if 0
pc98/pc98/mse.c
(2) hold per-unit I/O ports in ed_softc
pc98/pc98/if_ed.c
pc98/pc98/if_ed98.h
(3) merge more files by segregating changes into headers.
new file (moved from pc98/pc98):
i386/isa/aic_98.h
deleted:
well, it's already in the commit message so I won't repeat the
long list here ;)
Submitted by: The FreeBSD(98) Development Team
If DEVFS is configured, create devfs devices for previously invisible
partitions on the slices.
Fixed an old aliasing bug which caused E=17 errors from DEVFS for
DIOCSDINFO when there were no real slices.
I decided to do this for every hardclock() call instead of lazily
in microtime(). The lazy method is simpler but has more overhead
if microtime() is called a lot.
CPU_THISTICKLEN() is now a no-op and should probably go away.
Previously it did nothing directly but had the side effect of
setting i586_last_tick for CPU_CLOCKUPDATE() and i586_avg_tick for
debugging. CPU_CLOCKUPDATE() now uses a better method and
i586_avg_tick is too much trouble to maintain.
Reduced nesting of #includes in the usual case.
Increased nesting of #includes when CLOCK_HAIR is defined. This
is a kludge to get typedefs for inline functions only when the
inline functions are used. Normally only kern_clock.c defines
this. kern_clock.c can't include the i386 headers directly.
Removed unused LOCORE support.
- use a more accurate and more efficient method of compensating for
overheads. The old method counted too much time against leaf
functions.
- normally use the Pentium timestamp counter if available.
On Pentiums, the times are now accurate to within a couple of cpu
clock cycles per function call in the (unlikely) event that there
are no cache misses in or caused by the profiling code.
- optionally use an arbitrary Pentium event counter if available.
- optionally regress to using the i8254 counter.
- scaled the i8254 counter by a factor of 128. Now the i8254 counters
overflow slightly faster than the TSC counters for a 150MHz Pentium :-)
(after about 16 seconds). This is to avoid fractional overheads.
files.i386:
permon.c temporarily has to be classified as a profiling-routine
because a couple of functions in it may be called from profiling code.
options.i386:
- I586_CTR_GUPROF is currently unused (oops).
- I586_PMC_GUPROF should be something like 0x70000 to enable (but not
use unless prof_machdep.c is changed) support for Pentium event
counters. 7 is a control mode and the counter number 0 is somewhere
in the 0000 bits (see perfmon.h for the encoding).
profile.h:
- added declarations.
- cleaned up separation of user mode declarations.
prof_machdep.c:
Mostly clock-select changes. The default clock can be changed by
editing kmem. There should be a sysctl for this.
subr_prof.c:
- added copyright.
- calibrate overheads for the new method.
- documented new method.
- fixed races and and machine dependencies in start/stop code.
mcount.c:
Use the new overhead compensation method.
gmon.h:
- changed GPROF4 counter type from unsigned to int. Oops, this should
be machine-dependent and/or int32_t.
- reorganized overhead counters.
Submitted by: Pentium event counter changes mostly by wollman
add free vnodes back to the freelist. They must do their own vnode
management. Anyway, this change is *only* activated with their filesystem
and doesn't affect anyone else. Whoops, forgot the submitted-by lines
in my previous commits too.. :-(
Submitted-By: Tony Ardolino <tony@netcon.com>
The heuristic for managment of memory backing the buffer cache was
nice, but didn't work due to some architectural problems. Simplify
and improve the algorithm.
capable of being used for things other than swap space allocation,
and splvm would have been appropriate for only swap space allocation
and other VM things. My commit broke that (and was actually a mistake.)
previous snap. Specifically, kern_exit and kern_exec now makes a
call into the pmap module to do a very fast removal of pages from the
address space. Additionally, the pmap module now updates the PG_MAPPED
and PG_WRITABLE flags. This is an optional optimization, but helpful
on the X86.
(yes I had tested the hell out of this).
I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.
Submitted by: fenner