supposed to be checked by the firewall rules twice. However, because the
various ipsec handlers never call ip_input(), this never happens anyway.
This fixes the situation where a gif tunnel is encrypted with IPsec. In
such a case, after IPsec processing, the unencrypted contents from the
GIF tunnel are fed back to the ipintrq and subsequently handeld by
ip_input(). Yet, since there still is IPSec history attached, the
packets coming out from the gif device are never fed into the filtering
code.
This fix was sent to Itojun, and he pointed towartds
http://www.netbsd.org/Documentation/network/ipsec/#ipf-interaction.
This patch actually implements what is stated there (specifically:
Packet came from tunnel devices (gif(4) and ipip(4)) will still
go through ipf(4). You may need to identify these packets by
using interface name directive in ipf.conf(5).
Reviewed by: rwatson
MFC after: 3 weeks
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).
As noted previously, don't use FAST_IPSEC with INET6 at the moment.
Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks
from the KAME IPsec implementation, but with heavy borrowing and influence
of openbsd. A key feature of this implementation is that it uses the kernel
crypto framework to do all crypto work so when h/w crypto support is present
IPsec operation is automatically accelerated. Otherwise the protocol
implementations are rather differet while the SADB and policy management
code is very similar to KAME (for the moment).
Note that this implementation is enabled with a FAST_IPSEC option. With this
you get all protocols; i.e. there is no FAST_IPSEC_ESP option.
FAST_IPSEC and IPSEC are mutually exclusive; you cannot build both into a
single system.
This software is well tested with IPv4 but should be considered very
experimental (i.e. do not deploy in production environments). This software
does NOT currently support IPv6. In fact do not configure FAST_IPSEC and
INET6 in the same system.
Obtained from: KAME + openbsd
Supported by: Vernier Networks
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version
Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month
a common lock. This change avoids a deadlock between snapshots when
separate requests cause them to deadlock checking each other for a
need to copy blocks that are close enough together that they fall
into the same indirect block. Although I had anticipated a slowdown
from contention for the single lock, my filesystem benchmarks show
no measurable change in throughput on a uniprocessor system with
three active snapshots. I conjecture that this result is because
every copy-on-write fault must check all the active snapshots, so
the process was inherently serial already. This change removes the
last of the deadlocks of which I am aware in snapshots.
Sponsored by: DARPA & NAI Labs.
of KBDIO_DEBUG which may be defined in the kernel config (as it is in NOTES).
This kind of bug is a _really_ horribly thing as we end up with one bit
of code thinking a particular structure is 136 bytes and another that it
is only 112 bytes.
Ideally all places would remember to #include the right "opt_foo.h" file,
but I think in practice file containing the variable sized struct should
#include it explicitly as a precaution.
Detected by: FlexeLint
to be administratively disabled as needed on UFS/UFS2 file systems. This
also has the effect of preventing the slightly more expensive ACL code
from running on non-ACL file systems, avoiding storage allocation for
ACLs that may be read from disk. MNT_ACLS may be set at mount-time
using mount -o acls, or implicitly by setting the FS_ACLS flag using
tunefs. On UFS1, you may also have to configure ACL store.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
automatically set MNT_MULTILABEL in the mount flags.
If FS_ACLS is set in a UFS or UFS2 superblock, automatically
set MNT_ACLS in the mount flags.
If either of these flags is set, but the appropriate kernel option
to support the features associated with the flag isn't available,
then print a warning at mount-time.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Ignoring a NULL dev in device_set_ivars() sounds wrong, KASSERT it to
non-NULL instead.
Do the same for device_get_ivars() for reasons of symmetry, though
it probably would have yielded a panic anyway, this gives more precise
diagnostics.
Absentmindedly nodded OK to by: jhb
were improperly relocated due to faulty logic in lookup_fdesc()
in elf_machdep.c. The symbol index (symidx) was bogusly used for
load modules other than the one the relocation applied to. This
resulted in bogus bindings and consequently runtime failures.
The fix is to use the symbol index only for the module being
relocated and to use the symbol name for look-ups in the
modules in the dependent list. As such, we need a function to
return the symbol name given the linker file and symbol index.
processes forked with RFTHREAD.
- Use a goto to a label for common code when exiting from fork1() in case
of an error.
- Move the RFTHREAD linkage setup code later in fork since the ppeers_lock
cannot be locked while holding a proc lock. Handle the race of a task
leader exiting and killing its peers while a peer is forking a new child.
In that case, go ahead and let the peer process proceed normally as the
parent is about to kill it. However, the task leader may have already
gone to sleep to wait for the peers to die, so the new child process may
not receive a SIGKILL from the task leader. Rather than try to destruct
the new child process, just go ahead and send it a SIGKILL directly and
add it to the p_peers list. This ensures that the task leader will wait
until both the peer process doing the fork() and the new child process
have received their KILL signals and exited.
Discussed with: truckman (earlier versions)
homerolling our own version.
- Rename the enum for memsize from ISA_IVAR_MSIZE to ISA_IVAR_MEMSIZE
since using 'MSIZE' in the macro invocation of ISA_ACCESSOR() conflicts
with the 'MSIZE' kernel option. The accessor function is still
isa_get_msize().
or fifo in UFS2, the normal ufs_strategy routine needs to be used
rather than the spec_strategy or fifo_strategy routine. Thus the
ffsext_strategy routine is interposed in the ffs_vnops vectors for
special devices and fifo's to pick off this special case. Otherwise
it simply falls through to the usual spec_strategy or fifo_strategy
routine.
Submitted by: Robert Watson <rwatson@FreeBSD.org>
Sponsored by: DARPA & NAI Labs.
It must be removed because it is done without the pipe being locked
via pipelock() and therefore is vulnerable to races with pipespace()
erroneously triggering it by temporarily zero'ing out the structure
backing the pipe.
It looks as if this assertion is not needed because all manipulation
of the data changed by pipespace() _is_ protected by pipelock().
Reported by: kris, mckusick
if failures occur, make sure that we release both the default ACL
and access ACL storage during new object creation.
Spotted by: phk and his pet flexelint
Sponsored by: DARPA, Network Associates Laboratories
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".
Sponsored by: DARPA & NAI Labs.
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.
Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by: DARPA & NAI Labs.
This is most beneficial for vmware client os installs.
Reviewed by: jmallet, iedowse, tlambert2@mindspring.com
MFC After: never, -STABLE does not currently use this instruction
used by UFS to administratively enable support for extended ACLs.
While I'm here, remove MNT_MULTILABEL from the list of file system
flags we permit to be updated after the initial mount.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
FS_ACLS Administrative enable/disable of extended ACL support
FS_MULTILABEL Administrative flag to indicate to the MAC Framework
that objects in the file system are individually
labeled using extended attributes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Reviewed by: (in principal) mckusick, phk
constants in scope, C90 does not; so, add namespace visibility
conditionals to SIG*.
2) Define the extended __sighandler_t type only in BSD namespace.
3) Don't forward declare a struct for a prototype in <signal.h>.
4) Move location of SIG_* constants.
5) Move a forward declare into the correct namespace conditional.
Requested by: bde (1)
Submitted by: bde (2 thru 5)
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).
In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.
Sponsored by: DARPA & NAI Labs.
with the exception of indirect function calls, are assumed to be
intra load module and thus that GP will be the same. This avoids
saving, setting and restoring GP for each function call and
reduces the kernel with ~320KB. There's obviously a performance
benefit as well.
Note that since we generally don't know if calls will be intra or
inter load module when we're compiling kernel modules, -mconstant-gp
cannot be used for modules.
Fix the "@gprel relocation against dynamic symbol xxx" linker error.
Variables defined in the link unit and small enough to be put in the
short data section will have a gp-relative access sequence (using the
@gprel relocation). It is invalid to have @gprel relocations in shared
libraries, because they are to be resolved by the static linker and
not the dynamic linker. The -fpic option will cause @ltoff relocations
for @gprel relocations, but the side-effects are untested (if any).
Instead, disable/eliminate the short data section to achieve the same.
middle of this header.
o Remove unneeded conditionals to hide SIG* in the POSIX case.
(C allows implementations to define additional SIG* constants.)
o Add comments about missing features.
o Move the location of the sigset_t typedef.
o Update standards visibility conditionals.
o Fix some assumptions about what pid_t and uid_t are defined as.
o Remove size_t typedef and use __size_t in struct sigaltstack
instead.
the predicate registers. Even though the ITLB and DTLB interrupts
happen often enough, this bug didn't do much harm. The reason
is that the interrupt handlers only modify p1 and since this is
a preserved (callee-saved) register it is hardly used in code
generated by the compiler. Compilers use scratch registers by
default. Changing the interrupt handlers to use p6 (ie a scratch
register) proved that the bug was in fact fatal.
o Replace KSTACK_PAGES with pages on panic() in pmap_new_thread(),
o Fix style bugs in adjacent code,
o Use NULL instead of 0 for pointers,
o Save the virtual kstack address if we create an alternate
kstack because 1) we can derive the physical (RR7) address
from it and 2) we need the virtual address for contigfree()
in pmap_dispose_thread(). Thus td_altkstack saves
td_md.md_kstackvirt.
as a trivial function that only calls ia64_tpa() and hence requires
the prototype of ia64_tpa(), but by defining pmap_kextract as
ia64_tpa. This solves the inclusion ordering issue in ddb/db_watch.c
o Add typedefs for gid_t, off_t, pid_t, and uid_t in the non-standards
case.
o Add struct iovec (also defined in <sys/uio.h>).
o Add visibility conditionals to avoid defining non-standard
extentions in the standards case.
o Change spelling of some types so they work without including
<sys/types.h> (u_char -> unsigned char, u_short -> unsigned short,
int64 -> __int64, caddr_t -> char *)
o Add comments about missing restrict type-qualifiers and missing
function.
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c
Reviewed by: -arch
the locking of the proc lock after the goto to done1 to avoid locking
the lock in an error case just so we can turn around and unlock it.
- Move the exec_setregs() stuff out from under the proc lock and after
the p_args stuff. This allows exec_setregs() to be able to sleep or
write things out to userland, etc. which ia64 does.
Tested by: peter
So do GEOM. Not a pretty sight.
Take all the interesting stuff out of GEOM::disk_create(), and leave just
the creation of the fake dev_t. Schedule the topology munging to happen
in the g_event thread with g_call_me().
This makes disk_create() pretty lock-agnostic, almost lock-atheist.
Tripped over by: peter
Sponsored by: DARPA & NAI Labs
pci_cvt_to_bwx.
This doesn't necessarily make bge(4) now actually *work* on an alpha.
It loads, configures, and then about 30 seconds later, my XP1000 hard
freezes. But, hey, it's a start.
Obtained from: gallatin@freebsd.org
One bug fixed: Use getmicrouptime() to trigger reseeds so that we
cannot be tricked by a clock being stepped backwards.
Express parameters in natural units and with natural names.
Don't use struct timeval more than we need to.
Various stylistic and readability polishing.
Introduce arc4rand(void *ptr, u_int len, int reseed) function which
returns a stream of pseudo-random bytes, observing the automatic
reseed criteria as well as allowing forced reseeds.
Rewrite arc4random() in terms of arc4rand().
Sponsored by: DARPA & NAI Labs.
- add wrappers for mmap2(2) and ftruncate64(2) system calls;
- don't spam console with printf's when VFAT_READDIR_BOTH ioctl(2) is invoked;
- add support for SOUND_MIXER_READ_STEREODEVS ioctl(2);
- make msgctl(IPC_STAT) and IPC_SET actually working by converting from
BSD msqid_ds to Linux and vice versa;
- properly return EINVAL if semget(2) is called with nsems being negative.
Reviewed by: marcel
Approved by: marcel
Tested with: LSB runtime test
clear the bit. This allows ata driver to attach its children because
it needs the interrupts enabled to succeed.
Submitted by: iwasaki-san
o Spell CardBus as CardBus, not Cardbus or CardBUS while I'm here.
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.
Sponsored by: DARPA & NAI Labs.
allow us to avoid nasty by-hand string parsing stuff in a number of
places in the kernel, reducing the risk of unexpected consequences
for kernel correctness.
on the _file() theme that do not follow symlinks. Sync to MAC tree.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
sched_lock. This means that we no longer access p_limit in mi_switch()
and the p_limit pointer can be protected by the proc lock.
- Remove PRS_ZOMBIE check from CPU limit test in mi_switch(). PRS_ZOMBIE
processes don't call mi_switch(), and even if they did there is no longer
the danger of p_limit being NULL (which is what the original zombie check
was added for).
- When we bump the current processes soft CPU limit in ast(), just bump the
private p_cpulimit instead of the shared rlimit. This fixes an XXX for
some value of fix. There is still a (probably benign) bug in that this
code doesn't check that the new soft limit exceeds the hard limit.
Inspired by: bde (2)
that add an instance of themselves. The npx(4) driver doesn't even check
the npx 'port' hint but hardcodes IO_NPX instead. The npx(4) driver also
will use isa IRQ 13 (on x86, 8 on pc98) by default if no 'irq' hint is
specified, so we don't need that hint either.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.
Sponsored by: DARPA & NAI Labs.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.
Sponsored by: DARPA & NAI Labs.
even when the underlying device has a larger sector size. Therefore,
the filesystem code should not (and with this patch does not) try to
use the underlying sector size when doing disk block address calculations.
This patch fixes problems in -current when using the swap-based
memory-disk device (mdconfig -a -t swap ...). This bugfix is not
relevant to -stable as -stable does not have the memory-disk device.
Sponsored by: DARPA & NAI Labs.
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.
Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".
DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)
Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.
The problem is that the code does a check for the granparent of
the Promise chip, if this is a bridge of the right type, we have
a TX4 on our hands, and need to handle that ones "issues".
Now the grandparent check cause subtle bugs in the newbus system,
mainly that pci_get_devid doesn't return an error value.
This patch works around the issue by using BUS_READ_IVAR() instead.
structure. This has been broken since 1998, but probably hasn't been
noticed because it takes a read/write of 64K blocks (32MB with 512 byte
blocks) to trigger using the 12 byte read/write CDB in scsi_read_write().
Submitted by: "Moore, Eric Dean" <emoore@lsil.com>
MFC after: 3 days
divide/remainder calls. For reasons not resolved, compiling the
relevant routines from libkern into boot2 results in stack corruption.
Do the simple thing: Don't use 64bit divide/remainder operations.
Sponsored by: DARPA & NAI Labs
among other things, the DEVFS rule subsystem to match nodes against a
path pattern supplied by the user.
fnmatch.c was repo-copied from src/lib/libc/gen/fnmatch.c, and the
only changes to it are those necessary to make it compile in the
kernel. The relevant parts of fnmatch.h were imported into libkern.h.
Approved by: -arch
o Implement the thread killing interlock as described by jhb in arch@
while talking to markm.
o Hold Giant around cbb_insert()/cbb_remove(). Deep in the belly of
the vm code we panic if we don't hold this when we activate the memory
for reading the CIS.
o If we had to do the kludge alloc, then do a kludge free.
configuration device hierarchy. Device arrival, departure and not
matched are presently reported. This will be the basis for devd, which
I still need to polish a little more before I commit it. If you don't
use /dev/devctl, it will be a noop.
o Allow the bus_debug variable to be set via the bus.debug tunable.
o Return pnpinfo and location info via the devinfo interface to userland.
devinfo(8) needs to be updated to print it.
o Better resume code. Move the comments around. Force the socket state to
be querried. Ack the interrupts properly.
o Intercept the interrupt requests and keep a list of interrupts to service
ourselves. When the card attaches, set its OK bit. When we get a card
status change interrupt for that card, clear the OK bit. Don't call the
ISR if the OK bit is cleared. Iwasaki-san and yamamoto-san have both
sent me patches that fix the same problem this fixes, but at the pccard
level.
o Try to get the signalling of the thread to actually die. This might not be
100% right, but it is less wrong than before.
o Add a SIC next to a TI type that looks like it could be wrong, but isn't.
the card.
o Add comments about how we're doing the CIS activation.
o Add location and pnp info functions.
o Add better code to hopefully deal with ata cards better (and other drivers
that allocate resources that we didn't preallocate from the CIS). OLDCARD
used to allow it, but NEWCARD was pickier. I'm not 100% sure this works,
but it doesn't break anything.
give us slightly better error checking than before and interpret what
default bits mean better. See the NetBSD CVS tree for the authors of
these changes (revs 1.10 .. 1.17).
Note, we return the PCI pnp info, but in fact that's wrong to do
since that data is not defined for CardBus cards. CardBus says that
these registers are undefined and one should use the CIS to do
device matching. To date, all CardBus cards have had these
registered defined, no doubt because they are using common silicon
to produce both the PCI cards and the CardBus cards. However, it isn't
any worse than the rest of the system, so just note it in passing and
move on.
o Also sort prototypes while I'm here.
Conditionalize the "XX bytes left" checks reference on UFS1/UFS12.
Conditionally build the necessary 64bit math for boot2 if UFS12.
Sponsored by: DARPA & NAI Labs.
revision 1.218. This bug caused a "struct file" reference to be
leaked if VOP_ADVLOCK(), vn_start_write(), or mac_check_vnode_write()
failed during the open operation.
PR: kern/43739
Reported by: Arne Woerner <woerner@mediabase-gmbh.de>
currently disabled):
o Don't use constants for the output parameter, use the iparam count as a
pointer to the first result location.
o Fix bits vs bytes counting problems.
o Split out the hardware and software normalization versions of modexp.
o Enable hardware normalization for chips that support it.
o On reset, disable hardware normalization for 582x and make sure the
chip is in little endian mode.
o Since sw normalization is now the only option, simplify normalization
handling.
Also fix RNG harvesting: disabling PK support (for the moment) had disabled
the MCR2 interrupt; consider both KEY support and RNG support when deciding
whether or not to enable it.
Obtained from: openbsd
It seems that the existence of a "depend" target in src/sys/boot is not
to be taken as an indication that it actually does what one would expect,
at least it clearly threw my testing off.
Apologies to: jhb
for processing callbacks. This closes race conditions caused by locking
too many things with a single mutex.
o reclaim crypto requests under certain (impossible) failure conditions
Load 4 sectors more than we used to. This is harmless overhead for
the UFS1_ONLY case, but sufficient for boot2(UFS1+2).
Sponsored by: DARPA & NAI Labs
Don't use snprintf where strlcpy() will do the job.
Also, a NUL is '\0' not 0 in our style (C doesn't care), so spell it like.
Remove useless {} and () in the general area of this change.
and therefore we need a way for ioctl handlers to run in that thread
in GEOM. Rather than invent a complicated registration system to
recognize which ioctl handler to use for a given ioctl, we still
schedule all ioctls down the tree as bio transactions but add a
special return code that means "call me directly" and have the
geom_dev layer do that.
Use this for all ioctls that make it as far as a diskdriver to
avoid any backwards compatibility problems.
Requested by: scottl
Sponsored by: DARPA & NAI Labs