ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.
Noticed by: rwatson
Approved by: bms(mentor)
vm_page_free() is called. The problem with holding this lock is that it is
a spin lock and vm_page_free() may attempt the acquisition of a different
default-type lock.
for user copyinout down to 12, and keeping segments 13/14 for
kernel VA.
It would be nice to have more available, but segments lower than
this are reserved for either memory or 1:1 mapped device i/o,
and seg 15 is OpenFirmware ROM. Also, the effort to keep OpenFirmware
available for callbacks limits the use of VA-mapped segments.
Fortunately UMA_MD_SMALL_ALLOC takes away a lot of VM pressure.
Obtained from: NetBSD
include/ucontext.h
- remove trapframe and switch over to 'generic' description of machine
state. Include version field to help with future modifications.
Include floating point and altivec state, and hopefully align
correctly
powerpc/copyinout.c
- fill out casuptr() sync primitive, required by kern_umtx.c
powerpc/machdep.c
- shifted proc0/thread0/pcpu setup to before cninit, since
syscons -> make_dev -> devlock requires a valid curthread
- implemented get_mcontext/set_mcontext
- recast sendsig/sigreturn to use get/set_mcontext and new
ucontext struct. floating point now saved
- TODO: save/restore altivec state
powerpc/vm_machdep.c
- implemented cpu_thread_setup/cpu_set_upcall/cpu_set_upcall_kse
- eliminated trailing whitespace
Submitted by: Suleiman Souhlal <refugee@segfaulted.com>, ucontext by grehan
race in between sleepq_add() and sleepq_catch_signals() in that setting
td_wchan and TDF_SINTR is not atomic to sched_lock but only to the sleepq
lock. This band-aid will stop assertion failures, but there is perhaps a
larger problem with the sleepq_add/sleepq_catch_signals race that I am not
sure how to solve. For the signals case the race is harmless because we
always call cursig() after setting TDF_SINTR. However, KSE doesn't do
anything in sleepq_catch_signals() to check that this race was lost, so I
am unsure if this race is harmful for this specific abort.
to NET_UNLOCK_GIANT(). While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion. Add a comment saying
as much.
Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.
checking and freeing a different pointer that may or may not have been
assigned the same value. This should fix panics under load that were
recently reported.
serial console connections but not graphical consoles. This fixes the
graphical console machines. It leaves the initial promcons console
driver in place until a bit later in the boot sequence, delaying the
switch to the device drivers more appropriate for the machine's real
console setup. Note we still need the delayed make_dev() for promcons,
it does not have a proper bus interface so unlike other console drivers
it will not be found later during normal device discovery.
Tested by: sepotvin <at> videotron <dot> ca
Root cause explained by: grehan (-current)
Approved by: rwatson (mentor)
but a bit more reamins to be done. For now, it is usable.
PR:
Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Reviewed by:
Approved by:
Obtained from:
MFC after:
functions in kern_socket.c.
Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".
Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.
Submitted by: sam
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.
Submitted by: sam
swap-backed memory disks. This reduces filesystem allocation overhead
and makes swap-backed memory disks compatible with broken code (dd,
for example) which expects to see 512 byte sectors. The size of a
swap-backed memory disk must still be a multiple of the page size.
When performing page-aligned operations, this change has zero
performance impact.
Reviewed by: phk
Approved by: rwatson (mentor)
Assert the BPF descriptor lock in the MAC calls referencing live
BPF descriptors.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research
BPF. Grab the BPF descriptor lock before entering MAC since the MAC
Framework references BPF descriptor fields, including the BPF
descriptor label.
Submitted by: sam
on it in hopes of making sure that the waitq was empty before going on.
This wasn't needed and probably never would have worked as intended. Now
that cv_waitq_empty() and friends are gone, the code in these drivers that
spins on it can go away too. This should unbreak LINT.
Discussed with: kan
- use correct rid when allocating PCI mem resource
- ATA taskfile registers are indeed spaced 0x10 apart just like
the Macio ATA cell. Adjust offsets in ATA channel struct.
Tested by: Suleiman Souhlal <ssouhlal@vt.edu>
generic watchdoc(9) interface.
Make watchdogd(8) perform as watchdog(8) as well, and make it
possible to specify a check command to run, timeout and sleep
periods.
Update watchdog(4) to talk about the generic interface and add
new watchdog(8) page.
rid of the MTX_DUPOK flag on channel mutexes, which allows witness to
do a better job of lock order checking. Nuke snd_chnmtxcreate() since
it is no longer needed.
Tested by: matk
channel at a time unless it is actually necessary to lock both.
This avoids problems with lock order reversal and malloc() calls
with a mutex held when lower level code unlocks a channel, calls malloc(),
and relocks the channel. This also avoids the cost of some unnecessary
locking and unlocking.
Tested by: matk
Testing on cluster ref machine with just delaying make_dev() seems
to work, and results in printf() output appearing sooner in boot
cycle instead of going to /dev/null.
Caught by: bde
Pointy hat: kensmith
Approved by: rwatson (mentor)
patterns. (These lines are correct the other two times they appear.)
Reported by: "Ted Unangst" <tedu@coverity.com>
Approved by: rwatson (mentor), ken (scsi)
remove unused pid field of file context struct
map nfs4 error codes to errnos
eliminate redundant code from nfs4_request
use zero stateid on setattr that doesn't set file size
use same clientid on all mounts until reboot
invalidate dirty bufs in nfs4_close, to play it safe
open file for writing if truncating and it's not already open
Approved by: alfred
subtle problems with how alpha was handling the promcons device. This
moves the call to make_dev() for the promcons device to a later point of
the boot-up sequence than where promcons initially gets attached, make_dev()
called during the first attach crashes due to kernel stack issues.
Reviewed by: gallatin, marcel, phk
Discussed on: -current@, -alpha@
Approved by: rwatson (mentor)
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
the process state to zombie when a process exits to avoid a lock order
reversal with the sleepqueue locks. This appears to be the only place
that we call wakeup() with sched_lock held.
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock. This subsystem will be used as
the backend for sleep/wakeup and condition variables initially. Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.
Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler. Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations. For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.
statements and nowhere else in the kernel seems to use them for single
statements. Also, all other users of do { } while(0) use multiple lines
rather than cramming it all onto one line.
work. This is odd because loader(8) doesn't suffer from this problem.
Perhaps pxeboot bootstrap can be fixed to handle this better.
Anyway, PXE booting should work again.
are employed in entry points later in the same include file.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Air Force Research Laboratory, McAfee Research
struct vattr in mac_policy.h. This permits policies not
implementing entry points using these types to compile without
including include files with these types.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Air Force Research Laboratory
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.
Approved by: bms(mentor)
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).
Approved by: bms(mentor)
to a new mac_inet.c. This code is now conditionally compiled based
on inet support being compiled into the kernel.
Move socket related MAC Framework entry points from mac_net.c to a new
mac_socket.c.
To do this, some additional _enforce MIB variables are now non-static.
In addition, mbuf_to_label() is now mac_mbuf_to_label() and non-static.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research
for a long time and is run in production use. This is the code present in
portversion 2.03 with some additional tweaks.
The rather extensive diff accounts for:
- locking (to enable pf to work with a giant-free netstack)
- byte order difference between OpenBSD and FreeBSD for ip_len/ip_off
- conversion from pool(9) to zone(9)
- api differences etc.
Approved by: bms(mentor) (in general)
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.
Approved by: jlemon (years ago, for a more invasive fix)
also prints the actual numerical value of the symbol in question.
Users of addr2line(1) will be less proficient in hex arithmetic as a
consequence.
This amongst other things means that traceback lines change from:
siointr1(c4016800,c073bda0,0,c06b699c,69f) at siointr1+0xc5
to
siointr1(c4016800,c073bda0,0,c06b699c,69f) at 0xc062b0bd = siointr1+0xc5
I made this an option to avoid bikesheds.
~
~
~
amount of segments it will hold.
The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:
net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).
net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).
net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.
net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.
Tested by: bms, silby
Reviewed by: bms, silby
address, even if we subsequently ignore its value by applying a >>8
to it.
Reported by: "Ted Unangst" <tedu@coverity.com>
Approved by: rwatson (mentor), {ume, suz} (KAME)
The nonstandard formatting made my mega-patch scripts miss it.
Retire the static major number while we're here anyway.
Reported by: Niels Chr. Bank-Pedersen <ncbp@bank-pedersen.dk>
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.
Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
In ufs_lock, check for attempts to acquire shared locks on
snapshot files and change them to be exclusive locks. This
change eliminates deadlocks and machine lockups reported in
-current since most read requests started using shared lock
requests.
Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>
swap_pager_putpages()'s buffer completion code. Note: the only
difference between swp_pager_sync_iodone() and bdone(), aside from
the locking in the latter, was the unnecessary clearing of B_ASYNC.
- Remove an unnecessary pmap_page_protect() from
swp_pager_async_iodone().
Reviewed by: tegge
of all, PIPE_EOF is not checked pervasively after everything that can drop
the pipe mutex and msleep(), so fix. Additionally, though it might not
harm anything, pipelock() and pipeunlock() are not used consistently.
Third, the kqueue support functions do not use the pipe mutex correctly.
Last, but absolutely not least, is a race: if pipe_busy is not set on
the closing side of the pipe, the other side that is trying to write to
that will crash BECAUSE PIPE_EOF IS NOT SET! Unconditionally set
PIPE_EOF, and get rid of all the lockups/crashes I have seen trying
to build ports.
an integer type and the a cast to (void *) was added in the
definition of NULL for the kernel, we need to use 0 here instead.
Partly submitted by: cperciva
Now I believe it is done in the right way.
Removed some XXMAC cases, we now assume 'high' integrity level for all
sysctls, except those with CTLFLAG_ANYBODY flag set. No more magic.
Reviewed by: rwatson
Approved by: rwatson, scottl (mentor)
Tested with: LINT (compilation), mac_biba(4) (functionality)
with a memory mapped I/O range that's immediately before it and is
not 256MB aligned. As a result, when an address is accessed in the
memory mapped range and a direct mapping is added for it, it overlaps
with the pre-mapped I/O port space and causes a machine check.
Based on a patch from: arun@
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.
Suggested by: imp
idmap_add failure case (found by Ted Unangst via Colin Percival)
also convert idmap_hashf to return void, since it can't fail
also change some panics to error returns
by 1 u_int if the number of clusters was 1 more than a multiple of
(8 * sizeof(u_int)). The bitmap is malloced and large (often huge), so
fatal overrun probably only occurred if the number of clusters was 1
more than 1 multiple of PAGE_SIZE/8.
This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:
Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.
Hold a dev_t reference while the device is open.
KASSERT that we do not enter the driver on a non-referenced dev_t.
Remove old D_NAG code, anonymous dev_t's are not a problem now.
When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.
Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
Add missing D_TTY flags to various drivers.
Complete asserts that dev_t's passed to ttyread(), ttywrite(),
ttypoll() and ttykqwrite() have (d_flags & D_TTY) and a struct tty
pointer.
Make ttyread(), ttywrite(), ttypoll() and ttykqwrite() the default
cdevsw methods for D_TTY drivers and remove the explicit initializations
in various drivers cdevsw structures.
This commit adds a couple of functions for pseudodrivers to use for
implementing cloning in a manner we will be able to lock down (shortly).
Basically what happens is that pseudo drivers get a way to ask for
"give me the dev_t with this unit number" or alternatively "give
me a dev_t with the lowest guaranteed free unit number" (there is
unfortunately a lot of non-POLA in the exact numeric value of this
number, just live with it for now)
Managing the unit number space this way removes the need to use
rman(9) to do so in the drivers this greatly simplifies the code in
the drivers because even using rman(9) they still needed to manage
their dev_t's anyway.
I have taken the if_tun, if_tap, snp and nmdm drivers through the
mill, partly because they (ab)used makedev(), but mostly because
together they represent three different problems for device-cloning:
if_tun and snp is the plain case: just give me a device.
if_tap has two kinds of devices, with a flag for device type.
nmdm has paired devices (ala pty) can you can clone either of them.
Free approx 86 major numbers with a mostly automatically generated patch.
A number of strategic drivers have been left behind by caution, and a few
because they still (ab)use their major number.
This removes the packet header in certain cases which later on
will give panic. Clarify what the atm_intr expects in the comment
and de-obscurify the code a little bit by replacing the portability
macros with the BSD names. The code isn't maintained externally anymore
so there's no point in keeping the extra level of obscurity.
- allow for ifp->if_ioctl being NULL, as the rest of ifioctl() does;
- give the interface driver a chance to report a error to the caller;
- don't forget to update ifp->if_lastchange upon successful modification
of interface operation parameters.
layering violation. As pointed out, there is much better way to do this.
Sorry guys, I need to find a better way to force reviews.
Requested by: harti, julian, scottl (mentor)
Pointy hat to: pjd
Instead of creating a mutex that we msleep on but don't actually lock when
doing the corresponding wakeup(), in the kthread, lock the mutex associated
with our taskqueue and msleep while the queue is empty. Assert that the
queue is locked when the callback function is called to wake the kthread.
return events on the fixed handler even after defining a duplicate in the
AML. While this violates the spec, hopefully we can get by with leaving
both installed.
It works as follows:
In every 'interval' seconds defined links are checked.
If they are non-active they will not be used by to data transfer.
No response from: julian, archie
Silent on: net@
Approved by: scottl (mentor)
mode is applied, since tunneled packets are considered to be
generated packets from a tunnel encapsulating node.
- tunnel mode may not be applied if SA mode is ANY and policy
does not say "tunnel it". check if we have extra IPv6 header
on the packet after ipsec6_output_tunnel() and call ip6_output()
only if additional IPv6 header is added.
- free the copyed packet before returning.
Obtained from: KAME
It returns 1 is process is inside of jail and 0 if it is not.
Information if we are in jail or not is not a secret, there is plenty of
ways to discover it. Many people are using own hack to check this and
this will be a legal way from now on.
It will be great if our starting scripts will take advantage of this sysctl
to allow clean "boot" inside jail.
Approved by: rwatson, scottl (mentor)
to size_t *, which is incorrect because they may have different widths.
This caused some subtle forms of corruption, the mostly frequently
reported one being that the last character of a filename was sometimes
duplicated on amd64.
stopped returning events. Don't disable the event when removing
the handler because it still needs to be enabled for the other
handler. Also, remove duplicate AcpiEnableEvent calls since the
install function now does this for us.
into its own file:
- All of the $PIR interrupt routing is now done in a link-centric fashion.
When a host-PCI bridge that uses the $PIR attaches, it calls pir_parse()
to parse the table. This scans for link devices and merges all the masks
for each link device from the table entries. It then looks at the intline
register of PCI devices connected to a link to figure out if the BIOS has
routed this link and if so to which IRQ.
- The IRQ for any given link can be overridden via a hint like so:
'hw.pci.link.0x62.irq=10' Any IRQ set in this matter is treated as if it
were set that way by the BIOS.
- We only call the BIOS to route each link device once.
- When a PCI device wants to route an interrupt, we look it up in the $PIR
to find the associated link. If the link is routed, we simply return the
IRQ it is using. If it is not routed, we have to pick one. This uses a
different algorithm from the old code. First off, when we try to pick
an interrupt from a mask of possible interrupts, we try to pick the one
that is least loaded as far as PCI devices. We maintain this weight based
on the number of devices attached to each link device. When choosing an
IRQ, we first attempt to route using any PCI only interrupts (the old
code did this as well). If that doesn't work, we try to use the list of
IRQs that the BIOS has used. This is a new step that the new code didn't
do and avoids using IRQ 3 or 4 for every virgin interrupt routing. If
none of the IRQs that the BIOS used worked, then we fall back to trying
anything.
- The fallback mask for !PC98 was fixed to include IRQ 3 and not allow IRQ
2.
- We don't use the $PIR to route interrupts on a PCI-PCI bridge unless it
has already been used to route on at least one Host-PCI bridge. This
helps to avoid mixing and matching x86 firmware PCI interrupt routing
methods (which is a Bad Thing(tm)).
Silence on: current@
Previously the "struct disk" were owned by the device driver and this
gave us problems when the device disappared and the users of that device
were not immediately disappearing.
Now the struct disk is allocate with a new call, disk_alloc() and owned
by geom_disk and just abandonned by the device driver when disk_create()
is called.
Unfortunately, this results in a ton of "s/\./->/" changes to device
drivers.
Since I'm doing the sweep anyway, a couple of other API improvements
have been carried out at the same time:
The Giant awareness flag has been flipped from DISKFLAG_NOGIANT to
DISKFLAG_NEEDSGIANT
A version number have been added to disk_create() so that we can detect,
report and ignore binary drivers with old ABI in the future.
Manual page update to follow shortly.
support is partial in that it will refuse to create large files on
filesystems that haven't been upgraded to EXT2_DYN_REV or that don't
have the EXT2_FEATURE_RO_COMPAT_LARGE_FILE flag set in the superblock.
MFC after: 2 weeks
kernel. I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.
value for MSGBUF_SIZE is configured. MSGBUF_SIZE =
(32768 * bootverbose ? 2 : 1) is always 1 or 2, so there is not enough space
in the buffer for metadata, and blindly using the nonexistent space tends
to cause fatal pagefaults. I think
MSGBUF_SIZE = (32768 * (bootverbose ? 2 : 1)) would be always 32768 since
bootverbose is only statically initialized to 0 early when MSGBUF_SIZE is
used. MSGBUF_SIZE = (32768 * ((boothowto & RB_VERBOSE) ? 2 : 1)) should
work, but this belongs in <sys/msgbuf.h> even less than previous versions.
MSGBUF_SIZE shouldn't be a macro.
it means that the correct value is unknown. Since this value is just
a hint to improve performance, initially assume that the first non-reserved
cluster is free, then correct this assumption if necessary before writing
the FSInfo block back to disk.
PR: 62826
MFC after: 2 weeks