the point where it being a macro is no longer sensible, and it will
only be more so in days to come.
BIO_STRATEGY() is now only used from DEV_STRATEGY() and should not
be used directly anymore.
Put the contents of both in the new function dev_strategy() and
make DEV_STRATEGY() call that function.
In addition, this allows us to make the rather magic bufdonebio()
helper function static.
This alse saves hunderedandsome bytes of code in a typical kernel.
definition structure. Define one flag, CN_FLAG_NODEBUG, which
indicates the console driver cannot be used in the context of the
debugger. This may be used, for example, if the console device
interacts with kernel services that cannot be used from the
debugger context, such as the network stack. These drivers are
skipped over for calls to cn_checkc() and cn_putc(), and the
calling function simply moves on to the next available console.
a fair bit of difference to the power consumption and lets my cpu cool
down enough for the temperature sensitive fan controller to completely
stop the cpu fan at times.
halt state that minimizes power consumption while still preserving
cache and TLB coherency. Halting the processor is not conditional at
this time. Tested with UP and SMP kernels.
classes and if a method is not found in a given class, its base classes
are searched (in the order they were declared). This search is recursive,
i.e. a method may be define in a base class of a base class.
* Change the kobj method lookup algorithm to one which is SMP-safe. This
relies only on the constraint that an observer of a sequence of writes
of pointer-sized values will see exactly one of those values, not a
mixture of two or more values. This assumption holds for all processors
which FreeBSD supports.
* Add locking to kobj class initialisation.
* Add a simpler form of 'inheritance' for devclasses. Each devclass can
have a parent devclass. Searches for drivers continue up the chain of
devclasses until either a matching driver is found or a devclass is
reached which has no parent. This can allow, for instance, pci drivers
to match cardbus devices (assuming that cardbus declares pci as its
parent devclass).
* Increment __FreeBSD_version.
This preserves the driver API entirely except for one minor feature used
by the ISA compatibility shims. A workaround for ISA compatibility will
be committed separately. The kobj and newbus ABI has changed - all modules
must be recompiled.
rounding errors. This was the source of the majority of the
interactivity problems. Reintroduce the old algorithm and its XXX.
- Up the interactivity threshold to 30. It really could stand to be even
a tiny bit higher.
- Let the sleep and run time accumulate up to 5 seconds of history rather
than two. This helps stop XFree86 from becoming non-interactive during
bursts of activity.
elevated either due to priority propagation or because we're in the
kernel in either case, put us on the current queue so that we dont
stop others from using important resources. At some point the priority
elevations from sleeping in the kernel should go away.
- Remove an optimization in sched_userret(). Before we would only set
NEEDRESCHED if there was something of a higher priority available. This
is a trivial optimization and it breaks priority propagation because it
doesn't take threads which we may be blocking into account. Notice that
the thread which is blocking others gets up to one tick of cpu time before
we honor this NEEDRESCHED in sched_clock().
accesses softc after it is freed. Use a different malloc type for
softc than the rest of the bus code to make it more clear when these
things happen that it is the driver that's at fault, not the bus code.
Suggested by: sam and/or phk (I think)
you on the current queue. In the future, it would be nice if priority
propagation could deterministicly pluck a thread off of the next queue
and put it on the current queue. Until then this hack stops us from
holding up our entire current queue, including interrupt handlers, while
a thread on the next queue is blocked while holding Giant.
- Inherit our pctcpu information from our parent.
kqueue write events on a socket and you regularly create tons of pipes
which overwrites the structure causing a panic when removing the knote
from the list. If the peer has gone away (and it's a write knote), then
don't bother trying to remove the knote from the list.
Submitted by: Brian Buchanan and myself
Obtained from: nCircle
caused snapshot related problems.
- The vp can not be NULL here or we would panic in vfs_bio_awrite(). Stop
confusing the logic by checking for it in several places.
Submitted by: kirk and then rototilled by me to remove vp == NULL checks.
Use pre-emption detection to avoid the need for wiring a userland buffer
when copying opaque data structures.
sysctl_wire_old_buffer() is now a no-op. Other consumers of this
API should use pre-emption detection to notice update collisions.
vslock() and vsunlock() should no longer be called by any code
and should be retired in subsequent commits.
Discussed with: pete, phk
MFC after: 1 week
go away in due course. Involuntary pre-emption means that we can't count
on wiring of pages alone for consistency when performing a SYSCTL_OUT()
bigger than PAGE_SIZE.
Discussed with: pete, phk
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
recycling.
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
scheme.
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
vgonechrl().
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
functions.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
set.
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.
validating the offset within a given memory buffer before handing the
real work off to uiomove(9).
Use uiomove_frombuf in procfs to correct several issues with
integer arithmetic that could result in underflows/overflows. As a
side-effect, the code is significantly simplified.
Add additional sanity checks when computing a memory allocation size
in pfs_read.
Submitted by: rwatson (original uiomove_frombuf -- bugs are mine :-)
Reported by: Joost Pol <joost@pine.nl> (integer underflows/overflows)
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK. This appears to be an
artifact left over from the initial integration of quotas into BSD.
Suggested by: peter
the TLB and ~1600 if it is not. Therefore, it is more effecient to
invalidate the TLB after operations that use CMAP rather than before.
- So that the tlb is invalidated prior to switching off of a processor, we
must change the switchin functions to switchout functions.
- Remove td_switchout from the thread and move it to the x86 pcb.
- Move the code that calls switchout into swtch.s. These changes make this
optimization truely x86 specific.
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.
Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.
Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.
Tested on: alpha, i386, ia64, sparc64
do exactly the same as vop_nopoll() for consistency and put a
comment in the two pointing at each other.
Retire seltrue() in favour of no_poll().
Create private default functions in kern_conf.c instead of public
ones.
Change default strategy to return the bio with ENODEV instead of
doing nothing which would lead the bio stranded.
Retire public nullopen() and nullclose() as well as the entire band
of public no{read,write,ioctl,mmap,kqfilter,strategy,poll,dump}
funtions, they are the default actions now.
Move the final two trivial functions from subr_xxx.c to kern_conf.c
and retire the now empty subr_xxx.c
provide no methods does not make any sense, and is not used by any
driver.
It is a pretty hard to come up with even a theoretical concept of
a device driver which would always fail open and close with ENODEV.
Change the defaults to be nullopen() and nullclose() which simply
does nothing.
Remove explicit initializations to these from the drivers which
already used them.
consdev structure.
If the consdev name is not set and we have a cn_dev, set the name
from there. Try to issue a printf about this, even though it may
not have a place to go.
Modify the sysctl related code to pick up the name from the consdev
instead.
systems where the data/stack/etc limits are too big for a 32 bit process.
Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.
Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.
Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.
Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.
Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.
freed belong to the kernel object.)
- Increase the granularity of the vm object locking in vm_hold_load_pages()
in order to reduce the number of times that we acquire and release the
same lock.
Doing so creates a race where the buf is on neither list.
- Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
should.
- Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
really should not happen and is fast to check.
special signal-delivery protections for setugid processes. In the
event that a system is relying on "unusual" signal delivery to
processes that change their credentials, this can be used to work
around application problems.
Also, add SIGALRM to the set of signals permitted to be delivered to
setugid processes by unprivileged subjects.
Reported by: Joe Greco <jgreco@ns.sol.net>
pmap_extract_and_hold(). Note, however, that GIANT_REQUIRED should not be
removed until all platforms fully implement the "prot" parameter to
pmap_extract_and_hold().
Reviewed by: tegge
about interrupt trigger mode and interrupt polarity. This allows ACPI
for example to pass interrupt resource information up the hierarchy.
The default implementation of the method therefore is to pass the
request to the parent.
Reviewed by: jhb, njl
specified directory is not found in the mount list. Before the
MNT_BYFSID changes, unmount(2) used to return ENOENT for a nonexistent
path and EINVAL for a non-mountpoint, but we can no longer distinguish
between these cases. Of the two error codes, EINVAL was more likely
to occur in practice, and it was the only one of the two that was
documented.
Update the manual page to match the current behaviour.
Suggested by: tjr
Reviewed by: tjr
and/or INTR_FAST. This belongs elsehwere and perhaps under bootverbose;
I'm committing it for now as it's uesful to know which drivers have
been converted and which have not.
out of cdregister() and daregister(), which are run from interrupt context.
The sysctl code does blocking mallocs (M_WAITOK), which causes problems
if malloc(9) actually needs to sleep.
The eventual fix for this issue will involve moving the CAM probe process
inside a kernel thread. For now, though, I have fixed the issue by moving
dynamic sysctl variable creation for these two drivers to a task queue
running in a kernel thread.
The existing task queues (taskqueue_swi and taskqueue_swi_giant) run in
software interrupt handlers, which wouldn't fix the problem at hand. So I
have created a new task queue, taskqueue_thread, that runs inside a kernel
thread. (It also runs outside of Giant -- clients must explicitly acquire
and release Giant in their taskqueue functions.)
scsi_cd.c: Remove sysctl variable creation code from cdregister(), and
move it to a new function, cdsysctlinit(). Queue
cdsysctlinit() to the taskqueue_thread taskqueue once we
have fully registered the cd(4) driver instance.
scsi_da.c: Remove sysctl variable creation code from daregister(), and
move it to move it to a new function, dasysctlinit().
Queue dasysctlinit() to the taskqueue_thread taskqueue once
we have fully registered the da(4) instance.
taskqueue.h: Declare the new taskqueue_thread taskqueue, update some
comments.
subr_taskqueue.c:
Create the new kernel thread taskqueue. This taskqueue
runs outside of Giant, so any functions queued to it would
need to explicitly acquire/release Giant if they need it.
cd.4: Update the cd(4) man page to talk about the minimum command
size sysctl/loader tunable. Also note that the changer
variables are available as loader tunables as well.
da.4: Update the da(4) man page to cover the retry_count,
default_timeout and minimum_cmd_size sysctl variables/loader
tunables. Remove references to /dev/r???, they aren't used
any longer.
cd.9: Update the cd(9) man page to describe the CD_Q_10_BYTE_ONLY
quirk.
taskqueue.9: Update the taskqueue(9) man page to describe the new thread
task queue, and the taskqueue_swi_giant queue.
MFC after: 3 days
Changes from the original implementation:
- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.
- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.
- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
o remove irrlevant spl
Notes:
1. We don't lock domain list traversals as this is safe until we start
removing domains.
2. The calculation of max_datalen in net_init_domain appears safe as
noone depends on max_hdr and max_datalen having consistent values.
3. Giant is still held for fast and slow timeouts; this must stay until
each timeout routine is properly locked (coming soon).
Sponsored by: FreeBSD Fondation
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.
sockets into machine-dependent files. The rationale for this
migration is illustrated by the modified amd64 allocator. It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space. On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page. Yuck.
Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.
mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()
These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
explicit access control checks to delete and list extended attributes
on a vnode, rather than implicitly combining with the setextattr and
getextattr checks. This reflects EA API changes in the kernel made
recently, including the move to explicit VOP's for both of these
operations.
Obtained from: TrustedBSD PRoject
Sponsored by: DARPA, Network Associates Laboratories
MAC_DEBUG_COUNTER_INC() and MAC_DEBUG_COUNTER_DEC() to maintain
debugging counter values rather than #ifdef'ing the atomic
operations to MAC_DEBUG.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
represents the pruely stylistic changes and should have no net impact
on the rest of the code.
bde's more substantive changes will follow in a separate commit once
we've come to closure on them.
Submitted by: bde
ntp_update_second twice when we have a large step in case that step
goes across a scheduled leap second. The only way this could happen
would be if we didn't call tc_windup over the end of day on the day of
a leap second, which would only happen if timeouts were delayed for
seconds. While it is an edge case, it is an important one to get
right for my employer.
Sponsored by: Timing Solutions Corporation
prototypes of cpu_halt(), cpu_reset() and swi_vm() from md_var.h to
cpu.h. This affects db_command.c and kern_shutdown.c.
ia64: move all MD prototypes from cpu.h to md_var.h. This affects
madt.c, interrupt.c and mp_machdep.c. Remove is_physical_memory().
It's not used (vm_machdep.c).
alpha: the MD prototypes have been left in cpu.h with a comment
that they should be there. Moving them is left for later. It was
expected that the impact would be significant enough to be done in
a seperate commit.
powerpc: MD prototypes left in cpu.h. Comment added.
Suggested by: bde
Tested with: make universe (pc98 incomplete)
A timecounter will be selected when registered if its quality is
not negative and no less than the current timecounters.
Add a sysctl to report all available timecounters and their qualities.
Give the dummy timecounter a solid negative quality of minus a million.
Give the i8254 zero and the ACPI 1000.
The TSC gets 800, unless APM or SMP forces it negative.
Other timecounters default to zero quality and thereby retain current
selection behaviour.
- Update some stale comments.
- Sort a couple of includes.
- Only set 'newcpu' in updatepri() if we use it.
- No functional changes.
Obtained from: bde (via an old diff I got a long time ago)
check for permissions, do it for all requests, not the known requests.
Later when we actually service the request we deal with the invalid
requests we previously caught earlier.
This commit changes the behaviour of the ptrace(2) interface for
boundary cases such as an unknown request without proper permissions.
Previously we would return EINVAL. Now we return EBUSY or EPERM.
Platforms need to define __HAVE_PTRACE_MACHDEP when they have MD
requests. This makes the prototype of cpu_ptrace() visible and
introduces a call to this function for all requests greater or
equal to PT_FIRSTMACH.
Silence on: audit
kobj global method table; also kassert that the table has not overflowed
when defining a new method.
there are indications that the table is being overflowed in certain
situations as we gain more kobj consumers- this will allow us to check
whether kobj is at fault. symptoms would be incorrect methods being called.
cpu_switch() where both the old and new threads are passed in as
arguments. Only powerpc uses the old conventions now.
- Update comments in the Alpha swtch.s to reflect KSE changes.
Tested by: obrien, marcel
- All those diffs to syscalls.master for each architecture *are*
necessary. This needed clarification; the stub code generation for
mlockall() was disabled, which would prevent applications from
linking to this API (suggested by mux)
- Giant has been quoshed. It is no longer held by the code, as
the required locking has been pushed down within vm_map.c.
- Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
to express their intention explicitly.
- Inspected at the vmstat, top and vm pager sysctl stats level.
Paging-in activity is occurring correctly, using a test harness.
- The RES size for a process may appear to be greater than its SIZE.
This is believed to be due to mappings of the same shared library
page being wired twice. Further exploration is needed.
- Believed to back out of allocations and locks correctly
(tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).
PR: kern/43426, standards/54223
Reviewed by: jake, alc
Approved by: jake (mentor)
MFC after: 2 weeks
From alc:
Move pageable pipe memory to a seperate kernel submap to avoid awkward
vm map interlocking issues. (Bad explanation provided by me.)
From me:
Rework pipespace accounting code to handle this new layout, and adjust
our default values to account for the fact that we now have a solid
limit on allocations.
Also, remove the "maxpipes" limit, as it no longer has a purpose.
(The limit on kva usage solves the problem of having two many pipes.)
application could cause a wired page to be freed. In general,
vm_page_hold() should be preferred for ephemeral kernel mappings of pages
borrowed from a user-level address space. (vm_page_wire() should really be
reserved for indefinite duration pinning by the "owner" of the page.)
Discussed with: silby
Submitted by: tegge
psignal()/tdsignal(). The test was historically in psignal(). It was
changed into a KASSERT, and then later moved to tdsignal() when the
latter was introduced.
Reviewed by: iedowse, jhb
ioctls.
In the particular case of ptrace(), this commit more-or-less reverts
revision 1.53 of sys_process.c, which appears to have been erroneous.
Reviewed by: iedowse, jhb
a reference to the containing object. The purpose of the reference
being to prevent the destruction of the object and an attempt to free
the wired page. (Wired pages can't be freed.) Unfortunately, this
approach does not work. Some operations, like fork(2) that call
vm_object_split(), can move the wired page to a difference object,
thereby making the reference pointless and opening the possibility
of the wired page being freed.
A solution is to use vm_page_hold() in place of vm_page_wire(). Held
pages can be freed. They are moved to a special hold queue until the
hold is released.
Submitted by: tegge
semaphore and doing so can lead to a possible reversal. WITNESS would have
caught this if semaphores were used more often in the kernel.
Submitted by: Ted Unangst <tedu@stanford.edu>, Dawson Engler
connection is to be established asynchronously, behave as in the
case of non-blocking mode:
- keep the SS_ISCONNECTING bit set thus indicating that
the connection establishment is in progress, which is the case
(clearing the bit in this case was just a bug);
- return EALREADY, instead of the confusing and unreasonable
EADDRINUSE, upon further connect(2) attempts on this socket
until the connection is established (this also brings our
connect(2) into accord with IEEE Std 1003.1.)
than i386 or AMD64, TP register points to thread mailbox, and they can not
atomically clear km_curthread in kse mailbox, in this case, thread retrieves
its thread pointer from TP register and sets flag TMF_NOUPCALL in its thread
mailbox to indicate a critical region.
use vrele() instead of vput() on the parent directory vnode returned
by namei() in the case where it is equal to the target vnode. This
handles namei()'s somewhat strange (but documented) behaviour of
not locking either vnode when the two vnodes are equal and LOCKPARENT
but not LOCKLEAF is specified.
Note that since a vnode double-unlock is not currently fatal, these
coding errors were effectively harmless.
Spotted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
Reviewed by: mckusick
malloc and mbuf allocation all not requiring Giant.
1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.
support routines in kern_acl.c:
- Define ACL_OVERRIDE_MASK and ACL_PRESERVE_MASK centrally in acl.h: the
mode bits that are (and aren't) stored in the ACL.
- Add acl_posix1e_acl_to_mode(): given a POSIX.1e extended ACL, generate
a compatibility mode (only the bits supported by the POSIX.1e ACL).
- acl_posix1e_newfilemode(): Given a requested creation mode and default
ACL, calculate the mode for the new file system object (only the bits
supported by the POSIX.1e ACL).
PR: 50148
Reported by: Ritz, Bruno <bruno_ritz@gmx.ch>
Obtained from: TrustedBSD Project
another thread. We use the td_oncpu member of the other field to locate
it's associated CPU and then search the that CPU's list of spin locks
contained in its per-CPU data. This is not always safe and may in fact
panic or just not work, but it is useful in at least one case.
on a non-blocking pipe in cases where select(2) returns the file
descriptor as ready for write. This in turns causes libc_r, for
one, to busy wait in such cases.
Note: it is a quick performance fix, a more complex fix might be
required in case this turns out to have unexpected side effects.
Reviewed by: silby
MFC after: 3 days
thread's pid to make debugging easier for people who don't want to have to
use the intended tool for these panics (witness).
Indirectly prodded by: kris
a long-standing mistake in the way a portion of a pipe's KVA is
allocated. Specifically, kmem_alloc_pageable() is inappropriate
for use in the "direct" case because it allows a preceding vm map entry
and vm object to be extended to support the new KVA allocation.
However, the direct case KVA allocation should not have a backing
vm object. This is corrected by using kmem_alloc_nofault().
Submitted by: tegge (with the above explanation by me)
kernel ACL interfaces and system call names.
Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
kern.file sysctl, don't return information about processes that
fail p_cansee(td, p). This prevents sockstat and related
programs from seeing file descriptors owned by processes not
in the same jail as the thread, as well as having implications
for MAC, etc.
This is a partial solution: it permits an information leak about
the number of descriptors in the sizing calculation (but this is
not new information, you can also get it from kern.openfiles),
and doesn't attempt to mask file descriptors based on the
properties of the descriptor, only the process referencing it.
However, it provides most of what you want under most
circumstances, without complicating the locking.
PR: 54211
Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
comes across it, it will turn into a core dump in userland instead of
a kernel panic. I had also inverted the sense of the test, so
Double pointy hat to: mtm
- Make m_prepend use m_gethdr instead of m_get where
appropriate
- Make m_copym use m_gethdr instead of m_get where
appropriate
- Add a call to m_fixhdr in m_defrag; m_defrag can't
deal with corrupted pkthdr.len counts.
MFC after: 3 days
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().
The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it. knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).
PR: kern/54331
When a signal is being delivered to process, first find a sigwait
thread to deliver, POSIX's argument is speed of delivering signal
to sigwait thread is faster than other ways. A signal in its wait
set will cause sigwait to return the signal number, a signal not
in its wait set but in not blocked by the thread also causes sigwait
to return, but sigwait returns EINTR, sigwait is oneshot operation,
only one signal can be delivered to its wait set, when a signal is
delivered to the sigwait thread, the thread's sigwait state is canceled.
an appropriate error number after a failure condition.
In particular, three of the changed statements return ESRCH for a
failed pfind(), and in also three places a non-zero return
from p_cansee() will be passed back,
Also noticed by: rwatson
1. There was a race condition between a thread unlocking
a umtx and the thread contesting it. If the unlocking
thread won the race it may try to wakeup a thread that
was not yet in msleep(). The contesting thread would then
go to sleep to await a wakeup that would never come. It's
not possible to close the race by using a lock because
calls to casuptr() may have to fault a page in from swap.
Instead, the race was closed by introducing a flag that
the unlocking thread will set when waking up a thread.
The contesting thread will check for this flag before
going to sleep. For now the flag is kept in td_flags,
but it may be better to use some other member or create
a new one because of the possible performance/contention
issues of having to own sched_lock. Thanks to jhb for
pointing me in the right direction on this one.
2. Once a umtx was contested all future locks and unlocks
were happening in the kernel, regardless of whether it
was contested or not. To prevent this from happening,
when a thread locks a umtx it checks the queue for that
umtx and unsets the contested bit if there are no other
threads waiting on it. Again, this is slightly more
complicated than it needs to be because we can't hold
a lock across casuptr(). So, the thread has to check
the queue again after unseting the bit, and reset the
contested bit if it finds that another thread has put
itself on the queue in the mean time.
3. Remove the if... block for unlocking an uncontested
umtx, and replace it with a KASSERT. The _only_ time
a thread should be unlocking a umtx in the kernel is
if it is contested.
large to huge amounts of small or medium sized receive buffers. The problem
with these situations is that they eat up the available DMA address space
very quickly when using mbufs or even mbuf clusters. Additionally this
facility provides a direct mapping between 32-bit integers and these buffers.
This is needed for devices originally designed for 32-bit systems. Ususally
the virtual address of the buffer is used as a handle to find the buffer as
soon as it is returned by the card. This does not work for 64-bit machines
and hence this mapping is needed.
multiple mutex pools with different options and sizes. Mutex pools can
be created with either the default sleep mutexes or with spin mutexes.
A dynamically created mutex pool can now be destroyed if it is no longer
needed.
Create two pools by default, one that matches the existing pool that
uses the MTX_NOWITNESS option that should be used for building higher
level locks, and a new pool with witness checking enabled.
Modify the users of the existing mutex pool to use the appropriate pool
in the new implementation.
Reviewed by: jhb
immediately after the kernel map has been sized, and is
the optimal place for the autosizing of memory allocations
which occur within the kernel map to occur.
Suggested by: bde
- Use atomic ops to update the bigpipe count
- Make the bigpipe count sysctl readable
- Remove a duplicate comparison in an if statement
- Comment two SYSCTLs.
than the shortcircuited version I had been using, which only worked
properly on i386 & amd64.
Also, change an autoscale constant to account for the more correct
kmem_map size.
Problem noticed by: mux
- Limit the total number of pipes so that we do not
exhaust all vm objects in the kernel map. When
this limit is reached, a ratelimited message will
be printed to the console.
- Put a soft limit on the amount of memory consumable
by pipes. Once the limit has been reached, all new
pipes will be limited to 4K in size, rather than the
default of 16K.
- Put a limit on the number of pages that may be used
for high speed page flipping in order to reduce the
amount of wired memory. Pipe writes that occur
while this limit is exceeded will fall back to
non-page flipping mode.
The above values are auto-tuned in subr_param.c and
are scaled to take into account both the size of
physical memory and the size of the kernel map.
These limits help to reduce the "kernel resources exhausted"
panics that could be caused by opening a large
number of pipes. (Pipes alone are no longer able
to exhaust all resources, but other kernel memory hogs
in league with pipes may still be able to do so.)
PR: 53627
Ideas / comments from: hsu, tjr, dillon@apollo.backplane.com
MFC after: 1 week
notice another typo in the same line. This typo makes libthr unuseable,
but it's effects where counter-balanced by the extra semicolon, which
made libthr remarkably useable for the past several months.
- Associate logical CPUs on the same physical core with the same kseq.
- Adjust code that assumed there would only be one running thread in any
kseq.
- Wrap the HTT code with a ULE_HTT_EXPERIMENTAL ifdef. This is a start
towards HyperThreading support but it isn't quite there yet.
as the target process' pid, it may exist if the process forked before leaving
the pgrp.
Thix fixes a panic that happens when calling setpgid to make a process
re-enter the pgrp with the same pgid as its pid if the pgrp still exists.
be delivered to that thread, regardless of whether it
has it masked or not.
Previously, if the targeted thread had the signal masked,
it would be put on the processes' siglist. If
another thread has the signal umasked or unmasks it before
the target, then the thread it was intended for would never
receive it.
This patch attempts to solve the problem by requiring callers
of tdsignal() to say whether the signal is for the thread or
for the process. If it is for the process, then normal processing
occurs and any thread that has it unmasked can receive it.
But if it is destined for a specific thread, it is put on
that thread's pending list regardless of whether it is currently
masked or not.
The new behaviour still needs more work, though. If the signal
is reposted for some reason it is always posted back to the
thread that handled it because the information regarding the
target of the signal has been lost by then.
Reviewed by: jdp, jeff, bde (style)
locks held by each thread.
- Fix a bug in the original BSD/OS code where a contested lock was not
properly handed off from the old thread to the new thread when a
contested lock with more than one blocked thread was transferred from
one thread to another.
- Don't use an atomic operation to write the MTX_CONTESTED value to
mtx_lock in the aforementioned special case. The memory barriers and
exclusion provided by sched_lock are sufficient.
Spotted by: alc (2)
system by specifying the file system ID instead of a path. Use this
by default in umount(8). This avoids the need to perform any vnode
operations to look up the mount point, so it makes it possible to
unmount a file system whose root vnode cannot be looked up (e.g.
due to a dead NFS server, or a file system that has become detached
from the hierarchy because an underlying file system was unmounted).
It also provides an unambiguous way to specify which file system is
to be unmunted.
Since the ability to unmount using a path name is retained only for
compatibility, that case now just uses a simple string comparison
of the supplied path against f_mntonname of each mounted file system.
Discussed on: freebsd-arch
mdoc help from: ru
happens to work on 32-bit platforms as sizeof(long)=sizeof(int), but
wrecks all kinds of havoc (garbage reads, corrupting writes and
misaligned loads/stores) on 64-bit architectures.
The fix for now is to use fuword32() and suword32() and change the
type of the applicable int fields to int32. This is to make it
explicit that we depend on these fields being 32-bit. We may want
to revisit this later.
Reviewed by: deischen
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.
Reviewed by: julian (mentor)
POSIX says siginfo pointer parameter can be NULL and if the
function success, it should return signal number but not zero.
The waitset it past should be negatived before it can be
used as thread signal mask.
nice distribution without significantly impacting interactive response.
As a side effect it should also allow batch processes to run for a
slightly longer period which will positively impact their performance.
This commit has two pieces. One half is the watchdog kernel code which lives
primarily in hardclock() in sys/kern/kern_clock.c. The other half is a userland
daemon which, when run, will keep the watchdog from firing while the userland
is intact and functioning.
Approved by: jeff (mentor)
Before, we would add/subtract the leap second when the system had been
up for an even multiple of days, rather than at the end of the day, as
a leap second is defined (at least wrt ntp). We do this by
calculating the notion of UTC earlier in the loop, and passing that to
get it adjusted. Any adjustments that ntp_update_second makes to this
time are then transferred to boot time. We can't pass it either the
boot time or the uptime because their sum is what determines when a
leap second is needed. This code adds an extra assignment and two
extra compare in the typical case, which is as cheap as I could made
it.
I have confirmed with this code the kernel time does the correct thing
for both positive and negative leap seconds. Since the ntp interface
doesn't allow for +2 or -2, those cases can't be tested (and the folks
in the know here say there will never be a +2s or -2s leap event, but
rather two +1s or -1s leap events).
There will very likely be no leap seconds for a while, given how the
earth is speeding up and slowing down, so there will be plenty of time
for this fix to propigate. UT1-UTC is currently at "about -0.4s" and
decrementing by .1s every 8 months or so. 6 * 8 is 48 months, or 4
years.
-stable has different code, but a similar bug that was introduced
about the time of the last leap second, which is why nobody has
noticed until now.
MFC After: 3 weeks
Reviewed by: phk
"Furthermore, leap seconds must die." -- Cato the Elder
incremented at the start of the leap second, not after the leap second
has been inserted. This is because at the start of the leap second,
we set the time back one second. This setting back one second is the
moment that the offset changes. The old code set it back after the
leap second, but that's one second too late. The negative leap second
case is handled correctly.
Reviewed by: phk