- Trade off granularity to reduce overhead, since the current model
doesn't appear to reduce contention substantially: move to a single
harvest mutex protecting harvesting queues, rather than one mutex
per source plus a mutex for the free list.
- Reduce mutex operations in a harvesting event to 2 from 4, and
maintain lockless read to avoid mutex operations if the queue is
full.
- When reaping harvested entries from the queue, move all entries from
the queue at once, and when done with them, insert them all into a
thread-local queue for processing; then insert them all into the
empty fifo at once. This reduces O(4n) mutex operations to O(2)
mutex operations per wakeup.
In the future, we may want to look at re-introducing granularity,
although perhaps at the granularity of the source rather than the
source class; both the new and old strategies would cause contention
between different instances of the same source (i.e., multiple
network interfaces).
Reviewed by: markm
reaching into the socket buffer. This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it). Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although haven't
seen it reported).
RELENG_5 candidate.
and make it visible (same way as in OpenBSD). Describe usage in manpage.
This change is useful for creating custom free methods, which
call default free method at their end.
While here, make malloc declaration for mbuf tags more informative.
Approved by: julian (mentor), sam
MFC after: 1 month
when the spin lock in question isn't -- it's the critical_enter() that
KDB set. No more panic in DDB for console -> syscons -> tty -> knote
operations.
be used to announce various system activity.
The auxio device provides auxiliary I/O functions and is found on various
SBus/EBus UltraSPARC models. At present, only front panel LED is
controlled by this driver.
Approved by: jake (mentor)
Reviewed by: joerg
Tested by: joerg
state management corruption, mbuf leaks, general mbuf corruption,
and at least on i386 a first level splash damage radius that
encompasses up to about half a megabyte of the memory after
an mbuf cluster's allocation slab. In short, this has caused
instability nightmares anywhere the right kind of network traffic
is present.
When the polymorphic refcount slabs were added to UMA, the new types
were not used pervasively. In particular, the slab management
structure was turned into one for refcounts, and one for non-refcounts
(supposed to be mostly like the old slab management structure),
but the latter was almost always used through out. In general, every
access to zones with UMA_ZONE_REFCNT turned on corrupted the
"next free" slab offset offset and the refcount with each other and
with other allocations (on i386, 2 mbuf clusters per 4096 byte slab).
Fix things so that the right type is used to access refcounted zones
where it was not before. There are additional errors in gross
overestimation of padding, it seems, that would cause a large kegs
(nee zones) to be allocated when small ones would do. Unless I have
analyzed this incorrectly, it is not directly harmful.
with acpi but the timer runs twice as fast. Note that the main problem
(system doesn't work properly with acpi disabled) should be fixed separately.
Changes:
* Add a quirk to disable the timer
* Merge the P5A and P5A-B quirks since they appear to be based on the
same ASL.
PR: i386/72450
Tested by: Kevin Oberman <oberman es.net>
MFC after: 3 days
We return ENOBUF to indicate the problem, which is an errno that should be
handled well everywhere.
Requested & Submitted by: green
Silently okay'ed by: The rest of the firewall gang
MFC after: 3 days
Restructure pmap_enter() to prevent the loss of a page modified (PG_M) bit
in a race between processors. (This restructuring assumes the newly atomic
pte_load_store() for correct operation.)
Reviewed by: tegge@
PR: i386/61852
problem I had but it's happening in code that is messing around with
register windows - I'm willing to live with that piece being sensitive
to this and it looks like the other problems we had reported lately
are not fixed by using -O instead of -O2.
Sorry for the churn. Looks like I need a second pointy hat. Someone
tells me they stack well. :-))))
This also adds support for bigger disks on the controller I have access to,
and maybe others if I understood the adhoc methods used on those.
Those with more PC98 bigdrive controllers it is hereby invited to add/fix
support for those in geom_pc98.c and not using #ifdef PC98 all over the place.
callouts as non-CALLOUT_MPSAFE. Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.
MFC after: 3 days
Reported by: Dikshie < dikshie at ppk dot itb dot ac dot id >
-O2 on kernel compiles after all. While working on adding a KASSERT
to sparc64/sparc64/rwindow.c I found that it was "position sensitive",
putting it above a call to flushw() instead of below caused corruption
of processes on the system. jake and jhb have both confirmed there is
no obvious explanation for that. The exact same kernel code does not
have the process corruption problem if compiled with -O instead of -O2.
There have been signs of similar issues floated on the sparc64@ mailing
list, lets see if this helps make them go away.
Note this isn't an optimal fix as far as the file format goes, if this
disgusts too many people I'll fix it the right way. Since compiling
with something other than -O is a known problem this format would prevent
a change to the default causing grief. And this may also help motivate
finding out what the compiler is doing wrong so we can shift back to
using -O2. :-)
My turn for the pointy hat... One of the florescent ones...
MFC after: 2 days
After some discussion the best option seems to be to signal the thread's
death from within the kernel. This requires that thr_exit() take an
argument.
Discussed with: davidxu, deischen, marcel
MFC after: 3 days
allocate unallocated memory resources from the top 32MB of the address
space rather than the top 2GB. While the latter works on some
chipsets, it fails badly on others. 32MB is more conservative and
matches what cheap harware from this era is hardwired to pass.
RAM. Many older, legacy bridges only allow allocation from this
range. This only appies to devices who don't have their memory
assigned by the BIOS (since we allocate the ranges so assigned
exactly), so should have minimal impact.
Hoewver, for CardBus bridges (cbb), they rarely get the resources
allocated by the BIOS, and this patch helps them greatly. Typically
the 'bad Vcc' messages are caused by this problem.
(and panic). To try to finish making BPF safe, at the very least,
the BPF descriptor lock really needs to change into a reader/writer
lock that controls access to "settings," and a mutex that controls
access to the selinfo/knote/callout. Also, use of callout_drain()
instead of callout_stop() (which is really a much more widespread
issue).
Discussion: this panic (or waning) only occurs when the kernel is
compiled with INVARIANTS. Otherwise the problem (which means that
the vp->v_data field isn't NULL, and represents a coding error and
possibly a memory leak) is silently ignored by setting it to NULL
later on.
Panicking here isn't very helpful: by this time, we can only find
the symptoms. The panic occurs long after the reason for "not
cleaning" has been forgotten; in the case in point, it was the
result of severe file system corruption which left the v_type field
set to VBAD. That issue will be addressed by a separate commit.
all other threads to suicide, problem is execve() could be failed, and
a failed execve() would change threaded process to unthreaded, this side
effect is unexpected.
The new code introduces a new single threading mode SINGLE_BOUNDARY, in
the mode, all threads should suspend themself at user boundary except
the singler. we can not use SINGLE_NO_EXIT because we want to start from
a clean state if execve() is successful, suspending other threads at unknown
point and later resuming them from there and forcing them to exit at user
boundary may cause the process to start from a dirty state. If execve() is
successful, current thread upgrades to SINGLE_EXIT mode and forces other
threads to suicide at user boundary, otherwise, other threads will be resumed
and their interrupted syscall will be restarted.
Reviewed by: julian
table. acpidump(8) concatenates the body of the DSDT and SSDTs so an
edited ASL will contain all the necessary information. We can't use a
completely empty table since ACPI-CA reports this as a problem.
MFC after: 3 days
This really doesn't belong here but is preferred (for the moment) over
adding yet another mechanism for sending msgs from the kernel to user apps.
Reviewed by: imp
the raw values including for child process statistics and only compute the
system and user timevals on demand.
- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.
Submitted by: bde (mostly)
MFC after: 1 month
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.
Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
turnstile chain lock until after making all the awakened threads
runnable. First, this fixes a priority inversion race. Second, this
attempts to finish waking up all of the threads waiting on a turnstile
before doing a preemption.
Reviewed by: Stephan Uphoff (who found the priority inversion race)
+ * 9: 0x3f0-0x3f3,0x3f4-0x3f5,0x3f7
This requires only one change to support. Rather than keying on the
size of the resource being 2, instead key off the end & 7 being 3.
This covers the same cases that the size of 2 would catch, but also
covers the new above case.
In addition, I think it is clearer to use the end in preference to the
size and start for case #8 as well. Turns two tests into one, and
catches no other cases.
Make minor commentary changes to deal with new case #9.
# This change is specifically minimal to allow easy MFC. A more
# extensive change will go into current once I've had a chance to test
# it on a lot of hardware...
with different file systems. This may cause ill things
with my previous fix. Now it translate fsid of direct child of
mount point directory only.
Pointed out by: Uwe Doering
that is no longer required. (In fact, it is not clear that it was ever
required in HEAD or RELENG_4, only RELENG_3 required a work-around.) Now,
as before revision 1.251, if the preexisting PTE is invalid, pmap_enter()
does not call pmap_invalidate_page() to update the TLB(s).
Note: Even with this change, the handling of a copy-on-write fault is
inefficient, in such cases pmap_enter() calls pmap_invalidate_page() twice.
Discussed with: bde@
PR: kern/16568
so would cause kernel to produce an unkillable process in some cases,
especially, P_STOPPED_SINGLE has a singling thread, turning off the
bit would mess the state.
need to mask off the page offset bits. (This operation made some sense
prior to i386/i386/pmap.c revision 1.254 when we passed a physical address
rather than a vm_page pointer to pmap_enter().)
Discussed extensively with KAME. The API author's intent isn't clear at this
point, so rather than remove the code entirely, #if 0 out and put a big
comment in for now. The IPV6_RECVPATHMTU sockopt is available if the
application wants to be notified of the path MTU to optimize packet sizes.
Thanks to JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp> for putting up
with my incessant badgering on this issue, and fenner for pointing out
the API issue and suggesting solutions.
data endpoints. The control endpoint doesn't need read/write/poll
operations, and more importantly, the thread counts should be
separate so that the control endpoint can properly reference itself
while deleting and recreating the data endpoints.
* Add some macros that handle referencing/releasing devices, and use them
for sleeping/woken-up and open/close operations as apppropriate.
* Use d_purge for FreeBSD, and a loop testing the open status for all
the endpoints for NetBSD and OpenBSD, so that when the device is
detached, the right thing always happens.
restart the current waiting transfer. If this isn't done, the device's
next transfer (that we would like to do a short read on) is going to
return an error -- for short transfer.
* For bulk transfer endpoints, restore the maximum transfer length each
time a transfer is done, or the first short transfer will make all the
rest that size or smaller.
* Remove impossibilities (malloc(M_WAITOK) == NULL, &var == NULL).
New device names are "{tty|cua}A$(card)$(port)[.init|.lock]"
Put a portname in the port structure if SI_DEBUG is defined to avoid
need to inspect minor number to construct name..
Constify some strings.
Remove duplicated DBG_ #defines.
uses predate the change in the pmap_enter() interface that replaced the
page's physical address by the address of its vm_page structure. The
PHYS_TO_VM_PAGE() was being used to compute the address of the same vm_page
structure that was being passed in.
is a no-op on little endian architectures, but fixes getting the MAC
address for some dc(4) cards on big endian architectures.
This is a RELENG_5 candidate.
Tested by: gallatin (powerpc), marius (sparc64)
First version of the patch written by: gallatin
1. Process p1 is currently being swapped in.
2. Process p2 calls linux_ptrace(PTRACE_GETFPXREGS, p1_pid, ...)
3. After acquiring a reference to FIRST_THREAD_IN_PROC(p1),
p2 blocks in faultin() while p1 finishes being swapped in.
This means p2 won't get back the lock on p1 until after p1's
threads are runnable.
4. After p1 is swapped in, the first thread in p1 exits.
5. p2 now uses its dangling reference to p1's first thread.
after loading such a kernel, "module_path" will be set to
an insane value. Fixed example by providing an equivalent
setting. For the record, when automatically loading a
kernel (commands "boot" and "boot-conf"), the following is
tried, in this order:
path=/boot/${kernel} file=${bootfile}
path=/boot/${kernel} file=${kernel}
path=${kernel} file=${bootfile}
path=${kernel} file=${kernel}
path=${module_path} file=${kernel}
MP machines (hopefully). CPU timers are OK on UP machines but we
don't keep the timers in sync on MP machines so if the CPU's timer
is chosen as the primary timecounter it's possible for time to
not be monotonically increasing because different CPU's counters
may be used at different times. But the CPU's counters are otherwise
one of the higher quality counters available. So, on UP machines
we'll use a relatively high quality value but on MP machines we'll
use a quality that should prevent the CPU's counters from being chosen.
Requested by: green (who did the first version of the patch)
Reviewed by: marius, green
MFC after: 1 week
the structure space had been obtained from malloc() its contents is
random garbage. The choice of value being set is part of a larger effort
to solve some timecounter issues on MP machines (while working on that
we noticed this problem).
Noticed by: marius
Reviewed by: marius, green
MFC after: 3 days