the interface as IFF_NEEDSGIANT so if_start is run holding Giant.
Note: this driver does declare and occasionally reference mutexes,
but I believe not nearly enough to provide safety.
code was adjusting twice for the instruction pointer indicating
the *next* instruction to execute. The aic79xx driver had a similar
bug, but was fixed some time ago.
check whether p_ucred is NULL or not in pfs_getattr() before
dereferencing the credential, and return ENOENT if there wasn't one.
This is a symptom of a larger problem, wherein pfind() can return
references to incompletely initialized processes, and we instead ought
to not return them, or check the process state before acting on the
process.
Reported by: kris
Discussed with: tjr, others
algorithm built into the map entry splay tree. This replaces the
first_free hint in struct vm_map with two fields in vm_map_entry:
adj_free, the amount of free space following a map entry, and
max_free, the maximum amount of free space in the entry's subtree.
These fields make it possible to find a first-fit free region of a
given size in one pass down the tree, so O(log n) amortized using
splay trees.
This significantly reduces the overhead in vm_map_findspace() for
applications that mmap() many hundreds or thousands of regions, and
has a negligible slowdown (0.1%) on buildworld. See, for example, the
discussion of a micro-benchmark titled "Some mmap observations
compared to Linux 2.6/OpenBSD" on -hackers in late October 2003.
OpenBSD adopted this approach in March 2002, and NetBSD added it in
November 2003, both with Red-Black trees.
Submitted by: Mark W. Krentel
* Serialize access to the sysctl routines and the notify handler
* Assert that the sx lock is held in any functions they call.
* Note that recursively calling to re-enable the hotkeys is sub-optimal.
* Remove the interrupt wrapper that locked Giant and call the handler
directly. Mark the handler as MPSAFE.
* Don't attempt to detect if a handler is installed. Leave that to the
bus_alloc_resource() function.
* Serialize operations in acpi_video_bind_outputs(), acpi_video_detach(),
acpi_video_notify_handler(), acpi_video_power_profile(), and the sysctls.
The main goal is to protect the shared device list and prevent conflicting
settings.
* Add assertions that the sx lock is held in the leaf functions.
* Restructure the event handling path. acpi_tz_thread() now calls
acpi_tz_timeout() any time an event occurs. acpi_tz_timeout() checks
the flags and calls acpi_tz_power_profile(), acpi_tz_establish(), and
acpi_tz_monitor() as appropriate. Notifies only do a wakeup and let
acpi_tz_thread() do the actual work. This path is cleaner and allows
locking since the call path is now always a D.A.G.
* Add the acpi_tz_signal() function to set flags and wake the thread.
* Remove the tz_tmp_updating flag since calls are serialized by
acpi_tz_thread().
* Remove Giant locking.
* Serialize acpi_pwr_switch_consumer() and acpi_pwr_wake_enable().
* Make acpi_pwr_switch_consumer() have a single exit point.
* Add assertions to the leaf functions they call.
* Fix a memory leak in acpi_pwr_deregister_consumer(). However, it is
currently ifdefed out so this code was unused.
* Serialize access to acpi_pci_link_config(), acpi_pci_find_prt(),
acpi_pci_link_route(), and acpi_pci_link_resume().
* Add lock assertions to all functions called by them.
* Serialize notifying the user in acpi_lid_notify_status_changed(). This
way multiple lid events occur in order.
* Add an initialization pass to get the lid status at boot-time. This
pass does not notify any apps but gets the initial status.
* Use the common serialization macros instead of rolling our own.
* Increase the coverage of the lock in EcSpaceHandler() to cover the entire
loop to avoid dropping the lock when reading more than one byte.
* Serialize ops in acpi_cmbat_notify_handler(), acpi_cmbat_ioctl(),
acpi_cmbat_init_battery(), and acpi_cmbat_get_battinfo().
* Get the softc directly in acpi_cmbat_get_total_battinfo() rather than
build an array of them.
* Don't queue a _BIF query after receiving a notify. Since we clear the
timespec, a _BIF query will be done in the context of the next caller.
* Add asserts to leaf functions that operate on shared data.
* Remove the bst/bif updating flags now that we hold the lock over the
full query.
* Explain various comments in more detail.
* Serialize acpi_battery_get_battdesc(), acpi_battery_register(), and
acpi_battery_remove().
* Assert that the sx lock is held in acpi_batteries_init().
* Remove check for device_get_softc() returning NULL.
* Serialize notification of acline changes in acpi_acad_get_status().
* Remove the initializing flag. With the locking, we don't need to
push off requests for the acline before initialization is done.
* Don't check device_get_softc(), it can't return NULL.
* Serialize calls to acpi_alloc_resource(), acpi_release_resource(),
acpi_Enable(), acpi_Disable(), and acpi_debug_sysctl().
* Acquire the ACPI mutex in acpi_register_ioctl(), acpi_deregister_ioctl(),
and acpiioctl().
* Acquire the mutex while disabling subsequent requests to enter a
sleep state in acpi_SetSleepState().
* Be sure to re-enable sleep requests and don't run resume methods when
the current request fails.
* Don't check if sleep requests are disabled in the ACPIIO_SETSLPSTATE
ioctl. acpi_SetSleepState() does this for us.
* Remove the acquisition of Giant from the struct cdevsw.
* Remove the ACPI_USE_THREADS option.
* Add and comment our locking primitives. The mutex primitives use a
a static mutex and the serialization ones use a static sx lock. A global
acpi_mutex is used for access to global resources (i.e., writes to the
SMI_CMD register.)
* Remove 4.x compat defines.
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.
This commit makes the following changes:
- Adds tokenizing and parsing for the ``jail'' command line option
to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
want to add additional opcodes, they should append them to the end
of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.
This change was a strong motivator behind the ucred caching
mechanism in ipfw.
A sample usage of this new functionality could be:
ipfw add count ip from any to any jail 2
It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.
Conceptual head nod by: pjd
Reviewed by: rwatson
Approved by: bmilekic (mentor)
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.
Simplify deadlock avoidance by not blocking on vm map relookup.
In collaboration with: alc
subset ("compatible", "device_type", "model" and "name") of the standard
properties in drivers for devices on Open Firmware supported busses. The
standard properties "reg", "interrupts" und "address" are not covered by
this interface because they are only of interest in the respective bridge
code. There's a remaining standard property "status" which is unclear how
to support properly but which also isn't used in FreeBSD at present.
This ofw_bus kobj-interface allows to replace the various (ebus_get_node(),
ofw_pci_get_node(), etc.) and partially inconsistent (central_get_type()
vs. sbus_get_device_type(), etc.) existing IVAR ones with a common one.
This in turn allows to simplify and remove code-duplication in drivers for
devices that can hang off of more than one OFW supported bus.
- Convert the sparc64 Central, EBus, FHC, PCI and SBus bus drivers and the
drivers for their children to use the ofw_bus kobj-interface. The IVAR-
interfaces of the Central, EBus and FHC are entirely replaced by this. The
PCI bus driver used its own kobj-interface and now also uses the ofw_bus
one. The IVARs special to the SBus, e.g. for retrieving the burst size,
remain.
Beware: this causes an ABI-breakage for modules of drivers which used the
IVAR-interfaces, i.e. esp(4), hme(4), isp(4) and uart(4), which need to be
recompiled.
The style-inconsistencies introduced in some of the bus drivers will be
fixed by tmm@ in a generic clean-up of the respective drivers later (he
requested to add the changes in the "new" style).
- Convert the powerpc MacIO bus driver and the drivers for its children to
use the ofw_bus kobj-interface. This invloves removing the IVARs related
to the "reg" property which were unused and a leftover from the NetBSD
origini of the code. There's no ABI-breakage caused by this because none
of these driver are currently built as modules.
There are other powerpc bus drivers which can be converted to the ofw_bus
kobj-interface, e.g. the PCI bus driver, which should be done together
with converting powerpc to use the OFW PCI code from sparc64.
- Make the SBus and FHC front-end of zs(4) and the sparc64 eeprom(4) take
advantage of the ofw_bus kobj-interface and simplify them a bit.
Reviewed by: grehan, tmm
Approved by: re (scottl)
Discussed with: tmm
Tested with: Sun AX1105, AXe, Ultra 2, Ultra 60; PPC cross-build on i386
pf_cksum_fixup() was called without last argument from
normalization, also fixup checksum when random-id modifies ip_id.
This would previously lead to incorrect checksums for packets
modified by scrub random-id.
(Originally) Submitted by: yongari
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.
Tripped over by: des
migration. Use this in sched_prio() and sched_switch() to stop us from
migrating threads that are in short term sleeps or are runnable. These
extra migrations were added in the patches to support KSE.
- Only set NEEDRESCHED if the thread we're adding in sched_add() is a
lower priority and is being placed on the current queue.
- Fix some minor whitespace problems.
there is no irq link. Since we now use the stored copy of PRT, not the
one that used to be passed into acpi_pcib_route_interrupt(), we need it in
the list. [1]
Fix a bug in acpi_pci_find_prt() where we weren't checking the bus, thus
choosing the wrong PRT entry to use for routing the link. Also, add a
printf for the case where the PRT entry is not found as this should not
happen.
Tested by: marcel [1]
- Remove kern.geom.mirror.sync_block_size sysctl. It is quite obvious that we
want to use the biggest size possible.
- Do not use UMA zone for sync data allocations. There could be only one
synchronization request per synchronized disk at a time, so allocate memory
for one request on whole synchronization process related to one disk.
Tested by synchronizing one component (out of three) and by synchronizing
two components (out of three) in parallel.
- Remove __RMAN_RESORUCE_VISIBLE again. It's no longer required either
because of the above change or because struct rman is no longer hidden.
Reviewed by: grehan
Tested by: cross-compile on i386
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.
Discussed with: bosko, tegge, rwatson
called "rtentry".
This saves a considerable amount of kernel memory. R_Zmalloc previously
used 256 byte blocks (plus kmalloc overhead) whereas UMA only needs 132
bytes.
Idea from: OpenBSD
incomplete in that the PRT routing was not aware of link programming.
Fix this by doing all routing through the link devices. The new algorithm
for setting up links is:
1. Read _CRS to get current setting. If invalid (not in _PRS), then set
to 0.
2. Attempt to call _DIS on the link. If successful, mark the link as not
routed. Otherwise, assume it still is.
Then when a routing request occurs:
3. Update weights for all IRQs
4. Attempt to route the initial IRQ if valid
5. If that fails, walk through the sorted list, attempting to route IRQs.
6. Configure the trigger/polarity based on _PRS.
Other changes:
* Add acpi_pci_find_prt() to look up the PRT entry for a given device and
acpi_pci_link_route() to select/route the best IRQ for it.
* Remove duplicated code in acpi_pcib_route_interrupt() that picked the
first IRQ from _PRS.
* Remove unneeded arguments from acpi_pcib_resume() and friends.
* Ignore _STA on link devices but report if it seems strange.
* Add a prt_source handle to the PRT structure since the ACPI struct
ACPI_PCI_ROUTING_TABLE uses a fixed-size entry for it. We'll need to
dynamically size this object if we want to use it the same way ACPI-CA
does. Null-terminate the source.
Tested by: Luo Hong <luohong99_at_mails.tsinghua.edu.cn>,
Jeffrey Katcher <jmkatcher_at_yahoo.com>
Info from: jhb, Len Brown (Intel)
and bio_inbed fields to 0. Without this change we can end up with
I/O leakage in some rare situations.
I tested this change by putting failure probability mechanism simlar
to this used in NOP class into g_clone_bio(9) function, so it was
able to return NULL with the given probability.
Discussed with: phk
we update the registers. That way we don't have any dirty registers to
worry about and also know that bsp=bspstore, which makes updating the
RSE related registers predictable.
This is not the end of it. We need more validity checks, but for now
this allows us to complete the gdb testsuite without crashing the
kernel.
if_start routines cannot currently be entered without Giant. When
the kernel is running with debug.mpsafenet != 0, this will defer
if_start execution to a task queue thread holding Giant, which may
introduce additional latency, but avoid incorrect execution.
Suggested by: dfr
full, avoiding the cost of mutex operations if it is. We re-test
once the mutex is acquired to make sure it's still true before doing
the -modify-write part of the read-modify-write. Note that due to
the maximum fifo depth being pretty deep, this is unlikely to improve
harvesting performance yet.
Approved by: markm
to allow dumping per-thread machine specific notes. On ia64 we use this
function to flush the dirty registers onto the backingstore before we
write out the PRSTATUS notes.
Tested on: alpha, amd64, i386, ia64 & sparc64
Not tested on: arm, powerpc
a standard configuration similar to [NO_]ADAPTIVE_MUTEXES. This
feature causes Giant to be included in the set of mutexes adaptively
spun on. It appears to have a positive effect on performance on SMP
across several workloads, including measurements of a 16% improvement
on buildworld, and 30%+ improvement for MySQL using the supersmack
benchmark with Giant over the network stack; a 6% improvement without
Giant on the network stack (as a result of less giant contention).
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.
Discussed with: jmg
Implement the protection check required by the pmap_extract_and_hold()
specification.
Remove the acquisition and release of Giant from pmap_extract_and_hold() and
pmap_protect().
Many thanks to Ken Smith for resolving a sparc64-specific initialization
problem in my original patch.
Tested by: kensmith@
bio_driver1 (as all the rest).
This introduced a small memory leak, but it wasn't really critical,
because maximum memory for g_stripe_zone is always set, so after few
requests gstripe was working in "economic" mode.
are resevered, they can be written with anything, but they always read
as zero, we should simulate it in set_regs() as we are reading/writting
real hardware %rflags register.
contributed to the transferable load count. This prevents any potential
problems with sched_pin() being used around calls to setrunqueue().
- Change the sched_add() load balancing algorithm to try to migrate on
wakeup. This attempts to place threads that communicate with each other
on the same CPU.
- Don't clear the idle counts in kseq_transfer(), let the cpus do that when
they call sched_add() from kseq_assign().
- Correct a few out of date comments.
- Make sure the ke_cpu field is correct when we preempt.
- Call kseq_assign() from sched_clock() to catch any assignments that were
done without IPI. Presently all assignments are done with an IPI, but I'm
trying a patch that limits that.
- Don't migrate a thread if it is still runnable in sched_add(). Previously,
this could only happen for KSE threads, but due to changes to
sched_switch() all threads went through this path.
- Remove some code that was added with preemption but is not necessary.
umich copyright is asserting.
Clarify that the copyright I'm asserting is the standard Berkeley
license.
Remove Giant assertions from AARP and DDP input routines.
is here so that we can gather stats on the nature of the recent rash of
hard lockups, and in this particular case panic the machine instead of
letting it deadlock forever.
becauses some syscalls using set_mcontext can sneakily change
parameters and later when those syscalls references parameters,
they will wrongly use register values in mcontext_t.
Approved by: peter
chipsets, based on Linux's via-agp.c. On boot, the system selects which AGP
version to use based on the inserted card. If v2 was chosen, the chipset
needs to be programmed with the v2 registers still. Also included in kern/69953
are changes to make the programming of the v3 registers match linux, but that
will be left out until the need to do so is confirmed (want specs or a tester).
PR: kern/69953
Submitted by: Oleg Sharoiko <os@rsu.ru>
Tested by: Oleg Sharoiko <os@rsu.ru>, Geoff Speicher <geoff@speicher.org>
(full version from PR)
The hardware always gives read access for privilege level 0, which
means that we cannot use the hardware access rights and privilege
level in the PTE to test whether there's a change in protection. So,
we save the original vm_prot_t in the PTE as well.
Add pmap_pte_prot() to set the proper access rights and privilege
level on the PTE given a pmap and the requested protection.
The above allows us to compare the protection in pmap_extract_and_hold()
which was missing. While in pmap_extract_and_hold(), add pmap locking.
While here, clean up most (i.e. all but one) PTE macros we inherited
from alpha. They were either unused, used inconsistently, badly named
or simply weren't beneficial. We save the wired and managed state of
the PTE in distinct (bit) fields.
While in pte.h, s/u_int64_t/uint64_t/g
pmap locking obtained from: alc@
feedback & review by: alc@
* Allow no-fault wiring/unwiring to succeed for consistency;
however, the wired count remains at zero, so it's a special case.
* Fix issues inside vm_map_wire() and vm_map_unwire() where the
exact state of user wiring (one or zero) and system wiring
(zero or more) could be confused; for example, system unwiring
could succeed in removing a user wire, instead of being an
error.
* Require all mappings to be unwired before they are deleted.
When VM space is still wired upon deletion, it will be waited
upon for the following unwire. This makes vslock(9) work
rather than allowing kernel-locked memory to be deleted
out from underneath of its consumer as it would before.
1. Move a comment to its proper place, updating it. (Except for white-
space, this comment had been unchanged since revision 1.1!)
2. Remove spl calls.
fix the obvious bugs, nastier ones reside below the surfac), and having
it commented out here just encourages people to try it.
# I'm not removing it from the base system, yet.
For incoming packets, the packet's source address is checked if it
belongs to a directly connected network. If the network is directly
connected, then the interface the packet came on in is compared to
the interface the network is connected to. When incoming interface
and directly connected interface are not the same, the packet does
not match.
Usage example:
ipfw add deny ip from any to any not antispoof in
Manpage education by: ru
It allows to fix problems when last provider's sector is shared between few
providers.
- Bump version number for CONCAT and STRIPE and add code for backward
compatibility.
- Do not bump version number of MIRROR, as it wasn't officially introduced yet.
Even if someone started to play with it, there is no big deal, because
wrong MD5 sum of metadata will deny those providers.
- Update manual pages.
- Add version history to g_(stripe|concat).h files.
thread, after the bound thread leaves critical region, the thread should
check debug flag may suspend itself by using the command.
2.Schedule upcall after thread is suspended by debugger
3.Wakeup upcall thread after process suspension.
Reviewed by: deischen
understood. This makes room for additional binary compatibility in the
future.
Put fields in the class for the geom's methods and initialize the methods
of a new geom from these fields. This saves some code in all classes.
o Change the motion calculation to result in
a more reasonable speed of motion
This should fix the 'aiming' problems people have reported. It also
mitigates (but doesn't completely solve) the 'stalling' problems at
very low speeds.
Tested by: many subscribers to -current
Approved by: njl
o Catch 'taps' as button presses
o One finger sends button1, two fingers send button3,
three fingers send button2 (double-click)
Tested by: many subscribers to -current
Approved by: njl
o Handle the 'up/down' buttons some touchpads have as
a z-axis (scrollwheel) as recommended by the specs
o Report the buttons as button4 and button5 instead
of button2 and button4, button2 can be emulated by
pressing button1 and button3 simultaneously. This
allows one to use the two extra buttons for other
purposes if one so desires.
Tested by: many subscribers to -current
Approved by: njl
o Clean up whitespace and comments in the
enable_synaptics() probing function
o Only use (and rely on) the extended capability
bits when we are told they actually exist
o Partly ignore the (possibly dated?) part of the
specification about the mode byte so that we
can support 'guest devices' too.
Tested by: many subscribers to -current
Approved by: njl
path. The basic problem is that we cannot set the single stepping flag
directly, because we don't leave the kernel via an interrupt return. So,
we need another way to set the single stepping flag.
The way we do this is by enabling the lower-privilege transfer trap, which
gets raised when we drop the privilege level. However, since we're still
running in kernel space (sec), we're not yet done. We clear the lower-
privilege transfer trap, enable the taken-branch trap and continue exiting
the kernel until we branch into user space.
Given the current code, there's a total of two traps this way before
we can raise SIGTRAP.
after a fork(2) in fork_trampoline(). By moving the epc_syscall_return
label immediately before the call to do_ast() in epc_syscall(), we not
only achieve that but also handle the detour through exception_return
when the frame corresponds to an asynchronous kernel entry. Hence, we
simplified fork_trampoline() as a side-effect.
related to breakpoints and single stepping into SIGTRAP so gdb(1) knows
why the remote target has stopped. In particular, gdb(1) needs to know
if the reason is something of its own doing.
catch leaking into VFS without Giant.
Inch Giant a little lower in several file descriptor operations on
vnodes to cover only VFS operations that need it, rather than file
flag reading, etc.
was being unconditionally dereferenced but was NULL for PIO requests.
Check the request flags for a DMA transaction before dereferencing.
Reported by: ceri
Tested by: Radek Kozlowski <radek -at- raadradd.com>
fcntl() operations, including:
F_DUPFD dup() alias
F_GETFD retrieve close-on-exec flag
F_SETFD set close-on-exec flag
F_GETFL retrieve file descriptor flags
For the remaining fcntl() operations, do acquire Giant, especially
where we call into fo_ioctl() as a result. We're not yet ready to
push Giant into fo_ioctl(). Once we do, this can all become quite a
bit prettier.
calls. Note that the information included is a bit different from the
existing KTR traces generated on powerpc, as I'm primarily interested
in kernel context (thread, syscall #, proc, etc), not the user
arguments to the system call. Some convergence would be useful here.
with it that need to be understood better before they can be resolved.
This takes time and time is already in short supply.
Reported & tested by: glebius@
will prepend the current kernel booting... This prevents a problem of
loading /boot/kernel's modules when a different kernel has no modules,
but you left your module_load="YES" in loader.conf...
Reviewed by: dcs (minus the help part)
something goes wrong while running in "fast" mode, we free all bios and
falling back to "economic" mode. Freeing bios, doesn't mean decrease
bio_children, so bio_inbed couldn't be equal to bio_children and request
was never finished.
Decrease bio_children manually when destroying bios.
Reported by: Sam Lawrance <boris@brooknet.com.au>, simon
message if they are incorrect. Also, remove the hack of allowing the
initial irq setting to not be in _PRS. As before, the old behavior can be
regained by defining ACPI_OLD_PCI_LINK.
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events. This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.
Reported by: kuriyama
Tested by: kuriyama
a result of scheduling an ithread, cut a KTR_INTR trace record so
that it's clear in tracing interrupt activity where and when the
entropy harvesting code is invoked.
callout_reset rather than calling callout_stop. This results in a few
lines of code duplication, but it provides a significant performance
improvement because it avoids recursing on callout_lock.
Requested by: rwatson
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address. Since we hold the pcbinfo lock already, these can't
change. Defer acquiring the inpcb mutex until we have a high
chance of a match. This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.
Reviewed by: sam
want a splash screen.
There seems to be some confusion in the syscons code as to the meaning of
the SC_KERNEL_CONSOLE flag. Its absence is sometimes interpreted to mean
"I am not the system console", and sometimes to mean "I am not the only
VGA console" (see the font loading code for an example of the latter).
Someone with better syscons fu than myself should take a closer look.
(that does not compile with !gcc). Moreover we get the benefit for all archs
that have a hand optimized in_cksum_skip().
Submitted by: yongari
Tested by: me (i386, extensivly), pf4freebsd ML (various)
better check for 'adjacent'. The old code assumed that if two resources
were adjacent in the linked list that they were also adjacent range wise.
This is not true when a resource manager has to manage disparate regions.
For example, the current interrupt code on i386/amd64 will instruct
irq_rman to manage two disjoint regions: 0-1 and 3-15 for the non-APIC
case. If IRQs 1 and 3 were allocated and then released, the old code
would coalesce across the 1 to 3 boundary because the resources were
adjacent in the linked list thus adding 2 to the area of resources that
irq_rman managed as a side effect. The fix adds extra checks so that
adjacent unallocated resources are only merged with the resource being
freed if the start and end values of the resources also match up. The
patch also consolidates the checks for adjacent resources being allocated.
following behavior:
* Link devices return invalid status (_STA) values. The results are very
unreliable -- sometimes never present. Just ignore the status and pick
the best configuration from _PRS.
* Link devices return invalid current settings (_CRS). Even after setting
the link value, many systems still return a different setting for _CRS.
When setting an IRQ, don't bother to check _CRS to see if we succeeded.
Note that we still check _CRS before routing and this should be addressed
as well.
Since this is a sensitive area, leave the old behavior accessible via
uncommenting the define for ACPI_OLD_PCI_LINK at the top of the file. Once
this has been thoroughly tested, this option and the code it covers will
be removed.
Thanks to Len Brown at Intel for informing us of these issues as he worked
around them in Linux.
location (for the wake code). It should not be needed since we don't
map other pages at the same location and if there was an old mapping, it
would be restored by a fault. The old code had serious problems, namely
that it was restoring the new page it had just removed (not opage) and
it could only guess at the right protection (since there's no
pmap_extract_protect function). Thanks to Alan Cox for explaining much
of this to me.
Also, remove a commented-out initializecpu() call since it is not needed.
Restoring the cpu context is better than attempting to init from scratch.
Reviewed by: alc (earlier version)
have clear idea on boot2 BSS size and leaves portion of it not zeroed out.
btxcsu.s is in much better position for this job.
Obtained from: DragonflyBSD (with minor adjustments)
Since HME doesn't compensate the checksum for UDP datagram which
can yield to 0x0, UDP transmit checksum offload is disabled by
default. The UDP Transmit checksum offload can be reactivated
by setting special link option link0 with ifconfig(8).
Approved by: jake (mentor)
Reviewed by: tmm
Tested by: Herve Boulouis <amon@sockar.homeip.net>
before grabbing BPF locks to see if there are any entries in order to
avoid the cost of locking if there aren't any. Avoids a mutex lock/
unlock for each packet received if there are no BPF listeners.
consumer and 'bio_pflags' which can be used by provider.
- Remove BIO_FLAG1 and BIO_FLAG2 flags. From now on new fields should be
used for internal flags.
- Update g_bio(9) manual page.
- Update some comments.
- Update GEOM_MIRROR, which was the only one using BIO_FLAGs.
Idea from: phk
Reviewed by: phk
spin-wait code to use the same spin mutex (smp_tlb_mtx) as the TLB ipi
and spin-wait code snippets so that you can't get into the situation of
one CPU doing a TLB shootdown to another CPU that is doing a lazy pmap
shootdown each of which are waiting on each other. With this change, only
one of the CPUs would do an IPI and spin-wait at a time.
the immediate awakening of proc0 (scheduler kproc, controls swapping
processes in and out). The scheduler process periodically awakens already,
so this will not result in processes not being swapped in, there will just
be more latency in between a thread being made runnable and the scheduler
waking up to swap the affected process back in.
macros and pass the value to the associated _mtx_*() functions to avoid
more curthread dereferences in the function implementations. This provided
a very modest perf improvement in some benchmarks.
Suggested by: rwatson
Tested by: scottl
text/data are covered on APs. This enables the kernel to boot on
a 4 way Intel Itanium-2 platform. This has a secondary effect of
keeping the TRs identical on BP and the APs.
reviewed by: marcel@
a sleep() call waking up in namei(), a later assertion triggers that
Giant is not held. By asserting Giant at the start of namei(), we can
know that if that assertion triggers, Giant is lost during the call to
namei(), and not before.
lock assertions even if IPv6 is compiled into the kernel. Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
- In ntoskrnl_var.h, I had defined compat macros for
ntoskrnl_acquire_spinlock() and ntoskrnl_release_spinlock() but
never used them. This is fortunate since they were stale. Fix them
to work properly. (In Windows/x86 KeAcquireSpinLock() is a macro that
calls KefAcquireSpinLock(), which lives in HAL.dll. To imitate this,
ntoskrnl_acquire_spinlock() is just a macro that calls hal_lock(),
which lives in subr_hal.o.)
- Add macros for ntoskrnl_raise_irql() and ntoskrnl_lower_irql() that
call hal_raise_irql() and hal_lower_irql().
- Use these macros in kern_ndis.c, subr_ndis.c and subr_ntoskrnl.c.
- Along the way, I realised subr_ndis.c:ndis_lock() was not calling
hal_lock() correctly (it was using the FASTCALL2() wrapper when
in reality this routine is FASTCALL1()). Using the
ntoskrnl_acquire_spinlock() fixes this. Not sure if this actually
caused any bugs since hal_lock() would have just ignored what
was in %edx, but it was still bogus.
This hides many of the uses of the FASTCALLx() macros which makes the
code a little cleaner. Should not have any effect on generated object
code, other than the one fix in ndis_lock().
vm_page_sleep_if_busy() and the page table page's busy flag as a
synchronization mechanism on page table pages.
Also, relocate the inline pmap_unwire_pte_hold() so that it can be used
to shorten _pmap_unwire_pte_hold() on alpha and amd64. This places
pmap_unwire_pte_hold() next to a comment that more accurately describes
it than _pmap_unwire_pte_hold().
by a transaction performing a driver handled message sequence (an
scb with the MK_MESSAGE flag set).
SCBs that perform host managed messaging must always be
at the head of their per-target selection queue so that
the firmware knows to manually assert ATN if the current
negotiation agreement is packetized. In the past we
guaranteed this by queuing these SCBs separarately in
the execution queue. This exposes the system to potential
command reordering in two cases:
1) Another SCB for the same ITL nexus is queued that does
not have the MK_MESSAGE flag set. This SCB will be
queued to the per-target list which can be serviced
before the MK_MESSAGE scb that preceeded it.
2) If the target cannot accept all of the commands in the
per-target selection queue in one selection, the remainder
is queued to the tail of the selection queues so as to
effect round-robin scheduling. This could allow the
MK_MESSAGE scb to be sent to the target before the
requeued commands.
This commit changes the firmware policy to defer queuing
MK_MESSAGE SCBs into the selection queues until this can
be done without affecting order. This means that the
target's selection queue is either empty, or the last
SCB on the execution queue is also a MK_MESSAGE SCB.
During any wait, the firmware halts the download of new
SCBs so only a single "holding location" is required.
Luckily, MK_MESSAGE SCBs are rare and typically occur only
during CAM's bus probe where only one command is outstanding
at a time. However, during some recovery scenarios, the
reordering *could* occur.
aic79xx.c:
Update ahd_search_qinfifo() and helper routines to
search for pending MK_MESSAGE scbs and properly
restitch the execution queue if either the MK_MESSAGE
SCB is being aborted, or the MK_MESSAGE SCB can be
queued due to the execution queue draining due to
aborts.
Enable LQOBUSFREE status to assert an interrupt.
This should be redundant since a BUSFREE interrupt
should always occur along with an LQOBUSFREE event,
but on the Rev A, this doesn't seem to be guaranteed.
When a PPR request is rejected when a previously
existing packetized agreement is in place, assume
that the target has been reset without our knowledge
and revert to async/narrow transfers. This corrects
two issues: the stale ENATNO setting that was used
to send the PPR is cleared so the firmware is not
confused by a future packetized selection with
ATN asserted but no MK_MESSAGE flag in the SCB and
it speeds up recovery by aborting any pending
packetized transactions that by definition are now
dead.
When re-queueing SCBs after a failed negotiation
attempt, ensure command ordering by freezing the
device queue first.
Traverse the list of pending SCBs rather than the
whole SCB array on the controller when pushing
MK_MESSAGE flag changes out to the controller.
The original code was optimized for the aic7xxx
controllers where there are fewer controller slots
then pending SCBs and the firmware picks SCB
slots. For the U320 controller, the hope is
that we have fewer pending SCBs then the 512
slots on the controller.
Enhance some diagnostics.
Factor out some common code.
aic79xx.h:
Add prototype for new ahd_done_with_status() that is
used to factor out some commone code.
aic79xx.reg:
Add definisions for the pending MK_MESSAGE SCB.
aic79xx.seq:
Defer MK_MESSAGE SCB queing to the execution queue
so as to preserve command ordering. Re-arrange some
of the selection processing code so the above change
had no performance impact on the common code path.
Close a few critical section holes.
When entering a non-packetized phase, manually enable
busfree interrupts, since the controller hardware
does not do this automatically.
aic79xx_inline.h:
Enhance logging for queued SCBs.
aic79xx_osm.c:
Add new a new DDB ahd command, ahd_dump, which
invokes the ahd_dump_card_state() routine on the
unit specified with the ahd_sunit DDB command.
aic79xx_pci.c:
Turn on the BUSFREEREV bug for the Rev B. controller.
This is required to close the busfree during non-packetized
phase hole.
functions. Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead. Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.
Approved by: des
should be set to VM_PAGE_BITS_ALL before returning, to ensure that
neither vm_pager_get_pages nor vm_fault calls vm_page_zero_invalid
after dev_pager_getpages has returned.
Submitted by: tegge
into single-user mode (as seen on sparc64 and PPC). Problems were due
to a minor oversight in the changes committed in revision 1.25.
Submitted by: grehan
Tested by: gad & yongari
being defined, define and use a new MD macro, cpu_spinwait(). It only
expands to something on i386 and amd64, so the compiled code should be
identical.
Name of the macro found by: jhb
Reviewed by: jhb
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
including the IP header.
o Computation of the delayed checksum is moved into divert_packet().
Reviewed by: silby
- according to RFC2661 an offset size of 0 is allowed.
- when skipping offset padding do not forget to also skip
the 2 octets of the offset size field.
Reviewed by: archie
Approved by: pjd (mentor)
link[n].latency calculated from user supplied value.
This prevents repeated NGM_PPP_SET_CONFIG/NGM_PPP_GET_CONFIG
from failing because of link[n].conf.latency being out of range.
Reviewed by: archie
Approved by: pjd (mentor)
and setting MSR. This was most evident with the idle proc running
with interrupts disabled and causing a lockup. Switch over to the
i386 style which does things in the right order.
debug assisted by: gallatin, and the invaluable KTR option.
pipelock(), not via a mixture of mutexes and pipelock(). Additionally,
add a few KASSERTS, and change some statements that should have been
KASSERTS into KASSERTS.
As a result of these cleanups, some segments of code have become
significantly shorter and/or easier to read.
unconditionally, stop after the first one (system board) if no EISA hardware
is detected. This fixes a boot hang (i.e. Thinkpad) when ACPI is disabled.
Also, split the probe code into a separate function and do some style cleanup.
Note that the Adaptec 2842 VLB controller probe is broken by this change
and will fail to probe. It should be fixed separately.
in case of a CHECK CONDITION.
- Make this driver return SCSI status information.
- While here, factor out the clearing of the CAM status from every
element of the switch statement to only once before the switch.
This fixes burning CDs with recent cdrecord 2.01 alpha versions and
burners attached to asr(4) controllers but there could have been
other applications and da(4) etc. also affected.
Reviewed by: gibbs, scottl
MFC after: 2 weeks
UFS2 was here. It so happened that UFS2 did not need a seperate
partition type. Keep the definition as a comment for documentation
purposes. If there is a benefit for UFS2 file systems to have a
seperate partition type under GPT, then this definition should be
restored as that was the intention of the definition.
Currently one cannot load the mem.ko module without panicing if mem is
compiled into the kernel and one cannot build a kernel w/o "device mem"
right now either. Thus it is too dangerous to install mem.ko right now
because if one puts 'mem_load="YES"' in /etc/loader.conf they cannot
boot an "old" kernel (at the time that a kernel doesn't have to be built
with "device mem).
pic_eoi_source() into one call. This halves the number of spinlock operations
and indirect function calls in the normal case of handling a normal (ithread)
interrupt. Optimize the atpic and ioapic drivers to use inlines where
appropriate in supporting the intr_execute_handlers() change.
This knocks 900ns, or roughly 1350 cycles, off of the time spent servicing an
interrupt in the common case on my 1.5GHz P4 uniprocessor system. SMP systems
likely won't see as much of a gain due to the ioapic being more efficient than
the atpic. I'll investigate porting this to amd64 soon.
Reviewed by: jhb
skip blocks that are too big by a factor of two or greater. This
avoids some cases of extremely inefficient memory use that can occur
when large (e.g. 64k) blocks on the free list get used when allocating
a 4k chunk of 64-byte fragments. Because fragments have their own
free list, the 60k difference got lost forever every time.
system BIOS to disable legacy device emulation as per the "EHCI
Extended Capability: Pre-OS to OS Handoff Synchronisation" section
of the EHCI spec. BIOSes that implement legacy emulation using SMIs
are supposed to disable the emulation when this procedure is performed.
set gp->softc to NULL and return ENXIO when it is NULL, so GEOM
will not panic or hang, but unload one device on every 'unload'.
This make 'unload' command usable, but it have to be executed
<number of devices> + 1 times.
- Made use of 'pp' variable.
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.
Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
now, but it's possible for ndis_reset_nic() to sleep (sometimes the
MiniportReset() method returns NDIS_STATUS_PENDING and we have
to wait for completion). To get around this, execute the ndis_reset_nic()
routine in the NDIS_TASKQUEUE thread.
- Give ndiscvt(8) the ability to process a .SYS file directly into
a .o file so that we don't have to emit big messy char arrays into
the ndis_driver_data.h file. This behavior is currently optional, but
may become the default some day.
- Give ndiscvt(8) the ability to turn arbitrary files into .ko files
so that they can be pre-loaded or kldloaded. (Both this and the
previous change involve using objcopy(1)).
- Give NdisOpenFile() the ability to 'read' files out of kernel memory
that have been kldloaded or pre-loaded, and disallow the use of
the normal vn_open() file opening method during bootstrap (when no
filesystems have been mounted yet). Some people have reported that
kldloading if_ndis.ko works fine when the system is running multiuser
but causes a panic when the modile is pre-loaded by /boot/loader. This
happens with drivers that need to use NdisOpenFile() to access
external files (i.e. firmware images). NdisOpenFile() won't work
during kernel bootstrapping because no filesystems have been mounted.
To get around this, you can now do the following:
o Say you have a firmware file called firmware.img
o Do: ndiscvt -f firmware.img -- this creates firmware.img.ko
o Put the firmware.img.ko in /boot/kernel
o add firmware.img_load="YES" in /boot/loader.conf
o add if_ndis_load="YES" and ndis_load="YES" as well
Now the loader will suck the additional file into memory as a .ko. The
phony .ko has two symbols in it: filename_start and filename_end, which
are generated by objcopy(1). ndis_open_file() will traverse each module
in the module list looking for these symbols and, if it finds them, it'll
use them to generate the file mapping address and length values that
the caller of NdisOpenFile() wants.
As a bonus, this will even work if the file has been statically linked
into the kernel itself, since the "kernel" module is searched too.
(ndiscvt(8) will generate both filename.o and filename.ko for you).
- Modify the mechanism used to provide make-pretend FASTCALL support.
Rather than using inline assembly to yank the first two arguments
out of %ecx and %edx, we now use the __regparm__(3) attribute (and
the __stdcall__ attribute) and use some macro magic to re-order
the arguments and provide dummy arguments as needed so that the
arguments passed in registers end up in the right place. Change
taken from DragonflyBSD version of the NDISulator.
to be particularly correct or optimal, but it seems to be enough
to allow the attachment of USB2 hubs and USB2 devices connected via
USB2 hubs. None of the split transaction support is implemented in
our USB stack, so USB1 peripherals will definitely not work when
connected via USB2 hubs.
their own directory and module, leaving the MD parts in the MD
area (the MD parts _are_ part of the modules). /dev/mem and /dev/io
are now loadable modules, thus taking us one step further towards
a kernel created entirely out of modules. Of course, there is nothing
preventing the kernel from having these statically compiled.
The different between the new function and g_mirror_orphan() (which was
used previously) is that syncid is bumped immediately, instead of on
first write, because when consumer was spoiled, it means, that its
provider was opened for writing, so we can't trust that its data
will be valid when it will be connected again.
features. The gmirror(8) utility should be used for control of this class.
There is no manual page yet, but I'm working on it with keramida@.
Many useful tests provided by: simon (thank you!)
Some ideas from: scottl, simon, phk
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.
s/vfs_mount/vfs_omount/
s/vfs_nmount/vfs_mount/
Name our filesystems mount function consistently.
Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.
Reorganize the root filesystem selection code.
those architectures without pmap locking.
- Eliminate the acquisition and release of Giant from vm_map_protect().
(Translation: mprotect(2) runs to completion without touching Giant on
alpha, amd64, i386 and ia64.)
brings ia64 to parity with alpha, amd64, and i386 in this area.)
- Prevent a race in pmap_find_pte(): If pmap_find_pte() sleeps in
uma_zalloc(), another thread could allocate a pte at the same address.
Instead, sleep at a higher level and retry the lookup before retrying
the allocation.
Reviewed and tested by: marcel@
This is really ugly way to do this, but there is no other way for now.
It allows to mount root file system from providers which belong to
those classes.
Approved by: phk
maps. We always acquire the sx lock exclusively here, but we can't
use a mutex because we want to be able to sleep while holding the
lock. This is completely equivalent to what we were doing with the
lockmgr(9) locks before.
Approved by: alc
provider.
- Bump version number.
This allows for a quite interesting trick. One can setup a stripe with
stripe size of 512 bytes and create transparent provider on top of it
with sector size equal to <ndisks> * 512. The result will be something
like RAID3 without parity disk (every access will touch all disks).
submitted version with style cleanups and changes to comments. I also
modified the ioctl interface. This version only has one ioctl (to get
the Synaptics-specific config parameters) since this is the only
information a user might want.
Submitted by: Arne Schwabe <arne -at- rfc2549.org>
- Enable recursion on the page queues lock. This allows calls to
vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page
queues lock held. Such calls are made to allocate page table pages
and pv entries.
- The previous change enables a partial reversion of vm/vm_page.c
revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault()
now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT.
- Add partial locking to pmap_copy(). (As a side-effect, pmap_copy()
should now be faster on i386 SMP because it no longer generates IPIs
for TLB shootdown on the other processors.)
- Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now,
all changes to a user-level pmap on alpha, amd64, and i386 are performed
with appropriate locking.)
- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.
This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).
Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.
* Some systems have _FDE and child floppy devices, but no _FDI. This seems
to be compatible with the standard. Don't error out if there is no _FDI.
Instead, continue on to the next device. The normal fd probe will take
care of this device.
* Some systems have _FDE but no child devices in AML. For these, add a
second pass that compares the results of _FDE to the presence of devices.
If not present, add the missing device.
* Some BIOS authors didn't read the spec. They use tape drive values for
all fdc(4) devices. Since this isn't grossly incompatible with the
required boolean value, use them. They also define the _FDE items as a
package instead of buffer. Regenerate the buffer from the package if it
is present.
Tested by: tjr, marcel
Add local rootvp variables as needed.
Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
vm/vm_object.c revision 1.88) and vm_object_sync() (originating in
vm/vm_map.c revision 1.36): When descending a chain of backing objects,
both use the wrong object's backing offset. Consequently, both may
operate on the wrong pages.
Quoting Matt, "This could be responsible for all of the sporatic madvise
oddness that has been reported over the years."
Reviewed by: Matt Dillon
Alice is too lazy to write a server application in PF-independent
manner. Therefore she knocks up the server using PF_INET6 only
and allows the IPv6 socket to accept mapped IPv4 as well. An evil
hacker known on IRC as cheshire_cat has an account in the same
system. He starts a process listening on the same port as used
by Alice's server, but in PF_INET. As a consequence, cheshire_cat
will distract all IPv4 traffic supposed to go to Alice's server.
Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections. After this
change, the above scenario will be impossible. In the same setting,
the user who attempts to start his server last will get EADDRINUSE.
Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.
This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code. It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.
MFC after: 1 month
this file from userland. Since we export struct ifnet to userland, and
that struct ifnet now contains a struct task, userland needs to know
what struct task looks like.
We need to consider having a pointer to a struct task here instead and
forward declare struct task in the !_KERNEL case.
static symbols. This wasn't a problem with previous GCC releases, but
unit-at-a-time mode of GCC 3.4.2 prevents linker set components from
being emitted at all.
"__FreeBSD_version should only ever increment. It is a historial record
of events in the system. Decrementing it is akin to trying to go back
in time and change history."
Reminded by: kuriyama, scottl
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.
Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
synchronizing IPv6 protocol control blocks and lists. These changes
are modeled on the inpcb locking for IPv4, submitted by Jennifer Yang,
and committed by Jeffrey Hsu. With these locking changes, IPv6 use of
inpcbs is now substantially more MPSAFE, and permits IPv4 inpcb locking
assertions to be run in the presence of IPv6 compiled into the kernel.
device drivers to declare that the ifp->if_start() method implemented
by the driver requires Giant in order to operate correctly.
Add a 'struct task' to 'struct ifnet' that can be used to execute a
deferred ifp->if_start() in the event that if_start needs to be called
in a Giant-free environment. To do this, introduce if_start(), a
wrapper function for ifp->if_start(). If the interface can run MPSAFE,
it directly dispatches into the interface start routine. If it can't
run MPSAFE, we're running with debug.mpsafenet != 0, and Giant isn't
currently held, the task is queued to execute in a swi holding Giant
via if_start_deferred().
Modify if_handoff() to use if_start() instead of direct dispatch.
Modify 802.11 to use if_start() instead of direct dispatch.
This is intended to provide increased compatibility for non-MPSAFE
network device drivers in the presence of Giant-free operation via
asynchronous dispatch. However, this commit does not mark any network
interfaces as IFF_NEEDSGIANT.
using linker_load_module(). This works OK if NGM_MKPEER message came
from userland and we have process associated with thread. But when
NGM_MKPEER was queued because target node was busy, linker_load_module()
is called from netisr thread leading to panic.
To workaround that we do not load modules by framework, instead ng_socket
loads module (if this is required) before sending NGM_MKPEER.
However, the race condition between return from NgSendMsg() and actual
creation of node still exist and needs to be solved.
PR: kern/62789
Approved by: julian
clients simultaneously. When node is client its mode is configured
with a control message.
sysctl net.graph.nonstandard_pppoe is deprecated but kept for
backward compatibility for some time.
Approved by: julian
dereference curthread. It is called only from critical_{enter,exit}(),
which already dereferences curthread. This doesn't seem to affect SMP
performance in my benchmarks, but improves MySQL transaction throughput
by about 1% on UP on my Xeon.
Head nodding: jhb, bmilekic
(i.e. with the foreign address being not wildcard) when checking
for possible port theft since such connections cannot be stolen.
The port theft check is FreeBSD-specific and isn't in the KAME tree.
PR: bin/65928 (in the audit trail)
Reviewed by: -net, -hackers (silence)
Tested by: Nick Leuta <skynick at mail.sc.ru>
MFC after: 1 month
an adaptive fashion when adaptive mutexes are enabled. The theory
behind non-adaptive Giant is that Giant will be held for long periods
of time, and therefore spinning waiting on it is wasteful. However,
in MySQL benchmarks which are relatively Giant-free, running Giant
adaptive makes an observable difference on SMP (5% transaction rate
improvement). As such, make adaptive behavior on Giant an option so
it can be more widely benchmarked.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.
switch in fork_exit() to before anything else is done (but keep
schedlock for the deadthread check). This means one less
nasty bug if ever in the future whatever might have been called
before the update played with schedlock or critical sections.
Discussed with: tjr
when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...
Reviewed by: jeremy (NetBSD)
Now it is user-controlled through ifconfig(8).
The former ``automagic'' way of operation created more
trouble than good. First, VLAN_MTU consumers other than
vlan(4) had appeared, e.g., ng_vlan(4). Second, there was
no way to disable VLAN_MTU manually if it were causing
trouble, e.g., data corruption.
Dropping the ``automagic'' should be completely invisible
to the user since
a) all the drivers supporting VLAN_MTU
have it enabled by default, and in the first place
b) there is only one driver that can really toggle VLAN_MTU
in the hardware under its control (it's fxp(4), to which
I added VLAN_MTU controls to illustrate the principle.)
the system" resource limit code: When checking if the caller has superuser
privileges, we should be checking the *real* user, not the *effective*
user. (In general, resource limiting is done based on the real user, in
order to avoid resource-exhaustion-by-setuid-program attacks.)
Now that a SUSER_RUID flag to suser_cred exists, use it here to return
this code to its correct behaviour.
Pointed out by: rwatson
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
ACPI_DEBUG. This upset the ordering that acpi_probe_order() was meant to
provide, causing devices to attach before the sysresource object. This
debugging feature has been unnecessary for a while so just remove it.
Testing by: marcel
the license from /usr/src/COPYRIGHT. Since cvs annotate shows that
this was written by jasone, julian, jhb, peter, bmilekic and obrien.
cvs log shows that many others may have contributed to this file. As
such, go ahead and use the author of 'FreeBSD Project' for this file.
If this is a problem, please notify me.
# this eliminates the last file in the kernel with an indirect reference
# to /usr/src/COPYRIGHT in the kernel. A few more in userland remain.
vinumdrive geom with an exclusive bit. This should fix the problem
when underlying partitions overlap (i.e. the 'a' partition is at
the same offset as the 'c' partition).
Ideas borrowed from pjd@, quite a bit of testing by
Matthias Schuendehuette <msch@snafu.de>.
Properly wait for not busy and introduce a timeout for devices not
setting busy (as they should).
Leave a printf in there that states how long the wait was, as I'd like
to get an idea of the variations here. The time needed seems also to be
affected by whether a medium is present or not.
FOREACH_SAFE. Remove bad cast of retp and instead use an additional
arg to pass back the number of valid outputs. Use the package convenience
functions for parsing packages.
o Separate out local (ports) scripts that use rc.d, and the old style
startup/shutdown scripts and execute them separately. On startup the
rc.d style scripts are executed first and then the old-style scripts.
On shutdown, exactly the reverse happens.
o The rc.d ports scripts should now behave more like base system scripts.
Scripts ending in .sh will be sourced into the current shell, while the
rest will be executed in a subshell. Previously, all ports scripts,
regardless of the .sh suffix, were executed in a subshell.
o The parent script, /etc/rc.d/localpkg, passes its command line arguments
straight to the rc.d ports scripts. This means they should now honor
faststop and faststart commands as well. Old style scripts, should not see
any differences. They will still get either a start or stop command.
o The initial phrase shown during shutdown has been changed to use
"local packages" instead of "daemon processes" to be more inline with the
phrase used during local package startup. The phrases are also used only for
old-style ports script startup/shutdown, whereas previously they were being
used for both rc.d and old-style scripts. This should make startup/shutdown
output a bit less ugly.
Discussed with: portmgr
Has Reservations: eik
(in particular, bge(4) hasn't supported rxcsum since if_bge.c#1.5)
Clean up some aspects of capabilities usage, i.e. stop using
if_hwassist to see whether we are doing offload now because if_hwassist
is for TCP/IP layer and it is subordinate to if_capenable.
Thanks to: Aled Morris for donating a nice bge(4) NIC to me
Reviewed by: -net, -hackers (silence)
vmspace to the new vmspace in vmspace_exec() is mostly wasted effort. With
one exception, vm_swrss, the copied fields are immediately overwritten.
Instead, initialize these fields to zero in vmspace_alloc(), eliminating a
bcopy() from vmspace_exec() and a bzero() from vmspace_fork().
can't yet be referenced by other threads.
In microbenchmarks, this appears to reduce the cost of
pipe();close();close() on UP by 10%, and SMP by 7%. The vast majority
of the cost of allocating a pipe remains VM magic.
Suggested by: silby
calls further down the stack. If we find the cksum to be okay we pretend
that the hardware did all the work and hence keep the upper layers from
checking again.
Submitted by: Pyun YongHyeon
addend of 0. This isn't correct, and was quite easy to break by
referring to the address of an element within a structure.
However, fixing this exposed the fact that symbol lookups for
local variables were returning the base of the section they
were contained in. This case is detected by comparing the return
value from elf_lookup() to the relocbase+addend value: if it is
lesser, but greater than relocbase, then relocbase+addend is
taken to be the authoritative value.
bug reported by: gallatin
happens because the sio device was never opened and com->tp is
therefore NULL. ttygone can't swallow a NULL, so guard against that
possibility. Other places in this function make similar checks, so I
believe this is correct.
Improve child_detached a little and make it conform better to
style(9). Also, improve comment about what we'll be doing in the
future about driver_added. Soon it will be possible to kldload usb
drivers and have them attach w/o a need to disconnect/reconnect them.
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.
Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
the data sheets leads me to believe these will just work. Those parts
with the various media readers on them may not have the required
FreeBSD drivers that will attach to the subdevices that will be seen
on some of these parts.
PCI 1515, 1530, 1620, 4520, 6411, 6420, 7410, 7510, 7610
Prompted by: Havard Eidnes
These are from the datasheets downloaded from TI's web site.
They describe the PCI[67]x[12]1 and PCI[67]x20 parts, with and without
the smartcard enabled.
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().
The following changes serve to eliminate the following lock-order
reversal reported by witness:
1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931
There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:
- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)
individual file object implementations can optionally acquire Giant if
they require it:
- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)
Notes:
Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.
The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.
thread-local pointer, in practice that thread needs to be curthread. If
we're running with INVARIANTS, generate a warning if not. If we have
KDB compiled in, generate a stack trace. This doesn't fire at all in my
local test environment, but could be irritating if it fires frequently
for someone, so there will be motivation to fix things quickly when it
does.
the caller passes in a td that is curthread, and consistently pass 'td'
into vget(). Remove some bogus logic that passed in td or curthread
conditional on td being non-NULL, which seems redundant in the face of
the earlier assignment of td to curthread if td is NULL.
In devfs_symlink(), cache the passed thread in 'td' so we don't have
to keep retrieving it from the 'ap' structure, and assert that td is
curthread (since we dereference it to get thread-local td_ucred). Use
'td' in preference to curthread for later lockmgr calls, since they are
equal.
and saved link register as per the ABI call sequence. Update code
that uses this (fork_trampoline etc) to use the correct genassym'd
offsets.
This fixes the 'invalid LR' message when backtracing kernel
threads in DDB.
being incomplete, it currently has to know how to drop and pick back
up the vm_object's mutex if it has to sleep and drop the page queue
mutex. The problem with this is that if the page is busy, while we
are sleeping, the page can be freed and object disappear. When trying
to lock m->object, we'd get a stale or NULL pointer and crash.
The object is now cached, but this makes the assumption that
the object is referenced in some manner and will not itself
disappear while it is unlocked. Since this only happens if
the object is locked, I had to remove an assumption earlier in
contigmalloc() that reversed the order of locking the object and
doing vm_page_sleep_if_busy(), not the normal order.
(WITNESS) for code paths that always call uma_zalloc_arg() shortly
after where the check was, because uma_zalloc_arg() already does
a similar check.
No objections from Alfred. Thanks Alfred.
RTF_BLACKHOLE as well.
To quote the submitter:
The uRPF loose-check implementation by the industry vendors, at least on Cisco
and possibly Juniper, will fail the check if the route of the source address
is pointed to Null0 (on Juniper, discard or reject route). What this means is,
even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
as a pseudo-packet-firewall without using any manual filtering configuration --
one can simply inject a IGP or BGP prefix with next-hop set to a static route
that directs to null/discard facility. This results in uRPF Loose-check failing
on all packets with source addresses that are within the range of the nullroute.
Submitted by: James Jun <james@towardex.com>
work very infrequently, and often results in a compound panic which
confuses debugging; locking/SMP have made the layering violation (and
risks) of this more obvious over time.
Discussed with: green, bde, et al.
the thread ID and call db_trace_thread().
Since arm has all the logic in db_stack_trace_cmd(), rename the
new DB_COMMAND function to db_stack_trace to avoid conflicts on
arm.
While here, have db_stack_trace parse its own arguments so that
we can use a more natural radix for IDs. If the ID is not a thread
ID, or more precisely when no thread exists with the ID, try if
there's a process with that ID and return the first thread in it.
This makes it easier to print stack traces from the ps output.
requested by: rwatson@
tested on: amd64, i386, ia64
init and fini handlers. Our vm system removes all userland mappings at
exit prior to calling pmap_release. It just so happens that we might
as well reuse the pmap for the next process since the userland slate
has already been wiped clean.
However. There is a functional benefit to this as well. For platforms
that share userland and kernel context in the same pmap, it means that
the kernel portion of a pmap remains valid after the vmspace has been
freed (process exit) and while it is in uma's cache. This is significant
for i386 SMP systems with kernel context borrowing because it avoids
a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case.
Tested on: amd64, i386, sparc64, alpha
Glanced at by: alc