o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly
Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005
and skip over the normal IP processing.
Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.
PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days
from both the acpi module build directory and a kernel build directory.
The latter didn't work when one attempted to build a kernel which had
"device acpi" with the "make kernel-toolchain buildkernel" command
because a cross-compiler couldn't find anything in the standard system
include path (it's empty in the kernel-toolchain case).
Fix this by passing a better root path to kernel headers (src/sys)
which works for both cases, kernel and module (-I@ only worked for
module).
Also, while here, pass -nostdinc (and a different spelling for icc) --
it's a feature that the kernel source tree is self-contained, and this
change enforces this.
Reported by: glebius
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
Reviewed by: ru
- hold/release device in start/done routines, this will probably slow
down things a bit, but previous code was racy;
- only release device if g_gate_destroy() failed - if it succeeded device
is dead and there is nothing to release;
- various other changes which makes forcible destruction reliable.
MFC after: 3 days
- Make the PROBE_KEYBOARD option better resemble the -P option in
boot2, i.e., if keyboard isn't present then boot with both
RB_SERIAL and RB_MULTIPLE set.
Reviewed by: jhb
any threads to them. However, it still counts those cores as "active but
permanently idle" when calculating system-wide CPUs statistics. It is
incorrect, since it skews statistics quite a bit and creates real problems
for certain types of applications (monitoring applications for example),
by making them believe that the system does have enough idle CPU resources,
while in fact it does not.
Correct the problem by not calling performance counting routines on "disabled"
cores. The cleaner solution would be to just disable APIC timer interrupts on
those cores completely, but ENOTIME here and it is not clear if the
additional complexity really worth minor performance gain.
Reviewed by: ssouhlal
Sponsored by: Sippy Software, Inc.
MFC after: 2 weeks
the "device isa" presence out of the opt_isa.h in the kernel
build directory, rather than always assuming its presence.
sparc64 is still special cased and is not affected by this
change.
Noticed by: bde
This helps systems that don't actually have atkbd controllers, such as the Intel
SBX82 blade, boot without device.hints hacks.
Hardware for this fix provided by iXsystems.
PR: 94822
Submitted by: Devon H. O'Dell <devon.odell@coyotepoint.com>
MFC After: 3 days
for a bit before retrying to resend an IPI in order to avoid
deadlocks if the other CPU is also trying to send one.
OpenSolaris uses a delay of 1 microsecond here but waiting 2
microseconds with interrupts enabled like Linux does shouldn't
hurt but is a bit safer.
MFC after: 1 day
required by arches like sparc64 (not yet implemented) and sun4v where there
are seperate IOMMU's for each PCI bus... For all other arches, it will
end up returning NULL, which makes it a no-op...
Convert a few drivers (the ones we've been working w/ on sun4v) to the
new convection... Eventually all drivers will need to replace the parent
tag of NULL, w/ bus_get_dma_tag(dev), though dev is usually different for
each driver, and will require hand inspection...
Reviewed by: scottl (earlier version)
svr4 code: this code would call centralized sysctl code that does
these checks also.
MFC after: 1 week
Obtained from: TrustedBSD Project
Sponsored by: nCircle Network Security, Inc.
register. This really shouldn't be using pci_enable_io() directly as
bus_alloc_resource() does it already, but the cached copy of the
command word needs to be correct so the enable/disable mwi functions
work properly.
- Use pci bus accessors to read revision ID and subvendor IDs.
Reviewed by: jvogel
Add the argument auditing functions for argv and env.
Add kernel-specific versions of the tokenizer functions for the
arg and env represented as a char array.
Implement the AUDIT_ARGV and AUDIT_ARGE audit policy commands to
enable/disable argv/env auditing.
Call the argument auditing from the exec system calls.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
INVARIANT_SUPPORT is defined so you can build a kernel with
INVARIANT_SUPPORT, but build a module with just INVARIANTS on.
MFC after: 3 days
Reported by: kuriyama
mutex structure is added as following:
struct umutex {
__lwpid_t m_owner;
uint32_t m_flags;
uint32_t m_ceilings[2];
uint32_t m_spare[4];
};
The m_owner represents owner thread, it is a thread id, in non-contested
case, userland can simply use atomic_cmpset_int to lock the mutex, if the
mutex is contested, high order bit will be set, and userland should do locking
and unlocking via kernel syscall. Flag UMUTEX_PRIO_INHERIT represents
pthread's PTHREAD_PRIO_INHERIT mutex, which when contention happens, kernel
should do priority propagating. Flag UMUTEX_PRIO_PROTECT indicates it is
pthread's PTHREAD_PRIO_PROTECT mutex, userland should initialize m_owner
to contested state UMUTEX_CONTESTED, then atomic_cmpset_int will be failure
and kernel syscall should be invoked to do locking, this becauses
for such a mutex, kernel should always boost the thread's priority before
it can lock the mutex, m_ceilings is used by PTHREAD_PRIO_PROTECT mutex,
the first element is used to boost thread's priority when it locked the mutex,
second element is used when the mutex is unlocked, the PTHREAD_PRIO_PROTECT
mutex's link list is kept in userland, the m_ceiling[1] is managed by thread
library so kernel needn't allocate memory to keep the link list, when such
a mutex is unlocked, kernel reset m_owner to UMUTEX_CONTESTED.
Flag USYNC_PROCESS_SHARED indicate if the synchronization object is process
shared, if the flag is not set, it saves a vm_map_lookup() call.
The umtx chain is still used as a sleep queue, when a thread is blocked on
PTHREAD_PRIO_INHERIT mutex, a umtx_pi is allocated to support priority
propagating, it is dynamically allocated and reference count is used,
it is not optimized but works well in my tests, while the umtx chain has
its own locking protocol, the priority propagating protocol are all protected
by sched_lock because priority propagating function is called with sched_lock
held from scheduler.
No visible performance degradation is found which these changes. Some parameter
names in _umtx_op syscall are renamed.
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep. In this case, inlining the test actually makes the
kernel smaller.
other stuff) in the osrelease=2.6.16 case:
- implement CLONE_PARENT semantic
- fix TLS loading in clone CLONE_SETTLS
- lock proc in the currently disabled part of CLONE_THREAD
I suggest to not unload the linux module after testing this, there are
some "<defunct>" processes hanging around after exiting (they aren't
with osrelease=2.4.2) and they may panic your kernel when unloading the
linux module. They are in state Z and some of them consume CPU according
to ps. But I don't trust the CPU part, the idle threads gets too much CPU
that this may be possible (accumulating idle, X and 2 defunct processes
results in 104.7%, this looks to much to be a rounding error).
Noticed by: Intron <mag@intron.ac>
Submitted by: rdivacky (in collaboration with Intron)
Tested by: Intron, netchild
Reviewed by: jhb (previous version)
but further on -current (still not successful, but a step into the right
direction).
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Tested by: Paul Mather <paul@gromit.dlib.vt.edu>
is loaded. This problem stems from the fact that the policy is not properly
initializing the mac label associated with the NFS daemon.
Obtained from: TrustedBSD Project
Discussed with: rwatson
audit record size at run-time, which can be used by the user
process to size the user space buffer it reads into from the audit
pipe.
Perforce change: 105098
Obtained from: TrustedBSD Project
the 'vfs_getopt(optlist, "errmsg", (void **)&errmsg, &errmsg_len)'
call fails, 'errmsg' is left uninitialized, making the later tests
against NULL meaningless, and the uses bogus. Thus initialize
'errmsg' to NULL beforehand. [1]
While at it, remove the superfluous assignment of 0 to 'errmsg_len'
if the above mentioned call fails as it's already initialized to 0.
Submitted by: Michael Plass [1]
- Fix a couple of LORs and panics;
- Temporarily remove the code that tries to cleanup sockets that stuck
on accepting queues (both complete and incomplete). I'm taking an ostrich
approach here until I find a better way to deal with sockets that were
disconnected before accepting (i.e. while socket was on complete or
incomplete accept queue).
we can do the stuff we need to do with linux processes at fork and
don't panic the kernel at exit of the child.
Submitted by: rdivacky
Tested with: tst-vfork* (glibc regression tests)
Tested by: netchild
progress the kernel audit code in CVS is considered authoritative.
This will ease $P4$-related merging issues during the CVS loopback.
Obtained from: TrustedBSD Project
LCDs to blink in the V_DISPLAY_ON case, at least in combination with
some 13W3-VGA-adaptors (what's exactly going on is unclear though,
as it happens when all of H-sync, V-sync and video output are enabled
and not touching the sync bits from the preset fixes it). Thus
creator_blank_display() now is reduced to turning the video output
on/off.
Although that DPMS code did what the XFree86/Xorg sunffb(4x) does,
it was questionable in the first place, as both implementations
also turn(ed) off the video output on standby and suspend, thus most
likely causing the monitor to turn off instead of entering standby
or suspend as intended (at least my monitors don't).
Reported and tested by: Patrick Reich
MFC after: 3 days
copyout(9) instead of copystr(9) for copying the errmsg from
kernel- to user-space. This fixes a panic on sparc64 when
using the nmount(2)-converted mountd(8).
While at it, use bcopy(3) instead of strncpy(3) in the kernel-
to kernel-space case for consistency with vfs_buildopts() and
between kernel- to user-space and kernel- to kernel-space case.
Following issues should be resolved:
- random watchdog timeouts (caused by concurrent phy access)
- some link state issues
- non working TX if media type was set explicitly
PR: kern/98738
Approved by: glebius (mentor)
MFC after: 2 weeks
returned by md_reserve(). Space reserved by mb_reserve() is
byte aligned and need to be used in conjunction with le16enc()
and le32enc().
Tested on: ia64
conditions. The cause of missing Tx completion interrupts comes from
Tx interrupt moderation mechanism(delayed interrupts) or chipset bug.
If Tx interrupt moderation mechanism is the cause of false watchdog
timeout error we should have to fix all device drivers that have Tx
interrupt moderation capability. We may need more investigation
for this issue. Anyway, the fix is the same for both cases.
This should fix occasional watchdog timeout errors seen on a few
systems.
Reported by: -net, Patrick M. Hausen < hausen AT punkt DOT de >
Tested by: Patrick M. Hausen < hausen AT punkt DOT de >
misc. control registers correctly and it is inconsistent with north bridge.
In fact, there are too many broken BIOS implementations out there and we
cannot fix every possible combination but at least it is consistent with
what we advertise with ioctl(2).
first filter out metadata update. Otherwise, devfs vnode could be
erronously interpreted as ufs one, causing further check of i_flags
to use random memory.
PR: kern/100365
Debugged and fix described by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks
REPORT LUNS command to a device.
camcontrol.[c8]: Implement reportluns. This tries to print the LUNs
out in a reasonable format. Only the periph
addressing method has been tested, since very little
hardware that I know of supports the other methods.
scsi_all.[ch]: Revamp the report luns CDB structure and helper
functions. This constitutes a little bit of an API
change, but since the old CDB length was 10 bytes,
and the REPORT LUNS CDB length is actually 12 bytes,
it's clear that no one was using this API in the
first place.
MFC After: 1 week
it ended up defaulting to ISP_ROLE_NONE. My testing hadn't caught it
because I was deliberatly setting role via ioctl.
Thanks to user Toni for lending me an alpha to test this on.
MFC after: 0 days
handling for amd64 in the common code. The MD parts for amd64 are still
outstanding, but at least this fixes some panics on amd64.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Tested by: bsam
Originally it wasn't enabled since the hardware wasn't commonplace,
but as 10GE hardware is becoming more widely used, building the module
by default should be beneficial.
Approved by: rwatson (mentor)
MFC after: 2 weeks
for example:
fwd tablearg ip from any to table(1)
where table 1 has entries of the form:
1.1.1.0/24 10.2.3.4
208.23.2.0/24 router2
This allows trivial implementation of a secondary routing table implemented
in the firewall layer.
I expect more work (under discussion with Glebius) to follow this to clean
up some of the messy parts of ipfw related to tables.
Reviewed by: Glebius
MFC after: 1 month
- protect td->td_proc->p_pid with the proc lock in linux_getpid
in the amd64 (= non i386) case [1]
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Noticed by: netchild [1]
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week
has in its procfs (do a readlink of /proc/self/fd/<nn> to find the pathname
that corresponds to a given file descriptor). Valgrind-3.x needs this
functionality. This is a placeholder only at this time.
match up with reality and the prototype definitions.
Register the sem_exechook as the "process_exec" event handler, not
sem_exithook.
Submitted by: rdivacky
Sponsored by: SoC 2006
that it operates on lockmgr and sx locks. This can be useful for tracking
down vnode deadlocks in VFS for example. Note that this command is a bit
more fragile than 'show lockchain' as we have to poke around at the
wait channel of a thread to see if it points to either a struct lock or
a condition variable inside of a struct sx. If td_wchan points to
something unmapped, then this command will terminate early due to a fault,
but no harm will be done.
panic, go ahead and do the longer DELAY(1) spin wait.
- If we panic due to spinning too long, print out a few more details
including the pointer to the mutex in question and the tid of the owning
thread.
for syscalls in kld's, even when compiled into the kernel statically.
Note that since this hardcodes the SYS_ prefix SYSCALL_MODULE_HELPER() now
only works for native ABI system calls. Those are the only ones that
used the macro anyway, and I chose to not require a second argument to the
macro to specify the prefix or audit event directly.
- Send the systrace_args files for all the compat ABIs to /dev/null for
now. Right now makesyscalls.sh generates a file with a hardcoded
function name, so it wouldn't work for any of the ABIs anyway. Probably
the function name should be configurable via a 'systracename' variable
and the functions should be stored in a function pointer in the sysvec
structure.
by restoring the ifv_proto field in the vlan softc and putting it to use
this time. It's a good companion for ifv_encaplen, which has already been
used throughout this driver.
'show lockchain'. The churn is because I'm about to add a new
'show sleepchain' similar to 'show lockchain' for sleep locks (lockmgr and
sx) and 'show threadchain' was a bit ambiguous as both commands show
a chain of thread dependencies, 'lockchain' is for non-sleepable locks
(mtx and rw) and 'sleepchain' is for sleepable locks.
KERNBASE and VM_MAXUSER_ADDRESS.
Remove the useless include of opt_global.h, as noticed by netchild@ (the one
in arm/elf_trampoline.c is legit, because this file is compiled outside the
kernel, and doesn't use the standard CFLAGS).
line switch. Other files which may make the same mistake (according to
fxr.watson.org) but aren't fixed in this commit (people with more clue
about those files should fix this):
- i386/xbox/xbox.c
- arm/arm/elf_trampoline.c
- arm/arm/mem.c
Noticed by: cognet
- Prepare the modules for build on amd64, but don't build them there as
part of the kernel build yet. The code for the missing symbols on amd64
isn't committed and it may be solved differently.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
- TLS - complete
- pid/tid mangling - complete
- thread area - complete
- futexes - complete with issues
- clone() extension - complete with some possible minor issues
- mq*/timer*/clock* stuff - complete but untested and the mq* stuff is
disabled when not build as part of the kernel with native FreeBSD mq*
support (module support for this will come later)
Tested with:
- linux-firefox - works, tested
- linux-opera - works, tested
- linux-realplay - doesnt work, issue with futexes
- linux-skype - doesnt work, issue with futexes
- linux-rt2-demo - works, tested
- linux-acroread - doesnt work, unknown reason (coredump) and sometimes
issue with futexes
- various unix utilities in linux-base-gentoo3 and linux-base-fc4:
everything tried worked
On amd64 not everything is supported like on i386, the catchup is planned for
later when the remaining bugs in the new functions are fixed.
To test this new stuff, you have to run
sysctl compat.linux.osrelease=2.6.16
to switch back use
sysctl compat.linux.osrelease=2.4.2
Don't switch while running a linux program, strange things may or may not
happen.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Some suggestions/help by: jhb, kib, manu@NetBSD.org, netchild
compat.linux.osrelease is changed to "2.6.16" or similar).
On amd64 not everything is supported like on i386, the catchup is planned for
later when the remaining bugs in the new functions are fixed.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Please don't style(9) the NetBSD code, we want to stay in sync. Not imported
on a vendor branch since we need local changes.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
With help from: manu@NetBSD.org
Obtained from: NetBSD (linux_{futex,time}.*)
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)
Reported by: Nick Withers < nick AT nickwithers DOT com >
Tested by: Nick Withers < nick AT nickwithers DOT com >
No objection from: ariff
MFC after: 1 week
serial line usb drivers that depend on it. Instead, let the compile
fail rather than silently not including the driver. This is more in
line with how we handle things like mii.
# I'll note: a better system for coping with missing depends is needed,
# but this dependency is clearly backwards given our current flawed
# depend system.
aren't mapped via pmap_enter() (KVA). We will eventually support PAT bits
on user pages, but those will require some sort of MI caching mode stored
in the vm_page.
Reviewed by: alc
i386 (I don't know) but on amd64 at hand here, it paniced early at
boot.
(I'm pretty sure that PAGE_SIZE here was miscopied from another place
during porting, where in OpenBSD bus_dmamem_alloc() is used, but there
PAGE_SIZE means completely different thing.)
options field in register 10 will be deterministic, not random.
Correct the number of input bits for EXECUTE_FIRMWARE 0..1 to
0..2- the 2322 and 24XX cards use mailbox register 2 to specify
whether the f/w being executed is freshly loaded or not.
Correct the number of input bits for {READ,WRITE}_RAM_WORD_EXTENDED
so that register 8 gets picked up.
Fix the indexing and offset for the 2322 f/w download so that it
correctly puts the different code segments where they belong.
Move VERIFY_CHECKSUM to be the 'else' clause to 2322 f/w downloads-
the EXECUTE_FIRMWARE command for 2322 and 24XX cards will tell you
if the f/w checksum is incorrect and VERIFY_CHECKSUM only works for
RISC SRAM address < 64K so you can only do a VERIFY_CHECKSUM on the
first of the 3 f/w segments for the 2322.
Shorten the delay for the continuation mailbox commands- 1ms is
ridiculous (100us is more likely).
All of the more or less is really only for the 2322/6322 cards.
Previously em(4) requeued the failed mbuf chains from
bus_dmamap_load_mbuf_sg(9) failure to resend it later. However,
bus_dmamap_load_mbuf_sg(9) may never complete its request as the
fragmented frames can have more than EM_MAX_SCATTER segments.
To handle the above EFBIG case, defragment the frame with m_defrag(9)
and free the mbuf chain if it can't deframent the chain due to
resource shortage.
Reviewed by glebius (with improvements)
o Create one more spare DMA map for Rx handler to recover from
bus_dmamap_load_mbuf_sg(9) failure.
o Make sure to update status bit in Rx descriptors even if we failed
to allocate a new buffer. Previously it resulted in stuck condition
and em_handle_rxtx task took up all available CPU cycles.
o Don't blindly unload DMA map. Reuse loaded DMA map if received
packet has errors. This would speed up Rx processing a bit under
heavy load as it does not need to reload DMA map in case of error.
(bus_dmamap_load_mbuf_sg(9) is the most expensive call in driver
context.)
o Update if_iqdrops counter if it can't allocate a mbuf cluster.
With this change it's now possible to see queue dropped packets
with netstat(1).
o Update mbuf_cluster_failed counter if fixup code failed to
allocate mbuf header.
o Return ENOBUFS instead of ENOMEM in case of Rx fixup failure.
o Make adapter->lmp NULL in case of Rx fixup failure. Strictly
specking it's not necessary for correct operation but it makes
the intention clear.
o Remove now unused dropped_pkts member in softc.
With these changes em(4) should survive mbuf cluster allocation
failure on Rx path.
Reviewed by: pdeuskar, glebius (with improvements)
MCLBYTES - ETHER_ALIGN. Previously it applied the alignment fixup code
for oversized frames which would result in reduced performance on
strict alignment archs.
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.
Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().