while doing a copyout. That can cause a panic, because copyout
can trigger VM faults, and we can't handle VM faults while holding
a mutex.
The solution here is to malloc a separate buffer to hold the OOA
queue entries, so that we don't risk a VM fault while filling up
the buffer and we don't have to drop the lock. The other solution
would be to wire the user's memory while filling their buffer with
copyout, but that would have been a little more complex.
Also fix a debugging parenthesis issue in ctl_abort_task() pointed
out by Chuck Tuffli.
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
drivers.
The bug occurrs when a userland process has the driver instance
open and the underlying device goes away. We get the devfs
callback that the device node has been destroyed, but not all of
the closes necessary to fully decrement the reference count on the
CAM peripheral.
The reason is that once devfs calls back and says the device has
been destroyed, it is moved off to deadfs, and devfs guarantees
that there will be no more open or close calls. So the solution
is to keep track of how many outstanding open calls there are on
the device, and just release that many references when we get the
callback from devfs.
scsi_pass.c,
scsi_enc.c,
scsi_enc_internal.h: Add an open count to the softc in these
drivers. Increment it on open and
decrement it on close.
When we get a devfs callback to say that
the device node has gone away, decrement
the peripheral reference count by the
number of still outstanding opens.
Make sure we don't access the peripheral
with cam_periph_unlock() after what might
be the final call to
cam_periph_release_locked(). The
peripheral might have been freed, and we
will be dereferencing freed memory.
scsi_ch.c,
scsi_sg.c: For the ch(4) and sg(4) drivers, add the
same changes described above, and in
addition, fix another bug that was
previously fixed in the pass(4) and enc(4)
drivers.
These drivers were calling destroy_dev()
from their cleanup routine, but that could
cause a deadlock because the cleanup
routine could be indirectly called from
the driver's close routine. This would
cause a deadlock, because the device node
is being held open by the active close
call, and can't be destroyed.
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
are used by NFSv4.1 for callbacks. A backchannel is a connection
established by the client, but used for RPCs done by the server
on the client (callbacks). As a result, this patch mixes some
client side calls in the server side and vice versa. Some
definitions in the .c files were extracted out into a file called
krpc.h, so that they could be included in multiple .c files.
This code has been in projects/nfsv4.1-client for some time.
Although no one has given it a formal review, I believe kib@
has taken a look at it.
The problem was a race condition between the EDT traversal used by
things like 'camcontrol devlist', and CAM peripheral driver
removal.
The EDT traversal code holds the CAM topology lock, and wants
to show devices that have been invalidated. It acquires a
reference to the peripheral to make sure the peripheral it is
examining doesn't go away.
However, because the peripheral removal code in camperiphfree()
drops the CAM topology lock to call the peripheral's destructor
routine, we can run into a situation where the EDT traversal
increments the peripheral reference count after free process is
already in progress. At that point, the reference count is
ignored, because it was 0 when we started the process.
Fix this race by setting a flag, CAM_PERIPH_FREE, that I previously
added and checked in xptperiphtraverse() and xptpdperiphtravsere(),
but failed to use. If the EDT traversal code sees that flag,
it will know that the peripheral free process has already started,
and that it should not access that peripheral.
Also, fix an inconsistency in the locking between
xptpdperiphtraverse() and xptperiphtraverse(). They now both
hold the CAM topology lock while calling the peripheral traversal
function.
cam_xpt.c: Change xptperiphtraverse() to hold the CAM topology
lock across calls to the traversal function.
Take out the comment in xptpdperiphtraverse() that
referenced the locking inconsistency.
cam_periph.c: Set the CAM_PERIPH_FREE flag when we are in the
process of freeing a peripheral driver.
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached
Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.
Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.
All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.
Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.
The following sysctls are added:
debug.kassert.warn_only: 1 = log, 0 = panic
debug.kassert.do_ktr: set to a ktr mask for logging via KTR
debug.kassert.do_log: 1 = log, 0 = quiet
debug.kassert.warnings: stats, number of kasserts hit
debug.kassert.log_panic_at:
number of kasserts before we actually panic, 0 = never
debug.kassert.log_pps_limit: pps limit for log messages
debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop
debug.kassert.kassert: set this sysctl to trigger a kassert
Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
The XC900M acts as a Ubiquiti XR9 (and I _think_ SR9) by default;
it uses the same 900MHz<->2.4GHz downconverter mapping.
However it has an alternative frequency mapping which squeezes in a couple
more half/quarter rate channels. Since the default HAL doesn't support
fractional tuning (sub-1MHz) in 2.4GHz mode on the AR5413/AR5414, they
implement it using a jumper.
Datasheet: http://www.xagyl.com/download/XC900M_Datasheet.pdf
Thankyou to Xagyl Communications for the XC900M NICs and Edgar Martinez
for organising the donation.
Tested:
* XC900M <-> XC900M
* Ubiquiti XR9 <-> XC900M
TODO:
* Test against SR9 and GZ901 if possible (the IEEE channel<->frequency
mapping may not match up, thanks to the slightly different channels
involved)
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
appropriate.
Reviewed by: glebius
id hash. If a state has been disconnected from id hash, its rule pointers
can no longer be dereferenced, and referenced memory can't be modified.
Thus, move rule statistics from pf_free_rule() to pf_unlink_rule() and
update them prior to releasing id hash slot lock.
Reported by: Ian FREISLICH <ianf cloudseed.co.za>
from pfsync:
- Call into pfsync_delete_state() holding the state lock.
- Set the state timeout to PFTM_UNLINKED after state has been moved
to the PFSYNC_S_DEL queue in pfsync.
Reported by: Ian FREISLICH <ianf cloudseed.co.za>
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
directly from the callout flags but might be checked by a cached
value. Hence, do so before to actually remove the callout, when
needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
migration case explaining that the dereference should be safe
because of the migration dereference invariants.
Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
accesses to callout structure after the comment stating the callout
must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
KASSERT() in the deferred migration case. It is not strictly necessary
but this way all the callout accesses happen after the above mentioned
comment, improving consistency.
Pointy hat to: me
Sponsored by: Isilon Systems / EMC Corporation
Reviewed by: kib
MFC after: 2 weeks
X-MFC: 243901
to map, and technically this isn't allowed.
Functionally, it works OK (at least on x86) to call bus_dmamap_load with
a NULL data pointer and zero length, so this is primarily for correctness
and consistency with other drivers.
While here, remove check in isci_io_request_construct for nseg==0.
Previously, bus_dmamap_load would pass nseg==1, even for case where
buffer is NULL and length = 0, which allowed CAM_DIR_NONE CCBs
to get processed. This check is not correct though, and needed to be
removed both for the changes elsewhere in this patch, as well as jeff's
preliminary bus_dmamap_load_ccb patch (which uncovered all of this in
the first place).
MFC after: 3 days
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links. After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.
Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.
If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.
Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.
Add some more strict asserts about the state of the callout in
callout_call_cc().
Reviewed by: attilio
Reported and tested by: pho (previous version)
MFC after: 2 weeks
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout. Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.
As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized. The consequence is the corrupted callwheel tailq.
Use process mutex to interlock the callout start, which fixes the race.
Reported and tested by: pho
Reviewed by: jhb
MFC after: 2 weeks
- Check V_deembed_scopeid before checking if sa_family == AF_INET6.
- Fix scope id handing in route(8)[2] and ifconfig(8).
Reported by: rpaulo[1], Mateusz Guzik[1], peter[2]
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.
This was uncovered after the r243599 made the active list iterators
non-nop.
Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.
Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.
Reported and tested by: pho
MFC after: 1 week
thought I've decided its overkill,a simple tuneable for
each RX and TX limit, and then init sets the ring values
based on that, should be sufficient.
More importantly, fix a bug causing a panic, when changing
the define style to IXGBE_LEGACY_TX a taskqueue init was
inadvertently set #ifdef when it should be #ifndef.
I couldn't think of a way to maintain the hardware TXQ locks _and_ layer
on top of that per-TXQ software queuing and any other kind of fine-grained
locks (eg per-TID, or per-node locks.)
So for now, to facilitate some further code refactoring and development
as part of the final push to get software queue ps-poll and u-apsd handling
into this driver, just do away with them entirely.
I may eventually bring them back at some point, when it looks slightly more
architectually cleaner to do so. But as it stands at the present, it's
not really buying us much:
* in order to properly serialise things and not get bitten by scheduling
and locking interactions with things higher up in the stack, we need to
wrap the whole TX path in a long held lock. Otherwise we can end up
being pre-empted during frame handling, resulting in some out of order
frame handling between sequence number allocation and encryption handling
(ie, the seqno and the CCMP IV get out of sequence);
* .. so whilst that's the case, holding the lock for that long means that
we're acquiring and releasing the TXQ lock _inside_ that context;
* And we also acquire it per-frame during frame completion, but we currently
can't hold the lock for the duration of the TX completion as we need
to call net80211 layer things with the locks _unheld_ to avoid LOR.
* .. the other places were grab that lock are reset/flush, which don't happen
often.
My eventual aim is to change the TX path so all rejected frame transmissions
and all frame completions result in any ieee80211_free_node() calls to occur
outside of the TX lock; then I can cut back on the amount of locking that
goes on here.
There may be some LORs that occur when ieee80211_free_node() is called when
the TX queue path fails; I'll begin to address these in follow-up commits.
which dumps out the actual options being used by an NFS mount.
This will be used to implement a "-m" option for nfsstat(1).
Reviewed by: alfred
MFC after: 2 weeks
This brand of controllers expects that the number of
contexts specified in the input slot context points
to an active endpoint context, else it refuses to
operate.
- Ring the correct doorbell when streams mode is used.
- Wrap one or two long lines.
Tested by: Markus Pfeiffer (DragonFlyBSD)
MFC after: 1 week
Programming the low bits has a side-effect if unmasking the pin if it is
not disabled. So if an interrupt was pending then it would be delivered
with the correct new vector but to the incorrect old LAPIC.
This fix could be made clearer by preserving the mask bit while
programming the low bits and then explicitly resetting the mask bit
after all the programming is done.
Probability to trip over the fixed bug could be increased by bootverbose
because printing of the interrupt information in ioapic_assign_cpu
lengthened the time window during which an interrupt could arrive while
a pin is masked.
Reported by: Andreas Longwitz <longwitz@incore.de>
Tested by: Andreas Longwitz <longwitz@incore.de>
MFC after: 12 days
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.
The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.
The ugly code employs potentially unsafe locking tricks.
Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.
PR: kern/151111
Reviewed by: kib
MFC after: 13 days
... to avoid any races or inconsistencies.
This should fix a regression introduced in r243404.
Also, remove a stale comment that has not been true for quite a while
now.
Pointyhat to: avg
Teested by: trociny, emaste, dumbbell (earlier version)
MFC after: 1 week
src/sys/{bsm,security/audit}. There are a few tweaks to help with the
FreeBSD build environment that will be merged back to OpenBSM. No
significant functional changes appear on the kernel side.
Obtained from: TrustedBSD Project
Sponsored by: The FreeBSD Foundation (auditdistd)
enforcing the TXOP and TBTT limits:
* Frames which will overlap with TBTT will not TX;
* Frames which will exceed TXOP will be filtered.
This is not enabled by default; it's intended to be enabled by the
TDMA code on 802.11n capable chipsets.
the revamped sysctl code did not work, and needed a change. This
makes the limit get set at the time that all sysctl stats are
created and is actually more elegant imho anyway.
TX hot path by getting rid of index calculations and simply
managing pointers. Much of the creative code is due to my
coworker here at Intel, Alex Duyck, thanks Alex!
Also, this whole series of patches was given the critical
eye of Gleb Smirnoff and is all the better for it, thanks
Gleb!
- add a limit for both RX and TX, change the default to 256
- change the sysctl usage to be common, and now to be called
during init for each ring.
- the TX limit is not yet used, but the changes in the last
patch in this series uses the value.
- the motivation behind these changes is to improve data
locality in the final code.
- rxeof interface changes since it now gets limit from the
ring struct
Fix path handling for *at() syscalls.
Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.
Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.
Sponsored by: FreeBSD Foundation (auditdistd)
Reviewed by: rwatson
MFC after: 2 weeks
defines (at Gleb's request). Also, change the defines around
the old transmit code to IXGBE_LEGACY_TX, I do this to make
it possible to define this regardless of the OS level (it is
not defined by default). There are also a couple changed
comments for clarity.
Currently when we discover that trail file is greater than configured
limit we send AUDIT_TRIGGER_ROTATE_KERNEL trigger to the auditd daemon
once. If for some reason auditd didn't rotate trail file it will never
be rotated.
Change it by sending the trigger when trail file size grows by the
configured limit. For example if the limit is 1MB, we will send trigger
on 1MB, 2MB, 3MB, etc.
This is also needed for the auditd change that will be committed soon
where auditd may ignore the trigger - it might be ignored if kernel
requests the trail file to be rotated too quickly (often than once a second)
which would result in overwriting previous trail file.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
Currently on each record write we call VFS_STATFS() to get available space
on the file system as well as VOP_GETATTR() to get trail file size.
We can assume that trail file is only updated by the audit worker, so instead
of asking for file size on every write, get file size on trail switch only
(it should be zero, but it's not expensive) and use global variable audit_size
protected by the audit worker lock to keep track of trail file's size.
This eliminates VOP_GETATTR() call for every write. VFS_STATFS() is satisfied
from in-memory data (mount->mnt_stat), so shouldn't be expensive.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
these are FCOE stats (fiber channel over ethernet), something that
FreeBSD does not yet have, they were mistaken for flow control by
the implementor I believe. Secondly, the real flow control stats
are oddly named with a 'link' tag on the front, it was requested
by my validation engineer to make these stats have the same name as
the igb driver for clarity and that seemed reasonable to me.
Remove redundant call to AUDIT_ARG_UPATH1().
Path will be remembered by the following NDINIT(AUDITVNODE1) call.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
multiqueue code, this functionality has proven to be more
trouble than it was worth. Thanks to Gleb for a second
critical look over my code and help in the patches!
* Global IPFW_DYN_LOCK() is changed to per-bucket mutex.
* State expiration is done in ipfw_tick every second.
* No expiration is done on forwarding path.
* hash table resize is done automatically and does not flush all states.
* Dynamic UMA zone is now allocated per each VNET
* State limiting is now done via UMA(9) api.
Discussed with: ipfw
MFC after: 3 weeks
Sponsored by: Yandex LLC
- Add "fdt addr" subcommand that lets you specify preloaded blob address
- Do not pre-initialize blob for "fdt addr"
- Do not try to load dtb every time fdt subcommand is issued,
do it only once
- Change the way DTB is passed to kernel. With introduction of "fdt addr"
actual blob address can be not virtual but physical or reside in
area higher then 64Mb. ubldr should create copy of it in kernel area
and pass pointer to this newly allocated buffer which is guaranteed to work
in kernel after switching on MMU.
- Convert memreserv FDT info to "memreserv" property of root node
FDT uses /memreserve/ data to notify OS about reserved memory areas.
Technically it's not real property, it's just data blob, sequence
of <start, size> pairs where both start and size are 64-bit integers.
It doesn't fit nicely with OF API we use in kernel, so in order to unify
thing ubldr converts this data to "memreserve" property using the same
format for addresses and sizes as /memory node.
It returns memory regions restricted from being used by kernel. These
regions are dfined in "memreserve" property of root node in the same
format as "reg" property of /memory node
embryonic connection has been setup and never attempt to abort a tid
before this is done. This fixes a bad race where a listening socket is
closed when the driver is in the middle of step (b) here. The symptom
of this were "ARP miss" errors from the driver followed by tid leaks.
A hardware-offloaded passive open works this way:
a) A SYN "hits" the TCAM entry for a server tid and the chip delivers it
to the queue associated with the server tid (say, queue A). It waits
for a response from the driver telling it what to do.
b) The driver decides it is ok to proceed. It adds the new tid to the
list of embryonic connections associated with the server tid and then
hands off the SYN to the kernel's syncache to make sure that the kernel
okays it too. If it does then the driver provides an L2 table entry,
queue id (say, queue B), etc. and instructs the chip to send the SYN/ACK
response.
c) The chip delivers a status to queue B depending on how the third step
of the 3-way handshake goes. The driver removes the tid from its list
of embryonic connections and either expands the syncache entry or
destroys the tid. In any case all subsequent messages for the new tid
will be delivered to queue B, not queue A. Anything running in queue B
knows that the L2 entry has long been setup and the new flag is of no
interest from here on. If the listener is closed it will deal with
so_comp as normal.
MFC after: 1 week
for bridge interface.
- If we found a collision we can break the loop - only one collision is
possible and one is exactly enough to need to renegerate.
Obtained from: WHEEL Systems
MFC after: 1 week
variable as they may overflow on i386/PAE and i386 with > 2GB RAM.
Use 64bit quad_t instead. It has broader kernel infrastructure support
with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types.
Pointed out by: alc, bde
but LDFLAGS is not (yet) passed on to the linker (via SYSTEM_LD et al).
Do so now. As such, any kernel configuration can now define linker
flags by setting LDFLAGS as normal and not have to revert to hacks
like setting DEBUG for flags that do not relate to debugging (see
sys/powerpc/conf/MPC85XX).
Make the following interface changes to my beastie boot menu:
+ Move boot options to a submenu
+ Add a new "Boot Single" menu item
+ Make "Boot" item and new "Boot Single" item reverse when boot_single is set
+ Add new "Load Defaults" item (in new "Boot Options" submenu) for overridding
loader.conf(5) provided values with system defaults.
Reviewed by: adrian (co-mentor)
Approved by: adrian (co-mentor)
Bring several definitions required for newer ext4 features.
Rename EXT2F_COMPAT_HTREE to EXT2F_COMPAT_DIRHASHINDEX since it
is not being used yet and the new name is more compatible with
NetBSD and Linux.
This change is purely cosmetic and has no effect on the real
code.
Obtained from: NetBSD
MFC after: 3 days
When a file is first being written, the dynamic block reallocation
(implemented by ext2_reallocblks) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.
When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block. Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous.
The issue was diagnosed long ago by Bruce Evans on ffs and surfaced
on ext2fs when block reallocaton was ported. This is only a partial
solution based on the similarities with FFS. We still require more
review of the allocation details that vary in ext2fs.
Reported by: bde
MFC after: 1 week
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.
The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.
At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.
Tidy up ordering in init_param2() and check up on some users of
those values calculated here.
Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.
Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.
The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.
NB: Every cluster also has an mbuf associated with it.
Two examples on the revised mbuf sizing limits:
1GB KVM:
512MB limit for mbufs
419,430 mbufs
65,536 2K mbuf clusters
32,768 4K mbuf clusters
9,709 9K mbuf clusters
5,461 16K mbuf clusters
16GB RAM:
8GB limit for mbufs
33,554,432 mbufs
1,048,576 2K mbuf clusters
524,288 4K mbuf clusters
155,344 9K mbuf clusters
87,381 16K mbuf clusters
These defaults should be sufficient for even the most demanding
network loads.
MFC after: 1 month
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.
The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.
Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by: His team.
MFC after: 1 week
now this works for non-debug and debug builds.
* Add a comment reminding me (or someone) to audit all of the relevant
math to ensure there's no weird wrapping issues still lurking about.
But yes, this does seem to be mostly working.
Pointy-hat-to: adrian, yet again
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.
This functionality will be used to enable core dumps for sandboxed processes.
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.
* add some further debugging prints, which are quite nice to have
* add in ALQ hooks (optional!) to allow for the TDMA information to be
logged in-line with the TX and RX descriptor information.
The existing logic wrapped programming nexttbtt at 65535 TU.
This is not good enough for the 11n chips, whose nexttbtt register
(GENERIC_TIMER_0) has an initial value from 0..2^31-1 TSF.
So converting the TU to TSF had the counter wrap at (65535 << 10) TSF.
Once this wrap occured, the nexttbtt value was very very low, much
lower than the current TSF value. At this point, the nexttbtt timer
would constantly fire, leading to the TX queue being constantly gated
open.. and when this occured, the sender was not correctly transmitting
in its slot but just able to continuously transmit. The master would
then delay transmitting its beacon until after the air became free
(which I guess would be after the burst interval, before the next burst
interval would quickly follow) and that big delta in master beacon TX
would start causing big swings in the slot timing adjustment.
With this change, the nexttbtt value is allowed to go all the way up
to the maximum value permissable by the 32 bit representation.
I haven't yet tested it to that point; I really should. The AR5212
HAL now filters out values above 65535 TU for the beacon configuration
(and the relevant legal values for SWBA, DBA and NEXTATIM) and the
AR5416 HAL just dutifully programs in what it should.
With this, TDMA is now useful on the 802.11n chips.
Tested:
* AR5416, AR9280 TDMA slave
* AR5413 TDMA slave
what the maximum legal values are.
The current beacon timer configuration from TDMA wraps things at
HAL_BEACON_PERIOD-1 TU. For the 11a chips this is fine, but for
the 11n chips it's not enough resolution. Since the 11a chips have a
limit on what's "valid", just enforce this so when I do write larger
values in, they get suitably wrapped before programming.
Tested:
* AR5413, TDMA slave
Todo:
* Run it for a (lot) longer on a clear channel, ensure that no strange
slippages occur.
* Re-validate this on STA configurations, just to be sure.
much all the union of all the kernel configuration files, including all
the CPU types, Marvell SOC types and at91 board types. Any device not
supported (read: does not compile) has been removed, which is a fairly
small set actually. As such, LINT gives us very good coverage without
having to build a zillion kernels.
expand to uncompilable code when the kernel configuration contains
"options DEBUG", such as it is for LINT. The toolchain is often a
better approach to figure this out, as it doesn't require one to
boot the kernel.
interfere with structure fields of the same name in drivers, like
the intr_disable function pointer in struct cphy_ops in cxgb(4).
Instead define intr_disable and intr_restore as inline functions.
With intr_disable() an inline function, the I32_bit and F32_bit
macros now need to be visible in MI code and given the rather
poor names, this is not at all good. Define ARM_CPSR_F32 and
ARM_CPSR_I32 and use that instead of F32_bit and I32_bit (resp)
for now.
The device reports support for SATA Asynchronous Notification in its
IDENTIFY data, but returns error on attempt to enable that feature.
Make SATA XPT of CAM only report these errors, but not fail the device.
MFC after: 1 week
fail or not. The mbuf pointer is no longer valid, so
can't be reused after.
Fix igb_mq_start() where mbuf pointer was used after
drbr_enqueue().
This eventually leads us to all invocations of
igb_mq_start_locked() called with third argument as NULL.
This allows us to simplify this function.
Submitted by: Karim Fodil-Lemelin <fodillemlinkarim gmail.com>
Reviewed by: jfv
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.
illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted
References:
https://www.illumos.org/issues/3254
MFC after: 2 weeks
the vnode use count, and this might cause the kernel to panic if compiled
with WITNESS enable.
- Be sure to put the '\0' terminator to the rpath string.
Sponsored by: iXsystems inc.
detailed information under the sound debug. To make it easier accessible,
export that information through the set of sysctls like dev.hdaa.X.nidY.
Also tune some output to make it both more compact and informative.
Add detail to the comment describing this function. In particular,
describe what MAP_PREFAULT_PARTIAL does.
Eliminate the abrupt change in behavior when the specified address range
grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages. Instead of
doing nothing, i.e., preloading no mappings whatsoever, map any resident
pages that fall within the start of the specified address range, i.e.,
[addr, addr + ulmin(size, ptoa(MAX_INIT_PT))).
Long ago, the vm object's list of resident pages was not ordered, so
this function had to choose between probing the global hash table of
all resident pages and iterating over the vm object's unordered list of
resident pages. Now, the list is ordered, so there is no reason for
MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of
resident changes.
MFC after: 14 days
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.
It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.
IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write
References:
https://www.illumos.org/issues/3236
MFC after: 2 weeks
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.
* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model. All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.
* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.
* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.
* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.
The above changes fix at least two known/reported problems:
o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit. This type of deadlock has not been
reported as of now.
o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim. This would happen if the znode is unlinked
while its vnode is still active.
To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start. This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested. Or to avoid blocking
when it is not allowed (see zil_commit example above).
ffs_vgetf seems like a good source of inspiration.
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 6 weeks
... otherwise zfs_getpages would mostly be called with one page at a time.
It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block. Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk. The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed. vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem. So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.
vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.
vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.
Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.
The above list should cover all the current calls to VOP_BMAP.
Reviewed by: kib
MFC after: 6 weeks
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.
Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.
MFC after: 1 month
Illumos 13886:e3261d03efbf
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
References:
https://www.illumos.org/issues/3349
MFC after: 1 week
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.
This is not an issue for upstream because they do not seem to support
using raidz as a root pool.
Reported by: Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by: Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after: 6 days
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.
The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).
MFC after: 6 days
Executive code where similar invariant knobs exist.
o) Make the Simple Executive's warning function print "WARNING: " on the same
line as the warning it is displaying, rather than on a separate line.
After chatting with the MAC team, the TSF writes (at least on the 11n
MACs, I don't know about pre-11n MACs) are done as 64 bit writes that
can take some time. So, doing a 32 bit TSF write is definitely not
supported. Leave a comment here which explains that.
Whilst here, add a comment which outlines that after a reset or TSF
write, the TSF write may take a while (up to 50uS) to update.
A write or reset shouldn't be done whilst the previous one is in
flight. Also (and this isn't currently done) a read shouldn't
occur until the SLEEP32_TSF_WRITE_STAT is clear. Right now we're
not doing that, mostly because we haven't been doing lots of TSF
resets/writes until recently.
reducing the number of runtime checks done by the SDK code.
o) Group board/CPU information at early startup by subject matter, so that e.g.
CPU information is adjacent to CPU information and board information is
adjacent to board information.
TSF write.
The TSF_L32 update is fine for the AR5413 (and later, I guess) 11abg NICs
however on the 11n NICs this didn't work. The TSF writes were causing
a much larger time to be skipped, leading to the timing to never
converge.
I've tested this 64 bit TSF read, adjust and write on both the
11n NICs and the AR5413 NIC I've been using for testing. It works
fine on each.
This patch allows the AR5416/AR9280 to be used as a TDMA member.
I don't yet know why the AR9280 is ~7uS accurate rather than ~3uS;
I'll look into it soon.
Tested:
* AR5413, TDMA slave (~ 3us accuracy)
* AR5416, TDMA slave (~ 3us accuracy)
* AR9280, TDMA slave (~ 7us accuracy)
on the 802.11n NICs.
The 802.11n NICs return a TBTT value that continues far past the 16 bit
HAL_BEACON_PERIOD time (in TU.) The code would constrain nextslot to
HAL_BEACON_PERIOD, but it wasn't constraining nexttbtt - the pre-11n
NICs would only return TU values from 0 -> HAL_BEACON_PERIOD. Thus,
when nexttbtt exceeded 64 milliseconds, it would not wrap (but nextslot
did) which lead to a huge tsfdelta.
So until the slot calculation is converted to work in TSF rather than
a mix of TSF and TU, "make" the nexttbtt values match the TU assumptions
for pre-11n NICs.
This fixes the crazy deltatsf calculations but it doesn't fix the
non-convergent tsfdelta issue. That'll be fixed in a subsequent commit.
Rasperry Pi firmware has a set of hardcoded pathes it uses to fill
FDT with system-specific information like display resolution, memory
size, UART and SDHCI clocks, ethernet MAC address. Handle two of them:
- Add placeholder for ethernet MAC address
- Move display node out of "axi" node
... instead of the ever increasing ones.
Also, do free old resources when allocating new ones when cx states
change.
Tested by: Tom Lislegaard <Tom.Lislegaard@proact.no>
Obtained from: jkim
MFC after: 1 week
useful and has the side effect of obfuscating the code a bit.
- Remove spurious references to simple_lock.
Reported by: attilio [1]
Sponsored by: iXsystems inc.
vnode and following back the chain of n_parent pointers up to the root,
without acquiring the locks of the n_parent vnodes analyzed during the
computation. This is immediately wrong because if the vnode lock is not
held there's no guarantee on the validity of the vnode pointer or the data.
In order to fix, store the whole path in the smbnode structure so that
smbfs_fullpath() can use this information.
Discussed with: kib
Reported and tested by: pho
Sponsored by: iXsystems inc.
- The feature is dangerous because the kernel code didn't check
validity of the memory address provided from user space.
- It seems that mdconfig(8) never really supported attaching preloaded
memory disks.
- Preloaded memory disks are automatically attached during md(4)
initialization. Thus there shouldn't be much use for the feature.
PR: kern/169683
Discussed on: freebsd-hackers
This is the missing piece for FreeBSD/Wii, but there's still a lot of
work ahead. We have to reset the MMU in locore before continuing
the boot process because we don't know how the boot loaders might
have setup the BATs. We also disable the PCI BAT because there's no PCI
bus on the Wii.
Thanks to Nathan Whitehorn and Peter Grenhan for their help.
Submitted by: Margarida Gouveia
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires. The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.
Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.
Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly. E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.
Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by: rwatson
MFC after: 2 weeks
taskqueue_enqueue_timeout(). Do not rearm the callout if it is
already armed and the ticks is negative. Otherwise rearm it to fire
in abs(ticks) ticks in the future.
The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument. As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.
Reviewed by: rwatson
Tested by: Markus Gebert <markus.gebert@hostpoint.ch>
MFC after: 2 weeks
encryption types.
The AR5210 only has four WEP key slots, in contrast to what the
later MACs have (ie, the keycache.) So there's no way to store a "clear"
key.
Even if the driver is taught to not allocate CLR key entries for
the AR5210, the hardware will actually attempt to decode the encrypted
frames with the (likely all 0!) WEP keys.
So for now, disable the hardware encryption entirely and just so it
all in software. That allows both WEP -and- WPA to actually work.
If someone wishes to try and make hardware WEP _but_ software WPA work,
they'll have to create a HAL capability to enable/disable hardware
encryption based on the current STA/Hostap mode. However, making
multi-vap work with one WEP and one WPA VAP will require hardware
encryption to be disabled anyway.
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.
Removes comments that makes no sense now.
Reviewed by: kib
MFC after: 3 days
then:
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.
Reviewed by: kib
MFC after: 2 weeks
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).
Discussed with: kib
MFC after: 10 days
other than UART 0 from the outset.
o) Print board information from sysinfo after consoles have been initialized
rather than doing it during boot descriptor parsing.
o) Use cvmx_safe_printf and platform_reset rather than panic when doing very
early boot descriptor parsing before the console is set up.
o) Get rid of the global octeon_bootinfo.