Calculate the amount of bytes to copy for select filedescriptor masks
taking into account size of fd_set for the current process ABI.
Approved by: re (kensmith)
Clean up Marvell platform code.
Introduce SheevaPlug support.
- The device is based on Marvell 88F6281 system on chip.
- More info about the platform at http://www.plugcomputer.org
- To build the FreeBSD kernel:
make buildkernel TARGET_ARCH=arm KERNCONF=SHEEVAPLUG
- Installation notes at: http://wiki.freebsd.org/FreeBSDMarvell
Submitted by: Michal Hajduk
Approved by: re (kib)
Obtained from: Semihalf
Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.
Reviewed by: bz
Approved by: re
The bootp code installs an interface address and the nfs client
module tries to install the same address again. This extra code
is removed, which was discovered by the removal of a call to
in_ifscrub() in r196714. This call to in_ifscrub is put back here
because the SIOCAIFADDR command can be used to change the prefix
length of an existing alias.
r197235 reverts file nfs_vfsops.c
Reviewed by: kmacy
Approved by: re
This patch fixes the following issues:
- Routing messages are not generated when adding and removing
interface address aliases.
- Loopback route installed for an interface address alias is
not deleted from the routing table when that address alias
is removed from the associated interface.
- Function in_ifscrub() is called extraneously.
Reviewed by: gnn, kmacy, sam
Approved by: re
Previously local end of point-to-point interface is not reachable
within the system that owns the interface. Packets destined to
the local end point leak to the wire towards the default gateway
if one exists. This behavior is changed as part of the L2/L3
rewrite efforts. The local end point is now reachable within the
system. The inpcb code needs to consider this fact during the
address selection process.
Reviewed by: bz
Approved by: re
Use explicit int values for the device states in order to allow, if
necessary, in the future, adds of new states without breaking ABI
between revisions.
Please note that this is a special condition as we want this fix in
before RC1 as we assume it is critical and so it has been handled
as an instant-merge.
Approved by: re (kib)
Fix sched_switch_migrate() by assuming locks cannot be shared and a
deadlock between 3 different threads by acquiring both runqueue locks
when doing the migration.
Please note that this is a special condition as we want this fix in
before RC1 as we assume it is critical and so it has been handled
as an instant-merge. For the STABLE_7 branch, 1 week before the MFC
is assumed.
Approved by: re (kib)
The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.
Busy mountpoint before dropping softdep lk.
Approved by: re (kensmith)
Remove 'ad:' prefix from disk serial number. We don't want serial number
to change when we reconnect the disk in a way that it is accessible through
CAM for example.
Discussed with: trasz
Simplify g_disk_ident_adjust() function and allow any printable character
in serial number.
Discussed with: trasz
Obtained from: Wheel Sp. z o.o. (http://www.wheel.pl)
Make serial numbers of daX disks visible by GEOM.
No objections from: scottl
Obtained from: Wheel Sp. z o.o. (http://www.wheel.pl)
Approved by: re (kib)
r196943,r196944,r196947,r196950,r196953,r196954,r196965,r196978,r196979,
r196980,r196982,r196985,r196992,r197131,r197133,r197150,r197151,r197152,
r197153,r197167,r197172,r197177,r197200,r197201:
r196456:
- Give minclsyspri and maxclsyspri real values (consulted with kmacy).
- Honour 'pri' argument for thread_create().
r196457:
Set priority of vdev_geom threads and zvol threads to PRIBIO.
r196458:
- Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
'vdev:worker da0' -> 'vdev da0'
r196662:
Add missing mountpoint vnode locking.
This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
regular user tries to mount dataset owned by him.
r196702:
Remove empty directory.
r196703:
Backport the 'dirtying dbuf' panic fix from newer ZFS version.
Reported by: Thomas Backman <serenity@exscape.org>
r196919:
bzero() on-stack argument, so mutex_init() won't misinterpret that the
lock is already initialized if we have some garbage on the stack.
PR: kern/135480
Reported by: Emil Mikulic <emikulic@gmail.com>
r196927:
Changing provider size is not really supported by GEOM, but doing so when
provider is closed should be ok.
When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.
PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
r196928:
Teach zdb(8) how to obtain GEOM provider size.
PR: kern/133134
Reported by: Philipp Wuensche <cryx-freebsd@h3q.com>
r196943:
- Avoid holding mutex around M_WAITOK allocations.
- Add locking for mnt_opt field.
r196944:
Don't recheck ownership on update mount. This will eliminate LOR between
vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.
Noticed by: kib
Reviewed by: kib
r196947:
Defer thread start until we set priority.
Reviewed by: kib
r196950:
Fix detection of file system being shared. Now zfs unshare/destroy/rename
command will properly remove exported file systems.
r196953:
When snapshot mount point is busy (for example we are still in it)
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.
Reported by: trasz
r196954:
If we have to use avl_find(), optimize a bit and use avl_insert() instead of
avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).
Fix similar case in the code that is currently commented out.
r196965:
Fix reference count leak for a case where snapshot's mount point is updated.
r196978:
Call ZFS_EXIT() after locking the vnode.
r196979:
On FreeBSD we don't have to look for snapshot's mount point,
because fhtovp method is already called with proper mount point.
r196980:
When we automatically mount snapshot we want to return vnode of the mount point
from the lookup and not covered vnode. This is one of the fixes for using .zfs/
over NFS.
r196982:
We don't export individual snapshots, so mnt_export field in snapshot's
mount point is NULL. That's why when we try to access snapshots over NFS
use mnt_export field from the parent file system.
r196985:
Only log successful commands! Without this fix we log even unsuccessful
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.
Reported by: Denis Ahrens <denis@h3q.com>
r196992:
Implement __assert() for Solaris-specific code. Until now Solaris code was
using Solaris prototype for __assert(), but FreeBSD's implementation.
Both take different arguments, so we were either core-dumping in assert()
or printing garbage.
Reported by: avg
r197131:
Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
NULL, but also can point to dead vnode, take that into account.
PR: kern/132068
Reported by: Edward Fisk <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
r197133:
- Protect reclaim with z_teardown_inactive_lock.
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().
r197150:
There is a bug where mze_insert() can trigger an assert() of inserting
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.
Reported by: marck
OpenSolaris bug: 6709336
r197151:
Be sure not to overflow struct fid.
r197152:
Extend scope of the z_teardown_lock lock for consistency and "just in case".
r197153:
When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.
r197167:
Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.
Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.
This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.
PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
r197172:
Add missing \n.
Reported by: marck
r197177:
Support both case: when snapshot is already mounted and when it is not yet
mounted.
r197200:
Modify mount(8) to skip MNT_IGNORE file systems by default, just like df(1)
does. This is not POLA violation, because there is no single file system in the
base that use MNT_IGNORE currently, although ZFS snapshots will be mounted with
MNT_IGNORE after next commit.
Reviewed by: kib
r197201:
- Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.
Reviewed by: kib
Approved by: re (bz)
Don't malloc a buffer while holding the prison0 mutex. Instead, use a loop
where we figure out the hostname length under the lock, malloc the buffer
with the lock dropped, then recheck the length under the lock and loop again
if the buffer is now too small.
Approved by: re (kib)
Add LK_NOWITNESS to the vn_lock() calls done on newly created nfs
vnodes, since these nodes are not linked into the mount queue and,
as such, the vn_lock() cannot cause a deadlock so LORs are harmless.
Suggested by: kib
Approved by: re (kensmith), kib (mentor)
Change 'dev.cpu.N.temperature', sysctl I (degC) to IK (Kelvin),
to match acpi_thermal(4) and amdtemp(4).
Approved by: re (rwatson)
Reviewed by: rpaulo
Suggested by: ume
Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.
Approved by: re (kensmith)
In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.
Approved by: re (kensmith)
When joining a multicast group, the inp_lookup_mcast_ifp call
does a KASSERT that the group address is multicast, so the
check if this is indeed true and eventually return a EINVAL if not,
should be done before calling inp_lookup_mcast_ifp. This fixes a kernel
crash when calling setsockopt (sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,...)
with invalid group address.
Reviewed by: bms
Approved by: re (kib)
for stable branches:
- shift to MALLOC_PRODUCTION
- turn off automatic crash dumps
- Remove kernel debuggers, INVARIANTS*[1], WITNESS* from
GENERIC kernel config files[2]
[1] INVARIANTS* left on for ia64 by request marcel
[2] sun4v was left as-is
Reviewed by: marcel, kib
Approved by: re (implicit)
insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.
Approved by: re (kensmith)
In fhopen, vfs_ref() the mount point while vnode is unlocked, to prevent
vn_start_write(NULL, &mp) from operating on potentially freed or reused
struct mount *.
Remove unmatched vfs_rel() in cleanup.
Approved by: re (kensmith)
SYSCTLs which are inappropriate for a daily use of the machine (mostly
useful only by a developer which wants to run benchmarks on it).
Remove them before the release as long as we do not want to ship with
them in.
Now that the SYSCTLs are gone, instead than use static storage for some
constants, use real numeric constants in order to avoid eventual compiler
dumbiness and the risk to share a storage (and then a cache-line) among
CPUs when doing adaptive spinning together.
Pleasse note that the sys/linker_set.h inclusion in lockmgr and sx lock
support could have been gone, but re@ preferred them to be in order to
minimize the risk of problems on future merging.
Please note that this patch is not a MFC, but an 'edge case' as commit
directly to stable/8, which creates a diverging from HEAD.
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Approved by: re (kib)
fix adaptive spinning in lockmgr by using correctly GIANT_RESTORE and
continue statement and improve adaptive spinning for sx lock by just
doing once GIANT_SAVE.
Approved by: re (kib)
attached to the bridge, rather than just in the case
when some device cannot do TSO. Customer tests have
shown that even when all devices can do TSO that LRO
will cause problems when bridging.
Approved by: re
Don't attempt to bind the current thread to the CPU an IRQ is bound to
when removing an interrupt handler from an IRQ during shutdown. During
shutdown we are already bound to CPU 0 and this was triggering a panic.
Approved by: re (kib)
Allow a jail's name to be the same as its jid (which is the default if
no name is specified), and let a numeric name specify the jid for a new
jail when the jid isn't otherwise set. Still disallow other numeric
names.
Reviewed by: zec
Approved by: re (kib), bz (mentor)
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ.
Introduce separate kernel stack cache that keeps some limited amount of
preallocated kernel stacks to lower the latency of thread allocation.
Not a merge: instead of removing td_altkstack* members of struct thread,
replace them with placeholders to keep struct thread layout on the
stable branch.
Also, record r196640, r196644 and r196648 as merged.
Approved by: re (kensmith)
Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value.
MFC r196733:
Fix mount reference leak when V_XSLEEP is specified to vn_start_write().
Approved by: re (kensmith)
This patch fixes the following issues:
- Interface link-local address is not reachable within the
node that owns the interface, this is due to the mismatch
in address scope as the result of the installed interface
address loopback route. Therefore for each interface
address loopback route, the rt_gateway field (of AF_LINK
type) will be used to track which interface a given
address belongs to. This will aid the address source to
use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
only for validation reason. Doing so will eliminate as much
of the special case (loopback addresses) handling code
as possible, however, these empty nd6 entries should not
be returned to the userland applications such as the
"ndp" command.
Since both of the above issues contain common files, these
files are committed together.
Reviewed by: bz
Approved by: re
This patch fixes an address scope violation. Considering the
scenario where an anycast address is assigned on one interface,
and a global address with the same scope is assigned on another
interface. In other words, the interface owns the anycast
address has only the link-local address as one other address.
Without this patch, "ping6" the anycast address from another
station will observe the source address of the returned ICMP6
echo reply has the link-local address, not the global address
that exists on the other interface in the same node.
Reviewed by: bz
Approved by: re
Rather than having enabled/disabled, implement a max queue depth.
While usually not an issue, this firewalls bugs in the code that may
run us out of memory.
Fix a memory exhaustion in the case where devctl was disabled, but the
link was bouncing. The check to queue was in the wrong place.
Implement a new sysctl hw.bus.devctl_queue to control the depth. Make
compatibility hacks for hw.bus.devctl_disable to ease transition.
Reviewed by: emaste@
Approved by: re@ (kib)
MFC after: asap
was previously dependent on), LRO gets turned off when bridging but
its been found that header split is still a performance win in that case.
Secondly, there was some interface specific control in stats code that
has been missing, and a logic error that resulted in bogus reporting.
Thanks to Manish and John of LineRateSystems for the report and help in
this code.
Approved by: re
Make sure rx descriptor ring align on 16 bytes. I guess the
alignment requirement could be multiple of 4 bytes but I think
using descriptor size would make intention clearer.
Previously the size of rx descriptor was not power of 2 so it
caused panic in bus_dmamem_alloc(9).
Reported by: Jeff Blank (jb000003 <> mr-happy dot com)
Approved by: re (kib)
fix a TX issue on big endian machines like powerpc or sparc64. Now
zyd(4) should work on all architectures.
Obtained from: OpenBSD
Approved by: re (kib)
- Improve pmap_change_attr() on i386 so that it is able to demote a large
(2/4MB) page into 4KB pages as needed. This should be fairly rare in
practice.
- Simplify pmap_change_attr() a bit:
- Always calculate the cache bits instead of doing it on-demand.
- Always set changed to TRUE rather than only doing it if it is false.
Approved by: re (kib)
In case an upper layer protocol tries to send a packet but the
L2 code does not have the ethernet address for the destination
within the broadcast domain in the table, we remember the
original mbuf in `la_hold' in arpresolve() and send out a
different packet with an arp request.
In case there will be more upper layer packets to send we will
free an earlier one held in `la_hold' and queue the new one.
Once we get a packet in, with which we can perfect our arp table
entry we send out the original 'on hold' packet, should there
be any.
Rather than continuing to process the packet that we received,
we returned without freeing the packet that came in, which
basically means that we leaked an mbuf for every arp request
we sent.
Rather than freeing the received packet and returning, continue
to process the incoming arp packet as well.
This should (a) improve some setups, also proxy-arp, in case it was an
incoming arp request and (b) resembles the behaviour FreeBSD had
from day 1, which alignes with RFC826 "Packet reception" (merge case).
Rename 'm0' to 'hold' to make the code more understandable as
well as diffable to earlier versions more easily.
Handle the link-layer entry 'la' lock comepletely in the block
where needed and release it as early as possible, rather than
holding it longer, down to the end of the function.
Found by: pointyhat, ns1
Bug hunting session with: erwin, simon, rwatson
Tested by: simon on cluster machines
Reviewed by: ratson, kmacy, julian
Approved by: re (kib)
Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.
Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.
The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo was matched first, as the fallback to ld-elf32.so.1 does
not exist in that case.
Reported and tested by: ticso
In collaboration with: kib
MFC after: 3 days
Approved by: re (rwatson)
Fix the conformance of poll(2) for sockets after r195423 by
returning POLLHUP instead of POLLIN for several cases. Now, the
tools/regression/poll results for FreeBSD are closer to that of the
Solaris and Linux.
Also, improve the POSIX conformance by explicitely clearing POLLOUT
when POLLHUP is reported in pollscan(), making the fix global.
Submitted by: bde
Reviewed by: rwatson
MFC r196556
Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.
This reverts part of r196460, so that sockets only return POLLHUP if both
directions are closed/error. Fifos get POLLHUP by closing the unused
direction immediately after creating the sockets.
The tools/regression/poll/*poll.c tests now pass except for two other
things:
- if POLLHUP is returned, POLLIN is always returned as well instead of
only when there is data left in the buffer to be read
- fifo old/new reader distinction does not work the way POSIX specs it
Reviewed by: kib, bde
MFC r196554
Add some tests for poll(2)/shutdown(2) interaction.
Approved by: re (kensmith)
Swap the start/end virtual addresses in pmap_invalidate_cache_range().
This fixes the functionality on non SelfSnoop hardware.
Found by: rnoland
Submitted by: alc
Reviewed by: kib
Approved by: re (rwatson)
Mark the fake pages constructed by the OBJT_SG pager valid. This was
accidentally lost at one point during the PAT development. Without this
fix vm_pager_get_pages() was zeroing each of the pages.
Approved by: re (kib)
ATA_FLUSHCACHE is a 28bit format command, not 48.
MFC r196658:
Improve camcontrol ATA support:
- Tune protocol version reporting,
- Add supported DMA/PIO modes reporting.
- Fix IDENTIFY request for ATAPI devices.
- Remove confusing "-" for NCQ status.
MFC r196659:
Short ATA command format has 28bit address, not 36bit.
Rename ata_36bit_cmd() into ata_28bit_cmd(), while it didn't become legacy.
Approved by: re (ATA-CAM blanket)
causing a panic if it is killed due to a unsolved stack overflow
seen very late during shutdown on sparc64 when the gmirror worker
process exists, which is a regression introduced in 8.0.
Reviewed by: kib
Approved by: re (rwatson)
Fix a LOR between allprison_lock and vnode locks by releasing
allprison_lock before releasing a prison's root vnode.
PR: kern/138004
Reviewed by: kib
Approved by: re (rwatson), bz (mentor)
Fix a few panics in linuxulator + VIMAGE due to curvnet not being set.
This change affects only options VIMAGE builds.
Reviewed by: julian
Approved by: re (rwatson)
Introduce a separate sx lock for protecting lists of vnet sysinit
and sysuninit handlers.
Previously, sx_vnet, which is a lock designated for protecting
the vnet list, was (ab)used for protecting vnet sysinit / sysuninit
handler lists as well. Holding exclusively the sx_vnet lock while
invoking sysinit and / or sysuninit handlers turned out to be
problematic, since some of the handlers may attempt to wake up
another thread and wait for it to walk over the vnet list, hence
acquire a shared lock on sx_vnet, which in turn leads to a deadlock.
Protecting vnet sysinit / sysuninit lists with a separate lock
mitigates this issue, which was first observed with
flowtable_flush() / flowtable_cleaner() in sys/net/flowtable.c.
Reviewed by: rwatson, jhb
MFC after: 3 days
Approved by: re (rwatson)
Prefix on-link verification is being performed on statically
configured prefixes. Since these statically configured prefixes
do not have any associated advertising routers, these prefixes
are treated as unreachable and those prefix routes are deleted
from the routing table. Therefore bypass prefixes that are not
learned from router advertisements during prefix on-link check.
Reviewed by: hrs
Approved by: re
In ip_output(), the flow-table module must not try to cache L2/L3
information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type.
Since the L2 information (rt_lle) is invalid for these interface
types, accidental caching attempt will trigger panic when the invalid
rt_lle reference is accessed.
When installing a new route, or when updating an existing route, the
user supplied gateway address may be an interface address (this is
particularly true for point-to-point interface related modules such
as ppp, if_tun, if_gif). Currently the routing command handler always
set the RTF_GATEWAY flag if the gateway address is given as part of the
command paramters. Therefore the gateway address must be verified against
interface addresses or else the route would be treated as an indirect
route, thus making that route unusable.
Reviewed by: kmacy, julian, rwatson
Approved by: re
Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.
Reviewed by: kmacy, rwatson
Approved by: re
When multiple interfaces exist in the system, with each interface having
an IPv6 address assigned to it, and if an incoming packet received on
one interface has a packet destination address that belongs to another
interface, the routing table is consulted to determine how to reach this
packet destination. Since the packet destination is an interface address,
the route table will return a host route with the loopback interface as
rt_ifp. The input code must recognize this fact, instead of using the
loopback interface, the input code performs a search to find the right
interface that owns the given IPv6 address.
Reviewed by: bz, gnn, kmacy
Approved by: re
It is possible for all the kthreads to exit (hci modules unloaded) which in
turn ends our usb process. This means the proc pointer becomes invalid and will
panic if a new kthread is added. Count the number of threads and clear the proc
pointer on the last one.
Approved by: re (kib)
Add IFNET_HOLD reserved pointer value for the ifindex ifnet array,
which allows an index to be reserved for an ifnet without making
the ifnet available for management operations. Use this in if_alloc()
while the ifnet lock is released between initial index allocation and
completion of ifnet initialization.
Add ifindex_free() to centralize the implementation of releasing an
ifindex value. Use in if_free() and if_vmove(), as well as when
releasing a held index in if_alloc().
Reviewed by: bz
Approved by: re (kib)
Break out allocation of new ifindex values from if_alloc() and if_vmove(),
and centralize in a single function ifindex_alloc(). Assert the
IFNET_WLOCK, and add missing IFNET_WLOCK in if_alloc(). This does not
close all known races in this code.
Reviewed by: bz
Approved by: re (kib)
Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables. This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.
Reviewed by: bz, kmacy, qingli
Approved by: re (kib)
Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.
Approved by: re (kib)
Fix argument ordering to memcpy as well as the size of the copy in the
(theoretical) case that pfi_buffer_cnt should be greater than ~_max.
Submitted by: pjd
Reviewed by: {krw,sthen,markus}@openbsd.org
Approved by: re (kib)
Rather than using IFNET_RLOCK() when iterating over (and modifying) the
ifnet list during if_ef load, directly acquire the ifnet_sxlock
exclusively. That way when if_alloc() recurses the lock, it's a write
recursion rather than a read->write recursion.
This code structure is arguably a bug, so add a comment indicating that
this is the case. Post-8.0, we should fix this, but this commit
resolves panic-on-load for if_ef.
Discussed with: bz, julian
Reported by: phk
Approved by: re (kib)
Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:
Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.
Reviewed by: bz, julian
Approved by: re (kib)
When moving ifnets from one vnet to another, and the ifnet
has ifaddresses of AF_LINK type which thus have an embedded
if_index "backpointer", we must update that if_index backpointer
to reflect the new if_index that our ifnet just got assigned.
This change affects only options VIMAGE builds.
Submitted by: bz
Reviewed by: bz
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
When "jail -c vnet" request fails, the current code actually creates and
leaves behind an orphaned vnet. This change ensures that such vnets get
released.
This change affects only options VIMAGE builds.
Submitted by: jamie
Discussed with: bz
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
Introduce a div_destroy() function which takes over per-vnet cleanup tasks
from the existing modevent / MOD_UNLOAD handler, and register div_destroy()
in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In
nooptions VIMAGE builds, div_destroy() will be invoked from the modevent
handler, resulting in effectively identical operation as it was prior this
change. div_destroy() also tears down hashtables used by ipdivert, which
were previously left behind on ipdivert kldunloads.
For options VIMAGE builds only, temporarily disable kldunloading of ipdivert,
because without introducing additional locking logic it is impossible to
atomically check whether all ipdivert instances in all vnets are idle, and
proceed with cleanup without opening a race window for a vnet to open an
ipdivert socket while ipdivert tear-down is in progress.
While here, staticize div_init(), because it is not used outside of
ip_divert.c.
In cooperation with: julian
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
When registering a protocol to an existing protocol domain via
pf_proto_register(), iterate over all existing vnets to call protosw_init()
and thus the appropriate .pr_init() handler in the context of each vnet.
NB in the future we probably want to separate pr_init() handlers into
two, i.e. per-vnet and global, functions.
This change has no impact on nooptions VIMAGE builds.
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
Don't try to power down PHY when alc(4) failed to map the device.
This fixes system crash when mapping alc(4) device failed in device
attach.
Reported by: Jim < stapleton.41 <> gmail DOT com >
Approved by: re (kib)
Add RTL8168DP/RTL8111DP device id. While I'm here append "8111D" to
the description of RTL8168D as RL_HWREV_8168D can be either
RTL8168D or RTL8111D.
PR: kern/137672
Approved by: re (kib)
Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.
Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.
Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].
In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].
These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).
PR: kern/135468
Submitted by: dchagin [1] (initial patch)
Suggested by: kib [2]
Tested by: Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by: kib
Approved by: re (kensmith)
Tweak the way that the ACPI and ISA bus drivers match hint devices to
BIOS-enumerated devices:
- Assume a device is a match if the memory and I/O ports match even if the
IRQ or DRQ is wrong or missing. Some BIOSes don't include an IRQ for
the atrtc device for example.
- Add a hack to better match floppy controller devices. Many BIOSes do not
include the starting port of the floppy controller listed in the hints
(0x3f0) in the resources for the device. So far, however, all the BIOS
variations encountered do include the 'port + 2' resource (0x3f2), so
adjust the matching for "fdc" devices to look for 'port + 2'.
Approved by: re (kib)
The svnversion string is only relevant when newvers.sh is called
during the kernel build process, the other places that call the
script do not make use of that information. So restrict execution
of the svnversion-related code to the kernel build context.
Approved by: re (kib)
Fix ipfw's initialization functions to get the correct order of evaluation
to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c
to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers
instead of the modevent handler. Correct some spelling errors in comments
in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when
ipfw is unloaded.
This patch is a minimal patch for 8.0
I have a much larger patch that actually fixes the underlying problems
that will be applied after 8.0
Reviewed by: zec@, rwatson@, bz@(earlier version)
Approved by: re (rwatson)
Don't allow access to the internals until it has all been set up.
Specifically, not until the per-vnet parts have been set up.
Submitted by: kmacy@
Reviewed by: julian@, zec@
Approved by: re(rwatson)
This patch fixes two bugs in sglist(9) and improves robustness of the API via
better semantics if a request to append an address range to an existing list
fails.
- When cloning an sglist, properly set the length in the new sglist instead of
leaving the new list empty.
- Properly compute the amount of data added to an sglist via
_sglist_append_buf(). This allows sglist_consume_uio() to properly update
uio_resid.
- When a request to append an address range to a scatter/gather list fails,
restore the sglist to the state it had at the start of the function call
instead of resetting it to an empty list.
Approved by: re (kib)
Fix a boot hang for hptrr(4) caused by changes introduced in r195534.
It is necessary to make sure cpi->transport is set for xpt_scan_bus() to
work properly.
Submitted by: Bernhard Schmidt (scb+freebsd-current <at> techwires
<dot> net)
Reviewed by: scottl
Approved by: re (kib)
Check whether the SMBIOS reports reasonable amount of memory. If it is
less than "avail memory", fall back to Maxmem to avoid user confusion.
We use SMBIOS information to display "real memory" since r190599 but
some broken SMBIOS implementation reported only half of actual memory.
Tested by: bz
Approved by: re (kib)
Rather than fix questionable ifnet list locking in the implementation of
the kern.polling.enable sysctl, remove the sysctl. It has been deprecated
since FreeBSD 6 in favour of per-ifnet polling flags.
Reviewed by: luigi
Approved by: re (kib)
Change the 'resid' parameter to sglist_consume_uio() from an int to a
size_t to match the recent type change of the uio_resid member of struct
uio.
Approved by: re (kib)
Fix CARP memory leaks on carp_if's malloc'd using M_CARP. This occurs when
CARP tries to free them using M_IFADDR after the last address for a virtual
host is removed and when detaching from the parent interface.
Approved by: re (kib), ken (mentor)
Our libc doesn't implement control method for XDR (only kernel does) and it
will always return failure. Fix this by bringing userland implementation of
xdrmem_control() back. This allow 'zpool import' to work again.
Reported by: Thomas Backman <serenity@exscape.org>
Reviewed by: kmacy
Approved by: re (kib)
moving a frequently executed flowtable syslog statement from being
conditional on bootverbose to conditional on a per-vnet flowtable
sysctl.
Approved by: re@
Temporarily enhance em(4) and igb(4) hack to take account for IFF_NOARP.
Without this changeset there will be no way to prevent these NICs from
sending ARP, which is harmful in server farms that is configured as
"Direct Server Return" behind a load balancer.
A better fix would remove the whole hack completely but it would be
later than 8.0-RELEASE.
Reviewed by: jfv, yongari
Approved by: re (kib)
Fix USB cache sync operations for platforms with non-coherent DMA.
- usb_pc_cpu_invalidate() is called between [consecutive] reads from a device,
so a sequence of BUS_DMASYNC_POSTREAD and _PREREAD should be used. Note we
cannot use or'ed shorthand ( _POSTREAD | _PREREAD) for BUS_DMASYNC flags, as
the low level bus dma sync operation is implementation dependent and we
cannot assume the required order of operations to be guaranteed.
- usb_pc_cpu_flush() is called before writing to a device, so
BUS_DMASYNC_PREWRITE should be used.
Submitted by: Grzegorz Bernacki
Reviewed by: HPS, arm@, usb@ ML
Tested by: HPS, Mike Tancsa
Approved by: re (kib)
Obtained from: Semihalf
Small changes to the warning message generated by pty(4):
- Only print the warning once, instead of filling up the screen.
- Use the word "legacy" for the pty_warningcnt description, to prevent
confusion.
- Use log() instead of printf().
Discussed with: rwatson, jhb
Approved by: re (kib)
If we cannot immediately get the pf_consistency_lock in the purge thread,
restart the scan after acquiring the lock the hard way. Otherwise we
might end up with a dead reference.
Approved by: re (kib)
Do not try to reevaluate current RX production index on each
loop iteration as it can be updated by the card while we
process the RX ring forcing us to process RX descriptors
for which DMA synchronisation operation has not been
performed. This fixes the bug when bge(4) drops packets
under high load.
Discussed with: yongari, marius
Approved by: re (kib)
- change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)
Reviewed by: sam@
Approved by: re@
Backout r193289. r193289 restored page select bits to previous
value instead of blindly resetting it to 0. However, it seems page
select bits of some 88E1116 PHY is initialized to invalid one such
that restoring page select bits after programming broke MII
register access. The correct solution would be reset page select
bits to 0 in PHY attach stage but it would require more testing.
Since we're in BETA stage such a change would be dangerous so just
back it out.
This change should fix nfe(4) breakage on NVIDIA MCP55.
Reported by: Ryan Rogers < webmaster <> doghouserepair dot com >
Sam Fourman Jr. < sfourman <> gmail dot com >
Tested by: Ryan Rogers < webmaster <> doghouserepair dot com >
Sam Fourman Jr. < sfourman <> gmail dot com >
Approved by: re (kib)
for ATA_SETFEATURES/ATA_SF_SETXFER command which by definition transfers no
data. Most of controllers are irrelevant to this bug, but some nVidia's
doesn't.
Tested on: current@
Approved by: re (kib)
Apply the same patch as r196205 for nfs_upgrade_lock() and
nfs_downgrade_lock() to the experimental nfs client.
Approved by: re (kensmith), kib (mentor)
* Change the scope of the ASSERT_ATOMIC_LOAD() from a generic check to
a pointer-fetching specific operation check. Consequently, rename the
operation ASSERT_ATOMIC_LOAD_PTR().
* Fix the implementation of ASSERT_ATOMIC_LOAD_PTR() by checking
directly alignment on the word boundry, for all the given specific
architectures. That's a bit too strict for some common case, but it
assures safety.
* Add a comment explaining the scope of the macro
* Add a new stub in the lockmgr specific implementation
Tested by: marcel (initial version), marius
Reviewed by: rwatson, jhb (comment specific review)
Approved by: re (kib)
The start of the EFI GPT partition in the PMBR can always be represented
by CHS addressing. Don't define these fields as 0xff, but rather define
them correctly. This prevents boot problems on PCs where GPT is being
used.
PR: 115406
Submitted by: Kent Hauser <kent@khauser.net>
Approved by: re (kib)
Fix parse() so that the partition to boot (load /boot/loader) from can
be set. The syntax as printed in main() is used: 0:ad(0p3)/boot/loader
Reviewed by: jhb
Approved by: re (kib)
getcwd() (when __getcwd() fails) works by stating current directory, going up
(..), calling readdir and looking for previous directory inode. In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.
Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.
This fixes /bin/pwd from within .zfs/snapshot/<name>/.
Suggested by: kib
Approved by: re (rwatson)
Fix panic in zfs recv code. The last vnode (mountpoint's vnode) can have
0 usecount.
Reported by: Thomas Backman <serenity@exscape.org>
Approved by: re (kib)
Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.
Approved by: re (kib)
Because taskqueue_run() can drop tq_mutex, we need to check if the
TQ_FLAGS_ACTIVE flag wasn't removed in the meantime, which means we missed a
wakeup.
Approved by: re (kib)
- Fix a race where /dev/zfs control device is created before ZFS is fully
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
where one of the zpool/zfs command loads zfs.ko and tries to do some work
immediately, but /dev/zfs is not there yet.
Reported by: pav
Approved by: re (kib)
Change the usb workers from kernel processes to threads, this is mostly a
cosmetic change to reduce cruft in the proc table.
Also change the idle wait message to `-` like how taskqueues are.
Reviewed by: julian
Approved by: re (kib)
* Fix a bug where PR-SCTP settings are ignore when using implicit
association setup.
* Fix a bug where message with illegal stream ids are not deleted.
* Fix a crash when reporting back unsent messages from the send_queue.
* Fix a bug related to INIT retransmission when the socket is already
closed.
* Fix a bug where associations were stalled when partial delivery API
was enabled.
* Fix a bug where the receive buffer size was smaller than the
partial_delivery_point.
Approved by: re, rrs (mentor)
Proprely intialize UART parameters at probe stage, so uart(4)
will initialize the FIFO memory correctly on attach. Before
that this values was intialized in only in at91_usart_bus_attach
which is called after the uart(4) memory allocation happens.
Approved by: re (kib)
MFC after: 1 week
In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.
Reviewed by: kmacy, rwatson
Approved by: re
Appease VNET_DEBUG - in if_vmove we temporarily switch i.e.
recurse from one vnet to another which is OK, so no need
to flood the console with warnings here.
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
SCTP is not yet compatible with options VIMAGE kernels although it compiles
with VIMAGE defined, so explicitly disallow building such kernels.
Reviewed by: rrs
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
Make VNET_DEBUG a standalone compile-time option, i.e. decouple it from
INVARIANTS.
Reviewed by: bz
Approved by: re (rwatson), julian (mentor)
Approved by: re (rwatson)
Add a new macro to test that a variable could be loaded atomically.
Check that the given variable is at most uintptr_t in size and that
it is aligned.
Note: ASSERT_ATOMIC_LOAD() uses ALIGN() to check for adequate
alignment -- however, the function of ALIGN() is to guarantee
alignment, and therefore may lead to stronger alignment
enforcement than necessary for types that are smaller than
sizeof(uintptr_t).
Add checks to mtx, rw and sx locks init functions to detect possible
breakage. This was used during debugging of the problem fixed with
r196118 where a pointer was on an un-aligned address in the dpcpu area.
In collaboration with: rwatson
Reviewed by: rwatson
Approved by: re (kib)
- Provide lapic_disable_pmc(), lapic_enable_pmc(), and lapic_reenable_pmc()
routines in the local APIC code that the hwpmc(4) driver can use to
manage the local APIC PMC interrupt vector.
- Do not enable the local APIC PMC interrupt vector by default when
HWPMC_HOOKS is enabled. Instead, the hwpmc(4) driver explicitly
enables the interrupt when it is succesfully initialized and disables
the interrupt when it is unloaded. This avoids enabling the interrupt
on unsupported CPUs which may result in spurious NMIs.
Reported by: rnoland
Reviewed by: jkoshy
Approved by: re (kib)
MFC after: 2 weeks
Take the number of allocated freeblks into consideration for
softdep_slowdown(), to prevent kernel memory exhaustioni on
mass-truncation.
Approved by: re (rwatson)