flags that has several bits cleared. The RF_WANTED and RF_FIRSTSHARE
bits are invalid in this context, and we want to defer setting RF_ACTIVE
in r_flags until later. This should make rman_get_flags() return
the correct answer in all cases.
Add a KASSERT() to catch callers which incorrectly pass the RF_WANTED
or RF_FIRSTSHARE flags.
Do a strict equality check on the share type bits of flags. In
particular, do an equality check on RF_PREFETCHABLE. The previous
code would allow one type of mismatch of RF_PREFETCHABLE but disallow
the other type of mismatch. Also, ignore the the RF_ALIGNMENT_MASK
bits since alignment validity should be handled by the amask check.
This field contains an integer value, but previous code did a strange
bitwise comparison on it.
Leave the original value of flags unmolested as a minor debug aid.
Change the start+amask overflow check to a KASSERT() since it is just
meant to catch a highly unlikely programming error in the caller.
Reviewed by: jhb
MFC after: 1 month
taskqueue worker thread(s) to.
For now it isn't a taskqueue/taskthread error to fail to pin
to the given cpuid.
Thanks to rpaulo@, kib@ and jhb@ for feedback.
Tested:
* igb(4), with local RSS patches to pin taskqueues.
TODO:
* ask the doc team for help in documenting the new API call.
* add a taskqueue_start_threads_cpuset() method which takes
a cpuset_t - but this may require a bunch of surgery to
bring cpuset_t into scope.
Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop(). This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.
Reviewed by: kib
Approved by: hrs (mentor)
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.
If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.
To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.
http://bugs.debian.org/743141
Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Replace with the existing loop termination test with a similar
condition from the nested "if" that may terminate the loop a bit
sooner, but still not too early. This condition can then be removed
from the nested "if". Relocate an operator to be style(9) compliant.
MFC after: 3 days
If the underlying protocol reported an error (e.g. because a connection was
closed while waiting in the queue), this error was also indicated by
returning a zero-length address. For all other kinds of errors (e.g.
[EAGAIN], [ENFILE], [EMFILE]), *addrlen is unmodified and there are
successful cases where a zero-length address is returned (e.g. a connection
from an unbound Unix-domain socket), so this error indication is not
reliable.
As reported in Austin Group bug #836, modifying *addrlen on error may cause
subtle bugs if applications retry the call without resetting *addrlen.
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.
Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.
MFC after: 3 days
Relnotes: Bug fix in power management on CPUs with Intel Turbo Boost
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores. When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.
Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."
The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0. So if one
pins the swi on CPU #0, there's no default floating swi.
A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.
Tested:
* parallel TCP tests (2 x 1g unfortunately for now);
CPU: Intel(R) Xeon(R) CPU E5-2650
Note:
This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
rman_reserve_resource_bound() to return incorrect results.
Continue the initial search until the first viable region is found.
Add a comment to explain the search termination test.
PR: kern/188534
Reviewed by: jhb (previous version)
MFC after: 1 week
for lockmgr and sx interlocks, but unused since optimised versions of
those sleep locks were introduced. This will save a (quite) small
amount of memory in all kernel configurations. The sleep mutex pool is
retained as it is used for 'struct bio' and several other consumers.
Discussed with: jhb
MFC after: 3 days
It can fail if pipe map is exhausted (as a result of too many pipes created),
but it is not fatal and could be provoked by unprivileged users. The only
consequence is worse performance with given pipe.
Reported by: ivoras
Suggested by: kib
MFC after: 1 week
If time_t is 64-bit (i.e. isn't 32-bit) allow any value of year, not
just years less than 2038.
Don't bother fixing the underflow in the case of years before 1903.
MFC after: 1 week
Sponsored by: DARPA, AFRL
pmap pvlist locks which are scaled by MAXCPU.
This allows an amd64 system to boot with MAXCPU set
to 256, which is currently FreeBSD's hard limit without
x2apic support.
Compile-tested for other arch's.
PR: 185831
Discussed with: jhb
MFC after: 3 weeks
old devd's and thus hosts that get IP addresses from DHCP was too much
of a POLA violation.
The sysctl may be removed again after r263758 has been merged to at
least stable/9 and stable/10, and releases have been cut from those
branches.
Discussed with: mjg
Reported by: theraven, rwatson
the cpufreq code. Replace its use with smp_started. There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.
Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
Due to the way those timers are implemented, we can't handle very short
intervals. In addition to that mentioned patch caused math overflows
for short intervals. To avoid that round those intervals to 1 tick.
PR: kern/187668
MFC after: 1 week
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.
Requested by: bapt, jmg, Ryan Lortie (implictly)
Reviewed by: mav, bde
execution to a emumation program via parsing of ELF header information.
With this kernel module and userland tool, poudriere is able to build
ports packages via the QEMU userland tools (or another emulator program)
in a different architecture chroot, e.g. TARGET=mips TARGET_ARCH=mips
I'm not connecting this to GENERIC for obvious reasons, but this should
allow the kernel module to be built by default and enable the building
of the userland tool (which automatically loads the kernel module).
Submitted by: sson@
Reviewed by: jhb@
Right now, init(8) cannot distinguish between an ACPI power button press
or a Ctrl+Alt+Del sequence on the keyboard. This is because
shutdown_nice() sends SIGINT to init(8) unconditionally, but later
modifies the arguments to reboot(2) to force a certain behaviour.
Instead of doing this, patch up the code to just forward the appropriate
signal to userspace. SIGUSR1 and SIGUSR2 can already be used to halt the
system.
While there, move waittime to the function where it's used; kern_reboot().
kqueue(2) already supports EVFILT_PROC. Add an EVFILT_PROCDESC that
behaves the same, but operates on a procdesc(4) instead. Only implement
NOTE_EXIT for now. The nice thing about NOTE_EXIT is that it also
returns the exit status of the process, meaning that we can now obtain
this value, even if pdwait4(2) is still unimplemented.
Notes:
- Simply reuse EVFILT_NETDEV for EVFILT_PROCDESC. As both of these will
be used on totally different descriptor types, this should not clash.
- Let procdesc_kqops_event() reuse the same structure as filt_proc().
The only difference is that procdesc_kqops_event() should also be able
to deal with the case where the process was already terminated after
registration. Simply test this when hint == 0.
- Fix some style(9) issues in filt_proc() to keep it consistent with the
newly added procdesc_kqops_event().
- Save the exit status of the process in pd->pd_xstat, as we cannot pick
up the proctree_lock from within procdesc_kqops_event().
Discussed on: arch@
Reviewed by: kib@
According to <sys/proc.h>, this field needs to be locked with either the
p_mtx or the p_slock. In this case the damage was quite small. Instead
of being reaped, the process would just be reparented to init, so it
could be reaped from there.
kqueue_scan() unlocking the kqueue to call f_event, knote() or
knote_fork() should not skip the knote. The knote is not going to
disappear during the influx time, and the mutual exclusion between
scan and knote() is ensured by both code pathes taking knlist lock.
The race appears since knlist lock is before kq lock, so KN_INFLUX
must be set, kq lock must be dropped and only then knlist lock can be
taken. The window between kq unlock and knlist lock causes lost
events.
Add a flag KN_SCAN to indicate that KN_INFLUX is set in a manner safe
for the knote(), and check for it to ignore KN_INFLUX in the knote*()
as needed. Also, in knote(), remove the lockless check for the
KN_INFLUX flag, which could also result in the lost notification.
Reported and tested by: Kohji Okuno <okuno.kohji@jp.panasonic.com>
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
few of them also build kern_clocksource.c. That strikes me as insane, but
maybe there's a good reason for it. Until I figure that out, un-break
the build by not referencing functions in kern_clocksource if NO_EVENTTIMERS
is defined.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
faults.
First, for accesses to direct map region should check for the limit by
which direct map is instantiated.
Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime. Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.
Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.
Note that at least the second issue exists on other architectures, and
requires similar patching for md code.
Reported and tested by: clusteradm (gjb, sbruno)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible. Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.
Reported and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Retire long time unused (basically always unused) sys__umtx_lock()
and sys__umtx_unlock() syscalls
- struct umtx and their supporting definitions
- UMUTEX_ERROR_CHECK flag
- Retire UMTX_OP_LOCK/UMTX_OP_UNLOCK from _umtx_op() syscall
__FreeBSD_version is not bumped yet because it is expected that further
breakages to the umtx interface will follow up in the next days.
However there will be a final bump when necessary.
Sponsored by: EMC / Isilon storage division
Reviewed by: jhb
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
change... This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)... On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...
This will also prevent future breakage due to people adding additional
fields to the struct...
This gets AVILA booting a bit farther...
Reviewed by: bde
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.
Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.
Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
a single priority queue. If that queue had a thread or threads which
could not be migrated we would fail to steal load. This could cause
starvation in situations where cores are idle.
Submitted by: Doug Kilpatrick <dkilpatrick@isilon.com>
Tested by: pho
Reviewed by: mav
Sponsored by: EMC / Isilon Storage Division
perforce syntax and committed some unrelated files. Only devd files
should've been committed.
Reported by: imp
Pointy hat to: asomers
MFC after: 3 weeks
X-MFC-With: r262914
sbin/devd/devd.cc
Add a -q flag to devd that will suppress syslog logging at
LOG_NOTICE or below.
Requested by: ian@ and imp@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
buffers drop packets". It was caused by a check for the space available
in a sockbuf, but it was checking the wrong sockbuf.
sys/sys/sockbuf.h
sys/kern/uipc_sockbuf.c
Add sbappendaddr_nospacecheck_locked(), which is just like
sbappendaddr_locked but doesn't validate the receiving socket's
space. Factor out common code into sbappendaddr_locked_internal().
We shouldn't simply make sbappendaddr_locked check the space and
then call sbappendaddr_nospacecheck_locked, because that would cause
the O(n) function m_length to be called twice.
sys/kern/uipc_usrreq.c
Use sbappendaddr_nospacecheck_locked for SOCK_SEQPACKET sockets,
because the receiving sockbuf's size limit is irrelevant.
tests/sys/kern/unix_seqpacket_test.c
Now that 185813 is fixed, pipe_128k_8k fails intermittently due to
185812. Make it fail every time by adding a usleep after starting
the writer thread and before starting the reader thread in
test_pipe. That gives the writer time to fill up its send buffer.
Also, clear the expected failure message due to 185813. It actually
said "185812", but that was a typo.
PR: kern/185813
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
to use-after-free.
fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but
kern_proc_{o,}filedesc_out only checked for hold count.
MFC after: 3 days
fdgrowtable() now only reallocates fd_map when necessary.
This fixes fdgrowtable() to use the same logic as fdescfree() for
when to free the fd_map. The logic in fdescfree() is intended to
not free the initial static allocation, however the fd_map grows
at a slower rate than the table does. The table is intended to hold
20 fd, but its initial map has many more slots than 20. The slot
sizing causes NDSLOTS(20) through NDSLOTS(63) to be 1 which matches
NDSLOTS(20), so fdescfree() was assuming that the fd_map was still
the initial allocation and not freeing it.
This partially reverts r244510 by reintroducing some of the logic
it removed in fdgrowtable().
Reviewed by: mjg
Approved by: bapt (mentor)
MFC after: 2 weeks
routine, now a platform can provide a pointer to an early_putc() routine
which is used instead of cn_putc(). Control can be handed off from early
printf support to standard console support by NULLing out the pointer
during standard console init.
This leverages all the existing error reporting that uses printf calls,
such as panic() which can now be usefully employed even in early
platform init code (useful at least to those who maintain that code and
build kernels with EARLY_PRINTF defined).
Reviewed by: imp, eadler
in the sysctl_root().
Note: SYSCTL_VNET_* macros can be removed as well. All is
needed to virtualize a sysctl oid is set CTLFLAG_VNET on it.
But for now keep macros in place to avoid large code churn.
Sponsored by: Nginx, Inc.
r261266:
Add a jail parameter, allow.kmem, which lets jailed processes access
/dev/kmem and related devices (i.e. grants PRIV_IO and PRIV_KMEM_WRITE).
This in conjunction with changing the drm driver's permission check from
PRIV_DRIVER to PRIV_KMEM_WRITE will allow a jailed Xorg server.
/dev/kmem and related devices (i.e. grants PRIV_IO and PRIV_KMEM_WRITE).
This in conjunction with changing the drm driver's permission check from
PRIV_DRIVER to PRIV_KMEM_WRITE will allow a jailed Xorg server.
Submitted by: netchild
MFC after: 1 week
It's common for multi-threaded processes to create a thread for
the purpose of synchronously processing signals. Allow such processes to
utilize a capabilities sandbox.
Discussed with: rwatson, pjd
MFC after: 2 weeks