Turn on the CTL disable tunable by default.
This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel. They
can simply set kern.cam.ctl.disable=0 in loader.conf.
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.
Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.
kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.
Reviewed by: davide
kern_timeout_callwheel_alloc() where it is actually used.
This is a mechanical move and no tuning parameters are changed.
The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0. Eventually all
remaining users of timeout(9) should switch to the callout_* API.
Reviewed by: davide
/usr/src/sys/modules/nvme/../../dev/nvme/nvme.c:211: warning: format '%qx' expects type 'long unsigned int', but argument 9 has type 'long long unsigned int' [-Wformat]
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.
MFC after: 3 days
response to an rtprio_thread() call, when the priority is different
than the old priority, and either the old or the new priority class is
not RTP_PRIO_NORMAL (timeshare).
The reasoning for the second half of the test is that if it's a change in
timeshare priority, then the scheduler is going to adjust that priority
in a way that completely wipes out the requested change anyway, so
what's the point? (If that's not true, then allowing a thread to change
its own timeshare priority would subvert the scheduler's adjustments and
let a cpu-bound thread monopolize the cpu; if allowed at all, that
should require priveleges.)
On the other hand, if either the old or new priority class is not
timeshare, then the scheduler doesn't make automatic adjustments, so we
should honor the request and make the priority change right away. The
reason the old class gets caught up in this is the very reason for this
change: when thread A changes the priority of its child thread B from
idle back to timeshare, thread B never actually gets moved to a
timeshare-range run queue unless there are some idle cycles available
to allow it to first get scheduled again as an idle thread.
Reviewed by: jhb@
using callout_reset_sbt() instead of callout_reset(). We can't remove
lower limit completely in this case because of significant processing
overhead, caused by unability to use direct callout execution due to using
process mutex in callout handler for sending SEGALRM signal. With support
of periodic events that would allow unprivileged user to abuse the system.
Reviewed by: davide
will prevent the kernel from linking if the device driver are included
without the virtio module. Remove pci and scbus for the same reason.
Also explain the relationship and necessity of the virtio and virtio_pci
modules. Currently in FreeBSD, we only support VirtIO PCI, but it could
be replaced with a different interface (like MMIO) and the device
(network, block, etc) will still function.
Requested by: luigi
Approved by: grehan (mentor)
MFC after: 3 days
This fixes a build error at the same time (unused variable "item"), if
the kernel is compiled without INVARIANTS.
Reported by: Hartmann, O. <ohartman@zedat.fu-berlin.de> (build error)
Reviewed by: Konstantin Belousov (kib@)
Without this, read data is mis-interpreted. This could trigger a panic,
as was the case on one computer where computed "recsize" was zero,
leading to a call to g_read_page() asking for 0 bytes.
The early commit is done to facilitate the off-tree work on the
porting of the Radeon driver.
Sponsored by: The FreeBSD Foundation
Debugged and tested by: dumbbell
MFC after: 1 month
When the VirtIO barrier feature is not negotiated, the driver
must enforce the proper ordering for BIO_ORDERED BIOs. All the
in-flight BIOs must complete before starting the BIO, and the
ordered BIO must complete before subsequent BIOs can start.
Also fix a few whitespace nits.
Reported by: neel
Approved by: grehan (mentor)
MFC after: 3 days
being compiled only when setting LEGACY_TX, this means you would
not get the drain when needed on detach!!
Thanks to Bryan Venteicher (bryanv@freebsd.org) for catching this
little gremlin!! :)
Fixes:
- flow control - don't override user value on re-init
- fix to make 1G optics work correctly
- change to interrupt enabling - some bits were incorrect
for certain hardware.
- certain stats fixes, remove a duplicate increment of
ierror, thanks to Scott Long for pointing these out.
- shared code link interface changed, requiring some
core code changes to accomodate this.
- add an m_adj() to ETHER_ALIGN on the recieve side, this
was requested by Mike Karels, thanks Mike.
- Multicast code corrections also thanks to Mike Karels.
the pid provider on a kernel compiled with INVARIANTS.
sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
In fasttrap_probe_pid(), attempts to write to the
address space of the thread that fired the probe
must be performed with the process of the thread
held. Use _PHOLD() to ensure this is the case.
In fasttrap_probe_pid(), use proc_write_regs() instead
of calling set_regs() directly. proc_write_regs()
performs invariant checks to verify the calling
environment of set_regs(). PROC_LOCK()/UNLOCK() around
the call to proc_write_regs() so that it's invariants
are satisfied.
Sponsored by: Spectra Logic Corporation
Reviewed by: gnn, rpaulo
MFC after: 1 week
tunable by default.
This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel. They
can simply set kern.cam.ctl.disable=0 in loader.conf.
The eventual solution to the memory usage problem is to change the way
CTL allocates memory to be more configurable, but this should fix things
for small memory situations in the mean time.
UPDATING: Explain the change in the CTL configuration, and
how users can enable CTL if they would like to use
it.
sys/conf/options: Add a new option, CTL_DISABLE, that prevents CTL
from initializing.
ctl.c: If CTL_DISABLE is turned on, don't initialize.
i386/conf/GENERIC,
amd64/conf/GENERIC: Re-enable device ctl, and add the CTL_DISABLE
option.
- Rewrite kevent() timeout implementation to allow sub-tick precision.
- Make the interval timings for EVFILT_TIMER more accurate. This also
removes an hack introduced in r238424.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Fix kern_select() and sys_poll() so that they can handle sub-tick
precision for timeouts (in the same fashion it was done for nanosleep()
in r247797).
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Specify that precision of 0.5s is enough for resource limitation.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Specify that wakeup rate of 7.5-10Hz is enough for yarrow harvesting
thread.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Specify that syslog doesn't need exactly 5 wakeups per second.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
kern_nanosleep() is now converted to use tsleep_sbt(). With this change
nanosleep() and usleep() can handle sub-tick precision for timeouts.
Also, try to help coalesce of events passing as argument to tsleep_bt()
a precision value calculated as a percentage of the sleep time.
This percentage is default 5%, but it can tuned according to users
need via the sysctl interface.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
- Switch syscons from timeout() to callout_reset_flags() and specify that
precision is not important there -- anything from 20 to 30Hz will be fine.
- Reduce syscons "refresh" rate to 1-2Hz when console is in graphics mode
and there is nothing to do except some polling for keyboard. Text mode
refresh would also be nice to have adaptive, but this change at least
should help laptop users who running X.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.
The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.
Sponsored by: EMC / Isilon storage division
Submitted by: alc
Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in
the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Extend condvar(9) KPI introducing sbt variant of cv_timedwait. This
rely on the previously committed sleepq_set_timeout_sbt().
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Convert sleepqueue(9) bits to the new callout KPI. Take advantage of
the possibility to run callback directly from hw interrupt context.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil
Make loadavg calculation callout direct. There are several reasons for it:
- it is very simple and doesn't worth context switch to SWI;
- since SWI is no longer used here, we can remove twelve years old hack,
excluding this SWI from from the loadavg statistics;
- it fixes problem when eventtimer (HPET) shares interrupt with some other
device, and that interrupt thread counted as permanent loadavg of 1; now
loadavg accounted before that interrupt thread is scheduled.
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, Fabian Keil, markj
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.
This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.
Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil
pointers to the file structure receiving descriptors stopped to work when also
at least few kilobytes of data is being send. In the kernel the
soreceive_generic() function doesn't see control mbuf as the first mbuf and
unp_externalize() is never called, first 6(?) kilobytes of data is missing as
well on receiving end.
This breaks for example tmux.
I don't know yet why going from 8 bytes to sizeof(struct filedescent) per
descriptor (or even to 16 bytes per descriptor) breaks things, but to
work-around it for now use 8 bytes per file descriptor at the cost of memory
allocation.
Reported by: flo, Diane Bruce, Jan Beich <jbeich@tormail.org>
Simple testcase provided by: mjg
necessary because we already use parenthesis in zfs_cv_init().
This fixes a long standing bug where there would be an extra ")" at the
end of the string. This extra parenthesis would show up in the WCHAN of
the process (top, stty status, etc.).
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);
which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.
- Add manual pages for the new syscalls.
- Make the new syscalls available for processes in capability mode sandbox.
- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.
- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.
- Update procstat(1) to recognize the new capability rights.
- Document the new capability rights in cap_rights_limit(2).
Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des
valid if the flag OBJ_COLORED is set. Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.
Sponsored by: EMC / Isilon Storage Division
devices. While at it, update the comment now that we know that MSI-X
doesn't work with ESXi 5.1 for Intel 82576 either and the underlying issue
is a bug in the MSI-X allocation code of the hypervisor.
Reported by: Harald Schmalzbauer
- Make the nomatch table const.
MFC after: 1 week
from the tree since few months (please note that the userland bits
were already disconnected since a long time, thus there is no need
to update the OLD* entries).
This is not targeted for MFC.
Rename the pv_entry_t iterator from pv_list to pv_next.
Besides being more correct technically (as the name seems to suggest
this is a list while it is an iterator), it will also be needed by
vm_radix work to avoid a nameclash on macro expansions.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc, jeff
Tested by: flo, pho, jhb, davide
interrupt shutdown handlers into filters. Shutdown_nice(9) acquires a sleep
lock, which filters shouldn't do. It also seems that kern_reboot(9) still
may require Giant to be hold.
- Correct an incorrect argument to shutdown_nice(9).
Submitted by: bde
interrupt shutdown handlers into filters. Shutdown_nice(9) acquires a sleep
lock, which filters shouldn't do. It also seems that kern_reboot(9) still
may require Giant to be hold.
Submitted by: bde
Include some flags of the nullfs mount itself:
MNT_RDONLY, MNT_NOEXEC, MNT_NOSUID, MNT_UNION, MNT_NOSYMFOLLOW.
This allows userland code calling statfs() or fstatfs() to see these flags.
In particular, this allows opendir() to detect that a -t nullfs -o union
mount needs deduplication (otherwise at least . and .. are returned twice)
and allows rtld to detect a -t nullfs -o noexec mount as noexec.
Turn off the MNT_ROOTFS flag from the underlying filesystem because the
nullfs mount is definitely not the root filesystem.
Reviewed by: kib
MFC after: 1 week
on the target directory descriptor, but only if this is renameat(2) and real
target directory descriptor is given (not AT_FDCWD). Without this fix regular
rename(2) fails if the target file already exists.
Reported by: Michael Butler <imb@protected-networks.net>
Reported by: Larry Rosenman <ler@lerctr.org>
Sponsored by: The FreeBSD Foundation
It unfortunately steals a fair chunk of RAM at startup even if it's not
actively used, which prevents FreeBSD VMs of 128MB from successfully
booting and running.
other architectures [1].
While here:
- Remove an unused and commented out include.
- Add a comment describing the file that other copies have.
- Fix the style of the defines and add a comment on what each one is.
Suggested by: [1] alc
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
an interrupt filter (some other drivers in the tree do the same). So
change the overtemperature and power fail interrupts from handlers in order
to code and get rid of a !INTR_MPSAFE handlers.
- Mark unused parameters as such.
- Use NULL instead of 0 for pointers.
MFC after: 1 week
fail interrupt handler, there seems to be either a broken batch of them
or a tendency to develop a defect which causes this interrupt to fire
inadvertedly. Given that apart from this problem these machines work
just fine, add a tunable allowing the setup of the power fail interrupt
to be disabled.
While at it, remove the DEBUGGER_ON_POWERFAIL compile time option and
make that behavior also selectable via the newly added tunable.
- Apparently, it's no longer a problem to call shutdown_nice(9) from within
an interrupt filter (some other drivers in the tree do the same). So
change the power fail interrupt from an handler in order to simplify the
code and get rid of a !INTR_MPSAFE handler.
- Use NULL instead of 0 for pointers.
MFC after: 1 week
Import a fix tighten assertion on SPA versions from vendor (Illumos).
Illumos ZFS issue:
3543 Feature flags causes assertion in spa.c to miss certain cases
MFC after: 2 weeks