sharing especially on the default CPU 0 callout_cpu structure.
This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.
Sponsored by: Intel
Reviewed by: jeff, attilio
MFC after: 1 week
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.
Namely such structs are the current struct mtx and the new
struct mtx_padalign. The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.
This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.
The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.
At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).
Discussed with: jhb
Reviewed by: jeff, andre
Reviewed by: mdf (earlier version)
Tested by: jimharris
executed. This means past the point where userret() is generally
executed.
Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.
Reported and tested by: fabient
MFC after: 1 week
X-MFC: r240246
overwriting the return mbuf pointer with newly received data after
a loop. Instead append the new mbuf chain to the existing one.
Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.
For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length. Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.
Fix the MSG_WAITALL case by comparing against sb_hiwat. Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.
Submitted by: trociny (except for the change in last paragraph)
mbuf's by doing proper testing with M_WRITABLE().
In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.
MFC after: 2 weeks
In the old TTY layer, SIGTTIN was correctly handled like this:
while (data should be read) {
send SIGTTIN if not foreground process group
read data
}
In the new TTY layer, however, this behaviour was changed, based on a
false interpretation of the standard:
send SIGTTIN if not foreground process group
while (data should be read) {
read data
}
Correct this by pushing tty_wait_background() into the ttydisc_read_*()
functions.
Reported by: koitsu
PR: kern/173010
MFC after: 2 weeks
A default install on large memory machines with multiple 10gigE interfaces
were not being given enough mbufs to do full bandwidth TCP or NFS traffic.
To keep the value somewhat reasonable, we scale back the number of
maxuers by 1/6 past the 384 point. This gives us enough mbufs for most
of our pretty basic 10gigE line-speed tests to complete.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock. Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).
Sponsored by: Intel
Reviewed by: jeff
more appropriate named kernel options for the very distinct
send and receive path.
"options SOCKET_SEND_COW" enables VM page copy-on-write based
sending of data on an outbound socket.
NB: The COW based send mechanism is not safe and may result
in kernel crashes.
"options SOCKET_RECV_PFLIP" enables VM kernel/userspace page
flipping for special disposable pages attached as external
storage to mbufs.
Only the naming of the kernel options is changed and their
corresponding #ifdef sections are adjusted. No functionality
is added or removed.
Discussed with: alc (mechanism and limitations of send side COW)
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
Return EPERM if processes were found but they
were unable to be signaled.
Return the first error from p_cansignal if no signal was successful.
Reviewed by: jilles
Approved by: cperciva
MFC after: 1 week
Return EPERM if processes were found but they
were unable to be signaled.
Return the first error from p_cansignal if no signal was successful.
Reviewed by: jilles
Approved by: cperciva
MFC after: 1 week
output and replace it with a new visible sysctl kern.ipc.acceptqueue
of the same functionality. It specifies the maximum length of the
accept queue on a listen socket.
The old kern.ipc.somaxconn remains available for reading and writing
for compatibility reasons so that existing programs, scripts and
configurations continue to work. There no plans to ever remove the
orginal and now hidden kern.ipc.somaxconn.
GIANT from VFS. In addition, disconnect also netsmb, which is a base
requirement for SMBFS.
In the while SMBFS regular users can use FUSE interface and smbnetfs
port to work with their SMBFS partitions.
Also, there are ongoing efforts by vendor to support in-kernel smbfs,
so there are good chances that it will get relinked once properly locked.
This is not targeted for MFC.
GIANT from VFS. This code is particulary broken and fragile and other
in-kernel implementations around, found in other operating systems,
don't really seem clean and solid enough to be imported at all.
If someone wants to reconsider in-kernel NTFS implementation for
inclusion again, a fair effort for completely fixing and cleaning it
up is expected.
In the while NTFS regular users can use FUSE interface and ntfs-3g
port to work with their NTFS partitions.
This is not targeted for MFC.
GIANT from VFS. In addition, disconnect also netncp, which is a base
requirement for NWFS.
In the possibility of a future maintenance of the code and later
readd to the FreeBSD base, maybe we should think about a better location
for netncp. I'm not entirely sure the / top location is actually right,
however I will let network people to comment on that more specifically.
This is not targeted for MFC.
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).
Reviewed by: avg
Tested by: pho
MFC after: 1 week
division by zero later if event timer's minimal period is above one second.
For now it is just a theoretical possibility.
Found by: Clang Static Analyzer
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.
Reviewed by: bde, kib
Discussed with: dim, theraven
MFC after: 2 weeks
.. when deciding whether to continue tracing across suid/sgid exec.
Otherwise if root ktrace-d an unprivileged process and the processed
exec-ed a suid program, then tracing didn't continue across exec.
Reviewed by: bde, kib
MFC after: 22 days
When performing a non-blocking read(2), on a TTY while no data is
available, we should return EAGAIN. But if there's a modem disconnect,
we should return 0. Right now we only return 0 when doing a blocking
read, which is wrong.
MFC after: 1 month
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.
Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.
Tested by: pho (previous version)
MFC after: 2 weeks
I have to note that POSIX is simply stupid in how it describes O_EXEC/fexecve
and friends. Yes, not only inconsistent, but stupid.
In the open(2) description, O_RDONLY flag is described as:
O_RDONLY Open for reading only.
Taken from:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html
Note "for reading only". Not "for reading or executing"!
In the fexecve(2) description you can find:
The fexecve() function shall fail if:
[EBADF]
The fd argument is not a valid file descriptor open for executing.
Taken from:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
As you can see the function shall fail if the file was not open with O_EXEC!
And yet, if you look closer you can find this mess in the exec.html:
Since execute permission is checked by fexecve(), the file description
fd need not have been opened with the O_EXEC flag.
Yes, O_EXEC flag doesn't have to be specified after all. You can open a file
with O_RDONLY and you still be able to fexecve(2) it.
global variables are placed. When a module is loaded by link_elf
linker its variables from "set_vnet" linker set are copied to the
kernel "set_vnet" ("modspace") and all references to these variables
inside the module are relocated accordingly.
The issue is when a module is loaded that has references to global
variables from another, previously loaded module: these references are
not relocated so an invalid address is used when the module tries to
access the variable. The example is V_layer3_chain, defined in ipfw
module and accessed from ipfw_nat.
The same issue is with DPCPU variables, which use "set_pcpu" linker
set.
Fix this making the link_elf linker on a module load recognize
"external" DPCPU/VNET variables defined in the previously loaded
modules and relocate them accordingly. For this set_pcpu_list and
set_vnet_list are used, where the addresses of modules' "set_pcpu" and
"set_vnet" linker sets are stored.
Note, archs that use link_elf_obj (amd64) were not affected by this
issue.
Reviewed by: jhb, julian, zec (initial version)
MFC after: 1 month
getmq_read() and getmq_write() respectively, just like sys_kmq_timedreceive()
and sys_kmq_timedsend().
Sponsored by: FreeBSD Foundation
MFC after: 2 weeks
Well, in theory we can pass those two flags, because O_RDONLY is 0,
but we won't be able to read from a descriptor opened with O_EXEC.
Update the comment.
Sponsored by: FreeBSD Foundation
MFC after: 2 weeks