Commit Graph

15472 Commits

Author SHA1 Message Date
Gleb Smirnoff
8d40bada3e Fix a degenerate case when soisdisconnected() would call soisconnected().
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().

Right now this case is innocent for two reasons.  First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag.  Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.

The fix is to check SS_ISDISCONNECTED flag before calling upcall.
2017-06-08 06:16:47 +00:00
Marcelo Araujo
e0a6a23c6d Allow sysctl kern.vm_guest to return bhyve when running under bhyve.
Submitted by:	Sean Fagan <sef@ixsystems.com>
Reviewed by:	grehan
MFH:		4 weeks.
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D11090
2017-06-08 04:02:14 +00:00
Alan Cox
d90bf7d850 Originally, this file could be compiled as a user-space application for
testing purposes.  However, over the years, various changes to the kernel
have broken this feature.  This revision applies some fixes to get user-
space compilation working again.  There are no changes in this revision
to code that is used by the kernel.

MFC after:	3 days
2017-06-07 16:04:34 +00:00
Gleb Smirnoff
b3244df799 Provide typedef for socket upcall function.
While here change so_gen_t type to modern uint64_t.
2017-06-07 01:48:11 +00:00
Gleb Smirnoff
b94f68dc52 Remove a piece of dead code. 2017-06-07 01:21:34 +00:00
Alan Cox
03bdd65f18 When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices.  And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size.  Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected.  Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme.  This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by:	kib
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11043
2017-06-06 03:32:17 +00:00
Allan Jude
e28f9b7d03 Jails: Optionally prevent jailed root from binding to privileged ports
You may now optionally specify allow.noreserved_ports to prevent root
inside a jail from using privileged ports (less than 1024)

PR:		217728
Submitted by:	Matt Miller <mattm916@pulsar.neomailbox.ch>
Reviewed by:	jamie, cem, smh
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D10202
2017-06-06 02:15:00 +00:00
Alan Cox
064650c180 Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int.  (See r96849 and
r96851.)

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11028
2017-06-05 17:14:16 +00:00
Konstantin Belousov
3df7ebc4ed Add sysctl vfs.ino64_trunc_error controlling action on truncating
inode number or link count for the ABI compat binaries.

Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits.  Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures.  For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.

EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').

Based on the discussion with:	bde
Sponsored by:	The FreeBSD Foundation
2017-06-05 11:40:30 +00:00
Alan Cox
d712b799b5 The data type returned by vmoff() is too narrow in its range. This could
break the transmission of files longer than 4 GB on 32-bit architectures.

Reviewed by:	glebius, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10019
2017-06-03 16:19:33 +00:00
Konstantin Belousov
a7ca2c6ad0 Ensure that cached struct thread does not keep spurious td_su
reference on an UFS mount point.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-03 14:12:17 +00:00
Gleb Smirnoff
971af2a311 Rename accept filter getopt/setopt functions, so that they are prefixed
with module name and match other functions in the module.  There is no
functional change.
2017-06-02 17:49:21 +00:00
Gleb Smirnoff
810951ddc9 Style: unwrap lines that doesn't have a good reason to be wrapped. 2017-06-02 17:43:47 +00:00
Gleb Smirnoff
bd617e3b98 Remove write only flag UNP_HAVEPCCACHED. 2017-06-02 17:39:05 +00:00
Gleb Smirnoff
0c3c207ffd For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.
2017-06-02 17:31:25 +00:00
Mateusz Guzik
c7a6a1b325 mtx: fix whitespace damage in _mtx_trylock_flags_
MFC after:	3 days
2017-05-30 02:25:47 +00:00
Konstantin Belousov
03311f117b Use whole mnt_stat.f_fsid bits for st_dev.
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid.  In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic.  Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by:	avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by:	The FreeBSD Foundation
2017-05-27 17:00:30 +00:00
Conrad Meyer
95b978955c procstat(1): Add TCP socket send/recv buffer size
Add TCP socket send and receive buffer size to procstat -f output.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10689
2017-05-26 22:17:44 +00:00
Allan Jude
c20feae640 Followup to r318765 (capsicumize cpuset_*affinity)
Update *sysent files
2017-05-24 01:01:57 +00:00
Allan Jude
f299c47b52 Allow cpuset_{get,set}affinity in capabilities mode
bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads

Reviewed by:	emaste, oshogbo, rwatson
MFC after:	2 weeks
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D10170
2017-05-24 00:58:30 +00:00
Steve Wills
a4aaba3b0a Add security.bsd.see_jail_proc
Add security.bsd.see_jail_proc sysctl to hide jail processes from non-root
users

Reviewed by:	jamie
Approved by:	allanjude
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D10770
2017-05-23 16:59:24 +00:00
Konstantin Belousov
ec95c622ff Regen. 2017-05-23 09:30:42 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Ed Maste
bd309b323a Regen sysent after r318634, no open(2) in capability mode
Sponsored by:	The FreeBSD Foundation
2017-05-22 11:45:45 +00:00
Ed Maste
68fc8f3934 disallow open(2) in capability mode
Previously open(2) was allowed in capability mode, with a comment that
suggested this was likely the case to facilitate debugging. The system
call would still fail later on, but it's better to disallow the syscall
altogether.

We now have the kern.trap_enotcap sysctl or PROC_TRAPCAP_CTL proccontrol
to aid in debugging.

In any case libc has translated open() to the openat syscall since
r277032.

Reviewed by:	kib, rwatson
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10850
2017-05-22 11:43:19 +00:00
Mark Johnston
3bd485f968 Avoid open-coding PRI_UNCHANGED.
MFC after:	1 week
2017-05-18 18:24:11 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
John Baldwin
00f6cd3f56 Add sglist_append_sglist().
This function permits a range of one scatter/gather list to be appended to
another sglist.  This can be used to construct a scatter/gather list that
reorders or duplicates ranges from one or more existing scatter/gather
lists.

Sponsored by:	Chelsio Communications
2017-05-16 23:31:52 +00:00
Konstantin Belousov
391aba32e6 mnt_vnode_next_active: use conventional lock order when trylock fails.
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by:	Tom Rix <trix@juniper.net>
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D10692
2017-05-15 10:02:45 +00:00
Konstantin Belousov
396a0d4455 Do not wake up sleeping thread in reschedule_signals() if the signal
is blocked.  The spurious wakeup might result in spurious EINTR.

The reschedule_signals() function is called when the calling thread
has the signal mask changed.  For each newly blocked signal, we try to
find a thread which might have the signal not blocked.  If no such
thread exists, sigtd() returns random thread, which must not be waken
up.  I decided that re-checking, as suggested by PR submitter, is more
reasonable change than to change sigtd() interface, due to other uses
of sigtd().  signotify() already performs this check.

Submitted by:	Duane <parakleta@darkreality.org>
PR:	219228
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-05-12 15:34:59 +00:00
Mark Johnston
0c46712ca1 Let ptracestop() suspend threads sleeping in an SBDRY section.
When a thread enters ptracestop(), for example because it had received
SIGSTOP from ptrace(PT_ATTACH), it attempts to suspend other threads in
the same process. In the case of a thread sleeping interruptibly in an
SBDRY section, sig_suspend_threads() must wake the thread and allow it to
reach the user-mode boundary. However, sig_suspend_threads() would
erroneously avoid waking up such threads, resulting in an apparent hang.

Reviewed by:	kib
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2017-05-11 17:03:45 +00:00
Marius Strobl
26d877f5b8 - Also outside of the KOBJOPLOOKUP macro - which in turn is used by
the code auto-generated for *.m - kobj_lookup_method(9) is useful;
  for example in back-ends or base class device drivers in order to
  determine whether a default method has been overridden. Thus, allow
  for the kobj_method_t pointer argument - used by KOBJOPLOOKUP in
  order to update the cache entry - of kobj_lookup_method(9), to be
  NULL. Actually, that pointer is redundant as it's just set to the
  same kobj_method_t that the kobj_lookup_method(9) function returns
  in the first place, but probably it serves to reduce the number of
  instructions generated for KOBJOPLOOKUP.
- For the same reason, move updating kobj_lookup_{hits,misses} (if
  KOBJ_STATS is defined) from kobj_lookup_method(9) to KOBJOPLOOKUP.
  As a side-effect, this gets rid of the convoluted approach of always
  incrementing kobj_lookup_hits in KOBJOPLOOKUP and then in case of
  a cache miss, decrementing it in kobj_lookup_method(9) again.
2017-05-08 21:08:39 +00:00
Brooks Davis
f19351aad8 Provide a freebsd32 implementation of sigqueue()
The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets.  The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.

Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.

Reviewed by:	kib, wblock
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10605
2017-05-05 18:49:39 +00:00
Mateusz Guzik
8066a14a3c cache: stop holding the ncneg_hot lock across purging
Only non-hot entries are purged so the lock is not needed in the first place.
This saves one lock/unlock pair.

MFC after:	1 week
2017-05-04 03:11:59 +00:00
Conrad Meyer
29dfb631d8 Extend cpuset_get/setaffinity() APIs
Add IRQ placement-only and ithread-only API variants. intr_event_bind
has been extended with sibling methods, as it has many more callsites in
existing code.

Reviewed by:	kib@, adrian@ (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10586
2017-05-03 18:41:08 +00:00
Konstantin Belousov
acd9f51725 Add asserts to verify stability of struct proc and struct thread layouts.
Some notes:
- Only i386 and amd64 layouts are checked, other Tier-1 (or close to
  it) architectures would benefit from the same check.
- Unconditional enabling of the asserts depend on the stability of locks
  memory layout.  If locks are optimized to avoid bloat when some debugging
  or profiling features turned off, it makes sense to only assert layout
  for production configs.

Reviewed by:	badger, emaste, jhb, vangyzen
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D10526
2017-04-27 21:24:50 +00:00
Patrick Kelsey
1431521236 Remove unnecessary check for NULL mbuf in soreceive_generic().
This check has been redundant since it was introduced in r162554.

Reviewed by:	emaste, glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10322
2017-04-25 19:54:34 +00:00
Edward Tomasz Napierala
04005c2f92 Make it possible to terminate "show lockedbufs" by pressing "q".
MFC after:	2 weeks
2017-04-23 22:20:25 +00:00
Edward Tomasz Napierala
10be945708 Improve BUF_TRACKING by not displaying NULL entries.
Reviewed by:	cem
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10443
2017-04-23 17:39:31 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
fef0991322 Typo! 2017-04-17 17:07:51 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Gleb Smirnoff
6286dc78d4 Remove unneeded include of vm_phys.h. 2017-04-17 16:51:04 +00:00
Edward Tomasz Napierala
b66f26e931 Don't try to write out bufs that have already failed with ENXIO.
This fixes some panics after disconnecting mounted disks.

Submitted by:	imp (slightly different version, which I've then lost)
Reviewed by:	kib, imp, mckusick
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9674
2017-04-14 20:15:34 +00:00
Maxim Sobolev
63649db042 Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned
by the shutdown(2) system call. This ability has been lost as part of the svn
revision 285910.

Reviewed by:	ed, rwatson, glebius, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10351
2017-04-14 17:23:28 +00:00
Andrey V. Elsukov
57386f5dce Fix the build.
Reported by:	lwhsu
2017-04-14 10:21:38 +00:00
Andrey V. Elsukov
c33a231337 Rework r316770 to make it protocol independent and general, like we
do for streaming sockets.

And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.

Suggested by:	glebius
Tested by:	Irina Liakh <spell at itl ua>
MFC after:	1 week
2017-04-14 09:00:48 +00:00
Andrew Turner
4e65501f13 Don't prefix zero with 0x in assym.s.
The arm64 binutils only accepts 0 as an offset to the Load-Acquire Register
instructions where llvm will acceps both 0 and 0x0. The thread switching
code uses these with SCHED_ULE to block waiting for a lock to be released.
As the offset of the data to be loaded is zero this is safe, however it is
useful to keep the offset in the instruction to document what is being
loaded.

To work around this issue in binutils only generate the 0x prefix for
non-zero values.

Reported by:	kan
Sponsored by:	DARPA, AFRL
2017-04-13 15:43:44 +00:00
Patrick Kelsey
67d955aab4 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
Conrad Meyer
69cfbe8851 kern_descrip: Move kinfo_ofile size assert under COMPAT_FREEBSD7
The size and structure are not used outside of FreeBSD 7 compatibility ABIs.

Sponsored by:	Dell EMC Isilon
2017-04-07 05:00:09 +00:00