Commit Graph

238 Commits

Author SHA1 Message Date
attilio
899ab64514 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
attilio
19b2ea9f81 The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
trociny
a5058251d5 Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our
intention to use 4-byte padding for elf notes.

MFC after:	3 weeks
2013-05-01 14:59:16 +00:00
trociny
f172830e71 Add a new set of notes to a process core dump to store procstat data.
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.

The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.

PR:		kern/173723
Suggested by:	jhb
Discussed with:	jhb, kib
Reviewed by:	jhb (initial version), kib
MFC after:	1 month
2013-04-16 19:19:14 +00:00
trociny
99603a0f21 Re-factor coredump routines. For each type of notes an output
function is provided, which is used either to calculate the note size
or output it to sbuf.  On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf.  For the sbuf a drain routine is
provided that writes data to a core file.

The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer.  Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.

Reviewed by:	jhb (initial version), kib
Discussed with:	jhb, kib
MFC after:	3 weeks
2013-04-14 19:59:38 +00:00
attilio
d67371dab6 Switch some "low-hanging fruit" to acquire read lock on vmobjects
rather than write locks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-04-08 19:58:32 +00:00
trociny
c71ec9b060 Fill p_flags and p_align fields of the core dump note segement.
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:42:27 +00:00
trociny
67b311e2f1 Use 4-byte padding for core dump notes on both 32 and 64bit archs.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.

This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.

Discussed with:	kib
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:40:49 +00:00
tijl
f0c5e71226 - Fix two possible overflows when testing if ELF program headers are on
the first page:
  1. Cast uint16_t operands in a multiplication to unsigned int because
     otherwise the implicit promotion to int results in a signed
     multiplication that can overflow and the behaviour on integer
     overflow is undefined.
  2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
     because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
  page. There's no overflow here because size is already limited by
  MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
  off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
  This also fixes an off-by-one error.

Reviewed by:	kib
MFC after:	1 week
2013-03-13 22:01:31 +00:00
attilio
15bf891afe Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by:	EMC / Isilon storage division
2013-02-20 12:03:20 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
kib
f24ec7b4ea Remove the ia64-specific code fragment, which effect is more cleanly
done by the call to trans_prot() function a line before.

Discussed with:	Oliver Pinter <oliver.pntr@gmail.com>
MFC after:	1 week
2013-02-10 20:08:33 +00:00
kib
560aa751e0 Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
kib
8f845e475e Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
kib
76522f4cab Fix several reads beyond the mapped first page of the binary in the
ELF parser. Specifically, do not allow note reader and interpreter
path comparision in the brandelf code to read past end of the page.
This may happen if specially crafter ELF image is activated.

Submitted by:	Lukasz Wojcik <lukasz.wojcik zoho com>
MFC after:	3 days
2012-07-19 11:15:53 +00:00
kib
7b36a08108 Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
kib
4e790f9b2b ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by:	kan
Tested by:	andrew (arm), flo (sparc64)
MFC after:	3 weeks
2012-03-11 19:38:49 +00:00
kib
6392be1eb8 Finally, try to enable the nxstacks on amd64 and powerpc64 for both 64bit
and 32bit ABIs. Also try to enable nxstacks for PAE/i386 when supported,
and some variants of powerpc32.

MFC after:	2 months (if ever)
2012-01-30 07:56:00 +00:00
alc
091f2726d5 Explain why it is safe to unlock the vnode.
Requested by:	kib
2012-01-17 16:20:50 +00:00
alc
5210c69a89 Improve abstraction. Eliminate direct access by elf*_load_section()
to an OBJT_VNODE-specific field of the vm object.  The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section().  Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.

In collaboration with:	kib
2012-01-17 00:27:32 +00:00
uqs
d61d88a310 Convert files to UTF-8 2012-01-15 13:23:18 +00:00
kib
8e118d38cf Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by:	marcel
2011-10-15 12:35:18 +00:00
marcel
92e552423d In elf32_trans_prot() and when compiling for amd64 or ia64, add
PROT_EXECUTE when PROT_READ is needed. By default i386 allows
execution when reading is allowed and JDK 1.4.x depends on that.
2011-10-13 16:16:46 +00:00
trasz
4a17b24427 All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0.  This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
2011-07-06 20:06:44 +00:00
jonathan
8c932faae4 Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)
2011-06-30 10:56:02 +00:00
trasz
92bec9b84c Add accounting for most of the memory-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-05 20:23:59 +00:00
mdf
b291e9a365 Put the general logic for being a CPU hog into a new function
should_yield().  Use this in various places.  Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after:	1 week
2011-02-02 16:35:10 +00:00
kib
25001a6e2e Use the same expression to report stack protection mode for AT_STACKEXEC
as the expression used by exec_new_vmspace().
2011-01-08 18:41:19 +00:00
kib
90de4ffc1a In elf image activator, read and apply the stack protection mode from
PT_GNU_STACK program header, if present and enabled. Two new sysctls
are provided, kern.elf32.nxstack and kern.elf64.nxstack, that allow to
enable PT_GNU_STACK for ABIs of specified bitsize, if ABI decided to
support shared page.

Inform rtld about access mode of the stack initial mapping by
AT_STACKPROT aux vector.

At the moment, the default is disabled, waiting for the usermode
support bits.
2011-01-08 16:30:59 +00:00
kib
12d561f88a Collect code to translate between vm_prot_t and p_flags into helper
functions.

MFC after:	1 week
2011-01-08 16:02:14 +00:00
attilio
7718cbcbf4 Add the ability for GDB to printout the thread name along with other
thread specific informations.

In order to do that, and in order to avoid KBI breakage with existing
infrastructure the following semantic is implemented:
- For live programs, a new member to the PT_LWPINFO is added (pl_tdname)
- For cores, a new ELF note is added (NT_THRMISC) that can be used for
  storing thread specific, miscellaneous, informations. Right now it is
  just popluated with a thread name.

GDB, then, retrieves the correct informations from the corefile via the
BFD interface, as it groks the ELF notes and create appropriate
pseudo-sections.

Sponsored by:	Sandvine Incorporated
Tested by:	gianni
Discussed with:	dim, kan, kib
MFC after:	2 weeks
2010-11-22 14:42:13 +00:00
kib
d9f088a03e Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by:	marius (sparc64)
MFC after:	1 month
2010-08-17 08:55:45 +00:00
alfred
993bf6ff36 Don't leak core_buf or gzfile if doing a compressed core file and we
hit an error condition.

Obtained from: Juniper Networks
2010-04-30 03:13:24 +00:00
nwhitehorn
ac2318460c Add the ELF relocation base to struct image_params. This will be
required to correctly relocate the executable entry point's function
descriptor on powerpc64.
2010-03-25 14:31:26 +00:00
nwhitehorn
b60f1f5349 Change the way text_addr and data_addr are computed to use the
executable status of segments instead of detecting the main text segment
by which segment contains the program entry point. This affects
obreak() and is required for correct operation of that function
on 64-bit PowerPC systems. The previous behavior was apparently
required only for the Alpha, which is no longer supported.

Reviewed by:	jhb
Tested on:	amd64, sparc64, powerpc
2010-03-25 14:21:22 +00:00
nwhitehorn
142a4d2993 Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by:	kib, jhb
2010-03-11 14:49:06 +00:00
alfred
88b3bf6496 put calls to gzclose() under ifdef COMPRESS_USER_CORES to prevent
undefined symbols on kernels without this option.

Reported by: Alexander Best
2010-03-04 21:53:45 +00:00
alfred
f34ce3dd38 Merge projects/enhanced_coredumps (r204346) into HEAD:
Enhanced process coredump routines.

  This brings in the following features:
  1) Limit number of cores per process via the %I coredump formatter.
  Example:
    if corefilename is set to %N.%I.core AND num_cores = 3, then
    if a process "rpd" cores, then the corefile will be named
    "rpd.0.core", however if it cores again, then the kernel will
    generate "rpd.1.core" until we hit the limit of "num_cores".

    this is useful to get several corefiles, but also prevent filling
    the machine with corefiles.

  2) Encode machine hostname in core dump name via %H.

  3) Compress coredumps, useful for embedded platforms with limited space.
    A sysctl kern.compress_user_cores is made available if turned on.

    To enable compressed coredumps, the following config options need to be set:
    options COMPRESS_USER_CORES
    device zlib   # brings in the zlib requirements.
    device gzio   # brings in the kernel vnode gzip output module.

  4) Eventhandlers are fired to indicate coredumps in progress.

  5) The imgact sv_coredump routine has grown a flag to pass in more
  state, currently this is used only for passing a flag down to compress
  the coredump or not.

  Note that the gzio facility can be used for generic output of gzip'd
  streams via vnodes.

Obtained from: Juniper Networks
Reviewed by: kan
2010-03-02 06:58:58 +00:00
kib
2eb5677d22 If ET_DYN binary has non-zero base address for some reason, honour it
and do not relocate the binary to ET_DYN_LOAD_ADDR. This allows for the
binary author to influence address map of the process. In particular,
when the binary is actually an interpeter, this allows to have almost
usual process address map.

Communicate the relocation bias of the mapping for interpeter-less
ET_DYN binary, that is interperter itself, in AT_BASE aux entry. This
way, rtld is able to find its dynamic structure and relocate itself.
Note that mapbase in the rtld is still wrong and requires further
fixing.

Reported and tested by:	rwatson
Discussed with:	kan
MFC after:	3 days
2009-10-18 12:57:48 +00:00
kib
edf781a815 Map PIE binaries at non-zero base address.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:33:01 +00:00
kib
16ff64e8c6 Do not map segments of zero length.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:28:52 +00:00
bz
a0d8f55f8a Print a warning in case we cannot add more brandinfo because
we would overflow the MAX_BRANDS sized array.

Reviewed by:	kib
MFC After:	1 month
2009-10-03 10:50:00 +00:00
bz
840afe36da Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.

Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.

The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo  was matched first,  as the fallback to ld-elf32.so.1 does
not exist in that case.

Reported and tested by:	ticso
In collaboration with:	kib
MFC after:		3 days
2009-08-30 14:38:17 +00:00
bz
ba7b3afabc Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.

Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.

Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].

In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].

These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR:		kern/135468
Submitted by:	dchagin [1] (initial patch)
Suggested by:	kib [2]
Tested by:	Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by:	kib
MFC after:	3 days
2009-08-24 16:19:47 +00:00
dchagin
01bf63c9fb Fix KBI breakage by r190520 which affects older linux.ko binaries:
1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
   is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
   modules won't have the flag set, so the new field brand_note would be
   ignored.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	kib (mentor)
MFC after:	6 days
2009-04-05 09:27:19 +00:00
kib
4c3e8a8b03 Fix several issues with parsing the notes for ELF objects.
Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.

In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.

For clarity, check the calculated note alignment in each iteration.

Reported by:	Chris Palmer <chris noncombatant org> [1]
PR:	kern/132886
Reviewed and tested by:	dchagin
MFC after:	3 days
2009-03-22 13:42:41 +00:00
kib
e905171fbe Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by:	pho
Reviewed by:	kan
2009-03-17 12:53:28 +00:00
kib
37de637d31 Use the properly sized types for ELF object header and program headers.
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.

Reported by:	swell.k gmail com
Reviewed by:	dchagin
Tested by:	dchagin, swell.k gmail com
2009-03-17 09:50:40 +00:00
dchagin
2408b715a0 Implement new way of branding ELF binaries by looking to a
".note.ABI-tag" section.

The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.

Move code which fetch osreldate for ELF binary to check_note() handler.

PR:		118473
Approved by:	kib (mentor)
2009-03-13 16:40:51 +00:00
rwatson
97295d8b75 When a statically linked binary is executed (or at least, one without
an interpreter definition in its program header), set the auxiliary
ELF argument AT_BASE to 0 rather than to the address that we would
have mapped the interpreter at if there had been one.

The ELF ABI specifications appear to be ambiguous as to the desired
behavior in this situation, as they define AT_BASE as the base address
of the interpreter, but do not mention what to do if there is none.
On Solaris, AT_BASE will be set to the base address of the static
binary if there is no interpreter, and on Linux, AT_BASE is set to 0.
We go with the Linux semantics as they are of more immediate utility
and allow the early runtime environment to know that the kernel has
not mapped an interpreter, but because AT_PHDR points at the ELF
header for the running binary, it is still possible to retrieve all
required mapping information when the process starts should it be
required.  Either approach would be preferable to our current behavior
of passing a pointer to an unmapped region of user memory as AT_BASE.

MFC after:	3 weeks
2009-01-25 12:07:43 +00:00