After KVA space was increased to 512GB on amd64 it became impractical
to use PTEs as entries in the minidump map of dumped pages, because size
of that map alone would already be 1GB.
Instead, we now use PDEs as page map entries and employ two stage lookup
in libkvm: virtual address -> PDE -> PTE -> physical address. PTEs are
now dumped as regular pages. Fixed page map size now is 2MB.
libkvm keeps support for accessing amd64 minidumps of version 1.
Support for 1GB pages is added.
Many thanks to Alan Cox for his guidance, numerous reviews, suggestions,
enhancments and corrections.
Reviewed by: alc [kernel part]
MFC after: 15 days
Additionally, because of sysctl(3) use (which is generally good), behaviour
for crash dumps differs slightly from behaviour for live kernels and this
will probably never be fixed entirely, so weaken that claim.
MFC after: 1 week
(DPCPU):
A new API, kvm_dpcpu_setcpu(3), selects the active CPU for the purposes
of DPCPU. Calls to kvm_nlist(3) will automatically translate DPCPU
symbols and return a pointer to the current CPU's version of the data.
Consumers needing to read the same symbol on several CPUs will invoke a
series of setcpu/nlist calls, one per CPU of interest.
This addition makes it possible for tools like netstat(1) to query the
values of DPCPU variables during crashdump analysis, and is based on
similar code handling virtualized global variables.
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
Similar to libexec/, do the same with lib/. Make WARNS=6 the norm and
lower it when needed.
I'm setting WARNS?=0 for secure/. It seems secure/ includes the
Makefile.inc provided by lib/. I'm not going to touch that directory.
Most of the code there is contributed anyway.
in a seperate array. As such we need to use kvm_read rather than bcopy
to populate the ki_groups field.
This fixes a crash when running ps -ax on a coredump.
Reported by: brucec
Tested by: brucec
MFC after: 3 days
value the kernel calculated directly as we already read it
with struct vnet. This will make kvm_vnet.c more resilent
in case of possible kernel changes.
Reviewed by: rwatson
Approved by: re (kib)
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
in up to 16 (KI_NGROUPS) values and steal a bit from ki_cr_flags
(all bits currently unused) to indicate overflow with the new flag
KI_CRF_GRP_OVERFLOW.
This fixes procstat -s.
Approved by: re (kib)
without VIMAGE virtualization in the kernel.
If we cannot resolve a symbol try to see if we can find it with
prefix of the virtualized subsystem, currently only "vnet_entry"
by identifying either the vnet of the current process for a
live system or the vnet of proc0 (or of dumptid if compiled
in a non-default way).
The way this is done currently allows us to only touch libkvm
but no single application. Once we are going to virtualize more
subsystems we will have to review this decision for better scaling.
Submitted by: rwatson (initial version of kvm_vnet.c, lots of ideas)
Reviewed by: rwatson
Approved by: re (kib)
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
support for virtual core files (aka minidumps). physical core
files are not supported.
The implementation is cross-tool ready and can be used in a non-
powerpc hosted debugger to analyze PowerPC core files. It also
accepts core files that still have the dump header, as can be
the case within Juniper where TFTP-based kernel core files are
supported and savecore is not used to "extract" the core file
from some dump device.
Obtained from: Juniper Networks, Inc.
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
'kern.cp_time'. For a live kernel it uses the sysctl. For a crashdump,
it first checks to see if the kernel has a 'cp_time' global symbol. If
it does, it uses that. If that doesn't work, when it uses the recently
added kvm_getmaxcpu(3) and kvm_getpcpu(3) routines to walk all the CPUs
and sum up their counters.
MFC after: 1 week
similar to _WANT_UCRED and _WANT_PRISON and seems to be much nicer than
defining _KERNEL.
It is also needed for my sys/refcount.h change going in soon.
global list of all files.
- Mark kvm_getfiles() as broken since the live version exports struct xfile
with no filelist at the head and does so incorrectly and the deadfiles
version exports struct file with a filelist at the head. It is not known
if either version works or complies to the manpage.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.
kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.
All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.
fix top to show kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.
make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'
man page fixes to follow.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.
Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)
scheme allowed for 1024 PTE pages, each containing 256 PTEs.
This yielded 2GB of KVA. This is not enough to boot a kernel
on a 16GB box and in general too low for a 64-bit machine.
By adding a level of indirection we now have 1024 2nd-level
directory pages, each capable of supporting 2GB of KVA. This
brings the grand total to 2TB of KVA.
- Restore support for fetching swap information from crash dumps via
kvm_get_swapinfo(3) to fix pstat -T/-s on crash dumps.
Reviewed by: arch@, phk
MFC after: 1 week
bogusly used the kvm_powerpc.c file as a template for the license, but
then either wrote the code himself, or cribbed it from the kvm_i386
file. The only thing from the kvm_powerpc.c file was the license.
Correct this mistake with his blessing.
far more convenient for libkvm to work with because of the page table
block at the beginning. As a result, the MD code is smaller.
libkvm will automatically detect old vs mini dumps on i386 and amd64.
libkvm will handle i386 PAE and non-PAE modes. There is a PAE flag in
the i386 minidump header to signal the width of the entries in the
page table block.
Other convenient values are also present, such as kernbase and the direct
map addresses on amd64.
readable on certain random memory configurations. If the libkvm consumer
tried to read something that was in the very last pdpe, pde or pte slot,
it would bogusly fail.
This is broken in RELENG_6 too.
returned an lseek offset in a "u_long *" value, which can't express >4GB
offsets on 32 bit machines (eg: PAE). Change to "off_t *" for all.
Support ELF crashdumps on i386 and amd64.
Support PAE crashdumps on i386. This is done by auto-detecting the
presence of the IdlePDPT which means that PAE is active.
I used Marcel's _kvm_pa2off strategy and ELF header reader for ELF support
on amd64. Paul Saab ported the amd64 changes to i386 and we implemented
the PAE support from there.
Note that gdb6 in the src tree uses whatever libkvm supports. If you want
to debug an old crash dump, you might want to keep an old libkvm.so handy
and use LD_PRELOAD or the like. This does not detect the old raw dump
format.
Approved by: re
Extract the struct cdev pointer and the tty device from inside rather than
incorrectly casting the 'struct cdev *' pointer to a 'dev_t' int. Not
that this was particularly important since it was only used for reading
vmcore files.
- Add a comment noting that the ru_[us]times values being read aren't
actually valid and need to be computed from the raw values.
Submitted by: many (1)
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
ki_childutime, and ki_emul. Also uses the timeradd() macro to correct
the calculation of ki_childtime. That will correct the value returned
when ki_childtime.tv_usec > 1,000,000.
This also implements a new KERN_PROC_GID option for kvm_getprocs().
It also implements the KERN_PROC_RGID and KERN_PROC_SESSION options
which were added to sys/kern/kern_proc.c revision 1.203.
PR: bin/65803 (a very tiny piece of the PR)
Submitted by: Cyrille Lefevre
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
This enable us to use /dev/fwmem* as a core file.
e.g.
ps -M /dev/fwmem0.0 -N kernel.debug
dmesg -M /dev/fwmem0.0 -N kernel.debug
gdb -k -c /dev/fwmem0.0 kernel.debug
You need to set target EUI64 in hw.firewire.fwmem.eui64_hi/lo before
opening the device. On the target arch, (PCI) bus address must be
equivalent to physical address.
(We cannot use this for sparc64 because of IOMMU.)
No objection in: -audit
binaries in /bin and /sbin installed in /lib. Only the versioned files
reside in /lib, the .so symlink continues to live /usr/lib so the
toolchain doesn't need to be modified.
prime objectives are:
o Implement a syscall path based on the epc inststruction (see
sys/ia64/ia64/syscall.s).
o Revisit the places were we need to save and restore registers
and define those contexts in terms of the register sets (see
sys/ia64/include/_regset.h).
Secundairy objectives:
o Remove the requirement to use contigmalloc for kernel stacks.
o Better handling of the high FP registers for SMP systems.
o Switch to the new cpu_switch() and cpu_throw() semantics.
o Add a good unwinder to reconstruct contexts for the rare
cases we need to (see sys/contrib/ia64/libuwx)
Many files are affected by this change. Functionally it boils
down to:
o The EPC syscall doesn't preserve registers it does not need
to preserve and places the arguments differently on the stack.
This affects libc and truss.
o The address of the kernel page directory (kptdir) had to
be unstaticized for use by the nested TLB fault handler.
The name has been changed to ia64_kptdir to avoid conflicts.
The renaming affects libkvm.
o The trapframe only contains the special registers and the
scratch registers. For syscalls using the EPC syscall path
no scratch registers are saved. This affects all places where
the trapframe is accessed. Most notably the unaligned access
handler, the signal delivery code and the debugger.
o Context switching only partly saves the special registers
and the preserved registers. This affects cpu_switch() and
triggered the move to the new semantics, which additionally
affects cpu_throw().
o The high FP registers are either in the PCB or on some
CPU. context switching for them is done lazily. This affects
trap().
o The mcontext has room for all registers, but not all of them
have to be defined in all cases. This mostly affects signal
delivery code now. The *context syscalls are as of yet still
unimplemented.
Many details went into the removal of the requirement to use
contigmalloc for kernel stacks. The details are mostly CPU
specific and limited to exception_save() and exception_restore().
The few places where we create, destroy or switch stacks were
mostly simplified by not having to construct physical addresses
and additionally saving the virtual addresses for later use.
Besides more efficient context saving and restoring, which of
course yields a noticable speedup, this also fixes the dreaded
SMP bootup problem as a side-effect. The details of which are
still not fully understood.
This change includes all the necessary backward compatibility
code to have it handle older userland binaries that use the
break instruction for syscalls. Support for break-based syscalls
has been pessimized in favor of a clean implementation. Due to
the overall better performance of the kernel, this will still
be notived as an improvement if it's noticed at all.
Approved by: re@ (jhb)
that crept in recently. GCC will optimize the divides and multiplies for us.
Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 1 day
memory while mapping a virtual address to a physical address.
This allows us to work with virtual addresses for page tables,
provided it doesn't cause infinite recursion. Currently all
page tables are direct mapped.
userland. If someone wants to implement a backup p_siglist in the kernel
for compatability and to export one could. For now, just tell KVM to hand
an empty signal set off to the userland.
after adding __FBSDID().
Garbage-collected kvm_readswap(). This was once used by kvm_uread(), but
kvm_uread() now just reads /proc/<pid>/mem and procfs hopefully handles
swapped out pages.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)
While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.
called <machine/_types.h>.
o <machine/ansi.h> will continue to live so it can define MD clock
macros, which are only MD because of gratuitous differences between
architectures.
o Change all headers to make use of this. This mainly involves
changing:
#ifdef _BSD_FOO_T_
typedef _BSD_FOO_T_ foo_t;
#undef _BSD_FOO_T_
#endif
to:
#ifndef _FOO_T_DECLARED
typedef __foo_t foo_t;
#define _FOO_T_DECLARED
#endif
Concept by: bde
Reviewed by: jake, obrien
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
processes match the given criteria. Since revision 1.60 of malloc.c,
malloc() and friends return an invalid pointer when given a size of 0.
kvm_getprocs() uses sysctl() with a NULL oldp argument to get an
initial size, but does not check whether it's 0 before passing it to
realloc() (via _kvm_realloc()). Before the aforementioned malloc()
change, this resulted in a minimal allocation made and a valid poitner
returned, but now results in an invalid, but non-NULL, pointer being
returned. When this is passed to sysctl(), the latter returns EFAULT
(as it should).
I'll know as soon as I re-import it and compile it.. :-)
There is no longer a 'pri' strict in the proc struct.
the fields are scattered between the ksegrp and thread in question.
Make a slight change so that libkvm reaches the main thread via the
linked list, rather than assuming it is in the proc structure. Both
conditions are true in -current but only the first will be true in
the KSE M3 world.
argument to kvm_open() and kvm_openfiles() as unused.
BSD didn't read swap since kvm.c CSRG revision 5.21 (u-area is pageable
under new VM. no need to read from swap.)
The old !NEWVM code was removed in CSRG revision 5.23 (~ten years ago).
what broke ps on ia64. It probably also broke on alpha, but the fallback
method of using lseek/read on /proc/*/mem to read ps_strings seems to
work there. It doesn't on ia64 yet.
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
sysctls exporting swap information. When running on a live kernel,
the sysctl's will now be used instead of kvm_read, allowing consumers of
this interface to run without privilege (setgid kmem). Retain the
ability to run on coredumps, or on a kernel using kmem if explicitly
pointed at one.
A side effect of this change is that kvm_getswapinfo() is faster now in
the general case. If the SWIF_DUMP_TREE flag is given (pstat -ss does
this), the radix tree walker, which still uses kvm_read in any case, is
invoked, and therefore does require privilege.
Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: freebsd-audit
no longer contains kernel specific data structures, but rather
only scalar values and structures that are already part of the
kernel/user interface, specifically rusage and rtprio. It no
longer contains proc, session, pcred, ucred, procsig, vmspace,
pstats, mtx, sigiolst, klist, callout, pasleep, or mdproc. If
any of these changed in size, ps, w, fstat, gcore, systat, and
top would all stop working. The new structure has over 200 bytes
of unassigned space for future values to be added, yet is nearly
100 bytes smaller per entry than the structure that it replaced.