directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.
The new procstat output looks like:
PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -
I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.
Reviewed by: rwatson
Approved by: rwatson
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.
Reviewed by: jeff, kib
the provided lock or &blocked_lock. The thread may be temporarily
assigned to the blocked_lock by the scheduler so a direct comparison
can not always be made.
- Use THREAD_LOCKPTR_ASSERT() in the primary consumers of the scheduling
interfaces. The schedulers themselves still use more explicit asserts.
Sponsored by: Nokia
- Move recursion checking into rwlock inlines to free a bit for use with
adaptive spinners.
- Clear the RW_LOCK_WRITE_SPINNERS flag whenever the lock state changes
causing write spinners to restart their loop.
- Write spinners are limited by a count while readers hold the lock as
there is no way to know for certain whether readers are running still.
- In the read path block if there are write waiters or spinners to avoid
starving writers. Use a new per-thread count, td_rw_rlocks, to skip
starvation avoidance if it might cause a deadlock.
- Remove or change invalid assertions in turnstiles.
Reviewed by: attilio (developed parts of the patch as well)
Sponsored by: Nokia
This support tries to be as parallel as possible with other locking
primitives, but there are differences; more specifically:
- The base witness support is alredy equipped for allowing lock
duplication acquisition as lockmgr rely on this.
- In the case of lockmgr_disown() the lock result unlocked by witness
even if it is still held by the "kernel context"
- In the case of upgrading we can have 3 different situations:
* Total unlocking of the shared lock and nothing else
* Real witness upgrade if the owner is the first upgrader
* Shared unlocking and exclusive locking if the owner is not the first
upgrade but it is still allowed to upgrade
- LK_DRAIN is basically handled like an exclusive acquisition
Additively new options LK_NODUP and LK_NOWITNESS can now be used with
lockinit(): LK_NOWITNESS disables WITNESS for the specified lock while
LK_NODUP enable duplicated locks tracking. This will require manpages
update and a __FreeBSD_version bumping (addressed by further commits).
This patch also fixes a problem occurring if a lockmgr is held in
exclusive mode and the same owner try to acquire it in shared mode:
currently there is a spourious shared locking acquisition while what
we really want is a lock downgrade. Probabilly, this situation can be
better served with a EDEADLK failing errno return.
Side note: first testing on this patch alredy reveleated several LORs
reported, so please expect LORs cascades until resolved. NTFS also is
reported broken by WITNESS introduction. BTW, NTFS is exposing a lock
leak which needs to be fixed, and this patch can help it out if
rightly tweaked.
Tested by: kris, yar, Scot Hetzel <swhetzel at gmail dot com>
done in consumers code: using locks properties is much more appropriate.
Fix current code doing these bogus checks.
Note: Really, callout are not usable by all !(LC_SPINLOCK | LC_SLEEPABLE)
primitives like rmlocks doesn't implement the generic lock layer
functions, but they can be equipped for this, so the check is still
valid.
Tested by: matteo, kris (earlier version)
Reviewed by: jhb
- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.
This makes socket close cleaner, and possibly also marginally faster.
MFC after: 3 weeks
referencing the files VM pages are returned from the network stack,
making changes to the file safe.
This flag does not guarantee that the data has been transmitted to the
other end.
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.
Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.
Likely break the only non-convetional user of the API, after informing
the relevant committer.
Update the mbuf(9) manual page, which was already out of sync on
this point.
Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.
This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.
This does not affect the memory layout or size of mbufs.
Parental oversight by: sam and rwatson.
No MFC is anticipated.
read socket buffers in shutdown() and close():
- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().
- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.
To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.
Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>
resulted in the argument to the make_dev() to be a unit number.
Correct this by supplying a minor number to make_dev(), and using
the unit number for the calculation of the slave tty name.
Reported and tested by: Peter Holm
Reviewed by: jhb
Yet another pointy hat to: kib
MFC after: 1 day
while the thread does not hold the thread lock would stop blocking for
subsequent interruptible sleeps and would always immediately fail the
sleep with EWOULDBLOCK instead (even sleeps that didn't have a timeout).
Some background:
- KSE has a facility for allowing one thread to interrupt another thread.
During this process, the target thread aborts any interruptible sleeps
much as if the target thread had a pending signal. Once the target
thread acknowledges the interrupt, normal sleep handling resumes. KSE
manages this via the TDF_INTERRUPTED flag. Specifically, it sets the
flag when it sends an interrupt to another thread and clears it when
the interrupt is acknowledged. (Note that this is purely a software
interrupt sort of thing and has no relation to hardware interrupts
or kernel interrupt threads.)
- The old code for handling the sleep timeout race handled the race
by setting the TDF_INTERRUPT flag and faking a KSE-style thread
interrupt to the thread in the process of going to sleep. It probably
should have just checked the TDF_TIMEOUT flag in sleepq_catch_signals()
instead.
- The bug was that the sleepq code would set TDF_INTERRUPT but it was
never cleared. The sleepq code couldn't safely clear it in case there
actually was a real KSE thread interrupt pending for the target thread
(in fact, the sleepq timeout actually stomped on said pending interrupt).
Thus, any future interruptible sleeps (*sleep(.. PCATCH ..) or
cv_*wait_sig()) would see the TDF_INTERRUPT flag set and immediately
fail with EWOULDBLOCK. The flag could be cleared if the thread belonged
to a KSE process and another thread posted an interrupt to the original
thread. However, in the more common case of a non-KSE process, the
thread would pretty much stop sleeping.
- Fix the bug by just setting TDF_TIMEOUT in the sleepq timeout code and
not messing with TDF_INTERRUPT and td_intrval. With yesterday's fix to
fix sleepq_switch() to check TDF_TIMEOUT, this is now sufficient.
MFC after: 3 days
being properly cancelled by a timeout. In general there is a race
between a the sleepq timeout handler firing while the thread is still
in the process of going to sleep. In 6.x with sched_lock, the race was
largely protected by sched_lock. The only place it was "exposed" and had
to be handled was while checking for any pending signals in
sleepq_catch_signals().
With the thread lock changes, the thread lock is dropped in between
sleepq_add() and sleepq_*wait*() opening up a new window for this race.
Thus, if the timeout fired while the sleeping thread was in between
sleepq_add() and sleepq_*wait*(), the thread would be marked as timed
out, but the thread would not be dequeued and sleepq_switch() would
still block the thread until it was awakened via some other means. In
the case of pause(9) where there is no other wakeup, the thread would
never be awakened.
Fix this by teaching sleepq_switch() to check if the thread has had its
sleep canceled before blocking by checking the TDF_TIMEOUT flag and
aborting the sleep and dequeueing the thread if it is set.
MFC after: 3 days
Reported by: dwhite, peter
`kn_sdata' member of the newly registered knote. The problem is that
this member is overwritten by a call to kevent(2) with the EV_ADD flag,
targetted at the same kevent/knote. For instance, a userland application
may set the pointer to NULL, leading to a panic.
A testcase was provided by the submitter.
PR: kern/118911
Submitted by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 1 day
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present
Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h
KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.
Tested by: matteo
Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).
Leave a few comments to be addressed later.
Reviewed by: rwatson (older version, before addressing his comments)
a run-queue. If the priority is numerically raised only change lowpri
if we're certain it will be correct. Some slop is allowed however
previously we could erroneously raise lowpri for an idle cpu that a
thread had recently run on which lead to errors in load balancing
decisions.
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj
BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.
KPI results, obviously, broken so further commits will update manpages
and freebsd version.
Tested by: kris (on UFS and NFS)
unp_connect(): it is expected to return with the lock held, and two
possible error paths otherwise returned with it unlocked.
The fix committed here is slightly different from the patch in the
PR, but along an alternative line suggested in the PR.
PR: 119778
MFC after: 3 days
Submitted by: James Juran <james dot juran at baesystems dot com>
was missed. As result, pty_create_slave() may index out of the names[]
bounds, creating wrong slave tty names.
Tested by: kensmith
Reviewed by: jhb
MFC after: 3 days
other. The first one survives, the rest are removed. So far, it appears
only some acpi_perf(4) BIOS tables have these invalid states, but address
this in the core to be sure to handle other potential driver data.
PR: kern/114722
Tested by: stefan.lambrev / moneybookers.com
MFC after: 3 days
lowest priority on the queue for the current cpu vs curthread's
priority. In the case that curthread is waking up many threads of a
lower priority as would happen with a turnstile_broadcast() or wakeup()
of many threads this prevents them from all ending up on the current cpu.
- In sched_add() make the relationship between a scheduled ithread and
the current cpu advisory rather than strict. Only give the ithread
affinity for the current cpu if it's actually being scheduled from
a hardware interrupt. This prevents it from migrating when it simply
blocks on a lock.
Sponsored by: Nokia
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
panic but it won't actually lock anything.
This can lead some paths to reach lockmgr_disown() with inconsistent
lock which will let trigger the relative assertions.
Fix those in order to recognize panic situation and to not trigger.
Reported by: pho
Submitted by: kib
maintain a separate td_incruntime to hold unbilled CPU usage for
the thread that has the previous properties of td_runtime.
When thread information is requested using the thread monitoring
sysctls, export thread td_runtime instead of process rusage runtime
in kinfo_proc.
This restores the display of individual ithread and other kernel
thread CPU usage since inception in ps -H and top -SH, as well for
libthr user threads, valuable debugging information lost with the
move to try kthreads since they are no longer independent processes.
There is universal agreement that we should rewrite the process and
thread export sysctls, but this commit gets things going a bit
better in the mean time. Likewise, there are resevations about the
continued validity of statclock given the speed of modern processors.
Reviewed by: attilio, emaste, jhb, julian
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
Now, lockmgr() function can only be called passing curthread and the
KASSERT() is upgraded according with this.
In order to support on-the-fly owner switching, the new function
lockmgr_disown() has been introduced and gets used in BUF_KERNPROC().
KPI, so, results changed and FreeBSD version will be bumped soon.
Differently from previous code, we assume idle thread cannot try to
acquire the lockmgr as it cannot sleep, so loose the relative check[1]
in BUF_KERNPROC().
Tested by: kris
[1] kib asked for a KASSERT in the lockmgr_disown() about this
condition, but after thinking at it, as this is a well known general
rule, I found it not really necessary.
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.
Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
dev2udev() when a tty was being detached concurrently with the sysctl
handler:
- Hold the 'tty_list_mutex' lock while we read all the fields out of the
struct tty for copying out later. Previously the pty(4) and pts(4)
destroy routines could set t_dev to NULL, drop their reference on the
tty and destroy the cdev while the sysctl handler was attempting to
invoke dev2udev() on the cdev being destroyed. This happened when the
sysctl handler read the value of t_dev prior to it being set to NULL
either due to it being stale or due to timing races. By holding the
list lock we guarantee that the destroy routines will block in ttyrel()
in that case and not destroy the cdev until after we've copied all of our
data. We may see a NULL cdev pointer or we may see the previous value,
but the previous value will no longer point to a destroyed cdev if we
see it.
- Fix the ttyfree() routine used by tty device drivers in their detach
methods to use ttyrel() on the tty so we don't leak them. Also, fix it
to use the same order of operations as pty/pts destruction (set t_dev
NULL, ttyrel(), destroy_dev()) so it cooperates with the sysctl handler.
MFC after: 3 days
Tested by: avatar