Commit Graph

10521 Commits

Author SHA1 Message Date
Robert Watson
8e230e30b7 Attempt to improve convergence of POSIX semaphore code with style(9).
MFC after:	3 days
2008-05-16 18:10:07 +00:00
George V. Neville-Neil
49f287f8c5 Update the kernel to count the number of mbufs and clusters
(all types) used per socket buffer.

Add support to netstat to print out all of the socket buffer
statistics.

Update the netstat manual page to describe the new -x flag
which gives the extended output.

Reviewed by:	rwatson, julian
2008-05-15 20:18:44 +00:00
Attilio Rao
90356491d7 - Embed the recursion counter for any locking primitive directly in the
lock_object, using an unified field called lo_data.
- Replace lo_type usage with the w_name usage and at init time pass the
  lock "type" directly to witness_init() from the parent lock init
  function.  Handle delayed initialization before than
  witness_initialize() is called through the witness_pendhelp structure.
- Axe out LO_ENROLLPEND as it is not really needed.  The case where the
  mutex init delayed wants to be destroyed can't happen because
  witness_destroy() checks for witness_cold and panic in case.
- In enroll(), if we cannot allocate a new object from the freelist,
  notify that to userspace through a printf().
- Modify the depart function in order to return nothing as in the current
  CVS version it always returns true and adjust callers accordingly.
- Fix the witness_addgraph() argument name prototype.
- Remove unuseful code from itismychild().

This commit leads to a shrinked struct lock_object and so smaller locks,
in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are
gained.

Reviewed by:	jhb
2008-05-15 20:10:06 +00:00
John Baldwin
ccd3953e5f Go back to using the process command name (p_comm) for the file name and
command line arguments stored in the note at the beginning of a core dump
instead of the current thread name.

Reviewed by:	julian
2008-05-15 03:07:34 +00:00
Konstantin Belousov
48504cc25b Add the devctl notifications for the cdev create/destroy events.
Based on the submission by: Andriy Gapon <avg icyb net ua>
MFC after:	2 weeks
2008-05-14 14:29:54 +00:00
Julian Elischer
681e40627d fix typo in runz_fuzz
noticed by:Elijah Buck
2008-05-12 06:42:06 +00:00
Alan Cox
3202ed7523 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
Konstantin Belousov
e15864efd8 Kqueue_scan() may sleep when encountered the influx knotes. On the other
hand, it may cause other threads to sleep since kqueue_scan() may mark
some knotes as infux. This could lead to the deadlock.

Before kqueue_scan() sleeps, wakeup the threads that are waiting for the
influx knotes produced by this thread.

Tested by:	pho (previous version)
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:37:05 +00:00
Konstantin Belousov
2e711e4d0d The kqueue_close() encountering the KN_INFLUX knotes on the kq being
closed is the legitimate situation. For instance, filedescriptor with
registered events may be closed in parallel with closing the kqueue.
Properly handle the case instead of asserting that this cannot happen.

Reported and tested by:	pho
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:35:32 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Doug Rabson
06c85cef9d When blocking on an F_FLOCK style lock request which is upgrading a
shared lock to exclusive, drop the shared lock before deadlock
detection.

MFC after: 2 days
2008-05-09 10:34:23 +00:00
Pawel Jakub Dawidek
b109dd74fe - Export HZ value via kern.hz sysctl (this is the same name as for the
loader tunable).
- Document other sysctls in this file and also mark them as loader tunable
  via CTLFLAG_RDTUN flag.

Reviewed by:	roberto
2008-05-09 07:42:02 +00:00
Attilio Rao
688b98135c Add a new witness sysctl which returns the relations between any lock
and its children in the form:
"parent","child"
so that head and bottom of an oriented graph can be easilly detected and
various form of diagrams can be build.
The sysctl is called debug.witness.graphs and it is read-only; in order
to get the list of relations, a simple:
#sysctl debug.witness.graphs
will do the trick.

This approach has been choosen in order to support easilly things like
the DOT format and such.  Soon, an auto-explicative awk script, which
filters simple informations returned by the sysctl and converts them into
a real DOT script, will be committed to the repository between examples.

Discussed with:	rwatson
2008-05-07 21:41:36 +00:00
Kip Macy
c8c7ad9260 add malloc flag to blist so that it can be used in ithread context
Reviewed by: alc, bsdimp
2008-05-05 19:48:54 +00:00
John Baldwin
be00f6053b Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET()
method:
- If the last of the child cpufreq drivers returns an error while trying to
  fetch its list of supported frequencies but an earlier driver found the
  requested frequency, don't return an error to the caller.
- If all of the child cpufreq drivers fail and the attempt to match the
  frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
  returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.

MFC after:	3 days
PR:		kern/121433
Reported by:	Eugene Grosbein  eugen ! kuzbass dot ru
2008-05-05 19:13:52 +00:00
Peter Wemm
43d7128c14 Expand kdb_alt_break a little, most commonly used with the option
ALT_BREAK_TO_DEBUGGER.  In addition to "Enter ~ ctrl-B" (to enter the
debugger), there is now "Enter ~ ctrl-P" (force panic) and
"Enter ~ ctrl-R" (request clean reboot, ala ctrl-alt-del on syscons).

We've used variations of this at work.  The force panic sequence is
best used with KDB_UNATTENDED for when you just want it to dump and
get on with it.

The reboot request is a safer way of getting into single user than
a power cycle.  eg: you've hosed the ability to log in (pam, rtld, etc).
It gives init the reboot signal, which causes an orderly reboot.

I've taken my best guess at what the !x86 and non-sio code changes
should be.

This also makes sio release its spinlock before calling KDB/DDB.
2008-05-04 23:29:38 +00:00
Attilio Rao
60e2edce55 sync_vnode() has some messy code about locking in order to deal with
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).

Suggested by:	jeff
Reviewed by:	kib
Tested by:	pho
2008-05-04 13:54:55 +00:00
Julian Elischer
2182c0cfbf Attempt to make the print types more friendly to other architectures.
Prodded by: Max Laier
Help from: BMS, jhb
2008-04-30 20:00:30 +00:00
Julian Elischer
c59b9a7659 Document the kproc_kthread_add() call
and fix a small detail of its implementation.
MFC after: 1 week
2008-04-29 22:43:15 +00:00
Roman Divacky
d6891277a4 Lock filedesc exclusively when modifying fd_[cr]dir.
This is probably harmless but it's better to lock it
correctly.

Approved by:	kib (mentor)
2008-04-29 21:40:11 +00:00
Julian Elischer
6eeac1d921 Add an option (compiled out by default)
to profile outoing packets for a number of mbuf chain
related parameters
e.g. number of mbufs, wasted space.
probably will do with further work later.

Reviewed by: various
2008-04-29 21:23:21 +00:00
David Xu
3bba58f287 Fix compiling problem. 2008-04-29 05:48:05 +00:00
David Xu
727158f6f6 Introduce command UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE
to allow userland to specify that an address is not shared by multiple
processes.
2008-04-29 03:48:48 +00:00
Robert Watson
ae11a989e6 When writing trailers in sendfile(2), don't call kern_writev()
while holding the socket buffer lock.  These leads to an
immediate panic due to recursing the socket buffer lock.  This
bug was introduced in uipc_syscalls.c:1.240, but masked by
another bug until that was fixed in uipc_syscalls.c:1.269.

Note that the current fix isn't perfect, but better than
panicking: normally we guarantee that simultaneous invocations
of a system call to write on a stream socket won't be
interlaced, which is ensured by use of the socket buffer sleep
lock.  This is guaranteed for the sendfile headers, but not
trailers.  In practice, this is likely not a problem, but
should be fixed.

MFC after:	3 days
Pointy hat to:	andre (1.240), cperciva (1.269)
2008-04-27 15:50:00 +00:00
Kris Kennaway
5894445dad * Correct a mis-merge that leaked the PROC_LOCK [1]
* Return ENOENT on error instead of 0 [2]

Submitted by: rdivacky [1], kib [2]
2008-04-26 13:16:55 +00:00
Pawel Jakub Dawidek
3800322fe2 Implement 'show mount' command in DDB. Without argument, it prints short
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.

MFC after:	2 weeks
2008-04-26 13:04:48 +00:00
Jeff Roberson
6c47aaae12 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
Kris Kennaway
b1ba81d948 fdhold can return NULL, so add the one remaining missing check for this
condition.

Reviewed by:    attilio
MFC after:      1 week
2008-04-24 22:08:36 +00:00
Konstantin Belousov
12e79a9bbc Allow the vnode zone to return the unused memory. The vnode reference
count is/shall be properly maintained for the long time, and VFS
shall be safe against the vnode memory reclamation.

Proposed by:	jeff
Tested by:	pho
2008-04-24 09:58:33 +00:00
Poul-Henning Kamp
9b4a8ab7ba Now that all platforms use genclock, shuffle things around slightly
for better structure.

Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.

In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such.  All that
is theoretically a matter for userland only.

Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC.  For this we have <sys/clock.h>

<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on.  These know only seconds and fractions thereof.

Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.

Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c.  Remove references to it
elsewhere.

Remove a lot of unnecessary <sys/clock.h> includes.

Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.
2008-04-22 19:38:30 +00:00
Pawel Jakub Dawidek
d90d4eb28c Back-out previous revision. For now I can use _ddb() variants of stack(9) KPI,
as I use it for debugging only. Once someone will need it for more production
features, the change should be reconsider.

Requested by:	rwatson
2008-04-21 17:22:35 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Pawel Jakub Dawidek
f55f27f862 Allow linker_search_symbol_name() to be called with KLD lock held.
The linker_search_symbol_name() function is used by stack_print()
and stack_print() can be called from kernel module unload method.

MFC after:	1 week
2008-04-17 19:19:40 +00:00
Jeff Roberson
1690c6c1be - Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
   sched_clock() is called.
 - If the busy metric exceeds a threshold allow the idle thread to spin
   waiting for new work for a brief period to avoid using IPIs.  This
   reduces the cost on the sender and receiver as well as reducing wakeup
   latency considerably when it works.

Sponsored by:	Nokia
2008-04-17 09:56:01 +00:00
Jeff Roberson
8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
Doug Rabson
a365ea5fba Fix compilation with LOCKF_DEBUG. 2008-04-16 14:08:12 +00:00
Konstantin Belousov
eab626f110 Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
David Xu
d61f3de656 Implement POSIX function tcgetsid() which returns session id.
PR: stand/107561
2008-04-15 08:33:32 +00:00
Marcel Moolenaar
495168ba8d Support and switch to the ULE scheduler:
o  Implement IPI_PREEMPT,
o  Set td_lock for the thread being switched out,
o  For ULE & SMP, loop while td_lock points to blocked_lock for
   the thread being switched in,
o  Enable ULE by default in GENERIC and SKI,
2008-04-15 05:02:42 +00:00
Randall Stewart
cf71e4381a Add pru_flush routine so a transport can
flush itself during Shutdown

MFC after:	1 week
2008-04-14 18:06:04 +00:00
Alan Cox
e384d8a89b Initialize the vm object's flags to include OBJ_NOSPLIT, just like the
vm objects that are used by System V shared memory segments.
2008-04-13 21:08:34 +00:00
Attilio Rao
22dd228d5d Use a "rel" memory barrier for disowning the lock as it cames from an
exclusive locking operation.
2008-04-13 01:21:56 +00:00
Attilio Rao
0b0100db88 struct lock_instance and struct lock_list_entry don't need to be in the
public namespace for WITNESS as they are only used internally so just
move them in the private namespace for the subsystem (with all related
supporting definitions).
2008-04-13 01:20:47 +00:00
Poul-Henning Kamp
8d24f82310 fix printf type confusion on amd64 2008-04-12 21:51:54 +00:00
Poul-Henning Kamp
c9ad6040dd Emit summaries of struct c(alender)t(ime) <-> struct timespec conversions
under bootverbose.

Struct ct is used for setting/reading real time clocks and I'm about
to Do Things to some of those, so a bit of preemptive debugging is
in order.

Remove a pointless __inline.
2008-04-12 20:35:56 +00:00
Attilio Rao
e5f94314ad - Re-introduce WITNESS support for lockmgr. About the old implementation
the only one difference is that lockmgr*() functions now accept
  LK_NOWITNESS flag which skips ordering for the instanced calling.
- Remove an unuseful stub in witness_checkorder() (because the above check
  doesn't allow ever happening) and allow witness_upgrade() to accept
  non-try operation too.
2008-04-12 19:57:30 +00:00
Attilio Rao
872b7289fd - Remove a stale comment.
- Add an extra assertion in order to catch malformed requested operations.
2008-04-12 13:56:17 +00:00
Attilio Rao
1859cffaef Add missing stubs for spinlocks cpuset and intrcnt.
Submitted by:	kris
2008-04-12 13:51:18 +00:00
Xin LI
31c50f53da Instead of rolling our own jail number allocation procedure, use
alloc_unr() to do it.

Submitted by:	Ed Schouten <ed 80386 nl>
PR:		kern/122270
MFC after:	1 month
2008-04-11 21:31:15 +00:00
John Baldwin
03c7442d75 Use kthread_exit() to terminate a taskqueue thread rather than kproc_exit()
now that the taskqueue threads are kthreads rather than kprocs.

Reported by:	kris
2008-04-11 17:35:54 +00:00
Jeff Roberson
9b33b154b5 - Add the interrupt vector number to intr_event_create so MI code can
lookup hard interrupt events by number.  Ignore the irq# for soft intrs.
 - Add support to cpuset for binding hardware interrupts.  This has the
   side effect of binding any ithread associated with the hard interrupt.
   As per restrictions imposed by MD code we can only bind interrupts to
   a single cpu presently.  Interrupts can be 'unbound' by binding them
   to all cpus.

Reviewed by:	jhb
Sponsored by:	Nokia
2008-04-11 03:26:41 +00:00
Pawel Jakub Dawidek
b03d720760 - Use LK_TYPE_MASK where needed. Actually after sys/sys/lockmgr.h:1.69 it is
no longer needed, but for now we still want to be consistent with other
  similar checks in the tree.
- Call ASSERT_VOP_ELOCKED() only when vget() returns 0.

Reviewed by:	jeff
2008-04-09 20:19:55 +00:00
Sam Leffler
6c6eaea6dd Do image loading in a context known to have a root directory:
o create a private task queue thread that sets up root and current
  directories (hooking mountroot event as needed); this is necessary
  because task queue threads are parented from proc0 and it does not
  have a reference to rootvnode (lost when / mounting moved to init)
o bounce image load + unload requests through the private task q so
  we can load images even when the request is made from a thread that
  does not have sufficient context (e.g. task q thread)
o add a check in the task q thread to fail requests before root is
  mounted (just in case)

Reviewed by:	jhb, mlaier, luigi (glance)
MFC after:	1 month
2008-04-09 19:07:48 +00:00
Sam Leffler
00c71fb7c3 o add a mountroot event handler that fires when / is mounted; this information
was lost when root started being mounted by init
o remove SI_SUB_MOUNT_ROOT since it's no longer meaningful

MFC after:	2 weeks
2008-04-08 17:53:33 +00:00
Sam Leffler
175611b668 change taskqueue_start_threads to create threads instead of proc's
Reviewed by:	jhb
2008-04-08 17:48:02 +00:00
Konstantin Belousov
48b05c3f82 Implement the linux syscalls
openat, mkdirat, mknodat, fchownat, futimesat, fstatat, unlinkat,
    renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat.

Submitted by:	rdivacky
Sponsored by:	Google Summer of Code 2007
Tested by:	pho
2008-04-08 09:45:49 +00:00
Attilio Rao
e0f62984c1 - Use a different encoding for lockmgr options: make them encoded by
bit in order to allow per-bit checks on the options flag, in particular
  in the consumers code [1]
- Re-enable the check against TDP_DEADLKTREAT as the anti-waiters
  starvation patch allows exclusive waiters to override new shared
  requests.

[1] Requested by:	pjd, jeff
2008-04-07 14:46:38 +00:00
Don Lewis
8a3724388b vfs_syscalls.c 1.452 mistakenly swapped the behavior of chown() and lchown(). 2008-04-07 00:29:32 +00:00
Attilio Rao
047dd67e96 Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw().  These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock.  In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer.  This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI.  In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by:      kris, pho, jeff, danger
Reviewed by:    jeff
Sponsored by:   Google, Summer of Code program 2007
2008-04-06 20:08:51 +00:00
Jeff Roberson
ce62b59c88 - Correct a major error introduced in the per-cpu timeout commit. Sleep
and wakeup require the same wait channel to function properly.

Found by:	kris
Pointy hat:	me
2008-04-06 11:08:49 +00:00
John Baldwin
8aa9e82e67 Move INTR_FILTER from opt_global.h to its own header. 2008-04-05 20:13:15 +00:00
John Baldwin
1ee1b68792 Add a MI intr_event_handle() routine for the non-INTR_FILTER case. This
allows all the INTR_FILTER #ifdef's to be removed from the MD interrupt
code.
- Rename the intr_event 'eoi', 'disable', and 'enable' hooks to
  'post_filter', 'pre_ithread', and 'post_ithread' to be less x86-centric.
  Also, add a comment describe what the MI code expects them to do.
- On amd64, i386, and powerpc this is effectively a NOP.
- On arm, don't bother masking the interrupt unless the ithread is
  scheduled in the non-INTR_FILTER case to match what INTR_FILTER did.
  Also, don't bother unmasking the interrupt in the post_filter case if
  we never masked it.  The INTR_FILTER case had been doing this by having
  arm_unmask_irq for the post_filter (formerly 'eoi') hook.
- On ia64, stray interrupts are now masked for the non-INTR_FILTER case.
  They were already masked in the INTR_FILTER case.
- On sparc64, use the a NULL pre_ithread hook and use intr_enable_eoi() for
  both the 'post_filter' and 'post_ithread' hooks to match what the
  non-INTR_FILTER code did.
- On sun4v, retire the ithread wrapper hack by using an appropriate
  'post_ithread' hook instead (it's what 'post_ithread'/'enable' was
  designed to do even in 5.x).

Glanced at by:	piso
Reviewed by:	marius
Requested by:	marius [1], [5]
Tested on:	amd64, i386, arm, sparc64
2008-04-05 19:58:30 +00:00
Alan Cox
7630c26507 Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.)  UMA_SLAB_KERNEL is now required by the jumbo frame
allocators.  Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object.  In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT.  This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL.  This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
2008-04-04 18:41:12 +00:00
Jeff Roberson
00ca09449d - Add sysctls at debug.rwlock to control the behavior of the speculative
spinning when readers hold a lock.  This spinning is speculative because,
   unlike the write case, we can not test whether the owners are running.
 - Add speculative read spinning for readers who are blocked by pending
   writers while a read lock is still held.  This allows the thread to
   spin until the write lock succeeds after which it may spin until the
   writer has released the lock.  This prevents excessive context switches
   when readers and writers both hold the lock for brief periods.

Sponsored by:	Nokia
2008-04-04 10:00:46 +00:00
Jeff Roberson
3bc8c68d9f - Add a Nokia copyright to cpuset to reflect their generous
contribution to this work.
2008-04-04 01:22:04 +00:00
Jeff Roberson
0502fe2e43 - Allow static_boost to specify no boost with '0', traditional kernel
fixed pri boost with '1' or any priority less than the current thread's
   priority with a value greater than two.  Default the boost to
   PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
   threads in the kernel.  This prevents these user-threads from also
   being scheduled as if they are high fixed-priority kernel threads.
 - Restore the setting of lowpri in tdq_choose().  It has to be either here
   or in sched_switch().  I accidentally removed it from both places.

Tested by:	kris
2008-04-04 01:16:18 +00:00
Jeff Roberson
03d17db7d5 - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't
do this either.  Simply check P_NOLOAD.  It'd be nice if this was
   in a thread flag so we didn't have an extra cache miss every time we
   add and remove a thread from the run-queue.
2008-04-04 01:04:43 +00:00
Jeff Roberson
e4b1aa6210 - Fix a mis-merge that crept in during the softclock changes.
Spotted by:	jhb
2008-04-04 01:03:23 +00:00
David Xu
44253336b6 let umtxq_busy() only spin on mp machine. make function name
do_rwlock_unlock to be consistent with others.
2008-04-03 11:49:20 +00:00
Jeff Roberson
e8245292a7 - Convert two timeout users to the new callout_reset_curcpu() api.
Sponsored by:	Nokia
2008-04-02 11:21:42 +00:00
Jeff Roberson
8d809d5061 Implement per-cpu callout threads, wheels, and locks.
- Move callout thread creation from kern_intr.c to kern_timeout.c
 - Call callout_tick() on every processor via hardclock_cpu() rather than
   inspecting callout internal details in kern_clock.c.
 - Remove callout implementation details from callout.h
 - Package up all of the global variables into a per-cpu callout structure.
 - Start one thread per-cpu.  Threads are not strictly bound.  They prefer
   to execute on the native cpu but may migrate temporarily if interrupts
   are starving callout processing.
 - Run all callouts by default in the thread for cpu0 to maintain current
   ordering and concurrency guarantees.  Many consumers may not properly
   handle concurrent execution.
 - The new callout_reset_on() api allows specifying a particular cpu to
   execute the callout on.  This may migrate a callout to a new cpu.
   callout_reset() schedules on the last assigned cpu while
   callout_reset_curcpu() schedules on the current cpu.

Reviewed by:	phk
Sponsored by:	Nokia
2008-04-02 11:20:30 +00:00
Konstantin Belousov
35b450291a Add two missed chunks from the rev. 1.210, for the giant_read() and
giant_ioctl().

PR:	kern/122287
MFC after:	3 days
2008-04-02 11:11:58 +00:00
Jeff Roberson
1fd9b6a577 - Destroy the bo mtx when the vnode is destroyed. 2008-04-02 10:40:03 +00:00
David Xu
fadd84c58f Fix compiling problem for amd64. 2008-04-02 05:54:41 +00:00
David Xu
11b1023b7d Er, don't restart a timeout version. 2008-04-02 04:26:59 +00:00
David Xu
1a30511c61 Introduce kernel based userland rwlock. Each umtx chain now has two lists,
one for readers and one for writers, other types of synchronization
object just use first list.

Asked by: jeff
2008-04-02 04:08:37 +00:00
Attilio Rao
b31a149bbb Add rw_try_rlock() and rw_try_wlock() to rwlocks.
These functions try the specified operation (rlocking and wlocking) and
true is returned if the operation completes, false otherwise.

The KPI is enriched by this commit, so __FreeBSD_version bumping and
manpage updating will happen soon.

Requested by:	jeff, kris
2008-04-01 20:31:55 +00:00
Doug Rabson
60cdfde09f Don't try to use an SX lock while holding the vnode interlock.
Sponsored by:	Isilon Systems
2008-04-01 16:07:01 +00:00
Konstantin Belousov
f2296b585e Regen 2008-03-31 12:12:27 +00:00
Konstantin Belousov
7104518b07 Add the openat(), fexecve() and other *at() syscalls to the table.
Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:06:55 +00:00
Konstantin Belousov
632dbc19e2 Implement the fexecve(2) syscall.
Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:05:52 +00:00
Konstantin Belousov
e4193f25cb Implement the
openat(2), faccessat(2), fchmodat(2), fchownat(2), fstatat(2),
	futimesat(2), linkat(2), mkdirat(2), mkfifoat(2), mknodat(2),
	readlinkat(2), renameat(2), symlinkat(2)
syscalls.

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:04:20 +00:00
Konstantin Belousov
57b4252e45 Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:01:21 +00:00
Konstantin Belousov
e314f69fff Add the support for the O_EXEC open(2) mode, as specified by the
POSIX Extended API Set Part 2 extension specification.

Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 11:57:18 +00:00
Konstantin Belousov
0a3af16a75 Add the utility function vn_commname() to retrieve the command name
from the vfs namecache, when available.

Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 11:53:03 +00:00
Jeff Roberson
a03ee0000e - Consistently return EDEADLK when presented with a new set that is
incompatible with existing bindings.
 - Try to copyout the setid in cpuset() before migrating the proc to the
   setid in case the user has supplied a bad buffer.
 - Rename cpuset_root() and cpuset_base() to cpuset_ref{root,base} to
   be more descriptive and free cpuset_root to be used as a different
   type of symbol.
 - Make cpuset_root the cpuset_t set of all cpus in the system.  This
   should contain the same bitmask as all_cpus presently.
 - Add a CPU_CMP() macro to compare two sets.
2008-03-30 11:31:14 +00:00
Jeff Roberson
5634d48667 - Don't allow calls to vn_lock() with no lock type requested. Callers
which simply want a reference should use vref().  Callers which want
   to check validity need to hold a lock while performing any action
   based on that validity.  vn_lock() would always release the interlock
   before returning making any action synchronous with the validity check
   impossible.
2008-03-29 23:36:26 +00:00
Jeff Roberson
069c6953a0 - Use vget() to lock the vnode rather than refing without a lock and
locking in separate steps.
2008-03-29 23:30:40 +00:00
Attilio Rao
71072af500 b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by:	jeff
Approved by:	jeff, kib
2008-03-28 12:30:12 +00:00
John Birrell
fc70c0bdd8 Regen after makesyscalls.sh change. 2008-03-27 01:55:06 +00:00
John Birrell
e994ea8e55 Generate another function for the DTrace syscall provider to specify
the syscall argument types.

This code is only compiled into the systrace kernel modul and has no
effect otherwise.
2008-03-27 01:53:44 +00:00
Poul-Henning Kamp
e465985885 The "free-lance" timer in the i8254 is only used for the speaker
these days, so de-generalize the acquire_timer/release_timer api
to just deal with speakers.

The new (optional) MD functions are:
	timer_spkr_acquire()
	timer_spkr_release()
and
	timer_spkr_setfreq()

the last of which configures the timer to generate a tone of a given
frequency, in Hz instead of 1/1193182th of seconds.

Drop entirely timer2 on pc98, it is not used anywhere at all.

Move sysbeep() to kern/tty_cons.c and use the timer_spkr*() if
they exist, and do nothing otherwise.

Remove prototypes and empty acquire-/release-timer() and sysbeep()
functions from the non-beeping archs.

This eliminate the need for the speaker driver to know about
i8254frequency at all.  In theory this makes the speaker driver MI,
contingent on the timer_spkr_*() functions existing but the driver
does not know this yet and still attaches to the ISA bus.

Syscons is more tricky, in one function, sc_tone(), it knows the hz
and things are just fine.

In the other function, sc_bell() it seems to get the period from
the KDMKTONE ioctl in terms if 1/1193182th second, so we hardcode
the 1193182 and leave it at that.  It's probably not important.

Change a few other sysbeep() uses which obviously knew that the
argument was in terms of i8254 frequency, and leave alone those
that look like people thought sysbeep() took frequency in hertz.

This eliminates the knowledge of i8254_freq from all but the actual
clock.c code and the prof_machdep.c on amd64 and i386, where I think
it would be smart to ask for help from the timecounters anyway [TBD].
2008-03-26 20:09:21 +00:00
Doug Rabson
a7ac0db6cb Regen. 2008-03-26 15:24:02 +00:00
Doug Rabson
dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
Scott Long
478cfc7300 Implement taskqueue_block() and taskqueue_unblock(). These functions allow
the owner of a queue to block and unblock execution of the tasks in the
queue while allowing tasks to continue to be added queue.  Combining this
with taskqueue_drain() allows a queue to be safely disabled.  The unblock
function may run (or schedule to run) the queue when it is called, just as
calling taskqueue_enqueue() would.

Reviewed by: jhb, sam
2008-03-25 22:38:45 +00:00
Ruslan Ermilov
ea26d58729 Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by:	arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
2008-03-25 09:39:02 +00:00
Ruslan Ermilov
b2798e2573 Regen after changing prototypes of cpuset_{get,set}affinity(). 2008-03-25 09:14:17 +00:00
Ruslan Ermilov
7f64829a5e Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t.
Prodded by:	davidxu
2008-03-25 09:11:53 +00:00
Jeff Roberson
0ee6cecc9d - Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
   call.  The kernel makes no guarantees about which reference was the
   last to close a file or when the actual inactive processing will
   happen.  The previous code was designed to preserve existing semantics
   in the face of shared locks, however, this was unnecessary.

Discussed with:	mckusick
2008-03-24 04:22:58 +00:00
Jeff Roberson
804e60d4cf - Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested.  Handle this case specially before the while loop.
 - Use the held vnode lock to check for VI_DOOMED.  The vnode lock and
   interlock must both be held to set VI_DOOMED so either one held, even
   shared, is sufficient to check it.

No objection by:	kib
2008-03-24 04:17:35 +00:00