o Do not use ipfw_insn_pipe->pipe_ptr in locate_flowset(). The
_ipfw_insn_pipe isn't touched by this commit to preserve ABI
compatibility.
o To optimize the lookup of the pipe/flowset in locate_flowset()
introduce hashes for pipes and queues:
- To preserve ABI compatibility utilize the place of global list
pointer for SLIST_ENTRY.
- Introduce locate_flowset(queue nr) and locate_pipe(pipe nr).
o Rework all the dummynet code to deal with the hashes, not global
lists. Also did some style(9) changes in the code blocks that were
touched by this sweep:
- Be conservative about flowset and pipe variable names on stack,
use "fs" and "pipe" everywhere.
- Cleanup whitespaces.
- Sort variables.
- Give variables more meaningful names.
- Uppercase and dots in comments.
- ENOMEM when malloc(9) failed.
- S3 Savage driver ported.
- Added support for ATI_fragment_shader registers for r200.
- Improved r300 support, needed for latest r300 DRI driver.
- (possibly) r300 PCIE support, needs X.Org server from CVS.
- Added support for PCI Matrox cards.
- Software fallbacks fixed for Rage 128, which used to render badly or hang.
- Some issues reported by WITNESS are fixed.
- i915 module Makefile added, as the driver may now be working, but is untested.
- Added scripts for copying and preprocessing DRM CVS for inclusion in the
kernel. Thanks to Daniel Stone for getting me started on that.
rare case of a stray interrupt to an unregistered source (such as a stray
interrupt from the 8259As when using APIC), this could result in a page
fault when it tried to walk the list of interrupt handlers to execute
INTR_FAST handlers. This bug was introduced with the intr_event changes,
so it's not present in 5.x or 6.x.
Submitted by: Mark Tinguely tinguely at casselton dot net
process as over the limit when its time is >= to the limit rather than >
the limit. Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit
and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit
yet. However, having the fraction exactly equal to 0 is rather rare, and
it is not worth the overhead to handle that edge case. With just the >
comparison, the process would have to exceed its limit by almost a second
before it was killed.
PR: kern/83192
Submitted by: Maciej Zawadzinski mzawadzinski at gmail dot com
Reviewed by: bde
MFC after: 1 week
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin(). This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.
MFC after: 1 month
Reviewed by: andre, glebius
this file. With ru@'s approval, change it to this version. In this case we
had to bump the version because the old parser would choke on | in the new
'or' syntax and consider that a device.
Approved by: ru@
rather than forcing the state to LOOK. If we are in the middle of parsing
a line when we have to do a FILL we would have lost any token we were in
the middle of parsing and would have treated the next character as being
at the start of a new line instead.
PR: kern/89181
Submitted by: Antony Mawer gnats at mawer dot org
MFC after: 1 week
Instead of echoing the code in a comment, try to describe why we split
up the evaluation in a special way.
The new optimization is mostly to move the evaluation of w = z*z later
so that everything else (except z = x*x) doesn't have to wait for w.
On Athlons, FP multiplication has a latency of 4 cycles so this
optimization saves 4 cycles per call provided no new dependencies are
introduced. Tweaking the other terms in to reduce dependencies saves
a couple more cycles in some cases (more on AXP than on A64; up to 8
cycles out of 56 altogether in some cases). The previous version had
a similar optimization for s = z*x. Special optimizations like these
probably have a larger effect than the simple 2-way vectorization
permitted (but not activated by gcc) in the old version, since 2-way
vectorization is not enough and the polynomial's degree is so small
in the float case that non-vectorizable dependencies dominate.
On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now
takes 34-55 cycles (was 39-59 cycles).
of between 1.0 and 1.8509 ulps for lgammaf(x) with x between -2**-21 and
-2**-70.
As usual, the cutoff for tiny args was not correctly translated to
float precision. It was 2**-70 but 2**-21 works. Not as usual, having
a too-small threshold was worse than a pessimization. It was just a
pessimization for (positive) args between 2**-70 and 2**-21, but for
the first ~50 million (negative) args below -2**-70, the general code
overflowed and gave a result of infinity instead of correct (finite)
results near 70*log(2). For the remaining ~361 million negative args
above -2**21, the general code gave almost-acceptable errors (lgamma[f]()
is not very accurate in general) but the pessimization was larger than
for misclassified tiny positive args.
Now the max error for lgammaf(x) with |x| < 2**-21 is 0.7885 ulps, and
speed and accuracy are almost the same for positive and negative args
in this range. The maximum error overall is still infinity ulps.
A cutoff of 2**-70 is probably wastefully small for the double precision
case. Smaller cutoffs can be used to reduce the max error to nearly
0.5 ulps for tiny args, but this is useless since the general algrorithm
for nearly-tiny args is not nearly that accurate -- it has a max error of
about 1 ulp.
gives a tiny but hopefully always free optimization in the 2 quadrants
to which it applies. On Athlons, it reduces maximum latency by 4 cycles
in these quadrants but has usually has a smaller effect on total time
(typically ~2 cycles (~5%), but sometimes 8 cycles when the compiler
generates poor code).
of the function name.
Added my (non-)copyright.
In k_tanf.c, added the first set of redundant parentheses to control
grouping which was claimed to be added in the previous commit.
returning float). The functions are renamed from __kernel_{cos,sin}f()
to __kernel_{cos,sin}df() so that misuses of them will cause link errors
and not crashes.
This version is an almost-routine translation with no special optimizations
for accuracy or efficiency. The not-quite-routine part is that in
__kernel_cosf(), regenerating the minimax polynomial with double
precision coefficients gives a coefficient for the x**2 term that is
not quite -0.5, so the literal 0.5 in the code and the related `hz'
variable need to be modified; also, the special code for reducing the
error in 1.0-x**2*0.5 is no longer needed, so it is convenient to
adjust all the logic for the x**2 term a little. Note that without
extra precision, it would be very bad to use a coefficient of other
than -0.5 for the x**2 term -- the old version depends on multiplication
by -0.5 being infinitely precise so as not to need even more special
code for reducing the error in 1-x**2*0.5.
This gives an unimportant increase in accuracy, from ~0.8 to ~0.501
ulps. Almost all of the error is from the final rounding step, since
the choice of the minimax polynomials so that their contribution to the
error is a bit less than 0.5 ulps just happens to give contributions that
are significantly less (~.001 ulps).
An Athlons, for uniformly distributed args in [-2pi, 2pi], this gives
overall speed increases in the 10-20% range, despite giving a speed
decrease of typically 19% (from 31 cycles up to 37) for sinf() on args
in [-pi/4, pi/4].
appeared to rely on all kinds of non-guaranteed behaviours: the
transfer abort code assumed that TDs with no interrupt timeout
configured would end up on the done queue within 20ms, the done
queue processing assumed that all TDs from a transfer would appear
at the same time, and there were access-after-free bugs triggered
on failed transfers.
Attempt to fix these problems by the following changes:
- Use a maximum (6-frame) interrupt delay instead of no interrupt
delay to ensure that the 20ms wait in ohci_abort_xfer() is enough
for the TDs to have been taken off the hardware done queue.
- Defer cancellation of timeouts and freeing of TDs until we either
hit an error or reach the final TD.
- Remove TDs from the done queue before freeing them so that it
is safe to continue traversing the done queue.
This appears to fix a hang that was reproducable with revision 1.67
or 1.68 of ulpt.c (earlier revisions had a different transfer
pattern). With certain HP printers, the command "true > /dev/ulpt0"
would cause ohci_add_done() to spin because the done queue had a
loop. The list corruption was caused by a 3-TD transfer where the
first TD completed but remained on the internal host controller
done queue because it had no interrupt timeout. When the transfer
timed out, the TD got freed and reused, so it caused a loop in the
done queue when it was inserted a second time from a different
transfer.
Reported by: Alex Pivovarov
MFC after: 1 week
refers to and add extra '#' comment characters at the beginning of two
lines that started with TABs, to avoid warnings like:
"/etc/make.conf", line 128: Unassociated shell command "# If set, you might need to adopt your"
"/etc/make.conf", line 129: Unassociated shell command "# nsswitch.conf(5) and remove `nis' entries."
PR: misc/89423
Submitted by: Scot W. Hetzel