when we create contexts. The meaning of the flags are documented in
<machine/ucontext.h>. I only list them here to help browsing the
commit logs:
_MC_FLAGS_ASYNC_CONTEXT
_MC_FLAGS_HIGHFP_VALID
_MC_FLAGS_KSE_SET_MBOX
_MC_FLAGS_RETURN_VALID
_MC_FLAGS_SCRATCH_VALID
Yes, _MC_FLAGS_KSE_SET_MBOX is a hack and I'm proud of it :-)
o For trap-based upcalls the argument (the kse_mailbox) to
the UTS must be written onto the kernel stack, not the
user stack. While here, deal with the fact that we may
be at a NaT collection point.
- Fix a bug in rl_dma_map_desc(): set the 'end of ring' bit in the
right descriptor (DESC_CNT - 1, not DESC_CNT). The 8139C+ is limited
to 64 descriptors and automatically wraps at 64 descriptors even
if the EOR bit isn't set, but the 8169 NIC can have up to 1024
descriptors per ring, so we must set the wrap point in the right
place.
- RealTek moved the RL_TIMERINT register from offset 0x54 to 0x58 in
the 8169 -- account for this.
- Added rl_gmii_readreg() and rl_gmii_writereg() routines.
- Fix rl_probe() to deal with the case where the base type is
not RL_8139.
The next step is to add jumbo buffer support.
Tested with the Xterasys XN-152 NIC (hard to beat $29 for a gigE NIC).
path into the kernel. Normally it's due to a syscall, but one can
also be created as the result of a clock interrupt (for example).
This now even more looks like exec_setregs().
While here, add an assert that we don't expect more than 8KB of
dirty registers on the kernel stack.
unconditionally restore ar.k7 (kernel memory stack) and ar.k6
(kernel register stack). I don't know what I was smoking then,
but if you unconditionally restore ar.k6, you also want to
compute its value unconditionally. By having the computation
predicated and dependent on whether we return to user mode, we
would end up writing junk (= invalid value for ar.bspstore) if
we would return to kernel mode. But the whole point of the
unconditional restoration was that there is a grey area where
we still need to have ar.k6 restored. If we restore with a junk
value, we would end up wedging the machine on the next interrupt.
So, unconditionally calculate the value we unconditionally write
to ar.k6.
o The previous braino was found while making the following change:
We used to clear the lower 9 bits of the value we write to ar.k6.
The meaning being that we know that the kernel register stack is
at least 512 byte aligned and simply clearing the lower 9 bits
allows us to return to a context of which we don't have dirty
registers on the kernel stack, even though the context that
entered the kernel does have dirty registers on the kernel stack.
By masking-off the lower bits, we correctly obtain the base of
the register stack without having to worry that we didn't actually
reached the base while unwinding it.
The change is to mask off the lower 13 bits, knowing that the
kernel register stack is always 8KB aligned. The advantage is that
we don't have to worry anymore if there's more than 512 bytes of
dirty registers on the kernel stack. A situation that frequently
occurs. In exec_setregs() in machdep.c:1.147 or older, we had to
deal with that situation by copying the active portion of the
register stack down in multiples of 512 bytes. Now that we mask off
the lower 13 bits we don't have to do that at all. Contemporary
IPF processors have a register file that can hold up to 96 stacked
registers (=784 bytes [incl. 2 NaT collections]). With no indication
that register files grow beyond a couple of hundred registers, we
should not have to worry about it anymore... and yes, 640KB is
enough for everybody :-)
This change helps setcontext(2) and cpu_set_upcall_kse() in that
they can return to completely different contexts without having to
mess with the kernel stack. Of course exec_setregs() doesn't need
to do that anymore as well.
queues lock such that it isn't held around the call to get_pv_entry(),
which calls uma_zalloc(). At the point of the call to get_pv_entry(), the
lock isn't necessary and holding it could lead to recursive acquisition,
which isn't allowed.
that the page's busy flag could be relied upon to synchronize access to the
pv list. I don't any longer. See, for example, the call to
pmap_insert_entry() from pmap_copy().)
(short) types for the port arg of inb() (rev.1.56). The warning started
working for u_short types with gcc-3.3. The pessimizations exposed
by this been fixed except for the cx and oltr drivers where the breakage
of the warning has been pushed to the drivers.
completenss. The pessimization is tiny compared with i/o port slowness
except on very old machines, but code that used signed short types for
i/o ports was unpessimized long ago, and the macro that detected it
recently started working for u_short types too. Use of bus space
should have made this moot long ago.
Not tested at runtime by: bde
completenss. The pessimization is tiny compared with i/o port slowness
except on very old machines, but code that used signed short types for
i/o ports was unpessimized long ago, and the macro that detected it
recently started working for u_short types too. Use of bus space
should have made this moot long ago.
Not tested at runtime by: bde
This change allows one to specify almost the complete traffic parameters
for IPoverATM channels through the routing table. Up to now we used
4 byte DL addresses (flag, vpi, vciH, vciL). This format is still allowed.
If the address is longer, however, the 5th byte is interpreted as the
traffic class (UBR, CBR, VBR or ABR) and the remaining bytes are the
parameters for this traffic class:
UBR: 0 byte or 3 byte PCR
CBR: 3 byte PCR
VBR: 3 byte PCR, 3 byte SCR, 3 byte MBS
ABR: 3 byte PCR, 3 byte MCR, 3 byte ICR, 3 byte TBE, 1 byte NRM,
1 byte TRM, 2 bytes ADTF, 1 byte RIF, 1 byte RDF and 1 byte CDF
A script to generate the corresponding 'route add' arguments will follow soon.
connection is to be established asynchronously, behave as in the
case of non-blocking mode:
- keep the SS_ISCONNECTING bit set thus indicating that
the connection establishment is in progress, which is the case
(clearing the bit in this case was just a bug);
- return EALREADY, instead of the confusing and unreasonable
EADDRINUSE, upon further connect(2) attempts on this socket
until the connection is established (this also brings our
connect(2) into accord with IEEE Std 1003.1.)
function prototypes. Use LIST_FOREACH instead of explicit loops.
The indentation of functions indendet by 4 space have been left alone.
2-space indented functions have been re-indented.
Eliminate a lot of checkes to make sure requests are not cross-device
which is unnecessary with the new layout. We know a sequential request
cannot possibly be cross-device because there is a reserved page between
the devices.
Remove a couple of comments which no longer are relevant.
i/o ports by calling the implementation-detail level below inb() and
outb() instead of inb() and outb(). Unpessimizing the types is too
hard since they are mainly used in microcode.
completenss. The pessimization is tiny compared with i/o port slowness
except on very old machines, but code that used signed short types for
i/o ports was unpessimized long ago, and the macro that detected it
recently started working for u_short types too. Use of bus space
should have made this moot long ago.
Not tested at runtime by: bde
- Build SGL's for ATA_PASSTHROUGH commands
- Fallback to using the sgl_offset when the opcode is unknown for building
SGL's/
- Add ioctl calls for adding and removing units.
- Define previously undefined AEN's
- Allocate memory for the ioctl payload in multiples of 512bytes.
MFC after: 1 week
need this for swapcontext(), KSE upcalls initiated from ast()
also need to save them so that we properly return the syscall
results after having had a context switch. Note that we don't
use r11 in the kernel. However, the runtime specification has
defined r8-r11 as return registers, so we put r11 in the context
as well. I think deischen@ was trying to tell me that we should
save the return registers before. I just wasn't ready for it :-)
o The EPC syscall code has 2 return registers and 2 frame markers
to save. The first (rp/pfs) belongs to the syscall stub itself.
The second (iip/cfm) belongs to the caller of the syscall stub.
We want to put the second in the context (note that iip and cfm
relate to interrupts. They are only being misused by the syscall
code, but are not part of a regular context).
This way, when the context is switched to again, we return to
the caller of setcontext(2) as one would expect.
o Deal with dirty registers on the kernel stack. The getcontext()
syscall will flush the RSE, so we don't expect any dirty registers
in that case. However, in thread_userret() we also need to save
the context in certain cases. When that happens, we are sure that
there are dirty registers on the kernel stack.
This implementation simply copies the registers, one at a time,
from the kernel stack to the user stack. NAT collections are not
dealt with. Hence we don't preserve NaT bits. A better solution
needs to be found at some later time.
We also don't deal with this in all cases in set_mcontext. No
temporay solution is implemented because it's not a showstopper.
The problem is that we need to ignore the dirty registers and we
automaticly do that for at most 62 registers. When there are more
than 62 dirty registers we have a memory "leak".
This commit is fundamental for KSE support.
sysctl:
- sysctlbyname("net.inet.ip.mfctable", ...)
- sysctlbyname("net.inet.ip.viftable", ...)
This change is needed so netstat can use sysctlbyname() to read
the data from those tables.
Otherwise, in some cases "netstat -g" may fail to report the
multicast forwarding information (e.g., if we run a multicast
router on PicoBSD).
* Bug fix: when sending IGMPMSG_WRONGVIF upcall to the multicast
routing daemon, set properly "im->im_vif" to the receiving
incoming interface of the packet that triggered that upcall
rather than to the expected incoming interface of that packet.
* Bug fix: add missing increment of counter "mrtstat.mrts_upcalls"
* Few formatting nits (e.g., replace extra spaces with TABs)
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
code used to call rtrequest(RTM_DELETE, ...). This is a problem, because
the function that just has called us (route_output)
is not really happy with the route it just is creating beeing ripped out
from under it. Unfortunately we also cannot return an error from
ifa_rtrequest. Therefore mark the route just as RTF_REJECT.
control whether to accept RAs per-interface basis.
the new stuff ensures the backward compatibility;
- the kernel does not accept RAs on any interfaces by default.
- since the default value of the flag bit is on, the kernel accepts RAs
on all interfaces when net.inet6.ip6.accept_rtadv is 1.
Obtained from: KAME
MFC after: 1 week
preparation for supporting the OPENVCC and CLOSEVCC ioctls which
are needed for ng_atm. This required some re-organisation of the code
(mostly converting array indexes to pointers). This also gives us
an array of open vccs that will help in using the generic GETVCCS handler.
than i386 or AMD64, TP register points to thread mailbox, and they can not
atomically clear km_curthread in kse mailbox, in this case, thread retrieves
its thread pointer from TP register and sets flag TMF_NOUPCALL in its thread
mailbox to indicate a critical region.
the macro definition, and cause the generation of syntactically
incorrect code that gcc happens to accept.
Reviewed by: schweikh (mentor)
MFC after: 4 weeks
use vrele() instead of vput() on the parent directory vnode returned
by namei() in the case where it is equal to the target vnode. This
handles namei()'s somewhat strange (but documented) behaviour of
not locking either vnode when the two vnodes are equal and LOCKPARENT
but not LOCKLEAF is specified.
Note that since a vnode double-unlock is not currently fatal, these
coding errors were effectively harmless.
Spotted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
Reviewed by: mckusick
they haven't been counted before. This test was ommitted when bus_dmamap_load()
was merged into this function, and results in the pagesneeded field growing
without bounds when multiple deferrals happen.
Thanks to Paul Saab for beating his head against this for a few hours =-)
user space region. Hence, we need to test if 5 is greater than the
region; not greater equal.
This bug caused us to call ast() while interrupting kernel mode.
malloc and mbuf allocation all not requiring Giant.
1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.
- Move the enabling of interrupts out of assembly and into C a few
instructions later at cpu_critical_fork_exit(). This puts more of the
MD critical section implementation under the MD critical section API
making it easier to test and develop alternative implementations.
set in cpu_critical_fork_exit() anymore.
- As far as I can tell, cpu_thread_link() has never been used, not even
when it was originally added, so remove it.
Also change "Auto mode" to use a "special" value
instead of 0, and define and document it.
I had thought libpthread had already been switched to use auto mode but
it appears that patch hasn't been committed yet.
Discussed with: Davidxu
to not get any cross-device I/O requests. (The unallocated first page
protecting BSD labels already gave us this, but that hack may go away
at some point in time).
Remove the check for cross-device I/O requests in swap_pager_strategy.
Move the repeated statistics updating into flushchainbuf().
larger than normal frames, to account for the case where a bge(4) NIC
is used with VLANs. Since we set the IFCAP_VLAN_MTU flag, we must allow
reception of frames up to 1522 bytes in size rather than 1518.
Note that it is possible to work around this bug by doing:
# ifconfig bge0 mtu 1504
prior to configuring any VLAN interfaces.
o Remove alpha specific timer code (mc146818A) and compiled-out
calibration of said timer.
o Remove i386 inherited timer code (i8253) and related acquire and
release functions.
o Move sysbeep() from clock.c to machdep.c and have it return
ENODEV. Console beeps should be implemented using ACPI or if no
such device is described, using the sound driver.
o Move the sysctls related to adjkerntz, disable_rtc_set and
wall_cmos_clock from machdep.c to clock.c, where the variables
are.
o Don't hardcode a hz value of 1024 in cpu_initclocks() and don't
bother faking a stathz that's 1/8 of that. Keep it simple: hz
defaults to HZ and stathz equals hz. This is also how it's done
for sparc64.
o Keep a per-CPU ITC counter (pc_clock) and adjustment (pc_clockadj)
to calculate ITC skew and corrections. On average, we adjust the
ITC match register once every ~1500 interrupts for a duration of
2 consequtive interruprs. This is to correct the non-deterministic
behaviour of the ITC interrupt (there's a delay between the match
and the raising of the interrupt).
o Add 4 debugging sysctls to monitor clock behaviour. Those are
debug.clock_adjust_edges, debug.clock_adjust_excess,
debug.clock_adjust_lost and debug.clock_adjust_ticks. The first
counts the individual adjustment cycles (when the skew first
crosses the threshold), the second counts the number of times the
adjustment was excessive (any non-zero value is to be considered
a bug), the third counts lost clock interrupts and the last counts
the number of interrupts for which we applied an adjustment
(debug.clock_adjust_ticks / debug.clock_adjust_edges gives the
avarage duration of an individual adjustment -- should be ~2).
While here, remove some nearby (trivial) left-overs from alpha and
other cleanups.
swapbkva. Swapbkva mappings are explicitly managed using pmap_qenter(),
not on-demand by vm_fault(), making kmem_alloc_nofault() more appropriate.
Submitted by: tegge
generate the inode mode from a default ACL and creation mask,
implement ufs_sync_inode_from_acl() using acl_posix1e_newfilemode().
Since ACL_OVERRIDE_MASK/ACL_PRESERVE_MASK are defined, we no
longer need to explicitly pass in a "preserve_mask" field: this
is implicit in the use of POSIX.1e semantics.
Note: this change contains a semantic bugfix for new file creation:
we now intersect the ACL-generated mode and the cmode requested by
the user process. This means permissions on newly created file
objects will now be more conservative. In the future, we may want
to provide alternative semantics (similar to Solaris and Linux) in
which the ACL mask overrides the umask, permitting ACLs to broaden
the rights beyond the requested umask.
PR: 50148
Reported by: Ritz, Bruno <bruno_ritz@gmx.ch>
Obtained from: TrustedBSD Project
support routines in kern_acl.c:
- Define ACL_OVERRIDE_MASK and ACL_PRESERVE_MASK centrally in acl.h: the
mode bits that are (and aren't) stored in the ACL.
- Add acl_posix1e_acl_to_mode(): given a POSIX.1e extended ACL, generate
a compatibility mode (only the bits supported by the POSIX.1e ACL).
- acl_posix1e_newfilemode(): Given a requested creation mode and default
ACL, calculate the mode for the new file system object (only the bits
supported by the POSIX.1e ACL).
PR: 50148
Reported by: Ritz, Bruno <bruno_ritz@gmx.ch>
Obtained from: TrustedBSD Project
cases:
- Setting sticky bit on non-directory
- Setting setgid on a file with a group that isn't in the effective
or extended groups of the authorizing credential
I.e., test the requirement first, then do the privilege test,
rather than doing the privilege test regardless of the need for
privilege.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
interrupting user mode. The net effect of this bug is that a clock
interrupt does not cause rescheduling and processes are not
preempted. It only takes a "while (1);" to render the machine
useless.
This bug was introduced by the context changes and EPC syscall code.
Handling of ASTs was moved to C for clarity and ease of maintenance,
but was not added for the external interrupt case.
This needs to be revisited. We now have calls to do_ast() in trap(),
break_syscall() and ivt_External_Interrupt(). A single call in
exception_restore covers these 3 places without duplication. This
is where we handled ASTs prior to the overhaul, except that the
meat has been moved to do_ast(), a C function. This was the goal
to begin with.
Pointy hat: marcel
Use ->bio_children to count child buffers, rather than abuse the
bio_caller1 pointer.
Expand the relevant bits of waitchainbuf() inline, this clarifies
the code a little bit.
striping to a per device round-robin algorithm.
Because of the policy of not attempting to retain previous swap
allocation on page-out, this means that a newly added swap device
almost instantly takes its 1/N share of the I/O load but it takes
somewhat longer for it to assume it's 1/N share of the pages if there
is plenty of space on the other devices.
Change the 8G total swapspace limitation to 8G per device instead
by using a per device blist rather than one global blist. This
reduces the memory footprint by 75% (typically a couple hundred
kilobytes) for the common case with one swapdevice but NSWAPDEV=4.
Remove the compile time constant limit of number of swap devices,
there is no limit now. Instead of a fixed size array, store the
per swapdev structure in a TAILQ.
Total swap space is still addressed by a 32 bit page number and
therefore the upper limit is now 2^42 bytes = 16TB (for i386).
We still do not allocate the first page of each device in order to
give some amount of protection to any bsdlabel at the start of the
device.
A new device is appended after the existing devices in the swap space,
no attempt is made to fill in holes left behind by swapoff (this can
trivially be changed should it ever become a problem).
The sysctl vm.nswapdev now reflects the number of currently configured
swap devices.
Rename vm_swap_size to swap_pager_avail for consistency with other
exported names.
Change argument type for vm_proc_swapin_all() and swap_pager_isswapped()
to be a struct swdevt pointer rather than an index.
Not changed: we are still using blists to manage the free space,
but since the swapspace is no longer fragmented by the striping
different resource managers might fare better.
concurrent invocations from acquiring the same address(es). Also, in case
of an incomplete allocation, free any allocated pages.
In collaboration with: tegge
sure that uma_dbg_free() is called if we're about to call
uma_zfree_internal() but we're asking it to skip the dtor and
uma_dbg_free() call itself. So, if we're about to call
uma_zfree_internal() from uma_zfree_arg() and skip == 1, call
uma_dbg_free() ourselves.
EFI file system. When booting from a CD and there's already an EFI
system partition on the disk, setting the current device to unit 0
will select the harddisk. This invariably breaks installing FreeBSD
when other operating systems have been installed before.
We obviously want to do the same when we're booting over the network.
Maybe later.
Based on a patch (from memory) from: arun
there is code that blindly allocates LDTEs starting at slot 6
and I quess it doesn't really matter to us if they overwrite the BSDI
syscall slot, since it isn't a BSDI binary. Also add some code to help track
down other such users (commented out for now).
Reviewed by: deischen@
o correct BSSID setup in ah_writeAssocid for 5211 and 5212 (fixes
reception of broadcast frames after association)
o correct transmit retry counts returned by 5211 in ah_procTxDesc
o add missing regulatory domain support that caused use of 11b channels to be
disallowed with some cards (e.g. mini-pci cards in certain IBM laptops)
o miscellaneous fixes to regulatory domain support
o increase size of 5212 ANI table to avoid overflow
o add monitor mode
o remove OS_QSORT support
o fix handling of HAL_RXDESC_INTREQ in ah_setupRxDesc
o rewrite 5212 descriptor handling for portability
o FreeBSD: track alq_open API change
type. We know about header types 0, 1 and 2. Ignore the rest in the
MD i386 code when we're looking for bridges. You cannot look at the
vendor tag. And if you don't you certainly can't look at function > 0
if the device isn't there.
The new soekris boards' GEODE cpu has issues with the old way. This
is reported to have fixed it.
MFC After: 2 days
considered to be good to try when it otherwise has no clue about which
interrupts to try. This is a band-aide and we really should try to
balance the IRQs that we arbitrarily pick, but it should help some
people that would otherwise get bad IRQs.
recompiling the driver. See the comments near the top of "if_em.h"
for descriptions of these delays. Four new loader tunables control
the system-wide default values:
hw.em.tx_int_delay
hw.em.rx_int_delay
hw.em.tx_abs_int_delay
hw.em.rx_abs_int_delay
The tunables are specified in microseconds. The valid range is
0-67108 usec., and 0 means that the timer is disabled.
There are also four new sysctls (actually, a set of four for each
"em" device in the system) to query and change the interrupt delays
after the system is up:
hw.em0.tx_int_delay
hw.em0.rx_int_delay
hw.em0.tx_abs_int_delay (not present for 82542/3/4 adapters)
hw.em0.rx_abs_int_delay (not present for 82542/3/4 adapters)
It seems to be OK to change these values even while the adapter is
passing traffic.
Approved by: Prafulla Deuskar <pdeuskar@FreeBSD.ORG>
MFC after: 4 weeks
- Move isa/ppc* to sys/dev/ppc (repo-copy)
- Add an attachment method to ppc for puc
- In puc we need to walk the chain of parents.
Still to do, is to make ppc(4) & puc(4) work on other platforms. Testers
wanted.
PR: 38372 (in spirit done differently)
Verified by: Make universe (if I messed up a platform please fix)
mac_mls_subject_equal_ok() to mac_mls_subject_privileged(),
which more consistently reflects the fact that this is really
about our notion of privilege in the MLS policy.
Since we don't use suser() for privilege in MLS, remove
the suser check from the ifnet relabel ioctl, and replace it
with an MLS privilege check.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
another thread. We use the td_oncpu member of the other field to locate
it's associated CPU and then search the that CPU's list of spin locks
contained in its per-CPU data. This is not always safe and may in fact
panic or just not work, but it is useful in at least one case.
already checks suser on a network interface relabel, so don't dup it
here. Rely solely on the Biba definition of privilege, which is
already tested.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Submitted by: Andrew Reisse <areisse@nailabs.com>
callout when a specified number of lines have been output. This can be
used to implement pagers for ddb commands that output a lot of text. A
simple paging function is included that automatically rearms itself when
fired.
Reviewed by: bde, julian
This is controlled by a per-adapter sysctl hw.atm.hfaX.shape. When
set to 0, no shaping occures. When set to 1 at most 1 channel is
shaped. When set to 2 all CBR channels are shaped. Note, that the
latter may actually not work, because of the adapter supporting
the shaping of only one PDU at the same time.
also do it). Three problems have been encountered:
1. The initialisation command does not work in interrupt mode. Whether
this is a firmware bug or a feature is not clear. The original Fore
drivers execute the initialize command always in polling mode, so
it appears that this behaviour is expected. When we detect a 4.X.Y
firmware do busy wait on the command status.
2. The command code of the GET_PROM command has changed. This is an
unofficial command anyway. What was GET_PROM in 3.X.Y is CLEAR_STATS
in 4.X.Y (although unimplemented in the firmware). We need to
use the correct code depending on the firmware.
3. The 4.X.Y firmware can set the error flag in the command status
without also setting the completion flag (as the documenation says).
Check both variants.
An additional field in the per-card structure fu_ft4 is TRUE when we have
detected a 4.X.Y firmware. Otherwise it is false. The behaviour of the
driver when using a 3.X.Y firmware should be identical to the previous
behaviour.
This change will enable traffic shaping of (at least one) CBR channels.
The other option would be to remove it, but I can imagine it may be useful
for the forseeable future as we fiddle with segments in KSE and thr libraries,
that while many maps can exist and be loaded per tag, bus_dmamap_load() and
friends can only be called on one map at a time from the tag. This is
enforced via the mutex arguments in the tag.
Fixing this bug means that s/g lists can be arbitrarily long in length, and
also removes an ugly GNU-ism from the code. No API or ABI change is
incurred. Similar changes for other platforms is forthcoming.
pointer in the PCB of the corresponding thread if it's not the
current thread. This is needed for thr_create() to setup the
newly created thread from the context provided by the application.
This commit finalizes supporting libthr.
created not only with UMA_ZONE_VM but also with UMA_ZONE_NOFREE. In
the i386 case in particular, the pmap code would hook a special
page allocation routine that allocated from kernel_map and not kmem_map,
and so when/if the pageout daemon drained the zones, it could actually
push out slabs from the PV ENTRY zone but call UMA's default page_free,
which resulted in pages allocated from kernel_map being freed to
kmem_map; bad. kmem_free() ignores the return value of the
vm_map_delete and just returns. I'm not sure what the exact
repercussions could be, but it doesn't look good.
In the PAE case on i386, we also set-up a zone in pmap, so be
conservative for now and make that zone also ZONE_NOFREE and
ZONE_VM. Do this for the pmap zones for the other archs too,
although in some cases it may not be entirely necessarily. We'd
rather be safe than sorry at this point.
Perhaps all UMA_ZONE_VM zones should by default be also
UMA_ZONE_NOFREE?
May fix some of silby's crashes on the PV ENTRY zone.
or free a LDT entry. The function has following prototype:
int i386_set_ldt(int start_sel, union descriptor *descs, int num_sels);
Added Features:
o If start_sel is 0, num_sels is 1 and the descriptor pointed to by descs
is legal, then i386_set_ldt() will allocate a descriptor and return its
selector numbe
o If num_descs is 1, start_sels is valid, and descs is NULL, then
i386_set_ldt() will free that descriptor (making it available to be real-
located again later).
o If num_descs is 0, start_sels is 0 and descs is NULL then, as a special
case, i386_set_ldt() will free all descriptors.
Reviewed by: julian
in sync with the backend machdep code. When cpu_thread_init() does not
have the same idea of KSTACK_PAGES as the thing that created the kstack,
all hell breaks loose.
Bad alc! no cookie! :-)
on a non-blocking pipe in cases where select(2) returns the file
descriptor as ready for write. This in turns causes libc_r, for
one, to busy wait in such cases.
Note: it is a quick performance fix, a more complex fix might be
required in case this turns out to have unexpected side effects.
Reviewed by: silby
MFC after: 3 days
thread's pid to make debugging easier for people who don't want to have to
use the intended tool for these panics (witness).
Indirectly prodded by: kris
use "\n\" instead of "\" at the end of each source line, and don't use
semicolons). Fixed some older style bugs on the same lines (mainly
English errors in comments).
1) The race has to do with zone destruction. From the zone destructor we
would lock the zone, set the working set size to 0, then unlock the zone,
drain it, and then free the structure. Within the window following the
working-set-size set to 0 and unlocking of the zone and the point where
in zone_drain we re-acquire the zone lock, the uma timer routine could
have fired off and changed the working set size to something non-zero,
thereby potentially preventing us from completely freeing slabs before
destroying the zone (and thus leaking them).
2) The leak has to do with zone destruction as well. When destroying a
zone we would take care to free all the buckets cached in the zone, but
although we would drain the pcpu cache buckets, we would not free them.
This resulted in leaking a couple of bucket structures (512 bytes each)
per cpu on SMP during zone destruction.
While I'm here, also silence GCC warnings by turning uma_slab_alloc()
from inline to real function. It's too big to be an inline.
Reviewed by: JeffR
a long-standing mistake in the way a portion of a pipe's KVA is
allocated. Specifically, kmem_alloc_pageable() is inappropriate
for use in the "direct" case because it allows a preceding vm map entry
and vm object to be extended to support the new KVA allocation.
However, the direct case KVA allocation should not have a backing
vm object. This is corrected by using kmem_alloc_nofault().
Submitted by: tegge (with the above explanation by me)
tsb_foreach(), 0 signals to terminate the tsb traversal, so when
tsb_foreach() was used in pmap_protect() (which only happens when
the area to be protected is larger than PMAP_TSB_THRESH = 16MB), only
the first tsb entry in the specified range would be protected.
Reported by: Andrew Belashov <bel@orel.ru>
("UMA Zone") carefully, because it does not have pcpu caches allocated
at all. In the UP case, we did not catch this because one pcpu cache
is always allocated with the zone, but for the MP case, we were getting
bogus stats for this zone.
Tested by: Lukas Ertl <le@univie.ac.at>
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.
Submitted by: tegge
so not only wastes memory but it can also cause a leak in zones that
will be destroyed later. The problem is that the slab allocation code
places newly created slabs on the partially allocated list because it
assumes that the caller will actually allocate some memory from it.
Failure to do so places an otherwise free slab on the partial slab list
where we wont find it later in zone_drain().
Continuously prodded to fix by: phk (Thanks)
with up to date comments. This fixes booting kernels with boot2
(except for loss of the features provided by loader) and is suitable
for MFC. Contrary to the old comments, most loaders don't clear the bss.
biosboot lost clearing of the bss in a code crunch in 1997, and boot2
never did it.
kan didn't notice the problem with gcc-3.3 putting variables that are
initialized to 0 in the bss until after committing gcc-3.3 because he
was already using essentially this patch. Before gcc-3.3, only the
non-critical `bootdev' variable was clobbered by clearing the bss.
MFC after: 3 days
messages are forwarded as netgraph control messages to the node
that is connected to the manage hook. If that hook is not connected,
the event is lost. Flow control events are converted to netgraph
flow control messages and send along the hook that is connected to
the flow controlled VC. ACR change events are converted to control
messages and sent along the hook for the given VC.
instead of int where the variable has to hold buffer lengths,
use u_int for things like number of network interfaces which
in principle can never be negative.
parts of the system about certain kinds of events, like changes
in the ABR rate, changes in the carrier state, PVC changes. The
main consumers of these events are the harp(4) pseudo-driver
and the ILMI daemon via ng_atm(4).
HIDENAME() macro seems to be unimplementable in C. (HIDENAME() used
to use invalid token pasting using ## for the STDC case until gcc
started rejecting that; now it uses unportable token pasting using
juxtaposition in all cases.) This reduces use of HIDENAME() in the
kernel to only i386 and amd64 profiling code so that it doesn't bite
most kernels whenever gcc becomes stricter. Problems with HIDENAME()
in userland are smaller because userland mostly doesn't use strict
flags yet. There are some advantages to hiding the name of mcount,
but newer arches shouldn't do it; only amd64 does.
MFC after: 3 days
On second thoughts hide tmpstk better by staticizing it.
- cut the version string at the newline, suppressing information about
who built the kernel and in what directory. Most of this information
was already lost to truncation.
- on i386, return the precise CPU class (if known) rather than just
"i386". Linux software which uses this information to select
which binary to run often does not know what to make of "i386".
allocation function. With this patch, it prevents continous growth of
the devbuf memory pool.
Tested with ssh <host> dd of=/dev/null < /dev/zero and vmstat -m | grep devbuf
to such devices. If a device fails due to this commit, add:
options DA_OLD_QUIRKS
to the kernel config and recompile. Then send the output of "camcontrol
inquiry da0" to scsi@freebsd.org so the quirk can be re-enabled.
to set np->n_size back to the desired size again after calling
nfs_meta_setsize(), since it could end up in nfs_loadattrcache() getting
called, which would change n_size back to the value it had before the
truncate request was issued. The result of this bug is that the size info
cached in the nfsnode becomes incorrect, lseek(fd, ofs, SEEK_END) seeks
past the end of the file, stat() returns the wrong size, etc.
PR: 41792
MFC after: 2 weeks
kernel ACL interfaces and system call names.
Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
kern.file sysctl, don't return information about processes that
fail p_cansee(td, p). This prevents sockstat and related
programs from seeing file descriptors owned by processes not
in the same jail as the thread, as well as having implications
for MAC, etc.
This is a partial solution: it permits an information leak about
the number of descriptors in the sizing calculation (but this is
not new information, you can also get it from kern.openfiles),
and doesn't attempt to mask file descriptors based on the
properties of the descriptor, only the process referencing it.
However, it provides most of what you want under most
circumstances, without complicating the locking.
PR: 54211
Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>
receive 6 byte commands. Add a check for this flag to da(4) and cd(4) so
that they honor it. This is a quick workaround for many devices (especially
USB) that require da(4) quirks to operate. The more complete approach is
to finish the new transport code which will be aware of the SCSI version a
transport implements.
MFC after: 1 day
child when forking. This provides a consistent initial state.
Note that cpu_set_upcall() does not clear the per-CPU unique value as
it is followed by a call to set_mcontext(), which sets it accordingly.
in the `video_state' structure, to larger ones (from u_char to
u_short). Each can now hold values at least as large as the
size of the array it is meant to point into.
This eliminates warnings printed by GCC 3.3.1 and hence makes
pcvt compilable using -Werror.
memory in bus_dmamem_alloc(). This is possible now that
contigmalloc() supports the M_ZERO flag.
- Remove the locking of Giant around calls to contigmalloc() since
contigmalloc() now grabs Giant itself.
mainly to quiet a warning emitted by GCC 3.3 about comparing
a variable to a value which is larger than the former can hold.
The value was checked to make sure the `np->squeue' array is
not accessed behind its boundary.
This worked due to possibly accidental truncation when
(np->squeueput + 1) was larger than or equal to MAX_START (256)
when it was assigned to `qidx'.
`qidx' is used to hold the next position in the start queue
for an insertion. The new type was chosen because some other
code in the function ncr_freeze_devq() also uses plain integers
to hold those indices.
Wrapped the line after the closing parenthesis of an `if'
condition.
function unless the device is configured up. Without this fix, the
device ends up in the RUNNING state even though it is configured down.
Also, check the RUNNING flag before calling the if_start function, in
case the if_init function failed for one reason or another.
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
invariants/witness/etc on i386, sparc64, amd64 and alpha for GENERIC.
Lint probably still needs fixing, as do a couple of other drivers
that have broken recently and not been noticed.
in ntfs_writentvattr_plain and ntfs_readntvattr_plain, and purge the boot
block from the buffer cache if isn't exactly one cluster long. These two
changes work around the same buffer cache bug that ntfs_subr.c 1.30 tried
to, but in a different way. This may decrease throughput by reading smaller
amounts of data from the disk at a time, but may increase it by avoiding
bogus writes of clean buffers.
Problem (re)reported by Karel J. Bosschaart on -current.
switching anymore, so there's no need to save and restore GP. This
change breaks threaded applications linked against libc_r. Pull the
tier 2 card again: relink. This will link against libthr instead.
the "do-nothing" versions of __RCSID(), __RCSID_SOURCE(), __SCCSID(),
and __COPYRIGHT() were not strictly correct. They should not expand
into [nothing], because the ';' which follows them would then cause
a syntax error (in a strict C compiler, if not gcc...).
So, change the do-nothing versions of those macros to use the
'struct __hack' tactic, as was already used with __FBSDID().
Approved by: discussions with bde
MFC after: 1 week
on i486's (and probably i386's), but it has had very little effect
since gcc-2.7 or gcc-2.95. With gcc-3.3, it gave a small
pessimization for at least i386's, athlon-xp's and pentium4's, a
small optimization (I think) for pentium1's, and made no difference
for i386's. (movzbl is best for all the later processors, and the
micro-optimization was to stop it being used on i486's.)
a non-standard construct. Instead, redefine struct _ia64_fpreg as a
union and put a long double in it. On ia64 and for LP64, this is
defined by the ABI to have 16-byte alignment. For ILP32 a long double
has 4-byte alignment, but we don't support ILP32.
Note that the in-memory image of a long double does not match the in-
memory image of spilled FP registers. This means that one cannot use
the fpr_flt field to interpet the bits. For this reason we continue
to use an aggregate type.
connections. While this confuses tcpdump, it enables other applications
to see and analyze non-IP traffic (signalling, for example).
Pointed out by: Vincent Jardin <vjardin@wanadoo.fr>
OID as the other protocol family sub-trees do, that is equal to the
protocol family identifier. Make the ATM layer debugging flags
available under this tree.
Submitted by: Vincent Jardin <vjardin@wanadoo.fr>
MFC after: 2 weeks
set an initial value. This is aimed at getting us closer to being able to
turn -Werror back on and we can adjust the settings later on. Yes, we
could turn off -Wno-inline instead, but that would hide the effect of
gcc's bogo-estimator ignoring inline (either rightly or wrongly).
written as a template that when inlined is specialized for the caller
through constant value propagation and dead code elimination. Thus,
the specialized code that is generated for pmap_clear_reference() et
al. avoids several conditional branches inside of a loop.
fields in the low 32 bits of the local APIC ICR register. Use this macro
in place of APIC_RESV2_MASK when masking off existing bits from the ICR
when writing to it to send an IPI.
Tested by: scottl
but this just created a weird inconsistency when porting gdb(1).
Instead, we name each high FP register seperately, like we do for
all the other registers.
__attribute__((__always_inline__)) by adding an __always_inline macro
(used like __dead2 etc). __inline_damnit has also been suggested but we
have a precedent of keeping the names similar so they are easier to find.
These were a left over from when the private memory pools were
converted to use uma zones. The limit of UMA zones, however,
works differently. When a zone is limited to only one or two pages
than, on multi-cpu systems, processes can get stuck on the zonelimit,
because all remaining free items are in caches of other CPUs.
Also add rudimentary error handling in some places (panic) when a zone
cannot be created.
AMD64, gcc (and the ABI) expects the x87 unit to be running in 80/64
mode (not 64/53) so that it can use it for 'long double' operations. It
takes the expected precision differences into account when generating
code.
the SSE mxcsr register as well. Since gcc will intermix SSE2 and x87
FP code, the fpsetround() etc mode had better be the same.
There are hooks to enable these inlines to be instantiated inside libc
for non-gcc or C++ callers. (g++ doesn't like the inlines that tried
to extract an integer and convert it to an enum).
sis_ioctl() was called, so one had to use ifconfig each time the cable got
plugged in to be able to use the connection.
Do it a better way now, add a "in_tick" field in the softc structure,
call timeout() in sis_tick() and don't call it in sis_init() if in_tick is
non-zero.
Reported by: Landmark Networks
Pointy hat to: cognet
newer lucent/hermes firmware than indicated (investigating). I'm committing
this now since it shouldn't hurt anything.
o Vaguely related, add bogus frame length check from netbsd.
Obtained from: netbsd
LAZY_SWITCH changes. He pointed out the acpi code sets up an identity
mapping in the current vmspace and that got messed up by the %cr3 being
out of sync with the current page directory. As a workaround, restore
%cr3 across the sleep/resume. A more complete fix would be to undo the
lazy state and clear the pm_active bit from the borrowed pmap, but this
works and people are currently hurting. I'll clean this up.
This is mostly Ian's patch, plus a PAE tweak from me.