754 Commits

Author SHA1 Message Date
kib
c35d24e497 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
imp
e3a601760b Define xpt_path_inq.
This provides a nice wrarpper around the XPT_PATH_INQ ccb creation and
calling.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13387
2017-12-06 23:05:22 +00:00
pfg
1537078d8f sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
sephe
74acc5b16c hyperv/hn: Enable transparent VF by default.
MFC after:	3 days
Sponsored by:	Microsoft
2017-10-11 05:28:51 +00:00
sephe
6366e7bab0 hyperv/hn: Workaround erroneous hash type observed on WS2016 for VF.
The background was described in r324489.

MFC after:	3 days
Sponsored by:	Microsoft
2017-10-11 05:15:49 +00:00
sephe
be6e031010 hyperv/hn: Workaround erroneous hash type observed on WS2016.
Background:
- UDP 4-tuple hash type is unconditionally enabled in Hyper-V on WS2016,
  which is _not_ affected by NDIS_OBJTYPE_RSS_PARAMS.
- Non-fragment UDP/IPv4 datagrams' hash type is delivered to VM as
  TCP_IPV4.

Currently this erroneous behavior only applies to WS2016/Windows10.

Force l3/l4 protocol check, if the RXed packet's hash type is TCP_IPV4,
and the Hyper-V is running on WS2016/Windows10.  If the RXed packet is
UDP datagram, adjust mbuf hash type to UDP_IPV4.

MFC after:	3 days
Sponsored by:	Microsoft
2017-10-10 08:32:03 +00:00
sephe
5a41a31fda hyperv/vmbus: Expose Hyper-V major version.
MFC after:	3 days
Sponsored by:	Microsoft
2017-10-10 08:23:19 +00:00
sephe
a2497afcb7 hyperv/vmbus: Add tunable to pin/unpin event tasks.
Event tasks are pinned to their respective CPU by default, in the same
fashion as they were.

Unpin the event tasks by setting hw.vmbus.pin_evttask to 0, if certain
CPUs serve special purpose.

MFC after:	3 days
Sponsored by:	Microsoft
2017-10-10 08:16:55 +00:00
sephe
e521cc6466 hyperv/hn: Fix options RSS building
Reported by:	np
MFC after:	1 week
Sponsored by:	Microsoft
2017-10-05 13:22:14 +00:00
sephe
11859da3f9 hyperv/hn: Unbreak i386 building.
Reported by:	cy
MFC after:	1 week
Sponsored by:	Microsoft
2017-09-28 07:02:56 +00:00
sephe
b69ef721ad hyperv/hn: Fix UDP checksum offload issue in Azure.
UDP checksum offload does not work in Azure if following conditions are
met:
- sizeof(IP hdr + UDP hdr + payload) > 1420.
- IP_DF is not set in IP hdr

Use software checksum for UDP datagrams falling into this category.

Add two tunables to disable UDP/IPv4 and UDP/IPv6 checksum offload, in
case something unexpected happened.

MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12429
2017-09-27 05:44:50 +00:00
sephe
8fc41247cb hyperv/hn: Set tcp header offset for CSUM/LSO offloading.
No observable effect; better safe than sorry.

MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12417
2017-09-27 04:42:40 +00:00
sephe
74e7b4226c hyperv/hn: Incease max supported MTU
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12365
2017-09-19 06:46:00 +00:00
sephe
284eaa3f70 hyperv/hn: Fix MTU setting
- Add size of an ethernet header to the value configured to NVS.  This
  does not seem to have any effects if MTU is 1500, but fix hypervisor
  side's setting if MTU > 1500.
- Override the MTU setting according to the view from the hypervisor
  side.

MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12352
2017-09-19 06:38:57 +00:00
sephe
becc8347a8 hyperv/hn: Apply VF's RSS setting
Since in Azure SYN and SYN|ACK go through the synthetic parts while the
rest of the same TCP flow goes through the VF, apply VF's RSS settings
to synthetic parts to have a consistent hash value/type for the same TCP
flow.

MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12333
2017-09-19 06:29:38 +00:00
sephe
ab9d242c8d hyperv/hn: Log RSS capabilities mask.
This helps to detect when UDP hash types can be supported.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12177
2017-09-05 06:20:02 +00:00
sephe
464b6d2488 hyperv/hn: Implement SIOCGIFRSS{KEY,HASH}.
The conditional compiling in the review request is removed, since
these IOCTLs will be available in stable/10 and stable/11.

Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12175
2017-09-05 06:05:48 +00:00
sephe
6a2d79f56b hyperv: Update copyright for the files changed in 2017
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11982
2017-08-14 06:00:50 +00:00
sephe
a4ceea128a hyperv/hn: Re-set datapath after synthetic parts reattached.
Do this even for non-transparent mode VF. Better safe than sorry.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11981
2017-08-14 05:55:16 +00:00
sephe
3142e35944 hyperv/hn: Minor cleanup
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11979
2017-08-14 05:46:50 +00:00
sephe
744451beef hyperv/hn: Fix/enhance receiving path when VF is activated.
- Update hn(4)'s stats properly for non-transparent mode VF.
- Allow BPF tapping to hn(4) for non-transparent mode VF.
- Don't setup mbuf hash, if 'options RSS' is set.
  In Azure, when VF is activated, TCP SYN and SYN|ACK go through hn(4)
  while the rest of segments and ACKs belonging to the same TCP 4-tuple
  go through the VF.  So don't setup mbuf hash, if a VF is activated
  and 'options RSS' is not enabled.  hn(4) and the VF may use neither
  the same RSS hash key nor the same RSS hash function, so the hash
  value for packets belonging to the same flow could be different!
- Disable LRO.
  hn(4) will only receive broadcast packets, multicast packets, TCP
  SYN and SYN|ACK (in Azure), LRO is useless for these packet types.
  For non-transparent, we definitely _cannot_ enable LRO at all, since
  the LRO flush will use hn(4) as the receiving interface; i.e.
  hn_ifp->if_input(hn_ifp, m).

While I'm here, remove unapplied comment and minor style change.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11978
2017-08-14 05:40:52 +00:00
sephe
19ec4cf79d hyperv/hn: Update VF's ibytes properly under transparent VF mode.
While, I'm here add comment about why updating VF's imcast stat is
not necessary.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11948
2017-08-14 05:30:02 +00:00
sephe
4b42c6e8a4 hyperv/hn: Implement transparent mode network VF.
How network VF works with hn(4) on Hyper-V in transparent mode:

- Each network VF has a cooresponding hn(4).
- The network VF and the it's cooresponding hn(4) have the same hardware
  address.
- Once the network VF is attached, the cooresponding hn(4) waits several
  seconds to make sure that the network VF attach routing completes, then:
  o  Set the intersection of the network VF's if_capabilities and the
     cooresponding hn(4)'s if_capabilities to the cooresponding hn(4)'s
     if_capabilities.  And adjust the cooresponding hn(4) if_capable and
     if_hwassist accordingly. (*)
  o  Make sure that the cooresponding hn(4)'s TSO parameters meet the
     constraints posed by both the network VF and the cooresponding hn(4).
     (*)
  o  The network VF's if_input is overridden.  The overriding if_input
     changes the input packet's rcvif to the cooreponding hn(4).  The
     network layers are tricked into thinking that all packets are
     neceived by the cooresponding hn(4).
  o  If the cooresponding hn(4) was brought up, bring up the network VF.
     The transmission dispatched to the cooresponding hn(4) are
     redispatched to the network VF.
  o  Bringing down the cooresponding hn(4) also brings down the network
     VF.
  o  All IOCTLs issued to the cooresponding hn(4) are pass-through'ed to
     the network VF; the cooresponding hn(4) changes its internal state
     if necessary.
  o  The media status of the cooresponding hn(4) solely relies on the
     network VF.
  o  If there are multicast filters on the cooresponding hn(4), allmulti
     will be enabled on the network VF. (**)
- Once the network VF is detached.  Undo all damages did to the
  cooresponding hn(4) in the above item.

NOTE:
No operation should be issued directly to the network VF, if the
network VF transparent mode is enabled.  The network VF transparent mode
can be enabled by setting tunable hw.hn.vf_transparent to 1.  The network
VF transparent mode is _not_ enabled by default, as of this commit.

The benefit of the network VF transparent mode is that the network VF
attachment and detachment are transparent to all network layers; e.g. live
migration detaches and reattaches the network VF.

The major drawbacks of the network VF transparent mode:
- The netmap(4) support is lost, even if the VF supports it.
- ALTQ does not work, since if_start method cannot be properly supported.

(*)
These decisions were made so that things will not be messed up too much
during the transition period.

(**)
This does _not_ need to go through the fancy multicast filter management
stuffs like what vlan(4) has, at least currently:
- As of this write, multicast does not work in Azure.
- As of this write, multicast packets go through the cooresponding hn(4).

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11803
2017-08-09 05:59:45 +00:00
sephe
5863c3132c hyperv/kvp: Use proper size macro for adapter id.
Submitted by:	Christopher Ertl <Christopher.Ertl microsoft com>
MFC after:	3 days
Sponsored by:	Microsoft
2017-08-03 01:44:40 +00:00
sephe
571da5685c hyperv/hn: Add comment about ether_ifattach event subscription.
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11710
2017-08-01 02:55:43 +00:00
sephe
f1555b6b44 hyperv/hn: Renaming and minor cleanup
This prepares for the upcoming transparent VF support.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11708
2017-08-01 02:45:54 +00:00
sephe
f305deb3bc hyperv/hn: Ignore LINK_SPEED_CHANGE status.
This status will be reported if the backend NIC is wireless; it's not
useful.  Due to the high frequency of the reporting, this could be
pretty annoying; ignore it.

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11651
2017-07-24 04:00:43 +00:00
sephe
e57775e637 hyperv/hn: Export VF list and VF-HN mapping
The VF-HN map will be used later on to implement "transparent VF".

MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11618
2017-07-24 03:52:32 +00:00
sephe
4d77702aa0 hyperv/storvsc: Force SPC3 for CDROM attached.
This unbreaks the CDROM attaching on GEN2 VMs.  On GEN1 VMs, CDROM is
attached to emulated ATA controller.

PR:		220790
Submitted by:	Hongjiang Zhang <honzhan microsoft com>
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11634
2017-07-20 07:13:26 +00:00
jah
d1caaa9300 Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
sephe
94aedec1e5 hyperv/input: Remove unnecessary inclusion.
The unbreaks gcc compilation.

Submitted by:	Ryan Libby
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11415
2017-06-30 03:01:22 +00:00
sephe
4323de0de8 hyperv/storvsc: Reduce log verbosity
On some windows hosts TEST_UNIT_READY command will return
SRB_STATUS_ERROR and sense data "NOT READY asc:3a,1 (Medium
not present - tray closed)", this occurs periodically, and
not hurt anything else.  So, we prefer to ignore this kind
of errors.

PR:		219973
Submitted by:	Hongjiang Zhang <hongzhan microsoft com>
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D11271
2017-06-21 06:44:56 +00:00
dexuan
dfcaf984ee hyperv/pcib: use the device serial number as PCI domain
Currently the PCI domain is initialized with the instance GUID in
vmbus_pcib_attach(). It turns out the GUID can change across VM reboot,
while some users want a persistent value for PCI domain. The solution is
that we can change to use the device serial number, which starts with 1
and is unique within a VM.

Obtained from:	Haiyang Zhang
MFC after:	1 day
Sponsored by:	Microsoft
2017-06-08 12:11:30 +00:00
sephe
5877c4f28d hyperv/vmbus: Reorganize vmbus device tree
For GEN1 Hyper-V, vmbus is attached to pcib0, which contains the
resources for PCI passthrough and SR-IOV.  There is no
acpi_syscontainer0 on GEN1 Hyper-V.

For GEN2 Hyper-V, vmbus is attached to acpi_syscontainer0, which
contains the resources for PCI passthrough and SR-IOV.  There is
no pcib0 on GEN2 Hyper-V.

The ACPI VMBUS device now only holds its _CRS, which is empty as
of this commit; its existence is mainly for upward compatibility.

Device tree structure is suggested by jhb@.

Tested-by:	dexuan@
Collabrated-wth:	dexuan@
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D10565
2017-05-10 05:28:14 +00:00
sephe
067e38e286 hyperv/kbd: Channel read expects non-NULL channel argument.
MFC after:	now
Sponsored by:	Microsoft
2017-05-05 03:28:30 +00:00
sephe
91a23f2d4a hyperv/hn: Use channel0, i.e. TX ring0, for TCP SYN/SYN|ACK.
Hyper-V hot channel effect:
Operation latency on hot channel is only _half_ of the operation
latency on cold channels.

This commit takes the advantage of the above Hyper-V host channel
effect, and can reduce more than 75% latency and more than 50%
latency stdev, i.e. lower and more stable/predictable latency,
for various types of web server workloads.

MFC after:	3 days
Sponsored by:	Microsoft
2017-04-24 07:52:27 +00:00
sephe
db3fd9c10a hyperv: Use kmem_malloc for hypercall memory due to NX bit change.
Reported by:	dexuan@
MFC after:	now
Sponsored by:	Microsoft
2017-04-19 02:39:48 +00:00
sephe
792d8e91e0 hyperv/kvp: Remove always false condition.
Reported by:	PVS
MFC after:	3 days
Sponsored by:	Microsoft
2017-04-14 05:29:27 +00:00
sephe
d63cadf896 hyperv/storvsc: Use ULL for 64bits value shift.
Reported by:	PVS
MFC after:	3 days
Sponsored by:	Microsoft
2017-04-14 05:25:21 +00:00
sephe
12926ed8ea hyperv/kbd: Remove unnecessary assignment.
Reported by:	PVS
MFC after:	3 days
Sponsored by:	Microsoft
2017-04-14 05:18:42 +00:00
sephe
27e51e2575 hyperv/hn: Fixat RNDIS rxfilter after the successful RNDIS init.
Under certain conditions on certain versions of Hyper-V, the RNDIS
rxfilter is _not_ zero on the hypervisor side after the successful
RNDIS initialization, which breaks the assumption of any following
code (well, it breaks the RNDIS API contract actually).  Clear the
RNDIS rxfilter explicitly, drain packets sneaking through, and drain
the interrupt taskqueues scheduled due to the stealth packets.

Reported by:	dexuan@
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D10230
2017-04-05 08:25:22 +00:00
sephe
3d99062ca4 hyperv/storvsc: Fixup SRB status.
This unbreaks GEN2 Hyper-V cd support.

Submitted by:	Hongjiang Zhang <honzhan microsoft com>
Reviewed by:	dexuan@
MFC after:	3 days
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D10212
2017-04-05 08:15:47 +00:00
sephe
8da6a9ea75 hyperv/kbd: Add support for synthetic keyboard.
Synthetic keyboard is the only supported keyboard on GEN2 Hyper-V.

Submitted by:	Hongjiang Zhang <honzhan microsoft com>
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D10196
2017-04-05 05:01:23 +00:00
sephe
bf0a3b4be6 hyperv/hn: Misaligned chimney sending buffers should not be used
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9714
2017-03-01 09:05:12 +00:00
sephe
357eaaacb0 hyperv/hn: Make sure that RNDIS packet message is at least 4B aligned.
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9713
2017-03-01 08:50:41 +00:00
sephe
f72c601aa2 hyperv/hn: Simplify RNDIS packet total length calculation.
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9712
2017-03-01 08:24:17 +00:00
sephe
2e63fa0867 hyperv/hn: Simplify RNDIS packet data offset calculation.
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9699
2017-02-28 09:50:34 +00:00
imp
ce9844cd72 Convert PCIe Hot Plug to using pci_request_feature
Convert PCIe hot plug support over to asking the firmware, if any, for
permission to use the HotPlug hardware. Implement pci_request_feature
for ACPI. All other host pci connections to allowing all valid feature
requests.

Sponsored by: Netflix
2017-02-25 06:11:59 +00:00
dexuan
76cd3d1328 hyperv/hn: add devctl_notify for VF_UP/DOWN events
Reviewed by:	sephe
Approved by:	sephe (mentor)
MFC after:	2 weeks
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9102
2017-01-24 09:27:13 +00:00
dexuan
66d60a9912 hyperv/hn: add a sysctl name for the VF interface
This makes it easier for the userland script to find the releated
VF interface.

Reviewed by:	sephe
Approved by:	sephe (mentor)
MFC after:	2 weeks
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D9101
2017-01-24 09:25:42 +00:00