Vendor import of llvm release_40 branch r296202:

https://llvm.org/svn/llvm-project/llvm/branches/release_40@296202
2017-02-25 14:40:33 +00:00 · 2017-02-25 14:40:33 +00:00 · 9c618dddcd
commit 9c618dddcd
parent 5a813558fc
11 changed files with 297 additions and 380 deletions
--- a/docs/ReleaseNotes.rst
+++ b/docs/ReleaseNotes.rst
@ -5,12 +5,6 @@ LLVM 4.0.0 Release Notes
 .. contents::
    :local:
 .. warning::
   These are in-progress notes for the upcoming LLVM 4.0.0 release.  You may
   prefer the `LLVM 3.9 Release Notes <http://llvm.org/releases/3.9.0/docs
   /ReleaseNotes.html>`_.
 Introduction
 ============
@ -28,74 +22,56 @@ them.
 Non-comprehensive list of changes in this release
 =================================================
 * The C API functions LLVMAddFunctionAttr, LLVMGetFunctionAttr,
  LLVMRemoveFunctionAttr, LLVMAddAttribute, LLVMRemoveAttribute,
  LLVMGetAttribute, LLVMAddInstrAttribute and
  LLVMRemoveInstrAttribute have been removed.
 * The C API enum LLVMAttribute has been deleted.
 .. NOTE
   For small 1-3 sentence descriptions, just add an entry at the end of
   this list. If your description won't fit comfortably in one bullet
   point (e.g. maybe you would like to give an example of the
   functionality, or simply have a lot to talk about), see the `NOTE` below
   for adding a new subsection.
 * The definition and uses of LLVM_ATRIBUTE_UNUSED_RESULT in the LLVM source
  were replaced with LLVM_NODISCARD, which matches the C++17 [[nodiscard]]
  semantics rather than gcc's __attribute__((warn_unused_result)).
 * Minimum compiler version to build has been raised to GCC 4.8 and VS 2015.
 * The C API functions ``LLVMAddFunctionAttr``, ``LLVMGetFunctionAttr``,
  ``LLVMRemoveFunctionAttr``, ``LLVMAddAttribute``, ``LLVMRemoveAttribute``,
  ``LLVMGetAttribute``, ``LLVMAddInstrAttribute`` and
  ``LLVMRemoveInstrAttribute`` have been removed.
 * The C API enum ``LLVMAttribute`` has been deleted.
 * The definition and uses of ``LLVM_ATRIBUTE_UNUSED_RESULT`` in the LLVM source
  were replaced with ``LLVM_NODISCARD``, which matches the C++17 ``[[nodiscard]]``
  semantics rather than gcc's ``__attribute__((warn_unused_result))``.
 * The Timer related APIs now expect a Name and Description. When upgrading code
  the previously used names should become descriptions and a short name in the
  style of a programming language identifier should be added.
-* LLVM now handles invariant.group across different basic blocks, which makes
+* LLVM now handles ``invariant.group`` across different basic blocks, which makes
  it possible to devirtualize virtual calls inside loops.
-* The aggressive dead code elimination phase ("adce") now remove
+* The aggressive dead code elimination phase ("adce") now removes
  branches which do not effect program behavior. Loops are retained by
  default since they may be infinite but these can also be removed
-  with LLVM option -adce-remove-loops when the loop body otherwise has
+  with LLVM option ``-adce-remove-loops`` when the loop body otherwise has
  no live operations.
 * The GVNHoist pass is now enabled by default. The new pass based on Global
  Value Numbering detects similar computations in branch code and replaces
  multiple instances of the same computation with a unique expression.  The
  transform benefits code size and generates better schedules.  GVNHoist is
-  more aggressive at -Os and -Oz, hoisting more expressions at the expense of
+  more aggressive at ``-Os`` and ``-Oz``, hoisting more expressions at the
-  execution time degradations.
+  expense of execution time degradations.
 * The llvm-cov tool can now export coverage data as json. Its html output mode
   has also improved.
-* ... next change ...
+Improvements to ThinLTO (-flto=thin)
 ------------------------------------
 Integration with profile data (PGO). When available, profile data
 enables more accurate function importing decisions, as well as
 cross-module indirect call promotion.
-.. NOTE
+Significant build-time and binary-size improvements when compiling with
-   If you would like to document a larger change, then you can add a
+debug info (-g).
   subsection about it right here. You can copy the following boilerplate
   and un-indent it (the indentation causes it to be inside this comment).
   Special New Feature
   -------------------
   Makes programs 10x faster by doing Special New Thing.
   Improvements to ThinLTO (-flto=thin)
   ------------------------------------
   * Integration with profile data (PGO). When available, profile data
     enables more accurate function importing decisions, as well as
     cross-module indirect call promotion.
   * Significant build-time and binary-size improvements when compiling with
     debug info (-g).
 LLVM Coroutines
 ---------------
 Experimental support for :doc:`Coroutines` was added, which can be enabled
-with ``-enable-coroutines`` in ``opt`` command tool or using
+with ``-enable-coroutines`` in ``opt`` the command tool or using the
 ``addCoroutinePassesToExtensionPoints`` API when building the optimization
 pipeline.
@ -106,18 +82,18 @@ For more information on LLVM Coroutines and the LLVM implementation, see
 Regcall and Vectorcall Calling Conventions
 --------------------------------------------------
-Support was added for _regcall calling convention.
+Support was added for ``_regcall`` calling convention.
-Existing __vectorcall calling convention support was extended to include
+Existing ``__vectorcall`` calling convention support was extended to include
 correct handling of HVAs.
-The __vectorcall calling convention was introduced by Microsoft to
+The ``__vectorcall`` calling convention was introduced by Microsoft to
 enhance register usage when passing parameters.
 For more information please read `__vectorcall documentation
 <https://msdn.microsoft.com/en-us/library/dn375768.aspx>`_.
-The __regcall calling convention was introduced by Intel to 
+The ``__regcall`` calling convention was introduced by Intel to
 optimize parameter transfer on function call.
-This calling convention ensures that as many values as possible are 
+This calling convention ensures that as many values as possible are
 passed or returned in registers.
 For more information please read `__regcall documentation
 <https://software.intel.com/en-us/node/693069>`_.
@ -127,7 +103,7 @@ Code Generation Testing
 Passes that work on the machine instruction representation can be tested with
 the .mir serialization format. ``llc`` supports the ``-run-pass``,
-``-stop-after``, ``-stop-before``, ``-start-after``, ``-start-before`` to to
+``-stop-after``, ``-stop-before``, ``-start-after``, ``-start-before`` to
 run a single pass of the code generation pipeline, or to stop or start the code
 generation pipeline at a given point.
@ -211,9 +187,6 @@ changes landed in this release.
  ``&*I`` (if not ``end()``); alternatively, clients may refactor to use
  references for known-good nodes.
 Changes to the LLVM IR
 ----------------------
 Changes to the ARM Targets
 --------------------------
@ -244,28 +217,6 @@ Changes to the ARM Targets
 A lot of work has also been done in LLD for ARM, which now supports more
 relocations and TLS.
 Changes to the MIPS Target
 --------------------------
 During this release ...
 Changes to the PowerPC Target
 -----------------------------
 During this release ...
 Changes to the X86 Target
 -------------------------
 During this release ...
 Changes to the AMDGPU Target
 -----------------------------
 During this release ...
 Changes to the AVR Target
 -----------------------------
@ -297,8 +248,6 @@ Changes to the OCaml bindings
 External Open Source Projects Using LLVM 4.0.0
 ==============================================
 * A project...
 LDC - the LLVM-based D compiler
 -------------------------------
--- a/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+++ b/include/llvm/Transforms/Vectorize/SLPVectorizer.h
@ -92,12 +92,6 @@ private:
  /// collected in GEPs.
  bool vectorizeGEPIndices(BasicBlock *BB, slpvectorizer::BoUpSLP &R);
  /// Try to find horizontal reduction or otherwise vectorize a chain of binary
  /// operators.
  bool vectorizeRootInstruction(PHINode *P, Value *V, BasicBlock *BB,
                                slpvectorizer::BoUpSLP &R,
                                TargetTransformInfo *TTI);
  /// \brief Scan the basic block and look for patterns that are likely to start
  /// a vectorization chain.
  bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);
--- a/lib/Target/AMDGPU/SIInstructions.td
+++ b/lib/Target/AMDGPU/SIInstructions.td
@ -996,6 +996,11 @@ def : Pat <
  (V_CMP_EQ_U32_e64 (S_AND_B32 (i32 1), $a), (i32 1))
 >;
 def : Pat <
  (i1 (trunc i16:$a)),
  (V_CMP_EQ_U32_e64 (S_AND_B32 (i32 1), $a), (i32 1))
 >;
 def : Pat <
  (i1 (trunc i64:$a)),
  (V_CMP_EQ_U32_e64 (S_AND_B32 (i32 1),
--- a/lib/Target/AMDGPU/VOP1Instructions.td
+++ b/lib/Target/AMDGPU/VOP1Instructions.td
@ -607,12 +607,6 @@ def : Pat<
  (COPY $src)
 >;
 def : Pat<
  (i1 (trunc i16:$src)),
  (COPY $src)
 >;
 def : Pat <
  (i16 (trunc i64:$src)),
  (EXTRACT_SUBREG $src, sub0)
--- a/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
+++ b/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
@ -41,6 +41,8 @@ STATISTIC(NumSDivs,     "Number of sdiv converted to udiv");
 STATISTIC(NumAShrs,     "Number of ashr converted to lshr");
 STATISTIC(NumSRems,     "Number of srem converted to urem");
 static cl::opt<bool> DontProcessAdds("cvp-dont-process-adds", cl::init(true));
 namespace {
  class CorrelatedValuePropagation : public FunctionPass {
  public:
@ -405,6 +407,9 @@ static bool processAShr(BinaryOperator *SDI, LazyValueInfo *LVI) {
 static bool processAdd(BinaryOperator *AddOp, LazyValueInfo *LVI) {
  typedef OverflowingBinaryOperator OBO;
  if (DontProcessAdds)
    return false;
  if (AddOp->getType()->isVectorTy() || hasLocalDefs(AddOp))
    return false;
--- a/lib/Transforms/Scalar/Reassociate.cpp
+++ b/lib/Transforms/Scalar/Reassociate.cpp
@ -1521,8 +1521,8 @@ Value *ReassociatePass::OptimizeAdd(Instruction *I,
      if (ConstantInt *CI = dyn_cast<ConstantInt>(Factor)) {
        if (CI->isNegative() && !CI->isMinValue(true)) {
          Factor = ConstantInt::get(CI->getContext(), -CI->getValue());
-          assert(!Duplicates.count(Factor) &&
+          if (!Duplicates.insert(Factor).second)
-                 "Shouldn't have two constant factors, missed a canonicalize");
+            continue;
          unsigned Occ = ++FactorOccurrences[Factor];
          if (Occ > MaxOcc) {
            MaxOcc = Occ;
@ -1534,8 +1534,8 @@ Value *ReassociatePass::OptimizeAdd(Instruction *I,
          APFloat F(CF->getValueAPF());
          F.changeSign();
          Factor = ConstantFP::get(CF->getContext(), F);
-          assert(!Duplicates.count(Factor) &&
+          if (!Duplicates.insert(Factor).second)
-                 "Shouldn't have two constant factors, missed a canonicalize");
+            continue;
          unsigned Occ = ++FactorOccurrences[Factor];
          if (Occ > MaxOcc) {
            MaxOcc = Occ;
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp
@ -4026,40 +4026,36 @@ bool SLPVectorizerPass::tryToVectorize(BinaryOperator *V, BoUpSLP &R) {
  if (!V)
    return false;
  Value *P = V->getParent();
  // Vectorize in current basic block only.
  auto *Op0 = dyn_cast<Instruction>(V->getOperand(0));
  auto *Op1 = dyn_cast<Instruction>(V->getOperand(1));
  if (!Op0 || !Op1 || Op0->getParent() != P || Op1->getParent() != P)
    return false;
  // Try to vectorize V.
-  if (tryToVectorizePair(Op0, Op1, R))
+  if (tryToVectorizePair(V->getOperand(0), V->getOperand(1), R))
    return true;
-  auto *A = dyn_cast<BinaryOperator>(Op0);
+  BinaryOperator *A = dyn_cast<BinaryOperator>(V->getOperand(0));
-  auto *B = dyn_cast<BinaryOperator>(Op1);
+  BinaryOperator *B = dyn_cast<BinaryOperator>(V->getOperand(1));
  // Try to skip B.
  if (B && B->hasOneUse()) {
-    auto *B0 = dyn_cast<BinaryOperator>(B->getOperand(0));
+    BinaryOperator *B0 = dyn_cast<BinaryOperator>(B->getOperand(0));
-    auto *B1 = dyn_cast<BinaryOperator>(B->getOperand(1));
+    BinaryOperator *B1 = dyn_cast<BinaryOperator>(B->getOperand(1));
-    if (B0 && B0->getParent() == P && tryToVectorizePair(A, B0, R))
+    if (tryToVectorizePair(A, B0, R)) {
      return true;
-    if (B1 && B1->getParent() == P && tryToVectorizePair(A, B1, R))
+    }
    if (tryToVectorizePair(A, B1, R)) {
      return true;
    }
  }
  // Try to skip A.
  if (A && A->hasOneUse()) {
-    auto *A0 = dyn_cast<BinaryOperator>(A->getOperand(0));
+    BinaryOperator *A0 = dyn_cast<BinaryOperator>(A->getOperand(0));
-    auto *A1 = dyn_cast<BinaryOperator>(A->getOperand(1));
+    BinaryOperator *A1 = dyn_cast<BinaryOperator>(A->getOperand(1));
-    if (A0 && A0->getParent() == P && tryToVectorizePair(A0, B, R))
+    if (tryToVectorizePair(A0, B, R)) {
      return true;
-    if (A1 && A1->getParent() == P && tryToVectorizePair(A1, B, R))
+    }
    if (tryToVectorizePair(A1, B, R)) {
      return true;
    }
  }
-  return false;
+  return 0;
 }
 /// \brief Generate a shuffle mask to be used in a reduction tree.
@ -4511,143 +4507,29 @@ static Value *getReductionValue(const DominatorTree *DT, PHINode *P,
  return nullptr;
 }
 namespace {
 /// Tracks instructons and its children.
 class WeakVHWithLevel final : public CallbackVH {
  /// Operand index of the instruction currently beeing analized.
  unsigned Level = 0;
  /// Is this the instruction that should be vectorized, or are we now
  /// processing children (i.e. operands of this instruction) for potential
  /// vectorization?
  bool IsInitial = true;
 public:
  explicit WeakVHWithLevel() = default;
  WeakVHWithLevel(Value *V) : CallbackVH(V){};
  /// Restart children analysis each time it is repaced by the new instruction.
  void allUsesReplacedWith(Value *New) override {
    setValPtr(New);
    Level = 0;
    IsInitial = true;
  }
  /// Check if the instruction was not deleted during vectorization.
  bool isValid() const { return !getValPtr(); }
  /// Is the istruction itself must be vectorized?
  bool isInitial() const { return IsInitial; }
  /// Try to vectorize children.
  void clearInitial() { IsInitial = false; }
  /// Are all children processed already?
  bool isFinal() const {
    assert(getValPtr() &&
           (isa<Instruction>(getValPtr()) &&
            cast<Instruction>(getValPtr())->getNumOperands() >= Level));
    return getValPtr() &&
           cast<Instruction>(getValPtr())->getNumOperands() == Level;
  }
  /// Get next child operation.
  Value *nextOperand() {
    assert(getValPtr() && isa<Instruction>(getValPtr()) &&
           cast<Instruction>(getValPtr())->getNumOperands() > Level);
    return cast<Instruction>(getValPtr())->getOperand(Level++);
  }
  virtual ~WeakVHWithLevel() = default;
 };
 } // namespace
 /// \brief Attempt to reduce a horizontal reduction.
 /// If it is legal to match a horizontal reduction feeding
-/// the phi node P with reduction operators Root in a basic block BB, then check
+/// the phi node P with reduction operators BI, then check if it
-/// if it can be done.
+/// can be done.
 /// \returns true if a horizontal reduction was matched and reduced.
 /// \returns false if a horizontal reduction was not matched.
-static bool canBeVectorized(
+static bool canMatchHorizontalReduction(PHINode *P, BinaryOperator *BI,
-    PHINode *P, Instruction *Root, BasicBlock *BB, BoUpSLP &R,
+                                        BoUpSLP &R, TargetTransformInfo *TTI,
-    TargetTransformInfo *TTI,
+                                        unsigned MinRegSize) {
    const function_ref<bool(BinaryOperator *, BoUpSLP &)> Vectorize) {
  if (!ShouldVectorizeHor)
    return false;
-  if (!Root)
+  HorizontalReduction HorRdx(MinRegSize);
  if (!HorRdx.matchAssociativeReduction(P, BI))
    return false;
-  if (Root->getParent() != BB)
+  // If there is a sufficient number of reduction values, reduce
-    return false;
+  // to a nearby power-of-2. Can safely generate oversized
-  SmallVector<WeakVHWithLevel, 8> Stack(1, Root);
+  // vectors and rely on the backend to split them to legal sizes.
-  SmallSet<Value *, 8> VisitedInstrs;
+  HorRdx.ReduxWidth =
-  bool Res = false;
+    std::max((uint64_t)4, PowerOf2Floor(HorRdx.numReductionValues()));
  while (!Stack.empty()) {
    Value *V = Stack.back();
    if (!V) {
      Stack.pop_back();
      continue;
    }
    auto *Inst = dyn_cast<Instruction>(V);
    if (!Inst || isa<PHINode>(Inst)) {
      Stack.pop_back();
      continue;
    }
    if (Stack.back().isInitial()) {
      Stack.back().clearInitial();
      if (auto *BI = dyn_cast<BinaryOperator>(Inst)) {
        HorizontalReduction HorRdx(R.getMinVecRegSize());
        if (HorRdx.matchAssociativeReduction(P, BI)) {
          // If there is a sufficient number of reduction values, reduce
          // to a nearby power-of-2. Can safely generate oversized
          // vectors and rely on the backend to split them to legal sizes.
          HorRdx.ReduxWidth =
              std::max((uint64_t)4, PowerOf2Floor(HorRdx.numReductionValues()));
-          if (HorRdx.tryToReduce(R, TTI)) {
+  return HorRdx.tryToReduce(R, TTI);
            Res = true;
            P = nullptr;
            continue;
          }
        }
        if (P) {
          Inst = dyn_cast<Instruction>(BI->getOperand(0));
          if (Inst == P)
            Inst = dyn_cast<Instruction>(BI->getOperand(1));
          if (!Inst) {
            P = nullptr;
            continue;
          }
        }
      }
      P = nullptr;
      if (Vectorize(dyn_cast<BinaryOperator>(Inst), R)) {
        Res = true;
        continue;
      }
    }
    if (Stack.back().isFinal()) {
      Stack.pop_back();
      continue;
    }
    if (auto *NextV = dyn_cast<Instruction>(Stack.back().nextOperand()))
      if (NextV->getParent() == BB && VisitedInstrs.insert(NextV).second &&
          Stack.size() < RecursionMaxDepth)
        Stack.push_back(NextV);
  }
  return Res;
 }
 bool SLPVectorizerPass::vectorizeRootInstruction(PHINode *P, Value *V,
                                                 BasicBlock *BB, BoUpSLP &R,
                                                 TargetTransformInfo *TTI) {
  if (!V)
    return false;
  auto *I = dyn_cast<Instruction>(V);
  if (!I)
    return false;
  if (!isa<BinaryOperator>(I))
    P = nullptr;
  // Try to match and vectorize a horizontal reduction.
  return canBeVectorized(P, I, BB, R, TTI,
                         [this](BinaryOperator *BI, BoUpSLP &R) -> bool {
                           return tryToVectorize(BI, R);
                         });
 }
 bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
@ -4717,42 +4599,67 @@ bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
      if (P->getNumIncomingValues() != 2)
        return Changed;
      Value *Rdx = getReductionValue(DT, P, BB, LI);
      // Check if this is a Binary Operator.
      BinaryOperator *BI = dyn_cast_or_null<BinaryOperator>(Rdx);
      if (!BI)
        continue;
      // Try to match and vectorize a horizontal reduction.
-      if (vectorizeRootInstruction(P, getReductionValue(DT, P, BB, LI), BB, R,
+      if (canMatchHorizontalReduction(P, BI, R, TTI, R.getMinVecRegSize())) {
                                   TTI)) {
        Changed = true;
        it = BB->begin();
        e = BB->end();
        continue;
      }
     Value *Inst = BI->getOperand(0);
      if (Inst == P)
        Inst = BI->getOperand(1);
      if (tryToVectorize(dyn_cast<BinaryOperator>(Inst), R)) {
        // We would like to start over since some instructions are deleted
        // and the iterator may become invalid value.
        Changed = true;
        it = BB->begin();
        e = BB->end();
        continue;
      }
      continue;
    }
-    if (ShouldStartVectorizeHorAtStore) {
+    if (ShouldStartVectorizeHorAtStore)
-      if (StoreInst *SI = dyn_cast<StoreInst>(it)) {
+      if (StoreInst *SI = dyn_cast<StoreInst>(it))
-        // Try to match and vectorize a horizontal reduction.
+        if (BinaryOperator *BinOp =
-        if (vectorizeRootInstruction(nullptr, SI->getValueOperand(), BB, R,
+                dyn_cast<BinaryOperator>(SI->getValueOperand())) {
-                                     TTI)) {
+          if (canMatchHorizontalReduction(nullptr, BinOp, R, TTI,
-          Changed = true;
+                                          R.getMinVecRegSize()) ||
-          it = BB->begin();
+              tryToVectorize(BinOp, R)) {
-          e = BB->end();
+            Changed = true;
-          continue;
+            it = BB->begin();
            e = BB->end();
            continue;
          }
        }
      }
    }
    // Try to vectorize horizontal reductions feeding into a return.
-    if (ReturnInst *RI = dyn_cast<ReturnInst>(it)) {
+    if (ReturnInst *RI = dyn_cast<ReturnInst>(it))
-      if (RI->getNumOperands() != 0) {
+      if (RI->getNumOperands() != 0)
-        // Try to match and vectorize a horizontal reduction.
+        if (BinaryOperator *BinOp =
-        if (vectorizeRootInstruction(nullptr, RI->getOperand(0), BB, R, TTI)) {
+                dyn_cast<BinaryOperator>(RI->getOperand(0))) {
-          Changed = true;
+          DEBUG(dbgs() << "SLP: Found a return to vectorize.\n");
-          it = BB->begin();
+          if (canMatchHorizontalReduction(nullptr, BinOp, R, TTI,
-          e = BB->end();
+                                          R.getMinVecRegSize()) ||
-          continue;
+              tryToVectorizePair(BinOp->getOperand(0), BinOp->getOperand(1),
                                 R)) {
            Changed = true;
            it = BB->begin();
            e = BB->end();
            continue;
          }
        }
      }
    }
    // Try to vectorize trees that start at compare instructions.
    if (CmpInst *CI = dyn_cast<CmpInst>(it)) {
@ -4765,14 +4672,16 @@ bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
        continue;
      }
-      for (int I = 0; I < 2; ++I) {
+      for (int i = 0; i < 2; ++i) {
-        if (vectorizeRootInstruction(nullptr, CI->getOperand(I), BB, R, TTI)) {
+        if (BinaryOperator *BI = dyn_cast<BinaryOperator>(CI->getOperand(i))) {
-          Changed = true;
+          if (tryToVectorizePair(BI->getOperand(0), BI->getOperand(1), R)) {
-          // We would like to start over since some instructions are deleted
+            Changed = true;
-          // and the iterator may become invalid value.
+            // We would like to start over since some instructions are deleted
-          it = BB->begin();
+            // and the iterator may become invalid value.
-          e = BB->end();
+            it = BB->begin();
-          break;
+            e = BB->end();
            break;
          }
        }
      }
      continue;
--- a/test/CodeGen/AMDGPU/trunc.ll
+++ b/test/CodeGen/AMDGPU/trunc.ll
@ -1,13 +1,15 @@
-; RUN: llc -march=amdgcn -verify-machineinstrs< %s | FileCheck -check-prefix=SI %s
+; RUN: llc -march=amdgcn -verify-machineinstrs< %s | FileCheck -check-prefix=GCN -check-prefix=SI %s
 ; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs< %s | FileCheck -check-prefix=GCN -check-prefix=VI  %s
 ; RUN: llc -march=r600 -mcpu=cypress < %s | FileCheck -check-prefix=EG %s
 declare i32 @llvm.r600.read.tidig.x() nounwind readnone
 define void @trunc_i64_to_i32_store(i32 addrspace(1)* %out, i64 %in) {
-; SI-LABEL: {{^}}trunc_i64_to_i32_store:
+; GCN-LABEL: {{^}}trunc_i64_to_i32_store:
-; SI: s_load_dword [[SLOAD:s[0-9]+]], s[0:1], 0xb
+; GCN: s_load_dword [[SLOAD:s[0-9]+]], s[0:1],
-; SI: v_mov_b32_e32 [[VLOAD:v[0-9]+]], [[SLOAD]]
+; GCN: v_mov_b32_e32 [[VLOAD:v[0-9]+]], [[SLOAD]]
 ; SI: buffer_store_dword [[VLOAD]]
 ; VI: flat_store_dword v[{{[0-9:]+}}], [[VLOAD]]
 ; EG-LABEL: {{^}}trunc_i64_to_i32_store:
 ; EG: MEM_RAT_CACHELESS STORE_RAW T0.X, T1.X, 1
@ -18,12 +20,14 @@ define void @trunc_i64_to_i32_store(i32 addrspace(1)* %out, i64 %in) {
  ret void
 }
-; SI-LABEL: {{^}}trunc_load_shl_i64:
+; GCN-LABEL: {{^}}trunc_load_shl_i64:
-; SI-DAG: s_load_dwordx2
+; GCN-DAG: s_load_dwordx2
-; SI-DAG: s_load_dword [[SREG:s[0-9]+]],
+; GCN-DAG: s_load_dword [[SREG:s[0-9]+]],
-; SI: s_lshl_b32 [[SHL:s[0-9]+]], [[SREG]], 2
+; GCN: s_lshl_b32 [[SHL:s[0-9]+]], [[SREG]], 2
-; SI: v_mov_b32_e32 [[VSHL:v[0-9]+]], [[SHL]]
+; GCN: v_mov_b32_e32 [[VSHL:v[0-9]+]], [[SHL]]
-; SI: buffer_store_dword [[VSHL]],
+; SI: buffer_store_dword [[VSHL]]
 ; VI: flat_store_dword v[{{[0-9:]+}}], [[VSHL]]
 define void @trunc_load_shl_i64(i32 addrspace(1)* %out, i64 %a) {
  %b = shl i64 %a, 2
  %result = trunc i64 %b to i32
@ -31,15 +35,17 @@ define void @trunc_load_shl_i64(i32 addrspace(1)* %out, i64 %a) {
  ret void
 }
-; SI-LABEL: {{^}}trunc_shl_i64:
+; GCN-LABEL: {{^}}trunc_shl_i64:
 ; SI: s_load_dwordx2 s{{\[}}[[LO_SREG:[0-9]+]]:{{[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0xd
-; SI: s_lshl_b64 s{{\[}}[[LO_SHL:[0-9]+]]:{{[0-9]+\]}}, s{{\[}}[[LO_SREG]]:{{[0-9]+\]}}, 2
+; VI: s_load_dwordx2 s{{\[}}[[LO_SREG:[0-9]+]]:{{[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0x34
-; SI: s_add_u32 s[[LO_SREG2:[0-9]+]], s[[LO_SHL]],
+; GCN: s_lshl_b64 s{{\[}}[[LO_SHL:[0-9]+]]:{{[0-9]+\]}}, s{{\[}}[[LO_SREG]]:{{[0-9]+\]}}, 2
-; SI: v_mov_b32_e32 v[[LO_VREG:[0-9]+]], s[[LO_SREG2]]
+; GCN: s_add_u32 s[[LO_SREG2:[0-9]+]], s[[LO_SHL]],
-; SI: s_addc_u32
+; GCN: v_mov_b32_e32 v[[LO_VREG:[0-9]+]], s[[LO_SREG2]]
 ; GCN: s_addc_u32
 ; SI: buffer_store_dword v[[LO_VREG]],
-; SI: v_mov_b32_e32
+; VI: flat_store_dword v[{{[0-9:]+}}], v[[LO_VREG]]
-; SI: v_mov_b32_e32
+; GCN: v_mov_b32_e32
 ; GCN: v_mov_b32_e32
 define void @trunc_shl_i64(i64 addrspace(1)* %out2, i32 addrspace(1)* %out, i64 %a) {
  %aa = add i64 %a, 234 ; Prevent shrinking store.
  %b = shl i64 %aa, 2
@ -49,9 +55,9 @@ define void @trunc_shl_i64(i64 addrspace(1)* %out2, i32 addrspace(1)* %out, i64
  ret void
 }
-; SI-LABEL: {{^}}trunc_i32_to_i1:
+; GCN-LABEL: {{^}}trunc_i32_to_i1:
-; SI: v_and_b32_e32 v{{[0-9]+}}, 1, v{{[0-9]+}}
+; GCN: v_and_b32_e32 v{{[0-9]+}}, 1, v{{[0-9]+}}
-; SI: v_cmp_eq_u32
+; GCN: v_cmp_eq_u32
 define void @trunc_i32_to_i1(i32 addrspace(1)* %out, i32 addrspace(1)* %ptr) {
  %a = load i32, i32 addrspace(1)* %ptr, align 4
  %trunc = trunc i32 %a to i1
@ -60,9 +66,30 @@ define void @trunc_i32_to_i1(i32 addrspace(1)* %out, i32 addrspace(1)* %ptr) {
  ret void
 }
-; SI-LABEL: {{^}}sgpr_trunc_i32_to_i1:
+; GCN-LABEL: {{^}}trunc_i8_to_i1:
-; SI: s_and_b32 s{{[0-9]+}}, 1, s{{[0-9]+}}
+; GCN: v_and_b32_e32 v{{[0-9]+}}, 1, v{{[0-9]+}}
-; SI: v_cmp_eq_u32
+; GCN: v_cmp_eq_u32
 define void @trunc_i8_to_i1(i8 addrspace(1)* %out, i8 addrspace(1)* %ptr) {
  %a = load i8, i8 addrspace(1)* %ptr, align 4
  %trunc = trunc i8 %a to i1
  %result = select i1 %trunc, i8 1, i8 0
  store i8 %result, i8 addrspace(1)* %out, align 4
  ret void
 }
 ; GCN-LABEL: {{^}}sgpr_trunc_i16_to_i1:
 ; GCN: s_and_b32 s{{[0-9]+}}, 1, s{{[0-9]+}}
 ; GCN: v_cmp_eq_u32
 define void @sgpr_trunc_i16_to_i1(i16 addrspace(1)* %out, i16 %a) {
  %trunc = trunc i16 %a to i1
  %result = select i1 %trunc, i16 1, i16 0
  store i16 %result, i16 addrspace(1)* %out, align 4
  ret void
 }
 ; GCN-LABEL: {{^}}sgpr_trunc_i32_to_i1:
 ; GCN: s_and_b32 s{{[0-9]+}}, 1, s{{[0-9]+}}
 ; GCN: v_cmp_eq_u32
 define void @sgpr_trunc_i32_to_i1(i32 addrspace(1)* %out, i32 %a) {
  %trunc = trunc i32 %a to i1
  %result = select i1 %trunc, i32 1, i32 0
@ -70,11 +97,12 @@ define void @sgpr_trunc_i32_to_i1(i32 addrspace(1)* %out, i32 %a) {
  ret void
 }
-; SI-LABEL: {{^}}s_trunc_i64_to_i1:
+; GCN-LABEL: {{^}}s_trunc_i64_to_i1:
 ; SI: s_load_dwordx2 s{{\[}}[[SLO:[0-9]+]]:{{[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0xb
-; SI: s_and_b32 [[MASKED:s[0-9]+]], 1, s[[SLO]]
+; VI: s_load_dwordx2 s{{\[}}[[SLO:[0-9]+]]:{{[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0x2c
-; SI: v_cmp_eq_u32_e64 s{{\[}}[[VLO:[0-9]+]]:[[VHI:[0-9]+]]], [[MASKED]], 1{{$}}
+; GCN: s_and_b32 [[MASKED:s[0-9]+]], 1, s[[SLO]]
-; SI: v_cndmask_b32_e64 {{v[0-9]+}}, -12, 63, s{{\[}}[[VLO]]:[[VHI]]]
+; GCN: v_cmp_eq_u32_e64 s{{\[}}[[VLO:[0-9]+]]:[[VHI:[0-9]+]]], [[MASKED]], 1{{$}}
 ; GCN: v_cndmask_b32_e64 {{v[0-9]+}}, -12, 63, s{{\[}}[[VLO]]:[[VHI]]]
 define void @s_trunc_i64_to_i1(i32 addrspace(1)* %out, i64 %x) {
  %trunc = trunc i64 %x to i1
  %sel = select i1 %trunc, i32 63, i32 -12
@ -82,11 +110,12 @@ define void @s_trunc_i64_to_i1(i32 addrspace(1)* %out, i64 %x) {
  ret void
 }
-; SI-LABEL: {{^}}v_trunc_i64_to_i1:
+; GCN-LABEL: {{^}}v_trunc_i64_to_i1:
 ; SI: buffer_load_dwordx2 v{{\[}}[[VLO:[0-9]+]]:{{[0-9]+\]}}
-; SI: v_and_b32_e32 [[MASKED:v[0-9]+]], 1, v[[VLO]]
+; VI: flat_load_dwordx2 v{{\[}}[[VLO:[0-9]+]]:{{[0-9]+\]}}
-; SI: v_cmp_eq_u32_e32 vcc, 1, [[MASKED]]
+; GCN: v_and_b32_e32 [[MASKED:v[0-9]+]], 1, v[[VLO]]
-; SI: v_cndmask_b32_e64 {{v[0-9]+}}, -12, 63, vcc
+; GCN: v_cmp_eq_u32_e32 vcc, 1, [[MASKED]]
 ; GCN: v_cndmask_b32_e64 {{v[0-9]+}}, -12, 63, vcc
 define void @v_trunc_i64_to_i1(i32 addrspace(1)* %out, i64 addrspace(1)* %in) {
  %tid = call i32 @llvm.r600.read.tidig.x() nounwind readnone
  %gep = getelementptr i64, i64 addrspace(1)* %in, i32 %tid
--- a/test/Transforms/CorrelatedValuePropagation/add.ll
+++ b/test/Transforms/CorrelatedValuePropagation/add.ll
@ -1,4 +1,4 @@
-; RUN: opt < %s -correlated-propagation -S | FileCheck %s
+; RUN: opt < %s -correlated-propagation -cvp-dont-process-adds=false -S | FileCheck %s
 ; CHECK-LABEL: @test0(
 define void @test0(i32 %a) {
--- a/test/Transforms/Reassociate/basictest.ll
+++ b/test/Transforms/Reassociate/basictest.ll
@ -222,3 +222,23 @@ define i32 @test15(i32 %X1, i32 %X2, i32 %X3) {
 ; CHECK-LABEL: @test15
 ; CHECK: and i1 %A, %B
 }
 ; PR30256 - previously this asserted.
 ; CHECK-LABEL: @test16
 ; CHECK: %[[FACTOR:.*]] = mul i64 %a, -4
 ; CHECK-NEXT: %[[RES:.*]] = add i64 %[[FACTOR]], %b
 ; CHECK-NEXT: ret i64 %[[RES]]
 define i64 @test16(i1 %cmp, i64 %a, i64 %b) {
 entry:
  %shl = shl i64 %a, 1
  %shl.neg = sub i64 0, %shl
  br i1 %cmp, label %if.then, label %if.end
 if.then:                                          ; preds = %entry
  %add1 = add i64 %shl.neg, %shl.neg
  %add2 = add i64 %add1, %b
  ret i64 %add2
 if.end:                                           ; preds = %entry
  ret i64 0
 }
--- a/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
+++ b/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
@ -12,25 +12,26 @@ define float @baz() {
 ; CHECK-NEXT:    [[TMP0:%.*]] = load i32, i32* @n, align 4
 ; CHECK-NEXT:    [[MUL:%.*]] = mul nsw i32 [[TMP0]], 3
 ; CHECK-NEXT:    [[CONV:%.*]] = sitofp i32 [[MUL]] to float
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x float>, <2 x float>* bitcast ([20 x float]* @arr to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x float>, <2 x float>* bitcast ([20 x float]* @arr1 to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP2:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP3:%.*]] = fmul fast <2 x float> [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    [[MUL4:%.*]] = fmul fast float [[TMP2]], [[TMP1]]
-; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[MUL4]], [[CONV]]
-; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[TMP4]], [[CONV]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[ADD_1:%.*]] = fadd fast float [[TMP5]], [[ADD]]
+; CHECK-NEXT:    [[MUL4_1:%.*]] = fmul fast float [[TMP4]], [[TMP3]]
-; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2) to <2 x float>*), align 8
+; CHECK-NEXT:    [[ADD_1:%.*]] = fadd fast float [[MUL4_1]], [[ADD]]
-; CHECK-NEXT:    [[TMP7:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2) to <2 x float>*), align 8
+; CHECK-NEXT:    [[TMP5:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2) to <2 x float>*), align 8
-; CHECK-NEXT:    [[TMP8:%.*]] = fmul fast <2 x float> [[TMP7]], [[TMP6]]
+; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2) to <2 x float>*), align 8
-; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x float> [[TMP8]], i32 0
+; CHECK-NEXT:    [[TMP7:%.*]] = fmul fast <2 x float> [[TMP6]], [[TMP5]]
-; CHECK-NEXT:    [[ADD_2:%.*]] = fadd fast float [[TMP9]], [[ADD_1]]
+; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x float> [[TMP7]], i32 0
-; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <2 x float> [[TMP8]], i32 1
+; CHECK-NEXT:    [[ADD_2:%.*]] = fadd fast float [[TMP8]], [[ADD_1]]
-; CHECK-NEXT:    [[ADD_3:%.*]] = fadd fast float [[TMP10]], [[ADD_2]]
+; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x float> [[TMP7]], i32 1
 ; CHECK-NEXT:    [[ADD_3:%.*]] = fadd fast float [[TMP9]], [[ADD_2]]
 ; CHECK-NEXT:    [[ADD7:%.*]] = fadd fast float [[ADD_3]], [[CONV]]
-; CHECK-NEXT:    [[ADD19:%.*]] = fadd fast float [[TMP4]], [[ADD7]]
+; CHECK-NEXT:    [[ADD19:%.*]] = fadd fast float [[MUL4]], [[ADD7]]
-; CHECK-NEXT:    [[ADD19_1:%.*]] = fadd fast float [[TMP5]], [[ADD19]]
+; CHECK-NEXT:    [[ADD19_1:%.*]] = fadd fast float [[MUL4_1]], [[ADD19]]
-; CHECK-NEXT:    [[ADD19_2:%.*]] = fadd fast float [[TMP9]], [[ADD19_1]]
+; CHECK-NEXT:    [[ADD19_2:%.*]] = fadd fast float [[TMP8]], [[ADD19_1]]
-; CHECK-NEXT:    [[ADD19_3:%.*]] = fadd fast float [[TMP10]], [[ADD19_2]]
+; CHECK-NEXT:    [[ADD19_3:%.*]] = fadd fast float [[TMP9]], [[ADD19_2]]
 ; CHECK-NEXT:    store float [[ADD19_3]], float* @res, align 4
 ; CHECK-NEXT:    ret float [[ADD19_3]]
 ;
@ -69,37 +70,40 @@ define float @bazz() {
 ; CHECK-NEXT:    [[TMP0:%.*]] = load i32, i32* @n, align 4
 ; CHECK-NEXT:    [[MUL:%.*]] = mul nsw i32 [[TMP0]], 3
 ; CHECK-NEXT:    [[CONV:%.*]] = sitofp i32 [[MUL]] to float
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x float>, <2 x float>* bitcast ([20 x float]* @arr to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x float>, <2 x float>* bitcast ([20 x float]* @arr1 to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP2:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP3:%.*]] = fmul fast <2 x float> [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    [[MUL4:%.*]] = fmul fast float [[TMP2]], [[TMP1]]
-; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[MUL4]], [[CONV]]
-; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[TMP4]], [[CONV]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[ADD_1:%.*]] = fadd fast float [[TMP5]], [[ADD]]
+; CHECK-NEXT:    [[MUL4_1:%.*]] = fmul fast float [[TMP4]], [[TMP3]]
-; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2) to <2 x float>*), align 8
+; CHECK-NEXT:    [[ADD_1:%.*]] = fadd fast float [[MUL4_1]], [[ADD]]
-; CHECK-NEXT:    [[TMP7:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2) to <2 x float>*), align 8
+; CHECK-NEXT:    [[TMP5:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[TMP8:%.*]] = fmul fast <2 x float> [[TMP7]], [[TMP6]]
+; CHECK-NEXT:    [[TMP6:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x float> [[TMP8]], i32 0
+; CHECK-NEXT:    [[MUL4_2:%.*]] = fmul fast float [[TMP6]], [[TMP5]]
-; CHECK-NEXT:    [[ADD_2:%.*]] = fadd fast float [[TMP9]], [[ADD_1]]
+; CHECK-NEXT:    [[ADD_2:%.*]] = fadd fast float [[MUL4_2]], [[ADD_1]]
-; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <2 x float> [[TMP8]], i32 1
+; CHECK-NEXT:    [[TMP7:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 3), align 4
-; CHECK-NEXT:    [[ADD_3:%.*]] = fadd fast float [[TMP10]], [[ADD_2]]
+; CHECK-NEXT:    [[TMP8:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 3), align 4
 ; CHECK-NEXT:    [[MUL4_3:%.*]] = fmul fast float [[TMP8]], [[TMP7]]
 ; CHECK-NEXT:    [[ADD_3:%.*]] = fadd fast float [[MUL4_3]], [[ADD_2]]
 ; CHECK-NEXT:    [[MUL5:%.*]] = shl nsw i32 [[TMP0]], 2
 ; CHECK-NEXT:    [[CONV6:%.*]] = sitofp i32 [[MUL5]] to float
 ; CHECK-NEXT:    [[ADD7:%.*]] = fadd fast float [[ADD_3]], [[CONV6]]
-; CHECK-NEXT:    [[TMP11:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 4) to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP9:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 4), align 16
-; CHECK-NEXT:    [[TMP12:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 4) to <2 x float>*), align 16
+; CHECK-NEXT:    [[TMP10:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 4), align 16
-; CHECK-NEXT:    [[TMP13:%.*]] = fmul fast <2 x float> [[TMP12]], [[TMP11]]
+; CHECK-NEXT:    [[MUL18:%.*]] = fmul fast float [[TMP10]], [[TMP9]]
-; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
+; CHECK-NEXT:    [[ADD19:%.*]] = fadd fast float [[MUL18]], [[ADD7]]
-; CHECK-NEXT:    [[ADD19:%.*]] = fadd fast float [[TMP14]], [[ADD7]]
+; CHECK-NEXT:    [[TMP11:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 5), align 4
-; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
+; CHECK-NEXT:    [[TMP12:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 5), align 4
-; CHECK-NEXT:    [[ADD19_1:%.*]] = fadd fast float [[TMP15]], [[ADD19]]
+; CHECK-NEXT:    [[MUL18_1:%.*]] = fmul fast float [[TMP12]], [[TMP11]]
-; CHECK-NEXT:    [[TMP16:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 6) to <2 x float>*), align 8
+; CHECK-NEXT:    [[ADD19_1:%.*]] = fadd fast float [[MUL18_1]], [[ADD19]]
-; CHECK-NEXT:    [[TMP17:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 6) to <2 x float>*), align 8
+; CHECK-NEXT:    [[TMP13:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 6) to <2 x float>*), align 8
-; CHECK-NEXT:    [[TMP18:%.*]] = fmul fast <2 x float> [[TMP17]], [[TMP16]]
+; CHECK-NEXT:    [[TMP14:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 6) to <2 x float>*), align 8
-; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <2 x float> [[TMP18]], i32 0
+; CHECK-NEXT:    [[TMP15:%.*]] = fmul fast <2 x float> [[TMP14]], [[TMP13]]
-; CHECK-NEXT:    [[ADD19_2:%.*]] = fadd fast float [[TMP19]], [[ADD19_1]]
+; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x float> [[TMP18]], i32 1
+; CHECK-NEXT:    [[ADD19_2:%.*]] = fadd fast float [[TMP16]], [[ADD19_1]]
-; CHECK-NEXT:    [[ADD19_3:%.*]] = fadd fast float [[TMP20]], [[ADD19_2]]
+; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
 ; CHECK-NEXT:    [[ADD19_3:%.*]] = fadd fast float [[TMP17]], [[ADD19_2]]
 ; CHECK-NEXT:    store float [[ADD19_3]], float* @res, align 4
 ; CHECK-NEXT:    ret float [[ADD19_3]]
 ;
@ -151,20 +155,24 @@ define float @bazzz() {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[TMP0:%.*]] = load i32, i32* @n, align 4
 ; CHECK-NEXT:    [[CONV:%.*]] = sitofp i32 [[TMP0]] to float
-; CHECK-NEXT:    [[TMP1:%.*]] = load <4 x float>, <4 x float>* bitcast ([20 x float]* @arr to <4 x float>*), align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, <4 x float>* bitcast ([20 x float]* @arr1 to <4 x float>*), align 16
+; CHECK-NEXT:    [[TMP2:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP3:%.*]] = fmul fast <4 x float> [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    [[MUL:%.*]] = fmul fast float [[TMP2]], [[TMP1]]
-; CHECK-NEXT:    [[TMP4:%.*]] = fadd fast float undef, undef
+; CHECK-NEXT:    [[TMP3:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = fadd fast float undef, [[TMP4]]
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP3]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
+; CHECK-NEXT:    [[MUL_1:%.*]] = fmul fast float [[TMP4]], [[TMP3]]
-; CHECK-NEXT:    [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP3]], [[RDX_SHUF]]
+; CHECK-NEXT:    [[TMP5:%.*]] = fadd fast float [[MUL_1]], [[MUL]]
-; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
+; CHECK-NEXT:    [[TMP6:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
+; CHECK-NEXT:    [[TMP7:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[TMP6:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
+; CHECK-NEXT:    [[MUL_2:%.*]] = fmul fast float [[TMP7]], [[TMP6]]
-; CHECK-NEXT:    [[TMP7:%.*]] = fadd fast float undef, [[TMP5]]
+; CHECK-NEXT:    [[TMP8:%.*]] = fadd fast float [[MUL_2]], [[TMP5]]
-; CHECK-NEXT:    [[TMP8:%.*]] = fmul fast float [[CONV]], [[TMP6]]
+; CHECK-NEXT:    [[TMP9:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 3), align 4
-; CHECK-NEXT:    store float [[TMP8]], float* @res, align 4
+; CHECK-NEXT:    [[TMP10:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 3), align 4
-; CHECK-NEXT:    ret float [[TMP8]]
+; CHECK-NEXT:    [[MUL_3:%.*]] = fmul fast float [[TMP10]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast float [[MUL_3]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast float [[CONV]], [[TMP11]]
 ; CHECK-NEXT:    store float [[TMP12]], float* @res, align 4
 ; CHECK-NEXT:    ret float [[TMP12]]
 ;
 entry:
  %0 = load i32, i32* @n, align 4
@ -194,19 +202,23 @@ define i32 @foo() {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[TMP0:%.*]] = load i32, i32* @n, align 4
 ; CHECK-NEXT:    [[CONV:%.*]] = sitofp i32 [[TMP0]] to float
-; CHECK-NEXT:    [[TMP1:%.*]] = load <4 x float>, <4 x float>* bitcast ([20 x float]* @arr to <4 x float>*), align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, <4 x float>* bitcast ([20 x float]* @arr1 to <4 x float>*), align 16
+; CHECK-NEXT:    [[TMP2:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 0), align 16
-; CHECK-NEXT:    [[TMP3:%.*]] = fmul fast <4 x float> [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    [[MUL:%.*]] = fmul fast float [[TMP2]], [[TMP1]]
-; CHECK-NEXT:    [[TMP4:%.*]] = fadd fast float undef, undef
+; CHECK-NEXT:    [[TMP3:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = fadd fast float undef, [[TMP4]]
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 1), align 4
-; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP3]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
+; CHECK-NEXT:    [[MUL_1:%.*]] = fmul fast float [[TMP4]], [[TMP3]]
-; CHECK-NEXT:    [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP3]], [[RDX_SHUF]]
+; CHECK-NEXT:    [[TMP5:%.*]] = fadd fast float [[MUL_1]], [[MUL]]
-; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
+; CHECK-NEXT:    [[TMP6:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
+; CHECK-NEXT:    [[TMP7:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2), align 8
-; CHECK-NEXT:    [[TMP6:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
+; CHECK-NEXT:    [[MUL_2:%.*]] = fmul fast float [[TMP7]], [[TMP6]]
-; CHECK-NEXT:    [[TMP7:%.*]] = fadd fast float undef, [[TMP5]]
+; CHECK-NEXT:    [[TMP8:%.*]] = fadd fast float [[MUL_2]], [[TMP5]]
-; CHECK-NEXT:    [[TMP8:%.*]] = fmul fast float [[CONV]], [[TMP6]]
+; CHECK-NEXT:    [[TMP9:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 3), align 4
-; CHECK-NEXT:    [[CONV4:%.*]] = fptosi float [[TMP8]] to i32
+; CHECK-NEXT:    [[TMP10:%.*]] = load float, float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 3), align 4
 ; CHECK-NEXT:    [[MUL_3:%.*]] = fmul fast float [[TMP10]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast float [[MUL_3]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast float [[CONV]], [[TMP11]]
 ; CHECK-NEXT:    [[CONV4:%.*]] = fptosi float [[TMP12]] to i32
 ; CHECK-NEXT:    store i32 [[CONV4]], i32* @n, align 4
 ; CHECK-NEXT:    ret i32 [[CONV4]]
 ;