---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ _ZN7cutlass6KernelINS_4gemm6kernel8DualGemmINS1_11threadblock17DualMmaMultistageINS1_9GemmShapeILi128ELi64ELi32EEENS_9transform11threadblock28PredicatedTileAccessIteratorINS_11MatrixShapeILi128ELi32EEENS_10bfloat16_tENS_6layout8RowMajorELi1ENS8_29PitchLinearWarpRakedThreadMapINS_16PitchLinearShapeILi32ELi128EEELi128ENSH_ILi4ELi8EEELi8EEENS_5ArrayISD_Li8ELb0EEELb0ENSE_9NoPermuteEEENS9_25RegularTileAccessIteratorISC_SD_NSE_37RowMajorTensorOpMultiplicandCrosswiseILi16ELi32EEELi0ESK_Li16EEELNS_4arch14CacheOperation4KindE1ENSA_INSB_ILi32ELi64EEESD_NSE_11ColumnMajorELi0ENSG_INSH_ILi32ELi64EEELi128ESJ_Li8EEESM_Lb0ESN_EENSP_ISW_SD_NSE_40ColumnMajorTensorOpMultiplicandCrosswiseILi16ELi32EEELi1ESZ_Li16EEELSV_1ES10_S13_fSF_NS4_9MmaPolicyINS1_4warp11MmaTensorOpINS6_ILi64ELi32ELi32EEESD_SR_SD_S12_fSF_NS15_17MmaTensorOpPolicyINST_3MmaINS6_ILi16ELi8ELi16EEELi32ESD_SF_SD_SX_fSF_NST_13OpMultiplyAddEEENSB_ILi1ELi1EEEEELi1ELb0EbEENSB_ILi0ELi0EEES1G_Li1EEES1H_Li3ELNS1_23SharedMemoryClearOptionE0EbEENS_8epilogue11threadblock8EpilogueIS7_S1F_Li1ENS1L_22PredicatedTileIteratorINS1L_26OutputTileOptimalThreadMapINS1L_15OutputTileShapeILi64ELi8ELi2ELi1ELi1EEENS1P_ILi1ELi8ELi1ELi1ELi8EEELi128ELi8ELi16EEESD_Lb0ESN_Lb0EEENS1K_4warp24FragmentIteratorTensorOpIS17_S1A_fNSL_IfLi4ELb1EEESF_EENS1U_20TileIteratorTensorOpIS17_S1A_fSF_EENS1L_18SharedLoadIteratorINS1S_18CompactedThreadMapEfLi32EEENS1K_6thread17LinearCombinationISD_Li8EffLNS23_9ScaleType4KindE1ELNS_15FloatRoundStyleE2ESD_EENSB_ILi0ELi8EEELi1ELi1EEES2A_27EpilogueLHSActivationAndMulISD_Li8ENS23_4SiLuESD_fLS27_2EENS4_30GemmIdentityThreadblockSwizz... 0.00% 0.000us 0.00% 0.000us 0.000us 9.899s 16.95% 9.899s 412.473us 24000 ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_32x3_tn 0.00% 0.000us 0.00% 0.000us 0.000us 7.662s 13.12% 7.662s 79.808us 96000 ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_32x3_nn 0.00% 0.000us 0.00% 0.000us 0.000us 6.008s 10.29% 6.008s 150.200us 40000 ampere_bf16_s1688gemm_bf16_128x128_ldg8_f2f_stages_32x1_nt 0.00% 0.000us 0.00% 0.000us 0.000us 5.317s 9.11% 5.317s 332.333us 16000 void cutlass::Kernel2(cutlass_80_tensorop_bf16_s16816gemm_relu_bf16_128x256_32x3_tn_align8::Params) 0.00% 0.000us 0.00% 0.000us 0.000us 5.008s 8.58% 5.008s 208.668us 24000 ampere_bf16_s16816gemm_bf16_128x128_ldg8_f2f_stages_32x5_nt 0.00% 0.000us 0.00% 0.000us 0.000us 3.183s 5.45% 3.183s 99.468us 32000 fmha_cutlassB_bf16_aligned_64x64_k64_seqaligned_sm80(PyTorchMemEffAttention::AttentionBackwardKernel::Params) 0.00% 0.000us 0.00% 0.000us 0.000us 1.976s 3.38% 1.976s 247.053us 8000 void at::native::elementwise_kernel<128, 2, at::native::gpu_kernel_impl_nocast, c10::complex, c10::complex, at::native::binary_internal::MulFunctor > > >(at::TensorIteratorBase&, at::native::BinaryFunctor, c10::complex, c10::complex, at::native::binary_internal::MulFunctor > > const&)::{lambda(int)#1}>(int, at::native::gpu_kernel_impl_nocast, c10::complex, c10::complex, at::native::binary_internal::MulFunctor > > >(at::TensorIteratorBase&, at::native::BinaryFunctor, c10::complex, c10::complex, at::native::binary_internal::MulFunctor > > const&)::{lambda(int)#1}) 0.00% 0.000us 0.00% 0.000us 0.000us 1.929s 3.30% 1.929s 30.137us 64000 ampere_bf16_s1688gemm_bf16_128x128_ldg8_f2f_stages_32x1_nn 0.00% 0.000us 0.00% 0.000us 0.000us 1.873s 3.21% 1.873s 234.165us 8000 fmha_cutlassF_bf16_aligned_64x64_rf_sm80(PyTorchMemEffAttention::AttentionKernel::Params) 0.00% 0.000us 0.00% 0.000us 0.000us 1.708s 2.93% 1.708s 71.187us 24000 void at::native::unrolled_elementwise_kernel, 4, TrivialOffsetCalculator<1, unsigned int>, TrivialOffsetCalculator<1, unsigned int>, at::native::memory::LoadWithCast<1>, at::native::memory::StoreWithCast<1> >(int, at::native::direct_copy_kernel_cuda(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#7}::operator()() const::{lambda(float)#1}, std::array, TrivialOffsetCalculator<1, unsigned int>, TrivialOffsetCalculator<1, unsigned int>, at::native::memory::LoadWithCast<1>, at::native::memory::StoreWithCast<1>) 0.00% 0.000us 0.00% 0.000us 0.000us 1.540s 2.64% 1.540s 16.923us 91000 void at::native::vectorized_elementwise_kernel<4, at::native::CUDAFunctor_add, std::array >(int, at::native::CUDAFunctor_add, std::array) 0.00% 0.000us 0.00% 0.000us 0.000us 1.496s 2.56% 1.496s 18.696us 80000 void at::native::vectorized_elementwise_kernel<4, at::native::bfloat16_copy_kernel_cuda(at::TensorIteratorBase&)::{lambda(float)#1}, std::array >(int, at::native::bfloat16_copy_kernel_cuda(at::TensorIteratorBase&)::{lambda(float)#1}, std::array) 0.00% 0.000us 0.00% 0.000us 0.000us 1.356s 2.32% 1.356s 20.862us 65000 void at::native::(anonymous namespace)::multi_tensor_apply_kernel, at::native::(anonymous namespace)::BinaryOpScalarFunctor, std::multiplies, float>(at::native::(anonymous namespace)::TensorListMetadata<1>, at::native::(anonymous namespace)::BinaryOpScalarFunctor, std::multiplies, float) 0.00% 0.000us 0.00% 0.000us 0.000us 1.137s 1.95% 1.137s 113.742us 10000 void at::native::(anonymous namespace)::multi_tensor_apply_kernel, at::native::(anonymous namespace)::PointwiseOpScalarListFunctor, std::divides >(at::native::(anonymous namespace)::TensorListScalarListMetadata, at::native::(anonymous namespace)::PointwiseOpScalarListFunctor, std::divides) 0.00% 0.000us 0.00% 0.000us 0.000us 1.105s 1.89% 1.105s 220.933us 5000 void at::native::(anonymous namespace)::unrolled_elementwise_kernel_for_multi_outputs<3, _INTERNAL_c9c18139_16_silu_bw_fused_cu_6c4604d0_6994::(anonymous namespace)::silu_bw_fused(at::Tensor const&, at::Tensor const&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(c10::BFloat16, c10::BFloat16, c10::BFloat16)#1}, std::array, OffsetCalculator<3, unsigned int, false>, OffsetCalculator<3, unsigned int, false> >(int, _INTERNAL_c9c18139_16_silu_bw_fused_cu_6c4604d0_6994::(anonymous namespace)::silu_bw_fused(at::Tensor const&, at::Tensor const&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(c10::BFloat16, c10::BFloat16, c10::BFloat16)#1}, std::array, OffsetCalculator<3, unsigned int, false>, OffsetCalculator<3, unsigned int, false>) 0.00% 0.000us 0.00% 0.000us 0.000us 931.710ms 1.60% 931.710ms 116.464us 8000 void at::native::(anonymous namespace)::multi_tensor_apply_kernel, at::native::(anonymous namespace)::PointwiseOpScalarFunctor, std::multiplies, float>(at::native::(anonymous namespace)::TensorListMetadata<3>, at::native::(anonymous namespace)::PointwiseOpScalarFunctor, std::multiplies, float) 0.00% 0.000us 0.00% 0.000us 0.000us 836.978ms 1.43% 836.978ms 167.396us 5000 void at::native::(anonymous namespace)::multi_tensor_apply_kernel, at::native::(anonymous namespace)::TernaryOpScalarFunctor, at::native::LerpFunctor, float>(at::native::(anonymous namespace)::TensorListMetadata<2>, at::native::(anonymous namespace)::TernaryOpScalarFunctor, at::native::LerpFunctor, float) 0.00% 0.000us 0.00% 0.000us 0.000us 829.295ms 1.42% 829.295ms 165.859us 5000 void at::native::(anonymous namespace)::vectorized_layer_norm_kernel(int, float, c10::BFloat16 const*, c10::BFloat16 const*, c10::BFloat16 const*, float*, float*, c10::BFloat16*) 0.00% 0.000us 0.00% 0.000us 0.000us 675.114ms 1.16% 675.114ms 13.238us 51000 void at::native::(anonymous namespace)::multi_tensor_apply_kernel, at::native::(anonymous namespace)::BinaryOpScalarListFunctor, std::divides >(at::native::(anonymous namespace)::TensorListScalarListMetadata, at::native::(anonymous namespace)::BinaryOpScalarListFunctor, std::divides) 0.00% 0.000us 0.00% 0.000us 0.000us 570.153ms 0.98% 570.153ms 114.031us 5000 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 35.994s Self CUDA time total: 58.393s