-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Open
Labels
module: inductoroncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
I'm debugging the perf loss of a longformer model for PR (#120758). One thing I found is, for the following log-softmax kernel, padding the output of a log_softmax kernel slow it down from 1.1ms to 1.6ms. This is counter-intuitive since in general padding should help here. Full runnable kernel: https://gist.github.com/shunting314/20c3249c9b206f1abc8f7c9db208d712 .
@triton.jit
def triton_red_fused__log_softmax_6(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
xnumel = 4096
rnumel = 50265
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = xindex < xnumel
rbase = tl.arange(0, RBLOCK)[None, :]
x0 = xindex
_tmp3 = tl.full([XBLOCK, RBLOCK], float("-inf"), tl.float32)
for roffset in range(0, rnumel, RBLOCK):
rindex = roffset + rbase
rmask = rindex < rnumel
r1 = rindex
tmp0 = tl.load(in_ptr0 + (r1 + (50272*x0)), rmask, eviction_policy='evict_last', other=0.0).to(tl.float32)
tmp1 = tmp0.to(tl.float32)
tmp2 = tl.broadcast_to(tmp1, [XBLOCK, RBLOCK])
tmp4 = triton_helpers.maximum(_tmp3, tmp2)
_tmp3 = tl.where(rmask, tmp4, _tmp3)
tmp3 = triton_helpers.max2(_tmp3, 1)[:, None]
_tmp10 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
for roffset in range(0, rnumel, RBLOCK):
rindex = roffset + rbase
rmask = rindex < rnumel
r1 = rindex
tmp5 = tl.load(in_ptr0 + (r1 + (50272*x0)), rmask, eviction_policy='evict_last', other=0.0).to(tl.float32)
tmp6 = tmp5.to(tl.float32)
tmp7 = tmp6 - tmp3
tmp8 = tl_math.exp(tmp7)
tmp9 = tl.broadcast_to(tmp8, [XBLOCK, RBLOCK])
tmp11 = _tmp10 + tmp9
_tmp10 = tl.where(rmask, tmp11, _tmp10)
tmp10 = tl.sum(_tmp10, 1)[:, None]
for roffset in range(0, rnumel, RBLOCK):
rindex = roffset + rbase
rmask = rindex < rnumel
r1 = rindex
tmp12 = tl.load(in_ptr0 + (r1 + (50272*x0)), rmask, eviction_policy='evict_first', other=0.0).to(tl.float32)
tmp13 = tmp12.to(tl.float32)
tmp14 = tmp13 - tmp3
tmp15 = tl_math.log(tmp10)
tmp16 = tmp14 - tmp15
tmp17 = tmp16.to(tl.float32)
# 1.6ms with this
tl.store(out_ptr2 + (r1 + (50272*x0)), tmp17, rmask)
# 1.1ms with this
# tl.store(out_ptr2 + (r1 + (50265*x0)), tmp17, rmask)
cc @ezyang @msaroufim @bdhirsh @anijain2305 @zou3519 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @jansel
cc @htyu not sure if it's due to triton or not.
Error logs
No response
Minified repro
No response
Versions
..
Metadata
Metadata
Assignees
Labels
module: inductoroncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module