๐Ÿ•œ

Relation-aware Global Attention for Person Re-identification

Tags
Re-identification
Created
2021/01/31 08:20
Publication
CVPR'20
Rate
3
Source
https://arxiv.org/abs/1904.02998
Summary
(์ƒ์„ธ ํŽ˜์ด์ง€ ์ฐธ๊ณ )

Reference

๋‹ค๋ฅธ ๋Œ€๋ถ€๋ถ„์˜ computer vision ๋ถ„์•ผ๊ฐ€ ๊ทธ๋ ‡๊ฒ ์ง€๋งŒ, ํŠนํžˆ person re-identification์— ๋Œ€ํ•ด์„œ๋Š” ๋ชจ๋ธ์˜ attention ๋Šฅ๋ ฅ์ด ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•œ๋‹ค. Multi-camera view์—์„œ์˜ background clutter, pose variation ๋“ฑ์˜ ์š”์†Œ๊ฐ€ ๋ชจ๋ธ์˜ feature learning์— ํฐ ์ž‘์šฉ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
ํ†ต์ƒ์ ์œผ๋กœ ๋ชจ๋ธ์˜ attention ์—ญ๋Ÿ‰์„ ๊ฐ•ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํŠน์ •ํ•œ ํ˜•ํƒœ์˜ convolution block(module)์„ ๊ณ ์•ˆํ•˜์—ฌ ์‚ฝ์ž…ํ•˜๋Š”๋ฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ๋Š” CBAM(Convolutional Bottleneck Attention Module)์ด๋‚˜ NL(Non-Local) ๋ชจ๋“ˆ ๋“ฑ์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ CBAM์€ 7 x 7 ํฌ๊ธฐ์˜ ๋น„๊ต์  ํฐ kernel size๋ฅผ ์ด์šฉํ•˜๊ธฐ๋Š” ํ–ˆ์ง€๋งŒ, ๊ตญ์†Œ์ ์ธ receptive field๋งŒ์„ ๊ณ ๋ คํ•˜์—ฌ attention์„ ์–ป์–ด๋‚ด๋Š” local attention module์ด๊ณ , Non-Local ๋ชจ๋“ˆ์€ global feature relation์„ ์ด์šฉํ•˜๊ธฐ๋Š” ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ relation์„ attention์„ ์–ป์–ด๋‚ด๊ธฐ ์œ„ํ•œ weight๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ ๊ทธ์น˜๋ฉฐ(์ฆ‰ relation์„ weight๋กœ ํ•œ weighted summation์„ ํ†ตํ•ด attention์„ ์–ป์–ด๋‚ธ๋‹ค), ๋”์šฑ์ด ๋‹ค๋ฅธ ๋…ผ๋ฌธ์˜ ์‹คํ—˜์—์„œ ๋ฐํ˜€์ง€๊ธฐ๋ฅผ ์ด๋Ÿฌํ•œ relation์€ spatial information invariance๋ฅผ ๊ฐ€์ง„๋‹ค. ์ฆ‰ ํ•ด๋‹น feature์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค.
๋…ผ๋ฌธ์—์„œ๋Š” ํšจ๊ณผ์ ์ธ attention์„ ์œ„ํ•ด์„œ๋Š” global feature relation์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋ฉฐ, spatial information์ด ์ œ๋Œ€๋กœ ๋ฐ˜์˜๋˜์–ด์žˆ์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์œ„ ๋ชจ๋“ˆ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋“ˆ์ธ RGA(Relation-aware Global Attention)์„ ์ œ์•ˆํ•œ๋‹ค.

RGA

RGA ๋ชจ๋“ˆ์€ ๊ฐ๊ฐ spatial attention, channel attention์„ ์œ„ํ•œ ๋…๋ฆฝ์ ์ธ ๋ชจ๋“ˆ์ธ RGA-S, RGA-C๋กœ ๋‚˜๋‰œ๋‹ค. ๋‘ ๋ชจ๋“ˆ์˜ ๋งค์ปค๋‹ˆ์ฆ˜์€ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ๋ชจ๋“ˆ๋งŒ ์†Œ๊ฐœํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค. ๊ณ ๋กœ RGA-S๋ฅผ ์ •๋ฆฌํ•œ๋‹ค.
Feature map XโˆˆRCร—Hร—WX \in R^{C \times H \times W}๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž. Spatial attention์„ ์œ„ํ•ด ์ด๋ฅผ N=Hร—WN = H \times W๊ฐœ์˜ CC dimensional feature node (vector)๋กœ ๊ตฌ์„ฑ๋œ feature map {xiโˆฃi=1,2,...,N}\{ x_i | i = 1, 2, ..., N \}์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๋‘ feature xi,xjx_i, x_j์˜ relation์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
ri,j=fs(xi,xj)=ฮธs(xi)Tฯ•s(xj)r_{i, j} = f_s(x_i, x_j) = \theta_s{( x_i )}^T \phi_s{( x_j )}
์ด ๋•Œ ฮธs(xi)=ReLU(BN(Wฮธxi))\theta_s{(x_i)} = \text{ReLU}{( \text{BN}{( W_\theta x_i )} )}, ฯ•s(xi)=ReLU(BN(Wฯ•xi))\phi_s{(x_i)} = \text{ReLU}{( \text{BN}{( W_\phi x_i )} )} ์ด๋ฉฐ, Wฮธ,WฮธโˆˆRC/siร—CW_{\theta}, W_{\theta} \in R^{C/s_i \times C}๋Š” spatial information์— ๋ฌด๊ด€ํ•˜๊ฒŒ ๊ณต์œ ๋˜๋Š” parameter์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋‘ parameter๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šตํ•จ์œผ๋กœ์จ relation์€ ๋ฐฉํ–ฅ์„ฑ์„ ๊ฐ€์ง€๋Š”(directional) relation์ด ๋œ๋‹ค.
์ด์ œ ์ด๋Ÿฌํ•œ ๋ฐฉํ–ฅ์„ฑ์„ ๊ฐ€์ง€๋Š” relation ๊ฐ’๋“ค์„ ๋ชจ์€ ํ–‰๋ ฌ RsโˆˆRNร—NR_s \in R^{N \times N}์— ๋Œ€ํ•ด, ii๋ฒˆ์งธ feature์— ๋Œ€ํ•œ global relation์„ ๊ด€๋ จ๋œ ๋ชจ๋“  relation์„ stackํ•œ ri=[Rs(i,:),Rs(:,i)]โˆˆR2Cr_i = [R_s(i, :), R_s(:, i)] \in R^{2C}์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
๋˜ํ•œ ์ด ํŽ˜์ดํผ์—์„œ๋Š” spatial attention์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์›๋ž˜์˜ feature ๊ฐ’ ๋˜ํ•œ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด ์›๋ž˜์˜ feature xix_i์™€ rir_i๋ฅผ concatenateํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋‘ ๋ฒกํ„ฐ๋Š” ๋‹ค๋ฅธ ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฒกํ„ฐ๋ฅผ ์ƒˆ๋กœ์ด mappingํ•˜์—ฌ concatenateํ•œ๋‹ค.
y~i=[poolc(ฯˆs(xi)),ฯ†s(ri)]\tilde{y}i = [ \text{pool}c{(\psi_s{(x_i)})}, \varphi_s{( r_i )} ]
์ด ๋•Œ ฯˆs(xi)=ReLU(BN(Wฯˆxi))\psi_s{(x_i)} = \text{ReLU}{( \text{BN}{( W{\psi} x_i )} )}, WฯˆโˆˆRC/s1ร—CW{\psi} \in R^{C/s_1 \times C}์ด๋ฉฐ, ฯ†s(ri)=ReLU(BN(Wฯ†ri))\varphi_s{(r_i)} = \text{ReLU}{( \text{BN}{( W_{\varphi} r_i )} )}, Wฯ†โˆˆR2N/2s1ร—2NW_{\varphi} \in R^{2N/2s_1 \times 2N}์ด๋‹ค. poolc\text{pool}_c๋Š” channel dimension์— ๋Œ€ํ•œ GAP๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
๋งˆ์ง€๋ง‰์œผ๋กœ ์ด๋Ÿฌํ•œ y~i\tilde{y}_i๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด attention์„ ๊ตฌํ•œ๋‹ค. W1W_1๋Š” feature map์˜ dimension์„ s2s_2์˜ ๋น„์œจ๋กœ ์ถ•์†Œํ•˜๋ฉฐ, W2W_2๋Š” dimension์„ 1๋กœ ์ถ•์†Œํ•œ๋‹ค.
ai=sigmoid(W2ReLU(W1y~i))a_i = \text{sigmoid}{( W_2 \text{ReLU}{( W_1 \tilde{y}_i )} )}
RGA-C๋กœ์˜ ํ™•์žฅ ์—ญ์‹œ ๋™์ผํ•˜๋‹ค. ์ฒ˜์Œ์˜ feature map์„ CC๊ฐœ์˜ Wร—HW \times H dimensional feature node๋กœ ์ƒ๊ฐํ•˜๋ฉด ๊ทธ๋งŒ์ด๋‹ค.

Implementation

Backbone architecture๋Š” ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด ํ‘œ์ค€ backbone์ด๋ผ ๋ถ€๋ฅผ ์ˆ˜ ์žˆ๋Š” ResNet-50๋ฅผ ์ด์šฉํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ stage 4์˜ down-sampling์€ ์ œ๊ฑฐํ•˜๋ฉฐ, ๊ฐ๊ฐ stage 1~4 ์ดํ›„์— RGA-SC ๋ชจ๋“ˆ์„ ๋ฐฐ์น˜ํ•œ๋‹ค. s1,s2=8s_1, s_2 = 8๋กœ ์„ค์ •ํ•œ๋‹ค.
Loss๋กœ๋Š” label smoothing์„ ํ†ตํ•œ identity loss(classification loss)์™€ hard mining์„ ํ†ตํ•œ triplet loss๋ฅผ ์ด์šฉํ•œ๋‹ค. Re-ranking์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.
Market, CUHK, MSMT ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Œ์„ ๋ณด์ด๋ฉฐ, ํญ๋„“์€ ablation study์™€ CBAM, NL ๋ชจ๋“ˆ๊ณผ์˜ ๋น„๊ต ๋“ฑ์„ ํ†ตํ•ด RGA ๋ชจ๋“ˆ์˜ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ•œ๋‹ค.
์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์‹คํ—˜์„ ํ†ตํ•ด ๋ชจ๋“ˆ์˜ ํšจ๊ณผ๋ฅผ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์—์„œ ์ž…์ฆํ•ด๋‚ธ ๊ฒƒ์ด ๊ฝค ํฅ๋ฏธ๋กœ์› ๊ณ , ๋ชจ๋“ˆ ์ž์ฒด๊ฐ€ ๊ต‰์žฅํžˆ ์ฒด๊ณ„์ ์œผ๋กœ ์งœ์—ฌ์ง„ ๋Š๋‚Œ์ด ๋“ค์–ด์„œ ์žฌ๋ฐŒ์—ˆ๋‹ค. ์ฒด๊ณ„์ ์ธ ๋Š๋‚Œ์ด ๋“ค์—ˆ๋˜ ์ด์œ ๋Š” ๊ฐ๊ฐ์˜ ๋งค์ปค๋‹ˆ์ฆ˜(?)์— ๋Œ€ํ•œ ๋…ผ๋ฆฌ๋ฅผ (์ž…์ฆํ•˜์ง€๋Š” ๋ชปํ–ˆ๋”๋ผ๋„) ์ œ์‹œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ ๊ฐ™๋‹ค. Attention์€ ๋น„๋‹จ re-id๋ฟ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ task์—์„œ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ์ƒ๊ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋˜ ์žฌ๋ฐŒ์—ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์„ ๊ฐ€์ง€๊ณ  ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์‹คํ—˜์„ ํ•ด ๋ณด์•„๋„ ์žฌ๋ฏธ์žˆ๊ฒ ๋‹ค.
๋˜ ํ•œ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ์ ์€, CAM์„ ํ†ตํ•ด ๋ชจ๋ธ์ด face์—๋Š” ์ง‘์ค‘ํ•˜์ง€ ์•Š๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ํŽ˜์ดํผ์—์„œ๋Š” low-resolution์œผ๋กœ ์ธํ•ด face๋กœ๋ถ€ํ„ฐ discriminability๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์ผ ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ๋‹ค. ์™ ์ง€ ๋ฏธ๋ž˜์˜ ์ž‘์—…์— ๊ดœ์ฐฎ์€ ํžŒํŠธ๋ฅผ ์ค„ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„ ๋ฉ”๋ชจํ•œ๋‹ค.
E.O.D.