๐Ÿ”บ

Feature Pyramid Networks for Object Detection

Tags
Model Architecture
Object Detection
Created
2021/01/31 10:47
Publication
CVPR'17
Rate
3
Source
https://arxiv.org/abs/1612.03144
Summary
(์ƒ์„ธ ํŽ˜์ด์ง€ ์ฐธ๊ณ )

Reference

โ€ข
Object detection์—์„œ scale-invariance๋Š” ๊ฝค๋‚˜ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. Object detection์—์„œ scale-invariance๋Š” ๊ฐ„๋‹จํžˆ ๋งํ•ด ํฌ๊ธฐ์— ์ƒ๊ด€ ์—†์ด ๋™์ผํ•œ detection ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์ด๋‹ค. ์˜ˆ์ปจ๋Œ€ ์‚ฌ๋žŒ์ด ์นด๋ฉ”๋ผ์— ๊ฐ€๊นŒ์ด ์žˆ์–ด ์•„์ฃผ ํฌ๊ฒŒ ์žกํžˆ๋“ , ์นด๋ฉ”๋ผ์— ๋ฉ€์ฐ์ด ๋–จ์–ด์ ธ ์žˆ์–ด ์•„์ฃผ ์ž‘๊ฒŒ ์žกํžˆ๋“  ๊ทธ๊ฒƒ์„ ๋™์ผํ•œ ์ˆ˜์ค€์œผ๋กœ ์‚ฌ๋žŒ์œผ๋กœ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
์ด์ „(์ฃผ๋กœ hand-engineered feature์„ ์‚ฌ์šฉํ•˜๋˜ ์‹œ์ ˆ)์—๋Š” ์ด๋Ÿฌํ•œ scale-invariance๋ฅผ ์œ„ํ•ด image pyramid ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด ์ด๋ฏธ์ง€์˜ scale์„ ๋Š˜๋ ค๊ฐ€๋ฉฐ, ์ฆ‰ ์ด๋ฏธ์ง€๋ฅผ ์ถ•์†Œ์‹œ์ผœ๊ฐ€๋ฉฐ(scale์ด ํฌ๋‹ค๋Š” ๊ฒƒ์€ ์‹œ์•ผ๊ฐ€ ๋” ๋„“๋‹ค๋Š” ๊ฒƒ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.) ๋™์ผํ•œ ์ˆ˜์ค€(์˜ˆ์ปจ๋Œ€ window size)์˜ ๋ถ„์„์„ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
ํ•˜์ง€๋งŒ ์œ„์™€ ๊ฐ™์€ image pyramid๋Š” ์ปดํ“จํŒ… ์ฝ”์ŠคํŠธ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์ฝ”์ŠคํŠธ๊ฐ€ ์—„์ฒญ๋‚˜๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ convolution network๊ฐ€ ๋‚˜์˜ค๋Š”๋ฐ, convolution network๋Š” ๊ทธ ์ž์ฒด๋กœ ์ƒ๋‹นํžˆ scale-invariantํ•˜๋‹ค๋Š” ํŠน์„ฑ์„ ๊ฐ–๋Š”๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , convolution network๊ฐ€ ์•„์ฃผ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ  ์žˆ๋Š” ์š”์ฆ˜, ๋” ์™„๋ฒฝํ•œ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ์ด๋Ÿฌํ•œ pyramid ๊ตฌ์กฐ๋ฅผ ํ†ตํ•œ scale-invariant ํ•œ ์„ค๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค.
์–ด์จŒ๊ฑฐ๋‚˜ CNN์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ”ผ๋ผ๋ฏธ๋“œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค. FPN์€ ์ด๋Ÿฌํ•œ ํ”ผ๋ผ๋ฏธ๋“œ ๊ตฌ์กฐ๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹ค์–‘ํ•œ scale์˜ feature๋ฅผ ์ด์šฉํ•จ์œผ๋กœ์จ scale-invariant architecture๋ฅผ ์„ค๊ณ„ํ•ด๋‚ธ๋‹ค. FPN์˜ ๊ตฌ์กฐ๋Š” ํฌ๊ฒŒ Bottom-up๊ณผ Top-down์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

Bottom-up

Bottom-up pathway๋Š” ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋˜ ๊ทธ ์ผ๋ฐ˜์ ์ธ forward propagation์„ ์ผ์ปซ๋Š”๋‹ค. ์ด ๋•Œ CNN์ด (resolution์ด ์ ์  ์ž‘์•„์ง€๋Š”) ํ”ผ๋ผ๋ฏธ๋“œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋™์ผํ•œ resolution์„ ๊ฐ€์ง„ ์ธต๋“ค์„ ๋ฌถ์–ด ํ•˜๋‚˜์˜ stage๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ FPN์—์„œ๋Š” ๊ฐ™์€ resolution ์ค‘ ๊ฐ€์žฅ ๊นŠ์€ semantic feature๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ธต, ์ฆ‰ ๊ฐ€์žฅ ๊นŠ์€ ์ธต์˜ activation์„ ์‚ฌ์šฉํ•œ๋‹ค. ํŽ˜์ดํผ์—์„œ๋Š” ResNet์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๊ฐ๊ฐ conv2, conv3, conv4, conv5์— ๋Œ€์‘ํ•˜๋Š” ์œ„์™€ ๊ฐ™์€ ์ธต๋“ค์„ {C2,C3,C4,C5}\{ C_2, C_3, C_4, C_5 \}์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•œ๋‹ค. (C1C_1์€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ƒ ๋ฒ„๋ฆฐ๋‹ค๊ณ  ํ•œ๋‹ค.)

Top-down

์‚ฌ์‹ค์ƒ top-down pathway๊ฐ€ FPN์˜ ํ•ต์‹ฌ์ด๋‹ค. Top-down pathway์—์„œ๋Š” ๋ณด๋‹ค ๊นŠ์€ semantic feature๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”, ์ฆ‰ top-down pathway์˜ ๋ฐ”๋กœ ์œ„์— ์žˆ๋Š” ์ธต์„ upsamplingํ•œ ํ›„, ์ด๋ฅผ ๋Œ€์‘๋˜๋Š” bottom-up pathway์˜ ์ธต๊ณผ element-wise addition์œผ๋กœ ๊ฒฐํ•ฉํ•œ๋‹ค. (์ด ๋•Œ bottom-up pathway ์ธต์˜ activation์€ channel์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด 1ร—11 \times 1 convolution ์—ฐ์‚ฐ์„ ๊ฑฐ์นœ๋‹ค.) ์ด๋ฅผ ๊ฐ€์žฅ ๋ฐ‘์— ์ธต์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๊ณ„์†ํ•˜๊ณ , ์ด๋ ‡๊ฒŒ ๊ฒฐํ•ฉ๋œ ์ธต๋“ค์— ๋งˆ์ง€๋ง‰์œผ๋กœ 3ร—33 \times 3 convolution์„ ๊ฐ€ํ•˜์—ฌ ์ตœ์ข… feature map๋“ค์„ ํ˜•์„ฑํ•œ๋‹ค. ๊ฐ€์žฅ ์œ„์˜ ์ธต์€ bottom-up pathway์˜ ์ตœ์ƒ์ธต์— 1ร—11 \times 1 convolution์„ ๊ฐ€ํ•จ์œผ๋กœ์จ ํ˜•์„ฑํ•œ๋‹ค. ์ฆ‰ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
โ€ข
C5C_5์— 1ร—11 \times 1 convolution์„ ๊ฐ€ํ•˜์—ฌ P5โ€ฒP'_5 ์ธต์„ ํ˜•์„ฑํ•œ๋‹ค.
โ€ข
P5โ€ฒP'_5 ์ธต์„ upsamplingํ•œ ๊ฒฐ๊ณผ์™€ C4C_4์— 1ร—11 \times 1 convolution์„ ๊ฐ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ P4โ€ฒP'_4 ์ธต์„ ํ˜•์„ฑํ•œ๋‹ค.
โ€ข
P2โ€ฒP'_2 ์ธต๊นŒ์ง€ ๊ณ„์†ํ•œ๋‹ค.
โ€ข
Pnโ€ฒP'_n ์ธต์— 3ร—33 \times 3 convolution์„ ๊ฐ€ํ•˜์—ฌ ์ตœ์ข… feature map์ธ {P5,P4,...,P2}\{ P_5, P_4, ..., P_2 \}์„ ํ˜•์„ฑํ•œ๋‹ค.
์ด์ œ ์ด๋ ‡๊ฒŒ ํ˜•์„ฑ๋œ ์ตœ์ข… feature map๋“ค๋กœ๋ถ€ํ„ฐ ๊ฐ๊ฐ prediction์„ ์ˆ˜ํ–‰ํ•˜์—ฌ multi-scale object detection์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. FPN์€ ์ด๋ ‡๊ฒŒ ์œ„์—์„œ๋ถ€ํ„ฐ pyramid ๊ตฌ์กฐ๋ฅผ ์Œ“์•„๋‚ด๋ ค์˜ด์œผ๋กœ์จ pyramid์˜ ๋ชจ๋“  ์ธต์ด ๊ฐ•๋ ฅํ•œ semantic feature map์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๊ทธ ๋•๋ถ„์— inference ์‹œ์— head ๋ถ€๋ถ„์˜ parameter๋ฅผ ๋ชจ๋“  scale์—์„œ ๊ณต์œ ํ•ด๋„ ๊ฐ๊ฐ์˜ ๊ณ ์œ ํ•œ parameter๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๊ณ  ํ•œ๋‹ค.
๋˜ํ•œ ์ด๋Ÿฌํ•œ FPN์€ object detection๋ฟ ์•„๋‹ˆ๋ผ object segmentation ๋“ฑ์˜ task์—๋„ ํ™•์žฅ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๋‚˜๋Š” segmentation์„ ์•„์ง ํ•˜๋‚˜๋„ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์Šคํ‚ตํ–ˆ๋‹ค.
์–ด์จŒ๊ฑฐ๋‚˜ ์ด๋ ‡๊ฒŒ CNN์˜ pyramidal architecture๋ฅผ ์ด์šฉํ•ด ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด scale-invariant ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•จ์œผ๋กœ์จ ๋‹น์‹œ(2017๋…„)์˜ object detection backbone์œผ๋กœ์„œ์˜ SOTA๋ฅผ ๋‹ฌ์„ฑํ•œ FPN์ด๋‹ค. ํ›Œ๋ฅญํ•œ ๋…ผ๋ฌธ๋“ค์€ ๋ฌธ์ œ๋ฅผ ๊ฝค๋‚˜ ์ง๊ด€์ ์ธ ๋™์‹œ์—, ์•Œ๊ณ  ๋‚˜๋ฉด ์ƒ๊ฐ๋ณด๋‹ค ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์œผ๋กœ ํ’€์–ด๋‚ธ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ ๋‹ค.
E.O.D.