Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Deep Learning for Generic Object Detection: A Survey

Deep Learning for Generic Object Detection: A Survey Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research. Keywords Object detection · Deep learning · Convolutional neural networks · Object recognition 1 Introduction chler and Elschlager 1973). The goal of object detection is to determine whether there are any instances of objects from As a longstanding, fundamental and challenging problem given categories (such as humans, cars, bicycles, dogs or in computer vision, object detection (illustrated in Fig. 1) cats) in an image and, if present, to return the spatial loca- has been an active area of research for several decades (Fis- tion and extent of each object instance (e.g., via a bounding box Everingham et al. 2010; Russakovsky et al. 2015). As the cornerstone of image understanding and computer vision, Communicated by Bernt Schiele. object detection forms the basis for solving complex or high Li Liu level vision tasks such as segmentation, scene understand- li.liu@oulu.fi ing, object tracking, image captioning, event detection, and Wanli Ouyang activity recognition. Object detection supports a wide range [email protected] of applications, including robot vision, consumer electronics, Xiaogang Wang security, autonomous driving, human computer interaction, [email protected] content based image retrieval, intelligent video surveillance, Paul Fieguth and augmented reality. pfi[email protected] Recently, deep learning techniques (Hinton and Salakhut- Jie Chen dinov 2006; LeCun et al. 2015) have emerged as powerful jie.chen@oulu.fi methods for learning feature representations automatically Xinwang Liu from data. In particular, these techniques have provided [email protected] major improvements in object detection, as illustrated in Matti Pietikäinen Fig. 3. matti.pietikainen@oulu.fi As illustrated in Fig. 2, object detection can be grouped into one of two types (Grauman and Leibe 2011; Zhang et al. National University of Defense Technology, Changsha, China 2013): detection of specific instances versus the detection of University of Oulu, Oulu, Finland broad categories. The first type aims to detect instances of University of Sydney, Camperdown, Australia a particular object (such as Donald Trump’s face, the Eiffel Chinese University of Hong Kong, Sha Tin, China Tower, or a neighbor’s dog), essentially a matching problem. University of Waterloo, Waterloo, Canada 123 262 International Journal of Computer Vision (2020) 128:261–318 (a) (b) Fig. 3 An overview of recent object detection performance: we can observe a significant improvement in performance (measured as mean Fig. 1 Most frequent keywords in ICCV and CVPR conference papers average precision) since the arrival of deep learning in 2012. a Detection from 2016 to 2018. The size of each word is proportional to the fre- results of winning entries in the VOC2007-2012 competitions, and b quency of that keyword. We can see that object detection has received top object detection competition results in ILSVRC2013-2017 (results significant attention in recent years in both panels use only the provided training data) over the past 5 years. Given the exceptionally rapid rate of progress, this article attempts to track recent advances and summarize their achievements in order to gain a clearer pic- ture of the current panorama in generic object detection. 1.1 Comparison with Previous Reviews Many notable object detection surveys have been published, as summarized in Table 1. These include many excellent sur- veys on the problem of specific object detection, such as pedestrian detection (Enzweiler and Gavrila 2009; Geron- Fig. 2 Object detection includes localizing instances of a particular imo et al. 2010; Dollar et al. 2012), face detection (Yang object (top), as well as generalizing to detecting object categories in et al. 2002; Zafeiriou et al. 2015), vehicle detection (Sun et al. general (bottom). This survey focuses on recent advances for the latter 2006) and text detection (Ye and Doermann 2015). There are problem of generic object detection comparatively few recent surveys focusing directly on the problem of generic object detection, except for the work by The goal of the second type is to detect (usually previ- Zhang et al. (2013) who conducted a survey on the topic ously unseen) instances of some predefined object categories of object class detection. However, the research reviewed (for example humans, cars, bicycles, and dogs). Historically, in Grauman and Leibe (2011), Andreopoulos and Tsotsos much of the effort in the field of object detection has focused (2013) and Zhang et al. (2013) is mostly pre-2012, and there- on the detection of a single category (typically faces and fore prior to the recent striking success and dominance of pedestrians) or a few specific categories. In contrast, over deep learning and related methods. the past several years, the research community has started Deep learning allows computational models to learn moving towards the more challenging goal of building gen- fantastically complex, subtle, and abstract representations, eral purpose object detection systems where the breadth of driving significant progress in a broad range of problems such object detection ability rivals that of humans. as visual recognition, object detection, speech recognition, Krizhevsky et al. (2012a) proposed a Deep Convo- natural language processing, medical image analysis, drug lutional Neural Network (DCNN) called AlexNet which discovery and genomics. Among different types of deep neu- achieved record breaking image classification accuracy in the ral networks, DCNNs (LeCun et al. 1998, 2015; Krizhevsky Large Scale Visual Recognition Challenge (ILSVRC) (Rus- et al. 2012a) have brought about breakthroughs in processing sakovsky et al. 2015). Since that time, the research focus in images, video, speech and audio. To be sure, there have been most aspects of computer vision has been specifically on deep many published surveys on deep learning, including that of learning methods, indeed including the domain of generic Bengio et al. (2013), LeCun et al. (2015), Litjens et al. (2017), object detection (Girshick et al. 2014;Heetal. 2014;Gir- Gu et al. (2018), and more recently in tutorials at ICCV and shick 2015; Sermanet et al. 2014; Ren et al. 2017). Although CVPR. tremendous progress has been achieved, illustrated in Fig. 3, In contrast, although many deep learning based methods we are unaware of comprehensive surveys of this subject have been proposed for object detection, we are unaware of 123 International Journal of Computer Vision (2020) 128:261–318 263 Table 1 Summary of related object detection surveys since 2000 No. Survey title References Year Venue Content 1 Monocular pedestrian detection: Enzweiler and Gavrila (2009) 2009 PAMI An evaluation of three pedestrian survey and experiments detectors 2 Survey of pedestrian detection for Geronimo et al. (2010) 2010 PAMI A survey of pedestrian detection advanced driver assistance for advanced driver assistance systems systems 3 Pedestrian detection: an evaluation Dollar et al. (2012) 2012 PAMI A thorough and detailed evaluation of the state of the art of detectors in monocular images 4 Detecting faces in images: a survey Yang et al. (2002) 2002 PAMI First survey of face detection from a single image 5 A survey on face detection in the Zafeiriou et al. (2015) 2015 CVIU A survey of face detection in the wild: past, present and future wild since 2000 6 On road vehicle detection: a review Sun et al. (2006) 2006 PAMI A review of vision based on-road vehicle detection systems 7 Text detection and recognition in Ye and Doermann (2015) 2015 PAMI A survey of text detection and imagery: a survey recognition in color imagery 8 Toward category level object Ponce et al. (2007) 2007 Book Representative papers on object recognition categorization, detection, and segmentation 9 The evolution of object Dickinson et al. (2009) 2009 Book A trace of the evolution of object categorization and the challenge categorization over 4 decades of image abstraction 10 Context based object Galleguillos and Belongie (2010) 2010 CVIU A review of contextual information categorization: a critical survey for object categorization 11 50 years of object recognition: Andreopoulos and Tsotsos (2013) 2013 CVIU A review of the evolution of object directions forward recognition systems over 5 decades 12 Visual object recognition Grauman and Leibe (2011) 2011 Tutorial Instance and category object recognition techniques 13 Object class detection: a survey Zhang et al. (2013) 2013 ACM CS Survey of generic object detection methods before 2011 14 Feature representation for Li et al. (2015b) 2015 PR Feature representation methods in statistical learning based object statistical learning based object detection: a review detection, including handcrafted and deep learning based features 15 Salient object detection: a survey Borji et al. (2014) 2014 arXiv A survey for salient object detection 16 Representation learning: a review Bengio et al. (2013) 2013 PAMI Unsupervised feature learning and and new perspectives deep learning, probabilistic models, autoencoders, manifold learning, and deep networks 17 Deep learning LeCun et al. (2015) 2015 Nature An introduction to deep learning and applications 18 A survey on deep learning in Litjens et al. (2017) 2017 MIA A survey of deep learning for medical image analysis image classification, object detection, segmentation and registration in medical image analysis 19 Recent advances in convolutional Gu et al. (2018) 2017 PR A broad survey of the recent neural networks advances in CNN and its applications in computer vision, speech and natural language processing 20 Tutorial: tools for efficient object − 2015 ICCV15 A short course for object detection detection only covering recent milestones 123 264 International Journal of Computer Vision (2020) 128:261–318 Table 1 continued No. Survey title References Year Venue Content 21 Tutorial: deep learning for objects − 2017 CVPR17 A high level summary of recent and scenes work on deep learning for visual recognition of objects and scenes 22 Tutorial: instance level recognition − 2017 ICCV17 A short course of recent advances on instance level recognition, including object detection, instance segmentation and human pose prediction 23 Tutorial: visual recognition and − 2018 CVPR18 A tutorial on methods and beyond principles behind image classification, object detection, instance segmentation, and semantic segmentation 24 Deep learning for generic object Ours 2019 VISI A comprehensive survey of deep detection learning for generic object detection any comprehensive recent survey. A thorough review and researchers a framework to understand current research and summary of existing work is essential for further progress in to identify open challenges for future research. object detection, particularly for researchers wishing to enter The remainder of this paper is organized as follows. the field. Since our focus is on generic object detection, the Related background and the progress made during the last extensive work on DCNNs for specific object detection, such 2 decades are summarized in Sect. 2. A brief introduction as face detection (Li et al. 2015a; Zhang et al. 2016a;Huetal. to deep learning is given in Sect. 3. Popular datasets and 2017), pedestrian detection (Zhang et al. 2016b; Hosang et al. evaluation criteria are summarized in Sect. 4. We describe 2015), vehicle detection (Zhou et al. 2016b) and traffic sign the milestone object detection frameworks in Sect. 5.From detection (Zhu et al. 2016b) will not be considered. Sects. 6 to 9, fundamental sub-problems and the relevant issues involved in designing object detectors are discussed. Finally, in Sect. 10, we conclude the paper with an overall 1.2 Scope discussion of object detection, state-of-the- art performance, and future research directions. The number of papers on generic object detection based on deep learning is breathtaking. There are so many, in fact, that compiling any comprehensive review of the state of the art is beyond the scope of any reasonable length paper. As a result, 2 Generic Object Detection it is necessary to establish selection criteria, in such a way that we have limited our focus to top journal and conference 2.1 The Problem papers. Due to these limitations, we sincerely apologize to those authors whose works are not included in this paper. For Generic object detection, also called generic object category surveys of work on related topics, readers are referred to the detection, object class detection, or object category detec- articles in Table 1. This survey focuses on major progress of tion (Zhang et al. 2013), is defined as follows. Given an the last 5 years, and we restrict our attention to still pictures, image, determine whether or not there are instances of objects leaving the important subject of video object detection as a from predefined categories (usually many categories, e.g., topic for separate consideration in the future. 200 categories in the ILSVRC object detection challenge) The main goal of this paper is to offer a comprehensive and, if present, to return the spatial location and extent of survey of deep learning based generic object detection tech- each instance. A greater emphasis is placed on detecting niques, and to present some degree of taxonomy, a high a broad range of natural categories, as opposed to specific level perspective and organization, primarily on the basis object category detection where only a narrower predefined of popular datasets, evaluation metrics, context modeling, category of interest (e.g., faces, pedestrians, or cars) may and detection proposal methods. The intention is that our be present. Although thousands of objects occupy the visual categorization be helpful for readers to have an accessi- world in which we live, currently the research community is ble understanding of similarities and differences between primarily interested in the localization of highly structured a wide variety of strategies. The proposed taxonomy gives objects (e.g., cars, faces, bicycles and airplanes) and artic- 123 International Journal of Computer Vision (2020) 128:261–318 265 (a) (b) (c) (d) Fig. 4 Recognition problems related to generic object detection: a Fig. 5 Taxonomy of challenges in generic object detection image level object classification, b bounding box level generic object detection, c pixel-wise semantic segmentation, d instance level semantic segmentation et al. 2006; Andreopoulos and Tsotsos 2013). Generic object detection is closely related to semantic image segmentation ulated objects (e.g., humans, cows and horses) rather than (Fig. 4c), which aims to assign each pixel in an image to a unstructured scenes (such as sky, grass and cloud). semantic class label. Object instance segmentation (Fig. 4d) The spatial location and extent of an object can be defined aims to distinguish different instances of the same object coarsely using a bounding box (an axis-aligned rectangle class, as opposed to semantic segmentation which does not. tightly bounding the object) (Everingham et al. 2010;Rus- sakovsky et al. 2015), a precise pixelwise segmentation mask 2.2 Main Challenges (Zhang et al. 2013), or a closed boundary (Lin et al. 2014; Russell et al. 2008), as illustrated in Fig. 4. To the best of The ideal of generic object detection is to develop a general- our knowledge, for the evaluation of generic object detec- tion algorithms, it is bounding boxes which are most widely purpose algorithm that achieves two competing goals of high quality/accuracy and high efficiency (Fig. 5). As illustrated used in the current literature (Everingham et al. 2010;Rus- in Fig. 6, high quality detection must accurately local- sakovsky et al. 2015), and therefore this is also the approach ize and recognize objects in images or video frames, such we adopt in this survey. However, as the research community that the large variety of object categories in the real world moves towards deeper scene understanding (from image level can be distinguished (i.e., high distinctiveness), and that object classification to single object localization, to generic object instances from the same category, subject to intra- object detection, and to pixelwise object segmentation), it is class appearance variations, can be localized and recognized anticipated that future challenges will be at the pixel level (i.e., high robustness). High efficiency requires that the entire (Lin et al. 2014). There are many problems closely related to that of generic detection task runs in real time with acceptable memory and storage demands. object detection . The goal of object classification or object categorization (Fig. 4a) is to assess the presence of objects from a given set of object classes in an image; i.e., assigning 2.2.1 Accuracy Related Challenges one or more object class labels to a given image, determin- ing the presence without the need of location. The additional Challenges in detection accuracy stem from (1) the vast range requirement to locate the instances in an image makes detec- of intra-class variations and (2) the huge number of object tion a more challenging task than classification. The object categories. recognition problem denotes the more general problem of Intra-class variations can be divided into two types: intrin- identifying/localizing all the objects present in an image, sic factors and imaging conditions. In terms of intrinsic subsuming the problems of object detection and classifica- factors, each object category can have many different object tion (Everingham et al. 2010; Russakovsky et al. 2015; Opelt instances, possibly varying in one or more of color, tex- ture, material, shape, and size, such as the “chair” category To the best of our knowledge, there is no universal agreement in the shown in Fig. 6i. Even in a more narrowly defined class, such literature on the definitions of various vision subtasks. Terms such as as human or horse, object instances can appear in different detection, localization, recognition, classification, categorization, veri- poses, subject to nonrigid deformations or with the addition fication, identification, annotation, labeling, and understanding are often differently defined (Andreopoulos and Tsotsos 2013). of clothing. 123 266 International Journal of Computer Vision (2020) 128:261–318 in object appearance, such as illumination, pose, scale, occlusion, clutter, shading, blur and motion, with examples illustrated in Fig. 6a–h. Further challenges may be added by digitization artifacts, noise corruption, poor resolution, and (d) (a) (b) (c) filtering distortions. In addition to intraclass variations, the large number of 4 5 object categories, on the order of 10 –10 , demands great dis- crimination power from the detector to distinguish between (e) (f) (g) (h) subtly different interclass variations, as illustrated in Fig. 6j. In practice, current detectors focus mainly on structured object categories, such as the 20, 200 and 91 object classes in PASCAL VOC (Everingham et al. 2010), ILSVRC (Rus- sakovsky et al. 2015) and MS COCO (Lin et al. 2014) (i) respectively. Clearly, the number of object categories under consideration in existing benchmark datasets is much smaller than can be recognized by humans. (j) 2.2.2 Efficiency and Scalability Related Challenges Fig. 6 Changes in appearance of the same class with variations in imag- ing conditions (a–h). There is an astonishing variation in what is meant The prevalence of social media networks and mobile/wearable to be a single object class (i). In contrast, the four images in j appear devices has led to increasing demands for analyzing visual very similar, but in fact are from four different object classes. Most data. However, mobile/wearable devices have limited com- images are from ImageNet (Russakovsky et al. 2015) and MS COCO putational capabilities and storage space, making efficient (Lin et al. 2014) object detection critical. The efficiency challenges stem from the need to localize Imaging condition variations are caused by the dra- and recognize, computational complexity growing with the matic impacts unconstrained environments can have on (possibly large) number of object categories, and with the object appearance, such as lighting (dawn, day, dusk, (possibly very large) number of locations and scales within indoors), physical location, weather conditions, cameras, a single image, such as the examples in Fig. 6c, d. backgrounds, illuminations, occlusion, and viewing dis- A further challenge is that of scalability: A detector should tances. All of these conditions produce significant variations be able to handle previously unseen objects, unknown situ- Fig. 7 Milestones of object detection and recognition, including feature datasets (Everingham et al. 2010; Lin et al. 2014; Russakovsky et al. representations (Csurka et al. 2004; Dalal and Triggs 2005;Heetal. 2015). The time period up to 2012 is dominated by handcrafted fea- 2016;Krizhevskyetal. 2012a; Lazebnik et al. 2006;Lowe 1999, 2004; tures, a transition took place in 2012 with the development of DCNNs Perronnin et al. 2010; Simonyan and Zisserman 2015; Sivic and Zisser- for image classification by Krizhevsky et al. (2012a), with methods after man 2003; Szegedy et al. 2015; Viola and Jones 2001;Wangetal. 2009), 2012 dominated by related deep networks. Most of the listed methods detection frameworks (Felzenszwalb et al. 2010b; Girshick et al. 2014; are highly cited and won a major ICCV or CVPR prize. See Sect. 2.3 Sermanet et al. 2014; Uijlings et al. 2013; Viola and Jones 2001), and for details 123 International Journal of Computer Vision (2020) 128:261–318 267 ations, and high data rates. As the number of images and tion. However, more recently, deeper CNNs have led to the number of categories continue to grow, it may become record-breaking improvements in the detection of more gen- impossible to annotate them manually, forcing a reliance on eral object categories, a shift which came about when the weakly supervised strategies. successful application of DCNNs in image classification (Krizhevsky et al. 2012a) was transferred to object detec- 2.3 Progress in the Past 2 Decades tion, resulting in the milestone Region-based CNN (RCNN) detector of Girshick et al. (2014). Early research on object recognition was based on template The successes of deep detectors rely heavily on vast train- ing data and large networks with millions or even billions of matching techniques and simple part-based models (Fischler and Elschlager 1973), focusing on specific objects whose parameters. The availability of GPUs with very high compu- spatial layouts are roughly rigid, such as faces. Before 1990 tational capability and large-scale detection datasets [such as the leading paradigm of object recognition was based on geo- ImageNet (Deng et al. 2009; Russakovsky et al. 2015) and metric representations (Mundy 2006; Ponce et al. 2007), with MS COCO (Lin et al. 2014)] play a key role in their suc- the focus later moving away from geometry and prior mod- cess. Large datasets have allowed researchers to target more els towards the use of statistical classifiers [such as Neural realistic and complex problems from images with large intra- Networks (Rowley et al. 1998), SVM (Osuna et al. 1997) and class variations and inter-class similarities (Lin et al. 2014; Adaboost (Viola and Jones 2001; Xiao et al. 2003)] based on Russakovsky et al. 2015). However, accurate annotations are labor intensive to obtain, so detectors must consider meth- appearance features (Murase and Nayar 1995a; Schmid and Mohr 1997). This successful family of object detectors set ods that can relieve annotation difficulties or can learn with smaller training datasets. the stage for most subsequent research in this field. The milestones of object detection in more recent years are The research community has started moving towards the presented in Fig. 7, in which two main eras (SIFT vs. DCNN) challenging goal of building general purpose object detec- are highlighted. The appearance features moved from global tion systems whose ability to detect many object categories representations (Murase and Nayar 1995b; Swain and Bal- matches that of humans. This is a major challenge: accord- lard 1991; Turk and Pentland 1991) to local representations ing to cognitive scientists, human beings can identify around that are designed to be invariant to changes in translation, 3000 entry level categories and 30,000 visual categories over- all, and the number of categories distinguishable with domain scale, rotation, illumination, viewpoint and occlusion. Hand- crafted local invariant features gained tremendous popularity, expertise may be to the order of 10 (Biederman 1987a). Despite the remarkable progress of the past years, designing starting from the Scale Invariant Feature Transform (SIFT) feature (Lowe 1999), and the progress on various visual an accurate, robust, efficient detection and recognition sys- 4 5 recognition tasks was based substantially on the use of local tem that approaches human-level performance on 10 –10 descriptors (Mikolajczyk and Schmid 2005) such as Haar- categories is undoubtedly an unresolved problem. like features (Viola and Jones 2001), SIFT (Lowe 2004), Shape Contexts (Belongie et al. 2002), Histogram of Gradi- ents (HOG) (Dalal and Triggs 2005) Local Binary Patterns (LBP) (Ojala et al. 2002), and region covariances (Tuzel et al. 3 A Brief Introduction to Deep Learning 2006). These local features are usually aggregated by simple concatenation or feature pooling encoders such as the Bag of Deep learning has revolutionized a wide range of machine Visual Words approach, introduced by Sivic and Zisserman learning tasks, from image classification and video process- (2003) and Csurka et al. (2004), Spatial Pyramid Matching ing to speech recognition and natural language understand- (SPM) of BoW models (Lazebnik et al. 2006), and Fisher ing. Given this tremendously rapid evolution, there exist Vectors (Perronnin et al. 2010). many recent survey papers on deep learning (Bengio et al. For years, the multistage hand tuned pipelines of hand- 2013; Goodfellow et al. 2016;Guetal. 2018; LeCun et al. crafted local descriptors and discriminative classifiers dom- 2015; Litjens et al. 2017; Pouyanfar et al. 2018;Wuetal. inated a variety of domains in computer vision, including 2019; Young et al. 2018; Zhang et al. 2018d; Zhou et al. object detection, until the significant turning point in 2012 2018a; Zhu et al. 2017). These surveys have reviewed deep when DCNNs (Krizhevsky et al. 2012a) achieved their learning techniques from different perspectives (Bengio et al. record-breaking results in image classification. 2013; Goodfellow et al. 2016;Guetal. 2018; LeCun et al. The use of CNNs for detection and localization (Row- 2015; Pouyanfar et al. 2018;Wuetal. 2019; Zhou et al. ley et al. 1998) can be traced back to the 1990s, with a 2018a), or with applications to medical image analysis (Lit- modest number of hidden layers used for object detection jens et al. 2017), natural language processing (Young et al. (Vaillant et al. 1994; Rowley et al. 1998; Sermanet et al. 2018), speech recognition systems (Zhang et al. 2018d), and 2013), successful in restricted domains such as face detec- remote sensing (Zhu et al. 2017). 123 268 International Journal of Computer Vision (2020) 128:261–318 (a) (b) Fig. 8 a Illustration of three operations that are repeatedly applied by a An image with 3 color channels is presented as the input. The network typical CNN: convolution with a number of linear filters; Nonlinearities has 8 convolutional layers, 3 fully connected layers, 5 max pooling lay- (e.g. ReLU); and local pooling (e.g. max pooling). The M feature maps ers and a softmax classification layer. The last three fully connected from a previous layer are convolved with N different filters (here shown layers take features from the top convolutional layer as input in vector as size 3 × 3 × M), using a stride of 1. The resulting N feature maps form. The final layer is a C-way softmax function, C being the number are then passed through a nonlinear function (e.g. ReLU), and pooled of classes. The whole network can be learned from labeled training data (e.g. taking a maximum over 2 × 2 regions) to give N feature maps by optimizing an objective function (e.g. mean squared error or cross at a reduced resolution. b Illustration of the architecture of VGGNet entropy loss) via stochastic gradient descent (Color figure online) (Simonyan and Zisserman 2015), a typical CNN with 11 weight layers. Convolutional Neural Networks (CNNs), the most repre- Finally, pooling corresponds to the downsampling/upsampl- sentative models of deep learning, are able to exploit the basic ing of feature maps. These three operations (convolution, properties underlying natural signals: translation invariance, nonlinearity, pooling) are illustrated in Fig. 8a; CNNs having local connectivity, and compositional hierarchies (LeCun a large number of layers, a “deep” network, are referred to et al. 2015). A typical CNN, illustrated in Fig. 8, has a hier- as Deep CNNs (DCNNs), with a typical DCNN architecture archical structure and is composed of a number of layers to illustrated in Fig. 8b. learn representations of data with multiple levels of abstrac- Most layers of a CNN consist of a number of feature maps, tion (LeCun et al. 2015). We begin with a convolution within which each pixel acts like a neuron. Each neuron in a convolutional layer is connected to feature maps of the pre- l−1 l x ∗ w (1) vious layer through a set of weights w (essentially a set of i , j 2D filters). As can be seen in Fig. 8b, where the early CNN l−1 between an input feature map x at a feature map from layers are typically composed of convolutional and pooling previous layer l−1, convolved with a 2D convolutional kernel layers, the later layers are normally fully connected. From (or filter or weights) w . This convolution appears over a earlier to later layers, the input image is repeatedly con- sequence of layers, subject to a nonlinear operation σ , such volved, and with each layer, the receptive field or region of that support increases. In general, the initial CNN layers extract ⎛ ⎞ low-level features (e.g., edges), with later layers extracting l−1 more general features of increasing complexity (Zeiler and l l−1 l l ⎝ ⎠ x = σ x ∗ w + b , (2) j i i , j j Fergus 2014; Bengio et al. 2013; LeCun et al. 2015; Oquab i =1 et al. 2014). DCNNs have a number of outstanding advantages: a l−1 with a convolution now between the N input feature maps l−1 l l hierarchical structure to learn representations of data with x and the corresponding kernel w , plus a bias term b . i i , j j multiple levels of abstraction, the capacity to learn very com- The elementwise nonlinear function σ(·) is typically a recti- plex functions, and learning feature representations directly fied linear unit (ReLU) for each element, and automatically from data with minimal domain knowl- edge. What has particularly made DCNNs successful has σ(x ) = max{x , 0}. (3) 123 International Journal of Computer Vision (2020) 128:261–318 269 been the availability of large scale labeled datasets and of dation and testing datasets for the detection challenges are GPUs with very high computational capability. given in Table 3. The most frequent object classes in VOC, Despite the great successes, known deficiencies remain. In COCO, ILSVRC and Open Images detection datasets are particular, there is an extreme need for labeled training data visualized in Table 4. and a requirement of expensive computing resources, and PASCAL VOC Everingham et al. (2010, 2015) is a multi- considerable skill and experience are still needed to select year effort devoted to the creation and maintenance of a series appropriate learning parameters and network architectures. of benchmark datasets for classification and object detection, Trained networks are poorly interpretable, there is a lack of creating the precedent for standardized evaluation of recog- robustness to degradations, and many DCNNs have shown nition algorithms in the form of annual competitions. Starting serious vulnerability to attacks (Goodfellow et al. 2015), all from only four categories in 2005, the dataset has increased to of which currently limit the use of DCNNs in real-world 20 categories that are common in everyday life. Since 2009, applications. the number of images has grown every year, but with all pre- vious images retained to allow test results to be compared from year to year. Due the availability of larger datasets like ImageNet, MS COCO and Open Images, PASCAL VOC has 4 Datasets and Performance Evaluation gradually fallen out of fashion. ILSVRC, the ImageNet Large Scale Visual Recognition 4.1 Datasets Challenge (Russakovsky et al. 2015), is derived from Ima- geNet (Deng et al. 2009), scaling up PASCAL VOC’s goal of Datasets have played a key role throughout the history of standardized training and evaluation of detection algorithms object recognition research, not only as a common ground by more than an order of magnitude in the number of object for measuring and comparing the performance of competing classes and images. ImageNet1000, a subset of ImageNet algorithms, but also pushing the field towards increasingly images with 1000 different object categories and a total of complex and challenging problems. In particular, recently, 1.2 million images, has been fixed to provide a standardized deep learning techniques have brought tremendous success to benchmark for the ILSVRC image classification challenge. many visual recognition problems, and it is the large amounts MS COCO is a response to the criticism of ImageNet that of annotated data which play a key role in their success. objects in its dataset tend to be large and well centered, mak- Access to large numbers of images on the Internet makes it ing the ImageNet dataset atypical of real-world scenarios. possible to build comprehensive datasets in order to capture To push for richer image understanding, researchers created a vast richness and diversity of objects, enabling unprece- the MS COCO database (Lin et al. 2014) containing com- dented performance in object recognition. plex everyday scenes with common objects in their natural For generic object detection, there are four famous context, closer to real life, where objects are labeled using datasets: PASCAL VOC (Everingham et al. 2010, 2015), fully-segmented instances to provide more accurate detec- ImageNet (Deng et al. 2009), MS COCO (Lin et al. 2014) tor evaluation. The COCO object detection challenge (Lin and Open Images (Kuznetsova et al. 2018). The attributes et al. 2014) features two object detection tasks: using either of these datasets are summarized in Table 2, and selected bounding box output or object instance segmentation output. sample images are shown in Fig. 9. There are three steps to COCO introduced three new challenges: creating large-scale annotated datasets: determining the set of target object categories, collecting a diverse set of candidate 1. It contains objects at a wide range of scales, including a images to represent the selected categories on the Internet, high percentage of small objects (Singh and Davis 2018); and annotating the collected images, typically by designing 2. Objects are less iconic and amid clutter or heavy occlu- crowdsourcing strategies. Recognizing space limitations, we sion; refer interested readers to the original papers (Everingham 3. The evaluation metric (see Table 5) encourages more et al. 2010, 2015; Lin et al. 2014; Russakovsky et al. 2015; accurate object localization. Kuznetsova et al. 2018) for detailed descriptions of these datasets in terms of construction and properties. Just like ImageNet in its time, MS COCO has become the The four datasets form the backbone of their respective standard for object detection today. detection challenges. Each challenge consists of a publicly OICOD (the Open Image Challenge Object Detection) is available dataset of images together with ground truth anno- derived from Open Images V4 (now V5 in 2019) (Kuznetsova tation and standardized evaluation software, and an annual et al. 2018), currently the largest publicly available object competition and corresponding workshop. Statistics for the number of images and object instances in the training, vali- The annotations on the test set are not publicly released, except for PASCAL VOC2007. 123 270 International Journal of Computer Vision (2020) 128:261–318 Table 2 Popular databases for object recognition Dataset Total images Categories Images per category Objects per image Image size Started year Highlights name PASCAL 11,540 20 303–4087 2.4 470 × 380 2005 Covers only 20 categories that are VOC (2012) common in everyday life; Large (Evering- number of training images; Close ham et al. to real-world applications; 2015) Significantly larger intraclass variations; Objects in scene context; Multiple objects in one image; Contains many difficult samples ImageNet 14 millions+ 21,841 − 1.5 500 × 400 2009 Large number of object categories; (Rus- More instances and more sakovsky categories of objects per image; et al. 2015) More challenging than PASCAL VOC; Backbone of the ILSVRC challenge; Images are object-centric MS COCO 328,000+ 91 − 7.3 640 × 480 2014 Even closer to real world scenarios; (Lin et al. Each image contains more 2014) instances of objects and richer object annotation information; Contains object segmentation notation data that is not available in the ImageNet dataset Places 10 millions+ 434 −− 256 × 256 2014 The largest labeled dataset for (Zhou et al. scene recognition; Four subsets 2017a) Places365 Standard, Places365 Challenge, Places 205 and Places88 as benchmarks Open 9 millions+ 6000+− 8.3 Varied 2017 Annotated with image level labels, Images object bounding boxes and visual (Kuznetsova relationships; Open Images V5 et al. 2018) supports large scale object detection, object instance segmentation and visual relationship detection Example images from PASCAL VOC, ImageNet, MS COCO and Open Images are shown in Fig. 9 (a) (c) (d) (b) Fig. 9 Some example images with object annotations from PASCAL VOC, ILSVRC, MS COCO and Open Images. See Table 2 for a summary of these datasets detection dataset. OICOD is different from previous large of classes, images, bounding box annotations and instance scale object detection datasets like ILSVRC and MS COCO, segmentation mask annotations, but also regarding the anno- not merely in terms of the significantly increased number tation process. In ILSVRC and MS COCO, instances of all 123 International Journal of Computer Vision (2020) 128:261–318 271 Table 3 Statistics of commonly used object detection datasets Challenge Object classes Number of images Number of annotated objects Summary (Train+Val) Train Val Test Train Val Images Boxes Boxes/Image PASCAL VOC object detection challenge VOC07 20 2501 2510 4952 6301(7844) 6307(7818) 5011 12,608 2.5 VOC08 20 2111 2221 4133 5082(6337) 5281(6347) 4332 10,364 2.4 VOC09 20 3473 3581 6650 8505(9760) 8713(9779) 7054 17,218 2.3 VOC10 20 4998 5105 9637 11,577(13,339) 11,797(13,352) 10,103 23,374 2.4 VOC11 20 5717 5823 10,994 13,609(15,774) 13,841(15,787) 11,540 27,450 2.4 VOC12 20 5717 5823 10,991 13,609(15,774) 13,841(15,787) 11,540 27,450 2.4 ILSVRC object detection challenge ILSVRC13 200 395,909 20,121 40,152 345,854 55,502 416,030 401,356 1.0 ILSVRC14 200 456,567 20,121 40,152 478,807 55,502 476,668 534,309 1.1 ILSVRC15 200 456,567 20,121 51,294 478,807 55,502 476,668 534,309 1.1 ILSVRC16 200 456,567 20,121 60,000 478,807 55,502 476,668 534,309 1.1 ILSVRC17 200 456,567 20,121 65,500 478,807 55,502 476,668 534,309 1.1 MS COCO object detection challenge MS COCO15 80 82,783 40,504 81,434 604,907 291,875 123,287 896,782 7.3 MS COCO16 80 82,783 40,504 81,434 604,907 291,875 123,287 896,782 7.3 MS COCO17 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3 MS COCO18 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3 Open images challenge object detection (OICOD)(BasedonopenimagesV4 Kuznetsova et al. 2018) OICOD18 500 1,643,042 100,000 99,999 11,498,734 696,410 1,743,042 12,195,144 7.0 Object statistics for VOC challenges list the non-difficult objects used in the evaluation (all annotated objects). For the COCO challenge, prior to 2017, the test set had four splits (Dev, Standard, Reserve,and Challenge), with each having about 20K images. Starting in 2017, the test set has only the Dev and Challenge splits, with the other two splits removed. Starting in 2017, the train and val sets are arranged differently, and the test set is divided into two roughly equally sized splits of about 20,000 images each: Test Dev and Test Challenge. Note that the 2017 Test Dev/Challenge splits contain the same images as the 2015 Test Dev/Challenge splits, so results across the years are directly comparable classes in the dataset are exhaustively annotated, whereas can be found in Everingham et al. (2010), Everingham et al. for Open Images V4 a classifier was applied to each image (2015), Russakovsky et al. (2015), Hoiem et al. (2012). and only those labels with sufficiently high scores were sent The standard outputs of a detector applied to a testing for human verification. Therefore in OICOD only the object image I are the predicted detections {(b , c , p )} , indexed j j j j instances of human-confirmed positive labels are annotated. by object j, of Bounding Box (BB) b , predicted category c , j j and confidence p . A predicted detection (b, c, p) is regarded as a True Positive (TP) if 4.2 Evaluation Criteria • The predicted category c equals the ground truth label There are three criteria for evaluating the performance of c . detection algorithms: detection speed in Frames Per Second • The overlap ratio IOU (Intersection Over Union) (Ever- (FPS), precision, and recall. The most commonly used met- ingham et al. 2010; Russakovsky et al. 2015) ric is Average Precision (AP), derived from precision and recall. AP is usually evaluated in a category specific manner, area (b ∩ b ) i.e., computed for each object category separately. To com- IOU(b, b ) = , (4) area (b ∪ b ) pare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the between the predicted BB b and the ground truth b is final measure of performance . More details on these metrics not smaller than a predefined threshold ε, where ∩ and In object detection challenges, such as PASCAL VOC and ILSVRC, Footnote 3 continued the winning entry of each object category is that with the highest AP performance, and is justified since the ranking of teams by mAP was score, and the winner of the challenge is the team that wins on the most always the same as the ranking by the number of object categories won object categories. The mAP is also used as the measure of a team’s (Russakovsky et al. 2015). 123 272 International Journal of Computer Vision (2020) 128:261–318 Table 4 Most frequent object classes for each detection challenge threshold different pairs (P, R) can be obtained, in principle allowing precision to be regarded as a function of recall, i.e. P(R), from which the Average Precision (AP) (Everingham et al. 2010; Russakovsky et al. 2015) can be found. Since the introduction of MS COCO, more attention has been placed on the accuracy of the bounding box location. Instead of using a fixed IOU threshold, MS COCO introduces (a) (b) a few metrics (summarized in Table 5) for characterizing the performance of an object detector. For instance, in contrast to the traditional mAP computed at a single IoU of 0.5, AP coco is averaged across all object categories and multiple IOU val- ues from 0.5to0.95 in steps of 0.05. Because 41% of the objects in MS COCO are small and 24% are large, metrics large small medium AP , AP and AP are also introduced. Finally, coco coco coco Table 5 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges, with metric modifications for the Open Images challenges pro- posed in Kuznetsova et al. (2018). (c) 5 Detection Frameworks There has been steady progress in object feature represen- tations and classifiers for recognition, as evidenced by the dramatic change from handcrafted features (Viola and Jones 2001; Dalal and Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; Vedaldi et al. 2009) to learned DCNN features (Girshick et al. 2014; Ouyang et al. 2015; Girshick (d) 2015; Ren et al. 2015; Dai et al. 2016c). In contrast, in terms of localization, the basic “sliding window” strategy (Dalal The size of each word is proportional to the frequency of that class in and Triggs 2005; Felzenszwalb et al. 2010b, 2008) remains the training dataset mainstream, although with some efforts to avoid exhaustive search (Lampert et al. 2008; Uijlings et al. 2013). However, cup denote intersection and union, respectively. A typical the number of windows is large and grows quadratically value of ε is 0.5. with the number of image pixels, and the need to search over multiple scales and aspect ratios further increases the search space. Therefore, the design of efficient and effec- Otherwise, it is considered as a False Positive (FP). The con- tive detection frameworks plays a key role in reducing this fidence level p is usually compared with some threshold β computational cost. Commonly adopted strategies include to determine whether the predicted class label c is accepted. cascading, sharing feature computation, and reducing per- AP is computed separately for each of the object classes, window computation. based on Precision and Recall. For a given object class c and This section reviews detection frameworks, listed in a testing image I ,let {(b , p )} denote the detections i ij ij j =1 Fig. 11 and Table 11, the milestone approaches appearing returned by a detector, ranked by confidence p in decreasing ij since deep learning entered the field, organized into two main order. Each detection (b , p ) is either a TP or an FP, which ij ij categories: can be determined via the algorithm in Fig. 10. Based on the TP and FP detections, the precision P(β) and recall R(β) (a) Two stage detection frameworks, which include a pre- (Everingham et al. 2010) can be computed as a function of processing step for generating object proposals; the confidence threshold β, so by varying the confidence (b) One stage detection frameworks, or region proposal free frameworks, having a single proposed method which It is worth noting that for a given threshold β, multiple detections of does not separate the process of the detection proposal. the same object in an image are not considered as all correct detections, and only the detection with the highest confidence level is considered as a TP and the rest as FPs. 123 International Journal of Computer Vision (2020) 128:261–318 273 Table 5 Summary of commonly used metrics for evaluating object detectors Metric Meaning Definition and description TP True positive A true positive detection, per Fig. 10 FP False positive A false positive detection, per Fig. 10 β Confidence threshold A confidence threshold for computing P(β) and R(β) ε IOU threshold VOC Typically around 0.5 wh ILSVRC min(0.5, ); w × h is the size of a GT box (w+10)(h+10) MS COCO Ten IOU thresholds ε ∈{0.5 : 0.05 : 0.95} P(β) Precision The fraction of correct detections out of the total detections returned by the detector with confidence of at least β R(β) Recall The fraction of all N objects detected by the detector having a confidence of at least β AP Average Precision Computed over the different levels of recall achieved by varying the confidence β mAP mean Average Precision VOC AP at a single IOU and averaged over all classes ILSVRC AP at a modified IOU and averaged over all classes MS COCO AP : mAP averaged over ten IOUs: {0.5 : 0.05 : 0.95}; coco IOU=0.5 AP : mAP at IOU = 0.50 (PASCAL VOC metric); coco IOU=0.75 AP :mAP at IOU = 0.75 (strict metric); coco small 2 AP : mAP for small objects of area smaller than 32 ; coco medium 2 2 AP : mAP for objects of area between 32 and 96 ; coco large AP : mAP for large objects of area bigger than 96 ; coco AR Average Recall The maximum recall given a fixed number of detections per image, averaged over all categories and IOU thresholds max=1 AR Average Recall MS COCO AR : AR given 1 detection per image; coco max=10 AR : AR given 10 detection per image; coco max=100 AR : AR given 100 detection per image; coco small 2 AR : AR for small objects of area smaller than 32 ; coco medium 2 2 AR : AR for objects of area between 32 and 96 ; coco large AR : AR for large objects of area bigger than 96 ; coco 5.1 Region Based (Two Stage) Frameworks In a region-based framework, category-independent region proposals are generated from an image, CNN (Krizhevsky et al. 2012a) features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. As can be observed from Fig. 11, DetectorNet (Szegedy et al. 2013), OverFeat (Ser- manet et al. 2014), MultiBox (Erhan et al. 2014) and RCNN (Girshick et al. 2014) independently and almost simultane- ously proposed using CNNs for generic object detection. RCNN (Girshick et al. 2014): Inspired by the break- through image classification results obtained by CNNs and Fig. 10 The algorithm for determining TPs and FPs by greedily match- the success of the selective search in region proposal for hand- ing object detection results to ground truth boxes crafted features (Uijlings et al. 2013), Girshick et al. (2014, 2016) were among the first to explore CNNs for generic Sections 6–9 will discuss fundamental sub-problems involved object detection and developed RCNN, which integrates in detection frameworks in greater detail, including DCNN features, detection proposals, and context modeling. Object proposals, also called region proposals or detection proposals, are a set of candidate regions or bounding boxes in an image that may potentially contain an object (Chavali et al. 2016;Hosangetal. 2016). 123 274 International Journal of Computer Vision (2020) 128:261–318 VGGNet Faster RCNN (Simonyan and Zisserman) (Ren et al.) YOLO9000 NIN (Redmon and Farhadi) (Lin et al.) Fast RCNN CornerNet (Girshick) RCNN GoogLeNet RFCN Mask RCNN ResNet (Law and Deng) (Girshick et al.) (Szegedy et al.) (Dai et al.) (He et al.) (He et al.) DetectorNet DenseNet MultiBox MSC Multibox SSD RetinaNet (Szegedy et al.) (Huang et al.) (Erhan et al.) (Szegedy et al.) (Liu et al.) (Lin et al.) SPPNet YOLO Feature Pyramid Network (He et al.) (Redmon et al.) OverFeat (FPN) (Lin et al.) (Sermanet et al.) Fig. 11 Milestones in generic object detection this stage, all region proposals with  0.5IOU overlap with a ground truth box are defined as positives for that ground truth box’s class and the rest as negatives. 3. Class specific SVM classifiers training A set of class- specific linear SVM classifiers are trained using fixed length features extracted with CNN, replacing the soft- max classifier learned by fine-tuning. For training SVM classifiers, positive examples are defined to be the ground truth boxes for each class. A region proposal with less than 0.3 IOU overlap with all ground truth instances of a class is negative for that class. Note that the positive and negative examples defined for training the SVM classi- fiers are different from those for fine-tuning the CNN. 4. Class specific bounding box regressor training Bounding box regression is learned for each object class with CNN features. In spite of achieving high object detection quality, RCNN has notable drawbacks (Girshick 2015): 1. Training is a multistage pipeline, slow and hard to opti- mize because each individual stage must be trained Fig. 12 Illustration of the RCNN detection framework (Girshick et al. 2014, 2016) separately. 2. For SVM classifier and bounding box regressor training, it is expensive in both disk space and time, because CNN AlexNet (Krizhevsky et al. 2012a) with a region proposal features need to be extracted from each object proposal selective search (Uijlings et al. 2013). As illustrated in detail in each image, posing great challenges for large scale in Fig. 12, training an RCNN framework consists of multi- detection, particularly with very deep networks, such as stage pipelines: VGG16 (Simonyan and Zisserman 2015). 3. Testing is slow, since CNN features are extracted per 1. Region proposal computation Class agnostic region pro- object proposal in each test image, without shared com- putation. posals, which are candidate regions that might contain objects, are obtained via a selective search (Uijlings et al. 2013). All of these drawbacks have motivated successive innova- 2. CNN model finetuning Region proposals, which are tions, leading to a number of improved detection frameworks cropped from the image and warped into the same size, such as SPPNet, Fast RCNN, Faster RCNN etc., as follows. are used as the input for fine-tuning a CNN model pre- trained using a large-scale dataset such as ImageNet. At Please refer to Sect. 4.2 for the definition of IOU. 123 International Journal of Computer Vision (2020) 128:261–318 275 SPPNet (He et al. 2014) During testing, CNN feature extraction is the main bottleneck of the RCNN detection pipeline, which requires the extraction of CNN features from thousands of warped region proposals per image. As a result, He et al. (2014) introduced traditional spatial pyramid pooling (SPP) (Grauman and Darrell 2005; Lazebnik et al. 2006) into CNN architectures. Since convolutional layers accept inputs of arbitrary sizes, the requirement of fixed- sized images in CNNs is due only to the Fully Connected (FC) layers, therefore He et al. added an SPP layer on top of the last convolutional (CONV) layer to obtain features of fixed length for the FC layers. With this SPPNet, RCNN obtains a significant speedup without sacrificing any detec- tion quality, because it only needs to run the convolutional layers once on the entire test image to generate fixed-length features for region proposals of arbitrary size. While SPPNet accelerates RCNN evaluation by orders of magnitude, it does not result in a comparable speedup of the detector training. Moreover, fine-tuning in SPPNet (He et al. 2014) is unable to update the convolutional layers before the SPP layer, which limits the accuracy of very deep networks. Fast RCNN (Girshick 2015) Girshick proposed Fast RCNN (Girshick 2015) that addresses some of the dis- advantages of RCNN and SPPNet, while improving on their detection speed and quality. As illustrated in Fig. 13, Fast RCNN enables end-to-end detector training by devel- oping a streamlined training process that simultaneously learns a softmax classifier and class-specific bounding box regression, rather than separately training a softmax clas- sifier, SVMs, and Bounding Box Regressors (BBRs) as in RCNN/SPPNet. Fast RCNN employs the idea of sharing the computation of convolution across region proposals, and adds a Region of Interest (RoI) pooling layer between the last CONV layer and the first FC layer to extract a fixed-length feature for each region proposal. Essentially, RoI pooling uses warping at the feature level to approx- imate warping at the image level. The features after the RoI pooling layer are fed into a sequence of FC layers that finally branch into two sibling output layers: softmax prob- abilities for object category prediction, and class-specific bounding box regression offsets for proposal refinement. Compared to RCNN/SPPNet, Fast RCNN improves the effi- ciency considerably—typically 3 times faster in training and 10 times faster in testing. Thus there is higher detection qual- ity, a single training process that updates all network layers, and no storage required for feature caching. Faster RCNN (Ren et al. 2015, 2017) Although Fast RCNN significantly sped up the detection process, it still relies on external region proposals, whose computation is exposed as the new speed bottleneck in Fast RCNN. Recent work has shown that CNNs have a remarkable ability to local- Fig. 13 High level diagrams of the leading frameworks for generic ize objects in CONV layers (Zhou et al. 2015, 2016a; Cinbis object detection. The properties of these methods are summarized in Table 11 et al. 2017; Oquab et al. 2015; Hariharan et al. 2016), an 123 276 International Journal of Computer Vision (2020) 128:261–318 ability which is weakened in the FC layers. Therefore, the prior to prediction. However, Dai et al. (2016c) found that this selective search can be replaced by a CNN in producing naive design turns out to have considerably inferior detection region proposals. The Faster RCNN framework proposed accuracy, conjectured to be that deeper CONV layers are by Ren et al. (2015, 2017) offered an efficient and accu- more sensitive to category semantics, and less sensitive to rate Region Proposal Network (RPN) for generating region translation, whereas object detection needs localization rep- proposals. They utilize the same backbone network, using resentations that respect translation invariance. Based on this features from the last shared convolutional layer to accom- observation, Dai et al. (2016c) constructed a set of position- plish the task of RPN for region proposal and Fast RCNN for sensitive score maps by using a bank of specialized CONV region classification, as shown in Fig. 13. layers as the FCN output, on top of which a position-sensitive RPN first initializes k reference boxes (i.e. the so called RoI pooling layer is added. They showed that RFCN with anchors) of different scales and aspect ratios at each CONV ResNet101 (He et al. 2016) could achieve comparable accu- feature map location. The anchor positions are image content racy to Faster RCNN, often at faster running times. independent, but the feature vectors themselves, extracted Mask RCNN He et al. (2017) proposed Mask RCNN to from anchors, are image content dependent. Each anchor is tackle pixelwise object instance segmentation by extend- mapped to a lower dimensional vector, which is fed into two ing Faster RCNN. Mask RCNN adopts the same two stage sibling FC layers—an object category classification layer and pipeline, with an identical first stage (RPN), but in the sec- a box regression layer. In contrast to detection in Fast RCNN, ond stage, in parallel to predicting the class and box offset, the features used for regression in RPN are of the same shape Mask RCNN adds a branch which outputs a binary mask for as the anchor box, thus k anchors lead to k regressors. RPN each RoI. The new branch is a Fully Convolutional Network shares CONV features with Fast RCNN, thus enabling highly (FCN) (Long et al. 2015; Shelhamer et al. 2017) on top of a efficient region proposal computation. RPN is, in fact, a kind CNN feature map. In order to avoid the misalignments caused of Fully Convolutional Network (FCN) (Long et al. 2015; by the original RoI pooling (RoIPool) layer, a RoIAlign Shelhamer et al. 2017); Faster RCNN is thus a purely CNN layer was proposed to preserve the pixel level spatial cor- based framework without using handcrafted features. respondence. With a backbone network ResNeXt101-FPN For the VGG16 model (Simonyan and Zisserman 2015), (Xie et al. 2017; Lin et al. 2017a), Mask RCNN achieved Faster RCNN can test at 5 FPS (including all stages) on a top results for the COCO object instance segmentation and GPU, while achieving state-of-the-art object detection accu- bounding box object detection. It is simple to train, general- racy on PASCAL VOC 2007 using 300 proposals per image. izes well, and adds only a small overhead to Faster RCNN, The initial Faster RCNN in Ren et al. (2015) contains sev- running at 5 FPS (He et al. 2017). eral alternating training stages, later simplified in Ren et al. Chained Cascade Network and Cascade RCNN The (2017). essence of cascade (Felzenszwalb et al. 2010a; Bourdev Concurrent with the development of Faster RCNN, Lenc and Brandt 2005; Li and Zhang 2004) is to learn more dis- and Vedaldi (2015) challenged the role of region proposal criminative classifiers by using multistage classifiers, such generation methods such as selective search, studied the role that early stages discard a large number of easy negative of region proposal generation in CNN based detectors, and samples so that later stages can focus on handling more diffi- found that CNNs contain sufficient geometric information cult examples. Two-stage object detection can be considered for accurate object detection in the CONV rather than FC as a cascade, the first detector removing large amounts of layers. They showed the possibility of building integrated, background, and the second stage classifying the remaining simpler, and faster object detectors that rely exclusively on regions. Recently, end-to-end learning of more than two cas- CNNs, removing region proposal generation methods such caded classifiers and DCNNs for generic object detection as selective search. were proposed in the Chained Cascade Network (Ouyang RFCN (Region based Fully Convolutional Network) et al. 2017a), extended in Cascade RCNN (Cai and Vasconce- While Faster RCNN is an order of magnitude faster than los 2018), and more recently applied for simultaneous object Fast RCNN, the fact that the region-wise sub-network still detection and instance segmentation (Chen et al. 2019a), win- needs to be applied per RoI (several hundred RoIs per image) ning the COCO 2018 Detection Challenge. led Dai et al. (2016c) to propose the RFCN detector which is Light Head RCNN In order to further increase the detec- fully convolutional (no hidden FC layers) with almost all tion speed of RFCN (Dai et al. 2016c), Li et al. (2018c)pro- computations shared over the entire image. As shown in posed Light Head RCNN, making the head of the detection Fig. 13, RFCN differs from Faster RCNN only in the RoI network as light as possible to reduce the RoI computation. sub-network. In Faster RCNN, the computation after the RoI In particular, Li et al. (2018c) applied a convolution to pro- pooling layer cannot be shared, so Dai et al. (2016c) proposed duce thin feature maps with small channel numbers (e.g., using all CONV layers to construct a shared RoI sub-network, 490 channels for COCO) and a cheap RCNN sub-network, and RoI crops are taken from the last layer of CONV features leading to an excellent trade-off of speed and accuracy. 123 International Journal of Computer Vision (2020) 128:261–318 277 5.2 Unified (One Stage) Frameworks deep networks. It is one of the most influential object detec- tion frameworks, winning the ILSVRC2013 localization and The region-based pipeline strategies of Sect. 5.1 have dom- detection competition. OverFeat performs object detection inated since RCNN (Girshick et al. 2014), such that the via a single forward pass through the fully convolutional leading results on popular benchmark datasets are all based layers in the network (i.e. the “Feature Extractor”, shown on Faster RCNN (Ren et al. 2015). Nevertheless, region- in Fig. 14a). The key steps of object detection at test time based approaches are computationally expensive for current can be summarized as follows: mobile/wearable devices, which have limited storage and computational capability, therefore instead of trying to opti- 1. Generate object candidates by performing object clas- mize the individual components of a complex region-based sification via a sliding window fashion on multiscale pipeline, researchers have begun to develop unified detection images OverFeat uses a CNN like AlexNet (Krizhevsky strategies. et al. 2012a), which would require input images ofa fixed Unified pipelines refer to architectures that directly pre- size due to its fully connected layers, in order to make dict class probabilities and bounding box offsets from full the sliding window approach computationally efficient, images with a single feed-forward CNN in a monolithic set- OverFeat casts the network (as shown in Fig. 14a) into ting that does not involve region proposal generation or post a fully convolutional network, taking inputs of any size, classification / feature resampling, encapsulating all compu- by viewing fully connected layers as convolutions with tation in a single network. Since the whole pipeline is a single kernels of size 1 × 1. OverFeat leverages multiscale fea- network, it can be optimized end-to-end directly on detection tures to improve the overall performance by passing up to performance. six enlarged scales of the original image through the net- DetectorNet (Szegedy et al. 2013) were among the first to work (as shown in Fig. 14b), resulting in a significantly explore CNNs for object detection. DetectorNet formulated increased number of evaluated context views. For each object detection a regression problem to object bounding of the multiscale inputs, the classifier outputs a grid of box masks. They use AlexNet (Krizhevsky et al. 2012a) predictions (class and confidence). and replace the final softmax classifier layer with a regres- 2. Increase the number of predictions by offset max pooling sion layer. Given an image window, they use one network In order to increase resolution, OverFeat applies offset to predict foreground pixels over a coarse grid, as well as max pooling after the last CONV layer, i.e. perform- four additional networks to predict the object’s top, bottom, ing a subsampling operation at every offset, yielding left and right halves. A grouping process then converts the many more views for voting, increasing robustness while predicted masks into detected bounding boxes. The network remaining efficient. needs to be trained per object type and mask type, and does 3. Bounding box regression Once an object is identified, not scale to multiple classes. DetectorNet must take many a single bounding box regressor is applied. The classi- crops of the image, and run multiple networks for each part fier and the regressor share the same feature extraction on every crop, thus making it slow. (CONV) layers, only the FC layers need to be recomputed OverFeat, proposed by Sermanet et al. (2014) and illus- after computing the classification network. trated in Fig. 14, can be considered as one of the first 4. Combine predictions OverFeat uses a greedy merge strat- single-stage object detectors based on fully convolutional egy to combine the individual bounding box predictions across all locations and scales. OverFeat has a significant speed advantage, but is less accu- rate than RCNN (Girshick et al. 2014), because it was difficult to train fully convolutional networks at the time. The speed (a) advantage derives from sharing the computation of convolu- tion between overlapping windows in the fully convolutional network. OverFeat is similar to later frameworks such as YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016), except that the classifier and the regressors in OverFeat are trained sequentially. (b) YOLO Redmon et al. (2016) proposed YOLO (You Only Look Once), a unified detector casting object detection as a regression problem from image pixels to spatially sep- arated bounding boxes and associated class probabilities, Fig. 14 Illustration of the OverFeat (Sermanet et al. 2014) detection framework illustrated in Fig. 13. Since the region proposal generation 123 278 International Journal of Computer Vision (2020) 128:261–318 stage is completely dropped, YOLO directly predicts detec- as VGG (Simonyan and Zisserman 2015), followed by sev- tions using a small set of candidate regions . Unlike region eral auxiliary CONV layers, progressively decreasing in size. based approaches (e.g. Faster RCNN) that predict detections The information in the last layer may be too coarse spa- based on features from a local region, YOLO uses features tially to allow precise localization, so SSD performs detection from an entire image globally. In particular, YOLO divides over multiple scales by operating on multiple CONV feature an image into an S × S grid, each predicting C class prob- maps, each of which predicts category scores and box off- abilities, B bounding box locations, and confidence scores. sets for bounding boxes of appropriate sizes. For a 300 ×300 By throwing out the region proposal generation step entirely, input, SSD achieves 74.3% mAP on the VOC2007 test at 59 YOLO is fast by design, running in real time at 45 FPS and FPS versus Faster RCNN 7 FPS / mAP 73.2% or YOLO 45 Fast YOLO (Redmon et al. 2016) at 155 FPS. Since YOLO FPS / mAP 63.4%. sees the entire image when making predictions, it implicitly CornerNet Recently, Law and Deng (2018) questioned the encodes contextual information about object classes, and is dominant role that anchor boxes have come to play in SoA less likely to predict false positives in the background. YOLO object detection frameworks (Girshick 2015;Heetal. 2017; makes more localization errors than Fast RCNN, resulting Redmon et al. 2016; Liu et al. 2016). Law and Deng (2018) from the coarse division of bounding box location, scale and argue that the use of anchor boxes, especially in one stage aspect ratio. As discussed in Redmon et al. (2016), YOLO detectors (Fu et al. 2017; Lin et al. 2017b; Liu et al. 2016; may fail to localize some objects, especially small ones, pos- Redmon et al. 2016), has drawbacks (Law and Deng 2018; sibly because of the coarse grid division, and because each Lin et al. 2017b) such as causing a huge imbalance between grid cell can only contain one object. It is unclear to what positive and negative examples, slowing down training and extent YOLO can translate to good performance on datasets introducing extra hyperparameters. Borrowing ideas from the with many objects per image, such as MS COCO. work on Associative Embedding in multiperson pose estima- YOLOv2 and YOLO9000 Redmon and Farhadi (2017) tion (Newell et al. 2017), Law and Deng (2018) proposed proposed YOLOv2, an improved version of YOLO, in which CornerNet by formulating bounding box object detection the custom GoogLeNet (Szegedy et al. 2015) network is as detecting paired top-left and bottom-right keypoints .In replaced with the simpler DarkNet19, plus batch normal- CornerNet, the backbone network consists of two stacked ization (He et al. 2015), removing the fully connected layers, Hourglass networks (Newell et al. 2016), with a simple cor- and using good anchor boxes learned via kmeans and multi- ner pooling approach to better localize corners. CornerNet scale training. YOLOv2 achieved state-of-the-art on standard achieved a 42.1% AP on MS COCO, outperforming all pre- detection tasks. Redmon and Farhadi (2017) also introduced vious one stage detectors; however, the average inference YOLO9000, which can detect over 9000 object categories in time is about 4FPS on a Titan X GPU, significantly slower real time by proposing a joint optimization method to train than SSD (Liu et al. 2016) and YOLO (Redmon et al. 2016). simultaneously on an ImageNet classification dataset and CornerNet generates incorrect bounding boxes because it is a COCO detection dataset with WordTree to combine data challenging to decide which pairs of keypoints should be from multiple sources. Such joint training allows YOLO9000 grouped into the same objects. To further improve on Cor- to perform weakly supervised detection, i.e. detecting object nerNet, Duan et al. (2019) proposed CenterNet to detect each classes that do not have bounding box annotations. object as a triplet of keypoints, by introducing one extra key- SSD In order to preserve real-time speed without sacrific- point at the centre of a proposal, raising the MS COCO AP to ing too much detection accuracy, Liu et al. (2016) proposed 47.0%, but with an inference speed slower than CornerNet. SSD (Single Shot Detector), faster than YOLO (Redmon et al. 2016) and with an accuracy competitive with region- based detectors such as Faster RCNN (Ren et al. 2015). SSD 6 Object Representation effectively combines ideas from RPN in Faster RCNN (Ren et al. 2015), YOLO (Redmon et al. 2016) and multiscale As one of the main components in any detector, good feature CONV features (Hariharan et al. 2016) to achieve fast detec- representations are of primary importance in object detection tion speed, while still retaining high detection quality. Like (Dickinson et al. 2009; Girshick et al. 2014; Gidaris and YOLO, SSD predicts a fixed number of bounding boxes Komodakis 2015; Zhu et al. 2016a). In the past, a great deal and scores, followed by an NMS step to produce the final of effort was devoted to designing local descriptors [e.g., detection. The CNN network in SSD is fully convolutional, SIFT (Lowe 1999) and HOG (Dalal and Triggs 2005)] and to whose early layers are based on a standard architecture, such explore approaches [e.g., Bag of Words (Sivic and Zisserman 2003) and Fisher Vector (Perronnin et al. 2010)] to group and YOLO uses far fewer bounding boxes, only 98 per image, compared to about 2000 from Selective Search. The idea of using keypoints for object detection appeared previously Boxes of various sizes and aspect ratios that serve as object candidates. in DeNet (TychsenSmith and Petersson 2017). 123 International Journal of Computer Vision (2020) 128:261–318 279 (Zeiler and Fergus 2014) VGGNet (Simonyan and Zisserman 2015), GoogLeNet (Szegedy et al. 2015), Inception series (Ioffe and Szegedy 2015; Szegedy et al. 2016, 2017), ResNet (He et al. 2016), DenseNet (Huang et al. 2017a) and SENet (Hu et al. 2018b), summarized in Table 6, and where the improvement over time is seen in Fig. 15. A further review of recent CNN advances can be found in Gu et al. (2018). The trend in architecture evolution is for greater depth: AlexNet has 8 layers, VGGNet 16 layers, more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet (Simonyan and Zisserman 2015) and GoogLeNet (Szegedy et al. 2015) which showed that increas- ing depth can improve the representational power. As can be observed from Table 6, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of param- eters, despite being only a few layers deep, since a large Fig. 15 Performance of winning entries in the ILSVRC competitions fraction of the parameters come from the FC layers. Newer from 2011 to 2017 in the image classification task networks like Inception, ResNet, and DenseNet, although having a great depth, actually have far fewer parameters by abstract descriptors into higher level representations in order avoiding the use of FC layers. to allow the discriminative parts to emerge; however, these With the use of Inception modules (Szegedy et al. 2015)in feature representation methods required careful engineering carefully designed topologies, the number of parameters of and considerable domain expertise. GoogLeNet is dramatically reduced, compared to AlexNet, In contrast, deep learning methods (especially deep ZFNet or VGGNet. Similarly, ResNet demonstrated the CNNs) can learn powerful feature representations with mul- effectiveness of skip connections for learning extremely deep tiple levels of abstraction directly from raw images (Bengio networks with hundreds of layers, winning the ILSVRC et al. 2013; LeCun et al. 2015). As the learning procedure 2015 classification task. Inspired by ResNet (He et al. 2016), reduces the dependency of specific domain knowledge and InceptionResNets (Szegedy et al. 2017) combined the Incep- complex procedures needed in traditional feature engineer- tion networks with shortcut connections, on the basis that ing (Bengio et al. 2013; LeCun et al. 2015), the burden for shortcut connections can significantly accelerate network feature representation has been transferred to the design of training. Extending ResNets, Huang et al. (2017a) proposed better network architectures and training procedures. DenseNets, which are built from dense blocksconnecting The leading frameworks reviewed in Sect. 5 [RCNN (Gir- each layer to every other layer in a feedforward fashion, lead- shick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN ing to compelling advantages such as parameter efficiency, (Ren et al. 2015), YOLO (Redmon et al. 2016), SSD (Liu et al. implicit deep supervision , and feature reuse. Recently, He 2016)] have persistently promoted detection accuracy and et al. (2016) proposed Squeeze and Excitation (SE) blocks, speed, in which it is generally accepted that the CNN archi- which can be combined with existing deep architectures to tecture (Sect. 6.1 and Fig. 15) plays a crucial role. As a result, boost their performance at minimal additional computational most of the recent improvements in detection accuracy have cost, adaptively recalibrating channel-wise feature responses been via research into the development of novel networks. by explicitly modeling the interdependencies between con- Therefore we begin by reviewing popular CNN architectures volutional feature channels, and which led to winning the used in Generic Object Detection, followed by a review of ILSVRC 2017 classification task. Research on CNN archi- the effort devoted to improving object feature representa- tectures remains active, with emerging networks such as tions, such as developing invariant features to accommodate Hourglass (Law and Deng 2018), Dilated Residual Networks geometric variations in object scale, pose, viewpoint, part (Yu et al. 2017), Xception (Chollet 2017), DetNet (Lietal. deformation and performing multiscale analysis to improve 2018b), Dual Path Networks (DPN) (Chen et al. 2017b), Fish- object detection over a wide range of scales. Net (Sun et al. 2018), and GLoRe (Chen et al. 2019b). 6.1 Popular CNN Architectures DenseNets perform deep supervision in an implicit way, i.e. individ- CNN architectures (Sect. 3) serve as network backbones used ual layers receive additional supervision from other layers through the in the detection frameworks of Sect. 5. Representative frame- shorter connections. The benefits of deep supervision have previously works include AlexNet (Krizhevsky et al. 2012b), ZFNet been demonstrated in Deeply Supervised Nets (DSN) (Lee et al. 2015). 123 280 International Journal of Computer Vision (2020) 128:261–318 Table 6 DCNN architectures that were commonly used for generic object detection No. DCNN architecture #Paras (×10 ) #Layers (CONV+FC) Test error (Top 5) First used in Highlights 1 AlexNet (Krizhevsky et al. 2012b)57 5 +215.3% Girshick et al. (2014) The first DCNN found effective for ImageNet classification; the historical turning point from hand-crafted features to CNN; Winning the ILSVRC2012 Image classification competition 2 ZFNet (fast) (Zeiler and Fergus 2014)58 5 +214.8% He et al. (2014) Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers 3 OverFeat (Sermanet et al. 2014) 140 6 +213.6% Sermanet et al. (2014) Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers 4 VGGNet (Simonyan and Zisserman 2015) 134 13 +26.8% Girshick (2015) Increasing network depth significantly by stacking 3 × 3 convolution filters and increasing the network depth step by step 5 GoogLeNet (Szegedy et al. 2015)6 22 6.7% Szegedy et al. (2015) Use Inception module, which uses multiple branches of convolutional layers with different filter sizes and then concatenates feature maps produced by these branches. The first inclusion of bottleneck structure and global average pooling 6 Inception v2 (Ioffe and Szegedy 2015)12 31 4.8% Howard et al. (2017) Faster training with the introduce of batch normalization 7 Inceptionv3(Szegedyetal. 2016)22 47 3.6% Inclusion of separable convolution and spatial resolution reduction 8 YOLONet (Redmon et al. 2016)64 24 + 1 − Redmon et al. (2016) A network inspired by GoogLeNet used in YOLO detector 9 ResNet50 (He et al. 2016)23.449 3.6% (ResNets) He et al. (2016) With identity mapping, substantially deeper networks can be learned International Journal of Computer Vision (2020) 128:261–318 281 Table 6 continued No. DCNN architecture #Paras (×10 ) #Layers (CONV+FC) Test error (Top 5) First used in Highlights 10 ResNet101 (He et al. 2016) 42 100 He et al. (2016) Requires fewer parameters than VGG by using the global average pooling and bottleneck introduced in GoogLeNet 11 InceptionResNet v1 (Szegedy et al. 2017)21 87 3.1% (Ensemble) Combination of identity mapping and Inception module, with similar computational cost of Inception v3, but faster training process 12 InceptionResNet v2 Szegedy et al. (2017) 30 95 (Huang et al. 2017b) A costlier residual version of Inception, with significantly improved recognition performance 13 Inception v4 Szegedy et al. (2017) 41 75 An Inception variant without residual connections, with roughly the same recognition performance as InceptionResNet v2, but significantly slower 14 ResNeXt (Xie et al. 2017)23 49 3.0% Xie et al. (2017) Repeating a building block that aggregates a set of transformations with the same topology 15 DenseNet201 (Huang et al. 2017a) 18 200 − Zhou et al. (2018b) Concatenate each layer with every other layer in a feed forward fashion. Alleviate the vanishing gradient problem, encourage feature reuse, reduction in number of parameters 16 DarkNet (Redmon and Farhadi 2017)20 19 − Redmon and Farhadi (2017) Similar to VGGNet, but with significantly fewer parameters 17 MobileNet (Howard et al. 2017)3.227 + 1 − Howard et al. (2017) Light weight deep CNNs using depth-wise separable convolutions 18 SE ResNet (Hu et al. 2018b)26 50 2.3% (SENets) Hu et al. (2018b) Channel-wise attention by a novel block called Squeeze and Excitation. Complementary to existing backbone CNNs Regarding the statistics for “#Paras” and “#Layers”, the final FC prediction layer is not taken into consideration. “Test Error” column indicates the Top 5 classification test error on ImageNet1000. When ambiguous, the “#Paras”, “#Layers”, and “Test Error” refer to: OverFeat (accurate model), VGGNet16, ResNet101 DenseNet201 (Growth Rate 32, DenseNet-BC), ResNeXt50 (32*4d), and SE ResNet50 282 International Journal of Computer Vision (2020) 128:261–318 The training of a CNN requires a large-scale labeled and use features from the top layer of the CNN as object rep- dataset with intraclass diversity. Unlike image classification, resentations; however, detecting objects across a large range detection requires localizing (possibly many) objects from an of scales is a fundamental challenge. A classical strategy to image. It has been shown (Ouyang et al. 2017b) that pretrain- address this issue is to run the detector over a number of ing a deep model with a large scale dataset having object level scaled input images (e.g., an image pyramid) (Felzenszwalb annotations (such as ImageNet), instead of only the image et al. 2010b; Girshick et al. 2014;Heetal. 2014), which level annotations, improves the detection performance. How- typically produces more accurate detection, with, however, ever, collecting bounding box labels is expensive, especially obvious limitations of inference time and memory. for hundreds of thousands of categories. A common scenario is for a CNN to be pretrained on a large dataset (usually with 6.2.1 Handling of Object Scale Variations a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, Since a CNN computes its feature hierarchy layer by layer, directly, as a generic feature extractor (Razavian et al. 2014; the sub-sampling layers in the feature hierarchy already lead Azizpour et al. 2016; Donahue et al. 2014;Yosinskietal. to an inherent multiscale pyramid, producing feature maps at 2014), which can support a wider range of visual recogni- different spatial resolutions, but subject to challenges (Hari- tion tasks. For detection, the pre-trained network is typically haran et al. 2016; Long et al. 2015; Shrivastava et al. 2017). fine-tuned on a given detection dataset (Donahue et al. In particular, the higher layers have a large receptive field and 2014; Girshick et al. 2014, 2016). Several large scale image strong semantics, and are the most robust to variations such classification datasets are used for CNN pre-training, among as object pose, illumination and part deformation, but the res- them ImageNet1000 (Deng et al. 2009; Russakovsky et al. olution is low and the geometric details are lost. In contrast, 2015) with 1.2 million images of 1000 object categories, lower layers have a small receptive field and rich geomet- Places (Zhou et al. 2017a), which is much larger than Ima- ric details, but the resolution is high and much less sensitive geNet1000 but with fewer classes, a recent Places-Imagenet to semantics. Intuitively, semantic concepts of objects can hybrid (Zhou et al. 2017a), or JFT300M (Hinton et al. 2015; emerge in different layers, depending on the size of the Sun et al. 2017). objects. So if a target object is small it requires fine detail Pretrained CNNs without fine-tuning were explored for information in earlier layers and may very well disappear at object classification and detection in Donahue et al. (2014), later layers, in principle making small object detection very Girshick et al. (2016), Agrawal et al. (2014), where it was challenging, for which tricks such as dilated or “atrous” con- shown that detection accuracies are different for features volution (Yu and Koltun 2015; Dai et al. 2016c; Chen et al. extracted from different layers; for example, for AlexNet pre- 2018b) have been proposed, increasing feature resolution, trained on ImageNet, FC6 / FC7 / Pool5 are in descending but increasing computational complexity. On the other hand, order of detection accuracy (Donahue et al. 2014; Girshick if the target object is large, then the semantic concept will et al. 2016). Fine-tuning a pre-trained network can increase emerge in much later layers. A number of methods (Shrivas- detection performance significantly (Girshick et al. 2014, tava et al. 2017; Zhang et al. 2018e; Lin et al. 2017a; Kong 2016), although in the case of AlexNet, the fine-tuning perfor- et al. 2017) have been proposed to improve detection accu- mance boost was shown to be much larger for FC6 / FC7 than racy by exploiting multiple CNN layers, broadly falling into for Pool5, suggesting that Pool5 features are more general. three types of multiscale object detection: Furthermore, the relationship between the source and target datasets plays a critical role, for example that ImageNet based 1. Detecting with combined features of multiple layers; CNN features show better performance for object detection 2. Detecting at multiple layers; than for human action (Zhou et al. 2015; Azizpour et al. 3. Combinations of the above two methods. 2016). (1) Detecting with combined features of multiple CNN lay- ers Many approaches, including Hypercolumns (Hariharan 6.2 Methods For Improving Object Representation et al. 2016), HyperNet (Kong et al. 2016), and ION (Bell et al. 2016), combine features from multiple layers before Deep CNN based detectors such as RCNN (Girshick et al. making a prediction. Such feature combination is commonly 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. accomplished via concatenation, a classic neural network 2015) and YOLO (Redmon et al. 2016), typically use the deep idea that concatenates features from different layers, archi- CNN architectures listed in Table 6 as the backbone network tectures which have recently become popular for semantic segmentation (Long et al. 2015; Shelhamer et al. 2017;Har- Fine-tuning is done by initializing a network with weights optimized iharan et al. 2016). As shown in Fig. 16a, ION (Bell et al. for a large labeled dataset like ImageNet. and then updating the net- work’s weights using the target-task training set. 2016) uses RoI pooling to extract RoI features from multiple 123 International Journal of Computer Vision (2020) 128:261–318 283 ION (Bell et al. 2016). On the other hand, however, it is natural to detect objects of different scales using features of approximately the same size, which can be achieved by detecting large objects from downscaled feature maps while detecting small objects from upscaled feature maps. There- fore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the resulting features obtained by combining features from dif- ferent layers. This approach has been found to be effective for segmentation (Long et al. 2015; Shelhamer et al. 2017) and human pose estimation (Newell et al. 2016), has been (b) (a) widely exploited by both one-stage and two-stage detec- tors to alleviate problems of scale variation across object Fig. 16 Comparison of HyperNet and ION. LRN is local response nor- malization, which performs a kind of “lateral inhibition” by normalizing instances. Representative methods include SharpMask (Pin- over local input regions (Jia et al. 2014) heiro et al. 2016), Deconvolutional Single Shot Detector (DSSD) (Fu et al. 2017), Feature Pyramid Network (FPN) (Lin et al. 2017a), Top Down Modulation (TDM)(Shrivastava et al. 2017), Reverse connection with Objectness prior Net- layers, and then the object proposals generated by selective search and edgeboxes are classified by using the concatenated work (RON) (Kong et al. 2017), ZIP (Li et al. 2018a), Scale Transfer Detection Network (STDN) (Zhou et al. 2018b), features. HyperNet (Kong et al. 2016), shown in Fig. 16b, follows a similar idea, and integrates deep, intermediate and RefineDet (Zhang et al. 2018a), StairNet (Woo et al. 2018), shallow features to generate object proposals and to predict Path Aggregation Network (PANet) (Liu et al. 2018c), Fea- objects via an end to end joint training strategy. The com- ture Pyramid Reconfiguration (FPR) (Kong et al. 2018), bined feature is more descriptive, and is more beneficial for DetNet (Lietal. 2018b), Scale Aware Network (SAN) (Kim localization and classification, but at increased computational et al. 2018), Multiscale Location aware Kernel Representa- complexity. tion (MLKP) (Wang et al. 2018) and M2Det (Zhao et al. 2019), as shown in Table 7 and contrasted in Fig. 17. (2) Detecting at multiple CNN layers A number of recent approaches improve detection by predicting objects of differ- Early works like FPN (Lin et al. 2017a), DSSD (Fu et al. 2017), TDM (Shrivastava et al. 2017), ZIP (Li et al. 2018a), ent resolutions at different layers and then combining these predictions: SSD (Liu et al. 2016) and MSCNN (Cai et al. RON (Kong et al. 2017) and RefineDet (Zhang et al. 2018a) 2016), RBFNet (Liu et al. 2018b), and DSOD (Shen et al. construct the feature pyramid according to the inherent multi- 2017). SSD (Liu et al. 2016) spreads out default boxes of scale, pyramidal architecture of the backbone, and achieved different scales to multiple layers within a CNN, and forces encouraging results. As can be observed from Fig. 17a1– each layer to focus on predicting objects of a certain scale. f1, these methods have very similar detection architectures RFBNet (Liu et al. 2018b) replaces the later convolution lay- which incorporate a top-down network with lateral connec- ers of SSD with a Receptive Field Block (RFB) to enhance tions to supplement the standard bottom-up, feed-forward the discriminability and robustness of features. The RFB is network. Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down a multibranch convolutional block, similar to the Inception block (Szegedy et al. 2015), but combining multiple branches network to combine with the bottom-up features from inter- mediate layers after lateral processing, and the combined with different kernels and convolution layers (Chen et al. 2018b). MSCNN (Cai et al. 2016) applies deconvolution on features are then used for detection. As can be seen from multiple layers of a CNN to increase feature map resolution Fig. 17a2–e2, the main differences lie in the design of the before using the layers to learn region proposals and pool fea- simple Feature Fusion Block (FFB), which handles the selec- tures. Similar to RFBNet (Liu et al. 2018b), TridentNet (Li tion of features from different layers and the combination of et al. 2019b) constructs a parallel multibranch architecture multilayer features. where each branch shares the same transformation param- FPN (Lin et al. 2017a) shows significant improvement as eters but with different receptive fields; dilated convolution a generic feature extractor in several applications including with different dilation rates are used to adapt the receptive object detection (Lin et al. 2017a, b) and instance segmen- tation (He et al. 2017). Using FPN in a basic Faster RCNN fields for objects of different scales. (3) Combinations of the above two methods Features from system achieved state-of-the-art results on the COCO detec- tion dataset. STDN (Zhou et al. 2018b) used DenseNet different layers are complementary to each other and can improve detection accuracy, as shown by Hypercolumns (Huang et al. 2017a) to combine features of different layers (Hariharan et al. 2016), HyperNet (Kong et al. 2016) and and designed a scale transfer module to obtain feature maps 123 284 International Journal of Computer Vision (2020) 128:261–318 Table 7 Summary of properties of representative methods in improving DCNN feature representations for generic object detection Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (1) Single ION (Bell et al. SS+EB VGG16 Fast RCNN 79.4 (07+12) 76.4 (07+12) 55.733.1 CVPR16 Use features from detection with 2016) MCG+RPN multiple layers; use multilayer spatial recurrent features neural networks for modeling contextual information; the Best Student Entry and the 3rd overall in the COCO detection challenge 2015 HyperNet (Kong RPN VGG16 Faster RCNN 76.3 (07+12) 71.4 (07T+12) −− CVPR16 Use features from et al. 2016) multiple layers for both region proposal and region classification PVANet (Kim RPN PVANet Faster RCNN 84.9 84.2 (07T+12+CO) −− NIPSW16 Deep but lightweight; et al. 2016) (07+12+CO) Combine ideas from concatenated ReLU (Shang et al. 2016), Inception (Szegedy et al. 2015), and HyperNet (Kong et al. 2016) International Journal of Computer Vision (2020) 128:261–318 285 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (2) Detection at SDP+CRC (Yang EB VGG16 Fast RCNN 69.4 (07) −− − CVPR16 Use features in multiple multiple layers et al. 2016b) layers to reject easy negatives via CRC, and then classify remaining proposals using SDP MSCNN (Cai RPN VGG Faster RCNN Only Tested on KITTI ECCV16 Region proposal and et al. 2016) classification are performed at multiple layers; includes feature upsampling; end to end learning MPN SharpMask VGG16 Fast RCNN −− 51.933.2 BMVC16 Concatenate features (Zagoruyko (Pinheiro from different et al. 2016) et al. 2016) convolutional layers and features of different contextual regions; loss function for multiple overlap thresholds; ranked 2nd in both the COCO15 detection and segmentation challenges DSOD (Shen Free DenseNet SSD 77.7 (07+12) 72.2 (07T+12) 47.329.3 ICCV17 Concatenate feature et al. 2017) sequentially, like DenseNet. Train from scratch on the target dataset without pre-training RFBNet (Liu Free VGG16 SSD 82.2 (07+12) 81.2 (07T+12) 55.734.4 ECCV18 Propose a multi-branch et al. 2018b) convolutional block similar to Inception (Szegedy et al. 2015), but using dilated convolution 286 International Journal of Computer Vision (2020) 128:261–318 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (3) Combination DSSD (Fu et al. Free ResNet101 SSD 81.5 (07+12) 80.0 (07T+12) 53.333.2 2017 Use Conv-Deconv, as of (1) and (2) 2017) shown in Fig. 17c1, c2 FPN (Lin et al. RPN ResNet101 Faster RCNN−− 59.136.2 CVPR17 Use Conv-Deconv, as 2017a) shown in Fig. 17a1, a2; Widely used in detectors TDM RPN ResNet101 Faster RCNN−− 57.736.8 CVPR17 Use Conv-Deconv, as (Shrivastava VGG16 shown in Fig. 17b2 et al. 2017) RON (Kong et al. RPN VGG16 Faster RCNN 81.3 80.7 (07T+12+CO) 49.527.4 CVPR17 Use Conv-deconv, as 2017) (07+12+CO) shown in Fig. 17d2; Add the objectness prior to significantly reduce object search space ZIP (Li et al. RPN Inceptionv2 Faster RCNN 79.8 (07+12) −− − IJCV18 Use Conv-Deconv, as 2018a) shown in Fig. 17f1. Propose a map attention decision (MAD) unit for features from different layers International Journal of Computer Vision (2020) 128:261–318 287 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO STDN (Zhou Free DenseNet169 SSD 80.9 (07+12) − 51.031.8 CVPR18 A new scale transfer et al. 2018b) module, which resizes features of different scales to the same scale in parallel RefineDet RPN VGG16 Faster RCNN 83.8 (07+12) 83.5 (07T+12) 62.941.8 CVPR18 Use cascade to obtain (Zhang et al. ResNet101 better and less 2018a) anchors. Use Conv-deconv, as shown in Fig. 17e2 to improve features PANet (Liu et al. RPN ResNeXt101 Mask RCNN −− 67.2 47.4 CVPR18 Shown in Fig. 17g. 2018c) +FPN Based on FPN, add another bottom-up path to pass information between lower and topmost layers; adaptive feature pooling. Ranked 1st and 2nd in COCO 2017 tasks DetNet (Li et al. RPN DetNet59+FPN Faster RCNN −− 61.740.2 ECCV18 Introduces dilated 2018b) convolution into the ResNet backbone to maintain high resolution in deeper layers; Shown in Fig. 17i FPR (Kong et al. − VGG16 SSD 82.4 (07+12) 81.1 (07T+12) 54.334.6 ECCV18 Fuse task oriented 2018) ResNet101 features across different spatial locations and scales, globally and locally; Shown in Fig. 17h M2Det (Zhao − SSD VGG16 ResNet101−− 64.644.2 AAAI19 Shown in Fig. 17j, et al. 2019) newly designed top down path to learn a set of multilevel features, recombined to construct a feature pyramid for object detection 288 International Journal of Computer Vision (2020) 128:261–318 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (4) Model DeepIDNet SS+ EB AlexNet ZFNet RCNN 69.0 (07) −− 25.6 CVPR15 Introduce a deformation geometric (Ouyang et al. OverFeat constrained pooling transforms 2015) GoogLeNet layer, jointly learned with convolutional layers in existing DCNNs. Utilize the following modules that are not trained end to end: cascade, context modeling, model averaging, and bounding box location refinement in the multistage detection pipeline DCN (Dai et al. RPN ResNet101 RFCN 82.6 (07+12) − 58.037.5 CVPR17 Design deformable 2017) IRN convolution and deformable RoI pooling modules that can replace plain convolution in existing DCNNs DPFCN (Mordan AttractioNet ResNet RFCN 83.3 (07+12) 81.2 (07T+12) 59.139.1 IJCV18 Design a deformable et al. 2018) (Gidaris and part based RoI pooling Komodakis layer to explicitly 2016) select discriminative regions around object proposals Details for Groups (1), (2), and (3) are provided in Sect. 6.2. Abbreviations: Selective Search (SS), EdgeBoxes (EB), InceptionResNet (IRN). Conv-Deconv denotes the use of upsampling and convolutional layers with lateral connections to supplement the standard backbone network. Detection results on VOC07, VOC12 and COCO were reported with mAP@IoU = 0.5, and the additional COCO results are computed as the average of mAP for IoU thresholds from 0.5 to 0.95. Training data: “07”←VOC2007 trainval; “07T”←VOC2007 trainval and test; “12”←VOC2012 trainval; CO← COCO trainval. The COCO detection results were reported with COCO2015 Test-Dev, except for MPN (Zagoruyko et al. 2016) which reported with COCO2015 Test-Standard International Journal of Computer Vision (2020) 128:261–318 289 Fig. 17 Hourglass architectures: Conv1 to Conv5 are the main Conv DSSD (Fu et al. 2017), RON (Kong et al. 2017), RefineDet (Zhang blocks in backbone networks such as VGG or ResNet. The figure com- et al. 2018a), ZIP (Li et al. 2018a), PANet (Liu et al. 2018c), FPR pares a number of feature fusion blocks (FFB) commonly used in recent (Kong et al. 2018), DetNet (Li et al. 2018b) and M2Det (Zhao et al. approaches: FPN (Lin et al. 2017a), TDM (Shrivastava et al. 2017), 2019). FFM feature fusion module, TUM thinned U-shaped module 123 290 International Journal of Computer Vision (2020) 128:261–318 with different resolutions. The scale transfer module can be variations other than just scale, which we group into three directly embedded into DenseNet with little additional cost. categories: More recent work, such as PANet (Liu et al. 2018c), FPR (Kong et al. 2018), DetNet (Li et al. 2018b), and M2Det • Geometric transformations, (Zhao et al. 2019), as shown in Fig. 17g–j, propose to further • Occlusions, and improve on the pyramid architectures like FPN in different • Image degradations. ways. Based on FPN, Liu et al. designed PANet (Liu et al. 2018c) (Fig. 17g1) by adding another bottom-up path with To handle these intra-class variations, the most straightfor- clean lateral connections from low to top levels, in order ward approach is to augment the training datasets with a to shorten the information path and to enhance the feature sufficient amount of variations; for example, robustness to pyramid. Then, an adaptive feature pooling was proposed to rotation could be achieved by adding rotated objects at many aggregate features from all feature levels for each proposal. orientations to the training data. Robustness can frequently In addition, in the proposal sub-network, a complementary be learned this way, but usually at the cost of expensive train- branch capturing different views for each proposal is cre- ing and complex model parameters. Therefore, researchers ated to further improve mask prediction. These additional have proposed alternative solutions to these problems. steps bring only slightly extra computational overhead, but Handling of geometric transformations DCNNs are inher- are effective and allowed PANet to reach 1st place in the ently limited by the lack of ability to be spatially invariant COCO 2017 Challenge Instance Segmentation task and 2nd to geometric transformations of the input data (Lenc and place in the Object Detection task. Kong et al. proposed FPR Vedaldi 2018; Liu et al. 2017; Chellappa 2016). The intro- (Kong et al. 2018) by explicitly reformulating the feature duction of local max pooling layers has allowed DCNNs to pyramid construction process [e.g. FPN (Lin et al. 2017a)] enjoy some translation invariance, however the intermediate as feature reconfiguration functions in a highly nonlinear but feature maps are not actually invariant to large geometric efficient way. As shown in Fig. 17h1, instead of using a top- transformations of the input data (Lenc and Vedaldi 2018). down path to propagate strong semantic features from the Therefore, many approaches have been presented to enhance topmost layer down as in FPN, FPR first extracts features robustness, aiming at learning invariant CNN representations from multiple layers in the backbone network by adaptive with respect to different types of transformations such as concatenation, and then designs a more complex FFB module scale (Kim et al. 2014; Bruna and Mallat 2013), rotation (Fig. 17h2) to spread strong semantics to all scales. Li et al. (Bruna and Mallat 2013; Cheng et al. 2016; Worrall et al. (2018b) proposed DetNet (Fig. 17i1) by introducing dilated 2017; Zhou et al. 2017b), or both (Jaderberg et al. 2015). One convolutions to the later layers of the backbone network in representative work is Spatial Transformer Network (STN) order to maintain high spatial resolution in deeper layers. (Jaderberg et al. 2015), which introduces a new learnable Zhao et al. (2019) proposed a MultiLevel Feature Pyramid module to handle scaling, cropping, rotations, as well as non- Network (MLFPN) to build more effective feature pyramids rigid deformations via a global parametric transformation. for detecting objects of different scales. As can be seen from STN has now been used in rotated text detection (Jaderberg Fig. 17j1, features from two different layers of the backbone et al. 2015), rotated face detection and generic object detec- are first fused as the base feature, after which a top-down tion (Wang et al. 2017). path with lateral connections from the base feature is created Although rotation invariance may be attractive in certain to build the feature pyramid. As shown in Fig. 17j2, j5, the applications, such as scene text detection (He et al. 2018; FFB module is much more complex than those like FPN, in Ma et al. 2018), face detection (Shi et al. 2018), and aerial that FFB involves a Thinned U-shaped Module (TUM) to imagery (Ding et al. 2018; Xia et al. 2018), there is limited generate a second pyramid structure, after which the feature generic object detection work focusing on rotation invariance maps with equivalent sizes from multiple TUMs are com- because popular benchmark detection datasets (e.g. PAS- bined for object detection. The authors proposed M2Det by CAL VOC, ImageNet, COCO) do not actually present rotated integrating MLFPN into SSD, and achieved better detection images. performance than other one-stage detectors. Before deep learning, Deformable Part based Models (DPMs) (Felzenszwalb et al. 2010b) were successful for 6.3 Handling of Other Intraclass Variations generic object detection, representing objects by compo- nent parts arranged in a deformable configuration. Although Powerful object representations should combine distinctive- DPMs have been significantly outperformed by more recent ness and robustness. A large amount of recent work has been object detectors, their spirit still deeply influences many devoted to handling changes in object scale, as reviewed in recent detectors. DPM modeling is less sensitive to transfor- Sect. 6.2.1. As discussed in Sect. 2.2 and summarized in mations in object pose, viewpoint and nonrigid deformations, Fig. 5, object detection still requires robustness to real-world motivating researchers (Dai et al. 2017; Girshick et al. 2015; 123 International Journal of Computer Vision (2020) 128:261–318 291 Mordan et al. 2018; Ouyang et al. 2015; Wan et al. 2015)to Bar 2004) that context plays an essential role in human explicitly model object composition to improve CNN based object recognition, and it is recognized that a proper mod- detection. The first attempts (Girshick et al. 2015; Wan et al. eling of context helps object detection and recognition 2015) combined DPMs with CNNs by using deep features (Torralba 2003; Oliva and Torralba 2007; Chen et al. 2018b, learned by AlexNet in DPM based detection, but without 2015a; Divvala et al. 2009; Galleguillos and Belongie 2010), region proposals. To enable a CNN to benefit from the built- especially when object appearance features are insufficient in capability of modeling the deformations of object parts, a because of small object size, object occlusion, or poor image number of approaches were proposed, including DeepIDNet quality. Many different types of context have been discussed (Ouyang et al. 2015), DCN (Dai et al. 2017) and DPFCN (Divvala et al. 2009; Galleguillos and Belongie 2010), and (Mordan et al. 2018) (shown in Table 7). Although simi- can broadly be grouped into one of three categories: lar in spirit, deformations are computed in different ways: DeepIDNet (Ouyang et al. 2017b) designed a deformation 1. Semantic context: The likelihood of an object to be found constrained pooling layer to replace regular max pooling, to in some scenes, but not in others; learn the shared visual patterns and their deformation prop- 2. Spatial context: The likelihood of finding an object in erties across different object classes; DCN (Dai et al. 2017) some position and not others with respect to other objects designed a deformable convolution layer and a deformable in the scene; RoI pooling layer, both of which are based on the idea of 3. Scale context: Objects have a limited set of sizes relative augmenting regular grid sampling locations in feature maps; to other objects in the scene. and DPFCN (Mordan et al. 2018) proposed a deformable part-based RoI pooling layer which selects discriminative A great deal of work (Chen et al. 2015b; Divvala et al. parts of objects around object proposals by simultaneously 2009; Galleguillos and Belongie 2010; Malisiewicz and optimizing latent displacements of all parts. Efros 2009; Murphy et al. 2003; Rabinovich et al. 2007; Handling of occlusions In real-world images, occlu- Parikh et al. 2012) preceded the prevalence of deep learning, sions are common, resulting in information loss from object and much of this work has yet to be explored in DCNN-based instances. A deformable parts idea can be useful for occlu- object detectors (Chen and Gupta 2017;Huetal. 2018a). sion handling, so deformable RoI Pooling (Dai et al. 2017; The current state of the art in object detection (Ren et al. Mordan et al. 2018; Ouyang and Wang 2013) and deformable 2015; Liu et al. 2016;Heetal. 2017) detects objects with- convolution (Dai et al. 2017) have been proposed to allevi- out explicitly exploiting any contextual information. It is ate occlusion by giving more flexibility to the typically fixed broadly agreed that DCNNs make use of contextual informa- geometric structures. Wang et al. (2017) propose to learn an tion implicitly (Zeiler and Fergus 2014; Zheng et al. 2015) adversarial network that generates examples with occlusions since they learn hierarchical representations with multiple and deformations, and context may be helpful in dealing with levels of abstraction. Nevertheless, there is value in exploring occlusions (Zhang et al. 2018b). Despite these efforts, the contextual information explicitly in DCNN based detectors occlusion problem is far from being solved; applying GANs (Hu et al. 2018a; Chen and Gupta 2017; Zeng et al. 2017), so to this problem may be a promising research direction. the following reviews recent work in exploiting contextual Handling of image degradations Image noise is a com- cues in DCNN- based object detectors, organized into cate- mon problem in many real-world applications. It is frequently gories of global and local contexts, motivated by earlier work caused by insufficient lighting, low quality cameras, image in Zhang et al. (2013), Galleguillos and Belongie (2010). compression, or the intentional low-cost sensors on edge Representative approaches are summarized in Table 8. devices and wearable devices. While low image quality may be expected to degrade the performance of visual recogni- 7.1 Global Context tion, most current methods are evaluated in a degradation free and clean environment, evidenced by the fact that PASCAL Global context (Zhang et al. 2013; Galleguillos and Belongie VOC, ImageNet, MS COCO and Open Images all focus on 2010) refers to image or scene level contexts, which can serve relatively high quality images. To the best of our knowledge, as cues for object detection (e.g., a bedroom will predict the there is so far very limited work to address this problem. presence of a bed). In DeepIDNet (Ouyang et al. 2015), the image classification scores were used as contextual features, and concatenated with the object detection scores to improve 7 Context Modeling detection results. In ION (Bell et al. 2016), Bell et al. pro- posed to use spatial Recurrent Neural Networks (RNNs) to In the physical world, visual objects occur in particular envi- explore contextual information across the entire image. In ronments and usually coexist with other related objects. SegDeepM (Zhu et al. 2015), Zhu et al. proposed a Markov There is strong psychological evidence (Biederman 1972; random field model that scores appearance as well as context 123 292 International Journal of Computer Vision (2020) 128:261–318 Table 8 Summary of detectors that exploit context information, with labelling details as in Table 7 Group Detector name Region proposal Backbone DCNN Pipelined Used mAP@IoU = 0.5 mAP Published in Highlights VOC07 VOC12 COCO Global context SegDeepM (Zhu SS+CMPC VGG16 RCNN VOC10 VOC12 − CVPR15 Additional features et al. 2015) extracted from an enlarged object proposal as context information DeepIDNet SS+EB AlexNet ZFNet RCNN 69.0 (07) −− CVPR15 Use image classification (Ouyang et al. scores as global 2015) contextual information to refine the detection scores of each object proposal ION (Bell et al. SS+EB VGG16 Fast RCNN 80.177.933.1 CVPR16 The contextual 2016) information outside the region of interest is integrated using spatial recurrent neural networks CPF (Shrivastava RPN VGG16 Faster RCNN 76.4 (07+12) 72.6 (07T+12) − ECCV16 Use semantic and Gupta segmentation to 2016) provide top-down feedback Local context MRCNN SS VGG16 SPPNet 78.2 (07+12) 73.9 (07+12) − ICCV15 Extract features from (Gidaris and multiple regions Komodakis surrounding or inside 2015) the object proposals. Integrate the semantic segmentation-aware features GBDNet (Zeng CRAFT (Yang Inception v2 Fast RCNN 77.2 (07+12) − 27.0 ECCV16 TPAMI18 A GBDNet module to et al. 2016, et al. 2016a) ResNet269 learn the relations of 2017) PolyNet multiscale (Zhang et al. contextualized regions 2017) surrounding an object proposal; GBDNet passes messages among features from different context regions through convolution between neighboring support regions in two directions International Journal of Computer Vision (2020) 128:261–318 293 Table 8 continued Group Detector name Region proposal Backbone DCNN Pipelined Used mAP@IoU = 0.5 mAP Published in Highlights VOC07 VOC12 COCO ACCNN (Li SS VGG16 Fast RCNN 72.0 (07+12) 70.6 (07T+12) − TMM17 Use LSTM to capture et al. 2017b) global context. Concatenate features from multi-scale contextual regions surrounding an object proposal. The global and local context features are concatenated for recognition CoupleNet (Zhu RPN ResNet101 RFCN 82.7 (07+12) 80.4 (07T+12) 34.4 ICCV17 Concatenate features et al. 2017a) from multiscale contextual regions surrounding an object proposal. Features of different contextual regions are then combined by convolution and element-wise sum SMN (Chen and RPN VGG16 Faster RCNN 70.0 (07) −− ICCV17 Model object-object Gupta 2017) relationships efficiently through a spatial memory network. Learn the functionality of NMS automatically ORN (Hu et al. RPN ResNet101 Faster RCNN −− 39.0 CVPR18 Model the relations of a 2018a) +DCN set of object proposals through the interactions between their appearance features and geometry. Learn the functionality of NMS automatically SIN (Liu et al. RPN VGG16 Faster RCNN 76.0 (07+12) 73.1 (07T+12) 23.2 CVPR18 Formulate object 2018d) detection as graph-structured inference, where objects are graph nodes and relationships the edges 294 International Journal of Computer Vision (2020) 128:261–318 for each detection, and allows each candidate box to select a contextual region and semantically segmented regions), in segment out of a large pool of object segmentation proposals order to obtain a richer and more robust object representa- and score the agreement between them. In Shrivastava and tion. All of these features are combined by concatenation. Gupta (2016), semantic segmentation was used as a form of Quite a number of methods, all closely related to MRCNN, contextual priming. have been proposed since then. The method in Zagoruyko et al. (2016) used only four contextual regions, organized in a foveal structure, where the classifiers along multiple paths 7.2 Local Context are trained jointly end-to-end. Zeng et al. (2016), Zeng et al. (2017) proposed GBDNet (Fig. 18b) to extract features from Local context (Zhang et al. 2013; Galleguillos and Belongie multiscale contextualized regions surrounding an object pro- 2010; Rabinovich et al. 2007) considers the relationship posal to improve detection performance. In contrast to the among locally nearby objects, as well as the interactions somewhat naive approach of learning CNN features for each between an object and its surrounding area. In general, mod- region separately and then concatenating them, GBDNet eling object relations is challenging, requiring reasoning passes messages among features from different contextual about bounding boxes of different classes, locations, scales regions. Noting that message passing is not always helpful, etc. Deep learning research that explicitly models object rela- but dependent on individual samples, Zeng et al. (2016)used tions is quite limited, with representative ones being Spatial gated functions to control message transmission. Li et al. Memory Network (SMN) (Chen and Gupta 2017), Object (2017b) presented ACCNN (Fig. 18c) to utilize both global Relation Network (Hu et al. 2018a), and Structure Inference and local contextual information: the global context was Network (SIN) (Liu et al. 2018d). In SMN, spatial memory captured using a Multiscale Local Contextualized (MLC) essentially assembles object instances back into a pseudo subnetwork, which recurrently generates an attention map for image representation that is easy to be fed into another CNN an input image to highlight promising contextual locations; for object relations reasoning, leading to a new sequential local context adopted a method similar to that of MRCNN reasoning architecture where image and memory are pro- (Gidaris and Komodakis 2015). As shown in Fig. 18d, Cou- cessed in parallel to obtain detections which further update pleNet (Zhu et al. 2017a) is conceptually similar to ACCNN memory. Inspired by the recent success of attention mod- (Lietal. 2017b), but built upon RFCN (Dai et al. 2016c), ules in natural language processing (Vaswani et al. 2017), which captures object information with position sensitive ORN processes a set of objects simultaneously through the RoI pooling, CoupleNet added a branch to encode the global interaction between their appearance feature and geometry. context with RoI pooling. It does not require additional supervision, and it is easy to embed into existing networks, effective in improving object recognition and duplicate removal steps in modern object 8 Detection Proposal Methods detection pipelines, giving rise to the first fully end-to-end object detector. SIN (Liu et al. 2018d) considered two kinds An object can be located at any position and scale in an of context: scene contextual information and object relation- image. During the heyday of handcrafted feature descrip- ships within a single image. It formulates object detection as tors [SIFT (Lowe 2004), HOG (Dalal and Triggs 2005) and a problem of graph inference, where the objects are treated LBP (Ojala et al. 2002)], the most successful methods for as nodes in a graph and relationships between objects are object detection [e.g. DPM (Felzenszwalb et al. 2008)] used modeled as edges. sliding window techniques (Viola and Jones 2001; Dalal and A wider range of methods has approached the con- Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; text challenge with a simpler idea: enlarging the detec- Vedaldi et al. 2009). However, the number of windows is tion window size to extract some form of local context. huge, growing with the number of pixels in an image, and Representative approaches include MRCNN (Gidaris and the need to search at multiple scales and aspect ratios further Komodakis 2015), Gated BiDirectional CNN (GBDNet) increases the search space . Therefore, it is computationally Zeng et al. (2016), Zeng et al. (2017), Attention to Con- too expensive to apply sophisticated classifiers. text CNN (ACCNN) (Li et al. 2017b), CoupleNet (Zhu et al. Around 2011, researchers proposed to relieve the tension 2017a), and Sermanet et al. (2013). In MRCNN (Gidaris between computational tractability and high detection qual- and Komodakis 2015) (Fig. 18a), in addition to the features extracted from the original object proposal at the last CONV 12 4 Sliding window based detection requires classifying around 10 – layer of the backbone, Gidaris and Komodakis proposed to 10 windows per image. The number of windows grows significantly 6 7 extract features from a number of different regions of an to 10 –10 windows per image when considering multiple scales and object proposal (half regions, border regions, central regions, aspect ratios. 123 International Journal of Computer Vision (2020) 128:261–318 295 (b) (a) (d) (c) Fig. 18 Representative approaches that explore local surrounding contextual features: MRCNN (Gidaris and Komodakis 2015), GBDNet (Zeng et al. 2016, 2017), ACCNN (Li et al. 2017b) and CoupleNet (Zhu et al. 2017a); also see Table 8 ity by using detection proposals (Van de Sande et al. 2011; paper, because object proposals have applications beyond Uijlings et al. 2013). Originating in the idea of objectness object detection (Arbeláez et al. 2012; Guillaumin et al. proposed by Alexe et al. (2010), object proposals are a set 2014; Zhu et al. 2017b). We refer interested readers to the of candidate regions in an image that are likely to contain recent surveys (Hosang et al. 2016; Chavali et al. 2016) which objects, and if high object recall can be achieved with a mod- provide in-depth analysis of many classical object proposal est number of object proposals (like one hundred), significant algorithms and their impact on detection performance. Our speed-ups over the sliding window approach can be gained, interest here is to review object proposal methods that are allowing the use of more sophisticated classifiers. Detection based on DCNNs, output class agnostic proposals, and are proposals are usually used as a pre-processing step, limit- related to generic object detection. ing the number of regions that need to be evaluated by the In 2014, the integration of object proposals (Van de detector, and should have the following characteristics: Sande et al. 2011; Uijlings et al. 2013) and DCNN features (Krizhevsky et al. 2012a) led to the milestone RCNN (Gir- shick et al. 2014) in generic object detection. Since then, 1. High recall, which can be achieved with only a few pro- detection proposal has quickly become a standard prepro- posals; cessing step, based on the fact that all winning entries in 2. Accurate localization, such that the proposals match the the PASCAL VOC (Everingham et al. 2010), ILSVRC (Rus- object bounding boxes as accurately as possible; and sakovsky et al. 2015) and MS COCO (Lin et al. 2014) object 3. Low computational cost. detection challenges since 2014 used detection proposals (Girshick et al. 2014; Ouyang et al. 2015; Girshick 2015; The success of object detection based on detection proposals Ren et al. 2015; Zeng et al. 2017;Heetal. 2017). (Van de Sande et al. 2011; Uijlings et al. 2013) has attracted Among object proposal approaches based on traditional broad interest (Carreira and Sminchisescu 2012; Arbeláez low-level cues (e.g., color, texture, edge and gradients), et al. 2014; Alexe et al. 2012; Cheng et al. 2014; Zitnick Selective Search (Uijlings et al. 2013), MCG (Arbeláez et al. and Dollár 2014; Endres and Hoiem 2010; Krähenbühl and 2014) and EdgeBoxes (Zitnick and Dollár 2014) are among Koltun 2014; Manen et al. 2013). A comprehensive review the more popular. As the domain rapidly progressed, tra- of object proposal algorithms is beyond the scope of this ditional object proposal approaches (Uijlings et al. 2013; Hosang et al. 2016; Zitnick and Dollár 2014), which were We use the terminology detection proposals, object proposals and adopted as external modules independent of the detectors, region proposals interchangeably. 123 296 International Journal of Computer Vision (2020) 128:261–318 became the speed bottleneck of the detection pipeline (Ren ground. Li et al. (2018a) proposed ZIP to improve RPN by et al. 2015). An emerging class of object proposal algorithms predicting object proposals with multiple convolutional fea- (Erhan et al. 2014; Ren et al. 2015; Kuo et al. 2015; Ghodrati ture maps at different network depths to integrate both low et al. 2015; Pinheiro et al. 2015; Yang et al. 2016a)using level details and high level semantics. The backbone used in DCNNs has attracted broad attention. ZIP is a “zoom out and in” network inspired by the conv and Recent DCNN based object proposal methods generally deconv structure (Long et al. 2015). fall into two categories: bounding box based and object Finally, recent work which deserves mention includes segment based, with representative methods summarized in Deepbox (Kuo et al. 2015), which proposed a lightweight Table 9. CNN to learn to rerank proposals generated by EdgeBox, and Bounding Box Proposal Methods are best exemplified by DeNet (TychsenSmith and Petersson 2017) which introduces the RPC method of Ren et al. (2015), illustrated in Fig. 19. bounding box corner estimation to predict object proposals RPN predicts object proposals by sliding a small network efficiently to replace RPN in a Faster RCNN style detector. over the feature map of the last shared CONV layer. At each Object Segment Proposal Methods Pinheiro et al. (2015), sliding window location, k proposals are predicted by using Pinheiro et al. (2016) aim to generate segment proposals that k anchor boxes, where each anchor box is centered at some are likely to correspond to objects. Segment proposals are location in the image, and is associated with a particular scale more informative than bounding box proposals, and take a and aspect ratio. Ren et al. (2015) proposed integrating RPN step further towards object instance segmentation (Hariha- and Fast RCNN into a single network by sharing their convo- ran et al. 2014; Dai et al. 2016b;Lietal. 2017e). In addition, lutional layers, leading to Faster RCNN, the first end-to-end using instance segmentation supervision can improve the per- detection pipeline. RPN has been broadly selected as the formance of bounding box object detection. The pioneering proposal method by many state-of-the-art object detectors, work of DeepMask, proposed by Pinheiro et al. (2015), seg- as can be observed from Tables 7 and 8. ments proposals learnt directly from raw image data with a Instead of fixing apriori a set of anchors as MultiBox deep network. Similarly to RPN, after a number of shared (Erhan et al. 2014; Szegedy et al. 2014) and RPN (Ren et al. convolutional layers DeepMask splits the network into two 2015), Lu et al. (2016) proposed generating anchor locations branches in order to predict a class agnostic mask and an by using a recursive search strategy which can adaptively associated objectness score. Also similar to the efficient slid- guide computational resources to focus on sub-regions likely ing window strategy in OverFeat (Sermanet et al. 2014), to contain objects. Starting with the whole image, all regions the trained DeepMask network is applied in a sliding win- visited during the search process serve as anchors. For any dow manner to an image (and its rescaled versions) during anchor region encountered during the search procedure, a inference. More recently, Pinheiro et al. (2016) proposed scalar zoom indicator is used to decide whether to further par- SharpMask by augmenting the DeepMask architecture with tition the region, and a set of bounding boxes with objectness a refinement module, similar to the architectures shown in scores are computed by an Adjacency and Zoom Network Fig. 17 (b1) and (b2), augmenting the feed-forward net- (AZNet), which extends RPN by adding a branch to com- work with a top-down refinement process. SharpMask can pute the scalar zoom indicator in parallel with the existing efficiently integrate spatially rich information from early fea- branch. tures with strong semantic information encoded in later layers Further work attempts to generate object proposals by to generate high fidelity object masks. exploiting multilayer convolutional features. Concurrent Motivated by Fully Convolutional Networks (FCN) for with RPN (Ren et al. 2015), Ghodrati et al. (2015)pro- semantic segmentation (Long et al. 2015) and DeepMask posed DeepProposal, which generates object proposals by (Pinheiro et al. 2015; Dai et al. 2016a) proposed Instance- using a cascade of multiple convolutional features, building FCN to generate instance segment proposals. Similar to an inverse cascade to select the most promising object loca- DeepMask, the InstanceFCN network is split into two fully tions and to refine their boxes in a coarse-to-fine manner. convolutional branches, one to generate instance sensitive An improved variant of RPN, HyperNet (Kong et al. 2016) score maps, the other to predict the objectness score. Hu et al. designs Hyper Features which aggregate multilayer convolu- (2017) proposed FastMask to efficiently generate instance tional features and shares them both in generating proposals segment proposals in a one-shot manner, similar to SSD (Liu and detecting objects via an end-to-end joint training strat- et al. 2016), in order to make use of multiscale convolutional egy. Yang et al. (2016a) proposed CRAFT which also used features. Sliding windows extracted densely from multiscale a cascade strategy, first training an RPN network to generate convolutional feature maps were input to a scale-tolerant object proposals and then using them to train another binary attentional head module in order to predict segmentation Fast RCNN network to further distinguish objects from back- masks and objectness scores. FastMask is claimed to run at 13 FPS on 800 × 600 images. The concept of “anchor” first appeared in Ren et al. (2015). 123 International Journal of Computer Vision (2020) 128:261–318 297 Table 9 Summary of object proposal methods using DCNN. Bold values indicates the number of object proposals Proposer name Backbone Detector tested Recall@IoU (VOC07) Detection results (mAP) Published in Highlights network 0.50.70.9 VOC07 VOC12 COCO Bounding box object proposal methods MultiBox1 AlexNet RCNN −−− 29.0(10)−− CVPR14 Learns a class agnostic (Erhan et al. (12) regressor on a small 2014) set of 800 predefined anchor boxes. Do not share features for detection DeepBox (Kuo VGG16 Fast RCNN 0.96 (1000)0.84 (1000)0.15 (1000) −− 37.8(500) ICCV15 Use a lightweight CNN et al. 2015) ([email protected]) to learn to rerank proposals generated by EdgeBox. Can run at 0.26s per image. Do not share features for detection RPN (Ren et al. VGG16 Faster RCNN 0.97 (300) 0.79 (300) 0.04 (300) 73.2(300) 70.4(300) 21.9(300) NIPS15 The first to generate 2015, 2017) 0.98 (1000) 0.84 (1000) 0.04 (1000) (07+12) (07++12) object proposals by sharing full image convolutional features with detection. Most widely used object proposal method. Significant improvements in detection speed DeepProposal VGG16 Fast RCNN 0.74 (100) 0.58 (100) 0.12 (100) 53.2(100)−− ICCV15 Generate proposals (Ghodrati et al. 0.92 (1000) 0.80 (1000) 0.16 (1000) (07) inside a DCNN in a 2015) multiscale manner. Share features with the detection network CRAFT (Yang VGG16 Faster RCNN 0.98 (300)0.90 (300)0.13 (300)75.7 71.3 (12) − CVPR16 Introduced a et al. 2016a) (07+12) classification network (i.e. two class Fast RCNN) cascade that comes after the RPN. Not sharing features extracted for detection 298 International Journal of Computer Vision (2020) 128:261–318 Table 9 continued Proposer name Backbone Detector tested Recall@IoU (VOC07) Detection results (mAP) Published in Highlights network 0.50.70.9 VOC07 VOC12 COCO AZNet (Lu et al. VGG16 Fast RCNN 0.91 (300)0.71 (300)0.11 (300)70.4 (07) − 22.3 CVPR16 Use coarse-to-fine 2016) search: start from large regions, then recursively search for subregions that may contain objects. Adaptively guide computational resources to focus on likely subregions ZIP (Li et al. Inception v2 Faster RCNN 0.85 (300) 0.74 (300) 0.35 (300) 79.8 −− IJCV18 Generate proposals 2018a) COCO COCO COCO (07+12) using conv-deconv network with multilayers; Proposed a map attention decision (MAD) unit to assign the weights for features from different layers DeNet ResNet101 Fast RCNN 0.82 (300)0.74 (300)0.48 (300)77.1 73.9 (07++12) 33.8 ICCV17 A lot faster than Faster (TychsenSmith (07+12) RCNN; Introduces a and Petersson bounding box corner 2017) estimation for predicting object proposals efficiently to replace RPN; Does not require predefined anchors International Journal of Computer Vision (2020) 128:261–318 299 Table 9 continued Proposer name Backbone Detector tested Box proposals (AR, COCO) Segment proposals (AR, COCO) Published in Highlights network Segment proposal methods DeepMask VGG16 Fast RCNN 0.33 (100), 0.48 (1000)0.26 (100), 0.37 (1000) NIPS15 First to generate object (Pinheiro et al. mask proposals with 2015) DCNN; Slow inference time; Need segmentation annotations for training; Not sharing features with detection network; Achieved mAP of 69.9% (500) with Fast RCNN InstanceFCN VGG16 −− 0.32 (100), 0.39 (1000) ECCV16 Combines ideas of FCN (Daietal. (Long et al. 2015)and 2016a) DeepMask (Pinheiro et al. 2015). Introduces instance sensitive score maps. Needs segmentation annotations to train the network SharpMask MPN Fast RCNN 0.39 (100), 0.53 (1000)0.30 (100), 0.39 (1000) ECCV16 Leverages features at (Pinheiro et al. (Zagoruyko multiple convolutional 2016) et al. 2016) layers by introducing a top-down refinement module. Does not share features with detection network. Needs segmentation annotations for training FastMask (Hu ResNet39 − 0.43 (100), 0.57 (1000)0.32 (100), 0.41 (1000) CVPR17 Generates instance et al. 2017) segment proposals efficiently in one-shot manner similar to SSD (Liu et al. 2016). Uses multiscale convolutional features. Uses segmentation annotations for training The detection results on COCO are based on mAP@IoU[0.5, 0.95], unless stated otherwise 300 International Journal of Computer Vision (2020) 128:261–318 but without reducing training samples; SNIPER allows for efficient multiscale training, only processing context regions around ground truth objects at the appropriate scale, instead of processing a whole image pyramid. Peng et al. (2018) studied a key factor in training, the minibatch size, and proposed MegDet, a Large MiniBatch Object Detector, to enable the training with a much larger minibatch size than before (from 16 to 256). To avoid the failure of convergence and significantly speed up the training process, Peng et al. (2018) proposed a learning rate policy and Cross GPU Batch Normalization, and effectively utilized 128 GPUs, allowing Fig. 19 Illustration of the region proposal network (RPN) introduced MegDet to finish COCO training in 4 hours on 128 GPUs, in Ren et al. (2015) and winning the COCO 2017 Detection Challenge. Reducing Localization Error In object detection, the Inter- 9 Other Issues section Over Union (IOU) between a detected bounding box and its ground truth box is the most popular evalua- Data Augmentation Performing data augmentation for learn- tion metric, and an IOU threshold (e.g. typical value of 0.5) is required to define positives and negatives. From Fig. 13, ing DCNNs (Chatfield et al. 2014; Girshick 2015; Girshick et al. 2014) is generally recognized to be important for visual in most state of the art detectors (Girshick 2015; Liu et al. 2016;Heetal. 2017; Ren et al. 2015; Redmon et al. 2016) recognition. Trivial data augmentation refers to perturbing an image by transformations that leave the underlying cate- object detection is formulated as a multitask learning prob- gory unchanged, such as cropping, flipping, rotating, scaling, lem, i.e., jointly optimizing a softmax classifier which assigns translating, color perturbations, and adding noise. By artifi- object proposals with class labels and bounding box regres- cially enlarging the number of samples, data augmentation sors, localizing objects by maximizing IOU or other metrics helps in reducing overfitting and improving generalization. between detection results and ground truth. Bounding boxes It can be used at training time, at test time, or both. Never- are only a crude approximation for articulated objects, con- sequently background pixels are almost invariably included theless, it has the obvious limitation that the time required for training increases significantly. Data augmentation may in a bounding box, which affects the accuracy of classifi- cation and localization. The study in Hoiem et al. (2012) synthesize completely new training images (Peng et al. 2015; Wang et al. 2017), however it is hard to guarantee that the syn- shows that object localization error is one of the most influ- thetic images generalize well to real ones. Some researchers ential forms of error, in addition to confusion between similar (Dwibedi et al. 2017; Gupta et al. 2016) proposed augment- objects. Localization error could stem from insufficient over- ing datasets by pasting real segmented objects into natural lap (smaller than the required IOU threshold, such as the images; indeed, Dvornik et al. (2018) showed that appro- green box in Fig. 20) or duplicate detections (i.e., multiple priately modeling the visual context surrounding objects is overlapping detections for an object instance). Usually, some crucial to place them in the right environment, and proposed post-processing step like NonMaximum Suppression (NMS) a context model to automatically find appropriate locations (Bodla et al. 2017; Hosang et al. 2017) is used for eliminat- ing duplicate detections. However, due to misalignments the on images to place new objects for data augmentation. Novel Training Strategies Detecting objects under a wide bounding box with better localization could be suppressed during NMS, leading to poorer localization quality (such as range of scale variations, especially the detection of very small objects, stands out as a key challenge. It has been shown the purple box shown in Fig. 20). Therefore, there are quite (Huang et al. 2017b; Liu et al. 2016) that image resolution a few methods aiming at improving detection performance has a considerable impact on detection accuracy, therefore by reducing localization error. scaling is particularly commonly used in data augmentation, MRCNN (Gidaris and Komodakis 2015) introduces iter- since higher resolutions increase the possibility of detecting ative bounding box regression, where an RCNN is applied small objects (Huang et al. 2017b). Recently, Singh et al. several times. CRAFT (Yang et al. 2016a) and AttractioNet proposed advanced and efficient data argumentation meth- (Gidaris and Komodakis 2016) use a multi-stage detection ods SNIP (Singh and Davis 2018) and SNIPER (Singh et al. sub-network to generate accurate proposals, to forward to Fast RCNN. Cai and Vasconcelos (2018) proposed Cas- 2018b) to 1 illustrate the scale invariance problem, as sum- marized in Table 10. Motivated by the intuitive understanding cade RCNN, a multistage extension of RCNN, in which a sequence of detectors is trained sequentially with increasing that small and large objects are difficult to detect at smaller and larger scales, respectively, SNIP introduces a novel train- ing scheme that can reduce scale variations during training, Please refer to Sect. 4.2 for more details on the definition of IOU. 123 International Journal of Computer Vision (2020) 128:261–318 301 Table 10 Representative methods for training strategies and class imbalance handling Detector name Region proposal Backbone DCNN Pipelined used VOC07 results VOC12 results COCO results Published in Highlights MegDet (Peng RPN ResNet50+FPN FasterRCNN −− 52.5 CVPR18 Allow training with et al. 2018) much larger minibatch size than before by introducing cross GPU batch normalization; Can finish the COCO training in 4 hours on 128 GPUs and achieved improved accuracy; Won COCO2017 detection challenge SNIP (Singh RPN DPN (Chen et al. RFCN −− 48.3 CVPR18 A new multiscale et al. 2018b) 2017b)+DCN training scheme. (Dai et al. Empirically examined 2017) the effect of up-sampling for small object detection. During training, only select objects that fit the scale of features as positive samples SNIPER (Singh RPN ResNet101+DCN Faster RCNN −− 47.6 2018 An efficient multiscale et al. 2018b) training strategy. Process context regions around ground-truth instances at the appropriate scale OHEM SS VGG16 Fast RCNN 78.9 (07+12) 76.3 (07++12) 22.4 CVPR16 A simple and effective (Shrivastava Online Hard Example et al. 2016) Mining algorithm to improve training of region based detectors 302 International Journal of Computer Vision (2020) 128:261–318 Table 10 continued Detector name Region proposal Backbone DCNN Pipelined used VOC07 results VOC12 results COCO results Published in Highlights FactorNet SS GooglNet RCNN −−− CVPR16 Identify the imbalance (Ouyang et al. in the number of 2016) samples for different object categories; propose a divide-and-conquer feature learning scheme Chained Cascade SS CRAFT VGG Fast RCNN, 80.4 (07+12) −− ICCV17 Jointly learn DCNN and (Cai and Inceptionv2 Faster RCNN (SS+VGG) multiple stages of Vasconcelos cascaded classifiers. 2018) Boost detection accuracy on PASCAL VOC 2007 and ImageNet for both fast RCNN and Faster RCNN using different region proposal methods Cascade RCNN RPN VGG ResNet101 Faster RCNN −− 42.8 CVPR18 Jointly learn DCNN and (Cai and +FPN multiple stages of Vasconcelos cascaded classifiers, 2018) which are learned using different localization accuracy for selecting positive samples. Stack bounding box regression at multiple stages RetinaNet (Lin − ResNet101 +FPN RetinaNet −− 39.1 ICCV17 Propose a novel Focal et al. 2017b) Loss which focuses training on hard examples. Handles well the problem of imbalance of positive and negative samples when training a one-stage detector Results on COCO are reported with Test Dev. The detection results on COCO are based on mAP@IoU[0.5, 0.95] International Journal of Computer Vision (2020) 128:261–318 303 proposed Focal Loss to address this by rectifying the Cross Entropy loss, such that it down-weights the loss assigned to correctly classified examples. Li et al. (2019a) studied this issue from the perspective of gradient norm distribution, and proposed a Gradient Harmonizing Mechanism (GHM) to handle it. 10 Discussion and Conclusion Generic object detection is an important and challenging problem in computer vision and has received considerable attention. Thanks to remarkable developments in deep learn- Fig. 20 Localization error could stem from insufficient overlap or duplicate detections. Localization error is a frequent cause of false pos- ing techniques, the field of object detection has dramatically itives (Color figure online) evolved. As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent IOU thresholds, based on the observation that the output of a achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing detector trained with a certain IOU is a good distribution to train the detector of the next higher IOU threshold, in order to popular datasets and evaluation criteria, and discussed perfor- mance for the most representative methods. We conclude this be sequentially more selective against close false positives. This approach can be built with any RCNN-based detector, review with a discussion of the state of the art in Sect. 10.1, and is demonstrated to achieve consistent gains (about 2 to an overall discussion of key issues in Sect. 10.2, and finally 4 points) independent of the baseline detector strength, at a suggested future research directions in Sect. 10.3. marginal increase in computation. There is also recent work (Jiang et al. 2018; Rezatofighi et al. 2019; Huang et al. 2019) 10.1 State of the Art Performance formulating IOU directly as the optimization objective, and in proposing improved NMS results (Bodla et al. 2017;He A large variety of detectors has appeared in the last few et al. 2019; Hosang et al. 2017; TychsenSmith and Petersson years, and the introduction of standard benchmarks, such as 2018), such as Soft NMS (Bodla et al. 2017) and learning PASCAL VOC (Everingham et al. 2010, 2015), ImageNet NMS (Hosang et al. 2017). (Russakovsky et al. 2015) and COCO (Lin et al. 2014), has Class Imbalance Handling Unlike image classification, made it easier to compare detectors. As can be seen from object detection has another unique problem: the serious our earlier discussion in Sects. 5–9, it may be misleading imbalance between the number of labeled object instances to compare detectors in terms of their originally reported and the number of background examples (image regions performance (e.g. accuracy, speed), as they can differ in not belonging to any object class of interest). Most back- fundamental / contextual respects, including the following ground examples are easy negatives, however this imbalance choices: can make the training very inefficient, and the large num- ber of easy negatives tends to overwhelm the training. In • Meta detection frameworks, such as RCNN (Girshick the past, this issue has typically been addressed via tech- et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN niques such as bootstrapping (Sung and Poggio 1994). More (Ren et al. 2015), RFCN (Dai et al. 2016c), Mask RCNN recently, this problem has also seen some attention (Li et al. (He et al. 2017), YOLO (Redmon et al. 2016) and SSD 2019a; Lin et al. 2017b; Shrivastava et al. 2016). Because (Liu et al. 2016); the region proposal stage rapidly filters out most background • Backbone networks such as VGG (Simonyan and Zis- regions and proposes a small number of object candidates, serman 2015), Inception (Szegedy et al. 2015; Ioffe and this class imbalance issue is mitigated to some extent in Szegedy 2015; Szegedy et al. 2016), ResNet (He et al. two-stage detectors (Girshick et al. 2014; Girshick 2015; 2016), ResNeXt (Xie et al. 2017), and Xception (Chollet Ren et al. 2015;Heetal. 2017), although example mining 2017) etc. listed in Table 6; approaches, such as Online Hard Example Mining (OHEM) • Innovations such as multilayer feature combination (Lin (Shrivastava et al. 2016), may be used to maintain a rea- et al. 2017a; Shrivastava et al. 2017;Fuetal. 2017), sonable balance between foreground and background. In the deformable convolutional networks (Dai et al. 2017), case of one-stage object detectors (Redmon et al. 2016;Liu deformable RoI pooling (Ouyang et al. 2015; Dai et al. et al. 2016), this imbalance is extremely serious (e.g. 100,000 2017), heavier heads (Ren et al. 2016; Peng et al. 2018), background examples to every object). Lin et al. (2017b) and lighter heads (Li et al. 2018c); 123 304 International Journal of Computer Vision (2020) 128:261–318 • Pretraining with datasets such as ImageNet (Russakovsky 10.2 Summary and Discussion et al. 2015), COCO (Lin et al. 2014), Places (Zhou et al. 2017a), JFT (Hinton et al. 2015) and Open Images With hundreds of references and many dozens of methods (Krasin et al. 2017); discussed throughout this paper, we would now like to focus • Different detection proposal methods and different num- on the key factors which have emerged in generic object bers of object proposals; detection based on deep learning. • Train/test data augmentation, novel multiscale training strategies (Singh and Davis 2018; Singh et al. 2018b) (1) Detection frameworks: two stage versus one stage etc, and model ensembling. In Sect. 5 we identified two major categories of detection Although it may be impractical to compare every recently frameworks: region based (two stage) and unified (one stage): proposed detector, it is nevertheless valuable to integrate representative and publicly available detectors into a com- • When large computational cost is allowed, two-stage mon platform and to compare them in a unified manner. detectors generally produce higher detection accuracies There has been very limited work in this regard, except for than one-stage, evidenced by the fact that most winning Huang’s study (Huang et al. 2017b) of the three main fam- approaches used in famous detection challenges like are ilies of detectors [Faster RCNN (Ren et al. 2015), RFCN predominantly based on two-stage frameworks, because (Dai et al. 2016c) and SSD (Liu et al. 2016)] by varying the their structure is more flexible and better suited for region backbone network, image resolution, and the number of box based classification. The most widely used frameworks are Faster RCNN (Ren et al. 2015), RFCN (Dai et al. proposals. As can be seen from Tables 7, 8, 9, 10, 11,wehavesum- 2016c) and Mask RCNN (He et al. 2017). marized the best reported performance of many methods on • It has been shown in Huang et al. (2017b) that the detec- three widely used standard benchmarks. The results of these tion accuracy of one-stage SSD (Liu et al. 2016)isless methods were reported on the same test benchmark, despite sensitive to the quality of the backbone network than rep- their differing in one or more of the aspects listed above. resentative two-stage frameworks. Figures 3 and 21 present a very brief overview of the state • One-stage detectors like YOLO (Redmon et al. 2016) and of the art, summarizing the best detection results of the PAS- SSD (Liu et al. 2016) are generally faster than two-stage CAL VOC, ILSVRC and MSCOCO challenges; more results ones, because of avoiding preprocessing algorithms, can be found at detection challenge websites (ILSVRC 2018; using lightweight backbone networks, performing pre- MS COCO 2018; PASCAL VOC 2018). The competition diction with fewer candidate regions, and making the winner of the open image challenge object detection task classification subnetwork fully convolutional. However, achieved 61.71% mAP in the public leader board and 58.66% two-stage detectors can run in real time with the intro- mAP on the private leader board, obtained by combining the duction of similar techniques. In any event, whether one detection results of several two-stage detectors including Fast stage or two, the most time consuming step is the feature RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), FPN extractor (backbone network) (Law and Deng 2018;Ren (Lin et al. 2017a), Deformable RCNN (Dai et al. 2017), and et al. 2015). Cascade RCNN (Cai and Vasconcelos 2018). In summary, the • It has been shown (Huang et al. 2017b; Redmon et al. backbone network, the detection framework, and the avail- 2016; Liu et al. 2016) that one-stage frameworks like ability of large scale datasets are the three most important YOLO and SSD typically have much poorer performance factors in detection accuracy. Ensembles of multiple models, when detecting small objects than two-stage architec- the incorporation of context features, and data augmentation tures like Faster RCNN and RFCN, but are competitive all help to achieve better accuracy. in detecting large objects. In less than 5 years, since AlexNet (Krizhevsky et al. 2012a) was proposed, the Top5 error on ImageNet classifica- There have been many attempts to build better (faster, more tion (Russakovsky et al. 2015) with 1000 classes has dropped accurate, or more robust) detectors by attacking each stage from 16% to 2%, as shown in Fig. 15. However, the mAP of of the detection framework. No matter whether one, two or the best performing detector (Peng et al. 2018)onCOCO multiple stages, the design of the detection framework has (Lin et al. 2014), trained to detect only 80 classes, is only converged towards a number of crucial design choices: at 73%, even at 0.5 IoU, illustrating how object detection is much harder than image classification. The accuracy and • A fully convolutional pipeline robustness achieved by the state-of-the-art detectors far from • Exploring complementary information from other corre- satisfies the requirements of real world applications, so there lated tasks, e.g., Mask RCNN (He et al. 2017) remains significant room for future improvement. • Sliding windows (Ren et al. 2015) 123 International Journal of Computer Vision (2020) 128:261–318 305 Table 11 Summary of properties and performance of milestone detection frameworks for generic object detection Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Region based (Sect. 5.1) RCNN (Girshick SS AlexNet Fixed 58.5 (07) 53.3 (12) < 0.1 CVPR14 Caffe Matlab Highlights: First to integrate CNN et al. 2014) with RP methods; Dramatic performance improvement over previous state of the artP Disadvantages: Multistage pipeline of sequentially-trained (External RP computation, CNN finetuning, each warped RP passing through CNN, SVM and BBR training); Training is expensive in space and time; Testing is slow SPPNet (He et al. SS ZFNet Arbitrary 60.9 (07) − < 1 ECCV14 Caffe Matlab Highlights: First to introduce SPP 2014) into CNN architecture; Enable convolutional feature sharing; Accelerate RCNN evaluation by orders of magnitude without sacrificing performance; Faster than OverFeat Disadvantages: Inherit disadvantages of RCNN; Does not result in much training speedup; Fine-tuning not able to update the CONV layers before SPP layer Fast RCNN SS AlexNet VGGM Arbitrary 70.0 (VGG) 68.4 (VGG) < 1 ICCV15 Caffe Python Highlights: First to enable (Girshick 2015) VGG16 (07+12) (07++12) end-to-end detector training (ignoring RP generation); Design a RoI pooling layer; Much faster and more accurate than SPPNet; No disk storage required for feature caching Disadvantages: External RP computation is exposed as the new bottleneck; Still too slow for real time applications 306 International Journal of Computer Vision (2020) 128:261–318 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Faster RCNN RPN ZFnet VGG Arbitrary 73.2 (VGG) 70.4 (VGG) < 5 NIPS15 Caffe Matlab Highlights: Propose RPN for (Ren et al. (07+12) (07++12) Python generating nearly cost-free and 2015) high quality RPs instead of selective search; Introduce translation invariant and multiscale anchor boxes as references in RPN; Unify RPN and Fast RCNN into a single network by sharing CONV layers; An order of magnitude faster than Fast RCNN without performance loss; Can run testing at 5 FPS with VGG16 Disadvantages: Training is complex, not a streamlined process; Still falls short of real time RCNNR(Lenc New ZFNet +SPP Arbitrary 59.7 (07) − <5BMVC15 − Highlights: Replace selective and Vedaldi search with static RPs; Prove the 2015) possibility of building integrated, simpler and faster detectors that rely exclusively on CNN Disadvantages: Falls short of real time; Decreased accuracy from poor RPs RFCN (Dai et al. RPN ResNet101 Arbitrary 80.5 (07+12) 77.6 (07++12) < 10 NIPS16 Caffe Matlab Highlights: Fully convolutional 2016c) 83.6 82.0 detection network; Design a set (07+12+CO) (07++12+CO) of position sensitive score maps using a bank of specialized CONV layers; Faster than Faster RCNN without sacrificing much accuracy Disadvantages: Training is not a streamlined process; Still falls short of real time International Journal of Computer Vision (2020) 128:261–318 307 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Mask RCNN (He RPN ResNet101 Arbitrary 50.3 (ResNeXt101) (COCO Result) < 5 ICCV17 Caffe Matlab Highlights: A simple, flexible, and et al. 2017) ResNeXt101 Python effective framework for object instance segmentation; Extends Faster RCNN by adding another branch for predicting an object mask in parallel with the existing branch for BB prediction; Feature Pyramid Network (FPN) is utilized; Outstanding performance Disadvantages: Falls short of real time applications Unified (Sect. 5.2) OverFeat − AlexNet like Arbitrary −− < 0.1 ICLR14 c++ Highlights: Convolutional feature (Sermanetetal. sharing; Multiscale image 2014) pyramid CNN feature extraction; Won the ISLVRC2013 localization competition; Significantly faster than RCNN Disadvantages: Multi-stage pipeline sequentially trained; Single bounding box regressor; Cannot handle multiple object instances of the same class; Too slow for real time applications YOLO (Redmon − GoogLeNet like Fixed 66.4 (07+12) 57.9 (07++12) < 25 (VGG) CVPR16 DarkNet Highlights: First efficient unified et al. 2016) detector; Drop RP process completely; Elegant and efficient detection framework; Significantly faster than previous detectors; YOLO runs at 45 FPS, Fast YOLO at 155 FPS; Disadvantages: Accuracy falls far behind state of the art detectors; Struggle to localize small objects 308 International Journal of Computer Vision (2020) 128:261–318 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages YOLOv2 − DarkNet Fixed 78.6 (07+12) 73.5 (07++12) < 50 CVPR17 DarkNet Highlights: Propose a faster (Redmon and DarkNet19; Use a number of Farhadi 2017) existing strategies to improve both speed and accuracy; Achieve high accuracy and high speed; YOLO9000 can detect over 9000 object categories in real time Disadvantages: Not good at detecting small objects SSD (Liu et al. − VGG16 Fixed 76.8 (07+12) 74.9 (07++12) < 60 ECCV16 Caffe Python Highlights: First accurate and 2016) 81.5 80.0 efficient unified detector; (07+12+CO) (07++12+CO) Effectively combine ideas from RPN and YOLO to perform detection at multi-scale CONV layers; Faster and significantly more accurate than YOLO; Can run at 59 FPS; Disadvantages: Not good at detecting small objects See Sect. 5 for a detailed discussion. Some architectures are illustrated in Fig. 13. The properties of the backbone DCNNs can be found in Table 6 Training data: “07”←VOC2007 trainval; “07T”←VOC2007 trainval and test; “12”←VOC2012 trainval; “CO”←COCO trainval. The “Speed” column roughly estimates the detection speed with a single Nvidia Titan X GPU RP region proposal; SS selective search; RPN region proposal network; RC N N  R RCNN minus R and used a trivial RP method International Journal of Computer Vision (2020) 128:261–318 309 • Fusing information from different layers of the backbone. The evidence from recent success of cascade for object detec- tion (Cai and Vasconcelos 2018; Cheng et al. 2018a, b) and instance segmentation on COCO (Chen et al. 2019a) and other challenges has shown that multistage object detection could be a future framework for a speed-accuracy trade-off. A teaser investigation is being done in the 2019 WIDER Challenge (Loy et al. 2019). (2) Backbone networks As discussed in Sect. 6.1, backbone networks are one of the main driving forces behind the rapid improvement of detection performance, because of the key role played by dis- criminative object feature representation. Generally, deeper backbones such as ResNet (He et al. 2016), ResNeXt (Xie Fig. 21 Evolution of object detection performance on COCO (Test-Dev et al. 2017), InceptionResNet (Szegedy et al. 2017) perform results). Results are quoted from (Girshick 2015;Heetal. 2017;Ren better; however, they are computationally more expensive et al. 2017). The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most and require much more data and massive computing for train- important factors in detection accuracy ing. Some backbones (Howard et al. 2017; Iandola et al. 2016; Zhang et al. 2018c) were proposed for focusing on speed • Using dilated convolutions (Li et al. 2018b, 2019b): A instead, such as MobileNet (Howard et al. 2017) which has simple and effective method to incorporate broader con- been shown to achieve VGGNet16 accuracy on ImageNet text and maintain high resolution feature maps. with only the computational cost and model size. Back- • Using anchor boxes of different scales and aspect ratios: bone training from scratch may become possible as more Drawbacks of having many parameters, and scales and training data and better training strategies are available (Wu aspect ratios of anchor boxes are usually heuristically and He 2018; Luo et al. 2018, 2019). determined. • Up-scaling: Particularly for the detection of small objects, (3) Improving the robustness of object representation high-resolution networks (Sun et al. 2019a, b) can be developed. It remains unclear whether super-resolution The variation of real world images is a key challenge in object techniques improve detection accuracy or not. recognition. The variations include lighting, pose, deforma- tions, background clutter, occlusions, blur, resolution, noise, Despite recent advances, the detection accuracy for small and camera distortions. objects is still much lower than that of larger ones. There- fore, the detection of small objects remains one of the key (3.1) Object scale and small object size challenges in object detection. Perhaps localization require- ments need to be generalized as a function of scale, since Large variations of object scale, particularly those of small certain applications, e.g. autonomous driving, only require objects, pose a great challenge. Here a summary and discus- the identification of the existence of small objects within a sion on the main strategies identified in Sect. 6.2: larger region, and exact localization is not necessary. • Using image pyramids: They are simple and effective, (3.2) Deformation, occlusion, and other factors helping to enlarge small objects and to shrink large ones. They are computationally expensive, but are nevertheless commonly used during inference for better accuracy. As discussed in Sect. 2.2, there are approaches to han- • Using features from convolutional layers of different dling geometric transformation, occlusions, and deformation resolutions: In early work like SSD (Liu et al. 2016), mainly based on two paradigms. The first is a spatial predictions are performed independently, and no infor- transformer network, which uses regression to obtain a mation from other layers is combined or merged. Now deformation field and then warp features according to the it is quite standard to combine features from different deformation field (Dai et al. 2017). The second is based on layers, e.g. in FPN (Lin et al. 2017a). a deformable part-based model (Felzenszwalb et al. 2010b), 123 310 International Journal of Computer Vision (2020) 128:261–318 which finds the maximum response to a part filter with spa- related tasks, methods for reducing localization error, han- tial constraints taken into consideration (Ouyang et al. 2015; dling the huge imbalance between positive and negative Girshick et al. 2015; Wan et al. 2015). samples, mining of hard negative samples, and improving Rotation invariance may be attractive in certain applica- loss functions. tions, but there are limited generic object detection work focusing on rotation invariance, because popular benchmark 10.3 Research Directions detection datasets (PASCAL VOC, ImageNet, COCO) do not have large variations in rotation. Occlusion handling is inten- Despite the recent tremendous progress in the field of object sively studied in face detection and pedestrian detection, but detection, the technology remains significantly more primi- very little work has been devoted to occlusion handling for tive than human vision and cannot yet satisfactorily address generic object detection. In general, despite recent advances, real-world challenges like those of Sect. 2.2. We see a number deep networks are still limited by the lack of robustness to of long-standing challenges: a number of variations, which significantly constrains their real-world applications. • Working in an open world: being robust to any number of environmental changes, being able to evolve or adapt. (4) Context reasoning • Object detection under constrained conditions: learning from weakly labeled data or few bounding box annota- As introduced in Sect. 7, objects in the wild typically coexist tions, wearable devices, unseen object categories etc. with other objects and environments. It has been recog- • Object detection in other modalities: video, RGBD nized that contextual information (object relations, global images, 3D point clouds, lidar, remotely sensed imagery scene statistics) helps object detection and recognition (Oliva etc. and Torralba 2007), especially for small objects, occluded objects, and with poor image quality. There was extensive work preceding deep learning (Malisiewicz and Efros 2009; Based on these challenges, we see the following directions Murphy et al. 2003; Rabinovich et al. 2007; Divvala et al. of future research: 2009; Galleguillos and Belongie 2010), and also quite a few (1) Open World Learning The ultimate goal is to develop works in the era of deep learning (Gidaris and Komodakis object detection capable of accurately and efficiently recog- 2015; Zeng et al. 2016, 2017; Chen and Gupta 2017;Huetal. nizing and localizing instances in thousands or more object 2018a). How to efficiently and effectively incorporate con- categories in open-world scenes, at a level competitive with textual information remains to be explored, possibly guided the human visual system. Object detection algorithms are by how human vision uses context, based on scene graphs unable, in general, to recognize object categories outside of (Li et al. 2017d), or via the full segmentation of objects and their training dataset, although ideally there should be the scenes using panoptic segmentation (Kirillov et al. 2018). ability to recognize novel object categories (Lake et al. 2015; Hariharan and Girshick 2017). Current detection datasets (5) Detection proposals (Everingham et al. 2010; Russakovsky et al. 2015; Lin et al. 2014) contain only a few dozen to hundreds of categories, Detection proposals significantly reduce search spaces. As significantly fewer than those which can be recognized by recommended in Hosang et al. (2016), future detection pro- humans. New larger-scale datasets (Hoffman et al. 2014; posals will surely have to improve in repeatability, recall, Singh et al. 2018a; Redmon and Farhadi 2017) with signifi- localization accuracy, and speed. Since the success of RPN cantly more categories will need to be developed. (Ren et al. 2015), which integrated proposal generation and (2) Better and More Efficient Detection Frameworks One detection into a common framework, CNN based detection of the reasons for the success in generic object detection has proposal generation methods have dominated region pro- been the development of superior detection frameworks, both posal. It is recommended that new detection proposals should region-based [RCNN (Girshick et al. 2014), Fast RCNN (Gir- be assessed for object detection, instead of evaluating detec- shick 2015), Faster RCNN (Ren et al. 2015), Mask RCNN tion proposals alone. (He et al. 2017)] and one-stage detectors [YOLO (Redmon et al. 2016), SSD (Liu et al. 2016)]. Region-based detectors (6) Other factors have higher accuracy, one-stage detectors are generally faster and simpler. Object detectors depend heavily on the under- As discussed in Sect. 9, there are many other factors affecting lying backbone networks, which have been optimized for object detection quality: data augmentation, novel train- image classification, possibly causing a learning bias; learn- ing strategies, combinations of backbone models, multiple ing object detectors from scratch could be helpful for new detection frameworks, incorporating information from other detection frameworks. 123 International Journal of Computer Vision (2020) 128:261–318 311 (3) Compact and Efficient CNN Features CNNs have 2019). Even more constrained, zero shot object detection increased remarkably in depth, from several layers [AlexNet localizes and recognizes object classes that have never been (Krizhevsky et al. 2012b)] to hundreds of layers [ResNet seen before (Bansal et al. 2018; Demirel et al. 2018; Rah- (He et al. 2016), DenseNet (Huang et al. 2017a)]. These man et al. 2018b, a), essential for life-long learning machines networks have millions to hundreds of millions of param- that need to intelligently and incrementally discover new eters, requiring massive data and GPUs for training. In order object categories. reduce or remove network redundancy, there has been grow- (8) Object Detection in Other Modalities Most detectors ing research interest in designing compact and lightweight are based on still 2D images; object detection in other modal- networks (Chen et al. 2017a; Alvarez and Salzmann 2016; ities can be highly relevant in domains such as autonomous Huang et al. 2018;Howardetal. 2017; Lin et al. 2017c;Yu vehicles, unmanned aerial vehicles, and robotics. These et al. 2018) and network acceleration (Cheng et al. 2018c; modalities raise new challenges in effectively using depth Hubara et al. 2016; Han et al. 2016;Lietal. 2017a, c;Wei (Chen et al. 2015c; Pepik et al. 2015; Xiang et al. 2014;Wu et al. 2018). et al. 2015), video (Feichtenhofer et al. 2017; Kang et al. (4) Automatic Neural Architecture Search Deep learning 2016), and point clouds (Qi et al. 2017, 2018). bypasses manual feature engineering which requires human (9) Universal Object Detection: Recently, there has been experts with strong domain knowledge, however DCNNs increasing effort in learning universal representations, those require similarly significant expertise. It is natural to con- which are effective in multiple image domains, such as nat- sider automated design of detection backbone architectures, ural images, videos, aerial images, and medical CT images such as the recent Automated Machine Learning (AutoML) (Rebuffi et al. 2017, 2018). Most such research focuses on (Quanming et al. 2018), which has been applied to image image classification, rarely targeting object detection (Wang classification and object detection (Cai et al. 2018; Chen et al. et al. 2019), and developed detectors are usually domain spe- 2019c;Ghiasietal. 2019; Liu et al. 2018a; Zoph and Le 2016; cific. Object detection independent of image domain and Zoph et al. 2018). cross-domain object detection represent important future (5) Object Instance Segmentation For a richer and more directions. detailed understanding of image content, there is a need to The research field of generic object detection is still far tackle pixel-level object instance segmentation (Lin et al. from complete. However given the breakthroughs over the 2014;Heetal. 2017;Huetal. 2018c), which can play an past 5 years we are optimistic of future developments and important role in potential applications that require the pre- opportunities. cise boundaries of individual objects. Acknowledgements Open access funding provided by University of (6) Weakly Supervised Detection Current state-of-the- Oulu including Oulu University Hospital. The authors would like to art detectors employ fully supervised models learned from thank the pioneering researchers in generic object detection and other labeled data with object bounding boxes or segmentation related fields. The authors would also like to express their sincere appre- masks (Everingham et al. 2015; Lin et al. 2014; Russakovsky ciation to Professor Jiˇrí Matas, the associate editor and the anonymous reviewers for their comments and suggestions. This work has been sup- et al. 2015; Lin et al. 2014). However, fully supervised learn- ported by the Center for Machine Vision and Signal Analysis at the ing has serious limitations, particularly where the collection University of Oulu (Finland) and the National Natural Science Foun- of bounding box annotations is labor intensive and where the dation of China under Grant 61872379. number of images is large. Fully supervised learning is not Open Access This article is distributed under the terms of the Creative scalable in the absence of fully labeled training data, so it Commons Attribution 4.0 International License (http://creativecomm is essential to understand how the power of CNNs can be ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, leveraged where only weakly / partially annotated data are and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative provided (Bilen and Vedaldi 2016; Diba et al. 2017; Shi et al. Commons license, and indicate if changes were made. 2017). (7) Few / Zero Shot Object Detection The success of deep detectors relies heavily on gargantuan amounts of annotated References training data. When the labeled data are scarce, the perfor- mance of deep detectors frequently deteriorates and fails Agrawal, P., Girshick, R., & Malik, J. (2014). Analyzing the perfor- to generalize well. In contrast, humans (even children) can mance of multilayer neural networks for object recognition. In learn a visual concept quickly from very few given exam- ECCV (pp. 329–344). ples and can often generalize well (Biederman 1987b;Lake Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR (pp. 73–80). et al. 2015; FeiFei et al. 2006). Therefore, the ability to learn from only few examples, few shot detection, is very appealing (Chen et al. 2018a; Dong et al. 2018; Finn et al. 2017; Kang Although side information may be provided, such as a wikipedia et al. 2018;Lakeetal. 2015; Ren et al. 2018; Schwartz et al. page or an attributes vector. 123 312 International Journal of Computer Vision (2020) 128:261–318 Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, of image windows. IEEE TPAMI, 34(11), 2189–2202. Z., Shi, J., Ouyang, W., et al. (2019a). Hybrid task cascade for Alvarez, J., & Salzmann, M. (2016). Learning the number of neurons instance segmentation. In CVPR. in deep networks. In NIPS (pp. 2270–2278). Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. Andreopoulos, A., & Tsotsos, J. (2013). 50 years of object recognition: (2015a), Semantic image segmentation with deep convolutional Directions forward. Computer Vision and Image Understanding, nets and fully connected CRFs. In ICLR. 117(8), 827–891. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. Arbeláez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., & Malik, J. (2018b). DeepLab: Semantic image segmentation with deep con- (2012). Semantic segmentation using regions and parts. In CVPR volutional nets, atrous convolution, and fully connected CRFs. (pp. 3378–3385). IEEE TPAMI, 40(4), 834–848. Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., & Yan, S. (2015b). Multiscale combinatorial grouping. In CVPR (pp. 328–335). Contextualizing object detection and classification. IEEE TPAMI, Azizpour, H., Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. 37(1), 13–27. (2016). Factors of transferability for a generic convnet represen- Chen, X., & Gupta, A. (2017). Spatial memory for context reasoning in tation. IEEE TPAMI, 38(9), 1790–1802. object detection. In ICCV. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., & Divakaran, A. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A. G., Ma, H., Fidler, S., & (2018). Zero shot object detection. In ECCV. Urtasun, R. (2015c) 3d object proposals for accurate object class Bar, M. (2004). Visual objects in context. Nature Reviews Neuroscience, detection. In NIPS (pp. 424–432). 5(8), 617–629. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., & Feng J. (2017b). Dual path Bell, S., Lawrence, Z., Bala, K., & Girshick, R. (2016). Inside outside networks. In NIPS (pp. 4467–4475). net: Detecting objects in context with skip pooling and recurrent Chen, Y., Rohrbach, M., Yan, Z., Yan, S., Feng, J., & Kalantidis, Y. neural networks. In CVPR (pp. 2874–2883). (2019b), Graph based global reasoning networks. In CVPR. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object Chen, Y., Yang, T., Zhang, X., Meng, G., Pan, C., & Sun, J. recognition using shape contexts. IEEE TPAMI, 24(4), 509–522. (2019c). DetNAS: Neural architecture search on object detection. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: arXiv:1903.10979. A review and new perspectives. IEEE TPAMI, 35(8), 1798–1828. Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., & Huang, T. (2018a). Biederman, I. (1972). Perceiving real world scenes. IJCV, 177(7), 77– Decoupled classification refinement: Hard false positive suppres- 80. sion for object detection. arXiv:1810.04002. Biederman, I. (1987a). Recognition by components: A theory of human Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., & Huang, T. (2018b). image understanding. Psychological Review, 94(2), 115. Revisiting RCNN: On awakening the classification power of faster Biederman, I. (1987b). Recognition by components: A theory of human RCNN. In ECCV. image understanding. Psychological Review, 94(2), 115. Cheng, G., Zhou, P., & Han, J. (2016). RIFDCNN: Rotation invariant Bilen, H., & Vedaldi, A. (2016). Weakly supervised deep detection and fisher discriminative convolutional neural networks for object networks. In CVPR (pp. 2846–2854). detection. In CVPR (pp. 2884–2893). Bodla, N., Singh, B., Chellappa, R., & Davis L. S. (2017). SoftNMS Cheng, M., Zhang, Z., Lin, W., & Torr, P. (2014). BING: Binarized improving object detection with one line of code. In ICCV (pp. normed gradients for objectness estimation at 300fps. In CVPR 5562–5570). (pp. 3286–3293). Borji, A., Cheng, M., Jiang, H., & Li, J. (2014). Salient object detection: Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2018c). Model compres- A survey, 1, 1–26. arXiv:1411.5878v1. sion and acceleration for deep neural networks: The principles, Bourdev, L., & Brandt, J. (2005). Robust object detection via soft cas- progress, and challenges. IEEE Signal Processing Magazine, cade. CVPR, 2, 236–243. 35(1), 126–136. Bruna, J., & Mallat, S. (2013). Invariant scattering convolution net- Chollet, F. (2017). Xception: Deep learning with depthwise separable works. IEEE TPAMI, 35(8), 1872–1886. convolutions. In CVPR (pp. 1800–1807). Cai, Z., & Vasconcelos, N. (2018). Cascade RCNN: Delving into high Cinbis, R., Verbeek, J., & Schmid, C. (2017). Weakly supervised quality object detection. In CVPR. object localization with multi-fold multiple instance learning. Cai, Z., Fan, Q., Feris, R., & Vasconcelos, N. (2016). A unified multi- IEEE TPAMI, 39(1), 189–203. scale deep convolutional neural network for fast object detection. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). In ECCV (pp. 354–370). Visual categorization with bags of keypoints. In ECCV Workshop Cai, H., Yang, J., Zhang, W., Han, S., & Yu, Y. et al. (2018) Path-level on statistical learning in computer vision. network transformation for efficient architecture search. In ICML. Dai, J., He, K., Li, Y., Ren, S., & Sun, J. (2016a). Instance sensitive Carreira, J., & Sminchisescu, C. (2012). CMPC: Automatic object seg- fully convolutional networks. In ECCV (pp. 534–549). mentation using constrained parametric mincuts. IEEE TPAMI, Dai, J., He, K., & Sun J. (2016b). Instance aware semantic segmentation 34(7), 1312–1328. via multitask network cascades. In CVPR (pp. 3150–3158). Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Dai, J., Li, Y., He, K., & Sun, J. (2016c). RFCN: Object detection via Return of the devil in the details: Delving deep into convolutional region based fully convolutional networks. In NIPS (pp. 379–387). nets. In BMVC. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object Deformable convolutional networks. In ICCV. proposal evaluation protocol is gameable. In CVPR (pp. 835–844). Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for Chellappa, R. (2016). The changing fortunes of pattern recognition and human detection. CVPR, 1, 886–893. computer vision. Image and Vision Computing, 55, 3–5. Demirel, B., Cinbis, R. G., & Ikizler-Cinbis, N. (2018). Zero shot object Chen, G., Choi, W., Yu, X., Han, T., & Chandraker M. (2017a). Learning detection by hybrid region embedding. In BMVC. efficient object detection models with knowledge distillation. In Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: NIPS. A large scale hierarchical image database. In CVPR (pp. 248–255). Chen, H., Wang, Y., Wang, G., & Qiao, Y. (2018a). LSTD: A low shot Diba, A., Sharma, V., Pazandeh, A. M., Pirsiavash, H., & Van Gool L. transfer detector for object detection. In AAAI. (2017). Weakly supervised cascaded convolutional networks. In CVPR (Vol.3,p.9). 123 International Journal of Computer Vision (2020) 128:261–318 313 Dickinson, S., Leonardis, A., Schiele, B., & Tarr, M. (2009). The evolu- Ghiasi, G., Lin, T., Pang, R., & Le, Q. (2019). NASFPN: Learn- tion of object categorization and the challenge of image abstraction ing scalable feature pyramid architecture for object detection. in object categorization: Computer and human vision perspectives. arXiv:1904.07392. Cambridge: Cambridge University Press. Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Van Gool, L. Ding, J., Xue, N., Long, Y., Xia, G., & Lu, Q. (2018). Learning RoI trans- (2015). DeepProposal: Hunting objects by cascading deep convo- former for detecting oriented objects in aerial images. In CVPR. lutional layers. In ICCV (pp. 2578–2586). Divvala, S., Hoiem, D., Hays, J., Efros, A., & Hebert, M. (2009). An Gidaris, S., & Komodakis, N. (2015). Object detection via a multiregion empirical study of context in object detection. In CVPR (pp. 1271– and semantic segmentation aware CNN model. In ICCV (pp. 1134– 1278). 1142). Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detec- Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box tion: An evaluation of the state of the art. IEEE TPAMI, 34(4), proposal generation via in out localization. In BMVC. 743–761. Girshick, R. (2015). Fast R-CNN. In ICCV (pp. 1440–1448). Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature et al. (2014). DeCAF: A deep convolutional activation feature for hierarchies for accurate object detection and semantic segmenta- generic visual recognition. ICML, 32, 647–655. tion. In CVPR (pp. 580–587). Dong, X., Zheng, L., Ma, F., Yang, Y., & Meng, D. (2018). Few-example Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based object detection with model communication. IEEE Transactions convolutional networks for accurate object detection and segmen- on Pattern Analysis and Machine Intelligence, 41(7), 1641–1654. tation. IEEE TPAMI, 38(1), 142–158. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Cen- Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable terNet: Keypoint triplets for object detection. arXiv:1904.08189. part models are convolutional neural networks. In CVPR (pp. 437– Dvornik, N., Mairal, J., & Schmid, C. (2018). Modeling visual context 446). is key to augmenting object detection datasets. In ECCV (pp. 364– Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. 380). Cambridge: MIT press. Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Goodfellow, I., Shlens, J., & Szegedy, C. (2015). Explaining and har- Surprisingly easy synthesis for instance detection. In ICCV (pp. nessing adversarial examples. In ICLR. 1301–1310). Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Dis- Endres, I., & Hoiem, D. (2010). Category independent object propos- criminative classification with sets of image features. ICCV, 2, als. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), European 1458–1465. Conference on Computer Vision (pp. 575–588). Berlin: Springer. Grauman, K., & Leibe, B. (2011). Visual object recognition. Synthesis Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detec- Lectures on Artificial Intelligence and Machine Learning, 5(2), tion: Survey and experiments. IEEE TPAMI, 31(12), 2179–2195. 1–181. Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. (2014). Scalable Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., et al. object detection using deep neural networks. In CVPR (pp. 2147– (2018). Recent advances in convolutional neural networks. Pattern 2154). Recognition, 77, 354–377. Everingham, M., Eslami, S., Gool, L. V., Williams, C., Winn, J., & Guillaumin, M., Küttel, D., & Ferrari, V. (2014). Imagenet autoan- Zisserman, A. (2015). The pascal visual object classes challenge: notation with segmentation propagation. International Journal of A retrospective. IJCV, 111(1), 98–136. Computer Vision, 110(3), 328–348. Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text (2010). The pascal visual object classes (voc) challenge. IJCV, localisation in natural images. In CVPR (pp. 2315–2324). 88(2), 303–338. Han, S., Dally, W. J., & Mao, H. (2016). Deep Compression: Compress- Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and ing deep neural networks with pruning, trained quantization and track to detect. In ICCV (pp. 918–927). huffman coding. In ICLR. FeiFei, L., Fergus, R., & Perona, P. (2006). One shot learning of object Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2014). Simulta- categories. IEEE TPAMI, 28(4), 594–611. neous detection and segmentation. In ECCV (pp. 297–312). Felzenszwalb, P., Girshick, R., & McAllester, D. (2010a). Cascade Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2016). Object object detection with deformable part models. In CVPR (pp. 2241– instance segmentation and fine-grained localization using hyper- 2248). columns. IEEE Transactions on Pattern Analysis and Machine Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010b). Intelligence, 39(4), 627–639. Object detection with discriminatively trained part based models. Hariharan, B., & Girshick R. B. (2017). Low shot visual recognition by IEEE TPAMI, 32(9), 1627–1645. shrinking and hallucinating features. In ICCV (pp. 3037–3046). Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discrimi- Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object natively trained, multiscale, deformable part model. In CVPR (pp. localization and image classification. In ICCV (pp. 237–244). 1–8). He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask RCNN. Finn, C., Abbeel, P., & Levine, S. (2017). Model agnostic meta learning In ICCV. for fast adaptation of deep networks. In ICML (pp. 1126–1135). He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling Fischler, M., & Elschlager, R. (1973). The representation and matching in deep convolutional networks for visual recognition. In ECCV of pictorial structures. IEEE Transactions on Computers, 100(1), (pp. 346–361). 67–92. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Surpassing human-level performance on ImageNet classification. Deconvolutional single shot detector. arXiv:1701.06659. In ICCV (pp. 1026–1034). Galleguillos, C., & Belongie, S. (2010). Context based object catego- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for rization: A critical survey. Computer Vision and Image Under- image recognition. In CVPR (pp. 770–778). standing, 114, 712–722. He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end Geronimo, D., Lopez, A. M., Sappa, A. D., & Graf, T. (2010). Survey of to end textspotter with explicit alignment and attention. In CVPR pedestrian detection for advanced driver assistance systems. IEEE (pp. 5020–5029). TPAMI, 32(7), 1239–1258. 123 314 International Journal of Computer Vision (2020) 128:261–318 He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection box regression with uncertainty for accurate object detection. In from video tubelets with convolutional neural networks. In CVPR CVPR. (pp. 817–825). Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality Kim, A., Sharma, A., & Jacobs, D. (2014). Locally scale invariant con- of data with neural networks. Science, 313(5786), 504–507. volutional neural networks. In NIPS. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in Kim, K., Hong, S., Roh, B., Cheon, Y., & Park, M. (2016). PVANet: a neural network. arXiv:1503.02531. Deep but lightweight neural networks for real time object detec- Hoffman, J., Guadarrama, S., Tzeng, E. S., Hu, R., Donahue, J., Gir- tion. In NIPSW. shick, R., Darrell, T., & Saenko, K. (2014). LSDA: Large scale Kim, Y, Kang, B.-N., & Kim, D. (2018). SAN: Learning relationship detection through adaptation. In NIPS (pp. 3536–3544). between convolutional features for multiscale object detection. In Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in ECCV (pp. 316–331). object detectors. In ECCV (pp. 340–353). Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2018). Panop- Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes tic segmentation. arXiv:1801.00868. for effective detection proposals? IEEE TPAMI, 38(4), 814–829. Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature Hosang, J., Benenson, R., & Schiele, B. (2017). Learning nonmaximum pyramid reconfiguration for object detection. In ECCV (pp. 169– suppression. In ICCV. 185). Hosang, J., Omran, M., Benenson, R., & Schiele, B. (2015). Taking a Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: deeper look at pedestrians. In Proceedings of the IEEE conference Reverse connection with objectness prior networks for object on computer vision and pattern recognition (pp. 4073–4082). detection. In CVPR. Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). HyperNet: Towards T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient accurate region proposal generation and joint object detection. In convolutional neural networks for mobile vision applications. In CVPR (pp. 845–853). CVPR. Krähenbühl, P., & Koltun, V. (2014), Geodesic object proposals. In Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018a). Relation networks ECCV. for object detection. In CVPR. Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., AbuElHaija, S., Hu, H., Lan, S., Jiang, Y., Cao, Z., & Sha, F. (2017). FastMask: Segment Kuznetsova, A., et al. (2017). OpenImages: A public dataset for multiscale object candidates in one shot. In CVPR (pp. 991–999). large scale multilabel and multiclass image classification. Dataset Hu, J., Shen, L., & Sun, G. (2018b). Squeeze and excitation networks. available from https://storage.googleapis.com/openimages/web/ In CVPR. index.html. Hu, P., & Ramanan, D. (2017). Finding tiny faces. In CVPR (pp. 1522– Krizhevsky, A., Sutskever, I., & Hinton, G. (2012a). ImageNet clas- 1530). sification with deep convolutional neural networks. In NIPS (pp. Hu, R., Dollár, P., He, K., Darrell, T., & Girshick, R. (2018c). Learning 1097–1105). to segment every thing. In CVPR. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012b). ImageNet clas- Huang, G., Liu, S., van der Maaten, L., & Weinberger, K. (2018). Con- sification with deep convolutional neural networks. In NIPS (pp. denseNet: An efficient densenet using learned group convolutions. 1097–1105). In CVPR. Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning object- Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017a). ness with convolutional networks. In ICCV (pp. 2479–2487). Densely connected convolutional networks. In CVPR. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., PontTuset, Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., J., et al. (2018). The open images dataset v4: Unified image classi- Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, fication, object detection, and visual relationship detection at scale. K. (2017b). Speed/accuracy trade offs for modern convolutional arXiv:1811.00982. object detectors. In CVPR. Lake, B., Salakhutdinov, R., & Tenenbaum, J. (2015). Human level Huang, Z., Huang, L., Gong, Y., Huang, C., & Wang, X. (2019). Mask concept learning through probabilistic program induction. Science, scoring rcnn. In CVPR. 350(6266), 1332–1338. Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., & Bengio, Y. Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding (2016). Binarized neural networks. In NIPS (pp. 4107–4115). windows: Object localization by efficient subwindow search. In Iandola, F., Han, S., Moskewicz, M., Ashraf, K., Dally, W., & Keutzer, CVPR (pp. 1–8). K. (2016). SqueezeNet: Alexnet level accuracy with 50x fewer Law, H., & Deng, J. (2018). CornerNet: Detecting objects as paired parameters and 0.5 mb model size. arXiv:1602.07360. keypoints. In ECCV. ILSVRC detection challenge results. (2018). http://www.image-net. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: org/challenges/LSVRC/. Spatial pyramid matching for recognizing natural scene categories. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep CVPR, 2, 2169–2178. network training by reducing internal covariate shift. In Interna- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, tional conference on machine learning (pp. 448–456). 521, 436–444. Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial trans- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient former networks. In NIPS (pp. 2017–2025). based learning applied to document recognition. Proceedings of Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, the IEEE, 86(11), 2278–2324. R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional Lee, C., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply architecture for fast feature embedding. In ACM MM (pp. 675– supervised nets. In Artificial intelligence and statistics (pp. 562– 678). 570). Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition Lenc, K., & Vedaldi, A. (2015). R-CNN minus R. In BMVC15. of localization confidence for accurate object detection. In ECCV Lenc, K., & Vedaldi, A. (2018). Understanding image representations (pp. 784–799). by measuring their equivariance and equivalence. In IJCV. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2018). Few Li, B., Liu, Y., & Wang, X. (2019a). Gradient harmonized single stage shot object detection via feature reweighting. arXiv:1812.01866. detector. In AAAI. 123 International Journal of Computer Vision (2020) 128:261–318 315 Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017a). Loy, C., Lin, D., Ouyang, W., Xiong, Y., Yang, S., Huang, Q., et al. Pruning filters for efficient convnets. In ICLR. (2019). WIDER face and pedestrian challenge 2018: Methods and Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. (2015a). A convolutional results. arXiv:1902.06854. neural network cascade for face detection. In CVPR (pp. 5325– Lu, Y., Javidi, T., & Lazebnik, S. (2016). Adaptive object detection 5334). using adjacency and zoom prediction. In CVPR (pp. 2351–2359). Li, H., Liu, Y., Ouyang, W., & Wang, X. (2018a). Zoom out and in Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding network with map attention decision for region proposal and object regularization in batch normalization. In ICLR. detection. In IJCV. Luo, P., Zhang, R., Ren, J., Peng, Z., & Li, J. (2019). Switch- Li, J., Wei, Y., Liang, X., Dong, J., Xu, T., Feng, J., et al. (2017b). able normalization for learning-to-normalize deep representation. Attentive contexts for object detection. IEEE Transactions on Mul- IEEE Transactions on Pattern Analysis and Machine Intelligence. timedia, 19(5), 944–954. https://doi.org/10.1109/TPAMI.2019.2932062. Li, Q., Jin, S., & Yan, J. (2017c). Mimicking very efficient network for Malisiewicz, T., & Efros, A. (2009). Beyond categories: The visual object detection. In CVPR (pp. 7341–7349). memex model for reasoning about object relationships. In NIPS. Li, S. Z., & Zhang, Z. (2004). Floatboost learning and statistical face Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). detection. IEEE TPAMI, 26(9), 1112–1123. Arbitrary oriented scene text detection via rotation proposals. IEEE Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019b). Scale aware trident TMM, 20(11), 3111–3122. networks for object detection. arXiv:1901.01892. Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object propos- Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017d). Scene als with randomized prim’s algorithm. In CVPR (pp. 2536–2543). graph generation from objects, phrases and region captions. In Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of ICCV (pp. 1261–1270). local descriptors. IEEE TPAMI, 27(10), 1615–1630. Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017e). Fully convolutional Mordan, T., Thome, N., Henaff, G., & Cord, M. (2018). End to end instance aware semantic segmentation. In CVPR (pp. 4438–4446). learning of latent deformable part based representations for object Li, Y., Wang, S., Tian, Q., & Ding, X. (2015b). Feature representation detection. In IJCV (pp. 1–21). for statistical learning based object detection: A review. Pattern MS COCO detection leaderboard. (2018). http://cocodataset.org/# Recognition, 48(11), 3542–3559. detection-leaderboard. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018b). DetNet: Mundy, J. (2006). Object recognition in the geometric era: A retrospec- A backbone network for object detection. In ECCV. tive. In J. Ponce, M. Hebert, C. Schmid, & A. Zisserman (Eds.), Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018c). Light Book toward category level object recognition (pp. 3–28). Berlin: head RCNN: In defense of two stage object detector. In CVPR. Springer. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Murase, H., & Nayar, S. (1995a). Visual learning and recognition of 3D (2017a). Feature pyramid networks for object detection. In CVPR. objects from appearance. IJCV, 14(1), 5–24. Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss Murase, H., & Nayar, S. (1995b). Visual learning and recognition of 3d for dense object detection. In ICCV. objects from appearance. IJCV, 14(1), 5–24. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dol- Murphy, K., Torralba, A., & Freeman, W. (2003). Using the forest to see lár, P., & Zitnick, L. (2014). Microsoft COCO: Common objects the trees: A graphical model relating features, objects and scenes. in context. In ECCV (pp. 740–755). In NIPS. Lin, X., Zhao, C., & Pan, W. (2017c). Towards accurate binary convo- Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: lutional neural network. In NIPS (pp. 344–352). End to end learning for joint detection and grouping. In NIPS (pp. Litjens, G., Kooi, T., Bejnordi, B., Setio, A., Ciompi, F., Ghafoorian, 2277–2287). M., et al. (2017). A survey on deep learning in medical image Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks analysis. Medical Image Analysis, 42, 60–88. for human pose estimation. In ECCV (pp. 483–499). Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., FeiFei, L., Ojala, T., Pietikäinen, M., & Maenpää, T. (2002). Multiresolution gray- Yuille, A., Huang, J., & Murphy, K. (2018a). Progressive neural scale and rotation invariant texture classification with local binary architecture search. In ECCV (pp. 19–34). patterns. IEEE TPAMI, 24(7), 971–987. Liu, L., Fieguth, P., Guo, Y., Wang, X., & Pietikäinen, M. (2017). Local Oliva, A., & Torralba, A. (2007). The role of context in object recogni- binary features for texture classification: Taxonomy and experi- tion. Trends in cognitive sciences, 11(12), 520–527. mental study. Pattern Recognition, 62, 135–160. Opelt, A., Pinz, A., Fussenegger, M., & Auer, P. (2006). Generic object Liu, S., Huang, D., & Wang, Y. (2018b). Receptive field block net for recognition with boosting. IEEE TPAMI, 28(3), 416–431. accurate and fast object detection. In ECCV. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018c). Path aggregation transferring midlevel image representations using convolutional network for instance segmentation. In CVPR (pp. 8759–8768). neural networks. In CVPR (pp. 1717–1724). Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object local- Berg, A. (2016). SSD: Single shot multibox detector. In ECCV ization for free? weakly supervised learning with convolutional (pp. 21–37). neural networks. In CVPR (pp. 685–694). Liu, Y., Wang, R., Shan, S., & Chen, X. (2018d). Structure inference Osuna, E., Freund, R., & Girosit, F. (1997). Training support vector net: Object detection using scene level context and instance level machines: An application to face detection. In CVPR (pp. 130– relationships. In CVPR (pp. 6985–6994). 136). Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net- Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian works for semantic segmentation. In Proceedings of the IEEE detection. In ICCV (pp. 2056–2063). Conference on Computer Vision and Pattern Recognition (pp. Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, 3431–3440). S., Wang, Z., Loy, C.-C., et al. (2015). DeepIDNet: Deformable Lowe, D. (1999). Object recognition from local scale invariant features. deep convolutional neural networks for object detection. In CVPR ICCV, 2, 1150–1157. (pp. 2403–2412). Lowe, D. (2004). Distinctive image features from scale-invariant key- Ouyang, W., Wang, X., Zhang, C., & Yang, X. (2016). Factors in fine- points. IJCV, 60(2), 91–110. tuning deep model for object detection with long tail distribution. In CVPR (pp. 864–873). 123 316 International Journal of Computer Vision (2020) 128:261–318 Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017a). Chained cascade Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2016). Object detec- network for object detection. In ICCV. tion networks on convolutional feature maps. IEEE Transactions Ouyang, W., Zeng, X., Wang, X., Qiu, S., Luo, P., Tian, Y., et al. (2017b). on Pattern Analysis and Machine Intelligence, 39(7), 1476–1481. DeepIDNet: Object detection with deformable part based convo- Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, lutional neural networks. IEEE TPAMI, 39(7), 1320–1334. S. (2019). Generalized intersection over union: A metric and a loss Parikh, D., Zitnick, C., & Chen, T. (2012). Exploring tiny images: The for bounding box regression. In CVPR. roles of appearance and contextual information for machine and Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network based face human object recognition. IEEE TPAMI, 34(10), 1978–1991. detection. IEEE TPAMI, 20(1), 23–38. PASCAL VOC detection leaderboard. (2018). http://host.robots.ox.ac. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. uk:8080/leaderboard/main_bootstrap.php (2015). ImageNet large scale visual recognition challenge. IJCV, Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., & Sun, 115(3), 211–252. J. (2018). MegDet: A large minibatch object detector. In CVPR. Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object A database and web based tool for image annotation. IJCV, 77(1– detectors from 3d models. In ICCV (pp. 1278–1286). 3), 157–173. Pepik, B., Benenson, R., Ritschel, T., & Schiele, B. (2015). What is Schmid, C., & Mohr, R. (1997). Local grayvalue invariants for image holding back convnets for detection? In German conference on retrieval. IEEE TPAMI, 19(5), 530–535. pattern recognition (pp. 517–528). Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Pankanti, Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher S., Feris, R., Kumar, A., Giries, R., & Bronstein, A. (2019). Rep- kernel for large scale image classification. In ECCV (pp. 143–156). Met: Representative based metric learning for classification and Pinheiro, P., Collobert, R., & Dollar, P. (2015). Learning to segment one shot object detection. In CVPR. object candidates. In NIPS (pp. 1990–1998). Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & Pinheiro, P., Lin, T., Collobert, R., & Dollár, P. (2016). Learning to LeCun, Y. (2014). OverFeat: Integrated recognition, localization refine object segments. In ECCV (pp. 75–91). and detection using convolutional networks. In ICLR. Ponce, J., Hebert, M., Schmid, C., & Zisserman, A. (2007). Toward Sermanet, P., Kavukcuoglu, K., Chintala, S., & LeCun, Y. (2013). Pedes- category level object recognition. Berlin: Springer. trian detection with unsupervised multistage feature learning. In Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., et al. CVPR (pp. 3626–3633). (2018). A survey on deep learning: Algorithms, techniques, and Shang, W., Sohn, K., Almeida, D., & Lee, H. (2016). Understanding and applications. ACM Computing Surveys, 51(5), 92:1–92:36. improving convolutional neural networks via concatenated recti- Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum fied linear units. In ICML (pp. 2217–2225). pointnets for 3D object detection from RGBD data. In CVPR (pp. Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional net- 918–927). works for semantic segmentation. IEEE TPAMI. Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet: Deep Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017). learning on point sets for 3D classification and segmentation. In DSOD: Learning deeply supervised object detectors from scratch. CVPR (pp. 652–660). In ICCV. Quanming, Y., Mengshuo, W., Hugo, J. E., Isabelle, G., Yiqi, H., Yufeng, Shi, X., Shan, S., Kan, M., Wu, S., & Chen, X. (2018). Real time rotation L., et al. (2018). Taking human out of learning applications: A invariant face detection with progressive calibration networks. In survey on automated machine learning. arXiv:1810.13306. CVPR. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, Shi, Z., Yang, Y., Hospedales, T., & Xiang, T. (2017). Weakly supervised S. (2007). Objects in context. In ICCV. image annotation and segmentation with objects and attributes. Rahman, S., Khan, S., & Barnes, N. (2018a). Polarity loss for zero shot IEEE TPAMI, 39(12), 2525–2538. object detection. arXiv:1811.08982. Shrivastava, A., & Gupta A. (2016), Contextual priming and feedback Rahman, S., Khan, S., & Porikli, F. (2018b). Zero shot object detection: for Faster RCNN. In ECCV (pp. 330–348). Learning to simultaneously recognize and localize novel concepts. Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region based In ACCV. object detectors with online hard example mining. In CVPR (pp. Razavian, R., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN 761–769). features off the shelf: An astounding baseline for recognition. In Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2017). Beyond CVPR workshops (pp. 806–813). skip connections: Top down modulation for object detection. In Rebuffi, S., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual CVPR. domains with residual adapters. In Advances in neural information Simonyan, K., & Zisserman, A. (2015). Very deep convolutional net- processing systems (pp. 506–516). works for large scale image recognition. In ICLR. Rebuffi, S., Bilen, H., & Vedaldi A. (2018). Efficient parametrization Singh, B., & Davis, L. (2018). An analysis of scale invariance in object of multidomain deep neural networks. In CVPR (pp. 8119–8127). detection-SNIP. In CVPR. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only Singh, B., Li, H., Sharma, A., & Davis, L. S. (2018a). RFCN 3000 at look once: Unified, real time object detection. In CVPR (pp. 779– 30fps: Decoupling detection and classification. In CVPR. 788). Singh, B., Najibi, M., & Davis, L. S. (2018b). SNIPER: Efficient mul- Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. tiscale training. arXiv:1805.09300. In CVPR. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, approach to object matching in videos. International Conference J. B., Larochelle, H., & Zemel R. S. (2018). Meta learning for on Computer Vision (ICCV), 2, 1470–1477. semisupervised few shot classification. In ICLR. Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards unreasonable effectiveness of data in deep learning era. In ICCV real time object detection with region proposal networks. In NIPS (pp. 843–852). (pp. 91–99). Sun, K., Xiao, B., Liu, D., & Wang, J. (2019a). Deep high resolution Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster RCNN: Towards representation learning for human pose estimation. In CVPR. real time object detection with region proposal networks. IEEE TPAMI, 39(6), 1137–1149. 123 International Journal of Computer Vision (2020) 128:261–318 317 Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., et al. (2019b). Woo, S., Hwang, S., & Kweon, I. (2018). StairNet: Top down semantic High resolution representations for labeling pixels and regions. aggregation for accurate one shot detection. In WACV (pp. 1093– CoRR.,. arXiv:1904.04514. 1102). Sun, S., Pang, J., Shi, J., Yi, S., & Ouyang, W. (2018). FishNet: A Worrall, D. E., Garbin, S. J., Turmukhambetov, D., & Brostow, G. J. versatile backbone for image, region, and pixel level prediction. In (2017). Harmonic networks: Deep translation and rotation equiv- NIPS (pp. 754–764). ariance. In CVPR (Vol. 2). Sun, Z., Bebis, G., & Miller, R. (2006). On road vehicle detection: A Wu, Y., & He, K. (2018). Group normalization. In ECCV (pp. 3–19). review. IEEE TPAMI, 28(5), 694–711. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2019). A com- Sung, K., & Poggio, T. (1994). Learning and example selection for prehensive survey on graph neural networks. arXiv:1901.00596. object and pattern detection. MIT AI Memo (1521). Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, Swain, M., & Ballard, D. (1991). Color indexing. IJCV, 7(1), 11–32. J. (2015). 3D ShapeNets: A deep representation for volumetric Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., shapes. In CVPR (pp. 1912–1920). Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper Xia, G., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., with convolutions. In CVPR (pp. 1–9). Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception object detection in aerial images. In CVPR (pp. 3974–3983). v4, inception resnet and the impact of residual connections on Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond PASCAL: A learning. In AAAI (pp. 4278–4284). benchmark for 3D object detection in the wild. In WACV (pp. 75– Szegedy, C., Reed, S., Erhan, D., Anguelov, D., & Ioffe, S. (2014). 82). Scalable, high quality object detection. arXiv:1412.1441. Xiao, R., Zhu, L., & Zhang, H. (2003). Boosting chain learning for Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. In ICCV (pp. 709–715). object detection. In NIPS (pp. 2553–2561). Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). residual transformations for deep neural networks. In CVPR. Rethinking the inception architecture for computer vision. In Yang, B., Yan, J., Lei, Z., & Li, S. (2016a). CRAFT objects from images. CVPR (pp. 2818–2826). In CVPR (pp. 6043–6051). Torralba, A. (2003). Contextual priming for object detection. IJCV, Yang, F., Choi, W., & Lin, Y. (2016b). Exploit all the layers: Fast and 53(2), 169–191. accurate CNN object detector with scale dependent pooling and Turk, M. A., & Pentland, A. (1991). Face recognition using eigenfaces. cascaded rejection classifiers. In CVPR (pp. 2129–2137). In CVPR (pp. 586–591). Yang, M., Kriegman, D., & Ahuja, N. (2002). Detecting faces in images: Tuzel, O., Porikli, F., & Meer P. (2006). Region covariance: A fast Asurvey. IEEE TPAMI, 24(1), 34–58. descriptor for detection and classification. In ECCV (pp. 589–600). Ye, Q., & Doermann, D. (2015). Text detection and recognition in TychsenSmith, L., & Petersson, L. (2017). DeNet: Scalable real time imagery: A survey. IEEE TPAMI, 37(7), 1480–1500. object detection with directed sparse sampling. In ICCV. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How trans- TychsenSmith, L., & Petersson, L. (2018). Improving object localiza- ferable are features in deep neural networks? In NIPS (pp. tion with fitness nms and bounded iou loss. In CVPR. 3320–3328). Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends Selective search for object recognition. IJCV, 104(2), 154–171. in deep learning based natural language processing. IEEE Com- Vaillant, R., Monrocq, C., & LeCun, Y. (1994). Original approach for the putational Intelligence Magazine, 13(3), 55–75. localisation of objects in images. IEE Proceedings Vision, Image Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated and Signal Processing, 141(4), 245–250. convolutions. arXiv preprint arXiv:1511.07122. Van de Sande, K., Uijlings, J., Gevers, T., & Smeulders, A. (2011). Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. Segmentation as selective search for object recognition. In ICCV In CVPR (Vol.2,p.3). (pp. 1879–1886). Yu, R., Li, A., Chen, C., Lai, J., et al. (2018). NISP: Pruning networks Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, using neuron importance score propagation. In CVPR. A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you Zafeiriou, S., Zhang, C., & Zhang, Z. (2015). A survey on face detection need. In NIPS (pp. 6000–6010). in the wild: Past, present and future. Computer Vision and Image Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple Understanding, 138, 1–24. kernels for object detection. In ICCV (pp. 606–613). Zagoruyko, S., Lerer, A., Lin, T., Pinheiro, P., Gross, S., Chintala, S., Viola, P., & Jones, M. (2001). Rapid object detection using a boosted & Dollár, P. (2016). A multipath network for object detection. In cascade of simple features. CVPR, 1, 1–8. BMVC. Wan, L., Eigen, D., & Fergus, R. (2015). End to end integration of a Zeiler, M., & Fergus, R. (2014). Visualizing and understanding convo- convolution network, deformable parts model and nonmaximum lutional networks. In ECCV (pp. 818–833). suppression. In CVPR (pp. 851–859). Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2017). Wang, H., Wang, Q., Gao, M., Li, P., & Zuo, W. (2018). Multiscale Crafting gbd-net for object detection. IEEE Transactions on Pat- location aware kernel representation for object detection. In CVPR. tern Analysis and Machine Intelligence, 40(9), 2109–2123. Wang, X., Cai, Z., Gao, D., & Vasconcelos, N. (2019). Towards universal Zeng, X., Ouyang, W., Yang, B., Yan, J., & Wang, X. (2016). Gated object detection by domain attention. arXiv:1904.04402. bidirectional cnn for object detection. In ECCV (pp. 354–369). Wang, X., Han, T., & Yan, S. (2009). An HOG-LBP human detector Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016a). Joint face detection with partial occlusion handling. In International conference on and alignment using multitask cascaded convolutional networks. computer vision (pp. 32–39). IEEE SPL, 23(10), 1499–1503. Wang, X., Shrivastava, A., & Gupta, A. (2017). A Fast RCNN: Hard Zhang, L., Lin, L., Liang, X., & He, K. (2016b). Is faster RCNN doing positive generation via adversary for object detection. In CVPR. well for pedestrian detection? In ECCV (pp. 443–457). Wei, Y., Pan, X., Qin, H., Ouyang, W., & Yan, J. (2018). Quantization Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. (2018a). Single shot mimic: Towards very tiny CNN for object detection. In ECCV (pp. refinement neural network for object detection. In CVPR. 267–283). Zhang, S., Yang, J., & Schiele, B. (2018b). Occluded pedestrian detec- tion through guided attention in CNNs. In CVPR (pp. 2056–2063). 123 318 International Journal of Computer Vision (2020) 128:261–318 Zhang, X., Li, Z., Change Loy, C., & Lin, D. (2017). PolyNet: A pursuit Zhou, Y., Ye, Q., Qiu, Q., & Jiao, J. (2017b). Oriented response net- of structural diversity in very deep networks. In CVPR (pp. 718– works. In CVPR (pp. 4961–4970). 726). Zhu, X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F., et al. (2017). Zhang, X., Yang, Y., Han, Z., Wang, H., & Gao, C. (2013). Object class Deep learning in remote sensing: A comprehensive review and list detection: A survey. ACM Computing Surveys, 46(1), 10:1–10:53. of resources. IEEE Geoscience and Remote Sensing Magazine, Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018c). ShuffleNet: An 5(4), 8–36. extremely efficient convolutional neural network for mobile Zhu, X., Vondrick, C., Fowlkes, C., & Ramanan, D. (2016a). Do we devices. In CVPR. need more training data? IJCV, 119(1), 76–92. Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E., Jin, W., & Schuller, Zhu, Y., Urtasun, R., Salakhutdinov, R., & Fidler, S. (2015). SegDeepM: B. (2018d). Deep learning for environmentally robust speech Exploiting segmentation and context in deep neural networks for recognition: An overview of recent developments. ACM Trans- object detection. In CVPR (pp. 4703–4711). actions on Intelligent Systems and Technology, 9(5), 49:1–49:28. Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017a). Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. (2018e). CoupleNet: Coupling global structure with local parts for object Single shot object detection with enriched semantics. In CVPR. detection. In ICCV. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., & Jiao, J. (2017b). Soft proposal (2019). M2Det: A single shot object detector based on multilevel networks for weakly supervised object localization. In ICCV (pp. feature pyramid network. In AAAI. 1841–1850). Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., & Hu, S. (2016b). D., Huang, C., & Torr, P. (2015). Conditional random fields as Traffic sign detection and classification in the wild. In CVPR (pp. recurrent neural networks. In ICCV (pp. 1529–1537). 2110–2118). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Zitnick, C., & Dollár, P. (2014). Edge boxes: Locating object proposals Object detectors emerge in deep scene CNNs. In ICLR. from edges. In ECCV (pp. 391–405). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016a). Zoph, B., & Le, Q. (2016). Neural architecture search with reinforce- Learning deep features for discriminative localization. In CVPR ment learning. arXiv preprint arXiv:1611.01578. (pp. 2921–2929). Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. (2018). Learning trans- Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017a). ferable architectures for scalable image recognition. In CVPR (pp. Places: A 10 million image database for scene recognition. IEEE 8697–8710). Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464. Publisher’s Note Springer Nature remains neutral with regard to juris- Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., & Sun, M. (2018a). dictional claims in published maps and institutional affiliations. Graph neural networks: A review of methods and applications. arXiv:1812.08434. Zhou, P., Ni, B., Geng, C., Hu, J., & Xu, Y. (2018b). Scale transferrable object detection. In CVPR. Zhou, Y., Liu, L., Shao, L., & Mellor, M. (2016b). DAVE: A unified framework for fast vehicle detection and annotation. In ECCV (pp. 278–293). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Computer Vision Springer Journals

Loading next page...
 
/lp/springer-journals/deep-learning-for-generic-object-detection-a-survey-xZASTzP9BP

References (379)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2019
Subject
Computer Science; Computer Imaging, Vision, Pattern Recognition and Graphics; Artificial Intelligence; Image Processing and Computer Vision; Pattern Recognition
ISSN
0920-5691
eISSN
1573-1405
DOI
10.1007/s11263-019-01247-4
Publisher site
See Article on Publisher Site

Abstract

Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research. Keywords Object detection · Deep learning · Convolutional neural networks · Object recognition 1 Introduction chler and Elschlager 1973). The goal of object detection is to determine whether there are any instances of objects from As a longstanding, fundamental and challenging problem given categories (such as humans, cars, bicycles, dogs or in computer vision, object detection (illustrated in Fig. 1) cats) in an image and, if present, to return the spatial loca- has been an active area of research for several decades (Fis- tion and extent of each object instance (e.g., via a bounding box Everingham et al. 2010; Russakovsky et al. 2015). As the cornerstone of image understanding and computer vision, Communicated by Bernt Schiele. object detection forms the basis for solving complex or high Li Liu level vision tasks such as segmentation, scene understand- li.liu@oulu.fi ing, object tracking, image captioning, event detection, and Wanli Ouyang activity recognition. Object detection supports a wide range [email protected] of applications, including robot vision, consumer electronics, Xiaogang Wang security, autonomous driving, human computer interaction, [email protected] content based image retrieval, intelligent video surveillance, Paul Fieguth and augmented reality. pfi[email protected] Recently, deep learning techniques (Hinton and Salakhut- Jie Chen dinov 2006; LeCun et al. 2015) have emerged as powerful jie.chen@oulu.fi methods for learning feature representations automatically Xinwang Liu from data. In particular, these techniques have provided [email protected] major improvements in object detection, as illustrated in Matti Pietikäinen Fig. 3. matti.pietikainen@oulu.fi As illustrated in Fig. 2, object detection can be grouped into one of two types (Grauman and Leibe 2011; Zhang et al. National University of Defense Technology, Changsha, China 2013): detection of specific instances versus the detection of University of Oulu, Oulu, Finland broad categories. The first type aims to detect instances of University of Sydney, Camperdown, Australia a particular object (such as Donald Trump’s face, the Eiffel Chinese University of Hong Kong, Sha Tin, China Tower, or a neighbor’s dog), essentially a matching problem. University of Waterloo, Waterloo, Canada 123 262 International Journal of Computer Vision (2020) 128:261–318 (a) (b) Fig. 3 An overview of recent object detection performance: we can observe a significant improvement in performance (measured as mean Fig. 1 Most frequent keywords in ICCV and CVPR conference papers average precision) since the arrival of deep learning in 2012. a Detection from 2016 to 2018. The size of each word is proportional to the fre- results of winning entries in the VOC2007-2012 competitions, and b quency of that keyword. We can see that object detection has received top object detection competition results in ILSVRC2013-2017 (results significant attention in recent years in both panels use only the provided training data) over the past 5 years. Given the exceptionally rapid rate of progress, this article attempts to track recent advances and summarize their achievements in order to gain a clearer pic- ture of the current panorama in generic object detection. 1.1 Comparison with Previous Reviews Many notable object detection surveys have been published, as summarized in Table 1. These include many excellent sur- veys on the problem of specific object detection, such as pedestrian detection (Enzweiler and Gavrila 2009; Geron- Fig. 2 Object detection includes localizing instances of a particular imo et al. 2010; Dollar et al. 2012), face detection (Yang object (top), as well as generalizing to detecting object categories in et al. 2002; Zafeiriou et al. 2015), vehicle detection (Sun et al. general (bottom). This survey focuses on recent advances for the latter 2006) and text detection (Ye and Doermann 2015). There are problem of generic object detection comparatively few recent surveys focusing directly on the problem of generic object detection, except for the work by The goal of the second type is to detect (usually previ- Zhang et al. (2013) who conducted a survey on the topic ously unseen) instances of some predefined object categories of object class detection. However, the research reviewed (for example humans, cars, bicycles, and dogs). Historically, in Grauman and Leibe (2011), Andreopoulos and Tsotsos much of the effort in the field of object detection has focused (2013) and Zhang et al. (2013) is mostly pre-2012, and there- on the detection of a single category (typically faces and fore prior to the recent striking success and dominance of pedestrians) or a few specific categories. In contrast, over deep learning and related methods. the past several years, the research community has started Deep learning allows computational models to learn moving towards the more challenging goal of building gen- fantastically complex, subtle, and abstract representations, eral purpose object detection systems where the breadth of driving significant progress in a broad range of problems such object detection ability rivals that of humans. as visual recognition, object detection, speech recognition, Krizhevsky et al. (2012a) proposed a Deep Convo- natural language processing, medical image analysis, drug lutional Neural Network (DCNN) called AlexNet which discovery and genomics. Among different types of deep neu- achieved record breaking image classification accuracy in the ral networks, DCNNs (LeCun et al. 1998, 2015; Krizhevsky Large Scale Visual Recognition Challenge (ILSVRC) (Rus- et al. 2012a) have brought about breakthroughs in processing sakovsky et al. 2015). Since that time, the research focus in images, video, speech and audio. To be sure, there have been most aspects of computer vision has been specifically on deep many published surveys on deep learning, including that of learning methods, indeed including the domain of generic Bengio et al. (2013), LeCun et al. (2015), Litjens et al. (2017), object detection (Girshick et al. 2014;Heetal. 2014;Gir- Gu et al. (2018), and more recently in tutorials at ICCV and shick 2015; Sermanet et al. 2014; Ren et al. 2017). Although CVPR. tremendous progress has been achieved, illustrated in Fig. 3, In contrast, although many deep learning based methods we are unaware of comprehensive surveys of this subject have been proposed for object detection, we are unaware of 123 International Journal of Computer Vision (2020) 128:261–318 263 Table 1 Summary of related object detection surveys since 2000 No. Survey title References Year Venue Content 1 Monocular pedestrian detection: Enzweiler and Gavrila (2009) 2009 PAMI An evaluation of three pedestrian survey and experiments detectors 2 Survey of pedestrian detection for Geronimo et al. (2010) 2010 PAMI A survey of pedestrian detection advanced driver assistance for advanced driver assistance systems systems 3 Pedestrian detection: an evaluation Dollar et al. (2012) 2012 PAMI A thorough and detailed evaluation of the state of the art of detectors in monocular images 4 Detecting faces in images: a survey Yang et al. (2002) 2002 PAMI First survey of face detection from a single image 5 A survey on face detection in the Zafeiriou et al. (2015) 2015 CVIU A survey of face detection in the wild: past, present and future wild since 2000 6 On road vehicle detection: a review Sun et al. (2006) 2006 PAMI A review of vision based on-road vehicle detection systems 7 Text detection and recognition in Ye and Doermann (2015) 2015 PAMI A survey of text detection and imagery: a survey recognition in color imagery 8 Toward category level object Ponce et al. (2007) 2007 Book Representative papers on object recognition categorization, detection, and segmentation 9 The evolution of object Dickinson et al. (2009) 2009 Book A trace of the evolution of object categorization and the challenge categorization over 4 decades of image abstraction 10 Context based object Galleguillos and Belongie (2010) 2010 CVIU A review of contextual information categorization: a critical survey for object categorization 11 50 years of object recognition: Andreopoulos and Tsotsos (2013) 2013 CVIU A review of the evolution of object directions forward recognition systems over 5 decades 12 Visual object recognition Grauman and Leibe (2011) 2011 Tutorial Instance and category object recognition techniques 13 Object class detection: a survey Zhang et al. (2013) 2013 ACM CS Survey of generic object detection methods before 2011 14 Feature representation for Li et al. (2015b) 2015 PR Feature representation methods in statistical learning based object statistical learning based object detection: a review detection, including handcrafted and deep learning based features 15 Salient object detection: a survey Borji et al. (2014) 2014 arXiv A survey for salient object detection 16 Representation learning: a review Bengio et al. (2013) 2013 PAMI Unsupervised feature learning and and new perspectives deep learning, probabilistic models, autoencoders, manifold learning, and deep networks 17 Deep learning LeCun et al. (2015) 2015 Nature An introduction to deep learning and applications 18 A survey on deep learning in Litjens et al. (2017) 2017 MIA A survey of deep learning for medical image analysis image classification, object detection, segmentation and registration in medical image analysis 19 Recent advances in convolutional Gu et al. (2018) 2017 PR A broad survey of the recent neural networks advances in CNN and its applications in computer vision, speech and natural language processing 20 Tutorial: tools for efficient object − 2015 ICCV15 A short course for object detection detection only covering recent milestones 123 264 International Journal of Computer Vision (2020) 128:261–318 Table 1 continued No. Survey title References Year Venue Content 21 Tutorial: deep learning for objects − 2017 CVPR17 A high level summary of recent and scenes work on deep learning for visual recognition of objects and scenes 22 Tutorial: instance level recognition − 2017 ICCV17 A short course of recent advances on instance level recognition, including object detection, instance segmentation and human pose prediction 23 Tutorial: visual recognition and − 2018 CVPR18 A tutorial on methods and beyond principles behind image classification, object detection, instance segmentation, and semantic segmentation 24 Deep learning for generic object Ours 2019 VISI A comprehensive survey of deep detection learning for generic object detection any comprehensive recent survey. A thorough review and researchers a framework to understand current research and summary of existing work is essential for further progress in to identify open challenges for future research. object detection, particularly for researchers wishing to enter The remainder of this paper is organized as follows. the field. Since our focus is on generic object detection, the Related background and the progress made during the last extensive work on DCNNs for specific object detection, such 2 decades are summarized in Sect. 2. A brief introduction as face detection (Li et al. 2015a; Zhang et al. 2016a;Huetal. to deep learning is given in Sect. 3. Popular datasets and 2017), pedestrian detection (Zhang et al. 2016b; Hosang et al. evaluation criteria are summarized in Sect. 4. We describe 2015), vehicle detection (Zhou et al. 2016b) and traffic sign the milestone object detection frameworks in Sect. 5.From detection (Zhu et al. 2016b) will not be considered. Sects. 6 to 9, fundamental sub-problems and the relevant issues involved in designing object detectors are discussed. Finally, in Sect. 10, we conclude the paper with an overall 1.2 Scope discussion of object detection, state-of-the- art performance, and future research directions. The number of papers on generic object detection based on deep learning is breathtaking. There are so many, in fact, that compiling any comprehensive review of the state of the art is beyond the scope of any reasonable length paper. As a result, 2 Generic Object Detection it is necessary to establish selection criteria, in such a way that we have limited our focus to top journal and conference 2.1 The Problem papers. Due to these limitations, we sincerely apologize to those authors whose works are not included in this paper. For Generic object detection, also called generic object category surveys of work on related topics, readers are referred to the detection, object class detection, or object category detec- articles in Table 1. This survey focuses on major progress of tion (Zhang et al. 2013), is defined as follows. Given an the last 5 years, and we restrict our attention to still pictures, image, determine whether or not there are instances of objects leaving the important subject of video object detection as a from predefined categories (usually many categories, e.g., topic for separate consideration in the future. 200 categories in the ILSVRC object detection challenge) The main goal of this paper is to offer a comprehensive and, if present, to return the spatial location and extent of survey of deep learning based generic object detection tech- each instance. A greater emphasis is placed on detecting niques, and to present some degree of taxonomy, a high a broad range of natural categories, as opposed to specific level perspective and organization, primarily on the basis object category detection where only a narrower predefined of popular datasets, evaluation metrics, context modeling, category of interest (e.g., faces, pedestrians, or cars) may and detection proposal methods. The intention is that our be present. Although thousands of objects occupy the visual categorization be helpful for readers to have an accessi- world in which we live, currently the research community is ble understanding of similarities and differences between primarily interested in the localization of highly structured a wide variety of strategies. The proposed taxonomy gives objects (e.g., cars, faces, bicycles and airplanes) and artic- 123 International Journal of Computer Vision (2020) 128:261–318 265 (a) (b) (c) (d) Fig. 4 Recognition problems related to generic object detection: a Fig. 5 Taxonomy of challenges in generic object detection image level object classification, b bounding box level generic object detection, c pixel-wise semantic segmentation, d instance level semantic segmentation et al. 2006; Andreopoulos and Tsotsos 2013). Generic object detection is closely related to semantic image segmentation ulated objects (e.g., humans, cows and horses) rather than (Fig. 4c), which aims to assign each pixel in an image to a unstructured scenes (such as sky, grass and cloud). semantic class label. Object instance segmentation (Fig. 4d) The spatial location and extent of an object can be defined aims to distinguish different instances of the same object coarsely using a bounding box (an axis-aligned rectangle class, as opposed to semantic segmentation which does not. tightly bounding the object) (Everingham et al. 2010;Rus- sakovsky et al. 2015), a precise pixelwise segmentation mask 2.2 Main Challenges (Zhang et al. 2013), or a closed boundary (Lin et al. 2014; Russell et al. 2008), as illustrated in Fig. 4. To the best of The ideal of generic object detection is to develop a general- our knowledge, for the evaluation of generic object detec- tion algorithms, it is bounding boxes which are most widely purpose algorithm that achieves two competing goals of high quality/accuracy and high efficiency (Fig. 5). As illustrated used in the current literature (Everingham et al. 2010;Rus- in Fig. 6, high quality detection must accurately local- sakovsky et al. 2015), and therefore this is also the approach ize and recognize objects in images or video frames, such we adopt in this survey. However, as the research community that the large variety of object categories in the real world moves towards deeper scene understanding (from image level can be distinguished (i.e., high distinctiveness), and that object classification to single object localization, to generic object instances from the same category, subject to intra- object detection, and to pixelwise object segmentation), it is class appearance variations, can be localized and recognized anticipated that future challenges will be at the pixel level (i.e., high robustness). High efficiency requires that the entire (Lin et al. 2014). There are many problems closely related to that of generic detection task runs in real time with acceptable memory and storage demands. object detection . The goal of object classification or object categorization (Fig. 4a) is to assess the presence of objects from a given set of object classes in an image; i.e., assigning 2.2.1 Accuracy Related Challenges one or more object class labels to a given image, determin- ing the presence without the need of location. The additional Challenges in detection accuracy stem from (1) the vast range requirement to locate the instances in an image makes detec- of intra-class variations and (2) the huge number of object tion a more challenging task than classification. The object categories. recognition problem denotes the more general problem of Intra-class variations can be divided into two types: intrin- identifying/localizing all the objects present in an image, sic factors and imaging conditions. In terms of intrinsic subsuming the problems of object detection and classifica- factors, each object category can have many different object tion (Everingham et al. 2010; Russakovsky et al. 2015; Opelt instances, possibly varying in one or more of color, tex- ture, material, shape, and size, such as the “chair” category To the best of our knowledge, there is no universal agreement in the shown in Fig. 6i. Even in a more narrowly defined class, such literature on the definitions of various vision subtasks. Terms such as as human or horse, object instances can appear in different detection, localization, recognition, classification, categorization, veri- poses, subject to nonrigid deformations or with the addition fication, identification, annotation, labeling, and understanding are often differently defined (Andreopoulos and Tsotsos 2013). of clothing. 123 266 International Journal of Computer Vision (2020) 128:261–318 in object appearance, such as illumination, pose, scale, occlusion, clutter, shading, blur and motion, with examples illustrated in Fig. 6a–h. Further challenges may be added by digitization artifacts, noise corruption, poor resolution, and (d) (a) (b) (c) filtering distortions. In addition to intraclass variations, the large number of 4 5 object categories, on the order of 10 –10 , demands great dis- crimination power from the detector to distinguish between (e) (f) (g) (h) subtly different interclass variations, as illustrated in Fig. 6j. In practice, current detectors focus mainly on structured object categories, such as the 20, 200 and 91 object classes in PASCAL VOC (Everingham et al. 2010), ILSVRC (Rus- sakovsky et al. 2015) and MS COCO (Lin et al. 2014) (i) respectively. Clearly, the number of object categories under consideration in existing benchmark datasets is much smaller than can be recognized by humans. (j) 2.2.2 Efficiency and Scalability Related Challenges Fig. 6 Changes in appearance of the same class with variations in imag- ing conditions (a–h). There is an astonishing variation in what is meant The prevalence of social media networks and mobile/wearable to be a single object class (i). In contrast, the four images in j appear devices has led to increasing demands for analyzing visual very similar, but in fact are from four different object classes. Most data. However, mobile/wearable devices have limited com- images are from ImageNet (Russakovsky et al. 2015) and MS COCO putational capabilities and storage space, making efficient (Lin et al. 2014) object detection critical. The efficiency challenges stem from the need to localize Imaging condition variations are caused by the dra- and recognize, computational complexity growing with the matic impacts unconstrained environments can have on (possibly large) number of object categories, and with the object appearance, such as lighting (dawn, day, dusk, (possibly very large) number of locations and scales within indoors), physical location, weather conditions, cameras, a single image, such as the examples in Fig. 6c, d. backgrounds, illuminations, occlusion, and viewing dis- A further challenge is that of scalability: A detector should tances. All of these conditions produce significant variations be able to handle previously unseen objects, unknown situ- Fig. 7 Milestones of object detection and recognition, including feature datasets (Everingham et al. 2010; Lin et al. 2014; Russakovsky et al. representations (Csurka et al. 2004; Dalal and Triggs 2005;Heetal. 2015). The time period up to 2012 is dominated by handcrafted fea- 2016;Krizhevskyetal. 2012a; Lazebnik et al. 2006;Lowe 1999, 2004; tures, a transition took place in 2012 with the development of DCNNs Perronnin et al. 2010; Simonyan and Zisserman 2015; Sivic and Zisser- for image classification by Krizhevsky et al. (2012a), with methods after man 2003; Szegedy et al. 2015; Viola and Jones 2001;Wangetal. 2009), 2012 dominated by related deep networks. Most of the listed methods detection frameworks (Felzenszwalb et al. 2010b; Girshick et al. 2014; are highly cited and won a major ICCV or CVPR prize. See Sect. 2.3 Sermanet et al. 2014; Uijlings et al. 2013; Viola and Jones 2001), and for details 123 International Journal of Computer Vision (2020) 128:261–318 267 ations, and high data rates. As the number of images and tion. However, more recently, deeper CNNs have led to the number of categories continue to grow, it may become record-breaking improvements in the detection of more gen- impossible to annotate them manually, forcing a reliance on eral object categories, a shift which came about when the weakly supervised strategies. successful application of DCNNs in image classification (Krizhevsky et al. 2012a) was transferred to object detec- 2.3 Progress in the Past 2 Decades tion, resulting in the milestone Region-based CNN (RCNN) detector of Girshick et al. (2014). Early research on object recognition was based on template The successes of deep detectors rely heavily on vast train- ing data and large networks with millions or even billions of matching techniques and simple part-based models (Fischler and Elschlager 1973), focusing on specific objects whose parameters. The availability of GPUs with very high compu- spatial layouts are roughly rigid, such as faces. Before 1990 tational capability and large-scale detection datasets [such as the leading paradigm of object recognition was based on geo- ImageNet (Deng et al. 2009; Russakovsky et al. 2015) and metric representations (Mundy 2006; Ponce et al. 2007), with MS COCO (Lin et al. 2014)] play a key role in their suc- the focus later moving away from geometry and prior mod- cess. Large datasets have allowed researchers to target more els towards the use of statistical classifiers [such as Neural realistic and complex problems from images with large intra- Networks (Rowley et al. 1998), SVM (Osuna et al. 1997) and class variations and inter-class similarities (Lin et al. 2014; Adaboost (Viola and Jones 2001; Xiao et al. 2003)] based on Russakovsky et al. 2015). However, accurate annotations are labor intensive to obtain, so detectors must consider meth- appearance features (Murase and Nayar 1995a; Schmid and Mohr 1997). This successful family of object detectors set ods that can relieve annotation difficulties or can learn with smaller training datasets. the stage for most subsequent research in this field. The milestones of object detection in more recent years are The research community has started moving towards the presented in Fig. 7, in which two main eras (SIFT vs. DCNN) challenging goal of building general purpose object detec- are highlighted. The appearance features moved from global tion systems whose ability to detect many object categories representations (Murase and Nayar 1995b; Swain and Bal- matches that of humans. This is a major challenge: accord- lard 1991; Turk and Pentland 1991) to local representations ing to cognitive scientists, human beings can identify around that are designed to be invariant to changes in translation, 3000 entry level categories and 30,000 visual categories over- all, and the number of categories distinguishable with domain scale, rotation, illumination, viewpoint and occlusion. Hand- crafted local invariant features gained tremendous popularity, expertise may be to the order of 10 (Biederman 1987a). Despite the remarkable progress of the past years, designing starting from the Scale Invariant Feature Transform (SIFT) feature (Lowe 1999), and the progress on various visual an accurate, robust, efficient detection and recognition sys- 4 5 recognition tasks was based substantially on the use of local tem that approaches human-level performance on 10 –10 descriptors (Mikolajczyk and Schmid 2005) such as Haar- categories is undoubtedly an unresolved problem. like features (Viola and Jones 2001), SIFT (Lowe 2004), Shape Contexts (Belongie et al. 2002), Histogram of Gradi- ents (HOG) (Dalal and Triggs 2005) Local Binary Patterns (LBP) (Ojala et al. 2002), and region covariances (Tuzel et al. 3 A Brief Introduction to Deep Learning 2006). These local features are usually aggregated by simple concatenation or feature pooling encoders such as the Bag of Deep learning has revolutionized a wide range of machine Visual Words approach, introduced by Sivic and Zisserman learning tasks, from image classification and video process- (2003) and Csurka et al. (2004), Spatial Pyramid Matching ing to speech recognition and natural language understand- (SPM) of BoW models (Lazebnik et al. 2006), and Fisher ing. Given this tremendously rapid evolution, there exist Vectors (Perronnin et al. 2010). many recent survey papers on deep learning (Bengio et al. For years, the multistage hand tuned pipelines of hand- 2013; Goodfellow et al. 2016;Guetal. 2018; LeCun et al. crafted local descriptors and discriminative classifiers dom- 2015; Litjens et al. 2017; Pouyanfar et al. 2018;Wuetal. inated a variety of domains in computer vision, including 2019; Young et al. 2018; Zhang et al. 2018d; Zhou et al. object detection, until the significant turning point in 2012 2018a; Zhu et al. 2017). These surveys have reviewed deep when DCNNs (Krizhevsky et al. 2012a) achieved their learning techniques from different perspectives (Bengio et al. record-breaking results in image classification. 2013; Goodfellow et al. 2016;Guetal. 2018; LeCun et al. The use of CNNs for detection and localization (Row- 2015; Pouyanfar et al. 2018;Wuetal. 2019; Zhou et al. ley et al. 1998) can be traced back to the 1990s, with a 2018a), or with applications to medical image analysis (Lit- modest number of hidden layers used for object detection jens et al. 2017), natural language processing (Young et al. (Vaillant et al. 1994; Rowley et al. 1998; Sermanet et al. 2018), speech recognition systems (Zhang et al. 2018d), and 2013), successful in restricted domains such as face detec- remote sensing (Zhu et al. 2017). 123 268 International Journal of Computer Vision (2020) 128:261–318 (a) (b) Fig. 8 a Illustration of three operations that are repeatedly applied by a An image with 3 color channels is presented as the input. The network typical CNN: convolution with a number of linear filters; Nonlinearities has 8 convolutional layers, 3 fully connected layers, 5 max pooling lay- (e.g. ReLU); and local pooling (e.g. max pooling). The M feature maps ers and a softmax classification layer. The last three fully connected from a previous layer are convolved with N different filters (here shown layers take features from the top convolutional layer as input in vector as size 3 × 3 × M), using a stride of 1. The resulting N feature maps form. The final layer is a C-way softmax function, C being the number are then passed through a nonlinear function (e.g. ReLU), and pooled of classes. The whole network can be learned from labeled training data (e.g. taking a maximum over 2 × 2 regions) to give N feature maps by optimizing an objective function (e.g. mean squared error or cross at a reduced resolution. b Illustration of the architecture of VGGNet entropy loss) via stochastic gradient descent (Color figure online) (Simonyan and Zisserman 2015), a typical CNN with 11 weight layers. Convolutional Neural Networks (CNNs), the most repre- Finally, pooling corresponds to the downsampling/upsampl- sentative models of deep learning, are able to exploit the basic ing of feature maps. These three operations (convolution, properties underlying natural signals: translation invariance, nonlinearity, pooling) are illustrated in Fig. 8a; CNNs having local connectivity, and compositional hierarchies (LeCun a large number of layers, a “deep” network, are referred to et al. 2015). A typical CNN, illustrated in Fig. 8, has a hier- as Deep CNNs (DCNNs), with a typical DCNN architecture archical structure and is composed of a number of layers to illustrated in Fig. 8b. learn representations of data with multiple levels of abstrac- Most layers of a CNN consist of a number of feature maps, tion (LeCun et al. 2015). We begin with a convolution within which each pixel acts like a neuron. Each neuron in a convolutional layer is connected to feature maps of the pre- l−1 l x ∗ w (1) vious layer through a set of weights w (essentially a set of i , j 2D filters). As can be seen in Fig. 8b, where the early CNN l−1 between an input feature map x at a feature map from layers are typically composed of convolutional and pooling previous layer l−1, convolved with a 2D convolutional kernel layers, the later layers are normally fully connected. From (or filter or weights) w . This convolution appears over a earlier to later layers, the input image is repeatedly con- sequence of layers, subject to a nonlinear operation σ , such volved, and with each layer, the receptive field or region of that support increases. In general, the initial CNN layers extract ⎛ ⎞ low-level features (e.g., edges), with later layers extracting l−1 more general features of increasing complexity (Zeiler and l l−1 l l ⎝ ⎠ x = σ x ∗ w + b , (2) j i i , j j Fergus 2014; Bengio et al. 2013; LeCun et al. 2015; Oquab i =1 et al. 2014). DCNNs have a number of outstanding advantages: a l−1 with a convolution now between the N input feature maps l−1 l l hierarchical structure to learn representations of data with x and the corresponding kernel w , plus a bias term b . i i , j j multiple levels of abstraction, the capacity to learn very com- The elementwise nonlinear function σ(·) is typically a recti- plex functions, and learning feature representations directly fied linear unit (ReLU) for each element, and automatically from data with minimal domain knowl- edge. What has particularly made DCNNs successful has σ(x ) = max{x , 0}. (3) 123 International Journal of Computer Vision (2020) 128:261–318 269 been the availability of large scale labeled datasets and of dation and testing datasets for the detection challenges are GPUs with very high computational capability. given in Table 3. The most frequent object classes in VOC, Despite the great successes, known deficiencies remain. In COCO, ILSVRC and Open Images detection datasets are particular, there is an extreme need for labeled training data visualized in Table 4. and a requirement of expensive computing resources, and PASCAL VOC Everingham et al. (2010, 2015) is a multi- considerable skill and experience are still needed to select year effort devoted to the creation and maintenance of a series appropriate learning parameters and network architectures. of benchmark datasets for classification and object detection, Trained networks are poorly interpretable, there is a lack of creating the precedent for standardized evaluation of recog- robustness to degradations, and many DCNNs have shown nition algorithms in the form of annual competitions. Starting serious vulnerability to attacks (Goodfellow et al. 2015), all from only four categories in 2005, the dataset has increased to of which currently limit the use of DCNNs in real-world 20 categories that are common in everyday life. Since 2009, applications. the number of images has grown every year, but with all pre- vious images retained to allow test results to be compared from year to year. Due the availability of larger datasets like ImageNet, MS COCO and Open Images, PASCAL VOC has 4 Datasets and Performance Evaluation gradually fallen out of fashion. ILSVRC, the ImageNet Large Scale Visual Recognition 4.1 Datasets Challenge (Russakovsky et al. 2015), is derived from Ima- geNet (Deng et al. 2009), scaling up PASCAL VOC’s goal of Datasets have played a key role throughout the history of standardized training and evaluation of detection algorithms object recognition research, not only as a common ground by more than an order of magnitude in the number of object for measuring and comparing the performance of competing classes and images. ImageNet1000, a subset of ImageNet algorithms, but also pushing the field towards increasingly images with 1000 different object categories and a total of complex and challenging problems. In particular, recently, 1.2 million images, has been fixed to provide a standardized deep learning techniques have brought tremendous success to benchmark for the ILSVRC image classification challenge. many visual recognition problems, and it is the large amounts MS COCO is a response to the criticism of ImageNet that of annotated data which play a key role in their success. objects in its dataset tend to be large and well centered, mak- Access to large numbers of images on the Internet makes it ing the ImageNet dataset atypical of real-world scenarios. possible to build comprehensive datasets in order to capture To push for richer image understanding, researchers created a vast richness and diversity of objects, enabling unprece- the MS COCO database (Lin et al. 2014) containing com- dented performance in object recognition. plex everyday scenes with common objects in their natural For generic object detection, there are four famous context, closer to real life, where objects are labeled using datasets: PASCAL VOC (Everingham et al. 2010, 2015), fully-segmented instances to provide more accurate detec- ImageNet (Deng et al. 2009), MS COCO (Lin et al. 2014) tor evaluation. The COCO object detection challenge (Lin and Open Images (Kuznetsova et al. 2018). The attributes et al. 2014) features two object detection tasks: using either of these datasets are summarized in Table 2, and selected bounding box output or object instance segmentation output. sample images are shown in Fig. 9. There are three steps to COCO introduced three new challenges: creating large-scale annotated datasets: determining the set of target object categories, collecting a diverse set of candidate 1. It contains objects at a wide range of scales, including a images to represent the selected categories on the Internet, high percentage of small objects (Singh and Davis 2018); and annotating the collected images, typically by designing 2. Objects are less iconic and amid clutter or heavy occlu- crowdsourcing strategies. Recognizing space limitations, we sion; refer interested readers to the original papers (Everingham 3. The evaluation metric (see Table 5) encourages more et al. 2010, 2015; Lin et al. 2014; Russakovsky et al. 2015; accurate object localization. Kuznetsova et al. 2018) for detailed descriptions of these datasets in terms of construction and properties. Just like ImageNet in its time, MS COCO has become the The four datasets form the backbone of their respective standard for object detection today. detection challenges. Each challenge consists of a publicly OICOD (the Open Image Challenge Object Detection) is available dataset of images together with ground truth anno- derived from Open Images V4 (now V5 in 2019) (Kuznetsova tation and standardized evaluation software, and an annual et al. 2018), currently the largest publicly available object competition and corresponding workshop. Statistics for the number of images and object instances in the training, vali- The annotations on the test set are not publicly released, except for PASCAL VOC2007. 123 270 International Journal of Computer Vision (2020) 128:261–318 Table 2 Popular databases for object recognition Dataset Total images Categories Images per category Objects per image Image size Started year Highlights name PASCAL 11,540 20 303–4087 2.4 470 × 380 2005 Covers only 20 categories that are VOC (2012) common in everyday life; Large (Evering- number of training images; Close ham et al. to real-world applications; 2015) Significantly larger intraclass variations; Objects in scene context; Multiple objects in one image; Contains many difficult samples ImageNet 14 millions+ 21,841 − 1.5 500 × 400 2009 Large number of object categories; (Rus- More instances and more sakovsky categories of objects per image; et al. 2015) More challenging than PASCAL VOC; Backbone of the ILSVRC challenge; Images are object-centric MS COCO 328,000+ 91 − 7.3 640 × 480 2014 Even closer to real world scenarios; (Lin et al. Each image contains more 2014) instances of objects and richer object annotation information; Contains object segmentation notation data that is not available in the ImageNet dataset Places 10 millions+ 434 −− 256 × 256 2014 The largest labeled dataset for (Zhou et al. scene recognition; Four subsets 2017a) Places365 Standard, Places365 Challenge, Places 205 and Places88 as benchmarks Open 9 millions+ 6000+− 8.3 Varied 2017 Annotated with image level labels, Images object bounding boxes and visual (Kuznetsova relationships; Open Images V5 et al. 2018) supports large scale object detection, object instance segmentation and visual relationship detection Example images from PASCAL VOC, ImageNet, MS COCO and Open Images are shown in Fig. 9 (a) (c) (d) (b) Fig. 9 Some example images with object annotations from PASCAL VOC, ILSVRC, MS COCO and Open Images. See Table 2 for a summary of these datasets detection dataset. OICOD is different from previous large of classes, images, bounding box annotations and instance scale object detection datasets like ILSVRC and MS COCO, segmentation mask annotations, but also regarding the anno- not merely in terms of the significantly increased number tation process. In ILSVRC and MS COCO, instances of all 123 International Journal of Computer Vision (2020) 128:261–318 271 Table 3 Statistics of commonly used object detection datasets Challenge Object classes Number of images Number of annotated objects Summary (Train+Val) Train Val Test Train Val Images Boxes Boxes/Image PASCAL VOC object detection challenge VOC07 20 2501 2510 4952 6301(7844) 6307(7818) 5011 12,608 2.5 VOC08 20 2111 2221 4133 5082(6337) 5281(6347) 4332 10,364 2.4 VOC09 20 3473 3581 6650 8505(9760) 8713(9779) 7054 17,218 2.3 VOC10 20 4998 5105 9637 11,577(13,339) 11,797(13,352) 10,103 23,374 2.4 VOC11 20 5717 5823 10,994 13,609(15,774) 13,841(15,787) 11,540 27,450 2.4 VOC12 20 5717 5823 10,991 13,609(15,774) 13,841(15,787) 11,540 27,450 2.4 ILSVRC object detection challenge ILSVRC13 200 395,909 20,121 40,152 345,854 55,502 416,030 401,356 1.0 ILSVRC14 200 456,567 20,121 40,152 478,807 55,502 476,668 534,309 1.1 ILSVRC15 200 456,567 20,121 51,294 478,807 55,502 476,668 534,309 1.1 ILSVRC16 200 456,567 20,121 60,000 478,807 55,502 476,668 534,309 1.1 ILSVRC17 200 456,567 20,121 65,500 478,807 55,502 476,668 534,309 1.1 MS COCO object detection challenge MS COCO15 80 82,783 40,504 81,434 604,907 291,875 123,287 896,782 7.3 MS COCO16 80 82,783 40,504 81,434 604,907 291,875 123,287 896,782 7.3 MS COCO17 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3 MS COCO18 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3 Open images challenge object detection (OICOD)(BasedonopenimagesV4 Kuznetsova et al. 2018) OICOD18 500 1,643,042 100,000 99,999 11,498,734 696,410 1,743,042 12,195,144 7.0 Object statistics for VOC challenges list the non-difficult objects used in the evaluation (all annotated objects). For the COCO challenge, prior to 2017, the test set had four splits (Dev, Standard, Reserve,and Challenge), with each having about 20K images. Starting in 2017, the test set has only the Dev and Challenge splits, with the other two splits removed. Starting in 2017, the train and val sets are arranged differently, and the test set is divided into two roughly equally sized splits of about 20,000 images each: Test Dev and Test Challenge. Note that the 2017 Test Dev/Challenge splits contain the same images as the 2015 Test Dev/Challenge splits, so results across the years are directly comparable classes in the dataset are exhaustively annotated, whereas can be found in Everingham et al. (2010), Everingham et al. for Open Images V4 a classifier was applied to each image (2015), Russakovsky et al. (2015), Hoiem et al. (2012). and only those labels with sufficiently high scores were sent The standard outputs of a detector applied to a testing for human verification. Therefore in OICOD only the object image I are the predicted detections {(b , c , p )} , indexed j j j j instances of human-confirmed positive labels are annotated. by object j, of Bounding Box (BB) b , predicted category c , j j and confidence p . A predicted detection (b, c, p) is regarded as a True Positive (TP) if 4.2 Evaluation Criteria • The predicted category c equals the ground truth label There are three criteria for evaluating the performance of c . detection algorithms: detection speed in Frames Per Second • The overlap ratio IOU (Intersection Over Union) (Ever- (FPS), precision, and recall. The most commonly used met- ingham et al. 2010; Russakovsky et al. 2015) ric is Average Precision (AP), derived from precision and recall. AP is usually evaluated in a category specific manner, area (b ∩ b ) i.e., computed for each object category separately. To com- IOU(b, b ) = , (4) area (b ∪ b ) pare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the between the predicted BB b and the ground truth b is final measure of performance . More details on these metrics not smaller than a predefined threshold ε, where ∩ and In object detection challenges, such as PASCAL VOC and ILSVRC, Footnote 3 continued the winning entry of each object category is that with the highest AP performance, and is justified since the ranking of teams by mAP was score, and the winner of the challenge is the team that wins on the most always the same as the ranking by the number of object categories won object categories. The mAP is also used as the measure of a team’s (Russakovsky et al. 2015). 123 272 International Journal of Computer Vision (2020) 128:261–318 Table 4 Most frequent object classes for each detection challenge threshold different pairs (P, R) can be obtained, in principle allowing precision to be regarded as a function of recall, i.e. P(R), from which the Average Precision (AP) (Everingham et al. 2010; Russakovsky et al. 2015) can be found. Since the introduction of MS COCO, more attention has been placed on the accuracy of the bounding box location. Instead of using a fixed IOU threshold, MS COCO introduces (a) (b) a few metrics (summarized in Table 5) for characterizing the performance of an object detector. For instance, in contrast to the traditional mAP computed at a single IoU of 0.5, AP coco is averaged across all object categories and multiple IOU val- ues from 0.5to0.95 in steps of 0.05. Because 41% of the objects in MS COCO are small and 24% are large, metrics large small medium AP , AP and AP are also introduced. Finally, coco coco coco Table 5 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges, with metric modifications for the Open Images challenges pro- posed in Kuznetsova et al. (2018). (c) 5 Detection Frameworks There has been steady progress in object feature represen- tations and classifiers for recognition, as evidenced by the dramatic change from handcrafted features (Viola and Jones 2001; Dalal and Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; Vedaldi et al. 2009) to learned DCNN features (Girshick et al. 2014; Ouyang et al. 2015; Girshick (d) 2015; Ren et al. 2015; Dai et al. 2016c). In contrast, in terms of localization, the basic “sliding window” strategy (Dalal The size of each word is proportional to the frequency of that class in and Triggs 2005; Felzenszwalb et al. 2010b, 2008) remains the training dataset mainstream, although with some efforts to avoid exhaustive search (Lampert et al. 2008; Uijlings et al. 2013). However, cup denote intersection and union, respectively. A typical the number of windows is large and grows quadratically value of ε is 0.5. with the number of image pixels, and the need to search over multiple scales and aspect ratios further increases the search space. Therefore, the design of efficient and effec- Otherwise, it is considered as a False Positive (FP). The con- tive detection frameworks plays a key role in reducing this fidence level p is usually compared with some threshold β computational cost. Commonly adopted strategies include to determine whether the predicted class label c is accepted. cascading, sharing feature computation, and reducing per- AP is computed separately for each of the object classes, window computation. based on Precision and Recall. For a given object class c and This section reviews detection frameworks, listed in a testing image I ,let {(b , p )} denote the detections i ij ij j =1 Fig. 11 and Table 11, the milestone approaches appearing returned by a detector, ranked by confidence p in decreasing ij since deep learning entered the field, organized into two main order. Each detection (b , p ) is either a TP or an FP, which ij ij categories: can be determined via the algorithm in Fig. 10. Based on the TP and FP detections, the precision P(β) and recall R(β) (a) Two stage detection frameworks, which include a pre- (Everingham et al. 2010) can be computed as a function of processing step for generating object proposals; the confidence threshold β, so by varying the confidence (b) One stage detection frameworks, or region proposal free frameworks, having a single proposed method which It is worth noting that for a given threshold β, multiple detections of does not separate the process of the detection proposal. the same object in an image are not considered as all correct detections, and only the detection with the highest confidence level is considered as a TP and the rest as FPs. 123 International Journal of Computer Vision (2020) 128:261–318 273 Table 5 Summary of commonly used metrics for evaluating object detectors Metric Meaning Definition and description TP True positive A true positive detection, per Fig. 10 FP False positive A false positive detection, per Fig. 10 β Confidence threshold A confidence threshold for computing P(β) and R(β) ε IOU threshold VOC Typically around 0.5 wh ILSVRC min(0.5, ); w × h is the size of a GT box (w+10)(h+10) MS COCO Ten IOU thresholds ε ∈{0.5 : 0.05 : 0.95} P(β) Precision The fraction of correct detections out of the total detections returned by the detector with confidence of at least β R(β) Recall The fraction of all N objects detected by the detector having a confidence of at least β AP Average Precision Computed over the different levels of recall achieved by varying the confidence β mAP mean Average Precision VOC AP at a single IOU and averaged over all classes ILSVRC AP at a modified IOU and averaged over all classes MS COCO AP : mAP averaged over ten IOUs: {0.5 : 0.05 : 0.95}; coco IOU=0.5 AP : mAP at IOU = 0.50 (PASCAL VOC metric); coco IOU=0.75 AP :mAP at IOU = 0.75 (strict metric); coco small 2 AP : mAP for small objects of area smaller than 32 ; coco medium 2 2 AP : mAP for objects of area between 32 and 96 ; coco large AP : mAP for large objects of area bigger than 96 ; coco AR Average Recall The maximum recall given a fixed number of detections per image, averaged over all categories and IOU thresholds max=1 AR Average Recall MS COCO AR : AR given 1 detection per image; coco max=10 AR : AR given 10 detection per image; coco max=100 AR : AR given 100 detection per image; coco small 2 AR : AR for small objects of area smaller than 32 ; coco medium 2 2 AR : AR for objects of area between 32 and 96 ; coco large AR : AR for large objects of area bigger than 96 ; coco 5.1 Region Based (Two Stage) Frameworks In a region-based framework, category-independent region proposals are generated from an image, CNN (Krizhevsky et al. 2012a) features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. As can be observed from Fig. 11, DetectorNet (Szegedy et al. 2013), OverFeat (Ser- manet et al. 2014), MultiBox (Erhan et al. 2014) and RCNN (Girshick et al. 2014) independently and almost simultane- ously proposed using CNNs for generic object detection. RCNN (Girshick et al. 2014): Inspired by the break- through image classification results obtained by CNNs and Fig. 10 The algorithm for determining TPs and FPs by greedily match- the success of the selective search in region proposal for hand- ing object detection results to ground truth boxes crafted features (Uijlings et al. 2013), Girshick et al. (2014, 2016) were among the first to explore CNNs for generic Sections 6–9 will discuss fundamental sub-problems involved object detection and developed RCNN, which integrates in detection frameworks in greater detail, including DCNN features, detection proposals, and context modeling. Object proposals, also called region proposals or detection proposals, are a set of candidate regions or bounding boxes in an image that may potentially contain an object (Chavali et al. 2016;Hosangetal. 2016). 123 274 International Journal of Computer Vision (2020) 128:261–318 VGGNet Faster RCNN (Simonyan and Zisserman) (Ren et al.) YOLO9000 NIN (Redmon and Farhadi) (Lin et al.) Fast RCNN CornerNet (Girshick) RCNN GoogLeNet RFCN Mask RCNN ResNet (Law and Deng) (Girshick et al.) (Szegedy et al.) (Dai et al.) (He et al.) (He et al.) DetectorNet DenseNet MultiBox MSC Multibox SSD RetinaNet (Szegedy et al.) (Huang et al.) (Erhan et al.) (Szegedy et al.) (Liu et al.) (Lin et al.) SPPNet YOLO Feature Pyramid Network (He et al.) (Redmon et al.) OverFeat (FPN) (Lin et al.) (Sermanet et al.) Fig. 11 Milestones in generic object detection this stage, all region proposals with  0.5IOU overlap with a ground truth box are defined as positives for that ground truth box’s class and the rest as negatives. 3. Class specific SVM classifiers training A set of class- specific linear SVM classifiers are trained using fixed length features extracted with CNN, replacing the soft- max classifier learned by fine-tuning. For training SVM classifiers, positive examples are defined to be the ground truth boxes for each class. A region proposal with less than 0.3 IOU overlap with all ground truth instances of a class is negative for that class. Note that the positive and negative examples defined for training the SVM classi- fiers are different from those for fine-tuning the CNN. 4. Class specific bounding box regressor training Bounding box regression is learned for each object class with CNN features. In spite of achieving high object detection quality, RCNN has notable drawbacks (Girshick 2015): 1. Training is a multistage pipeline, slow and hard to opti- mize because each individual stage must be trained Fig. 12 Illustration of the RCNN detection framework (Girshick et al. 2014, 2016) separately. 2. For SVM classifier and bounding box regressor training, it is expensive in both disk space and time, because CNN AlexNet (Krizhevsky et al. 2012a) with a region proposal features need to be extracted from each object proposal selective search (Uijlings et al. 2013). As illustrated in detail in each image, posing great challenges for large scale in Fig. 12, training an RCNN framework consists of multi- detection, particularly with very deep networks, such as stage pipelines: VGG16 (Simonyan and Zisserman 2015). 3. Testing is slow, since CNN features are extracted per 1. Region proposal computation Class agnostic region pro- object proposal in each test image, without shared com- putation. posals, which are candidate regions that might contain objects, are obtained via a selective search (Uijlings et al. 2013). All of these drawbacks have motivated successive innova- 2. CNN model finetuning Region proposals, which are tions, leading to a number of improved detection frameworks cropped from the image and warped into the same size, such as SPPNet, Fast RCNN, Faster RCNN etc., as follows. are used as the input for fine-tuning a CNN model pre- trained using a large-scale dataset such as ImageNet. At Please refer to Sect. 4.2 for the definition of IOU. 123 International Journal of Computer Vision (2020) 128:261–318 275 SPPNet (He et al. 2014) During testing, CNN feature extraction is the main bottleneck of the RCNN detection pipeline, which requires the extraction of CNN features from thousands of warped region proposals per image. As a result, He et al. (2014) introduced traditional spatial pyramid pooling (SPP) (Grauman and Darrell 2005; Lazebnik et al. 2006) into CNN architectures. Since convolutional layers accept inputs of arbitrary sizes, the requirement of fixed- sized images in CNNs is due only to the Fully Connected (FC) layers, therefore He et al. added an SPP layer on top of the last convolutional (CONV) layer to obtain features of fixed length for the FC layers. With this SPPNet, RCNN obtains a significant speedup without sacrificing any detec- tion quality, because it only needs to run the convolutional layers once on the entire test image to generate fixed-length features for region proposals of arbitrary size. While SPPNet accelerates RCNN evaluation by orders of magnitude, it does not result in a comparable speedup of the detector training. Moreover, fine-tuning in SPPNet (He et al. 2014) is unable to update the convolutional layers before the SPP layer, which limits the accuracy of very deep networks. Fast RCNN (Girshick 2015) Girshick proposed Fast RCNN (Girshick 2015) that addresses some of the dis- advantages of RCNN and SPPNet, while improving on their detection speed and quality. As illustrated in Fig. 13, Fast RCNN enables end-to-end detector training by devel- oping a streamlined training process that simultaneously learns a softmax classifier and class-specific bounding box regression, rather than separately training a softmax clas- sifier, SVMs, and Bounding Box Regressors (BBRs) as in RCNN/SPPNet. Fast RCNN employs the idea of sharing the computation of convolution across region proposals, and adds a Region of Interest (RoI) pooling layer between the last CONV layer and the first FC layer to extract a fixed-length feature for each region proposal. Essentially, RoI pooling uses warping at the feature level to approx- imate warping at the image level. The features after the RoI pooling layer are fed into a sequence of FC layers that finally branch into two sibling output layers: softmax prob- abilities for object category prediction, and class-specific bounding box regression offsets for proposal refinement. Compared to RCNN/SPPNet, Fast RCNN improves the effi- ciency considerably—typically 3 times faster in training and 10 times faster in testing. Thus there is higher detection qual- ity, a single training process that updates all network layers, and no storage required for feature caching. Faster RCNN (Ren et al. 2015, 2017) Although Fast RCNN significantly sped up the detection process, it still relies on external region proposals, whose computation is exposed as the new speed bottleneck in Fast RCNN. Recent work has shown that CNNs have a remarkable ability to local- Fig. 13 High level diagrams of the leading frameworks for generic ize objects in CONV layers (Zhou et al. 2015, 2016a; Cinbis object detection. The properties of these methods are summarized in Table 11 et al. 2017; Oquab et al. 2015; Hariharan et al. 2016), an 123 276 International Journal of Computer Vision (2020) 128:261–318 ability which is weakened in the FC layers. Therefore, the prior to prediction. However, Dai et al. (2016c) found that this selective search can be replaced by a CNN in producing naive design turns out to have considerably inferior detection region proposals. The Faster RCNN framework proposed accuracy, conjectured to be that deeper CONV layers are by Ren et al. (2015, 2017) offered an efficient and accu- more sensitive to category semantics, and less sensitive to rate Region Proposal Network (RPN) for generating region translation, whereas object detection needs localization rep- proposals. They utilize the same backbone network, using resentations that respect translation invariance. Based on this features from the last shared convolutional layer to accom- observation, Dai et al. (2016c) constructed a set of position- plish the task of RPN for region proposal and Fast RCNN for sensitive score maps by using a bank of specialized CONV region classification, as shown in Fig. 13. layers as the FCN output, on top of which a position-sensitive RPN first initializes k reference boxes (i.e. the so called RoI pooling layer is added. They showed that RFCN with anchors) of different scales and aspect ratios at each CONV ResNet101 (He et al. 2016) could achieve comparable accu- feature map location. The anchor positions are image content racy to Faster RCNN, often at faster running times. independent, but the feature vectors themselves, extracted Mask RCNN He et al. (2017) proposed Mask RCNN to from anchors, are image content dependent. Each anchor is tackle pixelwise object instance segmentation by extend- mapped to a lower dimensional vector, which is fed into two ing Faster RCNN. Mask RCNN adopts the same two stage sibling FC layers—an object category classification layer and pipeline, with an identical first stage (RPN), but in the sec- a box regression layer. In contrast to detection in Fast RCNN, ond stage, in parallel to predicting the class and box offset, the features used for regression in RPN are of the same shape Mask RCNN adds a branch which outputs a binary mask for as the anchor box, thus k anchors lead to k regressors. RPN each RoI. The new branch is a Fully Convolutional Network shares CONV features with Fast RCNN, thus enabling highly (FCN) (Long et al. 2015; Shelhamer et al. 2017) on top of a efficient region proposal computation. RPN is, in fact, a kind CNN feature map. In order to avoid the misalignments caused of Fully Convolutional Network (FCN) (Long et al. 2015; by the original RoI pooling (RoIPool) layer, a RoIAlign Shelhamer et al. 2017); Faster RCNN is thus a purely CNN layer was proposed to preserve the pixel level spatial cor- based framework without using handcrafted features. respondence. With a backbone network ResNeXt101-FPN For the VGG16 model (Simonyan and Zisserman 2015), (Xie et al. 2017; Lin et al. 2017a), Mask RCNN achieved Faster RCNN can test at 5 FPS (including all stages) on a top results for the COCO object instance segmentation and GPU, while achieving state-of-the-art object detection accu- bounding box object detection. It is simple to train, general- racy on PASCAL VOC 2007 using 300 proposals per image. izes well, and adds only a small overhead to Faster RCNN, The initial Faster RCNN in Ren et al. (2015) contains sev- running at 5 FPS (He et al. 2017). eral alternating training stages, later simplified in Ren et al. Chained Cascade Network and Cascade RCNN The (2017). essence of cascade (Felzenszwalb et al. 2010a; Bourdev Concurrent with the development of Faster RCNN, Lenc and Brandt 2005; Li and Zhang 2004) is to learn more dis- and Vedaldi (2015) challenged the role of region proposal criminative classifiers by using multistage classifiers, such generation methods such as selective search, studied the role that early stages discard a large number of easy negative of region proposal generation in CNN based detectors, and samples so that later stages can focus on handling more diffi- found that CNNs contain sufficient geometric information cult examples. Two-stage object detection can be considered for accurate object detection in the CONV rather than FC as a cascade, the first detector removing large amounts of layers. They showed the possibility of building integrated, background, and the second stage classifying the remaining simpler, and faster object detectors that rely exclusively on regions. Recently, end-to-end learning of more than two cas- CNNs, removing region proposal generation methods such caded classifiers and DCNNs for generic object detection as selective search. were proposed in the Chained Cascade Network (Ouyang RFCN (Region based Fully Convolutional Network) et al. 2017a), extended in Cascade RCNN (Cai and Vasconce- While Faster RCNN is an order of magnitude faster than los 2018), and more recently applied for simultaneous object Fast RCNN, the fact that the region-wise sub-network still detection and instance segmentation (Chen et al. 2019a), win- needs to be applied per RoI (several hundred RoIs per image) ning the COCO 2018 Detection Challenge. led Dai et al. (2016c) to propose the RFCN detector which is Light Head RCNN In order to further increase the detec- fully convolutional (no hidden FC layers) with almost all tion speed of RFCN (Dai et al. 2016c), Li et al. (2018c)pro- computations shared over the entire image. As shown in posed Light Head RCNN, making the head of the detection Fig. 13, RFCN differs from Faster RCNN only in the RoI network as light as possible to reduce the RoI computation. sub-network. In Faster RCNN, the computation after the RoI In particular, Li et al. (2018c) applied a convolution to pro- pooling layer cannot be shared, so Dai et al. (2016c) proposed duce thin feature maps with small channel numbers (e.g., using all CONV layers to construct a shared RoI sub-network, 490 channels for COCO) and a cheap RCNN sub-network, and RoI crops are taken from the last layer of CONV features leading to an excellent trade-off of speed and accuracy. 123 International Journal of Computer Vision (2020) 128:261–318 277 5.2 Unified (One Stage) Frameworks deep networks. It is one of the most influential object detec- tion frameworks, winning the ILSVRC2013 localization and The region-based pipeline strategies of Sect. 5.1 have dom- detection competition. OverFeat performs object detection inated since RCNN (Girshick et al. 2014), such that the via a single forward pass through the fully convolutional leading results on popular benchmark datasets are all based layers in the network (i.e. the “Feature Extractor”, shown on Faster RCNN (Ren et al. 2015). Nevertheless, region- in Fig. 14a). The key steps of object detection at test time based approaches are computationally expensive for current can be summarized as follows: mobile/wearable devices, which have limited storage and computational capability, therefore instead of trying to opti- 1. Generate object candidates by performing object clas- mize the individual components of a complex region-based sification via a sliding window fashion on multiscale pipeline, researchers have begun to develop unified detection images OverFeat uses a CNN like AlexNet (Krizhevsky strategies. et al. 2012a), which would require input images ofa fixed Unified pipelines refer to architectures that directly pre- size due to its fully connected layers, in order to make dict class probabilities and bounding box offsets from full the sliding window approach computationally efficient, images with a single feed-forward CNN in a monolithic set- OverFeat casts the network (as shown in Fig. 14a) into ting that does not involve region proposal generation or post a fully convolutional network, taking inputs of any size, classification / feature resampling, encapsulating all compu- by viewing fully connected layers as convolutions with tation in a single network. Since the whole pipeline is a single kernels of size 1 × 1. OverFeat leverages multiscale fea- network, it can be optimized end-to-end directly on detection tures to improve the overall performance by passing up to performance. six enlarged scales of the original image through the net- DetectorNet (Szegedy et al. 2013) were among the first to work (as shown in Fig. 14b), resulting in a significantly explore CNNs for object detection. DetectorNet formulated increased number of evaluated context views. For each object detection a regression problem to object bounding of the multiscale inputs, the classifier outputs a grid of box masks. They use AlexNet (Krizhevsky et al. 2012a) predictions (class and confidence). and replace the final softmax classifier layer with a regres- 2. Increase the number of predictions by offset max pooling sion layer. Given an image window, they use one network In order to increase resolution, OverFeat applies offset to predict foreground pixels over a coarse grid, as well as max pooling after the last CONV layer, i.e. perform- four additional networks to predict the object’s top, bottom, ing a subsampling operation at every offset, yielding left and right halves. A grouping process then converts the many more views for voting, increasing robustness while predicted masks into detected bounding boxes. The network remaining efficient. needs to be trained per object type and mask type, and does 3. Bounding box regression Once an object is identified, not scale to multiple classes. DetectorNet must take many a single bounding box regressor is applied. The classi- crops of the image, and run multiple networks for each part fier and the regressor share the same feature extraction on every crop, thus making it slow. (CONV) layers, only the FC layers need to be recomputed OverFeat, proposed by Sermanet et al. (2014) and illus- after computing the classification network. trated in Fig. 14, can be considered as one of the first 4. Combine predictions OverFeat uses a greedy merge strat- single-stage object detectors based on fully convolutional egy to combine the individual bounding box predictions across all locations and scales. OverFeat has a significant speed advantage, but is less accu- rate than RCNN (Girshick et al. 2014), because it was difficult to train fully convolutional networks at the time. The speed (a) advantage derives from sharing the computation of convolu- tion between overlapping windows in the fully convolutional network. OverFeat is similar to later frameworks such as YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016), except that the classifier and the regressors in OverFeat are trained sequentially. (b) YOLO Redmon et al. (2016) proposed YOLO (You Only Look Once), a unified detector casting object detection as a regression problem from image pixels to spatially sep- arated bounding boxes and associated class probabilities, Fig. 14 Illustration of the OverFeat (Sermanet et al. 2014) detection framework illustrated in Fig. 13. Since the region proposal generation 123 278 International Journal of Computer Vision (2020) 128:261–318 stage is completely dropped, YOLO directly predicts detec- as VGG (Simonyan and Zisserman 2015), followed by sev- tions using a small set of candidate regions . Unlike region eral auxiliary CONV layers, progressively decreasing in size. based approaches (e.g. Faster RCNN) that predict detections The information in the last layer may be too coarse spa- based on features from a local region, YOLO uses features tially to allow precise localization, so SSD performs detection from an entire image globally. In particular, YOLO divides over multiple scales by operating on multiple CONV feature an image into an S × S grid, each predicting C class prob- maps, each of which predicts category scores and box off- abilities, B bounding box locations, and confidence scores. sets for bounding boxes of appropriate sizes. For a 300 ×300 By throwing out the region proposal generation step entirely, input, SSD achieves 74.3% mAP on the VOC2007 test at 59 YOLO is fast by design, running in real time at 45 FPS and FPS versus Faster RCNN 7 FPS / mAP 73.2% or YOLO 45 Fast YOLO (Redmon et al. 2016) at 155 FPS. Since YOLO FPS / mAP 63.4%. sees the entire image when making predictions, it implicitly CornerNet Recently, Law and Deng (2018) questioned the encodes contextual information about object classes, and is dominant role that anchor boxes have come to play in SoA less likely to predict false positives in the background. YOLO object detection frameworks (Girshick 2015;Heetal. 2017; makes more localization errors than Fast RCNN, resulting Redmon et al. 2016; Liu et al. 2016). Law and Deng (2018) from the coarse division of bounding box location, scale and argue that the use of anchor boxes, especially in one stage aspect ratio. As discussed in Redmon et al. (2016), YOLO detectors (Fu et al. 2017; Lin et al. 2017b; Liu et al. 2016; may fail to localize some objects, especially small ones, pos- Redmon et al. 2016), has drawbacks (Law and Deng 2018; sibly because of the coarse grid division, and because each Lin et al. 2017b) such as causing a huge imbalance between grid cell can only contain one object. It is unclear to what positive and negative examples, slowing down training and extent YOLO can translate to good performance on datasets introducing extra hyperparameters. Borrowing ideas from the with many objects per image, such as MS COCO. work on Associative Embedding in multiperson pose estima- YOLOv2 and YOLO9000 Redmon and Farhadi (2017) tion (Newell et al. 2017), Law and Deng (2018) proposed proposed YOLOv2, an improved version of YOLO, in which CornerNet by formulating bounding box object detection the custom GoogLeNet (Szegedy et al. 2015) network is as detecting paired top-left and bottom-right keypoints .In replaced with the simpler DarkNet19, plus batch normal- CornerNet, the backbone network consists of two stacked ization (He et al. 2015), removing the fully connected layers, Hourglass networks (Newell et al. 2016), with a simple cor- and using good anchor boxes learned via kmeans and multi- ner pooling approach to better localize corners. CornerNet scale training. YOLOv2 achieved state-of-the-art on standard achieved a 42.1% AP on MS COCO, outperforming all pre- detection tasks. Redmon and Farhadi (2017) also introduced vious one stage detectors; however, the average inference YOLO9000, which can detect over 9000 object categories in time is about 4FPS on a Titan X GPU, significantly slower real time by proposing a joint optimization method to train than SSD (Liu et al. 2016) and YOLO (Redmon et al. 2016). simultaneously on an ImageNet classification dataset and CornerNet generates incorrect bounding boxes because it is a COCO detection dataset with WordTree to combine data challenging to decide which pairs of keypoints should be from multiple sources. Such joint training allows YOLO9000 grouped into the same objects. To further improve on Cor- to perform weakly supervised detection, i.e. detecting object nerNet, Duan et al. (2019) proposed CenterNet to detect each classes that do not have bounding box annotations. object as a triplet of keypoints, by introducing one extra key- SSD In order to preserve real-time speed without sacrific- point at the centre of a proposal, raising the MS COCO AP to ing too much detection accuracy, Liu et al. (2016) proposed 47.0%, but with an inference speed slower than CornerNet. SSD (Single Shot Detector), faster than YOLO (Redmon et al. 2016) and with an accuracy competitive with region- based detectors such as Faster RCNN (Ren et al. 2015). SSD 6 Object Representation effectively combines ideas from RPN in Faster RCNN (Ren et al. 2015), YOLO (Redmon et al. 2016) and multiscale As one of the main components in any detector, good feature CONV features (Hariharan et al. 2016) to achieve fast detec- representations are of primary importance in object detection tion speed, while still retaining high detection quality. Like (Dickinson et al. 2009; Girshick et al. 2014; Gidaris and YOLO, SSD predicts a fixed number of bounding boxes Komodakis 2015; Zhu et al. 2016a). In the past, a great deal and scores, followed by an NMS step to produce the final of effort was devoted to designing local descriptors [e.g., detection. The CNN network in SSD is fully convolutional, SIFT (Lowe 1999) and HOG (Dalal and Triggs 2005)] and to whose early layers are based on a standard architecture, such explore approaches [e.g., Bag of Words (Sivic and Zisserman 2003) and Fisher Vector (Perronnin et al. 2010)] to group and YOLO uses far fewer bounding boxes, only 98 per image, compared to about 2000 from Selective Search. The idea of using keypoints for object detection appeared previously Boxes of various sizes and aspect ratios that serve as object candidates. in DeNet (TychsenSmith and Petersson 2017). 123 International Journal of Computer Vision (2020) 128:261–318 279 (Zeiler and Fergus 2014) VGGNet (Simonyan and Zisserman 2015), GoogLeNet (Szegedy et al. 2015), Inception series (Ioffe and Szegedy 2015; Szegedy et al. 2016, 2017), ResNet (He et al. 2016), DenseNet (Huang et al. 2017a) and SENet (Hu et al. 2018b), summarized in Table 6, and where the improvement over time is seen in Fig. 15. A further review of recent CNN advances can be found in Gu et al. (2018). The trend in architecture evolution is for greater depth: AlexNet has 8 layers, VGGNet 16 layers, more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet (Simonyan and Zisserman 2015) and GoogLeNet (Szegedy et al. 2015) which showed that increas- ing depth can improve the representational power. As can be observed from Table 6, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of param- eters, despite being only a few layers deep, since a large Fig. 15 Performance of winning entries in the ILSVRC competitions fraction of the parameters come from the FC layers. Newer from 2011 to 2017 in the image classification task networks like Inception, ResNet, and DenseNet, although having a great depth, actually have far fewer parameters by abstract descriptors into higher level representations in order avoiding the use of FC layers. to allow the discriminative parts to emerge; however, these With the use of Inception modules (Szegedy et al. 2015)in feature representation methods required careful engineering carefully designed topologies, the number of parameters of and considerable domain expertise. GoogLeNet is dramatically reduced, compared to AlexNet, In contrast, deep learning methods (especially deep ZFNet or VGGNet. Similarly, ResNet demonstrated the CNNs) can learn powerful feature representations with mul- effectiveness of skip connections for learning extremely deep tiple levels of abstraction directly from raw images (Bengio networks with hundreds of layers, winning the ILSVRC et al. 2013; LeCun et al. 2015). As the learning procedure 2015 classification task. Inspired by ResNet (He et al. 2016), reduces the dependency of specific domain knowledge and InceptionResNets (Szegedy et al. 2017) combined the Incep- complex procedures needed in traditional feature engineer- tion networks with shortcut connections, on the basis that ing (Bengio et al. 2013; LeCun et al. 2015), the burden for shortcut connections can significantly accelerate network feature representation has been transferred to the design of training. Extending ResNets, Huang et al. (2017a) proposed better network architectures and training procedures. DenseNets, which are built from dense blocksconnecting The leading frameworks reviewed in Sect. 5 [RCNN (Gir- each layer to every other layer in a feedforward fashion, lead- shick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN ing to compelling advantages such as parameter efficiency, (Ren et al. 2015), YOLO (Redmon et al. 2016), SSD (Liu et al. implicit deep supervision , and feature reuse. Recently, He 2016)] have persistently promoted detection accuracy and et al. (2016) proposed Squeeze and Excitation (SE) blocks, speed, in which it is generally accepted that the CNN archi- which can be combined with existing deep architectures to tecture (Sect. 6.1 and Fig. 15) plays a crucial role. As a result, boost their performance at minimal additional computational most of the recent improvements in detection accuracy have cost, adaptively recalibrating channel-wise feature responses been via research into the development of novel networks. by explicitly modeling the interdependencies between con- Therefore we begin by reviewing popular CNN architectures volutional feature channels, and which led to winning the used in Generic Object Detection, followed by a review of ILSVRC 2017 classification task. Research on CNN archi- the effort devoted to improving object feature representa- tectures remains active, with emerging networks such as tions, such as developing invariant features to accommodate Hourglass (Law and Deng 2018), Dilated Residual Networks geometric variations in object scale, pose, viewpoint, part (Yu et al. 2017), Xception (Chollet 2017), DetNet (Lietal. deformation and performing multiscale analysis to improve 2018b), Dual Path Networks (DPN) (Chen et al. 2017b), Fish- object detection over a wide range of scales. Net (Sun et al. 2018), and GLoRe (Chen et al. 2019b). 6.1 Popular CNN Architectures DenseNets perform deep supervision in an implicit way, i.e. individ- CNN architectures (Sect. 3) serve as network backbones used ual layers receive additional supervision from other layers through the in the detection frameworks of Sect. 5. Representative frame- shorter connections. The benefits of deep supervision have previously works include AlexNet (Krizhevsky et al. 2012b), ZFNet been demonstrated in Deeply Supervised Nets (DSN) (Lee et al. 2015). 123 280 International Journal of Computer Vision (2020) 128:261–318 Table 6 DCNN architectures that were commonly used for generic object detection No. DCNN architecture #Paras (×10 ) #Layers (CONV+FC) Test error (Top 5) First used in Highlights 1 AlexNet (Krizhevsky et al. 2012b)57 5 +215.3% Girshick et al. (2014) The first DCNN found effective for ImageNet classification; the historical turning point from hand-crafted features to CNN; Winning the ILSVRC2012 Image classification competition 2 ZFNet (fast) (Zeiler and Fergus 2014)58 5 +214.8% He et al. (2014) Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers 3 OverFeat (Sermanet et al. 2014) 140 6 +213.6% Sermanet et al. (2014) Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers 4 VGGNet (Simonyan and Zisserman 2015) 134 13 +26.8% Girshick (2015) Increasing network depth significantly by stacking 3 × 3 convolution filters and increasing the network depth step by step 5 GoogLeNet (Szegedy et al. 2015)6 22 6.7% Szegedy et al. (2015) Use Inception module, which uses multiple branches of convolutional layers with different filter sizes and then concatenates feature maps produced by these branches. The first inclusion of bottleneck structure and global average pooling 6 Inception v2 (Ioffe and Szegedy 2015)12 31 4.8% Howard et al. (2017) Faster training with the introduce of batch normalization 7 Inceptionv3(Szegedyetal. 2016)22 47 3.6% Inclusion of separable convolution and spatial resolution reduction 8 YOLONet (Redmon et al. 2016)64 24 + 1 − Redmon et al. (2016) A network inspired by GoogLeNet used in YOLO detector 9 ResNet50 (He et al. 2016)23.449 3.6% (ResNets) He et al. (2016) With identity mapping, substantially deeper networks can be learned International Journal of Computer Vision (2020) 128:261–318 281 Table 6 continued No. DCNN architecture #Paras (×10 ) #Layers (CONV+FC) Test error (Top 5) First used in Highlights 10 ResNet101 (He et al. 2016) 42 100 He et al. (2016) Requires fewer parameters than VGG by using the global average pooling and bottleneck introduced in GoogLeNet 11 InceptionResNet v1 (Szegedy et al. 2017)21 87 3.1% (Ensemble) Combination of identity mapping and Inception module, with similar computational cost of Inception v3, but faster training process 12 InceptionResNet v2 Szegedy et al. (2017) 30 95 (Huang et al. 2017b) A costlier residual version of Inception, with significantly improved recognition performance 13 Inception v4 Szegedy et al. (2017) 41 75 An Inception variant without residual connections, with roughly the same recognition performance as InceptionResNet v2, but significantly slower 14 ResNeXt (Xie et al. 2017)23 49 3.0% Xie et al. (2017) Repeating a building block that aggregates a set of transformations with the same topology 15 DenseNet201 (Huang et al. 2017a) 18 200 − Zhou et al. (2018b) Concatenate each layer with every other layer in a feed forward fashion. Alleviate the vanishing gradient problem, encourage feature reuse, reduction in number of parameters 16 DarkNet (Redmon and Farhadi 2017)20 19 − Redmon and Farhadi (2017) Similar to VGGNet, but with significantly fewer parameters 17 MobileNet (Howard et al. 2017)3.227 + 1 − Howard et al. (2017) Light weight deep CNNs using depth-wise separable convolutions 18 SE ResNet (Hu et al. 2018b)26 50 2.3% (SENets) Hu et al. (2018b) Channel-wise attention by a novel block called Squeeze and Excitation. Complementary to existing backbone CNNs Regarding the statistics for “#Paras” and “#Layers”, the final FC prediction layer is not taken into consideration. “Test Error” column indicates the Top 5 classification test error on ImageNet1000. When ambiguous, the “#Paras”, “#Layers”, and “Test Error” refer to: OverFeat (accurate model), VGGNet16, ResNet101 DenseNet201 (Growth Rate 32, DenseNet-BC), ResNeXt50 (32*4d), and SE ResNet50 282 International Journal of Computer Vision (2020) 128:261–318 The training of a CNN requires a large-scale labeled and use features from the top layer of the CNN as object rep- dataset with intraclass diversity. Unlike image classification, resentations; however, detecting objects across a large range detection requires localizing (possibly many) objects from an of scales is a fundamental challenge. A classical strategy to image. It has been shown (Ouyang et al. 2017b) that pretrain- address this issue is to run the detector over a number of ing a deep model with a large scale dataset having object level scaled input images (e.g., an image pyramid) (Felzenszwalb annotations (such as ImageNet), instead of only the image et al. 2010b; Girshick et al. 2014;Heetal. 2014), which level annotations, improves the detection performance. How- typically produces more accurate detection, with, however, ever, collecting bounding box labels is expensive, especially obvious limitations of inference time and memory. for hundreds of thousands of categories. A common scenario is for a CNN to be pretrained on a large dataset (usually with 6.2.1 Handling of Object Scale Variations a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, Since a CNN computes its feature hierarchy layer by layer, directly, as a generic feature extractor (Razavian et al. 2014; the sub-sampling layers in the feature hierarchy already lead Azizpour et al. 2016; Donahue et al. 2014;Yosinskietal. to an inherent multiscale pyramid, producing feature maps at 2014), which can support a wider range of visual recogni- different spatial resolutions, but subject to challenges (Hari- tion tasks. For detection, the pre-trained network is typically haran et al. 2016; Long et al. 2015; Shrivastava et al. 2017). fine-tuned on a given detection dataset (Donahue et al. In particular, the higher layers have a large receptive field and 2014; Girshick et al. 2014, 2016). Several large scale image strong semantics, and are the most robust to variations such classification datasets are used for CNN pre-training, among as object pose, illumination and part deformation, but the res- them ImageNet1000 (Deng et al. 2009; Russakovsky et al. olution is low and the geometric details are lost. In contrast, 2015) with 1.2 million images of 1000 object categories, lower layers have a small receptive field and rich geomet- Places (Zhou et al. 2017a), which is much larger than Ima- ric details, but the resolution is high and much less sensitive geNet1000 but with fewer classes, a recent Places-Imagenet to semantics. Intuitively, semantic concepts of objects can hybrid (Zhou et al. 2017a), or JFT300M (Hinton et al. 2015; emerge in different layers, depending on the size of the Sun et al. 2017). objects. So if a target object is small it requires fine detail Pretrained CNNs without fine-tuning were explored for information in earlier layers and may very well disappear at object classification and detection in Donahue et al. (2014), later layers, in principle making small object detection very Girshick et al. (2016), Agrawal et al. (2014), where it was challenging, for which tricks such as dilated or “atrous” con- shown that detection accuracies are different for features volution (Yu and Koltun 2015; Dai et al. 2016c; Chen et al. extracted from different layers; for example, for AlexNet pre- 2018b) have been proposed, increasing feature resolution, trained on ImageNet, FC6 / FC7 / Pool5 are in descending but increasing computational complexity. On the other hand, order of detection accuracy (Donahue et al. 2014; Girshick if the target object is large, then the semantic concept will et al. 2016). Fine-tuning a pre-trained network can increase emerge in much later layers. A number of methods (Shrivas- detection performance significantly (Girshick et al. 2014, tava et al. 2017; Zhang et al. 2018e; Lin et al. 2017a; Kong 2016), although in the case of AlexNet, the fine-tuning perfor- et al. 2017) have been proposed to improve detection accu- mance boost was shown to be much larger for FC6 / FC7 than racy by exploiting multiple CNN layers, broadly falling into for Pool5, suggesting that Pool5 features are more general. three types of multiscale object detection: Furthermore, the relationship between the source and target datasets plays a critical role, for example that ImageNet based 1. Detecting with combined features of multiple layers; CNN features show better performance for object detection 2. Detecting at multiple layers; than for human action (Zhou et al. 2015; Azizpour et al. 3. Combinations of the above two methods. 2016). (1) Detecting with combined features of multiple CNN lay- ers Many approaches, including Hypercolumns (Hariharan 6.2 Methods For Improving Object Representation et al. 2016), HyperNet (Kong et al. 2016), and ION (Bell et al. 2016), combine features from multiple layers before Deep CNN based detectors such as RCNN (Girshick et al. making a prediction. Such feature combination is commonly 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. accomplished via concatenation, a classic neural network 2015) and YOLO (Redmon et al. 2016), typically use the deep idea that concatenates features from different layers, archi- CNN architectures listed in Table 6 as the backbone network tectures which have recently become popular for semantic segmentation (Long et al. 2015; Shelhamer et al. 2017;Har- Fine-tuning is done by initializing a network with weights optimized iharan et al. 2016). As shown in Fig. 16a, ION (Bell et al. for a large labeled dataset like ImageNet. and then updating the net- work’s weights using the target-task training set. 2016) uses RoI pooling to extract RoI features from multiple 123 International Journal of Computer Vision (2020) 128:261–318 283 ION (Bell et al. 2016). On the other hand, however, it is natural to detect objects of different scales using features of approximately the same size, which can be achieved by detecting large objects from downscaled feature maps while detecting small objects from upscaled feature maps. There- fore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the resulting features obtained by combining features from dif- ferent layers. This approach has been found to be effective for segmentation (Long et al. 2015; Shelhamer et al. 2017) and human pose estimation (Newell et al. 2016), has been (b) (a) widely exploited by both one-stage and two-stage detec- tors to alleviate problems of scale variation across object Fig. 16 Comparison of HyperNet and ION. LRN is local response nor- malization, which performs a kind of “lateral inhibition” by normalizing instances. Representative methods include SharpMask (Pin- over local input regions (Jia et al. 2014) heiro et al. 2016), Deconvolutional Single Shot Detector (DSSD) (Fu et al. 2017), Feature Pyramid Network (FPN) (Lin et al. 2017a), Top Down Modulation (TDM)(Shrivastava et al. 2017), Reverse connection with Objectness prior Net- layers, and then the object proposals generated by selective search and edgeboxes are classified by using the concatenated work (RON) (Kong et al. 2017), ZIP (Li et al. 2018a), Scale Transfer Detection Network (STDN) (Zhou et al. 2018b), features. HyperNet (Kong et al. 2016), shown in Fig. 16b, follows a similar idea, and integrates deep, intermediate and RefineDet (Zhang et al. 2018a), StairNet (Woo et al. 2018), shallow features to generate object proposals and to predict Path Aggregation Network (PANet) (Liu et al. 2018c), Fea- objects via an end to end joint training strategy. The com- ture Pyramid Reconfiguration (FPR) (Kong et al. 2018), bined feature is more descriptive, and is more beneficial for DetNet (Lietal. 2018b), Scale Aware Network (SAN) (Kim localization and classification, but at increased computational et al. 2018), Multiscale Location aware Kernel Representa- complexity. tion (MLKP) (Wang et al. 2018) and M2Det (Zhao et al. 2019), as shown in Table 7 and contrasted in Fig. 17. (2) Detecting at multiple CNN layers A number of recent approaches improve detection by predicting objects of differ- Early works like FPN (Lin et al. 2017a), DSSD (Fu et al. 2017), TDM (Shrivastava et al. 2017), ZIP (Li et al. 2018a), ent resolutions at different layers and then combining these predictions: SSD (Liu et al. 2016) and MSCNN (Cai et al. RON (Kong et al. 2017) and RefineDet (Zhang et al. 2018a) 2016), RBFNet (Liu et al. 2018b), and DSOD (Shen et al. construct the feature pyramid according to the inherent multi- 2017). SSD (Liu et al. 2016) spreads out default boxes of scale, pyramidal architecture of the backbone, and achieved different scales to multiple layers within a CNN, and forces encouraging results. As can be observed from Fig. 17a1– each layer to focus on predicting objects of a certain scale. f1, these methods have very similar detection architectures RFBNet (Liu et al. 2018b) replaces the later convolution lay- which incorporate a top-down network with lateral connec- ers of SSD with a Receptive Field Block (RFB) to enhance tions to supplement the standard bottom-up, feed-forward the discriminability and robustness of features. The RFB is network. Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down a multibranch convolutional block, similar to the Inception block (Szegedy et al. 2015), but combining multiple branches network to combine with the bottom-up features from inter- mediate layers after lateral processing, and the combined with different kernels and convolution layers (Chen et al. 2018b). MSCNN (Cai et al. 2016) applies deconvolution on features are then used for detection. As can be seen from multiple layers of a CNN to increase feature map resolution Fig. 17a2–e2, the main differences lie in the design of the before using the layers to learn region proposals and pool fea- simple Feature Fusion Block (FFB), which handles the selec- tures. Similar to RFBNet (Liu et al. 2018b), TridentNet (Li tion of features from different layers and the combination of et al. 2019b) constructs a parallel multibranch architecture multilayer features. where each branch shares the same transformation param- FPN (Lin et al. 2017a) shows significant improvement as eters but with different receptive fields; dilated convolution a generic feature extractor in several applications including with different dilation rates are used to adapt the receptive object detection (Lin et al. 2017a, b) and instance segmen- tation (He et al. 2017). Using FPN in a basic Faster RCNN fields for objects of different scales. (3) Combinations of the above two methods Features from system achieved state-of-the-art results on the COCO detec- tion dataset. STDN (Zhou et al. 2018b) used DenseNet different layers are complementary to each other and can improve detection accuracy, as shown by Hypercolumns (Huang et al. 2017a) to combine features of different layers (Hariharan et al. 2016), HyperNet (Kong et al. 2016) and and designed a scale transfer module to obtain feature maps 123 284 International Journal of Computer Vision (2020) 128:261–318 Table 7 Summary of properties of representative methods in improving DCNN feature representations for generic object detection Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (1) Single ION (Bell et al. SS+EB VGG16 Fast RCNN 79.4 (07+12) 76.4 (07+12) 55.733.1 CVPR16 Use features from detection with 2016) MCG+RPN multiple layers; use multilayer spatial recurrent features neural networks for modeling contextual information; the Best Student Entry and the 3rd overall in the COCO detection challenge 2015 HyperNet (Kong RPN VGG16 Faster RCNN 76.3 (07+12) 71.4 (07T+12) −− CVPR16 Use features from et al. 2016) multiple layers for both region proposal and region classification PVANet (Kim RPN PVANet Faster RCNN 84.9 84.2 (07T+12+CO) −− NIPSW16 Deep but lightweight; et al. 2016) (07+12+CO) Combine ideas from concatenated ReLU (Shang et al. 2016), Inception (Szegedy et al. 2015), and HyperNet (Kong et al. 2016) International Journal of Computer Vision (2020) 128:261–318 285 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (2) Detection at SDP+CRC (Yang EB VGG16 Fast RCNN 69.4 (07) −− − CVPR16 Use features in multiple multiple layers et al. 2016b) layers to reject easy negatives via CRC, and then classify remaining proposals using SDP MSCNN (Cai RPN VGG Faster RCNN Only Tested on KITTI ECCV16 Region proposal and et al. 2016) classification are performed at multiple layers; includes feature upsampling; end to end learning MPN SharpMask VGG16 Fast RCNN −− 51.933.2 BMVC16 Concatenate features (Zagoruyko (Pinheiro from different et al. 2016) et al. 2016) convolutional layers and features of different contextual regions; loss function for multiple overlap thresholds; ranked 2nd in both the COCO15 detection and segmentation challenges DSOD (Shen Free DenseNet SSD 77.7 (07+12) 72.2 (07T+12) 47.329.3 ICCV17 Concatenate feature et al. 2017) sequentially, like DenseNet. Train from scratch on the target dataset without pre-training RFBNet (Liu Free VGG16 SSD 82.2 (07+12) 81.2 (07T+12) 55.734.4 ECCV18 Propose a multi-branch et al. 2018b) convolutional block similar to Inception (Szegedy et al. 2015), but using dilated convolution 286 International Journal of Computer Vision (2020) 128:261–318 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (3) Combination DSSD (Fu et al. Free ResNet101 SSD 81.5 (07+12) 80.0 (07T+12) 53.333.2 2017 Use Conv-Deconv, as of (1) and (2) 2017) shown in Fig. 17c1, c2 FPN (Lin et al. RPN ResNet101 Faster RCNN−− 59.136.2 CVPR17 Use Conv-Deconv, as 2017a) shown in Fig. 17a1, a2; Widely used in detectors TDM RPN ResNet101 Faster RCNN−− 57.736.8 CVPR17 Use Conv-Deconv, as (Shrivastava VGG16 shown in Fig. 17b2 et al. 2017) RON (Kong et al. RPN VGG16 Faster RCNN 81.3 80.7 (07T+12+CO) 49.527.4 CVPR17 Use Conv-deconv, as 2017) (07+12+CO) shown in Fig. 17d2; Add the objectness prior to significantly reduce object search space ZIP (Li et al. RPN Inceptionv2 Faster RCNN 79.8 (07+12) −− − IJCV18 Use Conv-Deconv, as 2018a) shown in Fig. 17f1. Propose a map attention decision (MAD) unit for features from different layers International Journal of Computer Vision (2020) 128:261–318 287 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO STDN (Zhou Free DenseNet169 SSD 80.9 (07+12) − 51.031.8 CVPR18 A new scale transfer et al. 2018b) module, which resizes features of different scales to the same scale in parallel RefineDet RPN VGG16 Faster RCNN 83.8 (07+12) 83.5 (07T+12) 62.941.8 CVPR18 Use cascade to obtain (Zhang et al. ResNet101 better and less 2018a) anchors. Use Conv-deconv, as shown in Fig. 17e2 to improve features PANet (Liu et al. RPN ResNeXt101 Mask RCNN −− 67.2 47.4 CVPR18 Shown in Fig. 17g. 2018c) +FPN Based on FPN, add another bottom-up path to pass information between lower and topmost layers; adaptive feature pooling. Ranked 1st and 2nd in COCO 2017 tasks DetNet (Li et al. RPN DetNet59+FPN Faster RCNN −− 61.740.2 ECCV18 Introduces dilated 2018b) convolution into the ResNet backbone to maintain high resolution in deeper layers; Shown in Fig. 17i FPR (Kong et al. − VGG16 SSD 82.4 (07+12) 81.1 (07T+12) 54.334.6 ECCV18 Fuse task oriented 2018) ResNet101 features across different spatial locations and scales, globally and locally; Shown in Fig. 17h M2Det (Zhao − SSD VGG16 ResNet101−− 64.644.2 AAAI19 Shown in Fig. 17j, et al. 2019) newly designed top down path to learn a set of multilevel features, recombined to construct a feature pyramid for object detection 288 International Journal of Computer Vision (2020) 128:261–318 Table 7 continued Group Detector name Region Backbone Pipelined mAP@IoU = 0.5 mAP Published in Highlights proposal DCNN used VOC07 VOC12 COCO COCO (4) Model DeepIDNet SS+ EB AlexNet ZFNet RCNN 69.0 (07) −− 25.6 CVPR15 Introduce a deformation geometric (Ouyang et al. OverFeat constrained pooling transforms 2015) GoogLeNet layer, jointly learned with convolutional layers in existing DCNNs. Utilize the following modules that are not trained end to end: cascade, context modeling, model averaging, and bounding box location refinement in the multistage detection pipeline DCN (Dai et al. RPN ResNet101 RFCN 82.6 (07+12) − 58.037.5 CVPR17 Design deformable 2017) IRN convolution and deformable RoI pooling modules that can replace plain convolution in existing DCNNs DPFCN (Mordan AttractioNet ResNet RFCN 83.3 (07+12) 81.2 (07T+12) 59.139.1 IJCV18 Design a deformable et al. 2018) (Gidaris and part based RoI pooling Komodakis layer to explicitly 2016) select discriminative regions around object proposals Details for Groups (1), (2), and (3) are provided in Sect. 6.2. Abbreviations: Selective Search (SS), EdgeBoxes (EB), InceptionResNet (IRN). Conv-Deconv denotes the use of upsampling and convolutional layers with lateral connections to supplement the standard backbone network. Detection results on VOC07, VOC12 and COCO were reported with mAP@IoU = 0.5, and the additional COCO results are computed as the average of mAP for IoU thresholds from 0.5 to 0.95. Training data: “07”←VOC2007 trainval; “07T”←VOC2007 trainval and test; “12”←VOC2012 trainval; CO← COCO trainval. The COCO detection results were reported with COCO2015 Test-Dev, except for MPN (Zagoruyko et al. 2016) which reported with COCO2015 Test-Standard International Journal of Computer Vision (2020) 128:261–318 289 Fig. 17 Hourglass architectures: Conv1 to Conv5 are the main Conv DSSD (Fu et al. 2017), RON (Kong et al. 2017), RefineDet (Zhang blocks in backbone networks such as VGG or ResNet. The figure com- et al. 2018a), ZIP (Li et al. 2018a), PANet (Liu et al. 2018c), FPR pares a number of feature fusion blocks (FFB) commonly used in recent (Kong et al. 2018), DetNet (Li et al. 2018b) and M2Det (Zhao et al. approaches: FPN (Lin et al. 2017a), TDM (Shrivastava et al. 2017), 2019). FFM feature fusion module, TUM thinned U-shaped module 123 290 International Journal of Computer Vision (2020) 128:261–318 with different resolutions. The scale transfer module can be variations other than just scale, which we group into three directly embedded into DenseNet with little additional cost. categories: More recent work, such as PANet (Liu et al. 2018c), FPR (Kong et al. 2018), DetNet (Li et al. 2018b), and M2Det • Geometric transformations, (Zhao et al. 2019), as shown in Fig. 17g–j, propose to further • Occlusions, and improve on the pyramid architectures like FPN in different • Image degradations. ways. Based on FPN, Liu et al. designed PANet (Liu et al. 2018c) (Fig. 17g1) by adding another bottom-up path with To handle these intra-class variations, the most straightfor- clean lateral connections from low to top levels, in order ward approach is to augment the training datasets with a to shorten the information path and to enhance the feature sufficient amount of variations; for example, robustness to pyramid. Then, an adaptive feature pooling was proposed to rotation could be achieved by adding rotated objects at many aggregate features from all feature levels for each proposal. orientations to the training data. Robustness can frequently In addition, in the proposal sub-network, a complementary be learned this way, but usually at the cost of expensive train- branch capturing different views for each proposal is cre- ing and complex model parameters. Therefore, researchers ated to further improve mask prediction. These additional have proposed alternative solutions to these problems. steps bring only slightly extra computational overhead, but Handling of geometric transformations DCNNs are inher- are effective and allowed PANet to reach 1st place in the ently limited by the lack of ability to be spatially invariant COCO 2017 Challenge Instance Segmentation task and 2nd to geometric transformations of the input data (Lenc and place in the Object Detection task. Kong et al. proposed FPR Vedaldi 2018; Liu et al. 2017; Chellappa 2016). The intro- (Kong et al. 2018) by explicitly reformulating the feature duction of local max pooling layers has allowed DCNNs to pyramid construction process [e.g. FPN (Lin et al. 2017a)] enjoy some translation invariance, however the intermediate as feature reconfiguration functions in a highly nonlinear but feature maps are not actually invariant to large geometric efficient way. As shown in Fig. 17h1, instead of using a top- transformations of the input data (Lenc and Vedaldi 2018). down path to propagate strong semantic features from the Therefore, many approaches have been presented to enhance topmost layer down as in FPN, FPR first extracts features robustness, aiming at learning invariant CNN representations from multiple layers in the backbone network by adaptive with respect to different types of transformations such as concatenation, and then designs a more complex FFB module scale (Kim et al. 2014; Bruna and Mallat 2013), rotation (Fig. 17h2) to spread strong semantics to all scales. Li et al. (Bruna and Mallat 2013; Cheng et al. 2016; Worrall et al. (2018b) proposed DetNet (Fig. 17i1) by introducing dilated 2017; Zhou et al. 2017b), or both (Jaderberg et al. 2015). One convolutions to the later layers of the backbone network in representative work is Spatial Transformer Network (STN) order to maintain high spatial resolution in deeper layers. (Jaderberg et al. 2015), which introduces a new learnable Zhao et al. (2019) proposed a MultiLevel Feature Pyramid module to handle scaling, cropping, rotations, as well as non- Network (MLFPN) to build more effective feature pyramids rigid deformations via a global parametric transformation. for detecting objects of different scales. As can be seen from STN has now been used in rotated text detection (Jaderberg Fig. 17j1, features from two different layers of the backbone et al. 2015), rotated face detection and generic object detec- are first fused as the base feature, after which a top-down tion (Wang et al. 2017). path with lateral connections from the base feature is created Although rotation invariance may be attractive in certain to build the feature pyramid. As shown in Fig. 17j2, j5, the applications, such as scene text detection (He et al. 2018; FFB module is much more complex than those like FPN, in Ma et al. 2018), face detection (Shi et al. 2018), and aerial that FFB involves a Thinned U-shaped Module (TUM) to imagery (Ding et al. 2018; Xia et al. 2018), there is limited generate a second pyramid structure, after which the feature generic object detection work focusing on rotation invariance maps with equivalent sizes from multiple TUMs are com- because popular benchmark detection datasets (e.g. PAS- bined for object detection. The authors proposed M2Det by CAL VOC, ImageNet, COCO) do not actually present rotated integrating MLFPN into SSD, and achieved better detection images. performance than other one-stage detectors. Before deep learning, Deformable Part based Models (DPMs) (Felzenszwalb et al. 2010b) were successful for 6.3 Handling of Other Intraclass Variations generic object detection, representing objects by compo- nent parts arranged in a deformable configuration. Although Powerful object representations should combine distinctive- DPMs have been significantly outperformed by more recent ness and robustness. A large amount of recent work has been object detectors, their spirit still deeply influences many devoted to handling changes in object scale, as reviewed in recent detectors. DPM modeling is less sensitive to transfor- Sect. 6.2.1. As discussed in Sect. 2.2 and summarized in mations in object pose, viewpoint and nonrigid deformations, Fig. 5, object detection still requires robustness to real-world motivating researchers (Dai et al. 2017; Girshick et al. 2015; 123 International Journal of Computer Vision (2020) 128:261–318 291 Mordan et al. 2018; Ouyang et al. 2015; Wan et al. 2015)to Bar 2004) that context plays an essential role in human explicitly model object composition to improve CNN based object recognition, and it is recognized that a proper mod- detection. The first attempts (Girshick et al. 2015; Wan et al. eling of context helps object detection and recognition 2015) combined DPMs with CNNs by using deep features (Torralba 2003; Oliva and Torralba 2007; Chen et al. 2018b, learned by AlexNet in DPM based detection, but without 2015a; Divvala et al. 2009; Galleguillos and Belongie 2010), region proposals. To enable a CNN to benefit from the built- especially when object appearance features are insufficient in capability of modeling the deformations of object parts, a because of small object size, object occlusion, or poor image number of approaches were proposed, including DeepIDNet quality. Many different types of context have been discussed (Ouyang et al. 2015), DCN (Dai et al. 2017) and DPFCN (Divvala et al. 2009; Galleguillos and Belongie 2010), and (Mordan et al. 2018) (shown in Table 7). Although simi- can broadly be grouped into one of three categories: lar in spirit, deformations are computed in different ways: DeepIDNet (Ouyang et al. 2017b) designed a deformation 1. Semantic context: The likelihood of an object to be found constrained pooling layer to replace regular max pooling, to in some scenes, but not in others; learn the shared visual patterns and their deformation prop- 2. Spatial context: The likelihood of finding an object in erties across different object classes; DCN (Dai et al. 2017) some position and not others with respect to other objects designed a deformable convolution layer and a deformable in the scene; RoI pooling layer, both of which are based on the idea of 3. Scale context: Objects have a limited set of sizes relative augmenting regular grid sampling locations in feature maps; to other objects in the scene. and DPFCN (Mordan et al. 2018) proposed a deformable part-based RoI pooling layer which selects discriminative A great deal of work (Chen et al. 2015b; Divvala et al. parts of objects around object proposals by simultaneously 2009; Galleguillos and Belongie 2010; Malisiewicz and optimizing latent displacements of all parts. Efros 2009; Murphy et al. 2003; Rabinovich et al. 2007; Handling of occlusions In real-world images, occlu- Parikh et al. 2012) preceded the prevalence of deep learning, sions are common, resulting in information loss from object and much of this work has yet to be explored in DCNN-based instances. A deformable parts idea can be useful for occlu- object detectors (Chen and Gupta 2017;Huetal. 2018a). sion handling, so deformable RoI Pooling (Dai et al. 2017; The current state of the art in object detection (Ren et al. Mordan et al. 2018; Ouyang and Wang 2013) and deformable 2015; Liu et al. 2016;Heetal. 2017) detects objects with- convolution (Dai et al. 2017) have been proposed to allevi- out explicitly exploiting any contextual information. It is ate occlusion by giving more flexibility to the typically fixed broadly agreed that DCNNs make use of contextual informa- geometric structures. Wang et al. (2017) propose to learn an tion implicitly (Zeiler and Fergus 2014; Zheng et al. 2015) adversarial network that generates examples with occlusions since they learn hierarchical representations with multiple and deformations, and context may be helpful in dealing with levels of abstraction. Nevertheless, there is value in exploring occlusions (Zhang et al. 2018b). Despite these efforts, the contextual information explicitly in DCNN based detectors occlusion problem is far from being solved; applying GANs (Hu et al. 2018a; Chen and Gupta 2017; Zeng et al. 2017), so to this problem may be a promising research direction. the following reviews recent work in exploiting contextual Handling of image degradations Image noise is a com- cues in DCNN- based object detectors, organized into cate- mon problem in many real-world applications. It is frequently gories of global and local contexts, motivated by earlier work caused by insufficient lighting, low quality cameras, image in Zhang et al. (2013), Galleguillos and Belongie (2010). compression, or the intentional low-cost sensors on edge Representative approaches are summarized in Table 8. devices and wearable devices. While low image quality may be expected to degrade the performance of visual recogni- 7.1 Global Context tion, most current methods are evaluated in a degradation free and clean environment, evidenced by the fact that PASCAL Global context (Zhang et al. 2013; Galleguillos and Belongie VOC, ImageNet, MS COCO and Open Images all focus on 2010) refers to image or scene level contexts, which can serve relatively high quality images. To the best of our knowledge, as cues for object detection (e.g., a bedroom will predict the there is so far very limited work to address this problem. presence of a bed). In DeepIDNet (Ouyang et al. 2015), the image classification scores were used as contextual features, and concatenated with the object detection scores to improve 7 Context Modeling detection results. In ION (Bell et al. 2016), Bell et al. pro- posed to use spatial Recurrent Neural Networks (RNNs) to In the physical world, visual objects occur in particular envi- explore contextual information across the entire image. In ronments and usually coexist with other related objects. SegDeepM (Zhu et al. 2015), Zhu et al. proposed a Markov There is strong psychological evidence (Biederman 1972; random field model that scores appearance as well as context 123 292 International Journal of Computer Vision (2020) 128:261–318 Table 8 Summary of detectors that exploit context information, with labelling details as in Table 7 Group Detector name Region proposal Backbone DCNN Pipelined Used mAP@IoU = 0.5 mAP Published in Highlights VOC07 VOC12 COCO Global context SegDeepM (Zhu SS+CMPC VGG16 RCNN VOC10 VOC12 − CVPR15 Additional features et al. 2015) extracted from an enlarged object proposal as context information DeepIDNet SS+EB AlexNet ZFNet RCNN 69.0 (07) −− CVPR15 Use image classification (Ouyang et al. scores as global 2015) contextual information to refine the detection scores of each object proposal ION (Bell et al. SS+EB VGG16 Fast RCNN 80.177.933.1 CVPR16 The contextual 2016) information outside the region of interest is integrated using spatial recurrent neural networks CPF (Shrivastava RPN VGG16 Faster RCNN 76.4 (07+12) 72.6 (07T+12) − ECCV16 Use semantic and Gupta segmentation to 2016) provide top-down feedback Local context MRCNN SS VGG16 SPPNet 78.2 (07+12) 73.9 (07+12) − ICCV15 Extract features from (Gidaris and multiple regions Komodakis surrounding or inside 2015) the object proposals. Integrate the semantic segmentation-aware features GBDNet (Zeng CRAFT (Yang Inception v2 Fast RCNN 77.2 (07+12) − 27.0 ECCV16 TPAMI18 A GBDNet module to et al. 2016, et al. 2016a) ResNet269 learn the relations of 2017) PolyNet multiscale (Zhang et al. contextualized regions 2017) surrounding an object proposal; GBDNet passes messages among features from different context regions through convolution between neighboring support regions in two directions International Journal of Computer Vision (2020) 128:261–318 293 Table 8 continued Group Detector name Region proposal Backbone DCNN Pipelined Used mAP@IoU = 0.5 mAP Published in Highlights VOC07 VOC12 COCO ACCNN (Li SS VGG16 Fast RCNN 72.0 (07+12) 70.6 (07T+12) − TMM17 Use LSTM to capture et al. 2017b) global context. Concatenate features from multi-scale contextual regions surrounding an object proposal. The global and local context features are concatenated for recognition CoupleNet (Zhu RPN ResNet101 RFCN 82.7 (07+12) 80.4 (07T+12) 34.4 ICCV17 Concatenate features et al. 2017a) from multiscale contextual regions surrounding an object proposal. Features of different contextual regions are then combined by convolution and element-wise sum SMN (Chen and RPN VGG16 Faster RCNN 70.0 (07) −− ICCV17 Model object-object Gupta 2017) relationships efficiently through a spatial memory network. Learn the functionality of NMS automatically ORN (Hu et al. RPN ResNet101 Faster RCNN −− 39.0 CVPR18 Model the relations of a 2018a) +DCN set of object proposals through the interactions between their appearance features and geometry. Learn the functionality of NMS automatically SIN (Liu et al. RPN VGG16 Faster RCNN 76.0 (07+12) 73.1 (07T+12) 23.2 CVPR18 Formulate object 2018d) detection as graph-structured inference, where objects are graph nodes and relationships the edges 294 International Journal of Computer Vision (2020) 128:261–318 for each detection, and allows each candidate box to select a contextual region and semantically segmented regions), in segment out of a large pool of object segmentation proposals order to obtain a richer and more robust object representa- and score the agreement between them. In Shrivastava and tion. All of these features are combined by concatenation. Gupta (2016), semantic segmentation was used as a form of Quite a number of methods, all closely related to MRCNN, contextual priming. have been proposed since then. The method in Zagoruyko et al. (2016) used only four contextual regions, organized in a foveal structure, where the classifiers along multiple paths 7.2 Local Context are trained jointly end-to-end. Zeng et al. (2016), Zeng et al. (2017) proposed GBDNet (Fig. 18b) to extract features from Local context (Zhang et al. 2013; Galleguillos and Belongie multiscale contextualized regions surrounding an object pro- 2010; Rabinovich et al. 2007) considers the relationship posal to improve detection performance. In contrast to the among locally nearby objects, as well as the interactions somewhat naive approach of learning CNN features for each between an object and its surrounding area. In general, mod- region separately and then concatenating them, GBDNet eling object relations is challenging, requiring reasoning passes messages among features from different contextual about bounding boxes of different classes, locations, scales regions. Noting that message passing is not always helpful, etc. Deep learning research that explicitly models object rela- but dependent on individual samples, Zeng et al. (2016)used tions is quite limited, with representative ones being Spatial gated functions to control message transmission. Li et al. Memory Network (SMN) (Chen and Gupta 2017), Object (2017b) presented ACCNN (Fig. 18c) to utilize both global Relation Network (Hu et al. 2018a), and Structure Inference and local contextual information: the global context was Network (SIN) (Liu et al. 2018d). In SMN, spatial memory captured using a Multiscale Local Contextualized (MLC) essentially assembles object instances back into a pseudo subnetwork, which recurrently generates an attention map for image representation that is easy to be fed into another CNN an input image to highlight promising contextual locations; for object relations reasoning, leading to a new sequential local context adopted a method similar to that of MRCNN reasoning architecture where image and memory are pro- (Gidaris and Komodakis 2015). As shown in Fig. 18d, Cou- cessed in parallel to obtain detections which further update pleNet (Zhu et al. 2017a) is conceptually similar to ACCNN memory. Inspired by the recent success of attention mod- (Lietal. 2017b), but built upon RFCN (Dai et al. 2016c), ules in natural language processing (Vaswani et al. 2017), which captures object information with position sensitive ORN processes a set of objects simultaneously through the RoI pooling, CoupleNet added a branch to encode the global interaction between their appearance feature and geometry. context with RoI pooling. It does not require additional supervision, and it is easy to embed into existing networks, effective in improving object recognition and duplicate removal steps in modern object 8 Detection Proposal Methods detection pipelines, giving rise to the first fully end-to-end object detector. SIN (Liu et al. 2018d) considered two kinds An object can be located at any position and scale in an of context: scene contextual information and object relation- image. During the heyday of handcrafted feature descrip- ships within a single image. It formulates object detection as tors [SIFT (Lowe 2004), HOG (Dalal and Triggs 2005) and a problem of graph inference, where the objects are treated LBP (Ojala et al. 2002)], the most successful methods for as nodes in a graph and relationships between objects are object detection [e.g. DPM (Felzenszwalb et al. 2008)] used modeled as edges. sliding window techniques (Viola and Jones 2001; Dalal and A wider range of methods has approached the con- Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; text challenge with a simpler idea: enlarging the detec- Vedaldi et al. 2009). However, the number of windows is tion window size to extract some form of local context. huge, growing with the number of pixels in an image, and Representative approaches include MRCNN (Gidaris and the need to search at multiple scales and aspect ratios further Komodakis 2015), Gated BiDirectional CNN (GBDNet) increases the search space . Therefore, it is computationally Zeng et al. (2016), Zeng et al. (2017), Attention to Con- too expensive to apply sophisticated classifiers. text CNN (ACCNN) (Li et al. 2017b), CoupleNet (Zhu et al. Around 2011, researchers proposed to relieve the tension 2017a), and Sermanet et al. (2013). In MRCNN (Gidaris between computational tractability and high detection qual- and Komodakis 2015) (Fig. 18a), in addition to the features extracted from the original object proposal at the last CONV 12 4 Sliding window based detection requires classifying around 10 – layer of the backbone, Gidaris and Komodakis proposed to 10 windows per image. The number of windows grows significantly 6 7 extract features from a number of different regions of an to 10 –10 windows per image when considering multiple scales and object proposal (half regions, border regions, central regions, aspect ratios. 123 International Journal of Computer Vision (2020) 128:261–318 295 (b) (a) (d) (c) Fig. 18 Representative approaches that explore local surrounding contextual features: MRCNN (Gidaris and Komodakis 2015), GBDNet (Zeng et al. 2016, 2017), ACCNN (Li et al. 2017b) and CoupleNet (Zhu et al. 2017a); also see Table 8 ity by using detection proposals (Van de Sande et al. 2011; paper, because object proposals have applications beyond Uijlings et al. 2013). Originating in the idea of objectness object detection (Arbeláez et al. 2012; Guillaumin et al. proposed by Alexe et al. (2010), object proposals are a set 2014; Zhu et al. 2017b). We refer interested readers to the of candidate regions in an image that are likely to contain recent surveys (Hosang et al. 2016; Chavali et al. 2016) which objects, and if high object recall can be achieved with a mod- provide in-depth analysis of many classical object proposal est number of object proposals (like one hundred), significant algorithms and their impact on detection performance. Our speed-ups over the sliding window approach can be gained, interest here is to review object proposal methods that are allowing the use of more sophisticated classifiers. Detection based on DCNNs, output class agnostic proposals, and are proposals are usually used as a pre-processing step, limit- related to generic object detection. ing the number of regions that need to be evaluated by the In 2014, the integration of object proposals (Van de detector, and should have the following characteristics: Sande et al. 2011; Uijlings et al. 2013) and DCNN features (Krizhevsky et al. 2012a) led to the milestone RCNN (Gir- shick et al. 2014) in generic object detection. Since then, 1. High recall, which can be achieved with only a few pro- detection proposal has quickly become a standard prepro- posals; cessing step, based on the fact that all winning entries in 2. Accurate localization, such that the proposals match the the PASCAL VOC (Everingham et al. 2010), ILSVRC (Rus- object bounding boxes as accurately as possible; and sakovsky et al. 2015) and MS COCO (Lin et al. 2014) object 3. Low computational cost. detection challenges since 2014 used detection proposals (Girshick et al. 2014; Ouyang et al. 2015; Girshick 2015; The success of object detection based on detection proposals Ren et al. 2015; Zeng et al. 2017;Heetal. 2017). (Van de Sande et al. 2011; Uijlings et al. 2013) has attracted Among object proposal approaches based on traditional broad interest (Carreira and Sminchisescu 2012; Arbeláez low-level cues (e.g., color, texture, edge and gradients), et al. 2014; Alexe et al. 2012; Cheng et al. 2014; Zitnick Selective Search (Uijlings et al. 2013), MCG (Arbeláez et al. and Dollár 2014; Endres and Hoiem 2010; Krähenbühl and 2014) and EdgeBoxes (Zitnick and Dollár 2014) are among Koltun 2014; Manen et al. 2013). A comprehensive review the more popular. As the domain rapidly progressed, tra- of object proposal algorithms is beyond the scope of this ditional object proposal approaches (Uijlings et al. 2013; Hosang et al. 2016; Zitnick and Dollár 2014), which were We use the terminology detection proposals, object proposals and adopted as external modules independent of the detectors, region proposals interchangeably. 123 296 International Journal of Computer Vision (2020) 128:261–318 became the speed bottleneck of the detection pipeline (Ren ground. Li et al. (2018a) proposed ZIP to improve RPN by et al. 2015). An emerging class of object proposal algorithms predicting object proposals with multiple convolutional fea- (Erhan et al. 2014; Ren et al. 2015; Kuo et al. 2015; Ghodrati ture maps at different network depths to integrate both low et al. 2015; Pinheiro et al. 2015; Yang et al. 2016a)using level details and high level semantics. The backbone used in DCNNs has attracted broad attention. ZIP is a “zoom out and in” network inspired by the conv and Recent DCNN based object proposal methods generally deconv structure (Long et al. 2015). fall into two categories: bounding box based and object Finally, recent work which deserves mention includes segment based, with representative methods summarized in Deepbox (Kuo et al. 2015), which proposed a lightweight Table 9. CNN to learn to rerank proposals generated by EdgeBox, and Bounding Box Proposal Methods are best exemplified by DeNet (TychsenSmith and Petersson 2017) which introduces the RPC method of Ren et al. (2015), illustrated in Fig. 19. bounding box corner estimation to predict object proposals RPN predicts object proposals by sliding a small network efficiently to replace RPN in a Faster RCNN style detector. over the feature map of the last shared CONV layer. At each Object Segment Proposal Methods Pinheiro et al. (2015), sliding window location, k proposals are predicted by using Pinheiro et al. (2016) aim to generate segment proposals that k anchor boxes, where each anchor box is centered at some are likely to correspond to objects. Segment proposals are location in the image, and is associated with a particular scale more informative than bounding box proposals, and take a and aspect ratio. Ren et al. (2015) proposed integrating RPN step further towards object instance segmentation (Hariha- and Fast RCNN into a single network by sharing their convo- ran et al. 2014; Dai et al. 2016b;Lietal. 2017e). In addition, lutional layers, leading to Faster RCNN, the first end-to-end using instance segmentation supervision can improve the per- detection pipeline. RPN has been broadly selected as the formance of bounding box object detection. The pioneering proposal method by many state-of-the-art object detectors, work of DeepMask, proposed by Pinheiro et al. (2015), seg- as can be observed from Tables 7 and 8. ments proposals learnt directly from raw image data with a Instead of fixing apriori a set of anchors as MultiBox deep network. Similarly to RPN, after a number of shared (Erhan et al. 2014; Szegedy et al. 2014) and RPN (Ren et al. convolutional layers DeepMask splits the network into two 2015), Lu et al. (2016) proposed generating anchor locations branches in order to predict a class agnostic mask and an by using a recursive search strategy which can adaptively associated objectness score. Also similar to the efficient slid- guide computational resources to focus on sub-regions likely ing window strategy in OverFeat (Sermanet et al. 2014), to contain objects. Starting with the whole image, all regions the trained DeepMask network is applied in a sliding win- visited during the search process serve as anchors. For any dow manner to an image (and its rescaled versions) during anchor region encountered during the search procedure, a inference. More recently, Pinheiro et al. (2016) proposed scalar zoom indicator is used to decide whether to further par- SharpMask by augmenting the DeepMask architecture with tition the region, and a set of bounding boxes with objectness a refinement module, similar to the architectures shown in scores are computed by an Adjacency and Zoom Network Fig. 17 (b1) and (b2), augmenting the feed-forward net- (AZNet), which extends RPN by adding a branch to com- work with a top-down refinement process. SharpMask can pute the scalar zoom indicator in parallel with the existing efficiently integrate spatially rich information from early fea- branch. tures with strong semantic information encoded in later layers Further work attempts to generate object proposals by to generate high fidelity object masks. exploiting multilayer convolutional features. Concurrent Motivated by Fully Convolutional Networks (FCN) for with RPN (Ren et al. 2015), Ghodrati et al. (2015)pro- semantic segmentation (Long et al. 2015) and DeepMask posed DeepProposal, which generates object proposals by (Pinheiro et al. 2015; Dai et al. 2016a) proposed Instance- using a cascade of multiple convolutional features, building FCN to generate instance segment proposals. Similar to an inverse cascade to select the most promising object loca- DeepMask, the InstanceFCN network is split into two fully tions and to refine their boxes in a coarse-to-fine manner. convolutional branches, one to generate instance sensitive An improved variant of RPN, HyperNet (Kong et al. 2016) score maps, the other to predict the objectness score. Hu et al. designs Hyper Features which aggregate multilayer convolu- (2017) proposed FastMask to efficiently generate instance tional features and shares them both in generating proposals segment proposals in a one-shot manner, similar to SSD (Liu and detecting objects via an end-to-end joint training strat- et al. 2016), in order to make use of multiscale convolutional egy. Yang et al. (2016a) proposed CRAFT which also used features. Sliding windows extracted densely from multiscale a cascade strategy, first training an RPN network to generate convolutional feature maps were input to a scale-tolerant object proposals and then using them to train another binary attentional head module in order to predict segmentation Fast RCNN network to further distinguish objects from back- masks and objectness scores. FastMask is claimed to run at 13 FPS on 800 × 600 images. The concept of “anchor” first appeared in Ren et al. (2015). 123 International Journal of Computer Vision (2020) 128:261–318 297 Table 9 Summary of object proposal methods using DCNN. Bold values indicates the number of object proposals Proposer name Backbone Detector tested Recall@IoU (VOC07) Detection results (mAP) Published in Highlights network 0.50.70.9 VOC07 VOC12 COCO Bounding box object proposal methods MultiBox1 AlexNet RCNN −−− 29.0(10)−− CVPR14 Learns a class agnostic (Erhan et al. (12) regressor on a small 2014) set of 800 predefined anchor boxes. Do not share features for detection DeepBox (Kuo VGG16 Fast RCNN 0.96 (1000)0.84 (1000)0.15 (1000) −− 37.8(500) ICCV15 Use a lightweight CNN et al. 2015) ([email protected]) to learn to rerank proposals generated by EdgeBox. Can run at 0.26s per image. Do not share features for detection RPN (Ren et al. VGG16 Faster RCNN 0.97 (300) 0.79 (300) 0.04 (300) 73.2(300) 70.4(300) 21.9(300) NIPS15 The first to generate 2015, 2017) 0.98 (1000) 0.84 (1000) 0.04 (1000) (07+12) (07++12) object proposals by sharing full image convolutional features with detection. Most widely used object proposal method. Significant improvements in detection speed DeepProposal VGG16 Fast RCNN 0.74 (100) 0.58 (100) 0.12 (100) 53.2(100)−− ICCV15 Generate proposals (Ghodrati et al. 0.92 (1000) 0.80 (1000) 0.16 (1000) (07) inside a DCNN in a 2015) multiscale manner. Share features with the detection network CRAFT (Yang VGG16 Faster RCNN 0.98 (300)0.90 (300)0.13 (300)75.7 71.3 (12) − CVPR16 Introduced a et al. 2016a) (07+12) classification network (i.e. two class Fast RCNN) cascade that comes after the RPN. Not sharing features extracted for detection 298 International Journal of Computer Vision (2020) 128:261–318 Table 9 continued Proposer name Backbone Detector tested Recall@IoU (VOC07) Detection results (mAP) Published in Highlights network 0.50.70.9 VOC07 VOC12 COCO AZNet (Lu et al. VGG16 Fast RCNN 0.91 (300)0.71 (300)0.11 (300)70.4 (07) − 22.3 CVPR16 Use coarse-to-fine 2016) search: start from large regions, then recursively search for subregions that may contain objects. Adaptively guide computational resources to focus on likely subregions ZIP (Li et al. Inception v2 Faster RCNN 0.85 (300) 0.74 (300) 0.35 (300) 79.8 −− IJCV18 Generate proposals 2018a) COCO COCO COCO (07+12) using conv-deconv network with multilayers; Proposed a map attention decision (MAD) unit to assign the weights for features from different layers DeNet ResNet101 Fast RCNN 0.82 (300)0.74 (300)0.48 (300)77.1 73.9 (07++12) 33.8 ICCV17 A lot faster than Faster (TychsenSmith (07+12) RCNN; Introduces a and Petersson bounding box corner 2017) estimation for predicting object proposals efficiently to replace RPN; Does not require predefined anchors International Journal of Computer Vision (2020) 128:261–318 299 Table 9 continued Proposer name Backbone Detector tested Box proposals (AR, COCO) Segment proposals (AR, COCO) Published in Highlights network Segment proposal methods DeepMask VGG16 Fast RCNN 0.33 (100), 0.48 (1000)0.26 (100), 0.37 (1000) NIPS15 First to generate object (Pinheiro et al. mask proposals with 2015) DCNN; Slow inference time; Need segmentation annotations for training; Not sharing features with detection network; Achieved mAP of 69.9% (500) with Fast RCNN InstanceFCN VGG16 −− 0.32 (100), 0.39 (1000) ECCV16 Combines ideas of FCN (Daietal. (Long et al. 2015)and 2016a) DeepMask (Pinheiro et al. 2015). Introduces instance sensitive score maps. Needs segmentation annotations to train the network SharpMask MPN Fast RCNN 0.39 (100), 0.53 (1000)0.30 (100), 0.39 (1000) ECCV16 Leverages features at (Pinheiro et al. (Zagoruyko multiple convolutional 2016) et al. 2016) layers by introducing a top-down refinement module. Does not share features with detection network. Needs segmentation annotations for training FastMask (Hu ResNet39 − 0.43 (100), 0.57 (1000)0.32 (100), 0.41 (1000) CVPR17 Generates instance et al. 2017) segment proposals efficiently in one-shot manner similar to SSD (Liu et al. 2016). Uses multiscale convolutional features. Uses segmentation annotations for training The detection results on COCO are based on mAP@IoU[0.5, 0.95], unless stated otherwise 300 International Journal of Computer Vision (2020) 128:261–318 but without reducing training samples; SNIPER allows for efficient multiscale training, only processing context regions around ground truth objects at the appropriate scale, instead of processing a whole image pyramid. Peng et al. (2018) studied a key factor in training, the minibatch size, and proposed MegDet, a Large MiniBatch Object Detector, to enable the training with a much larger minibatch size than before (from 16 to 256). To avoid the failure of convergence and significantly speed up the training process, Peng et al. (2018) proposed a learning rate policy and Cross GPU Batch Normalization, and effectively utilized 128 GPUs, allowing Fig. 19 Illustration of the region proposal network (RPN) introduced MegDet to finish COCO training in 4 hours on 128 GPUs, in Ren et al. (2015) and winning the COCO 2017 Detection Challenge. Reducing Localization Error In object detection, the Inter- 9 Other Issues section Over Union (IOU) between a detected bounding box and its ground truth box is the most popular evalua- Data Augmentation Performing data augmentation for learn- tion metric, and an IOU threshold (e.g. typical value of 0.5) is required to define positives and negatives. From Fig. 13, ing DCNNs (Chatfield et al. 2014; Girshick 2015; Girshick et al. 2014) is generally recognized to be important for visual in most state of the art detectors (Girshick 2015; Liu et al. 2016;Heetal. 2017; Ren et al. 2015; Redmon et al. 2016) recognition. Trivial data augmentation refers to perturbing an image by transformations that leave the underlying cate- object detection is formulated as a multitask learning prob- gory unchanged, such as cropping, flipping, rotating, scaling, lem, i.e., jointly optimizing a softmax classifier which assigns translating, color perturbations, and adding noise. By artifi- object proposals with class labels and bounding box regres- cially enlarging the number of samples, data augmentation sors, localizing objects by maximizing IOU or other metrics helps in reducing overfitting and improving generalization. between detection results and ground truth. Bounding boxes It can be used at training time, at test time, or both. Never- are only a crude approximation for articulated objects, con- sequently background pixels are almost invariably included theless, it has the obvious limitation that the time required for training increases significantly. Data augmentation may in a bounding box, which affects the accuracy of classifi- cation and localization. The study in Hoiem et al. (2012) synthesize completely new training images (Peng et al. 2015; Wang et al. 2017), however it is hard to guarantee that the syn- shows that object localization error is one of the most influ- thetic images generalize well to real ones. Some researchers ential forms of error, in addition to confusion between similar (Dwibedi et al. 2017; Gupta et al. 2016) proposed augment- objects. Localization error could stem from insufficient over- ing datasets by pasting real segmented objects into natural lap (smaller than the required IOU threshold, such as the images; indeed, Dvornik et al. (2018) showed that appro- green box in Fig. 20) or duplicate detections (i.e., multiple priately modeling the visual context surrounding objects is overlapping detections for an object instance). Usually, some crucial to place them in the right environment, and proposed post-processing step like NonMaximum Suppression (NMS) a context model to automatically find appropriate locations (Bodla et al. 2017; Hosang et al. 2017) is used for eliminat- ing duplicate detections. However, due to misalignments the on images to place new objects for data augmentation. Novel Training Strategies Detecting objects under a wide bounding box with better localization could be suppressed during NMS, leading to poorer localization quality (such as range of scale variations, especially the detection of very small objects, stands out as a key challenge. It has been shown the purple box shown in Fig. 20). Therefore, there are quite (Huang et al. 2017b; Liu et al. 2016) that image resolution a few methods aiming at improving detection performance has a considerable impact on detection accuracy, therefore by reducing localization error. scaling is particularly commonly used in data augmentation, MRCNN (Gidaris and Komodakis 2015) introduces iter- since higher resolutions increase the possibility of detecting ative bounding box regression, where an RCNN is applied small objects (Huang et al. 2017b). Recently, Singh et al. several times. CRAFT (Yang et al. 2016a) and AttractioNet proposed advanced and efficient data argumentation meth- (Gidaris and Komodakis 2016) use a multi-stage detection ods SNIP (Singh and Davis 2018) and SNIPER (Singh et al. sub-network to generate accurate proposals, to forward to Fast RCNN. Cai and Vasconcelos (2018) proposed Cas- 2018b) to 1 illustrate the scale invariance problem, as sum- marized in Table 10. Motivated by the intuitive understanding cade RCNN, a multistage extension of RCNN, in which a sequence of detectors is trained sequentially with increasing that small and large objects are difficult to detect at smaller and larger scales, respectively, SNIP introduces a novel train- ing scheme that can reduce scale variations during training, Please refer to Sect. 4.2 for more details on the definition of IOU. 123 International Journal of Computer Vision (2020) 128:261–318 301 Table 10 Representative methods for training strategies and class imbalance handling Detector name Region proposal Backbone DCNN Pipelined used VOC07 results VOC12 results COCO results Published in Highlights MegDet (Peng RPN ResNet50+FPN FasterRCNN −− 52.5 CVPR18 Allow training with et al. 2018) much larger minibatch size than before by introducing cross GPU batch normalization; Can finish the COCO training in 4 hours on 128 GPUs and achieved improved accuracy; Won COCO2017 detection challenge SNIP (Singh RPN DPN (Chen et al. RFCN −− 48.3 CVPR18 A new multiscale et al. 2018b) 2017b)+DCN training scheme. (Dai et al. Empirically examined 2017) the effect of up-sampling for small object detection. During training, only select objects that fit the scale of features as positive samples SNIPER (Singh RPN ResNet101+DCN Faster RCNN −− 47.6 2018 An efficient multiscale et al. 2018b) training strategy. Process context regions around ground-truth instances at the appropriate scale OHEM SS VGG16 Fast RCNN 78.9 (07+12) 76.3 (07++12) 22.4 CVPR16 A simple and effective (Shrivastava Online Hard Example et al. 2016) Mining algorithm to improve training of region based detectors 302 International Journal of Computer Vision (2020) 128:261–318 Table 10 continued Detector name Region proposal Backbone DCNN Pipelined used VOC07 results VOC12 results COCO results Published in Highlights FactorNet SS GooglNet RCNN −−− CVPR16 Identify the imbalance (Ouyang et al. in the number of 2016) samples for different object categories; propose a divide-and-conquer feature learning scheme Chained Cascade SS CRAFT VGG Fast RCNN, 80.4 (07+12) −− ICCV17 Jointly learn DCNN and (Cai and Inceptionv2 Faster RCNN (SS+VGG) multiple stages of Vasconcelos cascaded classifiers. 2018) Boost detection accuracy on PASCAL VOC 2007 and ImageNet for both fast RCNN and Faster RCNN using different region proposal methods Cascade RCNN RPN VGG ResNet101 Faster RCNN −− 42.8 CVPR18 Jointly learn DCNN and (Cai and +FPN multiple stages of Vasconcelos cascaded classifiers, 2018) which are learned using different localization accuracy for selecting positive samples. Stack bounding box regression at multiple stages RetinaNet (Lin − ResNet101 +FPN RetinaNet −− 39.1 ICCV17 Propose a novel Focal et al. 2017b) Loss which focuses training on hard examples. Handles well the problem of imbalance of positive and negative samples when training a one-stage detector Results on COCO are reported with Test Dev. The detection results on COCO are based on mAP@IoU[0.5, 0.95] International Journal of Computer Vision (2020) 128:261–318 303 proposed Focal Loss to address this by rectifying the Cross Entropy loss, such that it down-weights the loss assigned to correctly classified examples. Li et al. (2019a) studied this issue from the perspective of gradient norm distribution, and proposed a Gradient Harmonizing Mechanism (GHM) to handle it. 10 Discussion and Conclusion Generic object detection is an important and challenging problem in computer vision and has received considerable attention. Thanks to remarkable developments in deep learn- Fig. 20 Localization error could stem from insufficient overlap or duplicate detections. Localization error is a frequent cause of false pos- ing techniques, the field of object detection has dramatically itives (Color figure online) evolved. As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent IOU thresholds, based on the observation that the output of a achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing detector trained with a certain IOU is a good distribution to train the detector of the next higher IOU threshold, in order to popular datasets and evaluation criteria, and discussed perfor- mance for the most representative methods. We conclude this be sequentially more selective against close false positives. This approach can be built with any RCNN-based detector, review with a discussion of the state of the art in Sect. 10.1, and is demonstrated to achieve consistent gains (about 2 to an overall discussion of key issues in Sect. 10.2, and finally 4 points) independent of the baseline detector strength, at a suggested future research directions in Sect. 10.3. marginal increase in computation. There is also recent work (Jiang et al. 2018; Rezatofighi et al. 2019; Huang et al. 2019) 10.1 State of the Art Performance formulating IOU directly as the optimization objective, and in proposing improved NMS results (Bodla et al. 2017;He A large variety of detectors has appeared in the last few et al. 2019; Hosang et al. 2017; TychsenSmith and Petersson years, and the introduction of standard benchmarks, such as 2018), such as Soft NMS (Bodla et al. 2017) and learning PASCAL VOC (Everingham et al. 2010, 2015), ImageNet NMS (Hosang et al. 2017). (Russakovsky et al. 2015) and COCO (Lin et al. 2014), has Class Imbalance Handling Unlike image classification, made it easier to compare detectors. As can be seen from object detection has another unique problem: the serious our earlier discussion in Sects. 5–9, it may be misleading imbalance between the number of labeled object instances to compare detectors in terms of their originally reported and the number of background examples (image regions performance (e.g. accuracy, speed), as they can differ in not belonging to any object class of interest). Most back- fundamental / contextual respects, including the following ground examples are easy negatives, however this imbalance choices: can make the training very inefficient, and the large num- ber of easy negatives tends to overwhelm the training. In • Meta detection frameworks, such as RCNN (Girshick the past, this issue has typically been addressed via tech- et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN niques such as bootstrapping (Sung and Poggio 1994). More (Ren et al. 2015), RFCN (Dai et al. 2016c), Mask RCNN recently, this problem has also seen some attention (Li et al. (He et al. 2017), YOLO (Redmon et al. 2016) and SSD 2019a; Lin et al. 2017b; Shrivastava et al. 2016). Because (Liu et al. 2016); the region proposal stage rapidly filters out most background • Backbone networks such as VGG (Simonyan and Zis- regions and proposes a small number of object candidates, serman 2015), Inception (Szegedy et al. 2015; Ioffe and this class imbalance issue is mitigated to some extent in Szegedy 2015; Szegedy et al. 2016), ResNet (He et al. two-stage detectors (Girshick et al. 2014; Girshick 2015; 2016), ResNeXt (Xie et al. 2017), and Xception (Chollet Ren et al. 2015;Heetal. 2017), although example mining 2017) etc. listed in Table 6; approaches, such as Online Hard Example Mining (OHEM) • Innovations such as multilayer feature combination (Lin (Shrivastava et al. 2016), may be used to maintain a rea- et al. 2017a; Shrivastava et al. 2017;Fuetal. 2017), sonable balance between foreground and background. In the deformable convolutional networks (Dai et al. 2017), case of one-stage object detectors (Redmon et al. 2016;Liu deformable RoI pooling (Ouyang et al. 2015; Dai et al. et al. 2016), this imbalance is extremely serious (e.g. 100,000 2017), heavier heads (Ren et al. 2016; Peng et al. 2018), background examples to every object). Lin et al. (2017b) and lighter heads (Li et al. 2018c); 123 304 International Journal of Computer Vision (2020) 128:261–318 • Pretraining with datasets such as ImageNet (Russakovsky 10.2 Summary and Discussion et al. 2015), COCO (Lin et al. 2014), Places (Zhou et al. 2017a), JFT (Hinton et al. 2015) and Open Images With hundreds of references and many dozens of methods (Krasin et al. 2017); discussed throughout this paper, we would now like to focus • Different detection proposal methods and different num- on the key factors which have emerged in generic object bers of object proposals; detection based on deep learning. • Train/test data augmentation, novel multiscale training strategies (Singh and Davis 2018; Singh et al. 2018b) (1) Detection frameworks: two stage versus one stage etc, and model ensembling. In Sect. 5 we identified two major categories of detection Although it may be impractical to compare every recently frameworks: region based (two stage) and unified (one stage): proposed detector, it is nevertheless valuable to integrate representative and publicly available detectors into a com- • When large computational cost is allowed, two-stage mon platform and to compare them in a unified manner. detectors generally produce higher detection accuracies There has been very limited work in this regard, except for than one-stage, evidenced by the fact that most winning Huang’s study (Huang et al. 2017b) of the three main fam- approaches used in famous detection challenges like are ilies of detectors [Faster RCNN (Ren et al. 2015), RFCN predominantly based on two-stage frameworks, because (Dai et al. 2016c) and SSD (Liu et al. 2016)] by varying the their structure is more flexible and better suited for region backbone network, image resolution, and the number of box based classification. The most widely used frameworks are Faster RCNN (Ren et al. 2015), RFCN (Dai et al. proposals. As can be seen from Tables 7, 8, 9, 10, 11,wehavesum- 2016c) and Mask RCNN (He et al. 2017). marized the best reported performance of many methods on • It has been shown in Huang et al. (2017b) that the detec- three widely used standard benchmarks. The results of these tion accuracy of one-stage SSD (Liu et al. 2016)isless methods were reported on the same test benchmark, despite sensitive to the quality of the backbone network than rep- their differing in one or more of the aspects listed above. resentative two-stage frameworks. Figures 3 and 21 present a very brief overview of the state • One-stage detectors like YOLO (Redmon et al. 2016) and of the art, summarizing the best detection results of the PAS- SSD (Liu et al. 2016) are generally faster than two-stage CAL VOC, ILSVRC and MSCOCO challenges; more results ones, because of avoiding preprocessing algorithms, can be found at detection challenge websites (ILSVRC 2018; using lightweight backbone networks, performing pre- MS COCO 2018; PASCAL VOC 2018). The competition diction with fewer candidate regions, and making the winner of the open image challenge object detection task classification subnetwork fully convolutional. However, achieved 61.71% mAP in the public leader board and 58.66% two-stage detectors can run in real time with the intro- mAP on the private leader board, obtained by combining the duction of similar techniques. In any event, whether one detection results of several two-stage detectors including Fast stage or two, the most time consuming step is the feature RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), FPN extractor (backbone network) (Law and Deng 2018;Ren (Lin et al. 2017a), Deformable RCNN (Dai et al. 2017), and et al. 2015). Cascade RCNN (Cai and Vasconcelos 2018). In summary, the • It has been shown (Huang et al. 2017b; Redmon et al. backbone network, the detection framework, and the avail- 2016; Liu et al. 2016) that one-stage frameworks like ability of large scale datasets are the three most important YOLO and SSD typically have much poorer performance factors in detection accuracy. Ensembles of multiple models, when detecting small objects than two-stage architec- the incorporation of context features, and data augmentation tures like Faster RCNN and RFCN, but are competitive all help to achieve better accuracy. in detecting large objects. In less than 5 years, since AlexNet (Krizhevsky et al. 2012a) was proposed, the Top5 error on ImageNet classifica- There have been many attempts to build better (faster, more tion (Russakovsky et al. 2015) with 1000 classes has dropped accurate, or more robust) detectors by attacking each stage from 16% to 2%, as shown in Fig. 15. However, the mAP of of the detection framework. No matter whether one, two or the best performing detector (Peng et al. 2018)onCOCO multiple stages, the design of the detection framework has (Lin et al. 2014), trained to detect only 80 classes, is only converged towards a number of crucial design choices: at 73%, even at 0.5 IoU, illustrating how object detection is much harder than image classification. The accuracy and • A fully convolutional pipeline robustness achieved by the state-of-the-art detectors far from • Exploring complementary information from other corre- satisfies the requirements of real world applications, so there lated tasks, e.g., Mask RCNN (He et al. 2017) remains significant room for future improvement. • Sliding windows (Ren et al. 2015) 123 International Journal of Computer Vision (2020) 128:261–318 305 Table 11 Summary of properties and performance of milestone detection frameworks for generic object detection Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Region based (Sect. 5.1) RCNN (Girshick SS AlexNet Fixed 58.5 (07) 53.3 (12) < 0.1 CVPR14 Caffe Matlab Highlights: First to integrate CNN et al. 2014) with RP methods; Dramatic performance improvement over previous state of the artP Disadvantages: Multistage pipeline of sequentially-trained (External RP computation, CNN finetuning, each warped RP passing through CNN, SVM and BBR training); Training is expensive in space and time; Testing is slow SPPNet (He et al. SS ZFNet Arbitrary 60.9 (07) − < 1 ECCV14 Caffe Matlab Highlights: First to introduce SPP 2014) into CNN architecture; Enable convolutional feature sharing; Accelerate RCNN evaluation by orders of magnitude without sacrificing performance; Faster than OverFeat Disadvantages: Inherit disadvantages of RCNN; Does not result in much training speedup; Fine-tuning not able to update the CONV layers before SPP layer Fast RCNN SS AlexNet VGGM Arbitrary 70.0 (VGG) 68.4 (VGG) < 1 ICCV15 Caffe Python Highlights: First to enable (Girshick 2015) VGG16 (07+12) (07++12) end-to-end detector training (ignoring RP generation); Design a RoI pooling layer; Much faster and more accurate than SPPNet; No disk storage required for feature caching Disadvantages: External RP computation is exposed as the new bottleneck; Still too slow for real time applications 306 International Journal of Computer Vision (2020) 128:261–318 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Faster RCNN RPN ZFnet VGG Arbitrary 73.2 (VGG) 70.4 (VGG) < 5 NIPS15 Caffe Matlab Highlights: Propose RPN for (Ren et al. (07+12) (07++12) Python generating nearly cost-free and 2015) high quality RPs instead of selective search; Introduce translation invariant and multiscale anchor boxes as references in RPN; Unify RPN and Fast RCNN into a single network by sharing CONV layers; An order of magnitude faster than Fast RCNN without performance loss; Can run testing at 5 FPS with VGG16 Disadvantages: Training is complex, not a streamlined process; Still falls short of real time RCNNR(Lenc New ZFNet +SPP Arbitrary 59.7 (07) − <5BMVC15 − Highlights: Replace selective and Vedaldi search with static RPs; Prove the 2015) possibility of building integrated, simpler and faster detectors that rely exclusively on CNN Disadvantages: Falls short of real time; Decreased accuracy from poor RPs RFCN (Dai et al. RPN ResNet101 Arbitrary 80.5 (07+12) 77.6 (07++12) < 10 NIPS16 Caffe Matlab Highlights: Fully convolutional 2016c) 83.6 82.0 detection network; Design a set (07+12+CO) (07++12+CO) of position sensitive score maps using a bank of specialized CONV layers; Faster than Faster RCNN without sacrificing much accuracy Disadvantages: Training is not a streamlined process; Still falls short of real time International Journal of Computer Vision (2020) 128:261–318 307 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages Mask RCNN (He RPN ResNet101 Arbitrary 50.3 (ResNeXt101) (COCO Result) < 5 ICCV17 Caffe Matlab Highlights: A simple, flexible, and et al. 2017) ResNeXt101 Python effective framework for object instance segmentation; Extends Faster RCNN by adding another branch for predicting an object mask in parallel with the existing branch for BB prediction; Feature Pyramid Network (FPN) is utilized; Outstanding performance Disadvantages: Falls short of real time applications Unified (Sect. 5.2) OverFeat − AlexNet like Arbitrary −− < 0.1 ICLR14 c++ Highlights: Convolutional feature (Sermanetetal. sharing; Multiscale image 2014) pyramid CNN feature extraction; Won the ISLVRC2013 localization competition; Significantly faster than RCNN Disadvantages: Multi-stage pipeline sequentially trained; Single bounding box regressor; Cannot handle multiple object instances of the same class; Too slow for real time applications YOLO (Redmon − GoogLeNet like Fixed 66.4 (07+12) 57.9 (07++12) < 25 (VGG) CVPR16 DarkNet Highlights: First efficient unified et al. 2016) detector; Drop RP process completely; Elegant and efficient detection framework; Significantly faster than previous detectors; YOLO runs at 45 FPS, Fast YOLO at 155 FPS; Disadvantages: Accuracy falls far behind state of the art detectors; Struggle to localize small objects 308 International Journal of Computer Vision (2020) 128:261–318 Table 11 continued Detector name RP Backbone DCNN Input ImgSize VOC07 results VOC12 results Speed (FPS) Published in Source code Highlights and Disadvantages YOLOv2 − DarkNet Fixed 78.6 (07+12) 73.5 (07++12) < 50 CVPR17 DarkNet Highlights: Propose a faster (Redmon and DarkNet19; Use a number of Farhadi 2017) existing strategies to improve both speed and accuracy; Achieve high accuracy and high speed; YOLO9000 can detect over 9000 object categories in real time Disadvantages: Not good at detecting small objects SSD (Liu et al. − VGG16 Fixed 76.8 (07+12) 74.9 (07++12) < 60 ECCV16 Caffe Python Highlights: First accurate and 2016) 81.5 80.0 efficient unified detector; (07+12+CO) (07++12+CO) Effectively combine ideas from RPN and YOLO to perform detection at multi-scale CONV layers; Faster and significantly more accurate than YOLO; Can run at 59 FPS; Disadvantages: Not good at detecting small objects See Sect. 5 for a detailed discussion. Some architectures are illustrated in Fig. 13. The properties of the backbone DCNNs can be found in Table 6 Training data: “07”←VOC2007 trainval; “07T”←VOC2007 trainval and test; “12”←VOC2012 trainval; “CO”←COCO trainval. The “Speed” column roughly estimates the detection speed with a single Nvidia Titan X GPU RP region proposal; SS selective search; RPN region proposal network; RC N N  R RCNN minus R and used a trivial RP method International Journal of Computer Vision (2020) 128:261–318 309 • Fusing information from different layers of the backbone. The evidence from recent success of cascade for object detec- tion (Cai and Vasconcelos 2018; Cheng et al. 2018a, b) and instance segmentation on COCO (Chen et al. 2019a) and other challenges has shown that multistage object detection could be a future framework for a speed-accuracy trade-off. A teaser investigation is being done in the 2019 WIDER Challenge (Loy et al. 2019). (2) Backbone networks As discussed in Sect. 6.1, backbone networks are one of the main driving forces behind the rapid improvement of detection performance, because of the key role played by dis- criminative object feature representation. Generally, deeper backbones such as ResNet (He et al. 2016), ResNeXt (Xie Fig. 21 Evolution of object detection performance on COCO (Test-Dev et al. 2017), InceptionResNet (Szegedy et al. 2017) perform results). Results are quoted from (Girshick 2015;Heetal. 2017;Ren better; however, they are computationally more expensive et al. 2017). The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most and require much more data and massive computing for train- important factors in detection accuracy ing. Some backbones (Howard et al. 2017; Iandola et al. 2016; Zhang et al. 2018c) were proposed for focusing on speed • Using dilated convolutions (Li et al. 2018b, 2019b): A instead, such as MobileNet (Howard et al. 2017) which has simple and effective method to incorporate broader con- been shown to achieve VGGNet16 accuracy on ImageNet text and maintain high resolution feature maps. with only the computational cost and model size. Back- • Using anchor boxes of different scales and aspect ratios: bone training from scratch may become possible as more Drawbacks of having many parameters, and scales and training data and better training strategies are available (Wu aspect ratios of anchor boxes are usually heuristically and He 2018; Luo et al. 2018, 2019). determined. • Up-scaling: Particularly for the detection of small objects, (3) Improving the robustness of object representation high-resolution networks (Sun et al. 2019a, b) can be developed. It remains unclear whether super-resolution The variation of real world images is a key challenge in object techniques improve detection accuracy or not. recognition. The variations include lighting, pose, deforma- tions, background clutter, occlusions, blur, resolution, noise, Despite recent advances, the detection accuracy for small and camera distortions. objects is still much lower than that of larger ones. There- fore, the detection of small objects remains one of the key (3.1) Object scale and small object size challenges in object detection. Perhaps localization require- ments need to be generalized as a function of scale, since Large variations of object scale, particularly those of small certain applications, e.g. autonomous driving, only require objects, pose a great challenge. Here a summary and discus- the identification of the existence of small objects within a sion on the main strategies identified in Sect. 6.2: larger region, and exact localization is not necessary. • Using image pyramids: They are simple and effective, (3.2) Deformation, occlusion, and other factors helping to enlarge small objects and to shrink large ones. They are computationally expensive, but are nevertheless commonly used during inference for better accuracy. As discussed in Sect. 2.2, there are approaches to han- • Using features from convolutional layers of different dling geometric transformation, occlusions, and deformation resolutions: In early work like SSD (Liu et al. 2016), mainly based on two paradigms. The first is a spatial predictions are performed independently, and no infor- transformer network, which uses regression to obtain a mation from other layers is combined or merged. Now deformation field and then warp features according to the it is quite standard to combine features from different deformation field (Dai et al. 2017). The second is based on layers, e.g. in FPN (Lin et al. 2017a). a deformable part-based model (Felzenszwalb et al. 2010b), 123 310 International Journal of Computer Vision (2020) 128:261–318 which finds the maximum response to a part filter with spa- related tasks, methods for reducing localization error, han- tial constraints taken into consideration (Ouyang et al. 2015; dling the huge imbalance between positive and negative Girshick et al. 2015; Wan et al. 2015). samples, mining of hard negative samples, and improving Rotation invariance may be attractive in certain applica- loss functions. tions, but there are limited generic object detection work focusing on rotation invariance, because popular benchmark 10.3 Research Directions detection datasets (PASCAL VOC, ImageNet, COCO) do not have large variations in rotation. Occlusion handling is inten- Despite the recent tremendous progress in the field of object sively studied in face detection and pedestrian detection, but detection, the technology remains significantly more primi- very little work has been devoted to occlusion handling for tive than human vision and cannot yet satisfactorily address generic object detection. In general, despite recent advances, real-world challenges like those of Sect. 2.2. We see a number deep networks are still limited by the lack of robustness to of long-standing challenges: a number of variations, which significantly constrains their real-world applications. • Working in an open world: being robust to any number of environmental changes, being able to evolve or adapt. (4) Context reasoning • Object detection under constrained conditions: learning from weakly labeled data or few bounding box annota- As introduced in Sect. 7, objects in the wild typically coexist tions, wearable devices, unseen object categories etc. with other objects and environments. It has been recog- • Object detection in other modalities: video, RGBD nized that contextual information (object relations, global images, 3D point clouds, lidar, remotely sensed imagery scene statistics) helps object detection and recognition (Oliva etc. and Torralba 2007), especially for small objects, occluded objects, and with poor image quality. There was extensive work preceding deep learning (Malisiewicz and Efros 2009; Based on these challenges, we see the following directions Murphy et al. 2003; Rabinovich et al. 2007; Divvala et al. of future research: 2009; Galleguillos and Belongie 2010), and also quite a few (1) Open World Learning The ultimate goal is to develop works in the era of deep learning (Gidaris and Komodakis object detection capable of accurately and efficiently recog- 2015; Zeng et al. 2016, 2017; Chen and Gupta 2017;Huetal. nizing and localizing instances in thousands or more object 2018a). How to efficiently and effectively incorporate con- categories in open-world scenes, at a level competitive with textual information remains to be explored, possibly guided the human visual system. Object detection algorithms are by how human vision uses context, based on scene graphs unable, in general, to recognize object categories outside of (Li et al. 2017d), or via the full segmentation of objects and their training dataset, although ideally there should be the scenes using panoptic segmentation (Kirillov et al. 2018). ability to recognize novel object categories (Lake et al. 2015; Hariharan and Girshick 2017). Current detection datasets (5) Detection proposals (Everingham et al. 2010; Russakovsky et al. 2015; Lin et al. 2014) contain only a few dozen to hundreds of categories, Detection proposals significantly reduce search spaces. As significantly fewer than those which can be recognized by recommended in Hosang et al. (2016), future detection pro- humans. New larger-scale datasets (Hoffman et al. 2014; posals will surely have to improve in repeatability, recall, Singh et al. 2018a; Redmon and Farhadi 2017) with signifi- localization accuracy, and speed. Since the success of RPN cantly more categories will need to be developed. (Ren et al. 2015), which integrated proposal generation and (2) Better and More Efficient Detection Frameworks One detection into a common framework, CNN based detection of the reasons for the success in generic object detection has proposal generation methods have dominated region pro- been the development of superior detection frameworks, both posal. It is recommended that new detection proposals should region-based [RCNN (Girshick et al. 2014), Fast RCNN (Gir- be assessed for object detection, instead of evaluating detec- shick 2015), Faster RCNN (Ren et al. 2015), Mask RCNN tion proposals alone. (He et al. 2017)] and one-stage detectors [YOLO (Redmon et al. 2016), SSD (Liu et al. 2016)]. Region-based detectors (6) Other factors have higher accuracy, one-stage detectors are generally faster and simpler. Object detectors depend heavily on the under- As discussed in Sect. 9, there are many other factors affecting lying backbone networks, which have been optimized for object detection quality: data augmentation, novel train- image classification, possibly causing a learning bias; learn- ing strategies, combinations of backbone models, multiple ing object detectors from scratch could be helpful for new detection frameworks, incorporating information from other detection frameworks. 123 International Journal of Computer Vision (2020) 128:261–318 311 (3) Compact and Efficient CNN Features CNNs have 2019). Even more constrained, zero shot object detection increased remarkably in depth, from several layers [AlexNet localizes and recognizes object classes that have never been (Krizhevsky et al. 2012b)] to hundreds of layers [ResNet seen before (Bansal et al. 2018; Demirel et al. 2018; Rah- (He et al. 2016), DenseNet (Huang et al. 2017a)]. These man et al. 2018b, a), essential for life-long learning machines networks have millions to hundreds of millions of param- that need to intelligently and incrementally discover new eters, requiring massive data and GPUs for training. In order object categories. reduce or remove network redundancy, there has been grow- (8) Object Detection in Other Modalities Most detectors ing research interest in designing compact and lightweight are based on still 2D images; object detection in other modal- networks (Chen et al. 2017a; Alvarez and Salzmann 2016; ities can be highly relevant in domains such as autonomous Huang et al. 2018;Howardetal. 2017; Lin et al. 2017c;Yu vehicles, unmanned aerial vehicles, and robotics. These et al. 2018) and network acceleration (Cheng et al. 2018c; modalities raise new challenges in effectively using depth Hubara et al. 2016; Han et al. 2016;Lietal. 2017a, c;Wei (Chen et al. 2015c; Pepik et al. 2015; Xiang et al. 2014;Wu et al. 2018). et al. 2015), video (Feichtenhofer et al. 2017; Kang et al. (4) Automatic Neural Architecture Search Deep learning 2016), and point clouds (Qi et al. 2017, 2018). bypasses manual feature engineering which requires human (9) Universal Object Detection: Recently, there has been experts with strong domain knowledge, however DCNNs increasing effort in learning universal representations, those require similarly significant expertise. It is natural to con- which are effective in multiple image domains, such as nat- sider automated design of detection backbone architectures, ural images, videos, aerial images, and medical CT images such as the recent Automated Machine Learning (AutoML) (Rebuffi et al. 2017, 2018). Most such research focuses on (Quanming et al. 2018), which has been applied to image image classification, rarely targeting object detection (Wang classification and object detection (Cai et al. 2018; Chen et al. et al. 2019), and developed detectors are usually domain spe- 2019c;Ghiasietal. 2019; Liu et al. 2018a; Zoph and Le 2016; cific. Object detection independent of image domain and Zoph et al. 2018). cross-domain object detection represent important future (5) Object Instance Segmentation For a richer and more directions. detailed understanding of image content, there is a need to The research field of generic object detection is still far tackle pixel-level object instance segmentation (Lin et al. from complete. However given the breakthroughs over the 2014;Heetal. 2017;Huetal. 2018c), which can play an past 5 years we are optimistic of future developments and important role in potential applications that require the pre- opportunities. cise boundaries of individual objects. Acknowledgements Open access funding provided by University of (6) Weakly Supervised Detection Current state-of-the- Oulu including Oulu University Hospital. The authors would like to art detectors employ fully supervised models learned from thank the pioneering researchers in generic object detection and other labeled data with object bounding boxes or segmentation related fields. The authors would also like to express their sincere appre- masks (Everingham et al. 2015; Lin et al. 2014; Russakovsky ciation to Professor Jiˇrí Matas, the associate editor and the anonymous reviewers for their comments and suggestions. This work has been sup- et al. 2015; Lin et al. 2014). However, fully supervised learn- ported by the Center for Machine Vision and Signal Analysis at the ing has serious limitations, particularly where the collection University of Oulu (Finland) and the National Natural Science Foun- of bounding box annotations is labor intensive and where the dation of China under Grant 61872379. number of images is large. Fully supervised learning is not Open Access This article is distributed under the terms of the Creative scalable in the absence of fully labeled training data, so it Commons Attribution 4.0 International License (http://creativecomm is essential to understand how the power of CNNs can be ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, leveraged where only weakly / partially annotated data are and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative provided (Bilen and Vedaldi 2016; Diba et al. 2017; Shi et al. Commons license, and indicate if changes were made. 2017). (7) Few / Zero Shot Object Detection The success of deep detectors relies heavily on gargantuan amounts of annotated References training data. When the labeled data are scarce, the perfor- mance of deep detectors frequently deteriorates and fails Agrawal, P., Girshick, R., & Malik, J. (2014). Analyzing the perfor- to generalize well. In contrast, humans (even children) can mance of multilayer neural networks for object recognition. In learn a visual concept quickly from very few given exam- ECCV (pp. 329–344). ples and can often generalize well (Biederman 1987b;Lake Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR (pp. 73–80). et al. 2015; FeiFei et al. 2006). Therefore, the ability to learn from only few examples, few shot detection, is very appealing (Chen et al. 2018a; Dong et al. 2018; Finn et al. 2017; Kang Although side information may be provided, such as a wikipedia et al. 2018;Lakeetal. 2015; Ren et al. 2018; Schwartz et al. page or an attributes vector. 123 312 International Journal of Computer Vision (2020) 128:261–318 Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, of image windows. IEEE TPAMI, 34(11), 2189–2202. Z., Shi, J., Ouyang, W., et al. (2019a). Hybrid task cascade for Alvarez, J., & Salzmann, M. (2016). Learning the number of neurons instance segmentation. In CVPR. in deep networks. In NIPS (pp. 2270–2278). Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. Andreopoulos, A., & Tsotsos, J. (2013). 50 years of object recognition: (2015a), Semantic image segmentation with deep convolutional Directions forward. Computer Vision and Image Understanding, nets and fully connected CRFs. In ICLR. 117(8), 827–891. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. Arbeláez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., & Malik, J. (2018b). DeepLab: Semantic image segmentation with deep con- (2012). Semantic segmentation using regions and parts. In CVPR volutional nets, atrous convolution, and fully connected CRFs. (pp. 3378–3385). IEEE TPAMI, 40(4), 834–848. Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., & Yan, S. (2015b). Multiscale combinatorial grouping. In CVPR (pp. 328–335). Contextualizing object detection and classification. IEEE TPAMI, Azizpour, H., Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. 37(1), 13–27. (2016). Factors of transferability for a generic convnet represen- Chen, X., & Gupta, A. (2017). Spatial memory for context reasoning in tation. IEEE TPAMI, 38(9), 1790–1802. object detection. In ICCV. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., & Divakaran, A. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A. G., Ma, H., Fidler, S., & (2018). Zero shot object detection. In ECCV. Urtasun, R. (2015c) 3d object proposals for accurate object class Bar, M. (2004). Visual objects in context. Nature Reviews Neuroscience, detection. In NIPS (pp. 424–432). 5(8), 617–629. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., & Feng J. (2017b). Dual path Bell, S., Lawrence, Z., Bala, K., & Girshick, R. (2016). Inside outside networks. In NIPS (pp. 4467–4475). net: Detecting objects in context with skip pooling and recurrent Chen, Y., Rohrbach, M., Yan, Z., Yan, S., Feng, J., & Kalantidis, Y. neural networks. In CVPR (pp. 2874–2883). (2019b), Graph based global reasoning networks. In CVPR. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object Chen, Y., Yang, T., Zhang, X., Meng, G., Pan, C., & Sun, J. recognition using shape contexts. IEEE TPAMI, 24(4), 509–522. (2019c). DetNAS: Neural architecture search on object detection. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: arXiv:1903.10979. A review and new perspectives. IEEE TPAMI, 35(8), 1798–1828. Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., & Huang, T. (2018a). Biederman, I. (1972). Perceiving real world scenes. IJCV, 177(7), 77– Decoupled classification refinement: Hard false positive suppres- 80. sion for object detection. arXiv:1810.04002. Biederman, I. (1987a). Recognition by components: A theory of human Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., & Huang, T. (2018b). image understanding. Psychological Review, 94(2), 115. Revisiting RCNN: On awakening the classification power of faster Biederman, I. (1987b). Recognition by components: A theory of human RCNN. In ECCV. image understanding. Psychological Review, 94(2), 115. Cheng, G., Zhou, P., & Han, J. (2016). RIFDCNN: Rotation invariant Bilen, H., & Vedaldi, A. (2016). Weakly supervised deep detection and fisher discriminative convolutional neural networks for object networks. In CVPR (pp. 2846–2854). detection. In CVPR (pp. 2884–2893). Bodla, N., Singh, B., Chellappa, R., & Davis L. S. (2017). SoftNMS Cheng, M., Zhang, Z., Lin, W., & Torr, P. (2014). BING: Binarized improving object detection with one line of code. In ICCV (pp. normed gradients for objectness estimation at 300fps. In CVPR 5562–5570). (pp. 3286–3293). Borji, A., Cheng, M., Jiang, H., & Li, J. (2014). Salient object detection: Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2018c). Model compres- A survey, 1, 1–26. arXiv:1411.5878v1. sion and acceleration for deep neural networks: The principles, Bourdev, L., & Brandt, J. (2005). Robust object detection via soft cas- progress, and challenges. IEEE Signal Processing Magazine, cade. CVPR, 2, 236–243. 35(1), 126–136. Bruna, J., & Mallat, S. (2013). Invariant scattering convolution net- Chollet, F. (2017). Xception: Deep learning with depthwise separable works. IEEE TPAMI, 35(8), 1872–1886. convolutions. In CVPR (pp. 1800–1807). Cai, Z., & Vasconcelos, N. (2018). Cascade RCNN: Delving into high Cinbis, R., Verbeek, J., & Schmid, C. (2017). Weakly supervised quality object detection. In CVPR. object localization with multi-fold multiple instance learning. Cai, Z., Fan, Q., Feris, R., & Vasconcelos, N. (2016). A unified multi- IEEE TPAMI, 39(1), 189–203. scale deep convolutional neural network for fast object detection. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). In ECCV (pp. 354–370). Visual categorization with bags of keypoints. In ECCV Workshop Cai, H., Yang, J., Zhang, W., Han, S., & Yu, Y. et al. (2018) Path-level on statistical learning in computer vision. network transformation for efficient architecture search. In ICML. Dai, J., He, K., Li, Y., Ren, S., & Sun, J. (2016a). Instance sensitive Carreira, J., & Sminchisescu, C. (2012). CMPC: Automatic object seg- fully convolutional networks. In ECCV (pp. 534–549). mentation using constrained parametric mincuts. IEEE TPAMI, Dai, J., He, K., & Sun J. (2016b). Instance aware semantic segmentation 34(7), 1312–1328. via multitask network cascades. In CVPR (pp. 3150–3158). Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Dai, J., Li, Y., He, K., & Sun, J. (2016c). RFCN: Object detection via Return of the devil in the details: Delving deep into convolutional region based fully convolutional networks. In NIPS (pp. 379–387). nets. In BMVC. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object Deformable convolutional networks. In ICCV. proposal evaluation protocol is gameable. In CVPR (pp. 835–844). Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for Chellappa, R. (2016). The changing fortunes of pattern recognition and human detection. CVPR, 1, 886–893. computer vision. Image and Vision Computing, 55, 3–5. Demirel, B., Cinbis, R. G., & Ikizler-Cinbis, N. (2018). Zero shot object Chen, G., Choi, W., Yu, X., Han, T., & Chandraker M. (2017a). Learning detection by hybrid region embedding. In BMVC. efficient object detection models with knowledge distillation. In Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: NIPS. A large scale hierarchical image database. In CVPR (pp. 248–255). Chen, H., Wang, Y., Wang, G., & Qiao, Y. (2018a). LSTD: A low shot Diba, A., Sharma, V., Pazandeh, A. M., Pirsiavash, H., & Van Gool L. transfer detector for object detection. In AAAI. (2017). Weakly supervised cascaded convolutional networks. In CVPR (Vol.3,p.9). 123 International Journal of Computer Vision (2020) 128:261–318 313 Dickinson, S., Leonardis, A., Schiele, B., & Tarr, M. (2009). The evolu- Ghiasi, G., Lin, T., Pang, R., & Le, Q. (2019). NASFPN: Learn- tion of object categorization and the challenge of image abstraction ing scalable feature pyramid architecture for object detection. in object categorization: Computer and human vision perspectives. arXiv:1904.07392. Cambridge: Cambridge University Press. Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Van Gool, L. Ding, J., Xue, N., Long, Y., Xia, G., & Lu, Q. (2018). Learning RoI trans- (2015). DeepProposal: Hunting objects by cascading deep convo- former for detecting oriented objects in aerial images. In CVPR. lutional layers. In ICCV (pp. 2578–2586). Divvala, S., Hoiem, D., Hays, J., Efros, A., & Hebert, M. (2009). An Gidaris, S., & Komodakis, N. (2015). Object detection via a multiregion empirical study of context in object detection. In CVPR (pp. 1271– and semantic segmentation aware CNN model. In ICCV (pp. 1134– 1278). 1142). Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detec- Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box tion: An evaluation of the state of the art. IEEE TPAMI, 34(4), proposal generation via in out localization. In BMVC. 743–761. Girshick, R. (2015). Fast R-CNN. In ICCV (pp. 1440–1448). Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature et al. (2014). DeCAF: A deep convolutional activation feature for hierarchies for accurate object detection and semantic segmenta- generic visual recognition. ICML, 32, 647–655. tion. In CVPR (pp. 580–587). Dong, X., Zheng, L., Ma, F., Yang, Y., & Meng, D. (2018). Few-example Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based object detection with model communication. IEEE Transactions convolutional networks for accurate object detection and segmen- on Pattern Analysis and Machine Intelligence, 41(7), 1641–1654. tation. IEEE TPAMI, 38(1), 142–158. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Cen- Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable terNet: Keypoint triplets for object detection. arXiv:1904.08189. part models are convolutional neural networks. In CVPR (pp. 437– Dvornik, N., Mairal, J., & Schmid, C. (2018). Modeling visual context 446). is key to augmenting object detection datasets. In ECCV (pp. 364– Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. 380). Cambridge: MIT press. Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Goodfellow, I., Shlens, J., & Szegedy, C. (2015). Explaining and har- Surprisingly easy synthesis for instance detection. In ICCV (pp. nessing adversarial examples. In ICLR. 1301–1310). Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Dis- Endres, I., & Hoiem, D. (2010). Category independent object propos- criminative classification with sets of image features. ICCV, 2, als. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), European 1458–1465. Conference on Computer Vision (pp. 575–588). Berlin: Springer. Grauman, K., & Leibe, B. (2011). Visual object recognition. Synthesis Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detec- Lectures on Artificial Intelligence and Machine Learning, 5(2), tion: Survey and experiments. IEEE TPAMI, 31(12), 2179–2195. 1–181. Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. (2014). Scalable Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., et al. object detection using deep neural networks. In CVPR (pp. 2147– (2018). Recent advances in convolutional neural networks. Pattern 2154). Recognition, 77, 354–377. Everingham, M., Eslami, S., Gool, L. V., Williams, C., Winn, J., & Guillaumin, M., Küttel, D., & Ferrari, V. (2014). Imagenet autoan- Zisserman, A. (2015). The pascal visual object classes challenge: notation with segmentation propagation. International Journal of A retrospective. IJCV, 111(1), 98–136. Computer Vision, 110(3), 328–348. Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text (2010). The pascal visual object classes (voc) challenge. IJCV, localisation in natural images. In CVPR (pp. 2315–2324). 88(2), 303–338. Han, S., Dally, W. J., & Mao, H. (2016). Deep Compression: Compress- Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and ing deep neural networks with pruning, trained quantization and track to detect. In ICCV (pp. 918–927). huffman coding. In ICLR. FeiFei, L., Fergus, R., & Perona, P. (2006). One shot learning of object Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2014). Simulta- categories. IEEE TPAMI, 28(4), 594–611. neous detection and segmentation. In ECCV (pp. 297–312). Felzenszwalb, P., Girshick, R., & McAllester, D. (2010a). Cascade Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2016). Object object detection with deformable part models. In CVPR (pp. 2241– instance segmentation and fine-grained localization using hyper- 2248). columns. IEEE Transactions on Pattern Analysis and Machine Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010b). Intelligence, 39(4), 627–639. Object detection with discriminatively trained part based models. Hariharan, B., & Girshick R. B. (2017). Low shot visual recognition by IEEE TPAMI, 32(9), 1627–1645. shrinking and hallucinating features. In ICCV (pp. 3037–3046). Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discrimi- Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object natively trained, multiscale, deformable part model. In CVPR (pp. localization and image classification. In ICCV (pp. 237–244). 1–8). He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask RCNN. Finn, C., Abbeel, P., & Levine, S. (2017). Model agnostic meta learning In ICCV. for fast adaptation of deep networks. In ICML (pp. 1126–1135). He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling Fischler, M., & Elschlager, R. (1973). The representation and matching in deep convolutional networks for visual recognition. In ECCV of pictorial structures. IEEE Transactions on Computers, 100(1), (pp. 346–361). 67–92. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Surpassing human-level performance on ImageNet classification. Deconvolutional single shot detector. arXiv:1701.06659. In ICCV (pp. 1026–1034). Galleguillos, C., & Belongie, S. (2010). Context based object catego- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for rization: A critical survey. Computer Vision and Image Under- image recognition. In CVPR (pp. 770–778). standing, 114, 712–722. He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end Geronimo, D., Lopez, A. M., Sappa, A. D., & Graf, T. (2010). Survey of to end textspotter with explicit alignment and attention. In CVPR pedestrian detection for advanced driver assistance systems. IEEE (pp. 5020–5029). TPAMI, 32(7), 1239–1258. 123 314 International Journal of Computer Vision (2020) 128:261–318 He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection box regression with uncertainty for accurate object detection. In from video tubelets with convolutional neural networks. In CVPR CVPR. (pp. 817–825). Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality Kim, A., Sharma, A., & Jacobs, D. (2014). Locally scale invariant con- of data with neural networks. Science, 313(5786), 504–507. volutional neural networks. In NIPS. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in Kim, K., Hong, S., Roh, B., Cheon, Y., & Park, M. (2016). PVANet: a neural network. arXiv:1503.02531. Deep but lightweight neural networks for real time object detec- Hoffman, J., Guadarrama, S., Tzeng, E. S., Hu, R., Donahue, J., Gir- tion. In NIPSW. shick, R., Darrell, T., & Saenko, K. (2014). LSDA: Large scale Kim, Y, Kang, B.-N., & Kim, D. (2018). SAN: Learning relationship detection through adaptation. In NIPS (pp. 3536–3544). between convolutional features for multiscale object detection. In Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in ECCV (pp. 316–331). object detectors. In ECCV (pp. 340–353). Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2018). Panop- Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes tic segmentation. arXiv:1801.00868. for effective detection proposals? IEEE TPAMI, 38(4), 814–829. Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature Hosang, J., Benenson, R., & Schiele, B. (2017). Learning nonmaximum pyramid reconfiguration for object detection. In ECCV (pp. 169– suppression. In ICCV. 185). Hosang, J., Omran, M., Benenson, R., & Schiele, B. (2015). Taking a Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: deeper look at pedestrians. In Proceedings of the IEEE conference Reverse connection with objectness prior networks for object on computer vision and pattern recognition (pp. 4073–4082). detection. In CVPR. Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). HyperNet: Towards T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient accurate region proposal generation and joint object detection. In convolutional neural networks for mobile vision applications. In CVPR (pp. 845–853). CVPR. Krähenbühl, P., & Koltun, V. (2014), Geodesic object proposals. In Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018a). Relation networks ECCV. for object detection. In CVPR. Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., AbuElHaija, S., Hu, H., Lan, S., Jiang, Y., Cao, Z., & Sha, F. (2017). FastMask: Segment Kuznetsova, A., et al. (2017). OpenImages: A public dataset for multiscale object candidates in one shot. In CVPR (pp. 991–999). large scale multilabel and multiclass image classification. Dataset Hu, J., Shen, L., & Sun, G. (2018b). Squeeze and excitation networks. available from https://storage.googleapis.com/openimages/web/ In CVPR. index.html. Hu, P., & Ramanan, D. (2017). Finding tiny faces. In CVPR (pp. 1522– Krizhevsky, A., Sutskever, I., & Hinton, G. (2012a). ImageNet clas- 1530). sification with deep convolutional neural networks. In NIPS (pp. Hu, R., Dollár, P., He, K., Darrell, T., & Girshick, R. (2018c). Learning 1097–1105). to segment every thing. In CVPR. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012b). ImageNet clas- Huang, G., Liu, S., van der Maaten, L., & Weinberger, K. (2018). Con- sification with deep convolutional neural networks. In NIPS (pp. denseNet: An efficient densenet using learned group convolutions. 1097–1105). In CVPR. Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning object- Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017a). ness with convolutional networks. In ICCV (pp. 2479–2487). Densely connected convolutional networks. In CVPR. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., PontTuset, Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., J., et al. (2018). The open images dataset v4: Unified image classi- Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, fication, object detection, and visual relationship detection at scale. K. (2017b). Speed/accuracy trade offs for modern convolutional arXiv:1811.00982. object detectors. In CVPR. Lake, B., Salakhutdinov, R., & Tenenbaum, J. (2015). Human level Huang, Z., Huang, L., Gong, Y., Huang, C., & Wang, X. (2019). Mask concept learning through probabilistic program induction. Science, scoring rcnn. In CVPR. 350(6266), 1332–1338. Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., & Bengio, Y. Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding (2016). Binarized neural networks. In NIPS (pp. 4107–4115). windows: Object localization by efficient subwindow search. In Iandola, F., Han, S., Moskewicz, M., Ashraf, K., Dally, W., & Keutzer, CVPR (pp. 1–8). K. (2016). SqueezeNet: Alexnet level accuracy with 50x fewer Law, H., & Deng, J. (2018). CornerNet: Detecting objects as paired parameters and 0.5 mb model size. arXiv:1602.07360. keypoints. In ECCV. ILSVRC detection challenge results. (2018). http://www.image-net. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: org/challenges/LSVRC/. Spatial pyramid matching for recognizing natural scene categories. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep CVPR, 2, 2169–2178. network training by reducing internal covariate shift. In Interna- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, tional conference on machine learning (pp. 448–456). 521, 436–444. Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial trans- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient former networks. In NIPS (pp. 2017–2025). based learning applied to document recognition. Proceedings of Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, the IEEE, 86(11), 2278–2324. R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional Lee, C., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply architecture for fast feature embedding. In ACM MM (pp. 675– supervised nets. In Artificial intelligence and statistics (pp. 562– 678). 570). Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition Lenc, K., & Vedaldi, A. (2015). R-CNN minus R. In BMVC15. of localization confidence for accurate object detection. In ECCV Lenc, K., & Vedaldi, A. (2018). Understanding image representations (pp. 784–799). by measuring their equivariance and equivalence. In IJCV. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2018). Few Li, B., Liu, Y., & Wang, X. (2019a). Gradient harmonized single stage shot object detection via feature reweighting. arXiv:1812.01866. detector. In AAAI. 123 International Journal of Computer Vision (2020) 128:261–318 315 Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017a). Loy, C., Lin, D., Ouyang, W., Xiong, Y., Yang, S., Huang, Q., et al. Pruning filters for efficient convnets. In ICLR. (2019). WIDER face and pedestrian challenge 2018: Methods and Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. (2015a). A convolutional results. arXiv:1902.06854. neural network cascade for face detection. In CVPR (pp. 5325– Lu, Y., Javidi, T., & Lazebnik, S. (2016). Adaptive object detection 5334). using adjacency and zoom prediction. In CVPR (pp. 2351–2359). Li, H., Liu, Y., Ouyang, W., & Wang, X. (2018a). Zoom out and in Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding network with map attention decision for region proposal and object regularization in batch normalization. In ICLR. detection. In IJCV. Luo, P., Zhang, R., Ren, J., Peng, Z., & Li, J. (2019). Switch- Li, J., Wei, Y., Liang, X., Dong, J., Xu, T., Feng, J., et al. (2017b). able normalization for learning-to-normalize deep representation. Attentive contexts for object detection. IEEE Transactions on Mul- IEEE Transactions on Pattern Analysis and Machine Intelligence. timedia, 19(5), 944–954. https://doi.org/10.1109/TPAMI.2019.2932062. Li, Q., Jin, S., & Yan, J. (2017c). Mimicking very efficient network for Malisiewicz, T., & Efros, A. (2009). Beyond categories: The visual object detection. In CVPR (pp. 7341–7349). memex model for reasoning about object relationships. In NIPS. Li, S. Z., & Zhang, Z. (2004). Floatboost learning and statistical face Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). detection. IEEE TPAMI, 26(9), 1112–1123. Arbitrary oriented scene text detection via rotation proposals. IEEE Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019b). Scale aware trident TMM, 20(11), 3111–3122. networks for object detection. arXiv:1901.01892. Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object propos- Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017d). Scene als with randomized prim’s algorithm. In CVPR (pp. 2536–2543). graph generation from objects, phrases and region captions. In Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of ICCV (pp. 1261–1270). local descriptors. IEEE TPAMI, 27(10), 1615–1630. Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017e). Fully convolutional Mordan, T., Thome, N., Henaff, G., & Cord, M. (2018). End to end instance aware semantic segmentation. In CVPR (pp. 4438–4446). learning of latent deformable part based representations for object Li, Y., Wang, S., Tian, Q., & Ding, X. (2015b). Feature representation detection. In IJCV (pp. 1–21). for statistical learning based object detection: A review. Pattern MS COCO detection leaderboard. (2018). http://cocodataset.org/# Recognition, 48(11), 3542–3559. detection-leaderboard. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018b). DetNet: Mundy, J. (2006). Object recognition in the geometric era: A retrospec- A backbone network for object detection. In ECCV. tive. In J. Ponce, M. Hebert, C. Schmid, & A. Zisserman (Eds.), Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018c). Light Book toward category level object recognition (pp. 3–28). Berlin: head RCNN: In defense of two stage object detector. In CVPR. Springer. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Murase, H., & Nayar, S. (1995a). Visual learning and recognition of 3D (2017a). Feature pyramid networks for object detection. In CVPR. objects from appearance. IJCV, 14(1), 5–24. Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss Murase, H., & Nayar, S. (1995b). Visual learning and recognition of 3d for dense object detection. In ICCV. objects from appearance. IJCV, 14(1), 5–24. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dol- Murphy, K., Torralba, A., & Freeman, W. (2003). Using the forest to see lár, P., & Zitnick, L. (2014). Microsoft COCO: Common objects the trees: A graphical model relating features, objects and scenes. in context. In ECCV (pp. 740–755). In NIPS. Lin, X., Zhao, C., & Pan, W. (2017c). Towards accurate binary convo- Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: lutional neural network. In NIPS (pp. 344–352). End to end learning for joint detection and grouping. In NIPS (pp. Litjens, G., Kooi, T., Bejnordi, B., Setio, A., Ciompi, F., Ghafoorian, 2277–2287). M., et al. (2017). A survey on deep learning in medical image Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks analysis. Medical Image Analysis, 42, 60–88. for human pose estimation. In ECCV (pp. 483–499). Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., FeiFei, L., Ojala, T., Pietikäinen, M., & Maenpää, T. (2002). Multiresolution gray- Yuille, A., Huang, J., & Murphy, K. (2018a). Progressive neural scale and rotation invariant texture classification with local binary architecture search. In ECCV (pp. 19–34). patterns. IEEE TPAMI, 24(7), 971–987. Liu, L., Fieguth, P., Guo, Y., Wang, X., & Pietikäinen, M. (2017). Local Oliva, A., & Torralba, A. (2007). The role of context in object recogni- binary features for texture classification: Taxonomy and experi- tion. Trends in cognitive sciences, 11(12), 520–527. mental study. Pattern Recognition, 62, 135–160. Opelt, A., Pinz, A., Fussenegger, M., & Auer, P. (2006). Generic object Liu, S., Huang, D., & Wang, Y. (2018b). Receptive field block net for recognition with boosting. IEEE TPAMI, 28(3), 416–431. accurate and fast object detection. In ECCV. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018c). Path aggregation transferring midlevel image representations using convolutional network for instance segmentation. In CVPR (pp. 8759–8768). neural networks. In CVPR (pp. 1717–1724). Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object local- Berg, A. (2016). SSD: Single shot multibox detector. In ECCV ization for free? weakly supervised learning with convolutional (pp. 21–37). neural networks. In CVPR (pp. 685–694). Liu, Y., Wang, R., Shan, S., & Chen, X. (2018d). Structure inference Osuna, E., Freund, R., & Girosit, F. (1997). Training support vector net: Object detection using scene level context and instance level machines: An application to face detection. In CVPR (pp. 130– relationships. In CVPR (pp. 6985–6994). 136). Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net- Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian works for semantic segmentation. In Proceedings of the IEEE detection. In ICCV (pp. 2056–2063). Conference on Computer Vision and Pattern Recognition (pp. Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, 3431–3440). S., Wang, Z., Loy, C.-C., et al. (2015). DeepIDNet: Deformable Lowe, D. (1999). Object recognition from local scale invariant features. deep convolutional neural networks for object detection. In CVPR ICCV, 2, 1150–1157. (pp. 2403–2412). Lowe, D. (2004). Distinctive image features from scale-invariant key- Ouyang, W., Wang, X., Zhang, C., & Yang, X. (2016). Factors in fine- points. IJCV, 60(2), 91–110. tuning deep model for object detection with long tail distribution. In CVPR (pp. 864–873). 123 316 International Journal of Computer Vision (2020) 128:261–318 Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017a). Chained cascade Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2016). Object detec- network for object detection. In ICCV. tion networks on convolutional feature maps. IEEE Transactions Ouyang, W., Zeng, X., Wang, X., Qiu, S., Luo, P., Tian, Y., et al. (2017b). on Pattern Analysis and Machine Intelligence, 39(7), 1476–1481. DeepIDNet: Object detection with deformable part based convo- Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, lutional neural networks. IEEE TPAMI, 39(7), 1320–1334. S. (2019). Generalized intersection over union: A metric and a loss Parikh, D., Zitnick, C., & Chen, T. (2012). Exploring tiny images: The for bounding box regression. In CVPR. roles of appearance and contextual information for machine and Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network based face human object recognition. IEEE TPAMI, 34(10), 1978–1991. detection. IEEE TPAMI, 20(1), 23–38. PASCAL VOC detection leaderboard. (2018). http://host.robots.ox.ac. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. uk:8080/leaderboard/main_bootstrap.php (2015). ImageNet large scale visual recognition challenge. IJCV, Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., & Sun, 115(3), 211–252. J. (2018). MegDet: A large minibatch object detector. In CVPR. Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object A database and web based tool for image annotation. IJCV, 77(1– detectors from 3d models. In ICCV (pp. 1278–1286). 3), 157–173. Pepik, B., Benenson, R., Ritschel, T., & Schiele, B. (2015). What is Schmid, C., & Mohr, R. (1997). Local grayvalue invariants for image holding back convnets for detection? In German conference on retrieval. IEEE TPAMI, 19(5), 530–535. pattern recognition (pp. 517–528). Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Pankanti, Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher S., Feris, R., Kumar, A., Giries, R., & Bronstein, A. (2019). Rep- kernel for large scale image classification. In ECCV (pp. 143–156). Met: Representative based metric learning for classification and Pinheiro, P., Collobert, R., & Dollar, P. (2015). Learning to segment one shot object detection. In CVPR. object candidates. In NIPS (pp. 1990–1998). Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & Pinheiro, P., Lin, T., Collobert, R., & Dollár, P. (2016). Learning to LeCun, Y. (2014). OverFeat: Integrated recognition, localization refine object segments. In ECCV (pp. 75–91). and detection using convolutional networks. In ICLR. Ponce, J., Hebert, M., Schmid, C., & Zisserman, A. (2007). Toward Sermanet, P., Kavukcuoglu, K., Chintala, S., & LeCun, Y. (2013). Pedes- category level object recognition. Berlin: Springer. trian detection with unsupervised multistage feature learning. In Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., et al. CVPR (pp. 3626–3633). (2018). A survey on deep learning: Algorithms, techniques, and Shang, W., Sohn, K., Almeida, D., & Lee, H. (2016). Understanding and applications. ACM Computing Surveys, 51(5), 92:1–92:36. improving convolutional neural networks via concatenated recti- Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum fied linear units. In ICML (pp. 2217–2225). pointnets for 3D object detection from RGBD data. In CVPR (pp. Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional net- 918–927). works for semantic segmentation. IEEE TPAMI. Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet: Deep Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017). learning on point sets for 3D classification and segmentation. In DSOD: Learning deeply supervised object detectors from scratch. CVPR (pp. 652–660). In ICCV. Quanming, Y., Mengshuo, W., Hugo, J. E., Isabelle, G., Yiqi, H., Yufeng, Shi, X., Shan, S., Kan, M., Wu, S., & Chen, X. (2018). Real time rotation L., et al. (2018). Taking human out of learning applications: A invariant face detection with progressive calibration networks. In survey on automated machine learning. arXiv:1810.13306. CVPR. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, Shi, Z., Yang, Y., Hospedales, T., & Xiang, T. (2017). Weakly supervised S. (2007). Objects in context. In ICCV. image annotation and segmentation with objects and attributes. Rahman, S., Khan, S., & Barnes, N. (2018a). Polarity loss for zero shot IEEE TPAMI, 39(12), 2525–2538. object detection. arXiv:1811.08982. Shrivastava, A., & Gupta A. (2016), Contextual priming and feedback Rahman, S., Khan, S., & Porikli, F. (2018b). Zero shot object detection: for Faster RCNN. In ECCV (pp. 330–348). Learning to simultaneously recognize and localize novel concepts. Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region based In ACCV. object detectors with online hard example mining. In CVPR (pp. Razavian, R., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN 761–769). features off the shelf: An astounding baseline for recognition. In Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2017). Beyond CVPR workshops (pp. 806–813). skip connections: Top down modulation for object detection. In Rebuffi, S., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual CVPR. domains with residual adapters. In Advances in neural information Simonyan, K., & Zisserman, A. (2015). Very deep convolutional net- processing systems (pp. 506–516). works for large scale image recognition. In ICLR. Rebuffi, S., Bilen, H., & Vedaldi A. (2018). Efficient parametrization Singh, B., & Davis, L. (2018). An analysis of scale invariance in object of multidomain deep neural networks. In CVPR (pp. 8119–8127). detection-SNIP. In CVPR. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only Singh, B., Li, H., Sharma, A., & Davis, L. S. (2018a). RFCN 3000 at look once: Unified, real time object detection. In CVPR (pp. 779– 30fps: Decoupling detection and classification. In CVPR. 788). Singh, B., Najibi, M., & Davis, L. S. (2018b). SNIPER: Efficient mul- Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. tiscale training. arXiv:1805.09300. In CVPR. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, approach to object matching in videos. International Conference J. B., Larochelle, H., & Zemel R. S. (2018). Meta learning for on Computer Vision (ICCV), 2, 1470–1477. semisupervised few shot classification. In ICLR. Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards unreasonable effectiveness of data in deep learning era. In ICCV real time object detection with region proposal networks. In NIPS (pp. 843–852). (pp. 91–99). Sun, K., Xiao, B., Liu, D., & Wang, J. (2019a). Deep high resolution Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster RCNN: Towards representation learning for human pose estimation. In CVPR. real time object detection with region proposal networks. IEEE TPAMI, 39(6), 1137–1149. 123 International Journal of Computer Vision (2020) 128:261–318 317 Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., et al. (2019b). Woo, S., Hwang, S., & Kweon, I. (2018). StairNet: Top down semantic High resolution representations for labeling pixels and regions. aggregation for accurate one shot detection. In WACV (pp. 1093– CoRR.,. arXiv:1904.04514. 1102). Sun, S., Pang, J., Shi, J., Yi, S., & Ouyang, W. (2018). FishNet: A Worrall, D. E., Garbin, S. J., Turmukhambetov, D., & Brostow, G. J. versatile backbone for image, region, and pixel level prediction. In (2017). Harmonic networks: Deep translation and rotation equiv- NIPS (pp. 754–764). ariance. In CVPR (Vol. 2). Sun, Z., Bebis, G., & Miller, R. (2006). On road vehicle detection: A Wu, Y., & He, K. (2018). Group normalization. In ECCV (pp. 3–19). review. IEEE TPAMI, 28(5), 694–711. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2019). A com- Sung, K., & Poggio, T. (1994). Learning and example selection for prehensive survey on graph neural networks. arXiv:1901.00596. object and pattern detection. MIT AI Memo (1521). Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, Swain, M., & Ballard, D. (1991). Color indexing. IJCV, 7(1), 11–32. J. (2015). 3D ShapeNets: A deep representation for volumetric Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., shapes. In CVPR (pp. 1912–1920). Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper Xia, G., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., with convolutions. In CVPR (pp. 1–9). Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception object detection in aerial images. In CVPR (pp. 3974–3983). v4, inception resnet and the impact of residual connections on Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond PASCAL: A learning. In AAAI (pp. 4278–4284). benchmark for 3D object detection in the wild. In WACV (pp. 75– Szegedy, C., Reed, S., Erhan, D., Anguelov, D., & Ioffe, S. (2014). 82). Scalable, high quality object detection. arXiv:1412.1441. Xiao, R., Zhu, L., & Zhang, H. (2003). Boosting chain learning for Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. In ICCV (pp. 709–715). object detection. In NIPS (pp. 2553–2561). Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). residual transformations for deep neural networks. In CVPR. Rethinking the inception architecture for computer vision. In Yang, B., Yan, J., Lei, Z., & Li, S. (2016a). CRAFT objects from images. CVPR (pp. 2818–2826). In CVPR (pp. 6043–6051). Torralba, A. (2003). Contextual priming for object detection. IJCV, Yang, F., Choi, W., & Lin, Y. (2016b). Exploit all the layers: Fast and 53(2), 169–191. accurate CNN object detector with scale dependent pooling and Turk, M. A., & Pentland, A. (1991). Face recognition using eigenfaces. cascaded rejection classifiers. In CVPR (pp. 2129–2137). In CVPR (pp. 586–591). Yang, M., Kriegman, D., & Ahuja, N. (2002). Detecting faces in images: Tuzel, O., Porikli, F., & Meer P. (2006). Region covariance: A fast Asurvey. IEEE TPAMI, 24(1), 34–58. descriptor for detection and classification. In ECCV (pp. 589–600). Ye, Q., & Doermann, D. (2015). Text detection and recognition in TychsenSmith, L., & Petersson, L. (2017). DeNet: Scalable real time imagery: A survey. IEEE TPAMI, 37(7), 1480–1500. object detection with directed sparse sampling. In ICCV. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How trans- TychsenSmith, L., & Petersson, L. (2018). Improving object localiza- ferable are features in deep neural networks? In NIPS (pp. tion with fitness nms and bounded iou loss. In CVPR. 3320–3328). Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends Selective search for object recognition. IJCV, 104(2), 154–171. in deep learning based natural language processing. IEEE Com- Vaillant, R., Monrocq, C., & LeCun, Y. (1994). Original approach for the putational Intelligence Magazine, 13(3), 55–75. localisation of objects in images. IEE Proceedings Vision, Image Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated and Signal Processing, 141(4), 245–250. convolutions. arXiv preprint arXiv:1511.07122. Van de Sande, K., Uijlings, J., Gevers, T., & Smeulders, A. (2011). Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. Segmentation as selective search for object recognition. In ICCV In CVPR (Vol.2,p.3). (pp. 1879–1886). Yu, R., Li, A., Chen, C., Lai, J., et al. (2018). NISP: Pruning networks Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, using neuron importance score propagation. In CVPR. A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you Zafeiriou, S., Zhang, C., & Zhang, Z. (2015). A survey on face detection need. In NIPS (pp. 6000–6010). in the wild: Past, present and future. Computer Vision and Image Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple Understanding, 138, 1–24. kernels for object detection. In ICCV (pp. 606–613). Zagoruyko, S., Lerer, A., Lin, T., Pinheiro, P., Gross, S., Chintala, S., Viola, P., & Jones, M. (2001). Rapid object detection using a boosted & Dollár, P. (2016). A multipath network for object detection. In cascade of simple features. CVPR, 1, 1–8. BMVC. Wan, L., Eigen, D., & Fergus, R. (2015). End to end integration of a Zeiler, M., & Fergus, R. (2014). Visualizing and understanding convo- convolution network, deformable parts model and nonmaximum lutional networks. In ECCV (pp. 818–833). suppression. In CVPR (pp. 851–859). Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2017). Wang, H., Wang, Q., Gao, M., Li, P., & Zuo, W. (2018). Multiscale Crafting gbd-net for object detection. IEEE Transactions on Pat- location aware kernel representation for object detection. In CVPR. tern Analysis and Machine Intelligence, 40(9), 2109–2123. Wang, X., Cai, Z., Gao, D., & Vasconcelos, N. (2019). Towards universal Zeng, X., Ouyang, W., Yang, B., Yan, J., & Wang, X. (2016). Gated object detection by domain attention. arXiv:1904.04402. bidirectional cnn for object detection. In ECCV (pp. 354–369). Wang, X., Han, T., & Yan, S. (2009). An HOG-LBP human detector Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016a). Joint face detection with partial occlusion handling. In International conference on and alignment using multitask cascaded convolutional networks. computer vision (pp. 32–39). IEEE SPL, 23(10), 1499–1503. Wang, X., Shrivastava, A., & Gupta, A. (2017). A Fast RCNN: Hard Zhang, L., Lin, L., Liang, X., & He, K. (2016b). Is faster RCNN doing positive generation via adversary for object detection. In CVPR. well for pedestrian detection? In ECCV (pp. 443–457). Wei, Y., Pan, X., Qin, H., Ouyang, W., & Yan, J. (2018). Quantization Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. (2018a). Single shot mimic: Towards very tiny CNN for object detection. In ECCV (pp. refinement neural network for object detection. In CVPR. 267–283). Zhang, S., Yang, J., & Schiele, B. (2018b). Occluded pedestrian detec- tion through guided attention in CNNs. In CVPR (pp. 2056–2063). 123 318 International Journal of Computer Vision (2020) 128:261–318 Zhang, X., Li, Z., Change Loy, C., & Lin, D. (2017). PolyNet: A pursuit Zhou, Y., Ye, Q., Qiu, Q., & Jiao, J. (2017b). Oriented response net- of structural diversity in very deep networks. In CVPR (pp. 718– works. In CVPR (pp. 4961–4970). 726). Zhu, X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F., et al. (2017). Zhang, X., Yang, Y., Han, Z., Wang, H., & Gao, C. (2013). Object class Deep learning in remote sensing: A comprehensive review and list detection: A survey. ACM Computing Surveys, 46(1), 10:1–10:53. of resources. IEEE Geoscience and Remote Sensing Magazine, Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018c). ShuffleNet: An 5(4), 8–36. extremely efficient convolutional neural network for mobile Zhu, X., Vondrick, C., Fowlkes, C., & Ramanan, D. (2016a). Do we devices. In CVPR. need more training data? IJCV, 119(1), 76–92. Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E., Jin, W., & Schuller, Zhu, Y., Urtasun, R., Salakhutdinov, R., & Fidler, S. (2015). SegDeepM: B. (2018d). Deep learning for environmentally robust speech Exploiting segmentation and context in deep neural networks for recognition: An overview of recent developments. ACM Trans- object detection. In CVPR (pp. 4703–4711). actions on Intelligent Systems and Technology, 9(5), 49:1–49:28. Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017a). Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. (2018e). CoupleNet: Coupling global structure with local parts for object Single shot object detection with enriched semantics. In CVPR. detection. In ICCV. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., & Jiao, J. (2017b). Soft proposal (2019). M2Det: A single shot object detector based on multilevel networks for weakly supervised object localization. In ICCV (pp. feature pyramid network. In AAAI. 1841–1850). Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., & Hu, S. (2016b). D., Huang, C., & Torr, P. (2015). Conditional random fields as Traffic sign detection and classification in the wild. In CVPR (pp. recurrent neural networks. In ICCV (pp. 1529–1537). 2110–2118). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Zitnick, C., & Dollár, P. (2014). Edge boxes: Locating object proposals Object detectors emerge in deep scene CNNs. In ICLR. from edges. In ECCV (pp. 391–405). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016a). Zoph, B., & Le, Q. (2016). Neural architecture search with reinforce- Learning deep features for discriminative localization. In CVPR ment learning. arXiv preprint arXiv:1611.01578. (pp. 2921–2929). Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. (2018). Learning trans- Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017a). ferable architectures for scalable image recognition. In CVPR (pp. Places: A 10 million image database for scene recognition. IEEE 8697–8710). Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464. Publisher’s Note Springer Nature remains neutral with regard to juris- Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., & Sun, M. (2018a). dictional claims in published maps and institutional affiliations. Graph neural networks: A review of methods and applications. arXiv:1812.08434. Zhou, P., Ni, B., Geng, C., Hu, J., & Xu, Y. (2018b). Scale transferrable object detection. In CVPR. Zhou, Y., Liu, L., Shao, L., & Mellor, M. (2016b). DAVE: A unified framework for fast vehicle detection and annotation. In ECCV (pp. 278–293).

Journal

International Journal of Computer VisionSpringer Journals

Published: Mar 2, 2020

There are no references for this article.