Middle-level Fusion for Lightweight RGB-D Salient Object Detection

Most existing lightweight RGB-D salient object detection (SOD) models are based on two-stream structure or single-stream structure. The former one first uses two sub-networks to extract unimodal features from RGB and depth images, respectively, and then fuses them for SOD. While, the latter one directly extracts multi-modal features from the input RGB-D images and then focuses on exploiting cross-level complementary information. However, two-stream structure based models inevitably require more parameters and single-stream structure based ones cannot well exploit the cross-modal complementary information since they ignore the modality difference. To address these issues, we propose to employ the middle-level fusion structure for designing lightweight RGB-D SOD model in this paper, which first employs two sub-networks to extract low- and middle-level unimodal features, respectively, and then fuses those extracted middle-level unimodal features for extracting corresponding high-level multi-modal features in the subsequent sub-network. Different from existing models, this structure can effectively exploit the cross-modal complementary information and significantly reduce the network's parameters, simultaneously. Therefore, a novel lightweight SOD model is designed, which contains a information-aware multi-modal feature fusion (IMFF) module for effectively capturing the cross-modal complementary information and a lightweight feature-level and decision-level feature fusion (LFDF) module for aggregating the feature-level and the decision-level saliency information in different stages with less parameters. Our proposed model has only 3.9M parameters and runs at 33 FPS. The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.

such as the salient objects sharing similar appearances with the backgrounds and the images with complex backgrounds. Recently, researchers try to introduce the depth images to address those issues, due to the fact that depth images can provide some geometrical information about the scene for complementing the RGB images [18]. By virtue of the complementary information within RGB and depth (RGB-D) images, RGB-D SOD methods have made significantly progress [19]- [37].
However, compared with unimodal RGB SOD models, most existing state-of-the-art (SOTA) RGB-D SOD models [20], [22], [23], [28], [31]- [34], [38], [39] require more computational costs and memory consumption to accurately detect the salient objects, since RGB-D SOD models need to process more information from the images of two modalities. This restricts their real-life applications since most of them are running on resource-constrained devices, such as mobiles and on-board computer in car. To address this issue, some lightweight RGB-D SOD models have been presented, which can be roughly divided into two-stream structure based models [40] and single-stream structure based models [35], [41], [42].
As shown in Fig. 1(a), two-stream structure based models [40] usually reduce their parameters by designing lightweight unimodal feature extraction, multi-modal feature fusion and saliency prediction modules. For example, ATSA [40] designed a lightweight asymmetric two-stream architecture, which uses a standard network for RGB images and designs a lightweight depth network (DepthNet) for depth images, to reduce the network's parameters increased by introducing depth images. Then, it proposes a novel depth attention module (DAM) to ensure that the depth features can effectively guide the RGB features by using the discriminative power of depth cues. Considering the modality discrepancy between the RGB and depth images, two-stream structure enables RGB-D SOD models to better exploit the complementary information within the RGB-D images for SOD. However, these models inevitably require more computational costs and memory consumption.
As shown in Fig. 1(b) and (c), compared with two-stream structure based models, single-stream structure based models first reduce their network's parameters by only using one subnetwork for feature extraction. Then, they design different lightweight multi-level feature fusion strategies to effectively exploit the extracted features across different levels for SOD. For example, as in Fig. 1(b), DANet [41] concatenates the RGB and depth images as the four-channel inputs of a SOD model. As in Fig. 1(c), inspired by knowledge distillation, A2dele [42] and CoNet [35]  from the depth stream to the RGB stream in the training phrase and then employ the RGB stream only for SOD in the testing phrase. Although, single-stream structure based models can effectively reduce network's parameters, they cannot well exploit the cross-modal complementary information, since they use the shared subnetwork for feature extraction, while ignoring the modality difference between the RGB and depth images. Meanwhile, those depth information cannot be fully transferred into RGB stream and those transfered information cannot fully represent the depth information extracted from depth images, due to their modality discrepancy. As a result, these models still have a large performance gap with other RGB-D SOD models.
As shown in Fig. 1(d), to solve the limitations of existing lightweight models, we revisit the middle-level feature fusion structure. As shown in Fig. 1(d), it first employs two subnetworks for extracting low-level unimodal features from the RGB and depth images, respectively, and then designs a multi-modal feature fusion module to fuse the extracted unimodal RGB and depth features. After that, it employs another subnetwork to extract high-level multi-modal features. Finally, it designs a saliency prediction module for deducing the saliency maps. Compared with existing models, the middle-level feature fusion structure has many advantages. First, compared with two-streamed structure, the middle-level feature fusion structure can significantly reduce the network's parameters by (1) using one subnetwork for extracting high-level features since the high-level features contain far more feature channels than those of low-level features; (2) using the multi-modal feature fusion module once. Secondly, compared with single-stream structure, the middle-level feature fusion structure can well extract the cross-modal complementary information from the input RGB-D images, due to the fact that it employs two independent subnetworks for low-level feature extraction and a shared sub-network for multi-modal feature fusion as well as high-level feature extraction. Therefore, in this paper, we will propose a novel middle-level feature fusion structure based lightweight RGB-D SOD model. To the best of our knowledge, this is also the first work which employs middle-level feature fusion structure for lightweight SOD.
In this model, a novel information-aware multi-modal feature fusion (IMFF) module is first designed to effectively capture the cross-modal complementary information within RGB-D images. The idea behind our proposed IMFF module is that the multi-modal feature fusion aims to exploit all the useful information in the RGB and depth images, including their complementary and redundant ones. The question is that we do not know which local region contains useful information and which one does not in the training and testing phrases. However, the unimodal features from an arbitrary local regions of RGB or depth images can indirectly reflect the amount of information in this region. Generally speaking, a high informative region is more likely to contain more useful information, while a less informative region (e.g., low-quality regions and background regions) may contain more ordinary information. Therefore, the local regions with useful information of the input RGB and depth images may be identified by searching their corresponding informative regions. Meanwhile, compared with RGB SOD, the informative local regions of the input RGB and depth images may be more easily distinguished by comparing their unimodal features from the same local regions in an information-aware feature space. To this end, in our proposed IMFF module, the local unimodal features from the RGB and depth images, respectively, are first project into an information-aware feature space and then the differences in the amount of their contained information are compared to distinguish whether this local region in RGB (depth) image is informative or not.
Then, a novel lightweight feature-level and decision-level feature fusion (LFDF) module is designed to aggregate the feature-level and the decision-level saliency information in different stages with less parameters. In our proposed LFDF module, all the input features are first reduced into 64 channels to reduce our model's parameters. Then, our proposed LFDF module employs a novel feature-level and the decision-level saliency information aggregation structure to significantly aggregate all the information in the features and the saliency maps across different stages. This can also make up the performance degradation caused by parameter reduction.
In summary, the main contributions of this work are as follows: (1) By revisiting the middle-level feature fusion, a novel lightweight RGB-D SOD model is presented in this paper, which achieves high efficiency, good accuracy and small model size, thus contributing to SOD's real-life applications.
(2) A novel information-aware multi-modal feature fusion (IMFF) module is designed to exploit all the discriminative saliency information in the RGB and depth images. Different from most existing models which employ simple fusion strategies (e.g., concatenation and element-wise addition), our proposed IMFF module fuses multi-modal features according to the amount of their contained information.
(3) A novel lightweight feature-level and decision-level feature fusion (LFDF) module is presented to effectively aggregate the feature-level and decision-level saliency information of different stages with less parameters for better saliency prediction.
The rest of this paper are organized as follows: In Section 2, we briefly introduce some previous works related to RGB and RGB-D salient object detection. In Section 3, the details of proposed method are presented, including the architecture and loss function. In Section 4, we conduct a series of experiments to validate proposed model. Finally, in Section 5, a brief conclusion is made for this paper.

A. RGB SOD
Conventional RGB SOD models mainly integrate different kinds of hand-designed features and prior knowledge together to model the focused attention of human beings [9], [43], [44]. Recently, convolutional neural networks (CNNs)-based RGB SOD models [5]- [17] has dominated this field due to its capability of extracting high-level global information and low-level local details, simultaneously. Lots of them try to employ the cross-level contextual information for saliency detection. For example, DSSNet [7] designed an enhanced HED architecture to aggregate the multi-level context information from the deeper layers to the shallower ones with the aid of multiple short connections. Afterwards, some RGB SOD models try to introduce the multi-scale context information to handle the large variants in the shapes and sizes of salient objects. For example, MINet [13] proposed a self-interaction module to enable the network adaptively extract multi-scale information from the input images. By integrating the selfinteraction modules into their saliency prediction module, their network may adaptively deal with scale variation of different samples during the training and testing stages.
More recently, the edge information from the salient objects has been revisited to address the blurred boundary problem. For example, ENFNet [14] proposed an edge guidance block to embed the edge prior knowledge into hierarchical feature maps. The edge guidance block simultaneously performs the feature-wise manipulation and spatial-wise transformation for effective edge embeddings. Besides, the part-object relationships are also exploited for solving the problems that fundamentally hinge on relational inference for visual saliency detection [16], [17]. For example, both TSPOANet [17] and TSPORTNet [16] employed the Capsule Network (CapsNet) to dig into part-object relationships for SOD.

B. RGB-D SOD
Recently, RGB-D SOD has received great research interests to exploit the complementary information in RGB-D image for boosting SOD. A complete survey on RGB-D SOD methods is beyond the scope of this paper and we refer the readers to a recent survey paper [45] for more details. In general, most existing RGB-D SOD models can be summarized into three categories, i.e., pixel-level fusion, feature-level fusion and decision-level fusion.
Pixel-level fusion based models [19], [24], [26], [36], [41], [46] directly take the RGB and depth images as the input of four channels for SOD models. For example, DANet [41] employed a single stream encoder which concatenates the input RGB and depth images as the four channels of inputs and then designs a depth-enhanced dual attention module to filter the mutual interference between depth prior and appearance prior, thereby enhancing the overall contrast between foreground and background.
Feature-level fusion based models [20], [22], [32]- [34], [38], [39] first extract the unimodal RGB and depth features from the input RGB and depth images, respectively, and then fuse them to capture their complementary information for SOD. For example, JCUF [39] first employed two subnetwork to extract unimodal RGB and depth features from the input RGB and depth images, respectively, and then designed a multi-branch feature fusion module to jointly use the fused cross-modal features and the unimodal RGB and depth features for SOD. To effectively capture the cross-modal Element-wise addition * Element-wise multiplication Element-wise subtraction C Concatenation

S Softmax function
Channel-wise separation operation complementary information within RGB-D images, EBFS [34] designed a novel multi-modal feature interaction module to simultaneously capture the first-order and the second-order statistical characteristics between the unimodal RGB and depth features.
Decision-level fusion based models [21], [47], [48] first deduce two saliency maps from the input RGB and depth images, respectively, and then fuse the two saliency maps by using some well designed weight maps. For example, QAMSOD [48] first deduced two saliency maps from the input RGB and depth images, respectively, and then fused the two saliency maps by using two weight maps. It should be noted that they generate the two weight maps by using a deep reinforcement learning algorithm.
Although RGB-D SOD has achieved great progress recently, most existing RGB-D SOD models require high computational costs and memory consumption to obtain high accuracy. Considering that, we propose a novel lightweight RGB-D SOD model in this paper.

III. PROPOSED MODEL
As shown in Fig. 2, the proposed lightweight RGB-D SOD model employs a middle-level feature fusion structure. Given the input RGB images (denoted by I r ) and depth images (denoted by I d ), two sub-networks are first employed to extract their unimodal features, respectively. As a result, three levels of the unimodal RGB and depth features (denoted by F i r and F i d , i=1,2,3, respectively) are obtained from the input RGB and depth images, respectively. Then, F 3 r and F 3 d are fed into our proposed information-aware multi-modal feature fusion (IMFF) module to exploit their cross-modal complementary information. As a result, the third level of the fused features F 3 rd are obtained. After that, the third level of the fused features are fed into other sub-networks to extract corresponding high-level cross-modal features. Here, two levels of the multi-modal features F i rd , i=4, 5, are further obtained. Finally, the extracted multi-modal features at different levels are fed into our proposed lightweight feature-level and decision-level feature fusion (LFDF) module for detecting salient objects. It should be noted that, as shown in Fig. 2, the second level of fused features F 2 rd are obtained by element-wise adding the extracted unimodal features (F 2 r and F 2 d ). Details about these modules will be discussed in the following content.

A. Feature extractors
As shown in Fig. 2, there are three feature extractors in our proposed lightweight RGB-D SOD model. Two of them are employed to extract low-level unimodal features from the input RGB and depth images, respectively. While, the rest one is to extract those high-level cross-modal features. Taking advantages from the existing lightweight technologies, we choose the shufflenet [49] as our feature extractors, which is a classic lightweight classification network. Specifically, the two subnetworks for unimodal feature extraction use the same structure with the first three convolutional blocks of the shufflenet. Furthermore, their parameters are first pre-trained on ImageNet [50] and then independently re-trained in our proposed network. The sub-network for extracting high-level multi-modal features uses the same structure with the last two convolutional blocks of the shufflenet. Its parameters are also pre-trained on Imagenet and re-trained in our model.

B. IMFF module
As discussed in Section I, the proposed IMFF module aims to exploit all the discriminative saliency information in the RGB and depth images for SOD. Based on the assumption that the unimodal features can indirectly reflect those informative and non-informative regions in the input RGB and depth images, the proposed IMFF module tries to fuse the unimodal RGB and depth features from their local regions with abundant amount of information, since those informative regions are more likely to contain discriminative saliency information and those non-informative regions may be lowqualities or backgrounds. Meanwhile, as shown in Fig. 2  several times in two-stream structure based models, this can significantly reduce our network's parameters.
Specifically, the IMFF module is only employed for the third level of unimodal RGB features F 3 r and depth features F 3 d ∈ R C3×W3×H3 , because we experimentally find that they may have proper receptive field to reflect the informativeness of different local regions in the input images. Here, C 3 , W 3 and H 3 are the feature channels, width and height of the third level of unimodal RGB or depth features, respectively. As mentioned above, the features in F 3 r and F 3 d can indirectly reflect the the amount of information contained in different local regions in the input RGB and depth images. Considering that, as shown in Fig. 3, the third level of unimodal RGB and depth features are first projected into an information-aware feature space by using a shared transfer function, i.e., where Conv( * , θ 1 ) is a convolutional layer and its parameters θ 1 . It serves as the transfer function. F 3 r and F 3 d ∈ R C3×W3×H3 denote the projected information-aware features. By doing so, as shown in Fig. 4, each local feature of F 3 r and F 3 d can reflect the amount of information that are contained in its corresponding local RGB and depth image regions (i.e., its respective field) in this information-aware features space.
Then, given the information-aware features in the different regions of the RGB and depth images, we can analyse the relative informativeness between the RGB and depth images in different local regions and further determine the informative and non-informative regions of the RGB and depth images. For examples, given an arbitrary local region in the input RGB and depth images, we find that their total amount of information is large, their amount of shared information is little and the amount difference of their information is large by comparing their information-aware features. Accordingly, we may inference that both the local regions of the input RGB and depth images are likely to be informative and there are abundant complementary information between the input RGB and depth images in this local region.
Therefore, as shown in Fig. 3, we analyse the relations between the information-aware features F  their total amount of information, their amount of shared information and the amount difference of their information are computed by the following equations, i.e., where the features F After that, as shown in Fig. 3, the selection weights (i.e., w r and w d ∈ R C3×W3×H3 ) for each local regions in the input RGB and depth images are generated by where Conv( * , θ 2 ) denotes a 1×1 convolutional layer and its parameters θ 2 for fitting the inherent relations between those information-aware features. Cat( * ) denotes the concatenation operation. P( * ) denotes the channel-wise separation operation.
Here, each local weight in w r and w d reflect the informativeness of the corresponding local regions in the input RGB and depth images. Meanwhile, for each local region, we use channel-wise selection rather than spatial-wise selection to further select those informative local features and discard those non-informative ones. Finally, given the w r and w d , the fused features are obtained by As shown in Fig. 5, we visualize some w r and w d in different regions of the input RGB and depth images, respectively. Generally speaking, the RGB images contain more information than the depth images. Therefore, as shown in Fig. 5(c)-(e), IMFF module aligns more higher weights for unimodal RGB features than that for unimodal depth features, i.e., the points bellow the red line are more than those above the red line, especially for the RGB-D images in the second row since its RGB image's quality is higher than the depth image. Furthermore, the local regions 1 are the noninformative regions for RGB and depth images, the local regions 2 are the informative regions for RGB images and the local regions 3 are the informative regions for RGB and depth images. It can be seen that, for the local region 1, IMFF module may generate equal w r and w d with the values of 0.4˜0.6. This may result from the fact that the unimodal RGB and depth features in those non-informative local regions have low activation values. For the local regions 2, IMFF module may generate more higher w r than w d . While, IMFF module has the best feature selection ability for the features in the informative regions.
Therefore, by virtue of our proposed IMFF module, our model's ability of capturing cross-modal complementary information between the RGB and depth images is significantly improved and, meanwhile, its total parameters are also effectively reduced.

C. LFDF module
Given the extracted multi-modal features (F i rd , i=2, 3, 4, 5), our proposed LFDF module aims to effectively exploit the multi-level complementary information. For that, as shown in Fig. 2, it aggregates not only those saliency information contained in these features of different levels (i.e., feature-level information) but also those saliency information contained in the saliency maps deduced in different levels (i.e., decisionlevel information) from two directions. This may make up the information losing caused by feature channel reduction.
Specifically, as shown in Fig. 2, the channels of all the multi-modal features F i rd , (i=2,3,4,5) are first reduced into 64 to decrease the computational complexity and memory in our proposed LFDF module, i.e., where Conv( * , β i ) denotes the convolutional layers with corresponding parameters β i . Similarly, all the other features in our proposed LFDF module also set their channel to 64 for reducing the computational complexity and memory. Compared with using their original channels, this can effectively reduce our network's parameters in the subsequent steps. However, reducing feature's channels will inevitably reduce the amount of salient information contained in the multi-level features, which further drops our model's performance. Our proposed LFDF module tries to address this issue by fully exploiting the feature-level and decision-level information. To this end, as shown in Fig. 2, it aggregates those complementary information across different levels of features and saliency maps in two directions.
Concretely, it first aggregates feature-level and decisionlevel saliency information from the shadow levels to high levels to obtain more discriminative features and more accurate saliency map of current level, i.e., where F i rd denotes the i-th level of the multi-modal features. Conv( * , γ i ) denotes the convolutional layers and its parameters γ i . Cat( * ) denotes the concatenation operation. Resize( * ) denotes the bilinear interpolation for resizing the shape of the features from different levels into the shape of the i-th level features. − → S i denotes the corresponding saliency maps and Conv( * , α i ) denotes the 1×1 convolutional layers and its parameters α i for saliency map prediction. In this way, the multi-modal features will be re-exploited multi times and the saliency map from previous step will guide our model to better extract high-level semantic information in current stage.
After that, those feature-level and decision-level saliency information is further aggregated from the high levels to their shadow levels by where Conv( * , ϑ i ) denotes the convolutional layers and its parameters ϑ i . ← − S i denotes the corresponding saliency maps and Conv( * , ε i ) denotes the 1×1 convolutional layers and its parameters ε i for saliency map prediction. In this way, the feature-level and decision-level saliency information is effectively aggregated from high levels to shadow levels.
Finally, the final saliency map S o is obtained by jointly employing the feature-level and decision-level saliency information, i.e., where Conv( * , ι) denotes a 1×1 convolutional layer and its parameters ι. By virtue of our proposed LFDF module, the computational complexity and memory usage of our proposed model are significantly decreased by reducing the channels of all the features in LFDF module. Meanwhile, its corresponding issue of the amount of information dropping is also well addressed by increasing use efficiency of the feature-level and decisionlevel saliency information. The experimental results in Section IV proves that our proposed LFDF module can make up the performance drop to extents by aggregating the feature-level and decision-level saliency information.

D. Loss Function
We employ the cross-Entropy (CE) loss and edge loss to train our proposed network. Among that, CE loss is widely used in the SOD and is expressed by: Here, S denotes corresponding saliency map and Y denotes the ground truth. And, edge loss is used to refine the boundaries of the generated saliency maps, which is expressed by where MSE( * ) denotes the MSE loss. Sobel( * ) denotes the Sobel edge detector. For better training, the CE loss and edge loss are employed for all saliency maps generated in our proposed model, including the final saliency map S o and those middle-level saliency maps − → S i and ← − S i , i = 2, 3, 4, 5. Therefore, the overall loss is expressed by

IV. EXPERIMENTS A. Datasets
Our experiments are conducted on four widely used RGB-D SOD datasets: NJU2000 [25], NLPR [51], STEREO [52] and SIP [26]. Among that, NJU2000 dataset [25] captures and annotates 2003 RGB-D images under diverse objects and complex, challenging scenarios. NLPR dataset [51] captures and annotates 1000 RGB-D images by using Kinect. It contains a variety of indoor and outdoor scenes under different illumination conditions. STEREO dataset [52] contains 797 RGB-D images. SIP is a recently proposed dataset, which contains 1000 accurately annotated high-resolution RGB-D images.
For fair comparisons, we follow the same data split way as in [26], [53] and [33]. Concretely, we randomly sample 1485 RGB-D images from the NJU2K dataset and 700 RGB-D images from the NLPR dataset as our training set. The remaining images in the NJU2K and NLPR datasets and the whole datasets of STEREO and SIP are used for testing.
Among that, F-measure is a weighted harmonic mean of P recision and Recall, which evaluates the overall performance of a salient object detection model. It is defined by We set β 2 = 0.3 as suggested in [25]. Here, P recision and Recall are computed by comparing the ground truths and the binarized saliency maps under different thresholds. M AE computes the difference between the saliency map S and the ground truth Y. Its formulation is expressed by where W and H are width and height of the saliency map (or ground truth), respectively.
S-measure (S λ ) is recently proposed in [54] to evaluate the structural similarities between salient map and ground truth. It jointly computes the region-aware (Sr) and object-aware (So) structural similarity as their final structure metric: where α ∈ [0, 1] is the balance parameter and sets to 0.5. More details are seen in [54]. E-measure (E γ ) [55] considers the pixel-level errors and image-level errors by simultaneously capturing global statistics and local pixel matching information, which is formulated by where W and H are the width and height of saliency maps. φ F M ( * ) is the enhanced alignment matrix whose details are in [55]. Besides, we also report the amount of parameters (millions, M) of existing models and their corresponding inference speed (FPS). To obtain inference speed, we first randomly sample 500 RGB-D images from the training set and then resize them into 352×352. After that, we feed them into corresponding models and compute corresponding inference speed.

C. Implementation
We construct our proposed model by using the Pytorch [56] toolbox on a NVIDIA 2080Ti GPU. All the parameters, expect those in feature extractors, in our proposed model are initialized by using the Xavier initialization. We use the SGD with Nesterov momentum to optimize our proposed model. Its learning rate, weight decay and mini-batch size are set as 2e-3, 5e-4 and 4, respectively. Furthermore, its learning rate will decay by a factor of 0.8 in every 20 epochs. All the images are resized into 224 × 224 in training phrase

D. Ablation Experiments and Analyses
In this section, the ablation experiments for each component of our proposed model are performed on NJU2000 to investigate their validities and contributions.
1) Multi-modal feature fusion in which level: To investigate the impact of the multi-modal feature fusion in different levels, several versions of our proposed method (i.e., Input F, L1 F, L2 F, L3 F, L4 F and L5 F, for short, respectively) are provided for comparison. Specifically, we keep the LFDF module fixed and employ the IMFF module in different levels. Input F denotes that the input RGB and depth images are concatenated as four-channel input of our model. L1 F,.., L5 F denote that the IMFF module are employed in the features of level1, level2, level3, level4 and level5, respectively.
The quantitative results of these models are shown in Table. I. It can be seen that employing the IMFF module on the features of high levels requires more parameters, due to the fact that there are more feature channels in the features of high levels. Furthermore, the performance of our proposed model first increases by moving the proposed IMFF module from level1 to level3 and then drops by moving the proposed IMFF module from level3 to level5. This may result from the fact that, when applying our proposed IMFF module in one of the first two levels, the corresponding features have relative small receptive field and corresponding extracted information from such small local regions in input RGB and depth images cannot effectively reflect their amount of information . Furthermore, for those high-level features, their receptive fields are too large to effectively capture the crossmodal complementary information within RGB-D images.
Employing the IMFF module on the features of the third level can obtain the balance of our model's size and performance. Therefore, in this paper, we choose the L3 F as our final model.

2) Ablation experiments for each module:
We then investigate the impact of each component of our proposed model. Specifically, the 'Baseline' model denotes the one that has removed the IMFF module and LFDF module from our proposed lightweight RGB-D SOD model. As shown in Table II, both the IMFF module and LFDF module (i.e., 'Baseline+IMFF' and 'Baseline+LFDF') can improve the performance of RGB-D SOD. This verifies that the proposed IMFF module can effectively capture the cross-modal complementary information and the proposed LFDF module can well exploit the cross-level complementary information. Furthermore, with the collaboration of the IMFF module and LFDF module, our proposed model (i.e., 'Baseline+IMFF+LFDF') obtains the best performance.

E. Compare with the State-of-the-art
The proposed model and some existing state-of-the-art RGB-D salient object detection models are evaluated on four benchmark datasets: NJU2000 [25], NLPR [51], STEREO [52] and SIP [26]. The following state-of-the-art models include PCA [20], TSAA [22], CPFP [23], D3Net [26], JCUF [39], ICNet [29], ASIFN [28], UCNet [57], DMRA [53], SSF [33], JLDCF [36], BBSNet [34], BIANet [37], EBFS [32], A2dele    [42], DANet [41] and CoNet [35]. Among that, A2dele [42], DANet [41] and CoNet [35] are lightweight models. For fair comparisons, the saliency maps deduced by these existing models are provided by their authors and are tested by our evaluating code. Here, 'OUR-VGG16' denotes that the feature extractors of our proposed model employ the structure of VGG16 network [58] rather than ShuffeNet. 'OUR-ShuffeNet' denotes the feature extractors of our proposed model employ the structure of ShuffeNet [49]. 1) Quantitative analysis: Table III shows the quantitative results of state-of-the-art models. For the NJU2000 dataset, the proposed 'OUR-VGG16' achieves the best performance in the metrics of M AE and E γ and obtains competitive results with respect to F β and S λ . For the NLPR and STEREO datasets, our proposed 'OUR-VGG16' achieves competitive results with respect to other state-of-the-art models. For the SIP dataset, our proposed 'OUR-VGG16' obtains the best performance in all of the metrics. Furthermore, compared with existing state-of-the-art models, 'OUR-ShuffeNet' achieves competitive results in all of the metrics. And, compared with existing lightweight RGB-D SOD models, 'OUR-ShuffeNet' obtains the best performance with least parameters.
Furthermore, as shown in Table. IV, compared with existing models, 'OUR-VGG16' has 18.4 million parameters, which is the most lightweight and the most fast RGB-D SOD models, expect for A2dele [42] and ATSA [40]. While, 'OUR-ShuffeNet' has only 3.9 million parameters. However, compared with 'OUR-VGG16', it inference speed is degraded, due to its high structural complexity. 'OUR-ShuffeNet' can still run in most real-time applications.
2) Qualitative analysis: Visualization results under different scenarios are illustrated in Fig. 6. As shown in the first two rows of Fig. 6, for the images under some simple scenes, most state-of-the-art methods accurately detect the salient objects. Furthermore, as shown in the last four rows of Fig. 6, for those relatively complex scenes, our proposed model can obtain competitive and even better results than those state-of-the-art models. This further verify the effectiveness of our proposed models.

V. CONCLUSION
In this paper, we proposed the first middle-level fusion structure based lightweight RGB-D SOD model. By revisiting the middle-level fusion structure, the proposed model significantly reduces the network's parameters. Furthermore, the proposed IMFF module can effectively capture the crossmodal complementary information with less parameters by exploiting the amount of the information from different local regions in the RGB and depth images. And, the proposed LFDF module can effectively extract the cross-level complementary information by jointly fusing the feature-level and decision-level information cross levels. Based on the middlelevel fusion structure, our proposed model has only 3.9M parameters and runs at 33 FPS. Furthermore, experimental results on several benchmarks show that, by virtue of the proposed IMFF and LFDF modules, our proposed model can make up the performance drop caused by reducing parameters to some extents.