Stereo Refinement Dehazing Network

between the coarse dehazed results of the WSCDN and the refined results of the GSRN. We present the dehazed results

Response: Thanks for your advice. We compare our SRDNet and BidNet which both belong to the stereo dehazing methods, in 3D detection task. In terms of each metric of AP3D in the different hazy scenes, Tab. 2 (i.e., Tab. VII in the manuscript) shows that our method obtains higher accuracy and has better perceptual quality. 4. Some Typos: (1) Page3Line50: The weight-sharing coarse dehazing network (WSCDN) enjoys a (an) encoder-decoder structure and is shared by the left view and the right view.
(2) Page6Line49: The values of SSIM also reduce more than 0.0018 dB (delete dB) compared with employing the GCSR module.
(3) Page7Line51: Our SRDNet achieves an obvious gain with 3.36 dB in the PSNR and 0.054 dB (delete dB) in the SSIM compared with the MSBDN.
Response: Thank you very much for pointing out these problems. According to your suggestion, in the resubmitted version, we have completely solved the acronym problems you mentioned. Furthermore, we have polished our manuscript and marked them red.
To Reviewer #2 1. (1) The author only summarizes the two shortcomings of the previous methods, but does not explain the improvement of their method in terms of these shortcomings. For example, how to solve the generalization problem?
Response: Thanks for your comment. In paragraph 2, line 58, right column, page 1, of the resubmitted version, we have described how our method can solve the shortcomings including the generalization problem. For the sake of clarity, we re-state the two shortcomings followed by describing how our method is capable of solving the corresponding shortcomings. Specifically, there are two shortcomings in the previous stereo dehazing methods: ① They simultaneously restore clear images and predict disparity. Disparity estimation is time-consuming and the estimation from haze images is a more challenging problem. A small error in disparity gives rise to a large variation in depth and in estimation of haze-free image. Furthermore, achieving the two tasks optimal jointly is hard, it is preferable to not directly utilizing disparity for haze removal. Although BidNet constructs the matrix in horizontal dimension to mining the information from the cross view, when the width of the input image gets larger, the needed memories for the matrix construction are very large. It can be observed from Tab. 4. To address the above two shortcomings, ①our SRDNet concentrates on the dehazing task and does not predict disparity, which makes the dehazing task optimal. Without applying the matrix, the SRDNet concatenates the left features with the right features and mines the depth information through a stereo feature extractor. When the width of input gets large, the improvement for the need of the computational memories is far lower than the matrix construction. We also design a guided channel and spatial refinement (GCSR) module separating the features to choose the useful information for different views, which impresses the negative effect of the inaccurate information and the irrelevant information. ② Our SRDNet is an end-to-end deep learning model that directly restore haze-free stereo images and is not dependent on the joint accurate estimations of the transmission map and the atmospheric light. In addition, our method is not limited in the computational relation of the physical model and can model more complicated and redundant computational relation to fit and generalize real foggy scenes.
In addition, experimental results demonstrate the superiority of the proposed method beyond two types of methods: SSMDN and BidNet.
(2) The contribution is just a two-stage dehazing net, which is somewhat limited.
Response: Thanks for your comment. In the last three paragraphs of Section I of the resubmitted version, we have re-summarized our core contributions. Importantly, we have also clarified and emphasized the challenges behind the contributions. The challenges are mainly divided into two aspects: On the one hand, it is not effective that directly apply a two-stage dehazing net of the domain of single image dehazing into the scenario of stereo image dehazing because of lack of mechanism to adopting stereo information helpful for dehazing. In the domain of stereo image dehazing, it is challenging to design an effective two-stage dehazing net in a coarse-to-fine way.
On the other hand, it is challenging to make use of stereo information to positively refine the coarse dehazed images because that the stereo information such as disparity/depth/distance is not accurate and employing the inaccurate information can even damage the dehazed results. Therefore, it is considerately difficult to utilize stereo information for designing a stereo dehazing framework immune to the negative effect of the inaccurate information.
To summarize, our contribution lies in how to deal with the challenges in designing a two-stage dehazing net in the relatively new domain of stereo image dehazing.
It is the proposed SRDNet (Stereo Refinement Dehazing Network) that effectively deal with the above-mentioned challenges. Our SRDNet is not a simple two-stage dehazing net. It incorporates a weight-sharing coarse dehazing network (WSCDN) and a guided separated refinement network (GSRN). The GSRN learns the residues for the corresponding views to refine the coarse dehazed image pair through a stereo feature extractor and a guided channel and spatial refinement (GCSR) module. The stereo feature extractor makes full use of the information of cross views. The GCSR module separates the features for different views and predicts the corresponding residues, which impresses the negative effect of the inaccurate information and the irrelevant information. We also construct a two-stage dehazing net by replacing the GCSR module by a 3×3 convolutional layer. This simple two-stage dehazing net is compared with our method in Tab. 3 (i.e., Tab. IV in the manuscript). The contribution of our method can be mainly divided into three aspects: We propose a stereo refinement dehazing network (SRDNet) to directly recover the clean stereo images in a coarse-to-fine fashion, which is the first attempt to address the stereo image dehazing progressively. The SRDNet makes full use of information collaboratively encoded in the cross views meanwhile without employing disparity or correlation matrix. Our SRDNet is not limited to the simple physical model, and learns a more complicated model that better matches the real foggy scenes. It is of great importance to eliminate the performance degradation of stereo based 3D detectors caused by the foggy inputs.
The SRDNet incorporates a weight-sharing coarse dehazing network (WSCDN) and a guided separated refinement network (GSRN). The WSCDN removes a part of haze and obtains a coarse dehazed image pair. The GSRN learns the residues for the corresponding views to refine the coarse dehazed image pair through a stereo feature extractor and a guided channel and spatial refinement (GCSR) module. The stereo feature extractor makes full use of the information of cross views. The GCSR module separates the features for different views and predicts the corresponding residues, which impresses the negative effects of the inaccurate information and the irrelevant information. Firstly, disparity prediction is a challenging task and achieving the two tasks optimal jointly is hard. A small error in disparity gives rise to a large variation in depth and in estimation of haze-free image. In hazy scenes, it is hard to estimate the correct disparity map or the correct correlation matrix. (b) Although the computation of the matrix is only in horizontal dimension, when the width of the input image is large, the needed memories for the matrix multiplication are large too. In In contrast, our SRDNet concentrates on the dehazing task and does not predict disparity, which makes the dehazing task optimal. Without applying the matrix, the SRDNet concatenates the left features with the right features and mines the depth information through a stereo feature extractor. When the width of input gets large, the improvement for the need computational memories is far lower than the matrix construction. In order to choose the useful information for each view and not introduce confusing information, we also design a guided channel and spatial refinement (GCSR) module to separates the features of each view. From Tab. I in the manuscript, the dehazing performance of our SRDNet outperforms other methods by a large margin, which also demonstrates the effectiveness of our method.
(2) Maybe the proposed method can solve these problems, but the introduction part is not clear, and should be rewritten.
Response: Thanks for your advice. We rewrite and update the introduction in a clear and organized way. For better understanding, the changed parts are labelled in blue in the resubmitted version.
3. The results on Drivingstereo dataset only compare with MSBDN on the quantitative results. More comparisons are needed to make the results on the real data set convincing.
Response: Thanks for your comment. We compare quantitative results of more methods in Tab. 5 (i.e. Tab. III of the manuscript) on Drivingstereo dataset. Additionally, we provide more results in Fig. 1 (i.e., Fig.6 in the resubmitted manuscript) to make the results on the real data set convincing. Fig. 1 gives qualitative comparisons of our dehazed results with the state-of-the-arts: MSBDN, BidNet and GCANet on the real hazy images from the Drivingstereo dataset. It can be observed that there still exists quite a lot of haze in the results of MSBDN. Color distortion is introduced by BidNet. Compared with GCANet, our method performs better in the regions of the sky of row-4 and the road of row-5. Our method has visually appealing results. To Reviewer #3 1. For GCSR, only the feature from each view is used as the input. Why the residue information of each view is not fed to GCSR like the SFE? Please clarify it.
Response: Thanks for your comment. This is a typo in the Fig.2(b) of the manuscript, we correct the figure and the corresponding description in the manuscript. We use the residue information to conduct the channel refinement and the feature from each view is used in the spatial refinement as shown in Fig.2 (i.e., Fig.2(b) in the resubmitted manuscript).  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 2. Why the max and average pooling are both used in GCSR and the basic block, but only the max-pooling is used in WSCDN? Please clarify the reasons and add more ablation experiments.
Response: Thanks for your comments. There may exists an ambiguity. In the GCSR module and the basic block, we utilize the global average pooling (GAP) and the global max pooling (GMP) to gather global statistic information and discriminative information, respectively. Our WSCDN consists of the Basic Blocks and some max-pooling operator with stride 2 to extract features. The max-pooling with stride 2 in WSCDN is used to expand the receptive field. We modify the description and the figures to eliminate the ambiguity in the resubmitted manuscript.
In addition, we add some ablation experiments to explore the effects of the global max-pooling and the global average-pooling in the Basic block in Tab.6 (i.e., Tab. IV in the resubmitted manuscript). It shows that combing the GAP and the GMP could extract more abundant information and obtain the best dehazing performance. 3. In Section IV-C, the authors should also add some subjective comparisons between the dehazing results of WSCDN and the proposed two-stage method.
Response: Thanks for your comment. We have modified the citation in our resubmitted manuscript.

It would be better to add the comparison of parameters and FLOPS for different methods.
Response: Thanks for your advice. We add the comparison of parameters and FLOPS for different methods in Tab. 7 (i.e., Tab. II in the manuscript). From Tab. 7, our SRDNet achieves a better trade-off between the performance and the computational cost when comparing with the methods: SSMDN, GCANet, MSBDN, and BidNet.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  Abstract-The performance of stereo vision tasks degrades when haze exists in the input stereo image pair. Independently applying single image dehazing algorithm on left and right images is not optimal. To overcome the problem, we propose an effective framework, called SRDNet, for simultaneously dehazing stereo images. The main idea of SRDNet is to make full use of the stereo information from cross views improving dehazing performance. It does not explicitly employ the disparity estimation and the correlation matrix. SRDNet comprises two parts: a weight-sharing coarse dehazing network (WSCDN) and a guided separated refinement network (GSRN). The WSCDN is utilized to predict a coarse dehazed image pair. Then the GSRN is introduced to predict the residues for different views by extracting the fused information of cross views and separating the features of different views with a guided channel and spatial refinement module. The residues are added to the coarse dehazed pair so as to make refinement and remove the remained haze. The experimental results demonstrate that our proposed SRDNet surpasses previous image dehazing methods by a significant margin both quantitatively and qualitatively. Moreover, our SRDNet could be a preprocessing step of the stereo image-based 3D object detection and boost the 3D detection accuracy in hazy scenes.

I. INTRODUCTION
Stereo vision has numerous advantages over monocular vision. For example, stereo vision is able to provide more precise depth and three-dimensional information of the objects and scenes [1], [2], [3], [4], [5], [6]. Therefore, stereo vision is widely applied in practical applications such as advanced driving assistance system, self-driving vehicles, unmanned surface vessel, and human-machine intelligence. However, the visibility of the stereo images and the scene understanding ability of stereo vision are deteriorated when haze occurs. The low-level vision tasks such as stereo dehazing [7], [8], [9], [10] and stereo deraining [11], [12], [13] are very necessary and have attracted increasing research attention in the computer vision community, which restores the stereo images from the corrupted inputs. Moreover, stereo images provide more information from cross views, which could boost the performance of the dehazing methods since they are depth related.
There are two strategies for dehazing stereo images. The first and straightforward strategy is independently dehazing the left image and right image captured by a binocular vision system. It can be accomplished by applying existing excellent single image dehazing methods such as GCANet [14], and MSBDN [15]. However, the single image dehazing methods  [15] and BidNet [7] methods for a hazy image from the Stereo Foggy Cityscapes dataset [7].
do not utilize the relationship between the binocular images. Therefore, directly applying single image dehazing methods is not optimal for dehazing binocular images. The second strategy [7], [8], [10] is stereo image dehazing methods, utilizing the depth information contained in the stereo image pairs to help predict the dehazed images, which demonstrates the superiority of the stereo images. Na et al. [10] and Song et al. [8] explicitly estimated disparity and merged the intermediate features for disparity estimation into a dehazing network. Because disparity estimation from haze images is a more challenging problem and achieving the two tasks optimal jointly is hard, it is preferable to not directly utilizing disparity for haze removal. BidNet [7] dehazes the binocular images by mining the correlation between left and right images through constructing the matrix without explicitly estimating disparity, which achieves the state-of-the-art. Although the computation of the matrix is only in horizontal dimension, when the width of the input image gets large, the needed computational resources and the need of memories for the matrix multiplication are very large. The above stereo image dehazing methods are based on the atmosphere scattering model [16] and utilize the depth information contained in the stereo image pairs to help predict the transmission maps. Though of success, the performance of the model-based methods is over dependent on the joint accurate estimations of the transmission map and the atmospheric light. The model is too crude to fit the real foggy scenes and the dehazed results of the model-based methods for the real-world hazy images are unsatisfactory.
To address the above issues, we design a stereo refinement dehazing network (SRDNet) in this paper to directly transform hazy stereo images to haze-free stereo images in a coarse-to-fine manner. Our SRDNet is not limited in the computational relation of the physical model and can model more complicated and redundant computational relation to fit and generalize real foggy scenes. Our SRDNet employs neither disparity nor correlation matrix. It concentrates on the dehazing task and does not predict disparity, which makes the dehazing task optimal. Firstly a part of haze are removed by our designed weight-sharing coarse dehazing network called WSCDN for both the left view and the right view. Then we could generate the residues between the coarse dehazed stereo image pair and the input foggy stereo image pairs, which are combined with the features from the WSCDN together as the input to a guided separated refinement network (GSRN). The GSRN is composed of a stereo feature extractor and a guided channel and spatial refinement (GCSR) module. Without applying the matrix, the SRDNet concatenates the left features with the right features and mines the depth information through a stereo feature extractor. When the width of the input gets larger, the improvement for the need of the computational resources and memories is far lower than the matrix construction. The stereo feature extractor makes full use of the information of cross views. The GCSR module separates the features for different views and predicts the corresponding residues for the coarse haze-free image pairs. The residues help refine the coarse dehazed stereo image pairs to remove the remained haze. To summarize, our contributions are threefold as below: (1) We propose a stereo refinement dehazing network (SRD-Net) to directly recover the clean stereo images in a coarseto-fine fashion, which is the first attempt to address the stereo image dehazing progressively. The SRDNet makes full use of information collaboratively encoded in the cross views meanwhile without employing disparity or correlation matrix. Our SRDNet is not limited to the simple physical model, and learns a more complicated model that better matches the real foggy scenes. It is of great importance to eliminate the performance degradation of stereo based 3D detectors caused by the foggy inputs.
(2) The SRDNet incorporates a weight-sharing coarse dehazing network (WSCDN) and a guided separated refinement network (GSRN). The WSCDN removes a part of haze and obtains a coarse dehazed image pair. The GSRN learns the residues for the corresponding views to refine the coarse dehazed image pair through a stereo feature extractor and a guided channel and spatial refinement (GCSR) module. The stereo feature extractor makes full use of the information of cross views. The GCSR module separates the features for different views and predicts the corresponding residues, which impresses the negative effects of the inaccurate information and the irrelevant information.
(3) Experiments demonstrate that our proposed SRDNet surpasses previous state-of-the-art image dehazing methods by a large margin both quantitatively and qualitatively. Specially, our method outperforms the sota stereo dehazing method by 4.70 dB and 4.44 dB for the binocular image pair on the Stereo Foggy Cityscapes dataset in terms of the PSNR. Moreover, our SRDNet could be a preprocessing step of the stereo image-based 3D object detection and boost the 3D detection accuracy in hazy scenes. By appending the SRDNet, the average precision improves by 16.23% in the heavy haze condition on the KITTI Val dataset for easy sets.

A. Single Image Dehazing Methods
Single image dehazing methods can be divided into two categories: prior-based approaches and learning-based approaches. Most dehazing methods rely on the atmosphere scattering model formulated as: where I(x) and J(x) denote the hazy image and the clear image respectively. A is the global atmospheric light intensity, and t(x) represents the transmission map. t(x) is a function of depth: t(x) = e −βd(x) , in which β and d(x) are the atmosphere scattering parameter and the distance, respectively. Prior-based approaches [29], [30], [31], [32], [33] employ strong priors as extra constraints to estimate the transmission maps and the global atmospheric lights, and then compute the haze-free results according to the atmosphere scattering model mentioned above. In order to boost the visibility of hazy images, Tan et al. [33] proposed to maximize the local contrast. The dark channel prior (DCP) [29] is put forward to estimate the transmission maps and restore the clean outdoor images. Berman et al. [32] developed an effective non-local path prior for single image dehazing.
Recently, most deep monocular dehazing methods achieve great success. The previous learning-based approaches [34], [35], [36], [37] also rely on the atmosphere scattering model, which first utilize the CNN to estimate transmission maps and atmospheric lights, and then restore clear images. Zhang et al. [18] regard the image dehazing problem as a iterative progress: first divide a hazy image into different regions and then optimize the atmospheric light and transmission simultaneously and iteratively based on local physical features. Several recent works [38], [14], [39], [40], [15], [41] reduce the image dehazing problem to an image-to-image translation problem. The Enhanced Pix2pix Dehazing Network (EPDN) [39] utilizes a generative adversarial network augmented with a well-designed enhancer to restore clear images directly. The Gated Context Aggregation Network (GCANet) adopts the smoothed dilated technique and fuses multi-level features for haze removal. The GCANet learns the residue between the clear image and the input foggy one. In contrast, the learning target of our guided separated refinement network is the residues between the coarse dehazed stereo image pairs and the final haze-free ones. GriddehazeNet [40] is an enhanced GridNet [42] with residual dense blocks [43] Page 10 of 20 IEEE Transactions on Circuits and Systems for Video Technology   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59   for dehazing. A Multi-Scale Boosted Dehazing Network (MS-BDN) [15] applies the boosting strategy and the error backprojection technique to improve the performance of dehazing. To solve the widely diffusing caused by the haze, an endto-end Pyramid Global Context Network (PGCNet) [17] is proposed to learn the global context. Zhang et al. [19] designed two modules adaptively fusing multi-level features to keep fine details and extract semantics.

B. Stereo Image Dehazing Methods
Stereo images processing has attracted increasing attention due to the advantages such as providing comparable depth accuracy. The methods based on stereo images have made great progress, such as 3D object detection [5], [4], [6] and stereo matching [44], [45], [46]. There are 18 stereo based 3D object detectors in the leaderboard of 3D detection evaluation on the KITTI website in the past two years. More information are provided from cross views by stereo images that have thus been utilized to improve the quality of various low-level tasks, including super-resolution [47], stereo image deraining [11], [12], [13] and stereo image dehazing [7], [8], [9], [10]. Li et al.presented a method recovering the clear images from foggy videos, which jointly predicts scene depths. They regarded that the stereo matching and the dehazing can reinforce each other. Song et al. [8] and Yun et al. [10] both proposed the deep-learning based multi-task methods that estimate a clear latent image and disparity simultaneously from a hazy stereo image pair. The intermediate features for disparity estimation are fused into a dehazing network to enhance each other. Song et al. [48] extends the work [8] by introducing an attentional feature fusion in order to integrate depth-related features effectively from the matching cost and haze transmission. Their attentional feature fusion consists of a channel/spatial attention fusion and a gated fusion. The channel/spatial attention fusion is separately conducted on the stereo features or on the transmission features, which is in a self-learning way. The gated fusion is to adaptively fuse the stereo features and the transmission features through a learning weight map. Differently, we propose a guided channel and spatial refinement (GCSR) module to extract features for the respective view from the mixed stereo features. The GCSR module is composed of a guided channel refinement and a guided spatial refinement, which are not in a self-learning way and instead is learned by the guidance of the residue information from the WSCDN and the feature from each view. Recently, BidNet [7] is proposed and does not explicitly estimate disparity. It explicitly computes correlation matrix of left and right features which is closely related to disparity. By contrast, our SRDNet employs neither disparity nor correlation matrix. The above methods based on the atmosphere scattering model are too simple to fit the real foggy scenes.
To address the aforementioned issues, this paper designs a stereo refinement dehazing network to directly recover the clean stereo pair from the foggy input pair, which utilizes the information from cross views and is more effective than single image dehazing methods in stereo tasks without estimating disparity.

III. METHODS
The visibility of the stereo images and the scene understanding ability of stereo vision are degraded when haze exists. Technology   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 Stereo dehazing could be a preprocessing step of the highlevel vision tasks such as the stereo image-based 3D object detection. Different from that existing stereo dehazing methods are based on the physical model, in this paper, we propose an end-to-end stereo refinement dehazing network (SRDNet) to directly recover the clean stereo images from the foggy input pair. It restores the haze-free stereo images in a coarse-tofine manner: firstly learn a coarse haze-free image pairs and then learn the residues to refine the coarse images through excavating the depth information from the stereo image pairs.

Page 11 of 20 IEEE Transactions on Circuits and Systems for Video
A. Overall Architecture Fig. 2(a) shows the overall architecture of the SRDNet which contains two parts: a weight-sharing coarse dehazing network (WSCDN) and a guided separated refinement network (GSRN). The WSCDN is utilized to predict a coarse dehazed image pair directly, whose weights are shared between the left images and the right images. The coarse dehazed stereo pair is removed most of haze. The residues between the coarse dehazed stereo image pair and the input foggy stereo image pairs are combined with the left and right features from the WSCDN, which are input to the GSRN. The GSRN utilizes a stereo feature extractor to fuse the information of cross views instead of predicting disparity. In addition, the GSRN designs a channel and spatial refinement module to separate the features for different views to predict the residues. The predicted residues refines the coarse dehazed images, which removes the remained haze and obtains clearer image pairs.

B. Weight-Sharing Coarse Dehazing Network
The weight-sharing coarse dehazing network (WSCDN) enjoys an encoder-decoder structure and is shared by the left view and the right view. The encoder of the WSCDN inputs the left foggy image I l or the right foggy image I r and stacks basic blocks and max-pooling with stride 2 to extract the features. The encoder first utilizes a basic block to learn better input features and then downsamples the input features through a max-pooling followed with a basic block, which is repeated 4 times and iteratively enlarges the receptive field. The decoder accordingly applies the bilinear interpolation with one basic blocks by 4 times to restore the detailed structure. The same level feature maps between the encoder and the decoder are concatenated to preserve spatial information at each resolution. The final feature maps F l and F r output from the weightshared decoders of the left view and the right view will be further fed into two separated 3 × 3 convolutional layer to restore the coarse dehazed image pair:Ĵ c l andĴ c r , which are removed most of haze. Basic Block: As shown in Fig. 2(c), a basic block is composed of a 3 × 3 convolutional layer with ReLU and another 3 × 3 convolutional layer with self gated mechanism. The self gated mechanism is to learn the channel-wise weight G s for the input feature F in to recalibrate the features adaptively. We choose the global max pooling (GMP) and the global average pooling (GAP) along spatial dimension to obtain the global spatial information, which is formulated as Eq. 2. Then we use two 1 × 1 convolutional layers followed with ReLU and sigmoid respectively to further fuse the useful information and generate channel-wise weights which are used to multiply with the input feature to recalibrate the input feature along the channel dimension.
where Concat means concatenating the outputs of the global max pooling P max and the global average pooling P avg . σ and δ are the sigmoid function and the ReLU function respectively. Finally, the output feature are obtained below: where refers to channel-wise product.

C. Guided Separated Refinement Network
In order to utilize the information from cross views and refine the predicted coarse dehazed stereo imagesĴ c l andĴ c r , we design a guided separated refinement network, called GSRN. The GSRN first uses a stereo feature extractor to extract stereo mixed features and fuse the information from cross views. Then a guided channel and spatial refinement (GCSR) module is designed to guide to separate the features for different views. The separated features predict the separated residuesR l and R r for refining the left coarse dehazed image and the right coarse dehazed image respectively. Stereo Feature Extractor: For the stereo feature extractor, we combine the left feature F l , the right feature F r out from the WSCDN, the original stereo foggy images I l and I r , and the predicted coarse dehazed stereo imagesĴ c l andĴ c r as the input S in , which is formulated as: Specially, we input the residues between the coarse dehazed stereo image pair and the input foggy stereo image pair to add the cues of the already detected haze. The stereo feature extractor has the similar structure as the WSCDN. It contains three Basic Block-MaxPooling and three bilinear interpolation-Basic Block, in which skip connection are applied with features across scales (s = 2, 4, 8) corresponding to the same dimension. The extractor only downsamples the input with stride 8 to keep more detail information compared with the WSCDN. Through the stereo feature extractor, the mixed feature F m is obtained and includes the information from cross views. Guided Channel and Spatial Refinement Module: If the mixed feature F m output from the stereo feature extractor is directly utilized to predict the residues for the coarse dehazed stereo pair, some confusing information would be introduced. Therefore, we design a guided channel and spatial refinement (GCSR) module to guide the network to learn the respective residue for the corresponding view. As shown in Fig. 2(b), our GCSR module consists of two steps: guided channel refinement and guided spatial refinement. The guided channel refinement is similar with the self gated mechanism in the basic block. The difference is that the guided channel refinement learns the channel weights G lef t cr for the mixed feature by learning from the left feature F cl instead of learning from itself. The left feature F res l is learned from the coarse residues (Ĵ c l − I l ) through a 3 × 3 convolution. The detailed process for predicting the residue of left view is given below: G lef t rc = σ(C 1×1 (δ((C 1×1 (P c ))))), where F lef t cr denotes the left feature after guided channel refinement by the left feature. Analogously, we replace the left feature F res l by the right feature F resr learned from the coarse residues (Ĵ c r − I r ) to guide channel refinement and obtain the refined right feature F right cr .
As for the guided spatial refinement, we use a 3 × 3 convolutional layer to learn the spatial offsets ∆p lef t k from the left feature F l in terms of the left view. As the kernel offsets in the deformable convolution operator, ∆p lef t k augments the regular sampling grid G at position p 0 obtaining a refined feature F lef t sr , as follows: where F lef t cr is the input feature to be sampled, and G is a regular grid (i.e.If the kernel is 3 × 3 with dilation 1, G = (−1, −1), (−1, 0), ..., (0, 1), (1, 1)) sampling the input features. p k is a position of G, whose corresponding convolutional weight is w r . Finally, we apply another 3 × 3 convolutional layer on the spatial refined feature F lef t cr to predict the left residueR l . The process of learning the residuê R r for the right view is the analogous process. The final dehazed stereo image pairĴ l andĴ r are generated as follows:

D. Loss Function
Our SRDNet is trained by adopting the smooth L1 loss and the perceptual loss [49]. The total loss L is defined as: where L S and L P are the smooth L1 loss and the perceptual loss respectively.Ĵ c and J are the predicted coarse dehazed image and the ground truth respectively.R is the predicted residue for the corresponding coarse dehazed image.

IV. EXPERIMENTS
It is a great challenge to collect a large-scale foggy stereo dataset including real-world foggy stereo images and their clear counterparts for learning-based stereo dehazing methods. To address this problem, Pang et al. [7] extended the Foggy Cityscapes dataset to a Stereo Foggy Cityscapes dataset with 8,925 stereo foggy image pairs in the training set and 1500

A. Training Details
We train the SRDNet on Pytorch with the size 256 × 256 and augment the training with randomly vertical flip. We set the training batch size as 8 and the total number of epochs as 30. We use Adam optimizer [50], where β 1 and β 2 are set as the default values: 0.9 and 0.999 respectively. We employ the cosine annealing strategy [51] to adjust the learning rate from the initial value 1 × 10 −3 to 0. The cosine function is formulated as: where the total number of batches is T . l 0 and l t are the initial learning rate and the learning rate at the batch t respectively. The training is carried on 2 TitanX GPUs and only one GPU is used for testing.

B. Comparison with State-of-the-art Methods
The proposed network is tested on the synthetic Stereo Foggy Cityscapes validation set for qualitative and quantitative comparisons with the state-of-the-arts that include SSMDN [8], GCANet [14], MSBDN [15] and BidNet [7]. We exploit the metrics of PSNR and SSIM [52] to evaluate the performance of restored images. Besides, we compare parameters and FLOPS for different methods. For fair comparisons, we re-train GCANet and MSBDN according to their provided training details in their papers on the same Stereo Foggy Cityscapes training set and evaluate them on the same Stereo Foggy Cityscapes validation set as ours. It is worthy noting that we test all methods with the image size of 1024 × 512.
Quantitative Results: Tab. I shows the quantitative comparison on the Stereo Foggy Cityscapes validation set between our SRDNet with SSMDN [8], GCANet [14], MSBDN [15] and BidNet [7] in terms of the PSNR and the SSIM. The single image dehazing methods only restore the left images. The stereo image dehazing methods: BidNet and our SRDNet obtain dehazed the left images and dehazed the right images simultaneously. It can be found that our proposed SRDNet surpasses all four different state-of-the-art methods by a wide 6 Fig. 3 shows qualitative state-of-thearts [14], [15], [7] comparison with the presented SRDNet on the Stereo Foggy Cityscapes validation set.   have thick fog. We can observe that the MSBDN can not remove the haze entirely, especially the row-5. The processing power of the BidNet at the sky in the first three rows is unsatisfactory. The GCANet recovers images with excessive brightness relative to ground truth. In addition, the sky in the first row and the building in the fifth row for the GCANet still remains a great amount of haze. In contrast, our method achieves better and visually appealing results. Analogously, the corresponding right images dehazed by our method are also appealing. In addition, we present some haze-free stereo image pairs of our method in Fig. 4.

C. Ablation Study
We conduct the ablation study on the Stereo Foggy Cityscapes validation set. Tab. III shows the impacts of the WSCDN and our GCSR module. Without the GSRN, we use the WSCDN directly to restore the clear stereo pair, the values are reduced by 1.72 dB and 1.77 dB in terms of the PSNR from Tab. III. It demonstrates that only dehazing once is not optimal and using our GSRN could indeed refine the dehazing results. To demonstrate the effectiveness of the GCSR module, we perform an experiment replacing the GCSR module by the 3 × 3 convolutional layer. As shown in Tab. III, the dehazing results decrease 1.57 dB and 1.78 dB for left dehazed images and right dehazed images from the perspective of PSNR.

Foggy
MSBDN [15] BidNet [7] GCANet [14] SRDNet Fig. 6. Evaluation on real foggy stereo images from the Drivingstereo Dataset [53]. We only present the left dehazed images. The values of SSIM also reduce more than 0.0018 compared with employing the GCSR module, which shows that the concatenated stereo features contains confusing information and our GCSR module could separate the useful information belong to the left image and the information belong to the right image. We add some ablation experiments to explore the effects of the max-pooling and the average-pooling in the Basic block of WSCDN in Tab. IV. From Tab. IV, it shows that combing the global average pooling (GAP) and the global max pooling (GMP) could extract more abundant information, which gathers global statistic information and discriminative information, respectively. We add some subjective comparisons between the dehazing results of WSCDN and the results of the GSRN in Fig. 5 of Section IV-C, which demonstrates that the GSRN indeed refines the dehazing results. We leverage the fog simulation pipeline described in [7] to add fog to the sunny and cloudy sequences in the Drivingstereo dataset, and randomly divide the dataset into the training set and the validation set. We generate the random atmospheric light from 0.7 to 1.0 and set β [0.005, 0.01, 0.02] for each stereo image pair. We finetune our model, BidNet [7], MS-BDN [15] and GCANet [14] pre-trained by the Stereo Foggy Cityscapes training set on the generated foggy Drivingstereo training set containing 2400 stereo pairs and evaluate them on the generated validation set containing 800 foggy stereo pairs. Tab. V compares the dehazing performance of our SRDNet and other methods on the synthetic foggy Drivingstereo validation set. Our SRDNet achieves an obvious gain with 3.36 dB in the PSNR and 0.054 in the SSIM compared with the MSBDN. In terms of the SSIM value, our method outperforms the second best method GCANet by 0.0055. Fig. 6 gives qualitative comparison of our dehazed results with the state-of-the-arts: MSBDN [15], BidNet [7] and GCANet [14] on the real hazy images from the Drivingstereo dataset. It can be observed that there still exists quite a lot of haze in the results of MSBDN. Color distortion is introduced by BidNet. Compared with GCANet, our method performs better in the regions of the sky of row-4 and the road of row-5. Our method has visually appealing results. The right images dehazed by our method have analogous results.

E. Perceptual Quality for High-level Vision Tasks
As the stereo dehazing algorithms are usually used as the pre-processing step for high-level computer vision tasks such as 3D object detection, the accuracy of 3D object detection can be treated as an indirect indicator of the stereo dehazing quality. We adopt the accuracy of stereo image-based 3D object detection on the KITTI dataset to evaluate the perceptual quality of our dehazing method. KITTI dataset [54] is a challenge benchmark for evaluating the performance of 3D object detection, which is divided into training set and validation set with 3,712 images and 3,769 images respectively. In order to generate foggy stereo images for the KITTI dataset, we first estimate the depth map for each image by a stereo matching method PSMNet [55], and then use the depth map to synthesize foggy stereo images using the fog simulation pipeline described in [7]. This synthetic dataset is referred to as the Stereo Foggy KITTI dataset in this work. We produce the atmospheric light randomly from 0.7 to 1.0 and use β [0.02, 0.04, 0.06] for each stereo image pair. Hence, there are 11,136 stereo foggy image pairs for training, and 11,307 stereo foggy image pairs for validation. We first train the SRDNet on the Foggy Stereo KITTI training set following Sec. IV-A. On the Stereo Foggy KITTI validation set, our SRDNet improves the PSNR values from 11.89 dB and 10.96 dB to 22.83 dB and 22.14 dB in terms of the left view and the right view respectively. For the metric of the SSIM, our SRDNet boosts 0.2301 and 0.2303 for the left view and the right view respectively. For the 3D detection accuracy, we choose Stereo R-CNN [4] pretrained on the KITTI clear training set to evaluate the dehazed results of our methods in light (β=0.02), medium (β=0.04), and heavy (β=0.06) foggy scenes. Specially, the Stereo R-CNN model uses ResNet101 [56] and FPN [57] as the backbone.
Generally, the metrics of 3D detection and 3D localization performance are Average Precision for 3D box (AP 3d ) and birds eye view (AP bv ). AP E , AP M and AP L are the average precision of easy, moderate and hard sets divided according to the KITTI setting, respectively. Tab. VI compares the 3D detection accuracy (AP 3d ) only Stereo R-CNN and SRDNet concatenated with Stereo R-CNN in foggy scenes using IoU = 0.7 on the Stereo Foggy KITTI validation set, which proves that our SRDNet as the pre-process for the detector can stably boost the accuracy in the conditions of light, medium, and heavy haze. Specifically, the heavy haze degrades AP 3d by 30 [54]. Heavy + S and Heavy + SR are short for Heavy + Stereo R-CNN and Heavy + SRDNet followed by Stereo R-CNN, respectively; similarly for the other groups.
hard sets. By appending the SRDNet, the AP E , AP M and AP L improve by 16.23%, 12.27% and 7.98% respectively in the heavy hazy circumstance. Further, as shown in Tab. VI, the haze degrades the 3D localization accuracy (AP bv ) of the Stereo R-CNN. After concatenating our SRDNet, the AP bv for birds eye view obtains a notable absolute gain, demonstrating the high perceptual quality of our stereo dehazing method. We compare our SRDNet and BidNet in 3D detection task, which both belong to the binocular dehazing methods. In terms of each metric of AP 3D in the light hazy scenes, the medium hazy scenes, and the heavy hazy scenes, Tab. VII shows that our method obtains higher accuracy and has better perceptual quality. Fig. 7 gives some stereo detection visual results of Stereo R-CNN in the conditions of light, medium, and heavy haze. The birds eye view images projected from the 3D box are also presented. When the haze gets heavier, there are more objects that are missed by the Stereo R-CNN. After appending the SRDNet, the missed objects are correctly detected and located. Our SRDNet is flexible and can pre-process the foggy stereo inputs for up-to-date stereo based 3D object detectors, which eliminates the degradation of the foggy inputs.