Coal Image Recognition Method Based on Improved Semantic Segmentation Model of PSPNET Network

,


Introduction
The coal industry has always been one of the important pillars of China's economy. Currently, in order to meet the growing energy demands across the country, the intelligent construction of coal mines has become an important direction for the future development of the coal industry. The goal of this construction is to establish a high-quality, efficient, safe, and low-pollution coal production system by introducing advanced technologies and equipment. One of the key technologies in the process of intelligent construction is coal-rock recognition. However, underground mining presents challenges due to the complex and variable environment, as well as the unique characteristics of each mine. The "one mine, one condition" feature becomes one of the constraints and difficulties in the development and application of coal-rock recognition technology in current intelligent coal mine construction.
For a long time, domestic scholars have conducted extensive research on coal-rock recognition, which can be broadly divided into contact detection and non-contact detection. Contact detection methods mainly differentiate based on the different vibration signals generated by the coal-cutting machine during coal-rock excavation. (Liu Junli, Zhao Haoge, & Li Changyou., 2013) For example, the vibration monitoring method proposed by the US Bureau of Mines Pittsburgh Research Center in 1982 and the comprehensive coal-rock recognition method based on the analysis of the forces on the shearer drum arm idler wheel shaft proposed by Tian Liyong et al. in 2016. (Tian LY, Mao J, & Wang QM., 2016) These techniques establish intelligent coal-cutting trajectories based on the different stresses generated during coal-rock excavation. However, the identification accuracy of vibration sensors is easily affected by other vibrations during mining and encounters difficulties in handling similar coal-rock vibration frequencies. To avoid these issues, researchers have proposed non-contact detection methods, with the typical technique being radioactive detection technology. The former distinguishes coal seams and rock layers based on their different absorption peaks and valleys of radiation, represented by the gamma-ray detection method initially proposed by British Solder Electrical Instrument for distinguishing coal and rock layers. However, due to the limitations of radiation wavelength, it can only detect shallow coal seams, and the preservation of the roof coal reduces the coal recovery rate. The infrared detection methods, where the principle is to analyze the reflected spectra after the radiation light reaches the surface of coal and rocks to differentiate substance types. For instance, the combined visible-near-infrared and thermal infrared detection technology proposed by Song Liang et al. in 2017(SONG Liang, LIU Shanjun, YU Jasmine, MAO Yachun, & WU Lixin., 2017 and the active-excited thermal infrared detection method proposed by Qiang Zhang et al. (Qiang Zhang, Junming Liu, Jieying Gu, & Ying Tian., 2022) However, radiation is easily affected by coal dust and water mist underground, and when the coal-rock has similar thermophysical properties, it becomes difficult to distinguish them. In 2019, Haijian Wang et al. (H. Wang et al., 2019) proposed an infrared thermal image feature recognition method based on the cutting process, but it has strict application requirements. In 2021, Muqin Tian et al. (M. Tian, Q. Li, C. Xv, Y. Yang, & Z. Li., 2021) used the vibration signals generated by the shearer when cutting coal and rocks of different proportions as a sample information library for different cutting signals. They converted the vibration signals generated during the cutting process into two-dimensional images using Gramian Angular Difference Field (GADF) and Gramian Angular Summation Field (GASF), and introduced the transfer learning ResNet-34 network model for recognition. However, it also faces the problem of limited applicability and cannot be widely used.
However, with the development of artificial intelligence algorithms and computing devices in recent years, image recognition has gradually become a hot topic in current research. The main approach is to utilize the differences in surface textures and colors between coal and rocks for autonomous discrimination. For example, in 2018, Huang Lei et al. (Huang L., & Guo Chaoya., 2018) improved the feature performance of coal using partial binary pattern algorithms and variation functions and local variance maps. With the development of machine learning and hardware devices, classification and recognition algorithms such as large-scale neural networks have been widely used in coal-rock image recognition. Wang Jiancai et al. (Wang Jiancai, Li Jin, Li Zhijun, Shi Jianting, & Tang Yaorui., 2022) introduced an attention mechanism based on the YOLO V5 framework, which significantly improved both the accuracy and speed of recognition. Wang Guofa et al. (Wang Guofa, Ren Huaiwei, Zhao Guorui, Zhang Desheng, Wen Zhiguo, Meng Lingyu, & Gong Shixin., 2022) proposed fusion recognition using a local binarization fusion CNN to extract features of the upper coal layer, achieving high recognition accuracy. Hu Tongxing et al. (Hua T.X., Xing C.E., & Zhao L., 2019) achieved good results in coal-rock recognition using the Faster R-CNN framework neural network. Zhang Haibo (Zhang, Haibo., 2022) adopted a fusion method combining vision and lidar, fusing image recognition pixels and lidar point clouds, and using KD trees for fusion and search, resulting in a recognition rate of over 93.2%. From the above domestic and foreign research, it can be seen that image recognition technology has certain feasibility in coal-rock recognition. However, it is susceptible to the blurring of images caused by coal dust during cutting and the refraction caused by water mist, leading to decreased signal-to-noise ratio and reduced segmentation accuracy. Coal semantic segmentation algorithms still face challenges in segmenting coal edges and details.
In order to further improve the generality and accuracy of image-based coal rock recognition techniques, this study focuses on the impact of different defogging techniques on image denoising and the improvement of the PSPNet semantic segmentation model by inserting an attention module. It proposes a fusion method that combines multiple techniques, namely the coal-rock recognition method with enhanced image features. A comparison is made between the effects of various defogging techniques on denoised images and the extraction of network features. The improved PSPNet network is used for semantic segmentation, and multi-parameter training is adopted to obtain the best-performing model. Experimental results show that using feature-enhanced semantic segmentation for coal-rock recognition has certain advantages compared to ordinary images and provides a new direction for the multi-parameter fusion of coal-rock recognition technology.

The PSPNET Network Model
The essence of semantic segmentation is to classify each pixel in an image. Currently, the mainstream framework for scene analysis is based on fully convolutional networks (FCN). However, due to the diversity of scenes and the continuity of information features, there are still many shortcomings in the application of semantic segmentation. In order to address the issue of integrating contextual information features, the Pyramid Scene Parsing Network (  This module consists of four levels: the first level processes the input feature layer roughly by globally average pooling it to generate a single data stream output. The second and third levels divide the feature layer into 2x2 and 3x3 sub-regions, respectively, and then perform average pooling (Avgpool) on each sub-region. The fourth level is the most refined, dividing the feature layer into 6x6 sub-regions and then pooling each sub-region. As the feature maps outputted by different levels have different scales, a 1x1 convolutional kernel is used at the end of each level to reduce the dimension and ensure the weight of global features. Finally, the low-dimensional feature maps are upsampled to match the original feature map scale. The feature maps from different levels are then concatenated to form the global feature of the module.
The basic steps of its operation are as follows: given an input image (a), the last convolutional layer's feature map is obtained using CNN (b). The pyramid parsing module is then applied to obtain representations of different sub-regions. Subsequently, upsampling and concatenation layers are used to form the final feature representation, which contains local and global information (c). Finally, this representation is fed into convolutional layers to obtain the final pixel-wise predictions (d).

CBAM Attention Mechanisms
In the field of cognition, human attention subconsciously focuses on the parts that they want to pay attention to, while ignoring other parts, which provides a theoretical basis for the rational use of information processing resources. Currently, attention mechanisms need to focus on two main issues: determining the parts which need to be focus on and how to allocate limited resources to the vital parts that need to be processed. Attention mechanisms can mainly be divided into Soft Attention and Hard Attention, and can be further divided into modules based on multiple perspectives such as channel-based, multimodal, and clustering.
The attention mechanism module CBAM (Woo, S., Park, J., Lee, JY., & Kweon, I.S., 2018) was introduced into the model structure to improve the perception of global image information and fine-grained features such as coal edges, as shown in Figure 2. In this structure, the inverted residual structure of the MobileNetV2 (Sandler, Mark, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, & Liang-Chieh Chen., 2018) backbone network is added directly to the input and output features of this module in the PPM structure. However, the PPM structure does not distinguish the weights of each feature channel, resulting in the persistence of features that are not relevant to the recognition target and inaccurate feature information extraction. Therefore, in this paper, the attention module of CBAM is added to the original PPM structure. The principle is to use the channel attention module and the spatial attention module for feature extraction, respectively. The channel attention module learns mas.ccsenet.org Modern Applied Science Vol. 17, No. 2; the relationship between channels by performing global average pooling and fully connected layer operations on the feature maps of different channels. While the spatial attention module performs attention weighting on different spatial locations of the feature map by convolution and stitching to extract spatially useful information, thus adding weighting information to each feature channel. CBAM makes full use of the channel information and spatial information of the feature map. It focuses on salient features in the boundary region of the coal seam, suppresses the transmission of useless features and improves the representational capability of the network.

PSPNET Network Architecture Improvement
In the proposed network architecture of this thesis, the backbone network incorporates an improved attention mechanism based on average pooling of sub-regions in the pyramid parsing module, which makes the model lightweight. On top of obtaining global features, an Convolutional Block Attention Module (CBAM) is introduced to focus on the details of the segmentation targets. The overall structure is illustrated in Figure 3. By incorporating the CBAM attention module into different pooling layers in the PPM, the information loss caused by pooling can be partially compensated. Although pooling reduces computational complexity, it also leads to a loss of information in the data. The addition of the CBAM module can help mitigate this issue by incorporating a lightweight attention mechanism that combines spatial and channel information. It adaptsively learns the correlations between channels and the relationships between spatial locations, enabling the network to better capture crucial information in the images.
In addition, this study also incorporates the CBAM module into the contact part. Since it operates after the final merging of feature maps, it captures the most essential features in the data. By utilizing the channel and spatial attention mechanism of CBAM, it can better extract the features relevant to the current target and enhance the extraction of local feature details, thus improving the model's generalization ability.
With this architecture, better attention to the details of coal features can be achieved, enhancing the recognition of target edges and effectively distinguishing them from other confusing objects.

Creation of the Data Set
An excellent neural network model requires high-quality dataset support, and the quality and accuracy of the dataset depend on the quality of data collection and processing.
In this experiment, coal will be captured and recognized using a camera, and the data will be divided into a training set and a validation set for model training. The validation set is not involved in training but is used to determine the optimal hyperparameters of the model, improve model accuracy, and reduce the chances of To enhance the model's generalization performance, a series of preprocessing techniques such as random brightness transformation, random contrast transformation, and random flipping are applied to the dataset images to increase the number of images. The dataset images are also normalized to facilitate the model's processing of input data.

Pre-processing of Data Sets
The dark channel prior is an effective method for image dehazing, which was chosen in this study after comparing it with other methods such as channel prior, multi-scale Retinex, Gaussian dehazing, median dehazing, adaptive histogram equalization dehazing, and adaptive histogram dehazing on both infrared and non-infrared images. The dark channel prior algorithm is based on the mathematical principle of dark channel prior (Lu Qiuju, & Han Tuanjun., 2021), which can be described as follows: represents the dark channel in the image,which is then substituted into Equation 3-1,the haze generation formula.

1
(2) I x is the original image, J x is the image after dehazing, A represents the global atmospheric light component, and t x represents the transmission rate. Finally, through a series of mathematical transformations, we can obtain: , This theme compares the dehazing effect with time for foggy images in underground mines. The time consumed for 512x512 non-infrared images is shown in Figure 5. From the graph, it can be observed that the three dehazing techniques, namely Dark Channel, Adaptive Histogram, and Adaptive Contrast, have similar time consumption. However, the dehazing effect of the Dark Channel algorithm is superior to the Adaptive Histogram and Adaptive Contrast algorithms, as evident from the resulting images. It is worth noting that the Adaptive Histogram and Adaptive Contrast algorithms make significant adjustments to the color tones of the images, which may alter the semantic content of the scenes. On the other hand, the Dark Channel algorithm preserves the details of the images more effectively. Therefore, the Dark Channel dehazing algorithm was chosen for preprocessing.
According to the performance metrics in Table 1 and Table 2, it can be concluded that the recognition parameters using the dehazing techniques are higher than their original counterparts. The dehazing algorithm enhances the clarity of coal in the images, leading to improved recognition performance in the aforementioned networks.

Experimental Environment and Parameter Settings
The experimental model was built using the PyTorch deep learning framework and trained and tested in a GPU environment. The detailed configuration of the experimental environment is shown in Table 3, and the training parameters are presented in Table 4.

Verification of MobileNetV2+ Performance
Network Performance in the PSPNET network structure, MobileNetV2, Xception, and ResNet101 were used as backbones for the ablation study on the validation set. By comparing the mean Intersection over Union (MIoU) and inference time schedule, Figure 10 shows the results for different backbones. The results indicate that when MobileNetV2 is used as the backbone, the PSPNET+ model has the fastest inference time and runtime speed, and the MIoU value is only slightly lower than that of the model using ResNet-101. Additionally, the MS gain results for MIoU, calculated in Table 5, demonstrate that the performance of the PSPNET model is optimal when MobileNetV2 is used as the backbone. Therefore, considering the accuracy and processing speed of the model, the optimal choice is to use MobileNetV2 as the backbone.    According to the experimental results, introducing any form of attention module shows a significant improvement in performance compared to the original network structure. Additionally, the CBAM attention module performs better than the ECA module in the PPM module's four pooling layers. In both the PPM module and the final Contact module, the CBAM module outperforms the ECA module in capturing fine details. This phenomenon can be explained by the dual-channel and spatial attention mechanisms of the CBAM module. The dual-channel attention mechanism adaptsively learns the importance of each channel and assigns larger weights to channels that contribute more to the task. The spatial attention mechanism assigns larger weights to locations that contribute more to the task. This approach enhances the network's focus on important information, improving its attention to critical regions. Therefore, the CBAM module can more accurately distinguish the fine texture of coal, thus improving the precision of the network.
The impact of the CBAM module on the performance of the PSPNET+ model is shown in Table 6. When the CBAM and ECA attention modules are simultaneously introduced in the PPM and CAT parts, the model's accuracy and MIoU value improve by 5.59% and 1.43%, respectively, compared to the original model.
Compared to the ECA model, the improvements are 0.63% and 0.47%, respectively. This indicates that the use of the CBAM attention module reduces the influence of irrelevant features on the model and enhances the learning of key coal-rock targets and the integration of information across different levels. As a result, it strengthens the feature discrimination capability and segmentation accuracy of the network model. The comparative analysis of different coal-rock image segmentation models through ablation experiments is shown in Figure 9. The commonly used Hernet network model and the improved PSPNET+ network model are selected for analysis, comparing their accuracy and loss function values on the same dataset and with the same network parameters. The improved PSPNET+ model in this study achieves an accuracy of 65.04%, which outperforms other models. This improvement is attributed to the incorporation of the CBAM attention module in the PSPNET+ model, which enhances its capability to extract detailed features. The attention module helps to mitigate the influence of irrelevant features, such as non-coal substances, and highlights the detailed features of coal-rock regions. It effectively combines spatial and channel features, thereby enhancing the model's feature extraction and learning capabilities.
Furthermore, by using MobileNetV2 as the backbone network and pre-training and transfer learning on the widely used VOC dataset, the model's parameter volume is effectively increased, improving its generalization ability and segmentation accuracy. Therefore, the improved PSPNET model exhibits optimal overall performance and better meets the requirements for both accuracy and real-time performance in coal-rock image segmentation.  The performance metrics of different coal-rock image segmentation models are extracted and listed in Table 7. It can be observed that the PSPNET network architecture has fewer parameters and produces smaller models. It achieves higher scores in terms of both time and accuracy compared to other models. Figure 9 shows the accuracy of each network. Based on Table 7 and Figure 9, it can be concluded that the improved PSPNET network still maintains high recognition accuracy while having smaller computational and parameter requirements compared to other networks.
To further demonstrate the advantages of the proposed improvement over other networks, random images from the test dataset were selected for testing. Herent, Deeplabv3+, PSPNET, and the optimized PSPNET+ were compared in the tests. The results show that the improved PSPNET model can extract more detailed features and achieve higher recognition accuracy.  From Figure 10, it can be observed that in the recognition tests with the added attention modules, there is a significant improvement in the edges and details compared to the original PSPNET network and other networks. There is a noticeable expansion outward. Based on the data in Table 9, we can conclude that the improved network achieves a rate of 96.8%, which is a significant improvement over the other networks. These phenomena indicates that the introduction of CBAM modules in PPM and contact layers can indeed further extract and learn the detailed features of coal, which is reflected in the recognition tests.
In coal recognition, edge and detail features are often important cues for identification. Therefore, the improved PSPNET with CBAM module can better handle this task and achieve higher network performance.

Conclusion
This study proposed an improved method for coal-rock interface image recognition using the enhanced PSPNET model and feature enhancement techniques, and conducted semantic segmentation experiments on coal. The main conclusions are as follows: (1) The lightweight neural network MobileNetV2 was used as the backbone, and the CBAM attention mechanism was introduced, effectively enhancing the network's ability to extract image detail features while minimizing the impact on computational efficiency. Transfer learning training method was utilized to reduce the influence of sample distribution differences on network performance.
(2) Through laboratory ablation experiments, it was found that the improved PSPNET+ model achieved an accuracy of 65.04% and an MIoU of 63.34% on a self-made coal-rock segmentation dataset. The model size was 9.4M. Compared to other models, the improved PSPNET network model exhibited excellent performance in terms of running time, accuracy, MIoU, MPA, etc., effectively enabling semantic segmentation of coal. This validates the feasibility and practicality of the model in coal-rock recognition.
The proposed method for coal-rock interface image recognition in this paper provides a new research idea and theoretical model for addressing coal-rock recognition tasks in underground coal mines. However, further research is needed to overcome the interference factors present in underground mining environments, such as water mist and coal dust, in order to deploy the model in practical underground coal mines. Additionally, to improve the recognition accuracy and robustness of the model, a large number of actual underground coal-rock interface images need to be collected to construct a coal-rock data sample library, enrich the semantic information of coal, and achieve better results in training.