[关键词]
[摘要]
目的 针对传统中药产地溯源主观性强、效率低等问题,探索基于中红外光谱(mid-infrared spectroscopy,MIRS)结合机器学习算法模型的中药产地溯源方法,为中药产地溯源提供新的技术支撑。方法 以中红外光谱数据集为基础,对中药材的中红外光谱数据进行异常值处理、空缺值填充、一阶差分、高方差特征筛选、距离计算与分组分析及局部线性嵌入算法(locally linear embedding,LLE)等预处理,构建支持向量机(support vector machine,SVM)、随机森林(random forest,RF)、轻量级梯度提升机(light gradient boosting machine,LightGBM)、K-最近邻(K-nearest neighbor,KNN)、极端梯度提升(extreme gradient boosting,XGBoost)、人工神经网络(artificial neural network,ANN)6种机器学习模型,并利用常青藤优化算法(Ivy algorithm,IVYA)优化模型参数,构建受试者工作特征(receiver operating characteristic,ROC)曲线下面积(area under curve,AUC)与准确率、召回率、精准率、F1分数(F1 score,F1)的多维度评价指标体系,探讨适合中药产地溯源的最优模型。结果 SVM(宏平均AUC=0.998、F1=0.949、准确率0.949、召回率0.949、精确率0.956)与ANN(宏平均AUC=0.999、F1=0.940、准确率0.939、召回率0.939,精确率0.944)在中药材产地的识别上表现最优,对未知样本的产地预测结果具有较高一致性,且二者核心评价指标值显著高于RF、LightGBM、KNN、XGBoost。SVM模型可精准捕捉不同产地药材中等化学成分的官能团振动差异,ANN模型可模拟中药产地形成的多因素非线性耦合效应,2种模型不仅实现了11个产地的高效区分,还具有互补性,SVM的全局最优分类面适合处理光谱特征差异显著的产地,而ANN的分布式表征能力更适用于特征重叠度高的复杂场景。MIRS结合SVM与ANN的方法,在中药产地溯源上具有快速、无损、高通量的优势,这为道地药材的规模化产地溯源提供了客观化工具,为中药产业的现代化发展提供了可落地的技术路径。结论 MIRS结合SVM、RF、LightGBM、KNN、XGBoost、ANN用于中药产地溯源具有有效性,其中SVM与ANN可作为中药产地溯源的首选模型,可为中药产地溯源提供跨学科方案,未来可整合多模态数据及深度学习技术提升模型性能。
[Key word]
[Abstract]
Objective To address the issues of strong subjectivity and low efficiency in traditional Chinese medicine origin tracing, we explore the method of Chinese medicine origin tracing based on mid-infrared spectroscopy (MIRS) combined with machine learning algorithm modeling, and provide new technical support for traditional Chinese medicine origin tracing. Methods Based on the mid-infrared spectral dataset, this study performed outlier handling, missing value imputation, first-order difference, high-variance feature selection, distance calculation and clustering analysis, and locally linear embedding (LLE) for dimensionality reduction. Subsequently, we constructed six machine learning models including Support Vector Machine (SVM), Random Forest (RF), Light Gradient Boosting Machine (LightGBM), K-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and artificial neural network (ANN) models. We then employed the Ivy Algorithm (IVYA) to optimize model parameters. A multidimensional evaluation framework was established, incorporating the area under the receiver operating characteristic (ROC) curve (AUC) alongside accuracy, recall, precision, and F1 score (F1). This framework was used to identify the optimal model for Chinese herbal medicine origin tracing. Results SVM (macro-average AUC = 0.998, F1 = 0.949, accuracy = 0.949, recall = 0.949, precision = 0.956) and ANN (macro-average AUC = 0.999, F1 = 0.940, accuracy = 0.939, recall = 0.939, precision = 0.944) demonstrated optimal performance in identifying the origins of Chinese herbal medicines. They exhibited high consistency in predicting the origins of unknown samples, with core evaluation metrics significantly outperforming RF, LightGBM, KNN, and XGBoost. The SVM model can accurately capture the functional group vibration differences in the medium chemical components of medicinal materials from different origins. ANN models simulate the multifactorial nonlinear coupling effects formed by Chinese herbal medicine origins. Both models not only achieve efficient differentiation among 11 origins but also exhibit complementary strengths: SVM’s globally optimal classification hyperplane excels at processing origins with markedly distinct spectral features, while ANN’s distributed representation capability better suits complex scenarios with high feature overlap. The MIRS method, integrating SVM and ANN, offers rapid, non-destructive, and high-throughput advantages for Chinese herbal medicine traceability. This provides an objective tool for large-scale origin tracing of authentic medicinal materials and delivers a practical technical pathway for the modernization of the Chinese herbal medicine industry. Conclusion MIRS combined with SVM, RF, LightGBM, KNN, XGBoost, and ANN demonstrates effectiveness for Chinese herbal medicine origin tracing. Among these, SVM and ANN emerge as preferred models for this application, offering an interdisciplinary solution. Future integration of multimodal data and deep learning techniques holds promise for enhancing model performance.
[中图分类号]
TP18;R282.23
[基金项目]
四川省自然科学基金(2024NSFSC2109);国家自然科学基金面上项目(82574868);成都市科学技术局技术创新研发项目(2024-YF05-01999-SN);教育部代谢性心血管疾病医药基础研究创新中心开放课题重点项目(xnykdxcxzx-2025-05);四川省中医药管理局面上项目(25MSZX079)