[关键词]
[摘要]
目的 针对近红外光谱(near-infrared spectroscopy,NIRS)建模过程依赖“地毯式”试错、缺乏理论指导的问题,以金振口服液(Jinzhen Oral Liquid,JOL)固含量预测模型为研究对象,从光谱信息质量角度揭示最佳模型的优化方向,并验证其在模型维护中的应用价值。方法 采集380个样本的NIRS和固含量数据。经9种预处理方法处理后,分别使用偏最小二乘法(partial least squares,PLS)和支持向量回归(support vector regression,SVR)建立固含量的NIRS预测模型。创新性地引入香农熵、主成分分析(principal component analysis,PCA)以及自编码器,构建一个从信息丰富度、线性结构集中度和非线性结构可捕获性3个维度量化光谱信息质量的评价框架。最后通过系统分析光谱信息特性与模型性能间的关联,揭示最佳预处理方法的方向,并将此关联规律应用于新增294个样本时的模型维护,以筛选最佳光谱数据集。结果 对于谱峰宽泛重叠的NIRS,其信息密度与信息保留率与PLS模型性能呈负相关。基于此关联规律,成功预测出模型维护时的最佳数据集,其建模效果(Rp2=0.990 9)显著优于其他数据集。结论 研究发现的关联规律能够有效解释预处理对模型性能的影响,为光谱模型的优化与维护提供了理论依据和指导工具,实现了从“盲目试错”到“主动改善”的转变,为建立标准化、智能化的近红外光谱模型构建与维护流程提供了新思路。
[Key word]
[Abstract]
Objective To address the issues of reliance on exhaustive “blind” trial-and-error and the lack of theoretical guidance in the modeling process of near-infrared spectroscopy (NIRS), this study used the solid content prediction model of Jinzhen Oral Liquid (JOL, 金振口服液) as a case study. It aimed to reveal the direction for optimizing the best model from the perspective of spectral information quality and verify its application value in model maintenance. Methods The NIRS and solid content data of 380 samples were collected. After being processed by nine preprocessing methods, prediction models for solid content were established using partial least squares (PLS) and support vector regression (SVR), respectively. An evaluation framework was innovatively constructed by introducing Shannon entropy, principal component analysis (PCA), and autoencoders to quantify spectral information quality from three dimensions: information richness, linear structure concentration, and non-linear structure capturability. Finally, by systematically analyzing the correlation between spectral information characteristics and model performance, the direction for the optimal preprocessing method was revealed. This correlation rule was then applied to model maintenance involving 294 newly added samples to screen for the optimal spectral dataset. Results It was found that for NIRS characterized by broad and overlapping peaks, both information density and information retention rate were negatively correlated with PLS model performance. Based on this correlation rule, the optimal dataset for model maintenance was successfully predicted, achieving a modeling performance (Rp2 = 0.990 9) significantly superior to that of other datasets. Conclusion The correlation rules identified in this study effectively explain the impact of preprocessing on model performance. They provide a theoretical basis and guiding tools for the optimization and maintenance of spectral models, facilitating a shift from “blind trial-and-error” to “active improvement”. This offers new insights for establishing a standardized and intelligent workflow for NIRS model construction and maintenance.
[中图分类号]
R283.6
[基金项目]
国家工信部产业基础再造和制造业高质量发展专项(TC2308068);中药制药过程控制与智能制造技术全国重点实验室开放基金课题(SKL2023D02003)