[关键词]
[摘要]
目的 探究常用关联与聚类分析方法对中医处方数据的适用性,基于现有方法提出一种在中医处方数据挖掘研究中关联与聚类分析部分的良好实践方案。方法 结合中医处方数据特点,从算法原理层面指出各分析方法与中医处方数据的不匹配之处;以列举典型真实研究示例的形式展示不同方法在应用时可能出现的不合理现象;对具备优化价值的方法提出其应用时的优化方法;对于关联分析方法,基于真实医案数据横向对比所有算法的计算结果以呈现算法间的差异。结果 除Pearson、Spearman和Phi相关系数外,所有药物关联分析方法应用于中医处方数据时均存在不同程度的局限性,在对比研究中,支持度与置信度对高频药物的计算结果均偏高,欧氏距离与汉明距离易高估低频药物间的关联性,余弦相似度与Jaccard系数对频次差异较大的药物组合的计算结果偏低,互信息与改良互信息对全部组合的计算结果几乎均存在异常,提升度对高频药物的计算结果偏低;在常用的药物聚类方法中,系统聚类与极大团算法对药物的划分效果最佳,但前者易在类中纳入低相关药物,后者的类别划分结果则过于模糊,经优化后,两者的局限性均得到改善。结论 在涉及药物关联计算时,应首选Pearson、Spearman和Phi相关系数;在进行药物聚类分析时,可借鉴优化后的应用方法,利用系统聚类分析药物类别的凝聚过程,利用极大团算法探索药物的多维应用方向。
[Key word]
[Abstract]
Objective To investigate the applicability of common association and clustering analysis methods to traditional Chinese medicine (TCM) prescription data, and propose an optimal framework for association and clustering analysis in TCM prescription data mining research based on existing methods. Methods By considering the characteristics of TCM prescription data, this study identifies mismatches between analytical methods and TCM prescription data at the algorithmic principle level. Typical real-world research examples are presented to demonstrate potential unreasonable phenomena in the application of different methods. For methods with optimization potential, improved application approaches are proposed. For association analysis methods, a comparative evaluation of all algorithms is conducted using real medical case data to highlight differences in computational outcomes. Results Besides Pearson, Spearman, and Phi correlation coefficients, all drug association analysis methods exhibit varying degrees of limitations when applied to TCM prescription data. Comparative experiments revealed the following issues: Support and confidence metrics overestimate results for high-frequency drugs; Euclidean and Hamming distances tend to overestimate associations between low-frequency drugs; cosine similarity and Jaccard coefficient yield underestimated results for drug combinations with significant frequency disparities; mutual information and modified mutual information produce abnormal results for nearly all combinations; and lift underestimates results for high-frequency drugs. Among common drug clustering methods, hierarchical clustering and maximal clique algorithms demonstrate the best partitioning performance, though the former may include low-relevance drugs in clusters, while the latter produces overly ambiguous category divisions. After optimization, the limitations ofhierarchical clustering and maximal clique-based algorithms were mitigated. Conclusion For drug association calculations, Pearson, Spearman, and Phi correlation coefficients could be prioritized. In drug clustering analysis, the optimized application methods can be adopted: hierarchical clustering is recommended to analyze the agglomeration process of drug categories, whereas the maximal clique algorithm can be used to identify multidimensional application directions of drugs.
[中图分类号]
TP18;R28
[基金项目]
国家重点研发计划(2023YFC3502900)