目的 应用公共数据库挖掘结直肠癌“炎癌转化”过程的关键基因，结合机器学习和神经网络模型的算法优势进行基因筛选，验证关键基因作为预测指标的可行性及免疫相关性，进而预测潜在防治中药。方法 利用基因表达数据库（Gene Expression Omnibus，GEO）2个GSE66407和GSE166925数据集，筛选由正常结肠到炎症结肠（溃疡性结肠炎），最后进展到肠癌的过程中出现的差异基因；通过构建蛋白质-蛋白质相互作用网络（protein-protein interaction，PPI）进一步筛选核心基因；运用梯度提升机、随机森林和决策树3种分类机器学习算法进行结局的预测学习，将3种机器学习算法筛选出的核心基因取交集，构建深度学习框架下的人工神经网络模型，对模型预测准确性进行内部及外部验证；实时定量聚合酶链反应（real-time quantitative polymerase chain reaction，RT-PCR）联合人蛋白质图谱数据库（Human Protein Atlas，HPA）的免疫组化数据分析关键基因在初发结直肠癌患者癌组织及正常组织中的表达；运用单样本基因集富集分析（single sample gene set enrichment analysis，ssGSEA）算法分析核心基因与免疫细胞之间的关联；通过生存分析筛选对结直肠癌具有预后意义的基因；最后通过核心基因进行相关中药的预测分析。结果 在结直肠炎癌转化全过程中，共鉴定出152个共同的差异基因。这些基因参与肿瘤免疫反应、炎症反应等生物过程；经由3种机器学习算法的筛选，最终确定了8个核心基因，包括组织金属蛋白酶抑制剂1（tissue inhibitor of metalloproteinases 1，TIMP1）、基质金属蛋白酶1（matrix metalloproteinase 1，MMP1）、趋化因子CXC配体13（C-X-C motif chemokine ligand 13，CXCL13）、普列克底物蛋白（Pleckstrin，PLEK）、颗粒酶B（granzyme B，GZMB）、CXCL5、CXCL3和CXCL8。在深度学习框架下的神经网络模型中，训练集中对正常组和肿瘤组的预测准确率分别为86.3%、92.3%；外部验证集中对正常组和肿瘤组的预测准确率分别为80.0%、80.6%。8个核心基因同时涉及多种免疫浸润过程；RT-PCR结合HPA数据库分析显示TIMP1、MMP1、GZMB、CXCL5、CXCL8、CXCL3在结直肠癌患者癌组织中表达显著升高（P<0.05），CXCL13在癌组织中表达显著降低（P<0.05），PLEK在癌组织中表达降低但差异不显著；TIMP1、CXCL13和CXCL8对患者的整体生存具有显著影响。对实验验证有显著差异的7个核心基因进行防治中药预测分析，结果显示预测中药以清热药和补虚药为主，四气以寒、温和平性为主，五味以苦、辛、甘味比例最大。结论 TIMP1、MMP1、CXCL13、GZMB、CXCL5、CXCL3和CXCL8可以作为预测结直肠癌“炎-癌”转化过程的关键分子，对这些分子进行中药的早期干预，可能对结直肠癌的早期诊疗具有重要意义。
Objective To mine crucial genes involved in the "inflammation-cancer transformation" process of colorectal cancer using public databases, screen genes combined the algorithm advantages of machine learning and neural network models, verify the feasibility and immune correlation of key genes as predictive indicators, and then predict the potential traditional Chinese medicine (TCM) for prevention and treatment. Methods Two data sets GSE66407 and GSE166925 from Gene Expression Omnibus (GEO) were used to screen the differential genes that appeared in the progression from normal colon to inflammatory colon (ulcerative colitis) and finally to intestinal cancer. The core genes were further screened by constructing protein-protein interaction network (PPI). Three classification machine learning algorithms, gradient elevator, random forest and decision tree, were used to predict the outcome. The core genes screened by the three machine learning algorithms were intersected to build an artificial neural network model under the deep learning framework, and the prediction accuracy of the model was verified internally and externally. The real-time quantitative polymerase chain reaction (RT-PCR) combined with immunohistochemical data of Human Protein Atlas (HPA) were used to analyze the expression of key genes in cancer tissues and normal tissues of patients with primary colorectal cancer. The association between core genes and immune cells was analyzed by single sample gene set enrichment analysis (ssGSEA). The genes with prognostic significance for colorectal cancer were screened by survival analysis. Finally, the prediction analysis of related TCM was carried out through core genes. Results A total of 152 common differential genes were identified in "inflammation-cancer transformation" process of colorectal cancer. These genes were involved in biological processes such as tumor immune response and inflammatory response. Through the screening of the three kinds of machine learning algorithms, eight core genes were determine ultimately, including tissue inhibitor of metalloproteinases 1 (TIMP1), matrix metalloproteinase 1 (MMP1), C-X-C motif chemokine ligand 13 (CXCL13), Pleckstrin (PLEK), granzyme B (GZMB), CXCL5, CXCL3 and CXCL8. In the neural network model under the framework of deep learning, the prediction accuracy of the normal group and the tumor group were 86.3% and 92.3% in training set, respectively. The prediction accuracy of the normal group and the tumor group was 80.0% and 80.6% in validation set, respectively. A total of eight core genes involved in multiple immune infiltration processes simultaneously; RT-PCR combined with HPA database analysis showed that the expression of TIMP1, MMP1, GZMB, CXCL5, CXCL8 and CXCL3 in cancer tissues of colorectal cancer patients was significantly increased (P < 0.05), while the expression of CXCL13 was significantly decreased (P < 0.05). PLEK expression decreased in cancer tissues but the difference was not significant. TIMP1, CXCL13, and CXCL8 had a significant impact on overall survival. The prediction analysis of seven core genes with significant differences in experimental verification of TCM showed that the predicted TCM mainly consisted of heat-clearing and deficient-tonifying medicines, with cold, warm and peaceful in nature, and bitter, pungent and sweet in flavours. Conclusion TIMP1, MMP1, CXCL13, GZMB, CXCL5, CXCL3 and CXCL8 could be used as key molecules to predict the "inflammation-cancer transformation" process of colorectal cancer. The early intervention of TCM on these molecules may be of great significance for the early diagnosis and treatment of colorectal cancer.