利用实体与依存句法结构特征的病历短文本分类方法

      Short Text Classification of EMR Based on Entities and Dependency Parser

      • 摘要: 近年来,电子病历文本的分类、挖掘成为医学大数据研究的基础。该文提出一种利用实体与依存句法结构分析构特征集的电子病历短文本分类方法。首先对病历文本进行自然语言处理,包括分句、分词、词性标注以及实体提取,构建实体词典,利用TF-IDF方法构建词-文本矩阵并利用潜在语义分析LSA方法进行词汇特征的选择,然后分析病历文本的依存句法关系,挖掘出词汇之间的依存关系并构建特征三元组作为分类特征的扩展,最后构建出分类特征向量集对病历短文本进行分类。实验证明,相比于未进行特征扩展的短文本分类,所提方法能有效地提高分类器的分类性能,其分类的准确率与F值均有明显的提高。

         

        Abstract: Nowadays, text classification and text mining of Electronic Medical Record(EMR) have become the basis of the Big Data research in biomedical fields. This paper proposes a method using entity dictionaries and dependency parser as the feature to do the classification of short texts in EMR. It used NLP to preprocess the texts first including sentence segmentation, word segmentation, part of speech and entity extraction. Then several entity dictionaries were built according to the result of NLP. After that the TF-IDF and LSA were deployed to select the vocabulary feature. Then considering the characters of EMR, dependency parser was done to the texts and triple dependency relation features would be used as the expanding feature for text classification. The result of the experiment shows that comparing to the classification which used vocabulary features only, the proposed method can effectively improve the performance of classifier and the precision and F-value are obviously higher.

         

      /

      返回文章
      返回