HIT-DEID: A De-identification Tool for English Electronic Medical Records

发布人：admin 发布时间: 2015-08-23 11:05:10 阅读数：

Zengjian Liu	Buzhou Tang	Xiaolong Wang	Qingcai Chen
liuzengjian.hit@gmail.com	tangbuzhou@gmail.com	wangxl@insun.hit.edu.cn	qingcai.chen@gmail.com

Introduction:

HIT-DEID is a de-identification tool for English Electronic Medical Records, which can identify twenty-five types of protected health information (PHI), including eighteen types of HIPAA-defined PHI. The tool is developed based on Conditional Random Fields (CRF) on the corpus of the 2014 i2b2 (Informatics for Integrating Biology and the Bedside) clinical natural language processing (NLP) challenge. The previous version submitted to the 2014 i2b2 challenge achieves a micro F-score of 91.24% under "strict" criteria, ranking second among all the systems of the challenge [1][2]. After the challenge, we further improve it by using a new tokenization module, updating rules and adding a postprocessing module. The HIT-DEID tool achieves a micro F-score of 93.13% under "strict" cirteria on the challenge corpus (as shown in the following table).
The released HIT-DEID tool is trained on both the training and test sets of the challenge for application.

	Precision(P)	Recall(R)	F-score(F)
Token	0.9701	0.9416	0.9556
Strict	0.9534	0.9102	0.9313
Relaxed	0.9549	0.9116	0.9327

Where "token" checks whether a predicted token exactly matches a token in a gold phrase of the same category. "strict" denotes that a PHI instance is correctly extracted only when it exactly matches with a gold one of the same boundary and category, while "relaxed" denotes that a PHI instance is correctly extracted only when it mostly overlaps (only allow two characters mismatched at the end) with a gold one of the same category.

References:

[1] Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1[J]. Journal of biomedical informatics, 2015.[PDF]

[2] Liu Z, Chen Y, Tang B, et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics, 2015. [PDF]

DownLoad:

Please fill out the [Application Form] and send it to Zengjian liu (liuzhengjian.hit@gmail.com) or Buzhou Tang (tangbuzhou@gmail.com) to get the source code of HIT-DEID. The trained model can be downloaded here:[HITdeid_crfsuite_model].

System configuration:

1) Linux system (tested in Centos 6.4 and Ubuntu 12.04)

2) Environmental Requirement:

Bash

python 2.7.*

java (above 1.4+, java1.7 will be great)

3) External Software:

a. CRFsuite:

-- Download source code at: http://www.chokkan.org/software/crfsuite/

-- Unpack and Install CRFsuite to system
or
-- Download the binary file and copy "crfsuite" file to "HITdeid/tools/crfsuite/".
b. Stanford POS tagger:
-- Download "basic English Stanford Tagger" at: http://www-nlp.stanford.edu/software/tagger.shtml

-- Unzip, rename and move directory to "HITdeid/tools/stanford-postagger/".
-- Please ensure the exist of "stanford-postagger.jar" and "models/english-left3words-distsim.tagger".

Usage:
1) Show the help infomation.
#bash bin/runHITdeid.sh
2) Ensure the HITdeid model.
-- Run with '-m' to assign this model file, or just move it to "HITdeid/models/HITdeid_crfsuite.model".
3) Run HITdeid must assign a input directory.
#bash bin/runHITdeid.sh -i [inputdir]
-- Then the outputs at: "HITdeid/output/", or use '-o' to assign a new output directory.

Contact:

Author: Zengjian Liu, Buzhou Tang

Email: liuzengjian.hit@gmail.com, tangbuzhou@gmail.com

Address: Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China, 518055.

Return Top

常用站点

HIT-DEID: A De-identification Tool for English Electronic Medical Records

发布人：admin 发布时间: 2015-08-23 11:05:10 阅读数：_showDynClicks("wbnews", 1470728364, 1142)

© 2019 哈尔滨工业大学（深圳）·智能计算研究中心 All rights reserved.

发布人：admin 发布时间: 2015-08-23 11:05:10 阅读数：