Introduction:
HIT-DEID is a de-identification tool for English Electronic Medical Records, which can identify twenty-five types of protected health information (PHI), including eighteen types of HIPAA-defined PHI. The tool is developed based on Conditional Random Fields (CRF) on the corpus of the 2014 i2b2 (Informatics for Integrating Biology and the Bedside) clinical natural language processing (NLP) challenge. The previous version submitted to the 2014 i2b2 challenge achieves a micro F-score of 91.24% under "strict" criteria, ranking second among all the systems of the challenge [1][2]. After the challenge, we further improve it by using a new tokenization module, updating rules and adding a postprocessing module. The HIT-DEID tool achieves a micro F-score of 93.13% under "strict" cirteria on the challenge corpus (as shown in the following table).
The released HIT-DEID tool is trained on both the training and test sets of the challenge for application.
|
|
Precision(P)
|
Recall(R)
|
F-score(F)
|
|
Token
|
0.9701
|
0.9416
|
0.9556
|
|
Strict
|
0.9534
|
0.9102
|
0.9313
|
|
Relaxed
|
0.9549
|
0.9116
|
0.9327
|
Where "token" checks whether a predicted token exactly matches a token in a gold phrase of the same category. "strict" denotes that a PHI instance is correctly extracted only when it exactly matches with a gold one of the same boundary and category, while "relaxed" denotes that a PHI instance is correctly extracted only when it mostly overlaps (only allow two characters mismatched at the end) with a gold one of the same category.
References:
[1] Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1[J]. Journal of biomedical informatics, 2015.[PDF]
[2] Liu Z, Chen Y, Tang B, et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics, 2015. [PDF]
DownLoad:
Please fill out the [Application Form] and send it to Zengjian liu (liuzhengjian.hit@gmail.com) or Buzhou Tang (tangbuzhou@gmail.com) to get the source code of HIT-DEID. The trained model can be downloaded here:[HITdeid_crfsuite_model].
System configuration:
1) Linux system (tested in Centos 6.4 and Ubuntu 12.04)
2) Environmental Requirement:
Bash
python 2.7.*
java (above 1.4+, java1.7 will be great)
3) External Software:
a. CRFsuite:
-- Download source code at: http://www.chokkan.org/software/crfsuite/
-- Unpack and Install CRFsuite to system
or
-- Download the binary file and copy "crfsuite" file to "HITdeid/tools/crfsuite/".
b. Stanford POS tagger:
-- Download "basic English Stanford Tagger" at: http://www-nlp.stanford.edu/software/tagger.shtml
-- Unzip, rename and move directory to "HITdeid/tools/stanford-postagger/".
-- Please ensure the exist of "stanford-postagger.jar" and "models/english-left3words-distsim.tagger".
Usage:
1) Show the help infomation.
#bash bin/runHITdeid.sh
2) Ensure the HITdeid model.
-- Run with '-m' to assign this model file, or just move it to "HITdeid/models/HITdeid_crfsuite.model".
3) Run HITdeid must assign a input directory.
#bash bin/runHITdeid.sh -i [inputdir]
-- Then the outputs at: "HITdeid/output/", or use '-o' to assign a new output directory.
Contact:
Author: Zengjian Liu, Buzhou Tang
Email: liuzengjian.hit@gmail.com, tangbuzhou@gmail.com
Address: Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China, 518055.