HIT-DEID is a de-identification tool for English Electronic Medical Records, which can identify twenty-five types of protected health information (PHI), including eighteen types of HIPAA-defined PHI. The tool is developed based on Conditional Random Fields (CRF) on the corpus of the 2014 i2b2 (Informatics for Integrating Biology and the Bedside) clinical natural language processing (NLP) challenge. The previous version submitted to the 2014 i2b2 challenge achieves a micro F-score of 91.24% under "strict" criteria, ranking second among all the systems of the challenge . After the challenge, we further improve it by using a new tokenization module, updating rules and adding a postprocessing module. The HIT-DEID tool achieves a micro F-score of 93.13% under "strict" cirteria on the challenge corpus (as shown in the following table).
The released HIT-DEID tool is trained on both the training and test sets of the challenge for application.
 Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1[J]. Journal of biomedical informatics, 2015.[PDF]
 Liu Z, Chen Y, Tang B, et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics, 2015. [PDF]
Please fill out the [Application Form] and send it to Zengjian liu (email@example.com) or Buzhou Tang (firstname.lastname@example.org) to get the source code of HIT-DEID. The trained model can be downloaded here:[HITdeid_crfsuite_model].
1) Linux system (tested in Centos 6.4 and Ubuntu 12.04)
2) Environmental Requirement:
java (above 1.4+, java1.7 will be great)
3) External Software:
a. CRFsuite:-- Download source code at: http://www.chokkan.org/software/crfsuite/
-- Unpack and Install CRFsuite to system
-- Download the binary file and copy "crfsuite" file to "HITdeid/tools/crfsuite/".
b. Stanford POS tagger:
-- Download "basic English Stanford Tagger" at: http://www-nlp.stanford.edu/software/tagger.shtml
-- Unzip, rename and move directory to "HITdeid/tools/stanford-postagger/".
-- Please ensure the exist of "stanford-postagger.jar" and "models/english-left3words-distsim.tagger".
1) Show the help infomation.
2) Ensure the HITdeid model.
-- Run with '-m' to assign this model file, or just move it to "HITdeid/models/HITdeid_crfsuite.model".
3) Run HITdeid must assign a input directory.
#bash bin/runHITdeid.sh -i [inputdir]
-- Then the outputs at: "HITdeid/output/", or use '-o' to assign a new output directory.
Author: Zengjian Liu, Buzhou Tang
Address: Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China, 518055.