Evaluating the Robustness of Biomedical Concept Normalization

Motivation

Concept normalization is the process of standardizing ambiguous, variable terms in text or handling missing links in graphs, thus facilitating effective information access. In the biomedical domain concept normalization involves linking entity mentions in text to standard concepts in a Knowledge base or an Ontology. BERT and its variants with domain-specific pre-training achieve impressive performance on this task.

It is important to note, however, that these same models have been shown to be insensitive to word order permutations and vulnerable to adversarial attacks. A lack of robustness prevents NLP systems from being used in real-world situations. For the normalization task, especially in the biomedical domain, such attacks and their effects have not been explored. Our work examines the effect of various heuristics based Input Transformations and Adversarial Attacks on the task of normalizing biomedical concepts. This direction of work is important because malicious attacks on normalization models in production could wreak havoc in application domains such as the healthcare industry.

Fig. 1: Even after applying invalid transformations to text, BERT still predicts with high confidence that the given connection between, say, “inhibitors into cox-double” and ace “inhibitors” is correct.

Key Contributions

  • We systematically study the effect of 13 different Input transformations, some of which are inspired by domain-specific, hand-crafted rules while others involve word-level modifications and word order variations.

  • We propose imperceptible Adversarial Attacks that lead to a significant drop in model performance (86.0 to 58.05 in F1-score) on the NCBI$^{[1]}$ dataset, revealing the brittle nature of top-performing Normalization models.

  • Finally, we explore existing mitigation strategies to make the models sensitive to invalid input transformations for the task of Normalization.

Methodology

Our Task: Biomedical concept Normalization

Given a corpus (text) $H$ and an Ontology $O$, the aim of concept normalization is to find a mapping function $f$, that maps a mention $m$ in the text corpus $H$ to a concept $c$ in the Ontology $O$, i.e., $c = f\left(m\right)$

Human Perceptible Input Transformations

An input transformation is defined as a function $\sigma: X \to X′$ that acts on input $X$ to produce $X′$ such that $f(X′)$ is not defined. Since the task of normalization involves a pair of texts as input, one being the mention level text and the other being candidate concepts, performing transformations on any one of these should suffice in order to carry out robustness analysis. Here, we list the domain-specific transformations.

Domain Specific Input Transformations Mention/Candidate Set of modified examples
Hyphenation hereditary breast and ovarian cancer {"hereditary-breast and ovarian cancer",
"hereditary breast-and ovarian cancer",
"hereditary breast and-ovarian cancer",
"hereditary breast and ovarian-cancer"}
Number Replacement c9 deficiency {"cnine deficiency", "cix deficiency"}
Disorder Synonym and Mention Term Concatenation behavioral abnormalities behavioral episodes
Stemming chromosomal fragmentation during meiosis chromosom fragment during meiosi
Subject Object Conversion hereditary breast and ovarian cancer cancer from hereditary breast and ovarian

Following are generic input transformations.

  • Lexical-overlap based transformations - These retain the bag of word collection but change the word order in four different ways: 

    • Sort - sorting input tokens; 

    • Reverse - reversing the token sequence; 

    • Shuffle - shuffling tokens randomly; 

    • Copy Sort - transforming a candidate $c$ as a copy of the mention $m$ with the words sorted alphabetically.

  • Gradient level transformations - These use the change in the loss for i-th input token in a given mention m to rank the tokens based on their importance. Based on this calculation four operations are defined:

    • Drop - this drops the least important token in a mention; 

    • Repeat - least important token is repeated ; 

    • Replace - Least important token is replaced by random tokens; 

    • Copy One - The most important token is copied from the mention m and put as the only token in the candidate $c$.

  • Chained Input Transformations: Multiple input transformations are “chained“ together, i.e., transformations are applied sequentially to generate new input. For example - the original mention entity “ace inhibitors” is transformed to “inhibitors into cox-double” after the chained application of three transformations in order - Subject-Object Conversion followed by Number Replacement and finally Hyphenation.

Human Imperceptible Input Transformations - Adversarial Attacks

Changes that are detectable by humans are brought about by the above transformations whereas modifications that are adversarial in nature are imperceptible. Adversarial attacks confuse the models and lead to performance degradation We propose two different adversarial Ranking Attacks for the task of normalization. Adversarial ranking attacks can alter the ranks of a candidate by adding perturbation to a candidate list. The rank of a candidate is defined as the position it occupies in an ordered candidate set. We adopt this approach to study the effect of adding imperceptible perturbations on candidate ranking.

  • Adversarial Ranking Attack (Adv-Rank) - For a set of chosen candidate $X = \{c_1, c_2, …., c_n\}$ with respect to a specific mention from the set $M = \{m_1, m_2, …., m_n\}$ we perform two types of Adversarial Ranking Attacks: 

    • Mention Attack: Attack targeted to a mention $m_i$.

    • Candidate Attack (CA) : Attack targeted to a candidate $c_i$. 

MA+ and MA- are defined as variants of Mention Attacks to raise or lower the rank of a candidate set $C$ by perturbing a single mention $m_i$. Similarly, CA+ and CA- are defined as variants of Candidate Attacks to raise or lower the rank of a single candidate $c$ with respect to the mention set $M$. 

The ranks are altered by adding universal perturbation $r$. The final ranking order for a Deep Neural Network is defined by the sample positions in a common embedding space so adding an adversarial perturbation to it can lead to potential alteration in ranking. This is performed using a surrogate loss in the form of Triplet loss. For CA+ it is defined as: $$LCA_+\left(c_i, M;X\right) = \sum_{m_k \in M}\sum_{c_j \in X}\left(d(m, c_i) - d(m, c_j)\right)_+$$$\forall m_k \in M$ and $\forall c_j \in X$ where, $k=\{1, 2, \ldots |M|\}$ and $j=\{1, 2, \ldots |X|\}$. Here $X$ denotes set of all candidates, $M$ denotes set of all mentions, $c_i$ is the candidate whose rank is raised w.r.t mention $M$ and $d(.,.)$ is a distance function, typically the Euclidean distance.

  • Least Similar Entity Concatenation (LSEC)  - LSEC modifies the candidate set by concatenating a given $c$ with the most dissimilar entity under the same parent in an ontology, i.e., the most dissimilar sibling. The following steps detail the approach taken by the LSEC attack for a mention $m$ and candidate $c$ -

    • Find the concept identifier of $c$ that links it to an ontology.

    • Access the immediate root which corresponds to the parent concept and find the set of existing siblings.

    • Find pair-wise cosine similarity between the concept and the existing set of candidates.

    • Select the most dissimilar entity and append it with the candidate $c$ to form $c′$.

Experiment and Evaluation

Datasets - We conduct experiments across three datasets, namely,

  • NCBI disease$^{[1]}$

  • BC5CDR Disease$^{[2]}$

  • BC5CDR Chemical$^{[2]}$

Models - For evaluating transformations we perform experiments on BERT-based Ranking$^{[3]}$ . We use the following BERT models for reporting the results on BERT-based Ranking (in this blog, we will be showing results on only some models. Please check out our paper for more insights.),

  • BioBERT$^{[4]}$

  • Clinical-BERT$^{[5]}$

  • PubMed BERT$^{[6]}$

  • BERT$^{[7]}$

  • As an additional model, we use Triplet Search ConNorm$^{[8]}$ which is trained using the triplet objective to observe the results of transformations on a different model.

Results with Input Transformations - The following table shows agreement scores of BioBert fine-tuned on each of the specified dataset. Fine-tuned BioBert exhibits high agreement scores. This is indicative of the fact that the model maintains its original predictions. Copy-Sort and Copy-One are the transformations for which the model can identify invalid output since those consist of a repetition of a single selected word lacking other terms in a mention/candidate. The agreement scores are comparatively lower for Gradient-based perturbations where the important tokens are altered. A similar drop in scores is observed for Chained Transformations. On the contrary, for each individual transformation (except for stemming which leads to a meaning change in the input) the agreement scores are much higher. The model might pick up spurious correlations related to the common representation of biomedical entity names or phrases during training. Also, high agreement scores on lexical transformations confirm that the model is insensitive to word order.

Input Transformations NCBI disease BC5CDR disease BC5CDR chemical
Unperturbed 0.93 0.92 0.94
Hyphenation (H) 0.89 0.88 0.92
Subject Object Conversion (SOC) 0.90 0.89 0.93
Number Replacement (Num-R) 0.88 0.89 0.91
Disorder Synonym/
Mention Term Concatenation (C)
0.90 0.89 0.91
Stemming (S) 0.87 0.87 0.92
H + SOC 0.89 0.9 0.94
H + SOC + Num-R 0.87 0.8 0.93
H + SOC + Num-R + C + S 0.86 0.87 0.92
Copy-Sort 0.91 0.90 0.14
Sort 0.90 0.89 0.92
Reverse 0.91 0.89 0.92
Shuffle 0.91 0.89 0.92
Drop 0.85 0.85 0.88
Repeat 0.84 0.84 0.88
Replace 0.83 0.82 0.8
Copy-One 0.76 0.77 0.87

Results with Adversarial Ranking Attacks - The next table reports the change in average rank using Cosine Distance Triplet Loss with (%) omitted. As per the table, for a designated embedding shift there is a considerable shift in the average rank for Mention Attack (MA+). The default value without the attack is $50\%$. The target of this attack is to make random changes in the ranked candidates after the model generates it. This is a black-box attack and significant shifts lead to $4\%$ to $5\%$ dip in model performance for BERT based ranking attacks.

Dataset Emb Shift Attack Clinical BERT BioBERT
NCB 1.3 MA+ ~50 → 51.8 ~50 → 44.2
BC5CDR Chemical 1.3 MA+ ~50 → 42.6 ~50 → 50.9

Results with Least Similar Entity Concatenation - The final table below shows the F1-score and average confidence with which the predictions are made after an attack is reported. The attack performed is for Least Similar Entity Concatenation (LSEC) on the NCBI dataset. Compared to Adversarial-Rank, LSEC is a better attack as it confuses the model resulting in degraded performance. In addition to the above main experiments we perform the following additional experiments:

  • Varying the transformations of input (%)

  • Are the transformations undetectable across different approaches?

  • Analysis of mitigation strategies for Input Transformations

Models Before Attack After Attack Confidence
PubMed BERT 0.853 0.580 0.986
BioBERT 0.859 0.582 0.990

Conclusion

Our work studies the effect of Input Transformations (word level modifications and word order variations) and Adversarial Attacks on the task of biomedical concept normalization. Experiments carried out show that BERT-based normalization models find it hard to identify invalid samples generated through input transformations. In addition, we propose two types of Adversarial attacks, one of which forms natural-looking adversaries while the other affects the ranking of candidate/mention sets by adding imperceptible perturbations. These attacks lead to model performance degradation. We use different BERT-based normalization models that operate on a ranking-based approach to conduct our experiments. We also investigate mitigation strategies and techniques to increase the model's sensitivity to input transformations.

References

  1. Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu. "NCBI disease corpus: a resource for disease name recognition and concept normalization." Journal of biomedical informatics 47 (2014): 1-10.

  2. Li, Jiao, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. "BioCreative V CDR task corpus: a resource for chemical disease relation extraction." Database 2016 (2016).

  3. Ji, Zongcheng, Qiang Wei, and Hua Xu. "Bert-based ranking for biomedical entity normalization." AMIA Summits on Translational Science Proceedings 2020 (2020): 269.

  4. Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics 36, no. 4 (2020): 1234-1240.

  5. Alsentzer, Emily, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. "Publicly available clinical BERT embeddings." arXiv preprint arXiv:1904.03323 (2019).

  6. Gu, Yu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. "Domain-specific language model pretraining for biomedical natural language processing." ACM Transactions on Computing for Healthcare (HEALTH) 3, no. 1 (2021): 1-23.

  7. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

  8. Xu, Dongfang, and Steven Bethard. "Triplet-Trained Vector Space and Sieve-Based Search Improve Biomedical Concept Normalization." Association for Computational Linguistics (ACL), 2021.