Posted by Katrin Affolter on March 6, 2020 | No Comments
Did you ever wonder why computational linguists always say “natural language is ambiguous and various,” as though that is an explanation why their tasks are complicated? I mean, I get it: There are many different languages and they have different structures and rules. But extracting pathologies from anonymized medical reports shouldn’t be that difficult. You know, because pathologies have official, medical names. For example, “spinal disk herniation” is just that. Perhaps you need to check for the Latin name “prolapsus disci intervertebralis” too – because in medicine everyone loves Latin. But that’s it, right? Hmm… unless they’re pressed for time and shorten it to “disk herniation.” I could imagine that, because it is obvious that a disk herniation affects the spine and not the knee… And now that I think about it, I saw some other versions, like ”herniated disk” or “slipped disk.” Oh, and in British English you would write “disc” with a ‘c’ instead of a ‘k’! Never mind, I start to see why natural language is called “various.”
As mentioned in the first NLP blog entry, we data scientists are trying to identify all pathologies and body parts in radiology reports. While it is important to detect as many of these as possible, variability in natural language is a big challenge to overcome. To do so, we need to think of all possible ways to describe a medical term. This already seems like a big task, but to make it an even bigger challenge, we try to solve this problem in German. Remember the example compound “Dampfschifffahrtsgesellschaftskapitänswitwe”? German speakers also do that with medical terms. Because language never makes anything easy for anyone, there is no single, correct way to combine words. Let us look once again at disk herniation. In German, one possible way to describe this pathology is by using the two words “Bandscheibe” (disk) and “Vorfall” (prolapse). Want to guess how many ways there are to combine those two words?
Bandscheibenvorfall, Bandscheiben-Vorfall, Vorfall der Bandscheibe, etc…
The last one is not a compound of words, but still a valid and used term. The first example, “Bandscheibenvorfall”, uses ‘n’ as an interfix. The second one, “Bandscheiben-Vorfall,” is similar but additionally hyphenates to make the two different compounds clearer. Compare it with the first NLP blog entry, where I showed how “Staubecken” can have two different interpretations. The hyphenation as interfix is optional, and as a rule of thumb, it can always be applied or be omitted. Helpful, isn’t it?
And this was just one possible way to mention the pathology disk herniation. In German, English words are often used directly without any translation (anglicisms). For example, the word “Baby” is widely used and the German word “Neugeborenes” is almost never used. The usage of anglicisms is difficult enough, but sometimes anglicisms are used in compounds with German words, like “Bandscheiben-Bulging,” meaning the bulging of a disk. This combination of terms from different languages, or pseudo-anglicisms, further confuses/complicates the issue. You could say that to search for pathologies in German, you need to know three languages (German, Latin and English) and all the combinations which are possible between them.
In the end, for a single pathology we have a huge variation of possible words (including compound words) which we need to identify in the radiology reports. During the analysis of radiology reports, we often encounter new words being used to describe a pathology. More than once I have thought, “Please, don’t come up with another one!”