This page contains an interactive visualisation of the patient embedding from the paper "Hospital-wide Natural Language Processing summarising health data of 1 million patients over a decade".
Click a cluster to see more details in the panel on the right. You can zoom and pan in the figure using the mouse.
Cluster number:
Patients in cluster:
Top 5 SNOMED codes:
Clustering of patients based on SNOMED disorder codes detected in free text. A sample of 100,000 patients was embedded based on normalised annotation counts for all SNOMED disorder codes detected in at least 1000 patients at King's College Hospital. These vectors were reduced to 50 dimensions using PCA then to 2 dimensions using t-SNE. Colour indicates cluster membership (50 clusters) assigned by agglomerative clustering with Ward linkage.
The prevalence of SNOMED codes is calculated for each cluster and the count of each code is propagated up the SNOMED ontology to all parents. The following SNOMED codes are then removed as they are uninformative (most have 100% prevalence in all clusters as they are high level parent codes): 138875005, 64572001, 301857004, 123946008, 118234003, 404684003, 362965005. When a cluster is selected, up to 5 codes are shown. These are the most prevalent codes that are relevant to at least 50% of the patients in the cluster.
For performance reasons, this visualisation is further subsampled to 20% of the original data, stratified by cluster.