Sanofi: NLP text mining and other AI tech to advance findings, improve data use

By Melissa Fassbender

- Last updated on GMT

(Image: Getty/yucelyilmaz)
(Image: Getty/yucelyilmaz)

Related tags NLP Artificial intelligence R&D Sanofi Linguamatics Text mining Pharmacovigilance Clinical trial design

Real world evidence extraction is one area where Sanofi is using NLP text mining, a technology with applications across the R&D pipeline, from target identification to clinical trial design and pharmacovigilance.

Sanofi​ used Linguamatics​’ artificial intelligence​ (AI) based natural language processing​ (NLP) text-mining software to process various literature sources as part of its multiple sclerosis drugs (MS) biomarker project.

According to the company, it is using NLP and text analytics in several spaces, including target identification and prioritization, drug repurposing, interpretation of genes/proteins identified by ‘omics experiments, and full patent text mining for new targets. 

Additionally, Sanofi also is using text mining to support clinical trial site selection and study design, as well as pharmacovigilance, among other areas.

To further discuss how the company is leveraging the technology – which recently received the Frost & Sullivan Global Product Leadership Award​ – Outsourcing-Pharma (OSP) caught up Dongyu Liu (DL), associate director of translational sciences at Sanofi.

OSP: Why has it become even more important to have a comprehensive understanding of the genetic associations for the disease of interest?

DL:​ A key requirement in any drug development project is a comprehensive understanding of the genetic associations between the target gene and diseases. When we are working to develop a drug, we have a higher chance of success if we understand the genetic linkage between our selected target gene and diseases.

If we are able to identify causal gene mutations associated with a disease, there is a better chance that we can develop a drug that corrects the mutation.

OSP: Could you describe the MS biomarker project? What was the goal?

DL: ​The goal of the project was to identify new biomarkers for MS by exploring the association of HLA alleles and haplotypes with diseases and drug sensitivity. We used NLP technology to extract information from unstructured text from millions of pieces of scientific literature.

We built a catalog of HLA allele annotations and established a workflow for HLA typing and analysis based on whole-exome sequencing. This identified more than 400 HLA alleles. We used NLP to search the literature to annotate the association of the HLA alleles with diseases and drug hypersensitivity.

We were not only able to identify all 22 previously published autoimmune diseases and drug sensitivities associated with HLA alleles and haplotypes, we also uncovered an additional 33 novel unpublished disease and drug sensitivity associations – more than doubling previously known associations.

The curated annotations were fed into a searchable knowledge base for broad use within the Sanofi team in its search for novel biomarkers.

OSP: How is Sanofi using AI-based natural language processing (NLP) text-mining software to process an extensive collection of literature sources? What are the benefits of using this software? And how has its use evolved in the industry?

DL: ​We use NLP text-mining to transform text from internet documents, patents, clinical trials, EHR, conference reports and other literature from an unstructured format to structured text.

We use it for early drug research, gene disease mapping, target identification and prioritization, drug repurposing, interpretation of genes/proteins identified by ‘omics experiments and full patent text mining for new targets.

We also use it for different areas along the drug development continuum such as opportunity scouting, pharmacovigilance, competitive intelligence, and social media analysis.

The biggest advantage of NLP is that it allows us to quickly get the information we need from the source documents. If we were only able to do a google-like search, a lot of information would be missed.

With NLP, we can apply different ontologies in our searches, such as a disease search from literature that includes several different disease variations. We are then able to extract the details we need.

OSP: Why did Sanofi decide to use the Linguamatics product specifically?

DL: ​In about 2011, we started to evaluate different NLP vendors. We did a thorough review of different options, including Linguamatics.

We found that Linguamatics had a very good NLP engine and the flexibility to apply different ontologies was very important to us. It's easy to plug in any ontology or dictionary and you can extract any domain knowledge. We have been working with them since them.

OSP: How is Sanofi using NLP and text analytics in other areas of R&D?

DL: ​One area we are using NLP text mining is for extracting real-world evidence. Precision medicine holds great promise but requires extensive data based on real-world evidence from EHRs and other sources. 

We want to understand how drugs work outside of the clinical trial environment to improve outcomes, so we are investing heavily in that area to gain more information. 

Beyond R&D, we are using text mining along the bench-to-bedside pipeline, in areas as diverse as opportunity scouting, pharmacovigilance, competitive intelligence, and social media analysis.

OSP: What do you expect the use of NLP to look like in the future?

DL: ​Down the road, I think we will be able to apply more machine-learning and combine NLP with other AI technologies. This will allow us to advance our findings and make even better use of the data we extract. 

Related news

Show more

Related products

show more

Saama accelerates data review processes

Saama accelerates data review processes

Content provided by Saama | 25-Mar-2024 | Infographic

In this new infographic, learn how Saama accelerates data review processes. Only Saama has AI/ML models trained for life sciences on over 300 million data...

More Data, More Insights, More Progress

More Data, More Insights, More Progress

Content provided by Saama | 04-Mar-2024 | Case Study

The sponsor’s clinical development team needed a flexible solution to quickly visualize patient and site data in a single location

Using Define-XML to build more efficient studies

Using Define-XML to build more efficient studies

Content provided by Formedix | 14-Nov-2023 | White Paper

It is commonly thought that Define-XML is simply a dataset descriptor: a way to document what datasets look like, including the names and labels of datasets...

Related suppliers

Follow us


View more