PRESS RELEASE | Bangalore, INDIA October 2019
Named entity recognition is an integral part of developing NLP algorithms. The variability that is possible in bio-medical literature both in terms of notations and context poses a major challenge for entity recognition. “Molecular Connections” today announced the launch of its data extraction service “curated gold standard training data for customers” in key industries such as healthcare and pharmaceuticals.
Molecular Connections’ subject matter experts are highly experienced in creating gold standard training data sets accommodating inter-individual variability, by adopting triple blind approach, where the corpus is manually annotated by two or three different annotators for various pharmacologically relevant entity types, for efficient training of machine learning algorithms, including defining guidelines for classification of bio-medical entities to eliminate any ambiguity and mapping it to right standard identifiers.
All extracted evidence is traceable to the original research articles and linked to concepts across multiple ontologies for knowledge organization. Ontologies used include UMLS™, MeSH™, MedDRA™ and ATC to name a few, and are continuously extended as required by customers.
Speaking on the occasion, Mr. Vidyendra S. CTO, Molecular Connections mentioned “the triple blind approach is highly essential for building high precision & recall NER models. Molecular Connections provides two or three sets of annotated data for each bio-medical document and detailed guidelines defined for all entity types of interest. The various scenarios that bring in ambiguity are discussed and the resolution offered is proven to be extremely useful to build machine learning algorithms”.
Molecular Connections’ has also carried out end-to-end projects where we built the gold standard training data sets, from preprocessing documents to reading and extraction of entities of relevance, evaluate relevance, linking it to ontologies & develop custom NER modules for our partners. Molecular Connections has also modelled knowledge graphs stored in a graph database that allows computational analysis of connected data. The data is accessible through an Application Programming Interface (API) and a graphical Web-Frontend. Users can use semantic search and boolean search to find evidence, discover articles and cause-effect relationships and visualize their results without technical knowledge of NLP technologies.
Molecular Connections proprietary platforms include:
MCAPS™: Molecular Connections Annotation Professional System, a proprietary A&I workflow solution. Evolved over 15 years of content mining and curation experience, the platforms provides state of art technology solutions in A&I based workflows, Journal Production System, Content delivery platforms, MIS, Quality Control and Security. The platform seamlessly scales to support high traffic and data processing needs and used by many publishers for their indexing workflows. The platform is completely modular & is a plug and play solution.
MCPARSE™: Molecular Connections’ proprietary document Parsing module. The platform plays a key role in various Content Acquisition systems.
With a core ML/AI module in the background, this platform parses content from PDF, XML, HTML, DOC, EXCEL, LaTEX etc. The inbuilt transformation engine parses raw text, meta data and helps in converting meta data to standard XML formats.
MCLEXiCON™: Molecular Connections’ proprietary Ontology/Taxonomy management system. It is a web based platform enabling large scale thesaurus/Ontology management. It has an inbuilt module to facilitate complex validation. The platform also has API to integrate with popular content management system like Microsoft SharePoint. The in house content store boasts more than 3.5 million terms from public ontologies across different subject areas which are regularly updated and maintained.
MCMiNER™: Molecular Connections’ proprietary ML/AI platforms for text-mining. Equipped with state of the art modules in the field of NER, Content Classification, Recommendation etc., plays major roles in text mining and is the backbone of most hybrid curation workflows (MCAPS™). Along with MCLEXiCON™ content. store.
MCMiNER™ provides more than 40 readymade APIs for content indexing, classification etc. across different subject areas.