IBM.Next: QuerioDALI - Architectural View
For any search engine, with an unstructured query language as input, the main steps required are: query construction, data / document retrieval, and presentation of results. Using NL as input allows the system to better capture the user's information needs and retrieve more precise results (query construction). Building and querying knowledge graphs allows the system to answer queries across heterogeneous systems, with both enterprise and open data.
- Data needs to be ingested into our semantic infrustructure, and enriched with content by adding a geo-spatial and semantic dimension: extracting datatypes, entities and meaningful relationships across them, annotating and linking entities across different datasets, and entity resolution. Annnotation and linkages over these Data Repositories are stored in the context store. To create these annotations, which provide meaning to the entities and a shared vocabulary, a pool of Linked Open Data (LOD) models and widely-known ontologies are used.
- Deep parsing, using Watson assets, is used to understand more complex user needs, allowing non-technical user to query in Natural Language the distributed knowledge graphs. Watson NLP extracts the named entities and predicate argument structures (PAS Triples)
- These PAS structures are map (partially or completely ) into contextual facts in the knowledge graphs, which can be combined to create a view that answers the user query
- (Future work) Collect users's feedback to further refine the query through contextual recommendations, explorations and visualization
NLP Deep Parsing:
The linguistic question analysis receives as input an unstructured text question and identifies syntactic and semantic elements of the question, which are encoded as structured information. The analysis of the question uses Watson's deep parsing and part of the NLP pipeline used in Jeopardy, implemented as a UIMA (Unstructured Information Management Architecture) application. Watson's parsing is general purpose and all the semantic components used are domain-independent.
Besides Watson Slot Grammar parser (ESG), predicate argument structure (PAS), R2 named-entity recogniser and the Subtree Pattern matching framework to execute a set of domain independent rules to construct the PAS triples from the dependency tree, we also use external NE extraction tools over the Web of Data, in particular IBM Alchemy API
The entities extracted from the user query are matched into entities (classes, instances, properties or literals ) in the knowledge graphs to be queried, with a certain confidence score. To bridge the gap between the vocabulary used by the users and the knowledge sources to be queried, the user query terms are semantically expanded using a set of semantic annotators. Thus, the confidence score is calculated based on string distance metrics (exact vs approximate match) and semantic distance (exact match vs an alternative lexically related name).
New background knowledge sources or annotators can be added at any time by specifying the location in an configuration file, including the graph name, location (a file or an SPARQL end point) and if it has been loaded and indexed by DALI (e.g., in the case of RDF/OWL files) into a context store (models loaded into the Context Store are stored using Jena TDB and indexed using Lucene LARQ which allows to perform full text searches as part of a SPARQL query).
The sources used as annotators can be modified according to the scenario, but typically they include:
- WordNet: to find synonyms, hypernyms and study the relatedness between two words by measuring its depth and path in the taxonomy. .
- A DBpedia endpoint: to find alternative names, superclass and hypernym terms can be obtained following IS-A taxonomy, Wikipedia redirects, owl:sameAs relationships and the dcterms:subject and YAGO types taxonomies.
- Schema.org: widely used by search engines as a common vocabulary.
- Domain specific models: such as the social care or 211 taxonomy.
Graph Pattern Search and Ranking:
Given all matches for the query terms in a PAS Triple, QuerioDALI searches for graph patterns (GP) among them in the relevant knowledge graphs. A GP can be automatically translated it into a SPARQL query and consists on a set of: (1) BGPs that belong to the same graph ; (2) JOINS AND UNIONS between the BGPs; (3) Solution modifiers such as as : ORDER BY DESC/ASC , COUNT, OFFSET and LIMIT; (4) A confidence score, which is the combined score of the matches involve in all the BGPs in the GP; and (5) The variables that are the focus of the GP, if any
Only the Graph Patterns that produce bindings are selected. The answers to a query are obtained by combining (joining) the Graph Patters (with partial answers for each PAS Triples) to obtain a complete answer. According to how well the Graph Pattern matches the PAS triple in the user query a confidence score is given. If there are alternative translations/ interpretations, answers are ranked according to the combined score of the GPs joined to answer the query.
Pattern Templates. The search for GPs that better translate a PAS is based on the following parameterised templates. The right template(s) are executed according to the type of the candidate entities. Searching for direct templates first, and indirect ones only when none direct ones are found. Including all paterns (based on the candidate types) and sub-patterns, there are currently 11 direct patterns and 7 indirect ones, 18 GP subpatterns in total categorized into 10 generic ones as shown in the following table (based on the type of the candidate matches for the PAS):
If there are matches for the properties a filter is added: FILTER (?property = <property1> || ?property = <property2>..
Merging GPs across the same or different graphs:
To merge answers across graphs the GPs should belong to the same sparql endpoint or federated sparql end points. To merge GPs , the first step is to find the join term for the query, which may not be explicitly the same. For instance , in the query "Is Eplerenone having side effects for Teresa's conditions?" the join term is the term "side effects" in the sider ontology ("the side effects of Eplerenone") and conditions in the patient data ("conditions of Teresa") . The second step is to find common bindings for the joins on the fly . For join terms across graphs, bindings will most likely have different URIs even if they represent the same term, thus we perform syntactic merging (based on labels) and semantic merging (based on linkage, such as owl:sameAs or skos:closeMatch links across entities), e.g., Teresa's condition "diabetes" will have a different URI than the side-effect of eplerenone "type 2 diabetes", however both terms are semantically linked.
Example 1 (DBpedia): Question: Who is the director of movies starring Benicio del Toro and Catherine Zeta-Jones?
Answer: Steven Soderberg (director of the movie: traffic , starring both Benicio del Toro and Catherine Zeta Jones)
Example 2 (answers in SIDER - for side effects of drugs - and drugbank - for drugs target to asthma -, join term: drugs)
Question: What are the side effects of drugs targeted to asthma?
Answer: List of side effects such as chest pain (in the evidence figure)