Converting NL queries to RDBMS statements

Problem statement

While developing an enterprise search engine, we came into multiple use cases where multiple clients have a large amount of structured data (mainly RDBMS), where they want their non-technical employees to search for expected answers through those databases. For e.g., if there is a table like below:

Architecture diagram

The user can ask “let me know about the department number of Nirun” to get the response “20-B”, instead of writing the following SQL query

SELECT DEPT_NUM FROM EMPLOYEE WHERE FIRST_NAME = “NIRUN”;

Summarizing these are the requirements break-down

Users will ask in a Natural language
The query needs to be classified whether to prepare SQL query to hit on DB or to search on free-flowing text documents
Users can ask in multiple languages
Users should not be expected to remember the exact table/column name of the entire DB
User’s query can expect a response from multiple tables or DBs
Entire table content may not be available for ingesting; on larger DBs, only the DB schema and sample few rows will be shared by the client for processing
For security reasons only the query will be sent as an API to client DB and more than a certain amount of columns and rows cannot be extracted at a time

Solution Approach

We have decided to divide the solution into two separate phases.

Analysis phase
Incremental release phase

In this first part, only the analysis phase is mentioned.

1. Analysis Phase

In this phase, we have gone through multiple tools and concepts available over the internet. The following libraries and concepts we have analyzed.

Deep learning based solutions
Statistical based solutions

Deep learning based solutions

There are many deep learning based solutions available, a consolidated list you can find here:- https://github.com/salesforce/WikiSQL

These models are tested on top of a certain type of data; WikiSQL. WikiSQL are annotated NL query to table data like there is SQuAD data by Stanford for Q&A system. Although some model claim to go beyond 85% accuracy, these are data hungry i.e. some models need to ingest the entire table content data while training, on the other hand, other models need labeled data as per the schema of the table. This approach may work for a PoC, research work and few tables, however, are not scalable and hard to get annotated data.

Statistical based solutions

There are few libraries which drew our attention:

ln2sql

This is the best library we found. It has some cool features.

Features

GPL 3.0 license
Take DB dump as input .sql file
Only the DB dump is enough, no need for direct access to the whole DB
GUI available for ingesting
NLP input and SQL query like JSON in output
Multi-language supported
Customized thesaurus supported i.e. on a way are synonyms
Customized stop word supported
Most of the nodes are supported

Unavailable features

It uses TreeTagger, which later says to contact the developers for commercial use.
Python 3 only
Rule-based query template generation
Only working on MySQL DB
Entity extraction and schema mapping with the elements
Pseudo-column support not there i.e. SYSADMIN, SYSTIMESTAMP etc.
Only single table support

quepy

This is the second best library.

Features

License seems like BSD
Uses NLTK Tagger which does Tokenizing, PoS Tagging and Lemmatizing
A customized RegEx support called, refo

Unavailable features

Only supports Sparql and MQL, no support for SQL
It has not been maintained over a year
The customized regex is needed many efforts to understand and customize on need-basis

NLIDB

Features

Apache 2.0, easy license to use
Parse tree looks good

Unavailable features

Written in Java, More efforts to customize in Python
No good documentation
It has not been maintained over a year

Possible solutions already built by Non-open source companies

Other Good resources

SQLizer

The final conclusion of the analysis phase

ln2sql can be leveraged for taking DB dump as input for the initial release.
ln2sql’s customized thesaurus and stop word support are also good
SQLizer has mentioned a few good practices, those will be checked.
For scalability and customization point of view rest, all need to be built as an in-house product.

Thanks for your time.