Natural Language Database
Natural Language Database
DEPARTMENT OF INFORMATION TECHNOLOGY IMPERIAL COLLEGE OF ENGINEERING and RESEARCH WAGHOLI, PUNE
PROJECT REPORT ON
DEPARTMENT OF INFORMATION TECHNOLOGY IMPERIAL COLLEGE OF ENGINEERING and RESEARCH WAGHOLI, PUNE
Contents
1 Introduction 1.1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 4 5 6 8 9 9 9 10 11 11 11 12 13 13
2 Literature Survey 2.1 2.2 2.3 2.4 2.5 Advantages of NLDBI system . . . . . . . . . . . . . . . . . . . . . . . . Disadvantages of NLDBI System . . . . . . . . . . . . . . . . . . . . . . Some Already Developed NLDBI Systems . . . . . . . . . . . . . . . . . Commercially Available Systems . . . . . . . . . . . . . . . . . . . . . . . Various Approaches Used for Development of NLDBI Systems . . . . . . 2.5.1 2.5.2 2.5.3 Symbolic Approach (Rule Based Approach) . . . . . . . . . . . . Empirical Approach (Corpus Based Approach) . . . . . . . . . . . Connectionist Approach (Using Neural Network) . . . . . . . . .
3 Proposed Work 3.1 3.2 3.3 3.4 3.5 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Project Design 4.1 4.2 4.3 4.4 SRS - Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 4.4.2 4.4.3 4.5 4.5.1 4.5.2 4.6 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . Module Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 14 14 15 16 16 17 18 19 19 20 22 23 23 25 25 35 39 39 39 39 44 47 48 55 55 57 60 62 62 62 62
Screen-shots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Test Design & Report 5.1 5.2 Test Design for NLDBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 5.2.2 Input/Output testing . . . . . . . . . . . . . . . . . . . . . . . . . Suggestion list testing . . . . . . . . . . . . . . . . . . . . . . . .
6 Implementation and Coding 6.1 List of questions answered by the system . . . . . . . . . . . . . . . . . . 6.1.1 6.1.2 6.1.3 6.1.4 6.2 6.3 Who questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . What questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which questions . . . . . . . . . . . . . . . . . . . . . . . . . . . When question . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parse Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Context Free Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 6.3.2 LEX specication: . . . . . . . . . . . . . . . . . . . . . . . . . . YACC Specication: . . . . . . . . . . . . . . . . . . . . . . . . .
7 Schedule of work 8 Conclusion and Future Scope 8.1 8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography
List of Figures
1.1 4.1 4.2 4.3 4.4 4.5 4.6 Existing NLDBI system with old user interface development techniques . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NLDBI screen-shot interfaced with movie database and designed to answer questions related to movies . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2 7.3 System behaviour for wrong question inputs . . . . . . . . . . . . . . . . System behaviour for irrelevant inputs . . . . . . . . . . . . . . . . . . . System behaviour for correct questions . . . . . . . . . . . . . . . . . . . System behaviour for no input or null input . . . . . . . . . . . . . . . . System behaviour if data not present in database . . . . . . . . . . . . . Systems suggestion list behaviour for no keyword inputs . . . . . . . . . Systems suggestion list behaviour for relevant keyword inputs . . . . . . Systems suggestion list behaviour for irrelevant inputs . . . . . . . . . . Who questions class one . . . . . . . . . . . . . . . . . . . . . . . . . . . Who questions class two . . . . . . . . . . . . . . . . . . . . . . . . . . . Who questions class three . . . . . . . . . . . . . . . . . . . . . . . . . . What questions class one . . . . . . . . . . . . . . . . . . . . . . . . . . . What questions class two . . . . . . . . . . . . . . . . . . . . . . . . . . . What questions class three . . . . . . . . . . . . . . . . . . . . . . . . . . What questions class four . . . . . . . . . . . . . . . . . . . . . . . . . . Gantt chart for preliminary research . . . . . . . . . . . . . . . . . . . . 22 25 27 29 31 33 35 36 37 48 49 50 51 52 53 54 60 60 61 2 15 16 17 18 19
Gantt chart for initial system design . . . . . . . . . . . . . . . . . . . . Gantt chart for nal development phase . . . . . . . . . . . . . . . . . .
Chapter 1 Introduction
Natural language Interface to Database (NLDBI) can acts as an alternative interface for nding structured information from database particularly on a small hand-held devices because writing questions in natural language is much easier for a casual user than the complicated and time consuming navigation required in the traditional database interfaces. NLDBI shifts a users burden of learning use of interface to describe his or her need for information to the system. NLDBI thus demands less input-output and processing facilities which make it more useful for mobile devices . Natural language interface to database is not a new area, a lot of research is going on since a long time. Most of the NLDBI systems developed so far used shallow analysis. Shallow analysis uses domain knowledge rather than linguistic knowledge to interpret the meaning of input question, which results in a low success rate. This is a main diculty in implementing NLDBI for any practical use. To increase the success rate deep analysis of input question can be used. Deep analysis uses linguistic knowledge for detail understanding. Deep analysis has an advantage of portability that is not possible in shallow analysis due to too much dependency on database. To explore deep analysis for NLDBI a movie inquiry is selected as a domain. For movie inquiry domain a corpus of question is collected. This corpus is then analysed to nd the pattern of questions used in the domain. Based on these patterns a context free grammar is designed to represent syntax of input questions. Representing syntax is an important step as in deep analysis method the subsequent steps are dependent on it.
Introduction
1.1
Motivation
Downsides of existing query languages (e.g., SeRQL, SPARQL): 1. Complex syntax, 2. Not easy to learn, 3. Writing queries is error-prone task, 4. Requires understanding of Semantic Web technologies. Downsides of existing systems ): This system requires user to select the columns that he would like to use to make his query rst. Next the user has to write an appropriate query with no guidance from the system. Then the user is supposed to press Do query to generate the output. This system appears to be complex and requires the user to perform a lot of steps with no guidance what so ever. Even though there are so many technologies available for designing rich and user friendly User Interfaces the existing system still seem to be dicult to use thus leaving a scope to apply the knowledge of Human Computer Interaction to develop a user friendly system that performs better than the existing systems.
Figure 1.1: Existing NLDBI system with old user interface development techniques
Introduction
1.2
Background
Traditionally , most natural language interfaces have a natural- language front end paradigm, whereby natural language queries are translated into formal database queries (i.e.: in some formal query language ). These queries are then processed by a database management system The requested data could be returned as is, or could be passed back to the natural-language component, which generates a natural-language version of the data. This eectively divides the problem of natural-language database access into tow sub problems: the natural-language component, and the database access component. The methods for translating natural language to formal query languages have varied over the years. The Lunar Sciences Natural Language Information System (or LUNAR) is one of the oldest natural-language interfaces to database: it performed a simple syntactic analysis of English query, followed by semantic analysis of the resulting parse tree.
One advantage of NLIDBs is supposed to be that the user is not required to learn an articial communication language. Formal query languages like SQL are dicult to learn and master, at least by non-computer-specialists.
No Need for Training
Graphical interfaces and form-based interfaces are easier to use by occasional users; still, invoking forms, linking frames, selecting restrictions from menus, etc. constitute articial communication languages that have to be learned and mastered by the end-user. In contrast, an ideal NLDBI would allow queries to be formulated in the users native language.
Better for Some Questions
It has been argued that there are some kind of questions (e.g. questions involving negation, or quantication) that can be easily expressed in natural language, but that seem dicult (or at least tedious) to express using graphical or form-based interfaces. For example, Which department has no programmers? (Negation), or Which company supplies every department? (Universal quantication), can be easily expressed in natural language, but they would be dicult to express in most graphical or form-based interfaces. Questions like the above can, of course, be expressed in database query languages like SQL, but complex database query language expressions may have to be written. 4
Literature Survey
Queries that involve multiple database tables like list the address of the farmers who got bonus greater than 10000 rupees for the crop of wheat, are dicult to form in graphical user interface as compared to natural language interface.
2.2
A frequent complaint against NLDBIs is that the systems linguistic capabilities are not obvious to the user. Current NLDBIs can only cope with limited subsets of natural language. Users nd it dicult to understand (and remember) what kinds of questions the NLDBI can or cannot cope with. Formal query languages, form-based interfaces, and graphical interfaces typically do not suer from these problems. In the case of formal query languages, the syntax of the query language is usually well-documented, and any syntactically correct query is guaranteed to be given an answer. In the case of form-based and graphical interfaces, the user can usually understand what sorts of questions can be input, by browsing the options oered on the screen; and any query that can be input is guaranteed to be given an answer. For example, database doesnt contain information about prot of farmers and user inputs the query which farmer gets maximum prot. This query is out of the linguistic coverage of the system.
Linguistic vs. Conceptual Failures
When the NLDBI cannot understand a question, it is often not clear to the user whether the rejected question is outside the systems linguistic coverage, or whether it is outside the systems conceptual coverage. Thus, users often try to rephrase questions referring to concepts the system does not know (e.g. rephrasing questions about salaries towards a system that knows nothing about salaries), because they think that the problem is caused by the systems limited linguistic coverage. In other cases, users do not try to rephrase questions the system could conceptually 5
Literature Survey
handle, because they do not realize that the particular phrasing of the question is outside the linguistic coverage, and that an alternative phrasing of the same question could be answered. Some NLIDBs attempt to solve this problem by providing diagnostic messages, showing the reason a question cannot be handled. For example, user asks a query list the names of farmers who are 35 years old and the database has information about the age of the farmer but age word is not there in query. So this query is not out of linguistic coverage but conceptually it is not right, because system does not understand what 35 is representing i.e. whether it is for age or for address.
Users assume intelligence
NLDBI users are often misled by the systems ability to process natural language, and they assume that the system is intelligent, that it has common sense, or that it can deduce facts, while in fact most NLDBIs have no reasoning abilities. This problem does not arise in formal query languages, form-based interfaces, and graphical interfaces, where the capabilities of the system are more obvious to the user. For example, when user asks a query list the names of farmers who are 35 years old, he/she is not specifying the word age, assuming that system will understand it automatically. But system is not so intelligent.
Inappropriate Medium
It has been argued that natural language is not an appropriate medium for communicating with a computer system. Natural language is claimed to be too verbose or too ambiguous for human-computer interaction. NLDBI users have to type long questions, while in form-based interfaces only elds have to be lled in, and in graphical interfaces most of the work can be done by mouse-clicking. In natural language interface user has to type full sentence with all the connectors (articles, prepositions, etc) but in graphical or form based interfaces it is not required.
2.3
LUNAR System
Literature Survey
databases; one for the chemical analysis and the other for literature references. The LUNAR system uses an Augmented Transition Network (ATN) parser and Woods Procedural Semantics. According to words , the LUNAR system performance was quite impressive; it managed to handle 78
LADDER
This system was developed in 1978. It was designed as a natural language interface to a database of information about US Navy ships. According to Hendrix , the LADDER system uses semantic grammar to parse questions to query a distributed database. Although semantic grammars helped to implement systems with impressive characteristics, the resulting systems proved dicult to port to dierent application domains. Indeed, a dierent grammar had to be developed whenever LADDER was congured for a new application . The system uses semantic grammars technique that interleaves syntactic and semantic processing. The question answering is done via parsing the input and mapping the parse tree to a database query. The system LADDER was implemented in LISP. At the time of creation of the LADDER system, it was able to process a database that is equivalent to a relational database with 14 tables and 100 attributes.
CHAT-80
This is one of the best-known NLDBIs of the early eighties. CHAT-80 was implemented entirely in Prolog. It transformed English questions into Prolog expressions, which were evaluated against the Prolog database. The code of CHAT-80 was circulated widely and formed the basis of several other experimental NLDBIs. The database of CHAT-80 consists of facts (i.e. oceans, major seas, major rivers and major cities) about 150 of the countries world and a small set of English language vocabulary that are enough for querying the database.
Literature Survey
2.4
to understand. The SQL-Tutor program tutors students by assisting the students through a number of database questions from four dierent databases. A student model is kept for each student based on query constraints (each constraint represents a part of the query that is necessary to answer the question). Each time a particular query constraint is used, SQL-Tutor records whether it was used successfully or unsuccessfully. In this way a model of a students strengths and weaknesses is generated and SQL-Tutor can select questions which re-enforce problem areas or introduce new query concepts.
INTELLECT: It was from Trinzic (formed by the merger of AICorp and Aion).
Literature Survey
2.5
Natural language is the topic of interest from computational viewpoint due to the implicit ambiguity that language possesses. Several researchers applied dierent techniques to deal with language.
2.5.1
Natural Language Processing appears to be a strongly symbolic activity. Words are symbols that stand for objects and concepts in real worlds, and they are put together into sentences that obey well specied grammar rules. Hence, for several decades Natural Language Processing research has been dominated by the symbolic approach. Knowledge about language is explicitly encoded in rules or other forms of representation. Language is analysed at various levels to obtain information. On this obtained information certain rules are applied to achieve linguistic functionality. As Human Language capabilities include rule-base reasoning, it is supported well by symbolic processing. In symbolic processing rules are formed for every level of linguistic analysis. It tries to capture the meaning of the language based on these rules.
2.5.2
Empirical approaches are based on statistical analysis as well as other data driven analysis, of raw data which is in the form of text corpora. A corpus is collections of machine readable text. The approach has been around since NLP began in the early 1950s. Only in the last 10 years or so empirical NLP has emerged as a major alternative to rationalist rule-based Natural Language Processing. Corpora are primarily used as a source of information about language and a number of techniques have emerged to enable the analysis of corpus data. Syntactic analysis can be achieved on the basis of statistical probabilities estimated from a training corpus. Lexical ambiguities can be resolved by considering the likelihood of one or another interpretation on the basis of context. Recent research in computational linguistics indicates that empirical or corpus based methods are currently the most promising approach for developing robust, ecient Natural Language Processing (NLP) systems. These methods automate the acquisition of much of the complex knowledge required for NLP by training on suitably annotated natural language corpora, e.g. tree-banks of parsed sentences. Most of the empirical NLP methods employ statistical techniques such as n-Gram models, Hidden Markov Models (HMMs), 9
Literature Survey
and Probabilistic Context Free Grammars 16(PCFGs). Given the successes of empirical NLP methods, researchers have recently begun to apply learning methods to the construction of information extraction systems. Several dierent symbolic and statistical methods have been employed, but most of them are used to generate one part of a larger information extraction system. Experimented n-Gram based language modeling and claimed to develop language independent approach to IR and Natural Language Processing.
2.5.3
Since human language capabilities are based on neural network in the brain, Articial Neural Networks (also called as connectionist network) provides an essential starting point for modelling language processing. In the recent years, the eld of connectionist processing has seen a remarkable development. The sub-symbolic neural network approach holds a lot of promise for modelling the cognitive foundations of Natural Language Interface using Shallow Parsing language processing. Instead of symbols, the approach is based on distributed representations that correspond to statistical regularities in language. There has also been signicant research applying neural-network methods to language processing. However, there has been relatively little recent language research using sub-symbolic learning, although some recent systems have successfully employed decision trees transformation rules, and other symbolic methods. SHRUTI system is a neurally inspired system for event modelling and temporal processing at a connectionist level.
10
Natural Language Interface to database system provides an interface to the user which helps him/her to query the database in his/her natural language. As the query languages like SQL are very dicult to use for the common people and they nd it very hard to learn. So, to make database applications easy to use for these people who dont know query languages, Natural Language Interface for various databases has been developed.
3.2
Features
user and hence easy to use.
1. The GUI is very similar to a chatting application, making the GUI familiar to the
2. The system provides Key-Word based suggestion to help with nding the appropriate question. This system lters the list of suggested questions based on the keywords entered. 3. Provides question analysis information i.e. the Goal of your question and the Constraints applied. 4. Provides the Sql query generated during translating, which may help the user to learn or validate the queries. 5. The system also provides auto spell check to correct the misspelled words.
11
Proposed Work
3.3
Goals
The goal of NLP as stated above is to accomplish human-like language processing. The choice of the word processing is very deliberate, and should not be replaced with understanding. For although the eld of NLP was originally referred to as Natural Language Understanding (NLU) in the early days of AI, it is well agreed today that while the goal of NLP is true NLU, that goal has not yet been accomplished. A full NLU System would be able to: 1. Paraphrase an input text 2. Translate the text into another language 3. Answer questions about the contents of the text 4. Draw inferences from the text While NLP has made serious inroads into accomplishing goals 1 to 3, the fact that NLP systems cannot, of themselves, draw inferences from text, NLU still remains the goal of NLP. There are more practical goals for NLP, many related to the particular application for which it is being utilized. For example, an NLP-based IR system has the goal of providing more precise, complete information in response to a users real information need. The goal of the NLP system here is to represent the true meaning and intent of the users query, which can be expressed as naturally in everyday language as if they were speaking to a reference librarian. Also, the contents of the documents that are being searched will be represented at all their levels of meaning so that a true match between need and response can be found, no matter how either are expressed in their surface form. The goal of natural language processing (NLP) is to design and build computer systems that are able to analyse natural languages like German or English, and that generate their outputs in a natural language, too. Typical applications of NLP are information retrieval, language understanding, and text classication. The goal of NLP is to accomplish human-like language processing. The choice of the word processing is very deliberate, and should not be replaced with understanding. Although the eld of NLP was originally referred to as Natural Language Understanding (NLU) in the early days of AI, it is well agreed today that while the goal of NLP is true NLU, that goal has not yet been accomplished. A full NLU System would be able to paraphrase an input text, translate the text into another language, answer questions about the contents of the text, and draw inferences from the text.
12
Proposed Work
3.4
Objectives
The objectives of the system are: 1. Provide a set of questions that provides answers to most of the users queries. 2. Divide the questions into two parts (a) Goal: It is the data that the user wishes to nd. (b) Condition: These are the constraints used to narrow down the results to nd the appropriate data. 3. Compiling the question to translate the goals and conditions into a syntactically correct and database relevant query. 4. Retrieving the answers and presenting it in a well formatted way.
3.5
Constraint
The constraints of the system are: 1. It uses Corpus based approach, hence the system is not capable of answering any question outside the corpus. 2. The natural-language interface is designed for specic database. Hence any change in the structure of the database will need lot of time to translate that change into the interface. 3. The system answers one question at a time i.e. the question must contain only one goal.
13
4.2
Hardware Tools
14
Project Design
4.3
System Architecture
15
Project Design
4.4
4.4.1
UML Diagrams
Use Case Diagram
16
Project Design
4.4.2
Sequence Diagram
17
Project Design
4.4.3
Activity Diagram
18
Project Design
4.5
4.5.1
19
Project Design
4.5.2
Database Design
Movie Table
This table contains a list of movie names and other details related to movies like the genre of the movie, its rating, the budget , language and its release date. As its the main table it contains references to actor table , director table and producer table by using actor id , director id, producer id as foreign keys. This is done to make the database normalized to 3NF.
Movie Parameter name Parameter Type movie name director id producer id actor id genre rating budget language release date varchar(50) int(11) int(11) int(11) varchar(50) int(11) varchar(50) varchar(50) varchar(50)
Actor Table
This table contains a list of all the actors that have acted in the movies stored in the Movie Table.The primary key in this table is the actor id.
Actor Parameter name Parameter Type actor id actor name gender int(11) varchar(50) varchar(10)
20
Project Design
Director Table
This table contains a list of all the directors that have directed the movies stored in the Movie Table. The primary key in this table is the director id.
Director Parameter name Parameter Type director id director name int(11) varchar(50)
Producer Table
This table contains a list of all the producers that have produced in the movies stored in the Movie Table.The primary key in this table is the producer id.
Producer Parameter name Parameter Type producer id producer name int(11) varchar(50)
21
Project Design
4.6
Screen-shots
Figure 4.6: NLDBI screen-shot interfaced with movie database and designed to answer questions related to movies
22
The tests have been divided into 2 parts: 1. Input/Output testing (a) Test for wrong questions (b) Test for irrelevant input (c) Test for correct questions (d) Test for no input (e) Test for data not present 2. Suggestion list testing (a) Test for no keywords (b) Test for relevant keywords (c) Test for irrelevant keywords
23
Test Description
Input/Output testing This test is designed to test how the system responds to the data that is given as input. The main output, sql query, goal and conditions generated by the system are considered for testing. Suggestion list testing This test is designed to test whether the suggestion utility works as intended as this utility plays an important role is making the system easy to use and is a key feature of the system.
24
5.2
Test Report
The test report contains screen shots and the description of each test with the observed output. The report is a proof that the system has passes all the intended tested under the specied constraints.
5.2.1
Input/Output testing
25
Description Designed to test system for wrong questions, or questions out of the domain
Expected output
Actual output
Passed
The system prompts the System prompts I didnt user that the query asked understand your quescannot be answered tion Yes
The system should not The system does not gengenerate any goal generate any condition generate any query erate any goal erate any condition erate any query The system should not The system does not genThe system should not The system does not gen-
26
27
Description
Expected output
Actual output
Passed
The system prompts the System prompts I didnt Designed to test system for irrelevant input user that the query asked understand your quescannot be answered generate any goal generate any condition generate any query tion erate any goal erate any condition erate any query The system should not The system does not genThe system should not The system does not genThe system should not The system does not genYes
28
29
Description Designed to test system for correct questions that are within its answering domain
Expected output
Actual output
Passed
The system answers the System answers the quesquestion data The system should generate correct goal The system should generate correct conditions The system should generate correct query The The The system system system generates generates generate right goal right conditions right query with correct tion with right data Yes
30
31
Description
Expected output
Actual output
Passed
The system should not System doesnt respond Designed to test system for no or Null input react to null input The system should not The system does not gengenerate any goal generate any condition generate any query erate any goal erate any condition erate any query The system should not The system does not genThe system should not The system does not genYes
32
33
Description Designed to test system when it does not have data to answer the question
Expected output
Actual output
Passed
The system prompts the System prompts Data user that the data is not present to answer the question The system should generate correct goal generate any condition The system should generate correct query The system system system generates generates generate right goal right conditions The right query Not Found Yes
34
5.2.2
35
Expected output
Actual output
Passed
The system should not System reacts as ex- Yes list the suggestion list
tion utility for no key- display the suggestion pected by not displaying
Table 5.6: Test report for no keywords 2. Test for relevant keywords 1.
Figure 5.7: Systems suggestion list behaviour for relevant keyword inputs
36
Expected output
Actual output
Passed
The system should dis- System displays the sug- Yes gestions with helpful suggestions with correct suggestions
Table 5.7: Test report for relevant keywords 3. Test for irrelevant keywords
37
Expected output
Actual output
Passed
The system should not System reacts as ex- Yes display the suggestion pected by not displaying list the suggestion list
38
6.1.1
Who questions
1. Who is the director of [movie name] ? 2. Who is the producer of [movie name] ? 3. Who is the actor of [movie name] ? 4. Who is the director of the movie [movie name] ? 5. Who is the producer of the movie [movie name] ? 6. Who is the actor of the movie [movie name] ? 7. Who directed [movie name] ? 8. Who produced [movie name] ?
6.1.2
What questions
1. What is the genre of [movie name] ? 2. What is the rating of [movie name] ? 3. What is the cast of [movie name] ? 4. What is the release date of [movie name] ? 39
5. What is the language of [movie name] ? 6. What is the duration of [movie name] ? 7. What is the budget of [movie name] ? 8. What are the movies directed by [director name] ? 9. What are the movies produced by [producer name] ? 10. What are the movies acted by [actor name] ? 11. What are the movies directed by [director name] and produced by [producer name] ? 12. What are the movies produced by [producer name] and directed by [director name] ? 13. What are the movies directed by [director name] and acted by [actor name] ? 14. What are the movies acted by [actor name] and directed by [director name] ? 15. What are the movies acted by [actor name] and produced by [producer name] ? 16. What are the movies produced by [producer name] and acted by [actor name] ? 17. What are the movies directed by [director name] and produced by [producer name] and acted by [actor name] ? 18. What are the movies directed by [director name] and acted by [actor name] and produced by [producer name] ? 19. What are the movies produced by [producer name] and directed by [director name] and acted by [actor name] ? 20. What are the movies produced by [producer name] and acted by [actor name] and directed by [director name] ? 21. What are the movies acted by [actor name] and produced by [producer name] and directed by [director name] ?
40
22. What are the movies acted by [actor name] and directed by [director name] and produced by [producer name] ? 23. What are the movies directed by [director name] in [year] ? 24. What are the movies produced by [producer name] in [year] ? 25. What are the movies acted by Brad in [year] ? 26. What are the movies directed by [director name] and produced by [producer name] in [year] ? 27. What are the movies produced by [producer name] and directed by [director 28. What are the movies directed by [director name] and acted by [actor name] in [year] ? 29. What are the movies acted by [actor name] and directed by [director name] in [year] ? 30. What are the movies acted by [actor name] and produced by [producer name] in [year] ? 31. What are the movies produced by [producer name] and acted by [actor name] in [year]? 32. What are the movies directed by [director name] and produced by [producer name] and acted by [actor name] in [year] ? 33. What are the movies directed by [director name] and acted by [actor name] and produced by [producer name] in [year] ? 34. What are the movies produced by [producer name] and directed by [director name] and acted by [actor name] in [year] ? 35. What are the movies produced by [producer name] and acted by [actor name] and directed by [director name] in [year] ? 36. What are the movies acted by [actor name] and produced by [producer name] and directed by [director name] in [year] ? 37. What are the movies acted by [actor name] and directed by [director name] and produced by [producer name] in [year] ? 41
38. What are the [genre] movies released in [year] ? 39. What are the [genre] movies directed by [director name] ? 40. What are the [genre] movies produced by [producer name] ? 41. What are the [genre] movies acted by [actor name] ? 42. What are the [genre] movies directed by [director name] and produced by [producer name] ? 43. What are the [genre] movies produced by [producer name] and directed by [director name] ? 44. What are the [genre] movies directed by [director name] and acted by [actor name] ? 45. What are the [genre] movies acted by [actor name] and directed by [director name] ? 46. What are the [genre] movies acted by [actor name] and produced by [producer name] ? 47. What are the [genre] movies produced by [producer name] and acted by [actor name] ? 48. What are the [genre] movies directed by [director name] and produced by [producer name] and acted by [actor name] ? 49. What are the [genre] movies directed by [director name] and acted by [actor name] and produced by [producer name] ? 50. What are the [genre] movies produced by [producer name] and directed by [director name] and acted by [actor name] ? 51. What are the [genre] movies produced by [producer name] and acted by [actor name] and directed by [director name] ? 52. What are the [genre] movies acted by [actor name] and produced by [producer name] and directed by [director name] ? 53. What are the [genre] movies acted by [actor name] and directed by [director name] and produced by [producer name] ? 42
54. What are the [genre] movies directed by [director name] in [year] ? 55. What are the [genre] movies produced by [producer name] in [year] ? 56. What are the [genre] movies acted by Brad in [year] ? 57. What are the [genre] movies directed by [director name] and produced by [producer name] in [year] ? 58. What are the [genre] movies produced by [producer name] and directed by [director name] in [year] ? 59. What are the [genre] movies directed by [director name] and acted by [actor name] in [year] ? 60. What are the [genre] movies acted by [actor name] and directed by [director name] in [year] ? 61. What are the [genre] movies acted by [actor name] and produced by [producer name] in [year] ? 62. What are the [genre] movies produced by [producer name] and acted by [actor name] in [year]? 63. What are the [genre] movies directed by [director name] and produced by [producer name] and acted by [actor name] in [year] ? 64. What are the [genre] movies directed by [director name] and acted by [actor name] and produced by [producer name] in [year] ? 65. What are the [genre] movies produced by [producer name] and directed by [director name] and acted by [actor name] in [year] ? 66. What are the [genre] movies produced by [producer name] and acted by [actor name] and directed by [director name] in [year] ? 67. What are the [genre] movies acted by [actor name] and produced by [producer name] and directed by [director name] in [year] ? 68. What are the [genre] movies acted by [actor name] and directed by [director name] and produced by [producer name] in [year] ?
43
6.1.3
Which questions
1. Which [genre] movies released in [year] ? 2. Which movies are directed by [director name] ? 3. Which movies are produced by [producer name] ? 4. Which movies are acted by [actor name] ? 5. Which movies are directed by [director name] and produced by [producer name] ? 6. Which movies are produced by [producer name] and directed by [director name] ? 7. Which movies are directed by [director name] and acted by [actor name] ? 8. Which movies are acted by [actor name] and directed by [director name] ? 9. Which movies are acted by [actor name] and produced by [producer name] ? 10. Which movies are produced by [producer name] and acted by [actor name] ? 11. Which movies are directed by [director name] and produced by [producer name] and acted by [actor name] ? 12. Which movies are directed by [director name] and acted by [actor name] and produced by [producer name] ? 13. Which movies are produced by [producer name] and directed by [director name] and acted by [actor name] ? 14. Which movies are produced by [producer name] and acted by [actor name] and directed by [director name] ? 15. Which movies are acted by [actor name] and produced by [producer name] and directed by [director name] ? 16. Which movies are acted by [actor name] and directed by [director name] and produced by [producer name] ? 17. Which movies are directed by [director name] in [year] ? 18. Which movies are produced by [producer name] in [year] ? 44
19. Which movies are acted by Brad in [year] ? 20. Which movies are directed by [director name] and produced by [producer name] in [year] ? 21. Which movies are produced by [producer name] and directed by [director name] in [year] ? 22. Which movies are directed by [director name] and acted by [actor name] in [year] ? 23. Which movies are acted by [actor name] and directed by [director name] in [year] ? 24. Which movies are acted by [actor name] and produced by [producer name] in [year] ? 25. Which movies are produced by [producer name] and acted by [actor name] in [year]? 26. Which movies are directed by [director name] and produced by [producer name] and acted by [actor name] in [year] ? 27. Which movies are directed by [director name] and acted by [actor name] and produced by [producer name] in [year] ? 28. Which movies are produced by [producer name] and directed by [director name] and acted by [actor name] in [year] ? 29. Which movies are produced by [producer name] and acted by [actor name] and directed by [director name] in [year] ? 30. Which movies are acted by [actor name] and produced by [producer name] and directed by [director name] in [year] ? 31. Which movies are acted by [actor name] and directed by [director name] and produced by [producer name] in [year] ? 32. Which [genre] movies are released in [year] ? 33. Which [genre] movies are directed by [director name] ? 34. Which [genre] movies are produced by [producer name] ? 45
35. Which [genre] movies are acted by [actor name] ? 36. Which [genre] movies are directed by [director name] and produced by [producer name] ? 37. Which [genre] movies are produced by [producer name] and directed by [director name] ? 38. Which [genre] movies are directed by [director name] and acted by [actor name] ? 39. Which [genre] movies are acted by [actor name] and directed by [director name] ? 40. Which [genre] movies are acted by [actor name] and produced by [producer name] ? 41. Which [genre] movies are produced by [producer name] and acted by [actor name] ? 42. Which [genre] movies are directed by [director name] and produced by [producer name] and acted by [actor name] ? 43. Which [genre] movies are directed by [director name] and acted by [actor name] and produced by [producer name] ? 44. Which [genre] movies are produced by [producer name] and directed by [director name] and acted by [actor name] ? 45. Which [genre] movies are produced by [producer name] and acted by [actor name] and directed by [director name] ? 46. Which [genre] movies are acted by [actor name] and produced by [producer name] and directed by [director name] ? 47. Which [genre] movies are acted by [actor name] and directed by [director name] and produced by [producer name] ? 48. Which [genre] movies are directed by [director name] in [year] ? 49. Which [genre] movies are produced by [producer name] in [year] ? 50. Which [genre] movies are acted by Brad in [year] ? 46
51. Which [genre] movies are directed by [director name] and produced by [producer name] in [year] ? 52. Which [genre] movies are produced by [producer name] and directed by [director name] in [year] ? 53. Which [genre] movies are directed by [director name] and acted by [actor name] in [year] ? 54. Which [genre] movies are acted by [actor name] and directed by [director name] in [year] ? 55. Which [genre] movies are acted by [actor name] and produced by [producer name] in [year] ? 56. Which [genre] movies are produced by [producer name] and acted by [actor name] in [year]? 57. Which [genre] movies are by [director name] and produced by [producer name] and acted by [actor name] in [year] ? 58. Which [genre] movies are directed by [director name] and acted by [actor name] and produced by [producer name] in [year] ? 59. Which [genre] movies are produced by [producer name] and directed by [director name] and acted by [actor name] in [year] ? 60. Which [genre] movies are produced by [producer name] and acted by [actor name] and directed by [director name] in [year] ? 61. Which [genre] movies are acted by [actor name] and produced by [producer name] and directed by [director name] in [year] ? 62. Which [genre] movies are acted by [actor name] and directed by [director name] and produced by [producer name] in [year] ?
6.1.4
When question
47
6.2
Parse Trees
These parse trees are generated by capturing the commonalities in the list of questions.
48
49
50
51
52
53
54
6.3
The context free grammar is the heart of the compiler and does the work of translating the natural language question into sql query using lexical analysis, syntax analysis and semantic analysis.
6.3.1
LEX specication:
The lex specication denes all the tokens used in the compiler. Each of the word entered by the user rst goes through lexical analysis. This is the step where the compiler distinguishes the words based on the regular expressions specied below and then passes these to the other phases of the compiler.
[ ? ] { yylval . s=strdup ( yytext ) ; r e t u r n QS ; } directed | acted | produced { yylval . s=strdup ( yytext ) ; r e t u r n VBD ; } released { yylval . s=strdup ( yytext ) ; r e t u r n VBN ; } who | what { yylval . s=strdup ( yytext ) ; r e t u r n WP ; } is { yylval . s=strdup ( yytext ) ; r e t u r n VBZ ; }
the { yylval . s=strdup ( yytext ) ; r e t u r n DT ; } horror | comedy | action | drama | suspence | thriller | scifi { yylval . s=strdup ( yytext ) ; r e t u r n JJ ; } and { yylval . s=strdup ( yytext ) ; r e t u r n CC ; }
55
which { yylval . s=strdup ( yytext ) ; r e t u r n WDT ; } when { yylval . s=strdup ( yytext ) ; r e t u r n WRB ; } did { yylval . s=strdup ( yytext ) ; r e t u r n VBDX ; } director | producer | actor | actress | movie | releasedate | genre | rating | cast | language | duration | budget | movies | year { yylval . s=strdup ( yytext ) ; r e t u r n NN ; }
in { yylval . s=strdup ( yytext ) ; r e t u r n IN ; } [0 9]+ { i f ( atoi ( yytext ) >1950 && atoi ( yytext ) <2012) { yylval . s=strdup ( yytext ) ; r e t u r n CD ; } }
[ azAZ ]+
%% i n t yywrap ( ) { return 1; }
56
6.3.2
YACC Specication:
%{ #include<stdio . h> #include<string . h> i n t mflag =0 , i =0; c h a r goal [ 5 ] [ 1 0 0 ] ; c h a r test [ 5 ] [ 1 0 0 ] ; i n t gp =0 , cp =0; %} %union { int y ; char *s ; } %token <s> QS %token <s> VBD %token <s> WP %token <s> NN %token <s> NNP %token <s> VBZ %token <s> DT %token <s> IN %token <s> VBP %token <s> CD %token <s> CC %token <s> VBN %token <s> JJ %token <s> WDT %token <s> VBDX %token <s> VB %token <s> WRB %% START : SBAR QS { // gc g o a l c o n d i t i o n t a b l e c h a r condition [ 5 ] [ 1 0 0 ] ;
printf ( =============g o a l==============\n ) ; f o r ( i =0; i<gp ; i++) printf ( %s \n , goal [ i ] ) ; printf ( =============conn==============\n ) ; f o r ( i =0; i<cp ; i++) printf ( %s = \ t %s \n , condition [ i ] , test [ i ] ) ; }; SBAR : WHNP SBARQ | WHADVP SQ ; S : NPW VPJJ | NPW VBP VP ; NPW : JJtemp NNtemp SBARQ : S | SQ ; WHADVP : WRB { addgoal ( r e l e a s e d a t e ) ; addcon ( moviename ) ; } ; WHNP : WP SQ : VBZ | WDT ; NPZ | VBP NPX | VBDX NNPtemp VB ; NPZ PP | VBDtemp | NNtemp ;
57
NPX : NPZ VP | NPJJ VPJJ | NPJJ VP ; VPX : VBDtemp VP : VP CC PPX : | PP ; NPZ : DT NNtemp ) ;} ; NPJJ : DT JJtemp NNtemp ; VPJJ : VBN PP : IN NPZ ; VBDtemp : VBD { i f ( mflag==0) { i f ( strcmp ( $ 1 , d i r e c t e d ) ==0) addgoal ( d i r e c t o r ) ; i f ( strcmp ( $ 1 , produced ) ==0) addgoal ( p r o d u c e r ) ; i f ( strcmp ( $ 1 , a c t e d ) ==0) addgoal ( a c t o r ) ; addcon ( moviename ) ; } else { i f ( strcmp ( $ 1 , d i r e c t e d ) ==0) addcon ( d i r e c t o r ) ; i f ( strcmp ( $ 1 , produced ) ==0) addcon ( p r o d u c e r ) ; i f ( strcmp ( $ 1 , a c t e d ) ==0) addcon ( a c t o r ) ; } }; NNtemp : NN { i f ( strcmp ( $ 1 , movie ) ==0) ; i f ( strcmp ( $ 1 , y e a r ) ==0) { addgoal ( r e l e s e d a t e ) ; addcon ( moviename ) ; } i f ( strcmp ( $ 1 , movies ) ==0) { mflag =1; addgoal ( moviename ) ; } e l s e { addgoal ( $ 1 ) ; addcon ( moviename ) ; } }; NNPtemp : NNP { addtest ( $ 1 ) ; } ; %% main ( ) { yyparse ( ) ; return 0; } yyerror ( c h a r * s1 ) { printf ( %s , s1 ) ; } PP ; JJtemp : JJ { addcon ( g e n r e ) ; addtest ( $ 1 ) ; } ; | DT NNtemp NNPtemp | NNPtemp | CD { addcon ( r e l e a s e d a t e ) , addtest ( $ 1 PP PPX ; VPX | VPX ;
58
v o i d addcon ( c h a r x [ 1 0 0 ] ) { strcpy ( condition [ cp ++] , x ) ; } v o i d addgoal ( c h a r x [ 1 0 0 ] ) { strcpy ( goal [ gp ++] , x ) ; } v o i d addtest ( c h a r x [ 1 0 0 ] ) { strcpy ( test [ cp 1 ] , x ) ; }
59
60
Schedule of work
61
The Natural Language Interface to Database system, for movie enquiry using Movie database, accepts query in natural-language that is translated into SQL query, by making use of a compiler tailored for this purpose . Then this SQL query is executed on database to provide output to the user. Major concepts in this system are:
Tokenizing the input query and check itf all the tokens are relevant . After analysing
query it divides it into goals and conditions and uses this information to form appropriate SQL query.
Execute SQL query on the database to retrieve required information. And provide
8.2
Future Scope
62
Bibliography
[1] Avinash J. Agrawal and Dr. O. G. Kakde , Corpus Based Context Free Grammar Design for Natural Language Interface to Database . , ISSN (Online): 1694-0814. [2] Aho, Ullman, Principle of Compiler Design , A book Published by Narosa Publications, 2002. [3] Androutsopoulos I., Ritchie G.D., and Thanisch P., Natural Language Interfaces to Databases ? An Introduction, Journal of Natural Language Engineering Part 1, 1995, 29?81. [4] The Standford Parser(statistical parser), http://nlp.stanford.edu/software/lexparser. [5] Ruwanpura,S.(2000), University. [6] Articial Intelligence , by Elaine Rich and Kevin Knight,(2006), McGraw Hill companies Inc., Chapter 15, page 377-426. [7] Sleator,D.,Temperley,D. (1991), Parsing english with a link Grammar , Carnegie Mellon University Computer Science Technical Report,October 1991. [8] A. El-Mouadib, Zakaria Suliman Zubi, Ahmed A. Almagrous, I. El-Feghi, Interactive Natural Language Interface , WSEAS transactions on computers, issue 4, volume 8, (2009), pp 5-9. [9] Tanveer Siddiqui, U.S. Tiwary, Natural Language Processing and Information Retrieval , Oxford university press, (2008). [10] Suh, K.S., Perkins, W.C., The eects of a system echo in a restricted natural language database interface for novice users, in IEEE System sciences, 4, pp. 594599, 1994. 63 SQ-HAL:Natural Language to SQL Translator ,
http://www.csse.monash.edu.au/hons/projects/2000/Supun.Ruwanpura,Monash