Behera PDF
Behera PDF
Behera PDF
RELATIONS
). The parser is
literature (Huet, 2005) with some computational
toolkits (Huet, 2002), and there is work going
taking care of both external and internal on towards developing mathematical model and
sandhi in the Sanskrit words. dependency grammar of Sanskrit(Huet, 2006), the
proposed Sanskrit parser is being developed for
1 INTRODUCTION using Sanskrit language as Indian networking lan-
Parsing is the ”de-linearization” of linguistic in- guage (INL). The utility of advanced techniques
put; that is, the use of grammatical rules and other such as stochastic parsing and machine learning
knowledge sources to determine the functions of in designing a Sanskrit parser need to be verified.
words in the input sentence. Getting an efficient
and unambiguous parse of natural languages has We have used deterministic finite automata
been a subject of wide interest in the field of for morphological analysis. We have identified
artificial intelligence over past 50 years. Instead the basic linguistic framework which shall facili-
of providing substantial amount of information tate the effective emergence of Sanskrit as INL. To
manually, there has been a shift towards using achieve this goal, a computational grammar has
Machine Learning algorithms in every possible been developed for the processing of Sanskrit lan-
NLP task. Among the most important elements guage. Sanskrit has a rich system of inflectional
in this toolkit are state machines, formal rule endings (vibhakti). The computational grammar
systems, logic, as well as probability theory and described here takes the concept of vibhakti and
karaka relations from Panini framework and uses Relation >.
them to get an efficient parse for Sanskrit Text. The structure contains the root word (<Base>)
The grammar is written in ’utsarga apavaada’ ap- and its form <attributes of word> and relation
proach i.e rules are arranged in several layers each with the verb/action or subject of that sentence.
layer forming the exception of previous one. We This analogy is done so as to completely disam-
are working towards encoding Paninian grammar biguate the meaning of word in the context.
to get a robust analysis of Sanskrit sentence. The
paninian framework has been successfully applied 2.1 <Word>
to Indian languages for dependency grammars Given a sentence, the parser identifies a singular
(Sangal, 1993), where constraint based parsing is word and processes it using the guidelines laid out
used and mapping between karaka and vibhakti in this section. If it is a compound word, then the
is via a TAM (tense, aspect, modality) tabel. We
have made rules from Panini grammar for the example:
!" #
compound word with
=
has to be undone. For
+ .
mapping. Also, finite state automata is used for
the analysis instead of finite state transducers.
The problem is that the Paninian grammar is 2.2 <Base>
generative and it is just not straight forward to The base is the original, uninflected form of the
invert the grammar to get a Sanskrit analyzer, i.e. word. Finite verb forms, other simple words and
its difficult to rely just on Panini sutras to build compound words are each indicated differently.
the analyzer. There will be lot of ambiguities For Simple words: The computer activates the
(due to options given in Panini sutras, as well DFA on the ISCII code (ISCII,1999) of the San-
as a single word having multiple analysis). We skrit text. For compound words: The computer
need therefore a hybrid scheme which should shows the nesting of internal and external
take some statistical methods for the analysis of using nested parentheses. Undo $
%
changes be-
sentence. Probabilistic approach is currently not tween the component words.
integrated within the parser since we don’t have
a Sanskrit corpus to work with, but we hope that 2.3 <Form>
in very near future, we will be able to apply the The <Form> of a word contains the information
statistical methods. regarding declensions for nominals and state for
verbs.
The paper is arranged as follows. Section 2
explains in a nutshell the computational process- • For undeclined words, just write u in this col-
ing of any Sanskrit corpus. We have codified the umn.
Nominal and Verb forms in Sanskrit in a directly
• For nouns, write first.m, f or n to indicate the
computable form by the computer. Our algorithm
gender, followed by a number for the case (1
for processing these texts and preparing Sanskrit
through 7, or 8 for vocative), and s, d or p to
lexicon databases are presented in section 3. The
indicate singular, dual or plural.
complete parser has been described in section
4. We have discussed here how we are going • For adjectives and pronouns, write first a, fol-
to do morphological analysis and hence relation lowed by the indications, as for nouns, of
analysis. Results have been enumerated in section gender (skipping this for pronouns unmarked
5. Discussion, conclusions and future work follow for gender), case and number.
in section 6.
• For verbs, in one column indicate the class
&
( ) and voice. Show the class by a num-
2 A STANDARD METHOD FOR
ANALYZING SANSKRIT TEXT ber from 1 to 11. Follow this (in the same
column) by ’1’ for parasmaipada, ’2’ for
The basic framework for analyzing the Sanskrit ätmanepada and ’3’ for ubhayapada. For fi-
corpus is discussed in this section. For every nite verb forms, give the root. Then (in the
word in a given sentence, machine/computer is same column) show the tense as given in Ta-
supposed to identify the word in following struc- ble 3. Then show the inflection in the same
ture. < W ord >< Base >< F orm >< column, if there is one. For finite forms, show
Table 1: Codes for Table 4: Codes for <Relation>
<Form>
Table 3: Codes for
pa/ passive Finite verb Forms, v main verb
ca/ causative showing the Tense vs subordinate verb
de/ desiderative s subject(of the sentence or a subordinate clause)
fr/ frequentative o object(of a verb or preposition)
pr present g destination(gati) of a verb of motion
if imperfect a Adjective
Table 2: Codes for Fi- iv imperative n Noun modifying another in apposition
nite Forms, showing the op optative d predicate nominative
Person and the Number ao aorist m other modifier
1 '( - .*/) ) 0++, , pe
fu
perfect
future
p Preposition
2 c Conjunction
3 132 *) +, f2
be
second future
benedictive
u vocative, with no syntactic connection
s singular q quoted sentence or phrase
co conditional r definition of a word or phrase(in a commentary)
d dual
p plural
J F W (5) d <?eleaf f W
N umber @
g<?>A@>CBED
(5)
h@<<>C>C@@>A>ABEBED D (19)
(18)
[; ^]ji <?B @O X (6)
(1)
LAk W
mn<?>C@>CBED
(6)
LoO (7) iaK p eaf W
(2) The meaning of the verb is said to be both
<?>C@>CBED
(7)
ea<?>C@>CBED (20) (3) vyapara (action, activity, cause), and phala (fruit,
; dq r <?>C@>CBED
(8)
t <?<?>C>C@@>C>CBEBED D (22)
(21) [^]`iCrOsW (8) result, effect). Syntactically, its meaning is in-
;"ua<?>A@>CBED (9)
v variably linked with the meaning of the verb ”to
f^<?>C@>CBED (10)
[ <?<?>C>C@@>C>CBEBED D (24)
(23)
w=<?>A@>CBED (11)
p do”. In our analysis of Verbs, we have found that
&
D0<?>C@>CBED
(12)
(13)
(25)
they are classified into 11 classes( , Table 7).
While coding the endings, each class is subdivided
according to ” ” knowledge,
y 9
, and
Let us illustrate this structure for the noun
#4$:67 ; each of which is again sub-classified as into 3
?)?
)?68
?)?
nominative, singular declension: which we have denoted as pada. Each verb sub-
class again has 10 lakaaras , which is used to ex-
This is encoded in the following syntax: press the tense of the action. Again, depending
(163{1∗ , 1η , 1ζ , 1@ }) . upon the form of the sentence, again a division
of form as ,
4 2 `| y.
and has
4$E|jy?. ~ yy?.
Where 163 is the ISCII code of the declension been done. This classification has been referred
(Table 6). The four 1’s in the curly brackets repre- to as voice. This structure has been explained in
sent Class, Case, Gender and Number respectively Table 7.
(Table 5) .
Table 7: attributes of the declension for verb
PRESENT
Singular( x 45y?z
)
• δ((qx , a), Y )
{0, 1}
= δ(qY , a)or δ(qY , b) a,b
Endings ISCII Code
First
219194 • q0 =< q0 , 0 >
Separate database files for nominals and verbs In this work, we have made our DFA in a matrix
have been maintained, which can be populated as form with each row representing the behavior of
more and more Sanskrit corpsuses are mined for a particular state. In a given row, there are 74
data. The Sanskrit rule base is prepared using the columns and entries in a particular column of the
”Sanskrit Database Maker” developed during this corresponding row store the state we will finally
work. move to on receiving the particular input corre-
sponding to the column. In addition, each row
3.2 Deterministic Finite Automata: Sanskrit carries the information whether or not it is a final
Rule Base state.
We have used deterministic finite automata (DFA) For example: D[32][5] = 36 conveys that in the
(Hopcraft, 2002) to compute the Sanskrit rule DFA matrix D[i][j], in 32nd state, if input is C5,
base, which we developed as described in section we will move to state no. 36 .(To be noted: C5
III A. Before we explain the DFA, let us define it. is the character corresponding to the ISCII code
A deterministic finite automaton consists of: 166.).
In the graph below, we are giving an example
1. A finite set of states, often denoted Q. how the DFA will look as a tree structure. The
particular graph is constructed for the verb declen-
2. A finite set of input symbols, often denoted
sions for the class
y
&
. The pada is
)67
?)?
S.
and the tense is present tense. The search in this
3. A transition function that takes as arguments DFA will be as follows:- If the first ending of the
a state and input symbol and returns a state, input corresponds to one of the state 163, 195 or
often commonly denoted d. 219, we will move ahead in the DFA otherwise
the input is not found in this tree. On getting a
4. A start state, one of the states in Q, denoted match, the search will continue in the matched
q0. branch. .
5. A set of final or accepting states F. The set
In general, the search in the DFA is done as fol-
~ y (¡{
F is a subset of Q.
lows (We take the example of searching for
Thus, we can define a DFA in this ”five-tuple” in the DFA tree constructed above:-
notation: A = (Q, S, d, q0, F ). With this short
discussion of the DFA, we shall proceed to the • Firstly, an input word is given as the input to
DFA structure for our Sanskrit Rule Base. Since the user interface in Devanagari format .
we are representing any word by ISCII codes that • The word is changed to its equivalent ISCII
range from 161 to 234, we have effectively 74 in- code (203212195163 in this case).
put states. In the notation given below, we are rep-
resenting the character set by {C0, C1, . . . , C73}, • The automaton reads the forms in the reverse
where Ci is the character corresponding to the order to lemmatize them. In our DFA, we
We can verify it from the graph given.
232 219194232
198 • Remaining part of the word is sent to
219194 database engine of program to verify and to
194 219194232198 get attributes. The word corresponding to the
stem, Devanagari equivalent of 203212, that
219
204
219204 218 is ~ y
) will be sent to database.
• The junction ²
is a candidate for
¦E| in the sense of place of pronunciation. Also, there
is a specific significance of first, second, third etc.
§ª
8
.
. So the following breaks are made:
1. +~ 1, 2. + ~ , 3. ²
8. letter of a specific string. The following ruleset is
~ :Z 1
.
+ , 4. ~ : Z ²
.
+ . For each
made:
break, the left hand word is first sent to • Define string s1, with first five entries of
DFA and only if it is a valid word, right
,
and 6th entry as . Also, define s2, with first
word will be sent. In this case, first so-
five entries of and 6th entry as . The rule
lution comes to be the correct one. says,
2.
y!¨
©´§
:- In this case, the junction is x , µ .
The junction is a + halanta + c, and the
breakup will be b + halanta and c, where
The corresponding break-ups are: a, c s1, b s2 and the position of a and b are
• x x x
:- ( or ) + ( or ). same in the respective strings.
68`,?º {
0µ ¶ µ For example, in the word , the junc-
• :- ( or ) + ( or ). ,
tion is + halanta + . The break-up will
,
, 68
, º {
The algorithm remains the same as told in be, +halanta and . Hence we get +
previous case. .
3.
& §ª
:- In this case, the junction is z»
x , 0¶ , #6o , £ . The corresponding break-
• Define string s1, with first five entries of
¤
and 6th entry as . Also, define s2, with first
ups are:
five entries of and 6th entry as . The rule
• x :- ( or ) + (9 or ± ). says,
•
0¶ :- ( or ) + (1 or ² ). The junction is a + halanta + c, and the
•
6o :- ( or ) + (· or ³ ). breakup will be b + halanta and c, where
• £ :- ( or ) + (¸ or ¹ ).
a, c s1, b s2 and the position of a and b are
same in the respective strings.
The algorithm follows the same guidelines. For example, in the word ¼¾½
, the junction
.& ¯§
½
is + halanta + . ½ ½
is the third character
4 £¡Í
is determined as verb. Similarly, if I say $4 ?¥z. 4oÙÚ- . )ÎÚ. 4
? z$Û Î¯ 0+µ?Þ^4 £ ÜÎÚ# ?+H4 ?z$j 67¢ &67 {µ?
:Þ^ Ý
Î
£
£¡Í
, the subject ¢ "
is determined.
:°H:$Æ )£ ß.{
ÝÎ ' µ?Þ^' {5Æ { £ 5)?ß.
:{Ñ Ý' Î y|
¢is a kind of appositive expression to the
y?
°
y¤! © «¬?) & 13à :67.
ÝÎá 0+z68& {
inflectional ending of the verb
£nÍ
. We
have used this concept for analyzing the nominal
¤¡: `.ây.
8
.ã y| ( |`¤näj.
) Æ .¶ æ . ç
Ý)è Î '?
8
4éyT|
: ÝÎ
sentences. That is, verb is determined from the 'êy?{å
ë y?
{
#. ÝÎÏ . ?
subject. Mostly, the forms of
only are used
Öj.:.?4é y|
ÝÎ
x ¡{
¤nj. {^x
: y E|`
.ì
6 &
and relations are defined with respect to that.
Although, the analysis done is not exhaustive,
4
67j.
ÝÎ.
some ruleset is built to deal with them. Most
of the times, relations in a nominal sentence are
indicated by pronouns, adjectives, genitive. For
example, in the sentence Ñ{Ò
86 {ÒÆ £ 4 {
,
there is
# ?)
of the verb
4
in the sentence
by the subject Æ £ 4 {
. Hence Æ £ {
is related
to the verb as subject.
to
Æ £ 4 {
and
6 { Ì {
is a pronoun referring
is an adjective referring
to Æ £ 4 {
8"
. Similarly, 9
"Ó
.Ô® ¢ EÎ ®
. In this
sentence,
. 9
is a pronoun referring to
o® ¢ and ¢
is a genitive to . Here again, there will
be
)
of the verb and
o® ¢
will be
related to the verb as subject.
LAWs P qaW
adjective with a good enough lexicon so that we can work
J > adjective @P7X S¾ q in the direction of
y? ?
in Sanskrit
sentences. Also, we are working on giving all
P q0O > L e W the rules of Panini the shape of multiple layers.
In fact, many of the rules are unimplementable
because they deal with intentions, desires etc.
For that, we need to build an ontology schema.
Figure 6: Semantic net representation of the sen-
) . +z67& 76 &:. +H4 £ ¤¡`' .: {
The Sandhi analysis is not complete and some
. ? ¼ {- . 1 )
{ .
y? {
tence [
exceptional rules are not coded. Also, not all the