0% found this document useful (0 votes)
86 views47 pages

Annotating Urdu Corpus

The document discusses part-of-speech annotation and named entity recognition for Pakistani languages. It begins by motivating the need for annotated corpora to build natural language processing tools for tasks like machine translation and information extraction. It then covers topics like defining appropriate part-of-speech tagsets that balance granularity with practical annotation needs, and examples of named entity annotation schemes. The document also provides examples of syntactic representation formats like phrase structure trees and dependency graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views47 pages

Annotating Urdu Corpus

The document discusses part-of-speech annotation and named entity recognition for Pakistani languages. It begins by motivating the need for annotated corpora to build natural language processing tools for tasks like machine translation and information extraction. It then covers topics like defining appropriate part-of-speech tagsets that balance granularity with practical annotation needs, and examples of named entity annotation schemes. The document also provides examples of syntactic representation formats like phrase structure trees and dependency graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Tafseer Ahmed

DHA Suffa University, Karachi


Presentation Plan
 Motivation

 Part of Speech Annotation


 Annotation for Named Entity (NE)

Practical task

 Dependency Annotation
Practical task

 Other Annotations
Annotated Corpus – preview
Motivation
 Software Tools for processing annotated corpus are
available.
 Part of Speech Tagger
 Named Entity Recognizer
 Chunker/Shallow Parser
 Language Modelling (& Grammar Learning)

 The annotated corpus can be used in


 Statistical Machine Translation
 Information Extraction
Motivation
 Hence, tools are available to create software applications

 However, a missing ingredient for Pakistani languages is


annotated corpus.
Part of speech
 a part of speech (also a word class, a lexical class,
or a lexical category) is a linguistic category of
words (or more precisely lexical items), which is
generally defined by the syntactic or
morphological behaviour of the lexical item in
question.

John saw the saw


Noun Verb Article Noun
“Traditional” POS Tag set
 Noun
 Pronoun
 Adjective
 Verb
 Adverb
 Preposition
 Conjunction
 Interjection
Tag set sizes
 3 Tags: ‫ حرف‬،‫ فعل‬، ‫ اسم‬: Arabic (“Tradition”)

 8 Tags: English (“Tradition”)

 48 Tags :Penn Treebank Tagset

 282 Tags: Hardie’s Urdu Tagset


Granularity Problem
 Adjective:
 good, bad, ‫ برا‬، ‫اچھا‬ adjective
 some, many, ‫ کئی‬، ‫چند‬ quantifier
 first, second, ‫ دوسرا‬، ‫پہال‬ ordinal

 Verb:
 go, read, ‫ پڑھ‬،‫جا‬ main verb
 is, was, ‫ تھا‬، ‫ہے‬ helping verb / auxiliary
 can, may, ،‫چاہیے سک‬ modal verb
POS tagset for Computation
Sample Text - English
one/CD charge/NN of/IN filing/NN a/DT false/JJ
return/N and/CC was/VBD fined/VBN $/$ 5,000/CD
and/CC sentenced/VBN to/TO 18/CD months/NNS
in/IN prison/NN ./.
Urdu Tagset - Issues
 Granularity
 From Hardie (282 tags) to Hassan (42)

 Syntactic versus functional tag


 Noun
 Verb
Urdu Tagset - Issues
 Granularity
 From Hardie (282 tags) to Hassan (42)
Urdu Tagset - Issues
 Coarse grained tags
 better for machine learning
 easy/fast annotation

 Fine grained tags


 more information
Urdu Tagset - Issues
 Syntactic versus functional behavior
 Noun (NN)
‫میں نے پانی پیا‬
vs.
‫میں نے سبق یاد کیا‬
vs.
‫میں گھر کے اندر گیا‬
CLE Urdu Tagset
Sample Text - Urdu
CLE POS Tagged Data
Tagset for other languages
 Sindhi
 J A Mahar (2010)
 Mutee u Rahman (2012)
“Universal” Tagset
 Naseem et. al, 2010
 Google (Petrov et al., 2012)

Used (and modified) by


 Nirve’s Universal Dependency
 TweetbooParser (CMU)
Google “Universal” Tagset
 NOUN (nouns) ‫ کراچی‬،‫کتاب‬

 VERB (verbs) ‫ رہا‬،‫ ہے‬،‫پڑھتا‬

 ADJ (adjectives) ‫ پہال‬،‫ چند‬،‫اچھا‬

 ADV (adverbs) ً ‫ تقریبا‬،‫ بہت‬،‫روزانہ‬

 PRON (pronouns) ‫ وہ‬،‫ تم‬،‫میں‬

 DET (determiners and articles) ‫وہ‬


“Universal” Tagset
 ADP (prepositions and postpositions) ‫ اندر‬،‫نے‬

 NUM (numerals) 2, ‫دو‬

 CONJ (conjunctions) ‫ لیکن‬،‫ یا‬،‫اور‬

 PRT(particles)

 ‘.’ (punctuation marks) ‫ ؟‬،-


 X (a catch-all for other categories)
Nirve’s “Universal” Tagset
 ADJ: adjective  PART: particle
 ADP: adposition  PRON: pronoun
 ADV: adverb  PROPN: proper noun
 AUX: auxiliary verb  PUNCT: punctuation
 CONJ: coordinating  SCONJ: subordinating
conjunction conjunction
 DET: determiner  SYM: symbol
 INTJ: interjection  VERB: verb
 NOUN: noun  X: other
 NUM: numeral
An Example (using Petrov Tagset)
‫پڑھی‬ ‫کتاب‬ ‫اچھی‬ ‫ایک‬ ‫روزانہ‬ ‫نے‬ ‫لڑکی‬ ‫ذہین‬

Verb Noun Adj Num Adv Adp Noun Adj


An Example – Noun Features
English: boy|NN boys|NNS
Form Number

Nominative Singular
‫اچھا لڑکا آیا‬
Nominative Plural
‫اچھے لڑکے آئے‬
Oblique Singular
‫اچھے لڑکے نے کہا‬
Oblique Plural
‫اچھے لڑکوں نے کہا‬
An Example – Verb Features
English: walk|VB walks|VBS walked|VBD
reading|VBG
Number Gender Person Form

‫چلتا ہے‬
Singular Masculine 3 imperfective

‫چلتی ہے‬
Singular Feminine 3 Imperfective

‫چلتی ہیں‬
Plural Feminine 3 imperfective

‫چال تھا‬
Singular Masculine 3 perfective

‫چلی تھی‬
Singular Feminine 3 perfective

‫چلوں گا‬
Singular 1 subjunctive

Singular Masculine infinitive


‫چلنا‬
Beyond POS Tagging

Named Entity Recognition


Named Entities
‫ء کو فیصل آباد میں پیداہوۓ‬1977 ‫ اکتوبر‬14 ‫سعید اجمل‬

 Person
 Organization
 Location

 Date
 Time
 Money
 Percent
 Misc
Inside Outside Beginning (IOB)
‫ہیں‬ ‫بانی‬ ‫کے‬ ‫مائیکروسافٹ‬ ‫گیٹس‬ ‫بل‬
Verb Noun AdP Noun Noun Noun POS
O O O Org-B Per-I Per-B IOB

‫بل‬ Noun Person-B


‫گیٹس‬ Noun Person-I
‫مائیکروسافٹ‬ Noun Organization-B
‫کے‬ AdP O
‫بانی‬ Noun O
‫ہیں‬ Verb O
Practical Task
 Annotating English file

 WebAnno Introduction

 Creating Urdu POS Tagset

 Annotating a news (from today’s news paper)


Syntactic Representation
Phrase Structure

vs

Dependency Structure
Phrase Structure vs Dependency
Constituent and Functional Structures
 Lexical Functional Grammar (LFG)
Urdu Examples
Urdu Examples
“Universal” Dependencies
‫‪Recap – An Urdu Example‬‬
‫تھیں‬ ‫پڑھیں‬ ‫ک تابیں‬ ‫اچھی‬ ‫نے‬ ‫لڑکیوں‬ ‫ذہین‬
‫‪Recap – An Urdu Example‬‬
‫تھیں‬ ‫پڑھیں‬ ‫ک تابیں‬ ‫اچھی‬ ‫نے‬ ‫لڑکیوں‬ ‫ذہین‬

‫ےہ‬ ‫پڑھ‬ ‫ک تاب‬ ‫اچھا‬ ‫نے‬ ‫لڑکی‬ ‫ذہین‬ ‫‪Lemma‬‬


‫‪Recap – An Urdu Example‬‬
‫تھیں‬ ‫پڑھیں‬ ‫ک تابیں‬ ‫اچھی‬ ‫نے‬ ‫لڑکیوں‬ ‫ذہین‬

‫ےہ‬ ‫پڑھ‬ ‫ک تاب‬ ‫اچھا‬ ‫نے‬ ‫لڑکی‬ ‫ذہین‬ ‫‪Lemma‬‬

‫‪Aux‬‬ ‫‪Verb‬‬ ‫‪NN‬‬ ‫‪Adj‬‬ ‫‪AdP‬‬ ‫‪Noun‬‬ ‫‪Adj‬‬ ‫‪POS‬‬


Recap – An Urdu Example
‫تھیں‬ ‫پڑھیں‬ ‫ک تابیں‬ ‫اچھی‬ ‫نے‬ ‫لڑکیوں‬ ‫ذہین‬

‫ےہ‬ ‫پڑھ‬ ‫ک تاب‬ ‫اچھا‬ ‫نے‬ ‫لڑکی‬ ‫ذہین‬ Lemma

Aux Verb NN Adj AdP Noun Adj POS

Form= Gend=Fem Gend= Gend= Gend= Features


Perf Num=Pl Fem Fem Fem
Gend= Form=Nom Num= Num=Pl Num=Pl
Fem Pl Form=Obl Form=
Pers=3 Form= Obl
Num=Pl Nom
Recap – An Urdu Example
CoNLL Format
 CoNLL (Conference on Natural Language Learning) format
 Representing graph (and other tags) in text file

Id
Word
Lemma
Coarse Grained POS
Fine Grained POS
Features
Host
Dependency Type
CoNLL Format
Tools for Annotation
 GATE (General Architecture for Text Engineering), Sheffield

 Brat, Manchester, Tokoyo,......

 Webanno, Darmstadt

 ...

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy