slides

Loading...

Statistical Natural Language Processing Çağrı Çöltekin /tʃaːɾˈɯ tʃœltecˈɪn/ [email protected] University of Tübingen Seminar für Sprachwissenschaft

Summer Semester 2017

Motivation Overview Practical matters Next

Why study (statistical) NLP

Ç. Çöltekin,



(Most of) you are studying in a ‘computational linguistics’ program



Many practical applications



Investigating basic questions in linguistics and cognitive science (and more)

SfS / University of Tübingen

Summer Semester 2017

1 / 24

Motivation Overview Practical matters Next

Application examples For profit (engineering): •

Machine translation



Question answering



For fun (research): •

Modeling cognitive/social behavior

Information retrieval



Authorship attribution



Dialog systems





Summarization



Text classification

Investigating language change through time and space



Text mining/analytics





Sentiment analysis

(Automatic) corpus annotation for linguistic research



Speech recognition/synthesis



Automatic grading



Forensic linguistics

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

2 / 24

Motivation Overview Practical matters Next

Layers of linguistic analysis Discourse analysis

discourse

Semantic analysis

semantics Analysis

syntax

Generation

Parsing

Sentence Planning

Sentence Generation

Morphological Analysis

morphology

Word Generation

Speech Recognition

phonetics / phonology

Speech Synthesis

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

3 / 24

Motivation Overview Practical matters Next

Annotation layers: example

From

Ç. Çöltekin,

the

AP

SfS / University of Tübingen

comes

this

story

:

→Tokens

Summer Semester 2017

4 / 24

Motivation Overview Practical matters Next

Annotation layers: example

From

the

AP

comes

this

story

ADP

DET

PROPN

VERB

DET

NOUN

Ç. Çöltekin,

SfS / University of Tübingen

: →Tokens PUNCT →POS Tags →Morphology

Summer Semester 2017

4 / 24

Motivation Overview Practical matters Next

Annotation layers: example

From

the

AP

comes

this

story

ADP

DET Def

PROPN Sing

VERB 3s,Pres

DET Sing,Dem

NOUN Sing

Ç. Çöltekin,

SfS / University of Tübingen

: →Tokens PUNCT →POS Tags →Morphology

Summer Semester 2017

4 / 24

Motivation Overview Practical matters Next

Annotation layers: example

root

→Syntax

punct case

nsubj det

obl

det

From

the

AP

comes

this

story

ADP

DET Def

PROPN Sing

VERB 3s,Pres

DET Sing,Dem

NOUN Sing

Ç. Çöltekin,

SfS / University of Tübingen

: →Tokens PUNCT →POS Tags →Morphology

Summer Semester 2017

4 / 24

Motivation Overview Practical matters Next

Typical NLP pipeline

Ç. Çöltekin,



Text processing / normalization



Word/sentence tokenization



POS tagging



Morphological analysis



Syntactic parsing



Semantic parsing



Named entity recognition



Coreference resolution

SfS / University of Tübingen

Summer Semester 2017

5 / 24

Motivation Overview Practical matters Next

Do we need a pipeline?



Most ”traditional” NLP architectures are based on a pipeline approach: – tasks are done individually, results are passed to upper level

Ç. Çöltekin,



Joint learning (e.g., POS tagging and syntax) often improves the results



End-to-end learning (without intermediate layers) is another (recent/trending) approach

SfS / University of Tübingen

Summer Semester 2017

6 / 24

Motivation Overview Practical matters Next

On the word ‘statistical’ But it must be recognized that the notion ’probability of a sentence’ is an entirely useless one, under any known interpretation of this term. — Chomsky (1968)

Ç. Çöltekin,



Some linguistic traditions emphasize(d) use of ‘symbolic’, rule-based methods



Some NLP systems are based on rule-based systems (esp. from 80’s 90’s)



Virtually, all modern NLP systems include some sort of statistical component

SfS / University of Tübingen

Summer Semester 2017

7 / 24

Motivation Overview Practical matters Next

What is difficult with NLP?

Ç. Çöltekin,



Combinatorial problems - computational complexity



Ambiguity



Data sparseness

SfS / University of Tübingen

Summer Semester 2017

8 / 24

Motivation Overview Practical matters Next

NLP and computational complexity

Ç. Çöltekin,



How many possible parses a sentence?



How many ways can you align two (parallel) sentences?



How to calculate probability of sentence based on the probabilities of words in it?

SfS / University of Tübingen

Summer Semester 2017

9 / 24

Motivation Overview Practical matters Next

NLP and computational complexity

Ç. Çöltekin,



How many possible parses a sentence?



How many ways can you align two (parallel) sentences?



How to calculate probability of sentence based on the probabilities of words in it?



Many similar questions we deal with have an exponential search space



Naive approaches often are computationally intractable

SfS / University of Tübingen

Summer Semester 2017

9 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines



Ç. Çöltekin,

FARMER BILL DIES IN HOUSE

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM



BAN ON NUDE DANCING ON GOVERNOR’S DESK

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM



BAN ON NUDE DANCING ON GOVERNOR’S DESK



PROSTITUTES APPEAL TO POPE

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM



BAN ON NUDE DANCING ON GOVERNOR’S DESK



PROSTITUTES APPEAL TO POPE



KIDS MAKE NUTRITIOUS SNACKS

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM



BAN ON NUDE DANCING ON GOVERNOR’S DESK



PROSTITUTES APPEAL TO POPE



KIDS MAKE NUTRITIOUS SNACKS



DRUNK GETS NINE MONTHS IN VIOLIN CASE

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

NLP and ambiguity fun with newspaper headlines

Ç. Çöltekin,



FARMER BILL DIES IN HOUSE



TEACHER STRIKES IDLE KIDS



SQUAD HELPS DOG BITE VICTIM



BAN ON NUDE DANCING ON GOVERNOR’S DESK



PROSTITUTES APPEAL TO POPE



KIDS MAKE NUTRITIOUS SNACKS



DRUNK GETS NINE MONTHS IN VIOLIN CASE



MINERS REFUSE TO WORK AFTER DEATH

SfS / University of Tübingen

Summer Semester 2017

10 / 24

Motivation Overview Practical matters Next

More ambiguities we do not recognize many of them at first read

Ç. Çöltekin,



Time flies like an arrow



Outside of a dog, a book is a man’s best friend



One morning I shot an elephant in my pajamas



Don’t eat the pizza with knife and fork



Hearing voices? Then you’re not alone!



No parking on both sides.



They are canning peas.



My job was keeping him alive.



We watched another fly.



Double job pay.



He fed her cat food. SfS / University of Tübingen

Summer Semester 2017

11 / 24

Motivation Overview Practical matters Next

More ambiguities we do not recognize many of them at first read

Ç. Çöltekin,



Time flies like an arrow; fruit flies like a banana



Outside of a dog, a book is a man’s best friend



One morning I shot an elephant in my pajamas



Don’t eat the pizza with knife and fork



Hearing voices? Then you’re not alone!



No parking on both sides.



They are canning peas.



My job was keeping him alive.



We watched another fly.



Double job pay.



He fed her cat food. SfS / University of Tübingen

Summer Semester 2017

11 / 24

Motivation Overview Practical matters Next

More ambiguities we do not recognize many of them at first read

Ç. Çöltekin,



Time flies like an arrow; fruit flies like a banana



Outside of a dog, a book is a man’s best friend; inside it’s too hard to read



One morning I shot an elephant in my pajamas



Don’t eat the pizza with knife and fork



Hearing voices? Then you’re not alone!



No parking on both sides.



They are canning peas.



My job was keeping him alive.



We watched another fly.



Double job pay.



He fed her cat food. SfS / University of Tübingen

Summer Semester 2017

11 / 24

Motivation Overview Practical matters Next

More ambiguities we do not recognize many of them at first read •

Time flies like an arrow; fruit flies like a banana



Outside of a dog, a book is a man’s best friend; inside it’s too hard to read • One morning I shot an elephant in my pajamas. How he got in my pajamas, I don’t know •

Don’t eat the pizza with knife and fork



Hearing voices? Then you’re not alone! • No parking on both sides. •

They are canning peas. • My job was keeping him alive. •

We watched another fly.



Double job pay. • He fed her cat food. Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

11 / 24

Motivation Overview Practical matters Next

More ambiguities we do not recognize many of them at first read • • • • • • • • • • • Ç. Çöltekin,

Time flies like an arrow; fruit flies like a banana Outside of a dog, a book is a man’s best friend; inside it’s too hard to read One morning I shot an elephant in my pajamas. How he got in my pajamas, I don’t know Don’t eat the pizza with knife and fork ; the one with anchovies is better Hearing voices? Then you’re not alone! No parking on both sides. They are canning peas. My job was keeping him alive. We watched another fly. Double job pay. He fed her cat food. SfS / University of Tübingen

Summer Semester 2017

11 / 24

Motivation Overview Practical matters Next

Even more ambiguities with pretty pictures

Cartoon Theories of Linguistics, SpecGram Vol CLIII, No 4, 2008. http://specgram.com/CLIII.4/school.gif Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

12 / 24

Motivation Overview Practical matters Next

Statistical methods and data sparsity

Ç. Çöltekin,



Statistical methods (machine learning) are the best way we know to deal with ambiguities



Even for rule-based approaches, a statistical disambiguation component is necessary



Machine learning methods require (annotated) data



But …

SfS / University of Tübingen

Summer Semester 2017

13 / 24

Motivation Overview Practical matters Next

Languages are full of rare events word frequencies in a small corpus

relative frequency

0.06

0.04

0.02 a long tail follows … 0.00 0

50

100

150

200

250

rank

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

14 / 24

Motivation Overview Practical matters Next

What is in this course

Ç. Çöltekin,



Quick introduction / refreshers on important prerequisites



The computational linguist’s toolbox: basic methods and tools in NLP



Some applications of NLP

SfS / University of Tübingen

Summer Semester 2017

15 / 24

Motivation Overview Practical matters Next

What is in this course Preliminaries



Linear algebra, some concepts from calculus



Probability theory



Information theory



Statistical inference Some topics from machine learning



– – – –

Ç. Çöltekin,

Regression & classification Sequence learning (HMMs) Neural networks and deep learning Unsupervised learning

SfS / University of Tübingen

Summer Semester 2017

16 / 24

Motivation Overview Practical matters Next

What is in this course NLP Tools and techniques

Ç. Çöltekin,



Tokenization, normalization, segmentation



N-gram language models



Part of speech tagging



Statistical parsing



Sequence alignment



Distributed representations (of words, and other linguistic object)



Text classification

SfS / University of Tübingen

Summer Semester 2017

17 / 24

Motivation Overview Practical matters Next

What is in this course Applications

Ç. Çöltekin,



Statistical machine translation



Sentiment analysis



Topic models





SfS / University of Tübingen

Summer Semester 2017

18 / 24

Motivation Overview Practical matters Next

What is not in this course

Ç. Çöltekin,



Cutting edge, latest methods & applications



In-depth treatment of particular topics



Introduction to terms / concepts from linguistics

SfS / University of Tübingen

Summer Semester 2017

19 / 24

Motivation Overview Practical matters Next

Logistics



Lectures: Mon/Wed/Fri 12:15 at Hörsaal 0.02 Normally:

Mon/Wed Formal lectures Fri Hands-on exercises •

Office hours: Wed 10:00-12:00 (room 1.09), or by appointment (email [email protected])



Course web page: http://sfs.uni-tuebingen.de/~ccoltekin/courses/snlp



Ç. Çöltekin,

We also have a Moodle page (linked from the course web page)

SfS / University of Tübingen

Summer Semester 2017

20 / 24

Motivation Overview Practical matters Next

Reading material • Daniel Jurafsky and James H. Martin (2009). Speech and Language

Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. second. Pearson Prentice Hall. isbn: 978-0-13-504196-3 – Draft chapters of the third edition is available at http://web.stanford.edu/~jurafsky/slp3/ • Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009).

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Springer series in statistics. Springer-Verlag New York. isbn: 9780387848587. url: http://web.stanford.edu/~hastie/ElemStatLearn/

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

21 / 24

Motivation Overview Practical matters Next

Grading / evaluation



Three graded homework assignments (10 % each)



Final exam (70 %)



Many non-graded (but not optional) exercises Attendance



– 5 % (bonus) if you miss only one or two classes – you loose one point for each additional class you miss •

Up to 5 % additional bonus points for Easter eggs: – first person finding intentional trivial mistakes in the course material gets 5 %

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

22 / 24

Motivation Overview Practical matters Next

Practical sessions

Ç. Çöltekin,



Tutor: Kuan Yu ⟨[email protected]



All programming exercises (graded or non-graded) should be done in Python



The exercises are not graded, but they should not be considered optional

SfS / University of Tübingen

Summer Semester 2017

23 / 24

Motivation Overview Practical matters Next

Next

Fri (this week and next) a hands-on introduction to python Mon Mathematical preliminaries (some linear algebra and bits from calculus) Wed Probability theory

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

24 / 24

References / additional reading material

Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. isbn: 978-0387-31073-2. Chomsky, Noam (1968). “Quine’s empirical assumptions”. In: Synthese 19.1, pp. 53–68. doi: 10.1007/BF00568049. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Springer series in statistics. Springer-Verlag New York. isbn: 9780387848587. url: http://web.stanford.edu/~hastie/ElemStatLearn/. Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. second. Pearson Prentice Hall. isbn: 978-0-13-504196-3. Manning, Christopher D. and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Press. isbn: 9780262133609.

Ç. Çöltekin,

SfS / University of Tübingen

Summer Semester 2017

A.1

Loading...

slides

Statistical Natural Language Processing Çağrı Çöltekin /tʃaːɾˈɯ tʃœltecˈɪn/ [email protected] University of Tübingen Seminar für Sprachwi...

140KB Sizes 5 Downloads 18 Views

Recommend Documents

slides
Francois Tessier, George Brown, Preeti Malakar, Rick Zamora, Venkat. Vishwanath, Paul Coffman. (ftessier, gbrown, pmalak

Slides
A dynamic Å¿on difim ad i a conditional node in The exem Hon tree, and to am ius tam ced asfaltic ordifonal. Each he tur

Slides - Hemakim
positive and negative inoculated well and space for concurrent staining of culture ... the label contains a droplet of a

PDF Slides
Based on the simulation results, use decision tree method to decide which crop to grow ... Extended Pearson- Tukey Metho

Lecture slides
What is an LED? Packaged Blue LED. Size: 0.4 mm x 0.4 mm. Actual Blue LED. A Light Emitting Diode (LED) produces light o

Presentation slides
Apr 5, 2011 - Tuxpan power plant. Ecosistemas importantes en un radio de 150km: pastizales, agricultura, selva baja, bos

Presentation Slides
Judging Model Reduction of Chaotic Systems via Shadowing Criteria. Erik M. Bollt. Department of Mathematics & Computer S

Slides - CS224d
Mar 31, 2016 - Milestone: 5% (2% bonus if you have your data and ran an experiment!) • A end at least 1 project advice

Lecture Slides
Lecture Slides. These slides are adapted from the slides accompanying the text: Computer Networking: A Top-Down Approach

PP1 - Presentation Slides - Education
Research has been done that Malaysian construction industry has suffered high proportion of business failure during econ