Informatics Educational Institutions & Programs

Sentence boundary disambiguation (SBD), also known as sentence breaking or sentence boundary detection, is the problem in natural language processing of deciding where sentences begin and end. Often, natural language processing tools require their input to be divided into sentences for a number of reasons; however, sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.^[1] As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Strategies

The standard 'vanilla' approach to locate the end of a sentence:^{[clarification needed]}

(a) If it's a period, it ends a sentence.

(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.

(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.^[2] Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.^[3] The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

Examples of use of Perl compatible regular expressions ("PCRE")

((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
$sentences=preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/",$text,-1, PREG_SPLIT_DELIM_CAPTURE); (for PHP)

Online use, libraries, and APIs

sent_detector – Java
Lingua-EN-Sentence – perl
Sentence.pm – perl
SATZ – An Adaptive Sentence Segmentation System – by David D. Palmer – C

Toolkits that include sentence detection

References

^ E. STAMATATOS; N. FAKOTAKIS; G. KOKKINAKIS. "1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION". University of Patras. Retrieved 2009-01-03. {{cite web}}: Unknown parameter |last-author-amp= ignored (|name-list-style= suggested) (help)
^ O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.
^ Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.

External links

Search for 'sentence boundary disambiguation', Google Scholar.

[1] E. STAMATATOS; N. FAKOTAKIS; G. KOKKINAKIS. "1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION". University of Patras. Retrieved 2009-01-03. {{cite web}}: Unknown parameter |last-author-amp= ignored (|name-list-style= suggested) (help)

[2] O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.

[3] Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.

[1]

[2]

[3]

Revision as of 21:46, 11 August 2017 Me, Myself, and I are Here (talk \| contribs) Extended confirmed users 103,429 edits m →‎top: bold ← Previous edit		Revision as of 23:11, 21 September 2017 DragonflySixtyseven (talk \| contribs) Autopatrolled, Administrators 84,581 edits punct Next edit →
Line 1:		Line 1:
	'''Sentence boundary disambiguation''' ('''SBD'''), also known as '''sentence breaking''' or '''sentence boundary detection''', is the problem in [[natural language processing]] of deciding where [[Sentence (linguistics)\|sentences]] begin and end. Often [[natural language processing]] tools require their input to be divided into sentences for a number of reasons. ~~However~~ sentence boundary identification is challenging because [[punctuation mark]]s are often ambiguous. For example, a [[Full stop\|period]] may denote an [[abbreviation]], [[decimal point]], an [[ellipsis]], or an email address{{snd}}not the end of a sentence. About 47% of the periods in the [[Wall Street Journal]] [[text corpus\|corpus]] denote abbreviations.<ref>{{cite web\|url=http://www.ling.gu.se/~lager/Mutbl/Papers/sent_bound.ps\|title=1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION \|author1=E. STAMATATOS \|author2=N. FAKOTAKIS \|author3=G. KOKKINAKIS \|last-author-amp=yes \|publisher=University of Patras\|accessdate=2009-01-03}}</ref> As well, [[question mark]]s and [[exclamation marks]] may appear in embedded quotations, [[emoticons]], [[computer code]], and [[slang]].		'''Sentence boundary disambiguation''' ('''SBD'''), also known as '''sentence breaking''' or '''sentence boundary detection''', is the problem in [[natural language processing]] of deciding where [[Sentence (linguistics)\|sentences]] begin and end. Often, [[natural language processing]] tools require their input to be divided into sentences for a number of reasons; however, sentence boundary identification is challenging because [[punctuation mark]]s are often ambiguous. For example, a [[Full stop\|period]] may denote an [[abbreviation]], [[decimal point]], an [[ellipsis]], or an email address{{snd}}not the end of a sentence. About 47% of the periods in the [[Wall Street Journal]] [[text corpus\|corpus]] denote abbreviations.<ref>{{cite web\|url=http://www.ling.gu.se/~lager/Mutbl/Papers/sent_bound.ps\|title=1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION \|author1=E. STAMATATOS \|author2=N. FAKOTAKIS \|author3=G. KOKKINAKIS \|last-author-amp=yes \|publisher=University of Patras\|accessdate=2009-01-03}}</ref> As well, [[question mark]]s and [[exclamation marks]] may appear in embedded quotations, [[emoticons]], [[computer code]], and [[slang]].

	Languages like Japanese and Chinese have unambiguous sentence-ending markers.		Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Informatics Educational Institutions & Programs

Revision as of 23:11, 21 September 2017

Strategies

Software

See also

References

External links