2104 – Lucene's Stemming Fails on Curly Apostrophes

Bug 2104 - Lucene's Stemming Fails on Curly Apostrophes

Summary: Lucene's Stemming Fails on Curly Apostrophes

Status:	CONFIRMED

Alias:	None

Product:	E3
Classification:	Unclassified
Component:	index (show other bugs)
Version:	trunk
Hardware:	All All

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Harms
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-09-12 15:23 CDT by Richard Harms
Modified:	2017-07-15 11:04 CDT (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Harms 2014-09-12 15:23:33 CDT

Item description: “couldn’t,” (curly) search for: “couldn’t,” (curly) works.
Item description: “couldn't,” (plain) search for: “couldn’t,” (curly) fails.
Item description: “couldn’t,” (curly) search for: “couldn't,” (plain) works.
Item description: “couldn't,” (plain) search for: “couldn't,” (plain) works.

Looks like the stemming code in Lucene isn't aware of curly apostrophes.

Comment 1 Richard Harms 2014-09-12 17:10:06 CDT

Isn't in the stemming, rather it's in the tokenization, StandardTokenizer specifically.

An apostrophe splits a word with a contraction into two words ("couldn't" -> "couldn" and "t"). A smart apostrophe seems to remain one word. (This needs to be verified, researching this problem on Google is what lead to this conclusion.)

There is little to no information available on what to do about this, which is kind of stunning. It really may come down to just replacing the smart one with a regular one before the text is handed over to Lucene for any purposes.

Comment 2 Richard Harms 2014-09-12 18:07:08 CDT

Subclass EnglishAnalyzer. Override initReader(…) to add a CharFilter.

Do something like MappingCharFilter that instead can replace contents of words, maybe based on regular expression matches? Match and replace contractions.

Comment 3 Richard Harms 2014-09-12 18:20:51 CDT

There may be something here, too.

https://issues.apache.org/jira/browse/LUCENE-3884

http://lucene.apache.org/core/4_10_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html