Bug 2104 - Lucene's Stemming Fails on Curly Apostrophes
Summary: Lucene's Stemming Fails on Curly Apostrophes
Status: CONFIRMED
Alias: None
Product: E3
Classification: Unclassified
Component: index (show other bugs)
Version: trunk
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Richard Harms
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-12 15:23 CDT by Richard Harms
Modified: 2017-07-15 11:04 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Harms 2014-09-12 15:23:33 CDT
Item description: “couldn’t,” (curly) search for: “couldn’t,” (curly) works.
Item description: “couldn't,” (plain) search for: “couldn’t,” (curly) fails.
Item description: “couldn’t,” (curly) search for: “couldn't,” (plain) works.
Item description: “couldn't,” (plain) search for: “couldn't,” (plain) works.

Looks like the stemming code in Lucene isn't aware of curly apostrophes.
Comment 1 Richard Harms 2014-09-12 17:10:06 CDT
Isn't in the stemming, rather it's in the tokenization, StandardTokenizer specifically.

An apostrophe splits a word with a contraction into two words ("couldn't" -> "couldn" and "t"). A smart apostrophe seems to remain one word. (This needs to be verified, researching this problem on Google is what lead to this conclusion.)

There is little to no information available on what to do about this, which is kind of stunning. It really may come down to just replacing the smart one with a regular one before the text is handed over to Lucene for any purposes.
Comment 2 Richard Harms 2014-09-12 18:07:08 CDT
Subclass EnglishAnalyzer. Override initReader(…) to add a CharFilter.

Do something like MappingCharFilter that instead can replace contents of words, maybe based on regular expression matches? Match and replace contractions.