Frequency of words in non-English language text: how can I merge singular and plural forms etc.?
Clash Royale CLAN TAG#URR8PPP
I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:
compt1 () grep -hEo "[[:alnum:]_'-]+"
...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep
construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!
Now when I compare the frequency of a significant word with the output of grep -c
directly on the files, I think it's close enough within some margin of error.
Questions:
- How could I modify this to merge the frequency of plurals with their
singular forms i.e. words sharing a common prefix with a varying 1
character suffix? - I'm trying to assess whether the
grep
part in particular would work with what's on OSX?
1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.
shell-script text-processing sed portability natural-language
add a comment |
I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:
compt1 () grep -hEo "[[:alnum:]_'-]+"
...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep
construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!
Now when I compare the frequency of a significant word with the output of grep -c
directly on the files, I think it's close enough within some margin of error.
Questions:
- How could I modify this to merge the frequency of plurals with their
singular forms i.e. words sharing a common prefix with a varying 1
character suffix? - I'm trying to assess whether the
grep
part in particular would work with what's on OSX?
1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.
shell-script text-processing sed portability natural-language
4
I've just been told by an expert on this kind of thing that you'll never be able to do this properly withsed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.
– terdon♦
Jul 19 '14 at 14:32
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38
add a comment |
I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:
compt1 () grep -hEo "[[:alnum:]_'-]+"
...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep
construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!
Now when I compare the frequency of a significant word with the output of grep -c
directly on the files, I think it's close enough within some margin of error.
Questions:
- How could I modify this to merge the frequency of plurals with their
singular forms i.e. words sharing a common prefix with a varying 1
character suffix? - I'm trying to assess whether the
grep
part in particular would work with what's on OSX?
1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.
shell-script text-processing sed portability natural-language
I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:
compt1 () grep -hEo "[[:alnum:]_'-]+"
...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep
construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!
Now when I compare the frequency of a significant word with the output of grep -c
directly on the files, I think it's close enough within some margin of error.
Questions:
- How could I modify this to merge the frequency of plurals with their
singular forms i.e. words sharing a common prefix with a varying 1
character suffix? - I'm trying to assess whether the
grep
part in particular would work with what's on OSX?
1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.
shell-script text-processing sed portability natural-language
shell-script text-processing sed portability natural-language
edited Apr 13 '17 at 12:36
Community♦
1
1
asked Jul 19 '14 at 13:59
jus cogens primejus cogens prime
2,70693067
2,70693067
4
I've just been told by an expert on this kind of thing that you'll never be able to do this properly withsed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.
– terdon♦
Jul 19 '14 at 14:32
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38
add a comment |
4
I've just been told by an expert on this kind of thing that you'll never be able to do this properly withsed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.
– terdon♦
Jul 19 '14 at 14:32
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38
4
4
I've just been told by an expert on this kind of thing that you'll never be able to do this properly with
sed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.– terdon♦
Jul 19 '14 at 14:32
I've just been told by an expert on this kind of thing that you'll never be able to do this properly with
sed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.– terdon♦
Jul 19 '14 at 14:32
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38
add a comment |
3 Answers
3
active
oldest
votes
You really are not going to be able to do this with a simplistic sed
script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.
That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.
That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.
I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:
# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;
That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g
first.
Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.
Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball
module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.
But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.
This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
add a comment |
Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.
The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.
You are using several non-POSIX options to grep
, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:
grep -h
to suppress the file name: callgrep
on one file at a time, or pass the files tocat
first.grep -o
to output only the matched part: usesed
orawk
instead.grep -w
to match only whole words: search for a pattern like(^|[^[:alnum:]])needle($|[^[:alnum:]])
.
You are using one GNU-only construct in sed: the L
directive to lowercase a replacement in the s
command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower
. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'
.
Thank you! I had forgotten about thatsed
bit. I used it because non-heirloomtr
can't do capitalized accented chars but I'll useawk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!
– jus cogens prime
Jul 20 '14 at 5:01
add a comment |
The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".
I think I can in some cases1 trim the last s
with sed
to achieve a pretty safe yet interesting result:
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
This compacts some 50 lines in the provided sample when used with the original function.
So I tried sed
with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:
sed '
h;
s/^(par|col|tap.*)/1/
t RVv
h;
s/^(par|col|tap.*)/1/
t RVc
h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v
h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b
:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1
:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1
:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2
:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res
:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'
In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed
expression.2 For further insight I'll look into Linguistics!
1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.
2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons
while the other(a noun) only of its s
. It's about parsing the parts of speech as was explained...
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f145462%2ffrequency-of-words-in-non-english-language-text-how-can-i-merge-singular-and-pl%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You really are not going to be able to do this with a simplistic sed
script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.
That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.
That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.
I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:
# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;
That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g
first.
Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.
Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball
module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.
But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.
This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
add a comment |
You really are not going to be able to do this with a simplistic sed
script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.
That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.
That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.
I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:
# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;
That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g
first.
Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.
Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball
module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.
But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.
This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
add a comment |
You really are not going to be able to do this with a simplistic sed
script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.
That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.
That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.
I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:
# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;
That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g
first.
Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.
Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball
module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.
But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.
This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.
You really are not going to be able to do this with a simplistic sed
script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.
That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.
That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.
I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:
# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;
That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g
first.
Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.
Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball
module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.
But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.
This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.
answered Jul 19 '14 at 14:53
tchristtchrist
369210
369210
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
add a comment |
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
2
2
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!
– jus cogens prime
Jul 19 '14 at 22:31
add a comment |
Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.
The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.
You are using several non-POSIX options to grep
, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:
grep -h
to suppress the file name: callgrep
on one file at a time, or pass the files tocat
first.grep -o
to output only the matched part: usesed
orawk
instead.grep -w
to match only whole words: search for a pattern like(^|[^[:alnum:]])needle($|[^[:alnum:]])
.
You are using one GNU-only construct in sed: the L
directive to lowercase a replacement in the s
command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower
. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'
.
Thank you! I had forgotten about thatsed
bit. I used it because non-heirloomtr
can't do capitalized accented chars but I'll useawk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!
– jus cogens prime
Jul 20 '14 at 5:01
add a comment |
Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.
The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.
You are using several non-POSIX options to grep
, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:
grep -h
to suppress the file name: callgrep
on one file at a time, or pass the files tocat
first.grep -o
to output only the matched part: usesed
orawk
instead.grep -w
to match only whole words: search for a pattern like(^|[^[:alnum:]])needle($|[^[:alnum:]])
.
You are using one GNU-only construct in sed: the L
directive to lowercase a replacement in the s
command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower
. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'
.
Thank you! I had forgotten about thatsed
bit. I used it because non-heirloomtr
can't do capitalized accented chars but I'll useawk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!
– jus cogens prime
Jul 20 '14 at 5:01
add a comment |
Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.
The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.
You are using several non-POSIX options to grep
, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:
grep -h
to suppress the file name: callgrep
on one file at a time, or pass the files tocat
first.grep -o
to output only the matched part: usesed
orawk
instead.grep -w
to match only whole words: search for a pattern like(^|[^[:alnum:]])needle($|[^[:alnum:]])
.
You are using one GNU-only construct in sed: the L
directive to lowercase a replacement in the s
command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower
. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'
.
Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.
The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.
You are using several non-POSIX options to grep
, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:
grep -h
to suppress the file name: callgrep
on one file at a time, or pass the files tocat
first.grep -o
to output only the matched part: usesed
orawk
instead.grep -w
to match only whole words: search for a pattern like(^|[^[:alnum:]])needle($|[^[:alnum:]])
.
You are using one GNU-only construct in sed: the L
directive to lowercase a replacement in the s
command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower
. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'
.
edited May 23 '17 at 12:40
Community♦
1
1
answered Jul 20 '14 at 2:40
GillesGilles
536k12810821600
536k12810821600
Thank you! I had forgotten about thatsed
bit. I used it because non-heirloomtr
can't do capitalized accented chars but I'll useawk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!
– jus cogens prime
Jul 20 '14 at 5:01
add a comment |
Thank you! I had forgotten about thatsed
bit. I used it because non-heirloomtr
can't do capitalized accented chars but I'll useawk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!
– jus cogens prime
Jul 20 '14 at 5:01
Thank you! I had forgotten about that
sed
bit. I used it because non-heirloom tr
can't do capitalized accented chars but I'll use awk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!– jus cogens prime
Jul 20 '14 at 5:01
Thank you! I had forgotten about that
sed
bit. I used it because non-heirloom tr
can't do capitalized accented chars but I'll use awk
for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!– jus cogens prime
Jul 20 '14 at 5:01
add a comment |
The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".
I think I can in some cases1 trim the last s
with sed
to achieve a pretty safe yet interesting result:
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
This compacts some 50 lines in the provided sample when used with the original function.
So I tried sed
with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:
sed '
h;
s/^(par|col|tap.*)/1/
t RVv
h;
s/^(par|col|tap.*)/1/
t RVc
h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v
h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b
:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1
:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1
:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2
:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res
:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'
In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed
expression.2 For further insight I'll look into Linguistics!
1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.
2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons
while the other(a noun) only of its s
. It's about parsing the parts of speech as was explained...
add a comment |
The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".
I think I can in some cases1 trim the last s
with sed
to achieve a pretty safe yet interesting result:
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
This compacts some 50 lines in the provided sample when used with the original function.
So I tried sed
with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:
sed '
h;
s/^(par|col|tap.*)/1/
t RVv
h;
s/^(par|col|tap.*)/1/
t RVc
h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v
h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b
:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1
:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1
:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2
:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res
:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'
In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed
expression.2 For further insight I'll look into Linguistics!
1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.
2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons
while the other(a noun) only of its s
. It's about parsing the parts of speech as was explained...
add a comment |
The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".
I think I can in some cases1 trim the last s
with sed
to achieve a pretty safe yet interesting result:
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
This compacts some 50 lines in the provided sample when used with the original function.
So I tried sed
with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:
sed '
h;
s/^(par|col|tap.*)/1/
t RVv
h;
s/^(par|col|tap.*)/1/
t RVc
h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v
h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b
:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1
:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1
:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2
:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res
:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'
In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed
expression.2 For further insight I'll look into Linguistics!
1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.
2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons
while the other(a noun) only of its s
. It's about parsing the parts of speech as was explained...
The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".
I think I can in some cases1 trim the last s
with sed
to achieve a pretty safe yet interesting result:
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
This compacts some 50 lines in the provided sample when used with the original function.
So I tried sed
with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:
sed '
h;
s/^(par|col|tap.*)/1/
t RVv
h;
s/^(par|col|tap.*)/1/
t RVc
h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v
h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b
:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1
:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1
:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1
:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2
:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res
:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'
In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed
expression.2 For further insight I'll look into Linguistics!
1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.
2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons
while the other(a noun) only of its s
. It's about parsing the parts of speech as was explained...
edited Apr 13 '17 at 12:54
Community♦
1
1
answered Jul 21 '14 at 12:16
jus cogens primejus cogens prime
2,70693067
2,70693067
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f145462%2ffrequency-of-words-in-non-english-language-text-how-can-i-merge-singular-and-pl%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
I've just been told by an expert on this kind of thing that you'll never be able to do this properly with
sed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.– terdon♦
Jul 19 '14 at 14:32
OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.
– terdon♦
Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...
– jus cogens prime
Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.
– terdon♦
Jul 20 '14 at 12:38