Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












5















I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l', d') in the context of shaping word tokens for sorting.



The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:



compt1 () grep -hEo "[[:alnum:]_'-]+" 


...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!



Now when I compare the frequency of a significant word with the output of grep -c directly on the files, I think it's close enough within some margin of error.




Questions:



  • How could I modify this to merge the frequency of plurals with their
    singular forms i.e. words sharing a common prefix with a varying 1
    character suffix?

  • I'm trying to assess whether the grep part in particular would work with what's on OSX?


1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.










share|improve this question



















  • 4





    I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

    – terdon
    Jul 19 '14 at 14:32












  • OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

    – terdon
    Jul 19 '14 at 16:00











  • @terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

    – jus cogens prime
    Jul 19 '14 at 22:18











  • As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

    – terdon
    Jul 20 '14 at 12:38















5















I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l', d') in the context of shaping word tokens for sorting.



The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:



compt1 () grep -hEo "[[:alnum:]_'-]+" 


...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!



Now when I compare the frequency of a significant word with the output of grep -c directly on the files, I think it's close enough within some margin of error.




Questions:



  • How could I modify this to merge the frequency of plurals with their
    singular forms i.e. words sharing a common prefix with a varying 1
    character suffix?

  • I'm trying to assess whether the grep part in particular would work with what's on OSX?


1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.










share|improve this question



















  • 4





    I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

    – terdon
    Jul 19 '14 at 14:32












  • OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

    – terdon
    Jul 19 '14 at 16:00











  • @terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

    – jus cogens prime
    Jul 19 '14 at 22:18











  • As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

    – terdon
    Jul 20 '14 at 12:38













5












5








5








I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l', d') in the context of shaping word tokens for sorting.



The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:



compt1 () grep -hEo "[[:alnum:]_'-]+" 


...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!



Now when I compare the frequency of a significant word with the output of grep -c directly on the files, I think it's close enough within some margin of error.




Questions:



  • How could I modify this to merge the frequency of plurals with their
    singular forms i.e. words sharing a common prefix with a varying 1
    character suffix?

  • I'm trying to assess whether the grep part in particular would work with what's on OSX?


1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.










share|improve this question
















I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l', d') in the context of shaping word tokens for sorting.



The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:



compt1 () grep -hEo "[[:alnum:]_'-]+" 


...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!



Now when I compare the frequency of a significant word with the output of grep -c directly on the files, I think it's close enough within some margin of error.




Questions:



  • How could I modify this to merge the frequency of plurals with their
    singular forms i.e. words sharing a common prefix with a varying 1
    character suffix?

  • I'm trying to assess whether the grep part in particular would work with what's on OSX?


1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.







shell-script text-processing sed portability natural-language






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 12:36









Community

1




1










asked Jul 19 '14 at 13:59









jus cogens primejus cogens prime

2,70693067




2,70693067







  • 4





    I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

    – terdon
    Jul 19 '14 at 14:32












  • OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

    – terdon
    Jul 19 '14 at 16:00











  • @terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

    – jus cogens prime
    Jul 19 '14 at 22:18











  • As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

    – terdon
    Jul 20 '14 at 12:38












  • 4





    I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

    – terdon
    Jul 19 '14 at 14:32












  • OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

    – terdon
    Jul 19 '14 at 16:00











  • @terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

    – jus cogens prime
    Jul 19 '14 at 22:18











  • As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

    – terdon
    Jul 20 '14 at 12:38







4




4





I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

– terdon
Jul 19 '14 at 14:32






I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on Stack Overflow or Linguistics.

– terdon
Jul 19 '14 at 14:32














OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

– terdon
Jul 19 '14 at 16:00





OK, I just checked with one of the Linguistics mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know.

– terdon
Jul 19 '14 at 16:00













@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

– jus cogens prime
Jul 19 '14 at 22:18





@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here...

– jus cogens prime
Jul 19 '14 at 22:18













As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

– terdon
Jul 20 '14 at 12:38





As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer.

– terdon
Jul 20 '14 at 12:38










3 Answers
3






active

oldest

votes


















10





+100









You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.



That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.



That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.



I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:




# 1st strip leading articles
s/^L'//; # Catalan
s O
x;
# 2nd strip interior particles
s/b[dl]'//g; # Catalan
s i gx;


That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.



Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.



Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.



But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.



This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.






share|improve this answer


















  • 2





    Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

    – jus cogens prime
    Jul 19 '14 at 22:31


















3














Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.



The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.



You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:




  • grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.


  • grep -o to output only the matched part: use sed or awk instead.


  • grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

You are using one GNU-only construct in sed: the L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.






share|improve this answer

























  • Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

    – jus cogens prime
    Jul 20 '14 at 5:01


















1














The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".




I think I can in some cases1 trim the last s with sed to achieve a pretty safe yet interesting result:



s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/


This compacts some 50 lines in the provided sample when used with the original function.



So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:



sed '

h;
s/^(par|col|tap.*)/1/
t RVv

h;
s/^(par|col|tap.*)/1/
t RVc

h;
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
t RVnotpctv_v

h;
s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
t RVnotpctother
b

:RVv
s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
t R1

:RVc
s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1

:RVnotpctv_v
s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
t R1

:RVnotpctother
s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
t R1

:R1
s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
s/Y/i/
s/ç/c/
t R2

:R2
s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
s/logie$|logies$/log/
s/usion$|ution$|usions$|utions$/u/
t Res

:Res
##Residual
s/ier$|ière$|Ier$|Ière$/i/
s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
##Undouble
s/(en)n$/1/
s/(on)n$/1/
s/(et)t$/1/
s/(el)l$/1/
s/(eil)l$/1/
##Unaccent
s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
s/(.*)e$/1/
t
'


In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.2 For further insight I'll look into Linguistics!




1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.



2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...






share|improve this answer
























    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f145462%2ffrequency-of-words-in-non-english-language-text-how-can-i-merge-singular-and-pl%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    10





    +100









    You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.



    That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.



    That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.



    I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:




    # 1st strip leading articles
    s/^L'//; # Catalan
    s O
    x;
    # 2nd strip interior particles
    s/b[dl]'//g; # Catalan
    s i gx;


    That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.



    Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.



    Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.



    But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.



    This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.






    share|improve this answer


















    • 2





      Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

      – jus cogens prime
      Jul 19 '14 at 22:31















    10





    +100









    You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.



    That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.



    That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.



    I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:




    # 1st strip leading articles
    s/^L'//; # Catalan
    s O
    x;
    # 2nd strip interior particles
    s/b[dl]'//g; # Catalan
    s i gx;


    That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.



    Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.



    Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.



    But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.



    This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.






    share|improve this answer


















    • 2





      Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

      – jus cogens prime
      Jul 19 '14 at 22:31













    10





    +100







    10





    +100



    10




    +100





    You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.



    That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.



    That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.



    I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:




    # 1st strip leading articles
    s/^L'//; # Catalan
    s O
    x;
    # 2nd strip interior particles
    s/b[dl]'//g; # Catalan
    s i gx;


    That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.



    Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.



    Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.



    But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.



    This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.






    share|improve this answer













    You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.



    That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.



    That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.



    I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:




    # 1st strip leading articles
    s/^L'//; # Catalan
    s O
    x;
    # 2nd strip interior particles
    s/b[dl]'//g; # Catalan
    s i gx;


    That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.



    Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.



    Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.



    But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.



    This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jul 19 '14 at 14:53









    tchristtchrist

    369210




    369210







    • 2





      Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

      – jus cogens prime
      Jul 19 '14 at 22:31












    • 2





      Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

      – jus cogens prime
      Jul 19 '14 at 22:31







    2




    2





    Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

    – jus cogens prime
    Jul 19 '14 at 22:31





    Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty!

    – jus cogens prime
    Jul 19 '14 at 22:31













    3














    Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.



    The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.



    You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:




    • grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.


    • grep -o to output only the matched part: use sed or awk instead.


    • grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

    You are using one GNU-only construct in sed: the L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.






    share|improve this answer

























    • Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

      – jus cogens prime
      Jul 20 '14 at 5:01















    3














    Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.



    The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.



    You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:




    • grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.


    • grep -o to output only the matched part: use sed or awk instead.


    • grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

    You are using one GNU-only construct in sed: the L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.






    share|improve this answer

























    • Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

      – jus cogens prime
      Jul 20 '14 at 5:01













    3












    3








    3







    Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.



    The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.



    You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:




    • grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.


    • grep -o to output only the matched part: use sed or awk instead.


    • grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

    You are using one GNU-only construct in sed: the L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.






    share|improve this answer















    Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.



    The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.



    You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:




    • grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.


    • grep -o to output only the matched part: use sed or awk instead.


    • grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

    You are using one GNU-only construct in sed: the L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited May 23 '17 at 12:40









    Community

    1




    1










    answered Jul 20 '14 at 2:40









    GillesGilles

    536k12810821600




    536k12810821600












    • Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

      – jus cogens prime
      Jul 20 '14 at 5:01

















    • Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

      – jus cogens prime
      Jul 20 '14 at 5:01
















    Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

    – jus cogens prime
    Jul 20 '14 at 5:01





    Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards!

    – jus cogens prime
    Jul 20 '14 at 5:01











    1














    The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".




    I think I can in some cases1 trim the last s with sed to achieve a pretty safe yet interesting result:



    s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/


    This compacts some 50 lines in the provided sample when used with the original function.



    So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:



    sed '

    h;
    s/^(par|col|tap.*)/1/
    t RVv

    h;
    s/^(par|col|tap.*)/1/
    t RVc

    h;
    s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
    t RVnotpctv_v

    h;
    s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
    t RVnotpctother
    b

    :RVv
    s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
    t R1

    :RVc
    s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
    t R1

    :RVnotpctv_v
    s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
    t R1

    :RVnotpctother
    s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
    t R1

    :R1
    s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
    s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
    s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
    s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
    s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
    s/Y/i/
    s/ç/c/
    t R2

    :R2
    s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
    s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
    s/logie$|logies$/log/
    s/usion$|ution$|usions$|utions$/u/
    t Res

    :Res
    ##Residual
    s/ier$|ière$|Ier$|Ière$/i/
    s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
    ##Undouble
    s/(en)n$/1/
    s/(on)n$/1/
    s/(et)t$/1/
    s/(el)l$/1/
    s/(eil)l$/1/
    ##Unaccent
    s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
    s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
    s/(.*)e$/1/
    t
    '


    In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.2 For further insight I'll look into Linguistics!




    1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.



    2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...






    share|improve this answer





























      1














      The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".




      I think I can in some cases1 trim the last s with sed to achieve a pretty safe yet interesting result:



      s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/


      This compacts some 50 lines in the provided sample when used with the original function.



      So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:



      sed '

      h;
      s/^(par|col|tap.*)/1/
      t RVv

      h;
      s/^(par|col|tap.*)/1/
      t RVc

      h;
      s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
      t RVnotpctv_v

      h;
      s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
      t RVnotpctother
      b

      :RVv
      s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
      t R1

      :RVc
      s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
      t R1

      :RVnotpctv_v
      s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
      t R1

      :RVnotpctother
      s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
      t R1

      :R1
      s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
      s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
      s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
      s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
      s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
      s/Y/i/
      s/ç/c/
      t R2

      :R2
      s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
      s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
      s/logie$|logies$/log/
      s/usion$|ution$|usions$|utions$/u/
      t Res

      :Res
      ##Residual
      s/ier$|ière$|Ier$|Ière$/i/
      s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
      ##Undouble
      s/(en)n$/1/
      s/(on)n$/1/
      s/(et)t$/1/
      s/(el)l$/1/
      s/(eil)l$/1/
      ##Unaccent
      s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
      s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
      s/(.*)e$/1/
      t
      '


      In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.2 For further insight I'll look into Linguistics!




      1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.



      2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...






      share|improve this answer



























        1












        1








        1







        The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".




        I think I can in some cases1 trim the last s with sed to achieve a pretty safe yet interesting result:



        s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/


        This compacts some 50 lines in the provided sample when used with the original function.



        So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:



        sed '

        h;
        s/^(par|col|tap.*)/1/
        t RVv

        h;
        s/^(par|col|tap.*)/1/
        t RVc

        h;
        s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
        t RVnotpctv_v

        h;
        s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
        t RVnotpctother
        b

        :RVv
        s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
        t R1

        :RVc
        s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
        t R1

        :RVnotpctv_v
        s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
        t R1

        :RVnotpctother
        s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
        t R1

        :R1
        s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
        s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
        s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
        s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
        s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
        s/Y/i/
        s/ç/c/
        t R2

        :R2
        s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
        s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
        s/logie$|logies$/log/
        s/usion$|ution$|usions$|utions$/u/
        t Res

        :Res
        ##Residual
        s/ier$|ière$|Ier$|Ière$/i/
        s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
        ##Undouble
        s/(en)n$/1/
        s/(on)n$/1/
        s/(et)t$/1/
        s/(el)l$/1/
        s/(eil)l$/1/
        ##Unaccent
        s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
        s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
        s/(.*)e$/1/
        t
        '


        In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.2 For further insight I'll look into Linguistics!




        1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.



        2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...






        share|improve this answer















        The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".




        I think I can in some cases1 trim the last s with sed to achieve a pretty safe yet interesting result:



        s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/


        This compacts some 50 lines in the provided sample when used with the original function.



        So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:



        sed '

        h;
        s/^(par|col|tap.*)/1/
        t RVv

        h;
        s/^(par|col|tap.*)/1/
        t RVc

        h;
        s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*)$/1/
        t RVnotpctv_v

        h;
        s/^(.*.[aeiouyâàëéêèïîôûù]....*)/1/
        t RVnotpctother
        b

        :RVv
        s/^(par|col|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*)/1/
        t R1

        :RVc
        s/^(par|col|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
        t R1

        :RVnotpctv_v
        s/^([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)$/1/
        t R1

        :RVnotpctother
        s/^(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*)/1/
        t R1

        :R1
        s/ement$|ements$|ité$|ités$|if$|ive$|ifs$|ives$|euse$|euses$//
        s/é$|ée$|ées$|és$|èrent$|er$|era$|erai$|eraIent$|erais$|erait$|eras$|erez$|eriez$|erions$|erons$|eront$|ez$|iez$|ions$|eons$//
        s/eâmes$|eât$|eâtes$|ea$|eai$|eaIent$|eais$|eait$|eant$|eante$|eantes$|eants$|eas$|easse$|eassent$|easses$|eassiez$|eassions$//
        s/âmes$|ât$|âtes$|a$|ai$|aIent$|ais$|ait$|ant$|ante$|antes$|ants$|as$|asse$|assent$|asses$|assiez$|assions$//
        s/[bcdfghjklmnpqrstvwxz]îmes$|ît$|îtes$|i$|ie$|ies$|ir$|ira$|irai$|iraIent$|irais$|irait$|iras$|irent$|irez$|iriez$|irions$|irons$|iront$|is$|issaIent$|issais$|issait$|issant$|issante$|issantes$|issants$|isse$|issent$|isses$|issez$|issiez$|issions$|issons$|it$//
        s/Y/i/
        s/ç/c/
        t R2

        :R2
        s/ance$|iqUe$|isme$|able$|iste$|eux$|ances$|iqUes$|ismes$|ables$|istes$//
        s/atrice$|ateur$|ation$|atrices$|ateurs$|ations$//
        s/logie$|logies$/log/
        s/usion$|ution$|usions$|utions$/u/
        t Res

        :Res
        ##Residual
        s/ier$|ière$|Ier$|Ière$/i/
        s/(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö])s$/1/
        ##Undouble
        s/(en)n$/1/
        s/(on)n$/1/
        s/(et)t$/1/
        s/(el)l$/1/
        s/(eil)l$/1/
        ##Unaccent
        s/(.*)(é)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
        s/(.*)(è)([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*)$/1e3/
        s/(.*)e$/1/
        t
        '


        In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.2 For further insight I'll look into Linguistics!




        1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.



        2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 13 '17 at 12:54









        Community

        1




        1










        answered Jul 21 '14 at 12:16









        jus cogens primejus cogens prime

        2,70693067




        2,70693067



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f145462%2ffrequency-of-words-in-non-english-language-text-how-can-i-merge-singular-and-pl%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay