Finding incorrect YAML headers

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am trying to identify which files in my project have incorrect headers. The files all starts like this



---
header:
.
.
.
title:
some header:
.
.
.
more headers:
level:
.
.
.
---


Where . . . only represents more headers. The headers contains no indentation. Using the following expression I have been able to extract the YAML header from every file.



grep -Przo --include=*.md "^---(.|n)*?---" .


Now I want to list the incorrect YAML headers.



  • Every YAML header must have a title: some text

  • Every YAML header must have language: [a-z]2

  • It must either contain a external: .* or author: .*.

  • The placement of title:, level:, external: and language: varies.

I tried to do something like



grep -L --include=*.md -e "external: .*" -e "author: .* ."


However the problem with this is that it searches the entire file, not just the YAML header. So I guess solving the issues above boils down to how I can feed the YAML header result from my previous search into grep again. I tried



grep -Przo --include=*.md "^---(.|n)*?---" . | xargs -0 grep "title:";


However this gave me an error "No such file or directory", so I am a bit uncertain how to proceed.



Examples:



---
title: Rull-en-ball
level: 1
author: Transkribert og oversatt fra [Unity3D](http://unity3d.com)
translator: Bjørn Fjukstad
license: Oversatt fra [unity3d.com](https://unity3d.com/learn/tutorials/projects/roll-ball-tutorial)
language: nb
---


Correct YAML, has an author, language and title.



---
title: Mini Golf
level: 2
language: en
external: http://appinventor.mit.edu/explore/ai2/minigolf.html
---


Correct YAML, has a title, language, and external instead of author.



---
title: 'Stjerner og galakser'
level: 2
logo: ../../assets/img/ccuk_logo.png
license: '[Code Club World Limited Terms of Service](https://github.com/CodeClub/scratch-curriculum/blob/master/LICENSE.md)'
translator: 'Ole Andreas Ramsdal'
language: nb
---


Incorrect YAML header, missing author.







share|improve this question





















  • Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
    – Jeff Schaller
    Jul 20 at 13:35










  • Also, the yaml I've seen (for Ansible) has indentation; does yours?
    – Jeff Schaller
    Jul 20 at 13:36










  • @JeffSchaller, no indentation. I will update my question accordingly.
    – Øistein Søvik
    Jul 20 at 13:40
















up vote
2
down vote

favorite












I am trying to identify which files in my project have incorrect headers. The files all starts like this



---
header:
.
.
.
title:
some header:
.
.
.
more headers:
level:
.
.
.
---


Where . . . only represents more headers. The headers contains no indentation. Using the following expression I have been able to extract the YAML header from every file.



grep -Przo --include=*.md "^---(.|n)*?---" .


Now I want to list the incorrect YAML headers.



  • Every YAML header must have a title: some text

  • Every YAML header must have language: [a-z]2

  • It must either contain a external: .* or author: .*.

  • The placement of title:, level:, external: and language: varies.

I tried to do something like



grep -L --include=*.md -e "external: .*" -e "author: .* ."


However the problem with this is that it searches the entire file, not just the YAML header. So I guess solving the issues above boils down to how I can feed the YAML header result from my previous search into grep again. I tried



grep -Przo --include=*.md "^---(.|n)*?---" . | xargs -0 grep "title:";


However this gave me an error "No such file or directory", so I am a bit uncertain how to proceed.



Examples:



---
title: Rull-en-ball
level: 1
author: Transkribert og oversatt fra [Unity3D](http://unity3d.com)
translator: Bjørn Fjukstad
license: Oversatt fra [unity3d.com](https://unity3d.com/learn/tutorials/projects/roll-ball-tutorial)
language: nb
---


Correct YAML, has an author, language and title.



---
title: Mini Golf
level: 2
language: en
external: http://appinventor.mit.edu/explore/ai2/minigolf.html
---


Correct YAML, has a title, language, and external instead of author.



---
title: 'Stjerner og galakser'
level: 2
logo: ../../assets/img/ccuk_logo.png
license: '[Code Club World Limited Terms of Service](https://github.com/CodeClub/scratch-curriculum/blob/master/LICENSE.md)'
translator: 'Ole Andreas Ramsdal'
language: nb
---


Incorrect YAML header, missing author.







share|improve this question





















  • Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
    – Jeff Schaller
    Jul 20 at 13:35










  • Also, the yaml I've seen (for Ansible) has indentation; does yours?
    – Jeff Schaller
    Jul 20 at 13:36










  • @JeffSchaller, no indentation. I will update my question accordingly.
    – Øistein Søvik
    Jul 20 at 13:40












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I am trying to identify which files in my project have incorrect headers. The files all starts like this



---
header:
.
.
.
title:
some header:
.
.
.
more headers:
level:
.
.
.
---


Where . . . only represents more headers. The headers contains no indentation. Using the following expression I have been able to extract the YAML header from every file.



grep -Przo --include=*.md "^---(.|n)*?---" .


Now I want to list the incorrect YAML headers.



  • Every YAML header must have a title: some text

  • Every YAML header must have language: [a-z]2

  • It must either contain a external: .* or author: .*.

  • The placement of title:, level:, external: and language: varies.

I tried to do something like



grep -L --include=*.md -e "external: .*" -e "author: .* ."


However the problem with this is that it searches the entire file, not just the YAML header. So I guess solving the issues above boils down to how I can feed the YAML header result from my previous search into grep again. I tried



grep -Przo --include=*.md "^---(.|n)*?---" . | xargs -0 grep "title:";


However this gave me an error "No such file or directory", so I am a bit uncertain how to proceed.



Examples:



---
title: Rull-en-ball
level: 1
author: Transkribert og oversatt fra [Unity3D](http://unity3d.com)
translator: Bjørn Fjukstad
license: Oversatt fra [unity3d.com](https://unity3d.com/learn/tutorials/projects/roll-ball-tutorial)
language: nb
---


Correct YAML, has an author, language and title.



---
title: Mini Golf
level: 2
language: en
external: http://appinventor.mit.edu/explore/ai2/minigolf.html
---


Correct YAML, has a title, language, and external instead of author.



---
title: 'Stjerner og galakser'
level: 2
logo: ../../assets/img/ccuk_logo.png
license: '[Code Club World Limited Terms of Service](https://github.com/CodeClub/scratch-curriculum/blob/master/LICENSE.md)'
translator: 'Ole Andreas Ramsdal'
language: nb
---


Incorrect YAML header, missing author.







share|improve this question













I am trying to identify which files in my project have incorrect headers. The files all starts like this



---
header:
.
.
.
title:
some header:
.
.
.
more headers:
level:
.
.
.
---


Where . . . only represents more headers. The headers contains no indentation. Using the following expression I have been able to extract the YAML header from every file.



grep -Przo --include=*.md "^---(.|n)*?---" .


Now I want to list the incorrect YAML headers.



  • Every YAML header must have a title: some text

  • Every YAML header must have language: [a-z]2

  • It must either contain a external: .* or author: .*.

  • The placement of title:, level:, external: and language: varies.

I tried to do something like



grep -L --include=*.md -e "external: .*" -e "author: .* ."


However the problem with this is that it searches the entire file, not just the YAML header. So I guess solving the issues above boils down to how I can feed the YAML header result from my previous search into grep again. I tried



grep -Przo --include=*.md "^---(.|n)*?---" . | xargs -0 grep "title:";


However this gave me an error "No such file or directory", so I am a bit uncertain how to proceed.



Examples:



---
title: Rull-en-ball
level: 1
author: Transkribert og oversatt fra [Unity3D](http://unity3d.com)
translator: Bjørn Fjukstad
license: Oversatt fra [unity3d.com](https://unity3d.com/learn/tutorials/projects/roll-ball-tutorial)
language: nb
---


Correct YAML, has an author, language and title.



---
title: Mini Golf
level: 2
language: en
external: http://appinventor.mit.edu/explore/ai2/minigolf.html
---


Correct YAML, has a title, language, and external instead of author.



---
title: 'Stjerner og galakser'
level: 2
logo: ../../assets/img/ccuk_logo.png
license: '[Code Club World Limited Terms of Service](https://github.com/CodeClub/scratch-curriculum/blob/master/LICENSE.md)'
translator: 'Ole Andreas Ramsdal'
language: nb
---


Incorrect YAML header, missing author.









share|improve this question












share|improve this question




share|improve this question








edited Jul 20 at 13:47
























asked Jul 19 at 18:03









Øistein Søvik

304




304











  • Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
    – Jeff Schaller
    Jul 20 at 13:35










  • Also, the yaml I've seen (for Ansible) has indentation; does yours?
    – Jeff Schaller
    Jul 20 at 13:36










  • @JeffSchaller, no indentation. I will update my question accordingly.
    – Øistein Søvik
    Jul 20 at 13:40
















  • Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
    – Jeff Schaller
    Jul 20 at 13:35










  • Also, the yaml I've seen (for Ansible) has indentation; does yours?
    – Jeff Schaller
    Jul 20 at 13:36










  • @JeffSchaller, no indentation. I will update my question accordingly.
    – Øistein Søvik
    Jul 20 at 13:40















Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
– Jeff Schaller
Jul 20 at 13:35




Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended?
– Jeff Schaller
Jul 20 at 13:35












Also, the yaml I've seen (for Ansible) has indentation; does yours?
– Jeff Schaller
Jul 20 at 13:36




Also, the yaml I've seen (for Ansible) has indentation; does yours?
– Jeff Schaller
Jul 20 at 13:36












@JeffSchaller, no indentation. I will update my question accordingly.
– Øistein Søvik
Jul 20 at 13:40




@JeffSchaller, no indentation. I will update my question accordingly.
– Øistein Søvik
Jul 20 at 13:40










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.



The general flow is:



  1. ask bash for the list of *.md files, recursively

  2. pass each file to sed to extract the YAML header

  3. pass that YAML header to awk for validation

  4. if the header fails validation, print the filename

The script:



#!/bin/bash
shopt -s globstar

for file in **/*.md
do
# use sed for the header
sed -n /^---$/,/^---$/p "$file" |
awk '
BEGIN
good_title=0
good_lang=0
good_extaut=0

/^title: .*/ good_title=1
/^language: [a-z][a-z]$/ good_lang=1
/^author: .*/ good_extaut=1
/^external: .*/ good_extaut=1
END
if (good_title && good_lang && good_extaut)
exit 0
else
exit 1

'
|| printf "Incorrect header found in %sn" "$file"
done


You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).



The sed statement extracts the YAML header by:



  • suppressing default-printing (-n)

  • asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.

  • that range of addresses is then printed

The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.



With these appropriately-named sample files scattered into the current directory and a subdirectory:



$ tree .
.
├── bad1.md
├── good1.md
├── good2.md
└── subdir
├── bad1.md
└── good1.md

1 directory, 5 files


... the script outputs:



Incorrect header found in bad1.md
Incorrect header found in subdir/bad1.md





share|improve this answer





















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457273%2ffinding-incorrect-yaml-headers%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.



    The general flow is:



    1. ask bash for the list of *.md files, recursively

    2. pass each file to sed to extract the YAML header

    3. pass that YAML header to awk for validation

    4. if the header fails validation, print the filename

    The script:



    #!/bin/bash
    shopt -s globstar

    for file in **/*.md
    do
    # use sed for the header
    sed -n /^---$/,/^---$/p "$file" |
    awk '
    BEGIN
    good_title=0
    good_lang=0
    good_extaut=0

    /^title: .*/ good_title=1
    /^language: [a-z][a-z]$/ good_lang=1
    /^author: .*/ good_extaut=1
    /^external: .*/ good_extaut=1
    END
    if (good_title && good_lang && good_extaut)
    exit 0
    else
    exit 1

    '
    || printf "Incorrect header found in %sn" "$file"
    done


    You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).



    The sed statement extracts the YAML header by:



    • suppressing default-printing (-n)

    • asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.

    • that range of addresses is then printed

    The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.



    With these appropriately-named sample files scattered into the current directory and a subdirectory:



    $ tree .
    .
    ├── bad1.md
    ├── good1.md
    ├── good2.md
    └── subdir
    ├── bad1.md
    └── good1.md

    1 directory, 5 files


    ... the script outputs:



    Incorrect header found in bad1.md
    Incorrect header found in subdir/bad1.md





    share|improve this answer

























      up vote
      2
      down vote



      accepted










      Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.



      The general flow is:



      1. ask bash for the list of *.md files, recursively

      2. pass each file to sed to extract the YAML header

      3. pass that YAML header to awk for validation

      4. if the header fails validation, print the filename

      The script:



      #!/bin/bash
      shopt -s globstar

      for file in **/*.md
      do
      # use sed for the header
      sed -n /^---$/,/^---$/p "$file" |
      awk '
      BEGIN
      good_title=0
      good_lang=0
      good_extaut=0

      /^title: .*/ good_title=1
      /^language: [a-z][a-z]$/ good_lang=1
      /^author: .*/ good_extaut=1
      /^external: .*/ good_extaut=1
      END
      if (good_title && good_lang && good_extaut)
      exit 0
      else
      exit 1

      '
      || printf "Incorrect header found in %sn" "$file"
      done


      You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).



      The sed statement extracts the YAML header by:



      • suppressing default-printing (-n)

      • asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.

      • that range of addresses is then printed

      The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.



      With these appropriately-named sample files scattered into the current directory and a subdirectory:



      $ tree .
      .
      ├── bad1.md
      ├── good1.md
      ├── good2.md
      └── subdir
      ├── bad1.md
      └── good1.md

      1 directory, 5 files


      ... the script outputs:



      Incorrect header found in bad1.md
      Incorrect header found in subdir/bad1.md





      share|improve this answer























        up vote
        2
        down vote



        accepted







        up vote
        2
        down vote



        accepted






        Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.



        The general flow is:



        1. ask bash for the list of *.md files, recursively

        2. pass each file to sed to extract the YAML header

        3. pass that YAML header to awk for validation

        4. if the header fails validation, print the filename

        The script:



        #!/bin/bash
        shopt -s globstar

        for file in **/*.md
        do
        # use sed for the header
        sed -n /^---$/,/^---$/p "$file" |
        awk '
        BEGIN
        good_title=0
        good_lang=0
        good_extaut=0

        /^title: .*/ good_title=1
        /^language: [a-z][a-z]$/ good_lang=1
        /^author: .*/ good_extaut=1
        /^external: .*/ good_extaut=1
        END
        if (good_title && good_lang && good_extaut)
        exit 0
        else
        exit 1

        '
        || printf "Incorrect header found in %sn" "$file"
        done


        You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).



        The sed statement extracts the YAML header by:



        • suppressing default-printing (-n)

        • asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.

        • that range of addresses is then printed

        The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.



        With these appropriately-named sample files scattered into the current directory and a subdirectory:



        $ tree .
        .
        ├── bad1.md
        ├── good1.md
        ├── good2.md
        └── subdir
        ├── bad1.md
        └── good1.md

        1 directory, 5 files


        ... the script outputs:



        Incorrect header found in bad1.md
        Incorrect header found in subdir/bad1.md





        share|improve this answer













        Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.



        The general flow is:



        1. ask bash for the list of *.md files, recursively

        2. pass each file to sed to extract the YAML header

        3. pass that YAML header to awk for validation

        4. if the header fails validation, print the filename

        The script:



        #!/bin/bash
        shopt -s globstar

        for file in **/*.md
        do
        # use sed for the header
        sed -n /^---$/,/^---$/p "$file" |
        awk '
        BEGIN
        good_title=0
        good_lang=0
        good_extaut=0

        /^title: .*/ good_title=1
        /^language: [a-z][a-z]$/ good_lang=1
        /^author: .*/ good_extaut=1
        /^external: .*/ good_extaut=1
        END
        if (good_title && good_lang && good_extaut)
        exit 0
        else
        exit 1

        '
        || printf "Incorrect header found in %sn" "$file"
        done


        You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).



        The sed statement extracts the YAML header by:



        • suppressing default-printing (-n)

        • asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.

        • that range of addresses is then printed

        The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.



        With these appropriately-named sample files scattered into the current directory and a subdirectory:



        $ tree .
        .
        ├── bad1.md
        ├── good1.md
        ├── good2.md
        └── subdir
        ├── bad1.md
        └── good1.md

        1 directory, 5 files


        ... the script outputs:



        Incorrect header found in bad1.md
        Incorrect header found in subdir/bad1.md






        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Jul 20 at 21:28









        Jeff Schaller

        30.8k846104




        30.8k846104






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457273%2ffinding-incorrect-yaml-headers%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            How many registers does an x86_64 CPU actually have?

            Nur Jahan