Git - prune every whitespace-separated word originally introduced by specific author in project's history

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite
1












We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.



To clarify, at this point we have the HEAD checked out. Now, in an example file named introduction.tex, if there is a sentence "Enlargement of the user-base is beneficial ...", I'd like a bash script with suitable git commands that:



  1. Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

  2. Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

  3. Check entire history of the project to find out who originally made the commit that introduced this word.

  4. If author of that specific commit matches johndoe, then remove the word under consideration from the file.

  5. Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:



It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal



Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?



Post-Processing by latexdiff:



This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.



Background and Context:



This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.



Starting Point:



git log -S --oneline 'enlargement' -- introduction.tex correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.







share|improve this question





















  • I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
    – Stephen Kitt
    Jul 18 at 11:47










  • @StephenKitt I have now updated the question to explain the context.
    – Krishna
    Jul 18 at 11:56







  • 1




    Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
    – Patrick Mevzek
    Jul 19 at 1:17
















up vote
3
down vote

favorite
1












We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.



To clarify, at this point we have the HEAD checked out. Now, in an example file named introduction.tex, if there is a sentence "Enlargement of the user-base is beneficial ...", I'd like a bash script with suitable git commands that:



  1. Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

  2. Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

  3. Check entire history of the project to find out who originally made the commit that introduced this word.

  4. If author of that specific commit matches johndoe, then remove the word under consideration from the file.

  5. Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:



It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal



Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?



Post-Processing by latexdiff:



This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.



Background and Context:



This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.



Starting Point:



git log -S --oneline 'enlargement' -- introduction.tex correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.







share|improve this question





















  • I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
    – Stephen Kitt
    Jul 18 at 11:47










  • @StephenKitt I have now updated the question to explain the context.
    – Krishna
    Jul 18 at 11:56







  • 1




    Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
    – Patrick Mevzek
    Jul 19 at 1:17












up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.



To clarify, at this point we have the HEAD checked out. Now, in an example file named introduction.tex, if there is a sentence "Enlargement of the user-base is beneficial ...", I'd like a bash script with suitable git commands that:



  1. Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

  2. Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

  3. Check entire history of the project to find out who originally made the commit that introduced this word.

  4. If author of that specific commit matches johndoe, then remove the word under consideration from the file.

  5. Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:



It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal



Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?



Post-Processing by latexdiff:



This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.



Background and Context:



This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.



Starting Point:



git log -S --oneline 'enlargement' -- introduction.tex correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.







share|improve this question













We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.



To clarify, at this point we have the HEAD checked out. Now, in an example file named introduction.tex, if there is a sentence "Enlargement of the user-base is beneficial ...", I'd like a bash script with suitable git commands that:



  1. Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

  2. Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

  3. Check entire history of the project to find out who originally made the commit that introduced this word.

  4. If author of that specific commit matches johndoe, then remove the word under consideration from the file.

  5. Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:



It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal



Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?



Post-Processing by latexdiff:



This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.



Background and Context:



This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.



Starting Point:



git log -S --oneline 'enlargement' -- introduction.tex correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.









share|improve this question












share|improve this question




share|improve this question








edited Jul 18 at 19:51
























asked Jul 18 at 11:42









Krishna

1828




1828











  • I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
    – Stephen Kitt
    Jul 18 at 11:47










  • @StephenKitt I have now updated the question to explain the context.
    – Krishna
    Jul 18 at 11:56







  • 1




    Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
    – Patrick Mevzek
    Jul 19 at 1:17
















  • I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
    – Stephen Kitt
    Jul 18 at 11:47










  • @StephenKitt I have now updated the question to explain the context.
    – Krishna
    Jul 18 at 11:56







  • 1




    Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
    – Patrick Mevzek
    Jul 19 at 1:17















I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
– Stephen Kitt
Jul 18 at 11:47




I imagine since you’re trying to go to all this trouble, that reverting the author’s commits wouldn’t be appropriate instead?
– Stephen Kitt
Jul 18 at 11:47












@StephenKitt I have now updated the question to explain the context.
– Krishna
Jul 18 at 11:56





@StephenKitt I have now updated the question to explain the context.
– Krishna
Jul 18 at 11:56





1




1




Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
– Patrick Mevzek
Jul 19 at 1:17




Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
– Patrick Mevzek
Jul 19 at 1:17















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f456968%2fgit-prune-every-whitespace-separated-word-originally-introduced-by-specific-au%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f456968%2fgit-prune-every-whitespace-separated-word-originally-introduced-by-specific-au%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay