Git - prune every whitespace-separated word originally introduced by specific author in project's history
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.
To clarify, at this point we have the HEAD
checked out. Now, in an example file named introduction.tex
, if there is a sentence "Enlargement of the user-base is beneficial ..."
, I'd like a bash script with suitable git commands that:
- Parses the current whitespace-separated word (in the example, for the first iteration, this will be
Enlargement
). Maybe by using a regex likeb[A-za-z+]b
for word detection. - Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
- Check entire history of the project to find out who originally made the commit that introduced this word.
- If author of that specific commit matches
johndoe
, then remove the word under consideration from the file. - Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.
Treatment of High-Frequency Occurence Words:
It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor
. So, I propose to keep the minimum length to 5 characters
in the string for the word to qualify for removal
Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?
Post-Processing by latexdiff:
This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff
that can detect these word removals (or indeed any other difference among the two latex
files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.
Background and Context:
This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git
and shell
along with git-grep
, sed
, awk
, perl
or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.
Starting Point:
git log -S --oneline 'enlargement' -- introduction.tex
correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement
in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.
shell-script text-processing git text-formatting bash-functions
add a comment |Â
up vote
3
down vote
favorite
We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.
To clarify, at this point we have the HEAD
checked out. Now, in an example file named introduction.tex
, if there is a sentence "Enlargement of the user-base is beneficial ..."
, I'd like a bash script with suitable git commands that:
- Parses the current whitespace-separated word (in the example, for the first iteration, this will be
Enlargement
). Maybe by using a regex likeb[A-za-z+]b
for word detection. - Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
- Check entire history of the project to find out who originally made the commit that introduced this word.
- If author of that specific commit matches
johndoe
, then remove the word under consideration from the file. - Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.
Treatment of High-Frequency Occurence Words:
It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor
. So, I propose to keep the minimum length to 5 characters
in the string for the word to qualify for removal
Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?
Post-Processing by latexdiff:
This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff
that can detect these word removals (or indeed any other difference among the two latex
files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.
Background and Context:
This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git
and shell
along with git-grep
, sed
, awk
, perl
or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.
Starting Point:
git log -S --oneline 'enlargement' -- introduction.tex
correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement
in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.
shell-script text-processing git text-formatting bash-functions
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
1
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can usegit log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and usinggit blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â Patrick Mevzek
Jul 19 at 1:17
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.
To clarify, at this point we have the HEAD
checked out. Now, in an example file named introduction.tex
, if there is a sentence "Enlargement of the user-base is beneficial ..."
, I'd like a bash script with suitable git commands that:
- Parses the current whitespace-separated word (in the example, for the first iteration, this will be
Enlargement
). Maybe by using a regex likeb[A-za-z+]b
for word detection. - Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
- Check entire history of the project to find out who originally made the commit that introduced this word.
- If author of that specific commit matches
johndoe
, then remove the word under consideration from the file. - Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.
Treatment of High-Frequency Occurence Words:
It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor
. So, I propose to keep the minimum length to 5 characters
in the string for the word to qualify for removal
Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?
Post-Processing by latexdiff:
This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff
that can detect these word removals (or indeed any other difference among the two latex
files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.
Background and Context:
This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git
and shell
along with git-grep
, sed
, awk
, perl
or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.
Starting Point:
git log -S --oneline 'enlargement' -- introduction.tex
correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement
in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.
shell-script text-processing git text-formatting bash-functions
We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.
To clarify, at this point we have the HEAD
checked out. Now, in an example file named introduction.tex
, if there is a sentence "Enlargement of the user-base is beneficial ..."
, I'd like a bash script with suitable git commands that:
- Parses the current whitespace-separated word (in the example, for the first iteration, this will be
Enlargement
). Maybe by using a regex likeb[A-za-z+]b
for word detection. - Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
- Check entire history of the project to find out who originally made the commit that introduced this word.
- If author of that specific commit matches
johndoe
, then remove the word under consideration from the file. - Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.
Treatment of High-Frequency Occurence Words:
It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor
. So, I propose to keep the minimum length to 5 characters
in the string for the word to qualify for removal
Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?
Post-Processing by latexdiff:
This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff
that can detect these word removals (or indeed any other difference among the two latex
files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.
Background and Context:
This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git
and shell
along with git-grep
, sed
, awk
, perl
or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.
Starting Point:
git log -S --oneline 'enlargement' -- introduction.tex
correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement
in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.
shell-script text-processing git text-formatting bash-functions
edited Jul 18 at 19:51
asked Jul 18 at 11:42
Krishna
1828
1828
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
1
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can usegit log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and usinggit blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â Patrick Mevzek
Jul 19 at 1:17
add a comment |Â
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
1
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can usegit log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and usinggit blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â Patrick Mevzek
Jul 19 at 1:17
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
1
1
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use
git log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.â Patrick Mevzek
Jul 19 at 1:17
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use
git log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.â Patrick Mevzek
Jul 19 at 1:17
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f456968%2fgit-prune-every-whitespace-separated-word-originally-introduced-by-specific-au%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
I imagine since youâÂÂre trying to go to all this trouble, that reverting the authorâÂÂs commits wouldnâÂÂt be appropriate instead?
â Stephen Kitt
Jul 18 at 11:47
@StephenKitt I have now updated the question to explain the context.
â Krishna
Jul 18 at 11:56
1
Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use
git log -S
you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and usinggit blame
. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.â Patrick Mevzek
Jul 19 at 1:17