Git - prune every whitespace-separated word originally introduced by specific author in project's history

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

We have a project under git revision control with only a single branch. We need to remove every new whitespace-separated word that was introduced for the first ever time in a given file by a specific author.

To clarify, at this point we have the HEAD checked out. Now, in an example file named introduction.tex, if there is a sentence "Enlargement of the user-base is beneficial ...", I'd like a bash script with suitable git commands that:

Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

Check entire history of the project to find out who originally made the commit that introduced this word.

If author of that specific commit matches johndoe, then remove the word under consideration from the file.

Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:

It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal

Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?

Post-Processing by latexdiff:

This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.

Background and Context:

This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.

Starting Point:

git log -S --oneline 'enlargement' -- introduction.tex correctly shows the list of commits that touch that case-sensitive word, i.e. enlargement in this case. The oldest commit in the list shall help to identify the committing author. We are simply looking for the "big,technical words" that explains a concept first. I am already doing this manually with that starter git command. But I need to automate this because there are around 10 such files. I obviously don't want to manually do this for every 5+ character word in every file.

edited Jul 18 at 19:51

asked Jul 18 at 11:42

Krishna

1828

I imagine since youÃ¢Â€Â™re trying to go to all this trouble, that reverting the authorÃ¢Â€Â™s commits wouldnÃ¢Â€Â™t be appropriate instead?
â€“Â Stephen Kitt
Jul 18 at 11:47

@StephenKitt I have now updated the question to explain the context.
â€“Â Krishna
Jul 18 at 11:56

1

Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â€“Â Patrick Mevzek
Jul 19 at 1:17

add a commentÂ |Â

up vote
3
down vote

favorite

Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

Check entire history of the project to find out who originally made the commit that introduced this word.

If author of that specific commit matches johndoe, then remove the word under consideration from the file.

Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:

Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?

Post-Processing by latexdiff:

Background and Context:

Starting Point:

edited Jul 18 at 19:51

asked Jul 18 at 11:42

Krishna

1828

I imagine since youÃ¢Â€Â™re trying to go to all this trouble, that reverting the authorÃ¢Â€Â™s commits wouldnÃ¢Â€Â™t be appropriate instead?
â€“Â Stephen Kitt
Jul 18 at 11:47

@StephenKitt I have now updated the question to explain the context.
â€“Â Krishna
Jul 18 at 11:56

1

Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â€“Â Patrick Mevzek
Jul 19 at 1:17

add a commentÂ |Â

up vote
3
down vote

favorite

Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

Check entire history of the project to find out who originally made the commit that introduced this word.

If author of that specific commit matches johndoe, then remove the word under consideration from the file.

Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:

Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?

Post-Processing by latexdiff:

Background and Context:

Starting Point:

edited Jul 18 at 19:51

asked Jul 18 at 11:42

Krishna

1828

Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like b[A-za-z+]b for word detection.

Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.

Check entire history of the project to find out who originally made the commit that introduced this word.

If author of that specific commit matches johndoe, then remove the word under consideration from the file.

Repeat #1 -- #4 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of High-Frequency Occurence Words:

Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?

Post-Processing by latexdiff:

Background and Context:

Starting Point:

edited Jul 18 at 19:51

asked Jul 18 at 11:42

Krishna

1828

edited Jul 18 at 19:51

asked Jul 18 at 11:42

Krishna

1828

asked Jul 18 at 11:42

Krishna

1828

asked Jul 18 at 11:42

Krishna

1828

I imagine since youÃ¢Â€Â™re trying to go to all this trouble, that reverting the authorÃ¢Â€Â™s commits wouldnÃ¢Â€Â™t be appropriate instead?
â€“Â Stephen Kitt
Jul 18 at 11:47

@StephenKitt I have now updated the question to explain the context.
â€“Â Krishna
Jul 18 at 11:56

1

Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â€“Â Patrick Mevzek
Jul 19 at 1:17

add a commentÂ |Â

I imagine since youÃ¢Â€Â™re trying to go to all this trouble, that reverting the authorÃ¢Â€Â™s commits wouldnÃ¢Â€Â™t be appropriate instead?
â€“Â Stephen Kitt
Jul 18 at 11:47

@StephenKitt I have now updated the question to explain the context.
â€“Â Krishna
Jul 18 at 11:56

1

Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â€“Â Patrick Mevzek
Jul 19 at 1:17

I imagine since youÃ¢Â€Â™re trying to go to all this trouble, that reverting the authorÃ¢Â€Â™s commits wouldnÃ¢Â€Â™t be appropriate instead?
â€“Â Stephen Kitt
Jul 18 at 11:47

@StephenKitt I have now updated the question to explain the context.
â€“Â Krishna
Jul 18 at 11:56

Your step 3, as is, is flawed and I really think you need to imagine doing things otherwise. Why? Because after having extracted the words, while you can use git log -S you loose the fact of where in the file this word was. You can have multiple times the same word... I would instead work on each commit one by one and using git blame. Tracking a word, in history, without context will create more problems than solutions. Also I doubt that your whole problem could be fully automated.
â€“Â Patrick Mevzek
Jul 19 at 1:17

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f456968%2fgit-prune-every-whitespace-separated-word-originally-introduced-by-specific-au%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu

Git - prune every whitespace-separated word originally introduced by specific author in project's history

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Your Answer

Post as a guest

Post as a guest

Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Running qemu-guest-agent on windows server 2008

Git - prune every whitespace-separated word originally introduced by specific author in project's history

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Treatment of High-Frequency Occurence Words:

Post-Processing by latexdiff:

Background and Context:

Starting Point:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Running qemu-guest-agent on windows server 2008