Splitting text files based on a semi-regular expression

up vote
-2
down vote

favorite

I have a number of pretty large text file that I want to split into a bunch of smaller files (the number of files will vary file to file).

All of them follow a the same pattern:

 "id": 999999,
 "url": "https://***",
 "name": "****",
 "name_abbreviation": "****",
 "decision_date": "****",
 "docket_number": "****",
 "first_page": "***",
 "last_page": "***",
 "citations": [
 
 "type": "***",
 "cite": "***"
 ,
 
 "type": "***",
 "cite": "***"
 
 ],
 "volume": 
 "url": "https://***",
 "volume_number": "**"
 ,
 "reporter": 
 "url": "***",
 "full_name": "***"
 ,
 "court": 
 "url": "https://***",
 "id": ***,
 "slug": "***",
 "name": "***",
 "name_abbreviation": "***"
 ,
 "jurisdiction": 
 "url": "https://***",
 "id": **,
 "slug": "**",
 "name": "***.",
 "name_long": "***",
 "whitelisted": ***
 ,
 "casebody": 
 "status": "ok",
 "data": 
 "attorneys": [
 "****",
 "***"
 ],
 "opinions": [
 
 "type": "***",
 "text": "INSERT MANY LINES OF TEXT",
 "author": "***"
 
 ],
 "judges": [
 "***"
 ],
 "parties": [
 "***"
 ],
 "head_matter": "***"
 
 
 },

And then repeats a variable amount of times.

I am trying to split this into each of these repeats in its own new text file. Aka from the first instance of "id": 99999, through the body of text and the final "head_matter" variable, until the next "id": 99999 will come up next.

My problem is that there are 3 '"id": ' patterns, but I only want to split at the first.

[a solution using awk or grep or csplit would be most preferable, this is going in a larger c shell script]

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

Perl is the tool you need. I'm on my phone, will explain more later.
– waltinator
3 hours ago

1

that looks suspiciously like json; is it?
– Jeff Schaller
2 hours ago

I have a json version! I've been working with the txt version, but I have access to both!
– Sara Alexandra
2 hours ago

add a comment |

up vote
-2
down vote

favorite

I have a number of pretty large text file that I want to split into a bunch of smaller files (the number of files will vary file to file).

All of them follow a the same pattern:

 "id": 999999,
 "url": "https://***",
 "name": "****",
 "name_abbreviation": "****",
 "decision_date": "****",
 "docket_number": "****",
 "first_page": "***",
 "last_page": "***",
 "citations": [
 
 "type": "***",
 "cite": "***"
 ,
 
 "type": "***",
 "cite": "***"
 
 ],
 "volume": 
 "url": "https://***",
 "volume_number": "**"
 ,
 "reporter": 
 "url": "***",
 "full_name": "***"
 ,
 "court": 
 "url": "https://***",
 "id": ***,
 "slug": "***",
 "name": "***",
 "name_abbreviation": "***"
 ,
 "jurisdiction": 
 "url": "https://***",
 "id": **,
 "slug": "**",
 "name": "***.",
 "name_long": "***",
 "whitelisted": ***
 ,
 "casebody": 
 "status": "ok",
 "data": 
 "attorneys": [
 "****",
 "***"
 ],
 "opinions": [
 
 "type": "***",
 "text": "INSERT MANY LINES OF TEXT",
 "author": "***"
 
 ],
 "judges": [
 "***"
 ],
 "parties": [
 "***"
 ],
 "head_matter": "***"
 
 
 },

And then repeats a variable amount of times.

My problem is that there are 3 '"id": ' patterns, but I only want to split at the first.

[a solution using awk or grep or csplit would be most preferable, this is going in a larger c shell script]

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

Perl is the tool you need. I'm on my phone, will explain more later.
– waltinator
3 hours ago

1

that looks suspiciously like json; is it?
– Jeff Schaller
2 hours ago

I have a json version! I've been working with the txt version, but I have access to both!
– Sara Alexandra
2 hours ago

add a comment |

up vote
-2
down vote

favorite

I have a number of pretty large text file that I want to split into a bunch of smaller files (the number of files will vary file to file).

All of them follow a the same pattern:

 "id": 999999,
 "url": "https://***",
 "name": "****",
 "name_abbreviation": "****",
 "decision_date": "****",
 "docket_number": "****",
 "first_page": "***",
 "last_page": "***",
 "citations": [
 
 "type": "***",
 "cite": "***"
 ,
 
 "type": "***",
 "cite": "***"
 
 ],
 "volume": 
 "url": "https://***",
 "volume_number": "**"
 ,
 "reporter": 
 "url": "***",
 "full_name": "***"
 ,
 "court": 
 "url": "https://***",
 "id": ***,
 "slug": "***",
 "name": "***",
 "name_abbreviation": "***"
 ,
 "jurisdiction": 
 "url": "https://***",
 "id": **,
 "slug": "**",
 "name": "***.",
 "name_long": "***",
 "whitelisted": ***
 ,
 "casebody": 
 "status": "ok",
 "data": 
 "attorneys": [
 "****",
 "***"
 ],
 "opinions": [
 
 "type": "***",
 "text": "INSERT MANY LINES OF TEXT",
 "author": "***"
 
 ],
 "judges": [
 "***"
 ],
 "parties": [
 "***"
 ],
 "head_matter": "***"
 
 
 },

And then repeats a variable amount of times.

My problem is that there are 3 '"id": ' patterns, but I only want to split at the first.

[a solution using awk or grep or csplit would be most preferable, this is going in a larger c shell script]

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

I have a number of pretty large text file that I want to split into a bunch of smaller files (the number of files will vary file to file).

All of them follow a the same pattern:

 "id": 999999,
 "url": "https://***",
 "name": "****",
 "name_abbreviation": "****",
 "decision_date": "****",
 "docket_number": "****",
 "first_page": "***",
 "last_page": "***",
 "citations": [
 
 "type": "***",
 "cite": "***"
 ,
 
 "type": "***",
 "cite": "***"
 
 ],
 "volume": 
 "url": "https://***",
 "volume_number": "**"
 ,
 "reporter": 
 "url": "***",
 "full_name": "***"
 ,
 "court": 
 "url": "https://***",
 "id": ***,
 "slug": "***",
 "name": "***",
 "name_abbreviation": "***"
 ,
 "jurisdiction": 
 "url": "https://***",
 "id": **,
 "slug": "**",
 "name": "***.",
 "name_long": "***",
 "whitelisted": ***
 ,
 "casebody": 
 "status": "ok",
 "data": 
 "attorneys": [
 "****",
 "***"
 ],
 "opinions": [
 
 "type": "***",
 "text": "INSERT MANY LINES OF TEXT",
 "author": "***"
 
 ],
 "judges": [
 "***"
 ],
 "parties": [
 "***"
 ],
 "head_matter": "***"
 
 
 },

And then repeats a variable amount of times.

My problem is that there are 3 '"id": ' patterns, but I only want to split at the first.

[a solution using awk or grep or csplit would be most preferable, this is going in a larger c shell script]

regular-expression split

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

edited 3 hours ago

Rui F Ribeiro

38k1475123

edited 3 hours ago

Rui F Ribeiro

38k1475123

edited 3 hours ago

Rui F Ribeiro

38k1475123

asked 3 hours ago

Sara Alexandra

New contributor

asked 3 hours ago

Sara Alexandra

asked 3 hours ago

Sara Alexandra

New contributor

Sara Alexandra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Perl is the tool you need. I'm on my phone, will explain more later.
– waltinator
3 hours ago

1

that looks suspiciously like json; is it?
– Jeff Schaller
2 hours ago

I have a json version! I've been working with the txt version, but I have access to both!
– Sara Alexandra
2 hours ago

add a comment |

Perl is the tool you need. I'm on my phone, will explain more later.
– waltinator
3 hours ago

1

that looks suspiciously like json; is it?
– Jeff Schaller
2 hours ago

I have a json version! I've been working with the txt version, but I have access to both!
– Sara Alexandra
2 hours ago

Perl is the tool you need. I'm on my phone, will explain more later.
– waltinator
3 hours ago

that looks suspiciously like json; is it?
– Jeff Schaller
2 hours ago

I have a json version! I've been working with the txt version, but I have access to both!
– Sara Alexandra
2 hours ago

add a comment |

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Sara Alexandra is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f480837%2fsplitting-text-files-based-on-a-semi-regular-expression%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

Sara Alexandra is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sara Alexandra is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu