How can I wget from a list with multiple lines into one file name?
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I would like to wget a list of items that I'm retrieving from an XML file.
I'm using sed to clean up the XML, and I'm ending up with output like this:
CountofMonteCristo.zip
English.
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
Alexandre.
Dumas.
LettersofTwoBrides.zip
English.
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
Honoréde.
Balzac.
BleakHouse.zip
English.
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
Charles.
Dickens.
I'd like to use wget -i to download these files as
Language.Lastname.Firstname.Title.zip
I'm open to re-arranging the file somehow so that I can use
$filename $url
I've tried a few different sed commands. Sed is what I've used to clean up the XML tags, but I can't figure out how to move text to the appropriate place. The titles, names, and languages will vary for each file.
EDIT: Before cleaning up the tags with sed, each line is wrapped in tags, such as English and FileTitle.
I think this could be helpful in identifying patterns to re-arrange things.
EDIT2: Here's the XML source
EDIT3: Something like this looks like it would work, but I'm having trouble modifying it to suit my needs.
My ultimate goal is to organize all of the files into folders, with a hierarchy of Language -> AuthorLastnameFirstname -> Files.zip
If what I'm doing is not best practice, I'm open to other methods.
Thanks
bash wget
 |Â
show 1 more comment
up vote
1
down vote
favorite
I would like to wget a list of items that I'm retrieving from an XML file.
I'm using sed to clean up the XML, and I'm ending up with output like this:
CountofMonteCristo.zip
English.
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
Alexandre.
Dumas.
LettersofTwoBrides.zip
English.
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
Honoréde.
Balzac.
BleakHouse.zip
English.
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
Charles.
Dickens.
I'd like to use wget -i to download these files as
Language.Lastname.Firstname.Title.zip
I'm open to re-arranging the file somehow so that I can use
$filename $url
I've tried a few different sed commands. Sed is what I've used to clean up the XML tags, but I can't figure out how to move text to the appropriate place. The titles, names, and languages will vary for each file.
EDIT: Before cleaning up the tags with sed, each line is wrapped in tags, such as English and FileTitle.
I think this could be helpful in identifying patterns to re-arrange things.
EDIT2: Here's the XML source
EDIT3: Something like this looks like it would work, but I'm having trouble modifying it to suit my needs.
My ultimate goal is to organize all of the files into folders, with a hierarchy of Language -> AuthorLastnameFirstname -> Files.zip
If what I'm doing is not best practice, I'm open to other methods.
Thanks
bash wget
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
Question comments are not a place to dive into the weeds, but I will point you first to the manual page forxpath
.
â DopeGhoti
Apr 2 at 18:12
 |Â
show 1 more comment
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I would like to wget a list of items that I'm retrieving from an XML file.
I'm using sed to clean up the XML, and I'm ending up with output like this:
CountofMonteCristo.zip
English.
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
Alexandre.
Dumas.
LettersofTwoBrides.zip
English.
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
Honoréde.
Balzac.
BleakHouse.zip
English.
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
Charles.
Dickens.
I'd like to use wget -i to download these files as
Language.Lastname.Firstname.Title.zip
I'm open to re-arranging the file somehow so that I can use
$filename $url
I've tried a few different sed commands. Sed is what I've used to clean up the XML tags, but I can't figure out how to move text to the appropriate place. The titles, names, and languages will vary for each file.
EDIT: Before cleaning up the tags with sed, each line is wrapped in tags, such as English and FileTitle.
I think this could be helpful in identifying patterns to re-arrange things.
EDIT2: Here's the XML source
EDIT3: Something like this looks like it would work, but I'm having trouble modifying it to suit my needs.
My ultimate goal is to organize all of the files into folders, with a hierarchy of Language -> AuthorLastnameFirstname -> Files.zip
If what I'm doing is not best practice, I'm open to other methods.
Thanks
bash wget
I would like to wget a list of items that I'm retrieving from an XML file.
I'm using sed to clean up the XML, and I'm ending up with output like this:
CountofMonteCristo.zip
English.
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
Alexandre.
Dumas.
LettersofTwoBrides.zip
English.
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
Honoréde.
Balzac.
BleakHouse.zip
English.
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
Charles.
Dickens.
I'd like to use wget -i to download these files as
Language.Lastname.Firstname.Title.zip
I'm open to re-arranging the file somehow so that I can use
$filename $url
I've tried a few different sed commands. Sed is what I've used to clean up the XML tags, but I can't figure out how to move text to the appropriate place. The titles, names, and languages will vary for each file.
EDIT: Before cleaning up the tags with sed, each line is wrapped in tags, such as English and FileTitle.
I think this could be helpful in identifying patterns to re-arrange things.
EDIT2: Here's the XML source
EDIT3: Something like this looks like it would work, but I'm having trouble modifying it to suit my needs.
My ultimate goal is to organize all of the files into folders, with a hierarchy of Language -> AuthorLastnameFirstname -> Files.zip
If what I'm doing is not best practice, I'm open to other methods.
Thanks
bash wget
edited Apr 2 at 19:12
asked Apr 2 at 17:25
Matt Zabojnik
86
86
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
Question comments are not a place to dive into the weeds, but I will point you first to the manual page forxpath
.
â DopeGhoti
Apr 2 at 18:12
 |Â
show 1 more comment
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
Question comments are not a place to dive into the weeds, but I will point you first to the manual page forxpath
.
â DopeGhoti
Apr 2 at 18:12
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
Question comments are not a place to dive into the weeds, but I will point you first to the manual page for
xpath
.â DopeGhoti
Apr 2 at 18:12
Question comments are not a place to dive into the weeds, but I will point you first to the manual page for
xpath
.â DopeGhoti
Apr 2 at 18:12
 |Â
show 1 more comment
3 Answers
3
active
oldest
votes
up vote
1
down vote
If what I'm doing is not best practice, I'm open to other methods.
I am going to suggest you don't use bash
or sed
etc.! And go with a python way, which is definitely a much better way to parse the xml you need to parse. I have just written and tested this with python3.6 and it does exactly what you have asked.
#!/usr/bin/python3
# Let's import the modules we need
import wget
import os
import requests
from bs4 import BeautifulSoup as bs
# Assign the url to a variable (not essential as we
# only use it once, but it's pythonic)
url = 'https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B'
# Use requests to fetch the raw xml
r = requests.get(url)
# Use BeautifulSoup and lxml to parse the raw xml so
# we can do stuff with it
s = bs(r.text, 'lxml')
# We need to find the data we need. This will find it and create some
# python lists for us to loop through later
# Find all xml tags named 'url_zip_file' and assign them to variable
links = s.find_all('url_zip_file')
# Find all xml tags named 'last_name' and assign them to variable
last_names = s.find_all('last_name')
# Find all xml tags named 'last_name' and assign them to variable
first_names = s.find_all('first_name')
# Find all xml tags named 'language' and assign them to variable
language = s.find_all('language')
# Assign the language to a variable
english = language[0].text
# Make our new language directory
os.mkdir(english)
# cd into our new language directory
os.chdir(str(english))
# Loop through the last names (ln), first names(fn) and links
# so we can make the directories, download the file, rename the
# file then we go back a directory and loop again
for ln, fn, link in zip(last_names, first_names, links):
os.mkdir('Author'.format(str(ln.text), str(fn.text)))
os.chdir('Author'.format(ln.text, fn.text))
filename = wget.download(link.text)
os.rename(filename, 'File.zip')
os.chdir('../')
You can either save this to a file or just paste/type into a python3 interpreter cli, it's up to you.
You will need to install python3-wget and beautifulsoup4 using pip or easy_install etc.
The API also provides JSON output (with&format=json
), so bs4 might not be needed at all.
â muru
Apr 3 at 4:37
add a comment |Â
up vote
1
down vote
If you can use jq
, the Librivox API also provides JSON output, and it's probably easier to parse JSON with jq
than XML with proper XML tools.
u='https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B&format=json'
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file'
Gives output like:
English.DumasAlexandre.Count of Monte Cristo.zip
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
English.BalzacHonoré de.Letters of Two Brides.zip
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
English.DickensCharles.Bleak House.zip
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
After that, it's relatively simple to use xargs
:
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file' |
xargs -d 'n' -n2 wget -O
Where xargs
use two lines as an argument each to wget
, with the first line becoming the -O
option parameter and the second the URL.
Though I'd recommend a Python-based solution like Jamie's, except using JSON and Python's builtin JSON capabilities instead of bs4.
add a comment |Â
up vote
0
down vote
Brute force.
If your parsed xml is in books
while read a; read b; read c; read d; read e; do wget $c -O $b$e$d$a; echo $c; done < books
Just recompose your lines as variables and you are good to go as long as your record blocks are padded to 5 lines.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
If what I'm doing is not best practice, I'm open to other methods.
I am going to suggest you don't use bash
or sed
etc.! And go with a python way, which is definitely a much better way to parse the xml you need to parse. I have just written and tested this with python3.6 and it does exactly what you have asked.
#!/usr/bin/python3
# Let's import the modules we need
import wget
import os
import requests
from bs4 import BeautifulSoup as bs
# Assign the url to a variable (not essential as we
# only use it once, but it's pythonic)
url = 'https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B'
# Use requests to fetch the raw xml
r = requests.get(url)
# Use BeautifulSoup and lxml to parse the raw xml so
# we can do stuff with it
s = bs(r.text, 'lxml')
# We need to find the data we need. This will find it and create some
# python lists for us to loop through later
# Find all xml tags named 'url_zip_file' and assign them to variable
links = s.find_all('url_zip_file')
# Find all xml tags named 'last_name' and assign them to variable
last_names = s.find_all('last_name')
# Find all xml tags named 'last_name' and assign them to variable
first_names = s.find_all('first_name')
# Find all xml tags named 'language' and assign them to variable
language = s.find_all('language')
# Assign the language to a variable
english = language[0].text
# Make our new language directory
os.mkdir(english)
# cd into our new language directory
os.chdir(str(english))
# Loop through the last names (ln), first names(fn) and links
# so we can make the directories, download the file, rename the
# file then we go back a directory and loop again
for ln, fn, link in zip(last_names, first_names, links):
os.mkdir('Author'.format(str(ln.text), str(fn.text)))
os.chdir('Author'.format(ln.text, fn.text))
filename = wget.download(link.text)
os.rename(filename, 'File.zip')
os.chdir('../')
You can either save this to a file or just paste/type into a python3 interpreter cli, it's up to you.
You will need to install python3-wget and beautifulsoup4 using pip or easy_install etc.
The API also provides JSON output (with&format=json
), so bs4 might not be needed at all.
â muru
Apr 3 at 4:37
add a comment |Â
up vote
1
down vote
If what I'm doing is not best practice, I'm open to other methods.
I am going to suggest you don't use bash
or sed
etc.! And go with a python way, which is definitely a much better way to parse the xml you need to parse. I have just written and tested this with python3.6 and it does exactly what you have asked.
#!/usr/bin/python3
# Let's import the modules we need
import wget
import os
import requests
from bs4 import BeautifulSoup as bs
# Assign the url to a variable (not essential as we
# only use it once, but it's pythonic)
url = 'https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B'
# Use requests to fetch the raw xml
r = requests.get(url)
# Use BeautifulSoup and lxml to parse the raw xml so
# we can do stuff with it
s = bs(r.text, 'lxml')
# We need to find the data we need. This will find it and create some
# python lists for us to loop through later
# Find all xml tags named 'url_zip_file' and assign them to variable
links = s.find_all('url_zip_file')
# Find all xml tags named 'last_name' and assign them to variable
last_names = s.find_all('last_name')
# Find all xml tags named 'last_name' and assign them to variable
first_names = s.find_all('first_name')
# Find all xml tags named 'language' and assign them to variable
language = s.find_all('language')
# Assign the language to a variable
english = language[0].text
# Make our new language directory
os.mkdir(english)
# cd into our new language directory
os.chdir(str(english))
# Loop through the last names (ln), first names(fn) and links
# so we can make the directories, download the file, rename the
# file then we go back a directory and loop again
for ln, fn, link in zip(last_names, first_names, links):
os.mkdir('Author'.format(str(ln.text), str(fn.text)))
os.chdir('Author'.format(ln.text, fn.text))
filename = wget.download(link.text)
os.rename(filename, 'File.zip')
os.chdir('../')
You can either save this to a file or just paste/type into a python3 interpreter cli, it's up to you.
You will need to install python3-wget and beautifulsoup4 using pip or easy_install etc.
The API also provides JSON output (with&format=json
), so bs4 might not be needed at all.
â muru
Apr 3 at 4:37
add a comment |Â
up vote
1
down vote
up vote
1
down vote
If what I'm doing is not best practice, I'm open to other methods.
I am going to suggest you don't use bash
or sed
etc.! And go with a python way, which is definitely a much better way to parse the xml you need to parse. I have just written and tested this with python3.6 and it does exactly what you have asked.
#!/usr/bin/python3
# Let's import the modules we need
import wget
import os
import requests
from bs4 import BeautifulSoup as bs
# Assign the url to a variable (not essential as we
# only use it once, but it's pythonic)
url = 'https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B'
# Use requests to fetch the raw xml
r = requests.get(url)
# Use BeautifulSoup and lxml to parse the raw xml so
# we can do stuff with it
s = bs(r.text, 'lxml')
# We need to find the data we need. This will find it and create some
# python lists for us to loop through later
# Find all xml tags named 'url_zip_file' and assign them to variable
links = s.find_all('url_zip_file')
# Find all xml tags named 'last_name' and assign them to variable
last_names = s.find_all('last_name')
# Find all xml tags named 'last_name' and assign them to variable
first_names = s.find_all('first_name')
# Find all xml tags named 'language' and assign them to variable
language = s.find_all('language')
# Assign the language to a variable
english = language[0].text
# Make our new language directory
os.mkdir(english)
# cd into our new language directory
os.chdir(str(english))
# Loop through the last names (ln), first names(fn) and links
# so we can make the directories, download the file, rename the
# file then we go back a directory and loop again
for ln, fn, link in zip(last_names, first_names, links):
os.mkdir('Author'.format(str(ln.text), str(fn.text)))
os.chdir('Author'.format(ln.text, fn.text))
filename = wget.download(link.text)
os.rename(filename, 'File.zip')
os.chdir('../')
You can either save this to a file or just paste/type into a python3 interpreter cli, it's up to you.
You will need to install python3-wget and beautifulsoup4 using pip or easy_install etc.
If what I'm doing is not best practice, I'm open to other methods.
I am going to suggest you don't use bash
or sed
etc.! And go with a python way, which is definitely a much better way to parse the xml you need to parse. I have just written and tested this with python3.6 and it does exactly what you have asked.
#!/usr/bin/python3
# Let's import the modules we need
import wget
import os
import requests
from bs4 import BeautifulSoup as bs
# Assign the url to a variable (not essential as we
# only use it once, but it's pythonic)
url = 'https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B'
# Use requests to fetch the raw xml
r = requests.get(url)
# Use BeautifulSoup and lxml to parse the raw xml so
# we can do stuff with it
s = bs(r.text, 'lxml')
# We need to find the data we need. This will find it and create some
# python lists for us to loop through later
# Find all xml tags named 'url_zip_file' and assign them to variable
links = s.find_all('url_zip_file')
# Find all xml tags named 'last_name' and assign them to variable
last_names = s.find_all('last_name')
# Find all xml tags named 'last_name' and assign them to variable
first_names = s.find_all('first_name')
# Find all xml tags named 'language' and assign them to variable
language = s.find_all('language')
# Assign the language to a variable
english = language[0].text
# Make our new language directory
os.mkdir(english)
# cd into our new language directory
os.chdir(str(english))
# Loop through the last names (ln), first names(fn) and links
# so we can make the directories, download the file, rename the
# file then we go back a directory and loop again
for ln, fn, link in zip(last_names, first_names, links):
os.mkdir('Author'.format(str(ln.text), str(fn.text)))
os.chdir('Author'.format(ln.text, fn.text))
filename = wget.download(link.text)
os.rename(filename, 'File.zip')
os.chdir('../')
You can either save this to a file or just paste/type into a python3 interpreter cli, it's up to you.
You will need to install python3-wget and beautifulsoup4 using pip or easy_install etc.
edited Apr 3 at 6:14
answered Apr 2 at 23:15
Jamie Lindsey
564
564
The API also provides JSON output (with&format=json
), so bs4 might not be needed at all.
â muru
Apr 3 at 4:37
add a comment |Â
The API also provides JSON output (with&format=json
), so bs4 might not be needed at all.
â muru
Apr 3 at 4:37
The API also provides JSON output (with
&format=json
), so bs4 might not be needed at all.â muru
Apr 3 at 4:37
The API also provides JSON output (with
&format=json
), so bs4 might not be needed at all.â muru
Apr 3 at 4:37
add a comment |Â
up vote
1
down vote
If you can use jq
, the Librivox API also provides JSON output, and it's probably easier to parse JSON with jq
than XML with proper XML tools.
u='https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B&format=json'
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file'
Gives output like:
English.DumasAlexandre.Count of Monte Cristo.zip
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
English.BalzacHonoré de.Letters of Two Brides.zip
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
English.DickensCharles.Bleak House.zip
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
After that, it's relatively simple to use xargs
:
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file' |
xargs -d 'n' -n2 wget -O
Where xargs
use two lines as an argument each to wget
, with the first line becoming the -O
option parameter and the second the URL.
Though I'd recommend a Python-based solution like Jamie's, except using JSON and Python's builtin JSON capabilities instead of bs4.
add a comment |Â
up vote
1
down vote
If you can use jq
, the Librivox API also provides JSON output, and it's probably easier to parse JSON with jq
than XML with proper XML tools.
u='https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B&format=json'
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file'
Gives output like:
English.DumasAlexandre.Count of Monte Cristo.zip
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
English.BalzacHonoré de.Letters of Two Brides.zip
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
English.DickensCharles.Bleak House.zip
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
After that, it's relatively simple to use xargs
:
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file' |
xargs -d 'n' -n2 wget -O
Where xargs
use two lines as an argument each to wget
, with the first line becoming the -O
option parameter and the second the URL.
Though I'd recommend a Python-based solution like Jamie's, except using JSON and Python's builtin JSON capabilities instead of bs4.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
If you can use jq
, the Librivox API also provides JSON output, and it's probably easier to parse JSON with jq
than XML with proper XML tools.
u='https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B&format=json'
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file'
Gives output like:
English.DumasAlexandre.Count of Monte Cristo.zip
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
English.BalzacHonoré de.Letters of Two Brides.zip
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
English.DickensCharles.Bleak House.zip
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
After that, it's relatively simple to use xargs
:
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file' |
xargs -d 'n' -n2 wget -O
Where xargs
use two lines as an argument each to wget
, with the first line becoming the -O
option parameter and the second the URL.
Though I'd recommend a Python-based solution like Jamie's, except using JSON and Python's builtin JSON capabilities instead of bs4.
If you can use jq
, the Librivox API also provides JSON output, and it's probably easier to parse JSON with jq
than XML with proper XML tools.
u='https://librivox.org/api/feed/audiobooks/?offset=0&limit=3&fields=%7Blanguage,authors,title,url_zip_file%7B&format=json'
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file'
Gives output like:
English.DumasAlexandre.Count of Monte Cristo.zip
http://www.archive.org/download/count_monte_cristo_0711_librivox/count_monte_cristo_0711_librivox_64kb_mp3.zip
English.BalzacHonoré de.Letters of Two Brides.zip
http://www.archive.org/download/letters_brides_0709_librivox/letters_brides_0709_librivox_64kb_mp3.zip
English.DickensCharles.Bleak House.zip
http://www.archive.org/download/bleak_house_cl_librivox/bleak_house_cl_librivox_64kb_mp3.zip
After that, it's relatively simple to use xargs
:
curl "$u" -sL |
jq -r '.books | "(.language).(.authors[0].last_name + .authors[0].first_name).(.title).zip", .url_zip_file' |
xargs -d 'n' -n2 wget -O
Where xargs
use two lines as an argument each to wget
, with the first line becoming the -O
option parameter and the second the URL.
Though I'd recommend a Python-based solution like Jamie's, except using JSON and Python's builtin JSON capabilities instead of bs4.
edited Apr 3 at 6:26
answered Apr 3 at 5:03
muru
33.3k576141
33.3k576141
add a comment |Â
add a comment |Â
up vote
0
down vote
Brute force.
If your parsed xml is in books
while read a; read b; read c; read d; read e; do wget $c -O $b$e$d$a; echo $c; done < books
Just recompose your lines as variables and you are good to go as long as your record blocks are padded to 5 lines.
add a comment |Â
up vote
0
down vote
Brute force.
If your parsed xml is in books
while read a; read b; read c; read d; read e; do wget $c -O $b$e$d$a; echo $c; done < books
Just recompose your lines as variables and you are good to go as long as your record blocks are padded to 5 lines.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Brute force.
If your parsed xml is in books
while read a; read b; read c; read d; read e; do wget $c -O $b$e$d$a; echo $c; done < books
Just recompose your lines as variables and you are good to go as long as your record blocks are padded to 5 lines.
Brute force.
If your parsed xml is in books
while read a; read b; read c; read d; read e; do wget $c -O $b$e$d$a; echo $c; done < books
Just recompose your lines as variables and you are good to go as long as your record blocks are padded to 5 lines.
answered Apr 2 at 21:06
bu5hman
1,164214
1,164214
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f435091%2fhow-can-i-wget-from-a-list-with-multiple-lines-into-one-file-name%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
can we see an example of the original .xml file ?
â D'Arcy Nader
Apr 2 at 17:39
Thanks for asking, I've edited my question to add a link to the XML source.
â Matt Zabojnik
Apr 2 at 17:41
Why in the name of feral ponies are you trying to use a regular expression tool to perse and reform XML data? Use a DOM parser or other tool that is designed to parse XML to ingest the data and spit out what you need.
â DopeGhoti
Apr 2 at 17:42
@DopeGhoti, Can you elaborate? I've done this before on another site using this method, so this is all I'm familiar with. Docs, examples, suggestions, for DOM parsing would be helpful. Also, I'm doing this on a headless ubuntu machine, using wget to retreive the XML file, if that matters.
â Matt Zabojnik
Apr 2 at 17:45
Question comments are not a place to dive into the weeds, but I will point you first to the manual page for
xpath
.â DopeGhoti
Apr 2 at 18:12