Trace web redirection in bash script
2014-07
I want to make a script to download the image at best resolution (not the preview image) from a deviantART link, as if I clicked the "Download" button.
However, it seems that deviantART redirect the browser to download the image from another source, and I can't find how to get this source via the bash script.
For example, I want to give this link as input:
http://earthsong9405.deviantart.com/art/The-Big-Boys-357700214
And get the image located here:
http://fc05.deviantart.net/fs71/f/2013/077/1/c/the_big_boys_by_earthsong9405-d5wyr92.png
Via the address given by the link in the download button:
_http://www.deviantart.com/download/357700214/the_big_boys_by_earthsong9405-d5wyr92.png?token=add3c3dbf4112b7140930c574a819878509c7ebc&ts=1403209394
Till the code of the page will be of this type you can do it with a little script as the one below:
MyUrl=$1
File_Url=$(wget -q -O - "$MyUrl") # here you put the file html in a variable
Line=$(echo "$File_Url" |grep -e 'meta name="og:image"') # select only 1 line
# echo $Line
Img=$(echo $Line |sed -e 's/<meta name="og:image" content="//g' -e 's/">//g')
# echo $File_To_Download
wget -q $Img
The url of the image you are searching for is included in a tag meta name="og:image"
So you can download with wget
the page, pass through grep
to select a unique line,
clean of what is not needed with sed
.
Once you obtain in this way the url of your image (in the script in inside the variable Img
), you can use again wget
to download it.
This is valid till the internal code of the page will be of this type. Else you have to find another way to select the unique tag that is interesting for you.
I want to use Wget to save single web pages (not recursively, not whole sites) for reference. Much like Firefox's "Web Page, complete".
My first problem is: I can't get Wget to save background images specified in the CSS. Even if it did save the background image files I don't think --convert-links would convert the background-image URLs in the CSS file to point to the locally saved background images. Firefox has the same problem.
My second problem is: If there are images on the page I want to save that are hosted on another server (like ads) these wont be included. --span-hosts doesn't seem to solve that problem with the line below.
I'm using:
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html
from the wget manual (1.12):
"Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’: "
wget -E -H -k -K -p url
Also In case robots.txt is disallowing you add -e robots=off
The wget
command offers the option --mirror
, which does the same thing as:
$ wget -r -N -l inf --no-remove-listing
You can also throw in -x
to create a whole directory hierarchy for the site, including the hostname.
You might not have been able to find this if you aren't using the newest version of wget
however.
It sounds like wget
and Firefox are not parsing the CSS for links to include those files in the download. You could work around those limitations by wget'ing what you can, and scripting the link extraction from any CSS or Javascript in the downloaded files to generate a list of files you missed. Then a second run of wget
on that list of links could grab whatever was missed (use the -i
flag to specify a file listing URLs).
If you like Perl, there's a CSS::Parser module on CPAN that may give you an easy means to extract links in this fashion.
Note that wget
is only parsing certain html markup (href
/src
) and css uris (url()
) to determine what page requisites to get. You might try using Firefox addons like DOM Inspector or Firebug to figure out if the 3rd-party images you aren't getting are being added through Javascript -- if so, you'll need to resort to a script or Firefox plugin to get them too.