Trace web redirection in bash script

08
2014-07
  • Epholys

    I want to make a script to download the image at best resolution (not the preview image) from a deviantART link, as if I clicked the "Download" button.

    However, it seems that deviantART redirect the browser to download the image from another source, and I can't find how to get this source via the bash script.

    For example, I want to give this link as input:

    http://earthsong9405.deviantart.com/art/The-Big-Boys-357700214

    And get the image located here:

    http://fc05.deviantart.net/fs71/f/2013/077/1/c/the_big_boys_by_earthsong9405-d5wyr92.png

    Via the address given by the link in the download button:

    _http://www.deviantart.com/download/357700214/the_big_boys_by_earthsong9405-d5wyr92.png?token=add3c3dbf4112b7140930c574a819878509c7ebc&ts=1403209394

  • Answers
  • Hastur

    Till the code of the page will be of this type you can do it with a little script as the one below:

    MyUrl=$1
    
    File_Url=$(wget -q -O - "$MyUrl")    # here you put the file html in a variable 
    Line=$(echo "$File_Url" |grep -e 'meta name="og:image"')   # select only 1 line
    # echo $Line
    Img=$(echo $Line |sed -e 's/<meta name="og:image" content="//g' -e 's/">//g')
    # echo $File_To_Download
    wget -q  $Img
    

    The url of the image you are searching for is included in a tag meta name="og:image"
    So you can download with wget the page, pass through grep to select a unique line, clean of what is not needed with sed.
    Once you obtain in this way the url of your image (in the script in inside the variable Img), you can use again wget to download it.

    This is valid till the internal code of the page will be of this type. Else you have to find another way to select the unique tag that is interesting for you.


  • Related Question

    command line - Save a single web page (with background images) with Wget
  • user14124

    I want to use Wget to save single web pages (not recursively, not whole sites) for reference. Much like Firefox's "Web Page, complete".

    My first problem is: I can't get Wget to save background images specified in the CSS. Even if it did save the background image files I don't think --convert-links would convert the background-image URLs in the CSS file to point to the locally saved background images. Firefox has the same problem.

    My second problem is: If there are images on the page I want to save that are hosted on another server (like ads) these wont be included. --span-hosts doesn't seem to solve that problem with the line below.

    I'm using: wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html


  • Related Answers
  • Greg Dean

    from the wget manual (1.12):

    "Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’: "

    wget -E -H -k -K -p url
    

    Also In case robots.txt is disallowing you add -e robots=off

  • k0pernikus

    The wget command offers the option --mirror, which does the same thing as:

    $ wget -r -N -l inf --no-remove-listing
    

    You can also throw in -x to create a whole directory hierarchy for the site, including the hostname.

    You might not have been able to find this if you aren't using the newest version of wget however.

  • quack quixote

    It sounds like wget and Firefox are not parsing the CSS for links to include those files in the download. You could work around those limitations by wget'ing what you can, and scripting the link extraction from any CSS or Javascript in the downloaded files to generate a list of files you missed. Then a second run of wget on that list of links could grab whatever was missed (use the -i flag to specify a file listing URLs).

    If you like Perl, there's a CSS::Parser module on CPAN that may give you an easy means to extract links in this fashion.

    Note that wget is only parsing certain html markup (href/src) and css uris (url()) to determine what page requisites to get. You might try using Firefox addons like DOM Inspector or Firebug to figure out if the 3rd-party images you aren't getting are being added through Javascript -- if so, you'll need to resort to a script or Firefox plugin to get them too.