networking - Accessing subpage of website by IP

07
2014-07
  • Clox

    I'm writing a php application that uses CURL to scrape data off websites. But the time it takes to load a website with curl is very, very slow. A lot slower than the time it takes to load it in Chrome even though Chrome loads a lot of other things like spreadsheets and images while the php application does not.

    Anyway, I read that curl might have problems with DNS lookup so accessing by IP could be a lot faster.

    But I'm not sure how to do that.

    Let's take Google for an example. I can open my commands prompt and do "ping www.google.com". It answers with:

    Pinging www.google.com [74.125.232.114] with 32 bytes of data...
    

    So I can use that IP address then, which works, but what if I'd like to access for instance: www.google.com/doodles

    If I try entering that address when pinging it says it couldn't find the host, and doing http://74.125.232.114/doodles does not work either.

    (Error: Not Found The requested URL /doodles was not found on this server.)
    

    So how do I access that by IP?

  • Answers
  • mtak

    You are trying to access VirtualHost based websites by IP. The problem with that is that curl doesn't the hostname it's trying to access to the webserver, so the webserver doesn't know which page to serve (google.com might also host gmail.com, but it doesn't know what to give you because curl doesn't ask).

    To let curl use a hostname, you could modify your /etc/hosts file with the following information:

    74.125.232.114 google.com
    

    (On Windows you can find this file in C:\Windows\System32\Drivers\etc\hosts)

    If you let curl do a request to example.com, your OS will find example.com in the /etc/hosts file and not even try a DNS lookup, which would be much faster.


    That being said, it would be much better if you fix your DNS settings. Have you tried modifying the /etc/resolv.conf file with the nameservers of your provider (or Google Public DNS)

    nameserver 8.8.8.8
    nameserver 8.8.4.4
    
  • hek2mgl

    If the DNS response time is that large you should fix the DNS settings in your network. Have a look at /etc/resolv.conf and check if the nameserver(s) listed there are still available. If not, add a working DNS server (on top). You could use google's DNS servive for example:

    nameserver 8.8.8.8
    

    If you need, for any reason, the slow DNS servers, this could be because your application is using internal DNS names which are not available in the internet, then you can still modify your /etc/hosts file and add the hostname for 74.125.232.114 there:

    74.125.232.114 www.google.com
    

    Having common settings in /etc/nsswitch.conf, the system would use the /etc/hosts before performing a DNS request.

  • barlop

    use -L to go with the redirect (as curl www.google.com says the page has been moved),

    and it has been mentioned that when doing it via IP the Host header doesn't get filled out.

    Well then, how about specifying the host header.

    curl -L -H "Host: www.google.com" 173.194.34.115/doodles


  • Related Question

    dns - How do I properly access Google through IPV6?
  • Charles Offenbacher

    I'm trying to access Google through IPv6. However, it seems to want to send me back to IPv4! I did a DNS lookup on IPv6.google.com at http://centralops.net/co/ and found their IP, then tried this...

    root@server:/logs# wget http://[2607:f8b0:4003:c00::6a]/
    --2011-09-14 12:10:13--  http://[2607:f8b0:4003:c00::6a]/
    Connecting to 2607:f8b0:4003:c00::6a:80... connected.
    HTTP request sent, awaiting response... 302 Found
    Location: http://www.google.com/ [following]
    --2011-09-14 12:10:14--  http://www.google.com/
    Resolving www.google.com... 74.125.113.106, 74.125.113.147, 74.125.113.99, ...
    Connecting to www.google.com|74.125.113.106|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/html]
    Saving to: `index.html.2'
    
    [ <=>                                        ] 11,670      --.-K/s   in 0.02s
    
    2011-09-14 12:10:14 (474 KB/s) - `index.html.2' saved [11670]
    

    How do I access Google (or other websites) solely over IPv6?

    I tested Facebook as well, essentially same result (301 redirect).


  • Related Answers
  • Kevin Reid

    The "identity" (origin) of a web site is determined by the hostname you access it by. This redirect may be simply to ensure the site works as intended (e.g. having access to your login session cookie), not specifically to reject IPv6 access.

    Try adding an IPv6 address for www.google.com in your hosts file instead, or using wget --header="Host: www.google.com" http://[2607:f8b0:4003:c00::6a]/ to override the URL-determined host header.

  • glglgl

    In order to avoid problems they announce their AAAA records only to DNS peers known to work.

    From http://www.google.com/intl/en/ipv6/:

    Google over IPv6 uses the IPv4 address of your DNS resolver to determine whether a network is IPv6-capable. If you enable Google over IPv6 for your resolver, IPv6 users of that resolver will receive AAAA records for IPv6-enabled Google services.

  • Jens Erat

    Find a carrier on the trusted testers list. Then lots of google-domains will be IPv6-accessible.

    Sixxs is on this list for example, but you need to reconfigure and use their name servers.

    Sorry, I don't know any further carriers on this list.

  • Tom Wijsman

    At least for the search engine, the URL http://ipv6.google.com should work.