networking - Accessing subpage of website by IP
2014-07
I'm writing a php application that uses CURL to scrape data off websites. But the time it takes to load a website with curl is very, very slow. A lot slower than the time it takes to load it in Chrome even though Chrome loads a lot of other things like spreadsheets and images while the php application does not.
Anyway, I read that curl might have problems with DNS lookup so accessing by IP could be a lot faster.
But I'm not sure how to do that.
Let's take Google for an example. I can open my commands prompt and do "ping www.google.com". It answers with:
Pinging www.google.com [74.125.232.114] with 32 bytes of data...
So I can use that IP address then, which works, but what if I'd like to access for instance: www.google.com/doodles
If I try entering that address when pinging it says it couldn't find the host, and doing http://74.125.232.114/doodles
does not work either.
(Error: Not Found The requested URL /doodles was not found on this server.)
So how do I access that by IP?
You are trying to access VirtualHost based websites by IP. The problem with that is that curl doesn't the hostname it's trying to access to the webserver, so the webserver doesn't know which page to serve (google.com might also host gmail.com, but it doesn't know what to give you because curl doesn't ask).
To let curl use a hostname, you could modify your /etc/hosts
file with the following information:
74.125.232.114 google.com
(On Windows you can find this file in C:\Windows\System32\Drivers\etc\hosts
)
If you let curl do a request to example.com, your OS will find example.com in the /etc/hosts
file and not even try a DNS lookup, which would be much faster.
That being said, it would be much better if you fix your DNS settings. Have you tried modifying the /etc/resolv.conf
file with the nameservers of your provider (or Google Public DNS)
nameserver 8.8.8.8
nameserver 8.8.4.4
If the DNS response time is that large you should fix the DNS settings in your network. Have a look at /etc/resolv.conf
and check if the nameserver(s) listed there are still available. If not, add a working DNS server (on top). You could use google's DNS servive for example:
nameserver 8.8.8.8
If you need, for any reason, the slow DNS servers, this could be because your application is using internal DNS names which are not available in the internet, then you can still modify your /etc/hosts
file and add the hostname for 74.125.232.114
there:
74.125.232.114 www.google.com
Having common settings in /etc/nsswitch.conf
, the system would use the /etc/hosts
before performing a DNS request.
use -L to go with the redirect (as curl www.google.com
says the page has been moved),
and it has been mentioned that when doing it via IP the Host header doesn't get filled out.
Well then, how about specifying the host header.
curl -L -H "Host: www.google.com" 173.194.34.115/doodles
I'm trying to access Google through IPv6. However, it seems to want to send me back to IPv4! I did a DNS lookup on IPv6.google.com at http://centralops.net/co/ and found their IP, then tried this...
root@server:/logs# wget http://[2607:f8b0:4003:c00::6a]/
--2011-09-14 12:10:13-- http://[2607:f8b0:4003:c00::6a]/
Connecting to 2607:f8b0:4003:c00::6a:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/ [following]
--2011-09-14 12:10:14-- http://www.google.com/
Resolving www.google.com... 74.125.113.106, 74.125.113.147, 74.125.113.99, ...
Connecting to www.google.com|74.125.113.106|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.2'
[ <=> ] 11,670 --.-K/s in 0.02s
2011-09-14 12:10:14 (474 KB/s) - `index.html.2' saved [11670]
How do I access Google (or other websites) solely over IPv6?
I tested Facebook as well, essentially same result (301 redirect).
The "identity" (origin) of a web site is determined by the hostname you access it by. This redirect may be simply to ensure the site works as intended (e.g. having access to your login session cookie), not specifically to reject IPv6 access.
Try adding an IPv6 address for www.google.com
in your hosts file instead, or using wget --header="Host: www.google.com" http://[2607:f8b0:4003:c00::6a]/
to override the URL-determined host header.
In order to avoid problems they announce their AAAA records only to DNS peers known to work.
From http://www.google.com/intl/en/ipv6/:
Google over IPv6 uses the IPv4 address of your DNS resolver to determine whether a network is IPv6-capable. If you enable Google over IPv6 for your resolver, IPv6 users of that resolver will receive AAAA records for IPv6-enabled Google services.
Find a carrier on the trusted testers list. Then lots of google-domains will be IPv6-accessible.
Sixxs is on this list for example, but you need to reconfigure and use their name servers.
Sorry, I don't know any further carriers on this list.
At least for the search engine, the URL http://ipv6.google.com
should work.