Hadoop exports status interface via HTTP, using DNS names.
Amazon uses internal DNS names in their AWS cloud.
Therefor, should you request to access Hadoop status interface from outside the cloud (like your office pc... you will fail).
This guide suggest a solution (using a claver work around) to this problem.
Problem Description
Hadoop builds status html pages & links based on current machine hostname. This hostname represents internal machine DNS name which is resolvable from within the AWS cloud using AWS internal DNS server:
cat /etc/resolv.conf
nameserver 172.16.0.23
domain ec2.internal
search ec2.internal
the problem here comes into existence when trying to access the status page from your development box (Office, home, mobile, telepathic...) here your machine will try to resolve ip-10-202-30-119.ec2.internal using your ISP DNS server, which will obviously fail.
Solution
The solution to the above problem is to use a combination of several cool tricks in the basis of which stands SOCKS5 proxy. Let's get started:
- Install Firefox 3 http://www.mozilla.com/firefox/
- FoxyProxy Standard is a proxy manipulation utility for firefox, install it from https://addons.mozilla.org/en-US/firefox/addon/2464/
- Make the following configurations to FoxyProxy
- Right click on foxy proxy Icon and choose options
- Choose "Add New Proxy"
General
Proxy Name = AWS Internal
Proxy Details
Select * Manual proxy configuration
Host or IP Address = localhost Port = 6666
[v] SOCKS proxy? (x) SOCKS v5
URL Patterns
Create 2 patterns: *ec2.internal*, *domu*.internal*
* Tip: You might want to consider a 3rd patten if you are for example from Israel... *pandora.com*
- Make sure you have selected "Use proxies based on their pre-defined patterns and priorities" from the FoxyProxy options menu.
- This step diverts for Windows & Linux users
Windows
//TODO// Configure putty with dynamic tunning.
Linux
nohup ssh -CqTnN -D 6666 ubuntu@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
OK... now try to access
http://ip-10-202-30-119.ec2.internal:50070/dfshealth.jsp you should be getting a nice GUI allowing you to browse your HDFS file system.
How it works
You are creating using SSH protocol a socks proxy tunned over ssh. From one hand your ssh client accepts SOCKS5 proxy requests on port 6666 (localhost) and transfers them to the host inside AWS cloud. Your firefox is set to do requests for *patten* which matches internal host names inside AWS cloud over a proxy (using foxyproxy). So what happens is the following:
Firefox: "Hmmm, I need to do HTTP GET for ip-10-202-30-119.ec2.internal" - Who should I be talking to? OH! I know, foxyproxy tells me I should delegate this to my friends, the socks5 proxy at localhost:6666, cool.
Firefox: Hey localhost:6666 (socks5 proxy), do me a favor, get me the content for ip-10-202-30-119.ec2.internal:50070
localhost:6666: Hmmm, ok.
localhost:6666 -> ssh -> ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com: HTTP GET http://ip-10-202-30-119.ec2.internal:50070
ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com: Hey, I know him. let me do a quick http get on his ass
ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com: HTTP GET 10.202.30.119:50070
ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com -> ssh -> localhost:6666 -> Firefox
This completes the cycle. At this point (skipped a few for clerity) you, the user, can the context of the HTTP GET request.
Mission accomplished.
Note that other DNS address (not from the aws ec2 cloud) are not being served over the socks proxy, for example if you go to http://google.com you get a direct hit by the browser - Which is exactly what you want to happens because traffic that goes via proxy has it's costs (bandwidth, latency).
Hope you enjoyed this, comments (as always) are welcome.
Maxim.