Java Bytecode Assembly, Interesting concepts in the art of Computer Science/Software Development, maybe from time to time other pesky corp. related development issues.

Monday, August 9, 2010

Accessing hadoop web interface on AWS

Hadoop exports status interface via HTTP, using DNS names.
Amazon uses internal DNS names in their AWS cloud.

Therefor, should you request to access Hadoop status interface from outside the cloud (like your office pc... you will fail).

This guide suggest a solution (using a claver work around) to this problem.

Problem Description

Hadoop builds status html pages & links based on current machine hostname. This hostname represents internal machine DNS name which is resolvable from within the AWS cloud using AWS internal DNS server:
cat /etc/resolv.conf 
domain ec2.internal
search ec2.internal 
the problem here comes into existence when trying to access the status page from your development box (Office, home, mobile, telepathic...) here your machine will try to resolve ip-10-202-30-119.ec2.internal using your ISP DNS server, which will obviously fail.


The solution to the above problem is to use a combination of several cool tricks in the basis of which stands SOCKS5 proxy. Let's get started:

  1. Install Firefox 3
  2. FoxyProxy Standard is a proxy manipulation utility for firefox, install it from
  3. Make the following configurations to FoxyProxy

    1. Right click on foxy proxy Icon and choose options

    2. Choose "Add New Proxy"

      Proxy Name = AWS Internal

      Proxy Details
      Select * Manual proxy configuration
      Host or IP Address = localhost Port = 6666
      [v] SOCKS proxy? (x) SOCKS v5

      URL Patterns
      Create 2 patterns: *ec2.internal*, *domu*.internal*
      * Tip: You might want to consider a 3rd patten if you are for example from Israel... **
    3. Make sure you have selected "Use proxies based on their pre-defined patterns and priorities" from the FoxyProxy options menu.
  4. This step diverts for Windows & Linux users
    //TODO// Configure putty with dynamic tunning.


    nohup ssh -CqTnN -D 6666

OK... now try to access http://ip-10-202-30-119.ec2.internal:50070/dfshealth.jsp you should be getting a nice GUI allowing you to browse your HDFS file system.

How it works

You are creating using SSH protocol a socks proxy tunned over ssh. From one hand your ssh client accepts SOCKS5 proxy requests on port 6666 (localhost) and transfers them to the host inside AWS cloud. Your firefox is set to do requests for *patten* which matches internal host names inside AWS cloud over a proxy (using foxyproxy). So what happens is the following:

Firefox: "Hmmm, I need to do HTTP GET for ip-10-202-30-119.ec2.internal" - Who should I be talking to? OH! I know, foxyproxy tells me I should delegate this to my friends, the socks5 proxy at localhost:6666, cool.
Firefox: Hey localhost:6666 (socks5 proxy), do me a favor, get me the content for ip-10-202-30-119.ec2.internal:50070
localhost:6666: Hmmm, ok.
localhost:6666 -> ssh -> HTTP GET http://ip-10-202-30-119.ec2.internal:50070 Hey, I know him. let me do a quick http get on his ass HTTP GET -> ssh -> localhost:6666 -> Firefox

This completes the cycle. At this point (skipped a few for clerity) you, the user, can the context of the HTTP GET request.

Mission accomplished.

Note that other DNS address (not from the aws ec2 cloud) are not being served over the socks proxy, for example if you go to you get a direct hit by the browser - Which is exactly what you want to happens because traffic that goes via proxy has it's costs (bandwidth, latency).

Hope you enjoyed this, comments (as always) are welcome.

About Me

My photo
Tel Aviv, Israel
I work in one of Israel's Hi-Tech company's. I do 5 nines system development with some system administration. This blog will focus on the technical side of things, with some philosophical strings attached.