Linux Tutorials
Building Hadoop Clusters On Linux In EC2
Installing And Using Hadoop
Setting Up SSH Keys Using SSH Agents And Tunnels
Creating OpenSSL Certificates and Certificate Authorities
Installing and configuring Xen
IPTables Primer
Linux Basic Bash Scripting

Installing And Using Hadoop

Using the Hadoop HDFS Filesystem

The HDFS filesystem is just an abstraction layer on top of the local filesystem on the node(s). There are 2 methods built-in methods to access and manipulate the HDFS filesystem, the CLI and a web based GUI.

HDFS CLI commands

The Hadoop HDFS filesystem CLI commands are part of the hadoop binary and are accessed using the dfs option. Add the -fs option to override the filesystem specified in the config file or use local to use the local filesystem.

Listing files

Listing files is done with the -ls switch. Optionally pass a path to list. The -lsr switch is used to recursively list files.

$ hadoop fs -ls /
Found 3 items
drwxrwxrwx   -          0 1969-12-31 18:00 /
drwxrwxrwx   -          0 1969-12-31 18:00 /tmp
drwxrwxrwx   -          0 1969-12-31 18:00 /user

Checking disk usage

Disk usage can be found with the -du option. This is similar to the linux du command.

$ hadoop dfs -du /tmp
Found 2 items
1205891     hdfs://localhost:9000/tmp/feder16.txt
301987      hdfs://localhost:9000/tmp/hadoop-hadoop

Creating directories

Creating directories is done with the -mkdir flag with the directory to create passed as an option.

$ hadoop dfs -mkdir /in

Copying and moving files

Copying and moving files are done with the -cp and -mv options respectively. The source and destination are passed in that order after the -cp or -mv.

$ hadoop dfs -cp /tmp/feder16.txt /in
$ hadoop dfs -mv /tmp/feder16.txt /tmp/feder16.txt.2  

Adding files to HDFS

Adding files to HDFS is done with the -put option. This was briefly covered in the pseudo-cluster operation. The -copyFromLocal option is an alias for -put. Pass the options in source then destination order. There is a -moveToLocal option that will delete the source file after uploading to HDFS.

$ hadoop dfs -put feder16.txt /
$ hadoop dfs -copyFromLocal feder16.txt /in

Retrieving files from HDFS

After running a job the output data needs to be downloaded from HDFS. The -get option is used for that, the -copyToLocal option is an alias for -put. To delete files from HDFS once they've been copied back to the local disk use the -moveToLocal option.

$ hadoop dfs -put feder16.txt /
$ hadoop dfs -copyFromLocal feder16.txt /in

Deleting files

Deleting files is done with the -rm option. To delete files and directories recursively use -rmr.

$ hadoop dfs -rm /in/feder16.txt
Deleted hdfs://localhost:9000/in/feder16.txt
$ hadoop dfs -rmr /in            
Deleted hdfs://localhost:9000/in

There are many more including cat, tail, chmod, chown and stat, for a full list run hadoop dfs -help.

Web UI

In addition to the CLI the Hadoop HDFS has an integrated web interface available at http://localhost:50070/ . This interface provides information about the NameNode and the ability to browse the filesystem.

Pseudo Distributed Cluster <<  1 2 3
New Content