Hadoop Distributed File System Introduction

Introduction

  • Built for Scale
    • Fault-tolerant and is designed to be deployed on low-cost hardware
    • Hundreds or thousands of server machines, each storing part of the file system’s data.
    • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
    • Designed for batch processing rather than interactive use by users. 
    • High throughput of data access rather than low latency of data access.
    • A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.
  • Simple Coherency Model
    • HDFS applications need a write-once-read-many access model for files. 
    • File once created, written, and closed need not be changed. 
    • This assumption simplifies data coherency issues and enables high throughput data access. 
    • A MapReduce application or a web crawler application fits perfectly with this model. 
  • Moving Computation is Cheaper than Moving Data
    • A computation requested by an application is much more efficient if it is executed near the data it operates on. 
    • This minimizes network congestion and increases the overall throughput of the system
    • HDFS provides interfaces for applications to move themselves closer to where the data is located.

Architecture

  • HDFS has a master/slave architecture. 
  • Single NameNode in a cluster:
    • manages the file system namespace
    • regulates access to files by clients. 
    • executes file system namespace operations like opening, closing, and renaming files and directories
    • It also determines the mapping of blocks to DataNodes.
  • Number of DataNodes:
    • Usually one per node in the cluster, which manage storage attached to the nodes that they run on. 
    • Responsible for serving read and write requests from the file system’s clients. 
    • Also perform block creation, deletion, and replication upon instruction from the NameNode.
  • HDFS exposes a file system namespace and allows user data to be stored in files. 
  • Storage
    • Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. 

Install Hadoop and start services 

brew install hadoop

// Set in .bashrc
alias hstart="/usr/local/cellar/hadoop/3.2.1/sbin/start-all.sh"
alias hstop="/usr/local/cellar/hadoop/3.2.1/sbin/stop-all.sh"

Basic commands

hdfs dfs -put <localsrc> ... <HDFS_dest_Path>
cd /usr/local/cellar/hadoop/3.2.1/bin

hdfs dfs -cat /user/test/input/README.txt
hdfs dfs -ls /user/test/input/README.txt

hdfs getconf -confKey fs.defaultFS

Read more commands here

References:

Leave a comment