Hadoop Course Content

 
Introduction , The Motivation for Hadoop
  • Problems with traditional large-scale systems
  • Requirements for a new approach
 
Hadoop Basic Concepts
  • An Overview of Hadoop
  • The Hadoop Distributed File System
  • Hands on Exercise
  • How MapReduce Works
  • Hands on Exercies
  • Anatomy of a Hadoop Cluster
  • Other Hadoop Ecosystem Components
 
Writing a MapReduce Program
  • Examining a Sample MapReduce Program
  • With several examples
  • Basic API Concepts
  • The Driver Code
  • The Mapper
  • The Reducer
  • Hadoop's Streaming API
 
Delving Deeper Into The Hadoop API
  • More About ToolRunner
  • Testing with MRUnit
  • Reducing Intermediate Data With Combiners
  • The configure and close methods for Map/Reduce Setup and Teardown
  • Writing Partitioners for Better Load Balancing
  • Hands-On Exercise
  • Directly Accessing HDFS
  • Using the Distributed Cache
  • Hands-On Exercise
 
 
 
Performing several hadoopjobs
  • The configure and close Methods
  • Sequence Files
  • Record Reader
  • Record Writer
  • Role of Reporter
  • Output Collector
  • Processing video files and audio files
  • Processing image files
  • Processing XML files
  • Counters
  • Directly Accessing HDFS
  • ToolRunner
  • Using The Distributed Cache
 
Common MapReduce Algorithms
  • Sorting and Searching
  • Indexing
  • Classification/Machine Learning
  • Term Frequency - Inverse Document Frequency
  • Word Co-Occurrence
  • Hands-On Exercise: Creating an Inverted Index
  • Identity Mapper
  • Identity Reducer
  • Exploring well known problems using MapReduce applications
 
Usining HBase
  • What is HBase?
  • HBase API
  • Managing large data sets with HBase
  • Using HBase in Hadoop applications
  • Hands-on Exercise
 
Using Hive and Pig
  • Hive Basics
  • Pig Basics
  • Hands on Exercise
 
Testing with MRUnit
  • Logging
  • Classification/Machine Learning
 
Advanced MapReduce Programming
  • A Recap of the MapReduce Flow
  • The Secondary Sort
  • CustomizedInputFormats and OutputFormats
  • Pipelining Jobs With Oozie
  • Map-Side Joins
  • Reduce-Side Joins
 
Joining Data Sets in MapReduce
  • Map-Side Joins
  • The Secondary Sort
  • Reduce-Side Joins
 
Monitoring and debugging on a Production Cluster
  • Counters
  • Skipping Bad Records
  • Rerunning failed tasks with Isolation Runner
 
Tuning for Performance in MapReduce
  • Reducing network traffic with combiner
  • Partitioners
  • Reducing the amount of input data
  • Using Compression
  • Reusing the JVM
  • Running with speculative execution
  • Refactoring code and rewriting algorithms Parameters affecting Performance
  • Other Performance Aspects