See: Description
Package | Description |
---|---|
org.apache.hadoop | |
org.apache.hadoop.conf |
Configuration of system parameters.
|
org.apache.hadoop.filecache | |
org.apache.hadoop.fs |
An abstract file system API.
|
org.apache.hadoop.fs.ftp | |
org.apache.hadoop.fs.kfs |
A client for the Kosmos filesystem (KFS)
|
org.apache.hadoop.fs.permission | |
org.apache.hadoop.fs.s3 |
A distributed, block-based implementation of
FileSystem that uses Amazon S3
as a backing store. |
org.apache.hadoop.fs.s3native |
A distributed implementation of
FileSystem for reading and writing files on
Amazon S3. |
org.apache.hadoop.fs.shell | |
org.apache.hadoop.http | |
org.apache.hadoop.io |
Generic i/o code for use when reading and writing data to the network,
to databases, and to files.
|
org.apache.hadoop.io.compress | |
org.apache.hadoop.io.compress.bzip2 | |
org.apache.hadoop.io.compress.zlib | |
org.apache.hadoop.io.file.tfile | |
org.apache.hadoop.io.retry |
A mechanism for selectively retrying methods that throw exceptions under certain circumstances.
|
org.apache.hadoop.io.serializer |
This package provides a mechanism for using different serialization frameworks
in Hadoop.
|
org.apache.hadoop.ipc |
Tools to help define network clients and servers.
|
org.apache.hadoop.ipc.metrics | |
org.apache.hadoop.log | |
org.apache.hadoop.mapred |
A software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) parallelly on large clusters
(thousands of nodes) built of commodity hardware in a reliable, fault-tolerant
manner.
|
org.apache.hadoop.mapred.jobcontrol |
Utilities for managing dependent jobs.
|
org.apache.hadoop.mapred.join |
Given a set of sorted datasets keyed with the same class and yielding equal
partitions, it is possible to effect a join of those datasets prior to the map.
|
org.apache.hadoop.mapred.lib |
Library of generally useful mappers, reducers, and partitioners.
|
org.apache.hadoop.mapred.lib.aggregate |
Classes for performing various counting and aggregations.
|
org.apache.hadoop.mapred.lib.db |
org.apache.hadoop.mapred.lib.db Package
|
org.apache.hadoop.mapred.pipes |
Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce.
|
org.apache.hadoop.mapred.tools | |
org.apache.hadoop.mapreduce | |
org.apache.hadoop.mapreduce.lib.input | |
org.apache.hadoop.mapreduce.lib.map | |
org.apache.hadoop.mapreduce.lib.output | |
org.apache.hadoop.mapreduce.lib.partition | |
org.apache.hadoop.mapreduce.lib.reduce | |
org.apache.hadoop.metrics |
This package defines an API for reporting performance metric information.
|
org.apache.hadoop.metrics.file |
Implementation of the metrics package that writes the metrics to a file.
|
org.apache.hadoop.metrics.ganglia |
Implementation of the metrics package that sends metric data to
Ganglia.
|
org.apache.hadoop.metrics.jvm | |
org.apache.hadoop.metrics.spi |
The Service Provider Interface for the Metrics API.
|
org.apache.hadoop.metrics.util | |
org.apache.hadoop.net |
Network-related classes.
|
org.apache.hadoop.record |
Hadoop record I/O contains classes and a record description language
translator for simplifying serialization and deserialization of records in a
language-neutral manner.
|
org.apache.hadoop.record.compiler |
This package contains classes needed for code generation
from the hadoop record compiler.
|
org.apache.hadoop.record.compiler.ant | |
org.apache.hadoop.record.compiler.generated |
This package contains code generated by JavaCC from the
Hadoop record syntax file rcc.jj.
|
org.apache.hadoop.record.meta | |
org.apache.hadoop.security | |
org.apache.hadoop.security.authorize | |
org.apache.hadoop.util |
Common utilities.
|
org.apache.hadoop.util.bloom | |
org.apache.hadoop.util.hash |
Package | Description |
---|---|
org.apache.hadoop.examples |
Hadoop example code.
|
org.apache.hadoop.examples.dancing |
This package is a distributed implementation of Knuth's dancing links
algorithm that can run under Hadoop.
|
org.apache.hadoop.examples.terasort |
This package consists of 3 map/reduce applications for Hadoop to
compete in the annual terabyte sort
competition.
|
Package | Description |
---|---|
org.apache.hadoop.streaming |
Hadoop Streaming is a utility which allows users to create and run
Map-Reduce jobs with any executables (e.g.
|
Package | Description |
---|---|
org.apache.hadoop.contrib.utils.join |
Package | Description |
---|---|
org.apache.hadoop.contrib.index.example | |
org.apache.hadoop.contrib.index.lucene | |
org.apache.hadoop.contrib.index.main | |
org.apache.hadoop.contrib.index.mapred |
Package | Description |
---|---|
org.apache.hadoop.contrib.failmon |
Hadoop primarily consists of the Hadoop Distributed FileSystem (HDFS) and an implementation of the Map-Reduce programming paradigm.
Hadoop is a software framework that lets one easily write and run applications that process vast amounts of data. Here's what makes Hadoop especially useful:
If your platform does not have the required software listed above, you will have to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:
First, you need to get a copy of the Hadoop code.
Edit the file conf/hadoop-env.sh to define at least JAVA_HOME.
Try the following command:
bin/hadoopThis will display the documentation for the Hadoop command script.
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir inputThis will display counts for each match of the regular expression.
Note that input is specified as a directory containing input files and that output is also specified as a directory where parts are written.
JobTracker
(MapReduce master)
host and port. This is specified with the configuration property
mapred.job.tracker.
(We also set the HDFS replication level to 1 in order to reduce warnings when running on a single node.)
Now check that the command
ssh localhost
does not
require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
The Hadoop daemons are started with the following command:
bin/start-all.sh
Daemon log output is written to the logs/ directory.
Input files are copied into the distributed filesystem as follows:
bin/hadoop fs -put input input
Things are run as before, but output must be copied locally to examine it:
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'When you're done, stop the daemons with:
bin/stop-all.sh
Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
Copyright © 2010 The Apache Software Foundation