Apache Hadoop is a framework for performing large-scale distributed computations in a cluster. Different from MPI (which we discussed here), it is based on Java technologies. It may thus be slower than MPI, but can reap the full benefit of the rich libraries and programming environment of the Java ecosystem (just think about all the things we already did in this course). One of the most well-known ways to use Hadoop is to perform computations following the MapReduce pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI).
Let us now shortly compare use cases of MPI versus MapReduce with Hadoop. MPI is the technology of choice if
- communication is expensive and the bottleneck of our application,
- frequent communication is required between processes solving related sub-problems,
- the available hardware is homogenous (and we can use an MPI implementation optimized for it),
- processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance,
- the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when
- we do not need to worry much about exchanging data with a heterogeneous distributed application environment.
Hadoop, on the other hand, covers use cases where
- communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when
- the environment is heterogeneous,
- processes do not need to be organized in a special way and
- the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where
- sub-problems have batch job character, where
- data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where
- data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks.
The first example is the wordCount
project, which is based on the famous and well-known word counting Map-Reduce example in the version provided by Luca Menichetti meniluca@gmail.com under the GNU General Public License version 2. The original version of the example is nicely discussed in this blog entry. Further, similar explanations are given here and here.
This example takes in one or multiple text files in folder input
and produces a word count overview in the folder output
. The paths to these two folders are its command line parameters. Here we use some older version of the description of this course as input. The mapper will receive tuples of the form.
<Integer (line number), Text (the line contents)>
The mapper will split each line into words (ignoring punctuation marks). For each word of the like, it will emit a tuple
<Text (the word), WriteableInteger (always value 1)>
If a word occurs multiple times in a line, one token is emitted for each occurrence. After this mapping step, the reducer will receive tokens such as
<Text (the word), Iterable<WriteableInteger> (number of occurrences)>
Hadoop as put all the WriteableInteger
generated by the mapping step which belong to the same Text (the word)
key into an Iterable
list for us. Thus, for each word that the mapper has discovered, we get a list with numbers. All we have to do is to add them up and emit tuples of the form:
<Text (the word), WriteableInteger (total number of occurrences)>
The reducer here also acts as combinator, meaning that a reduction step is also performed locally before the results of each mapper is sent to the central reduction step. This way we can already add up some word counts locally and the amount of data that needs to be sent to the central reducer decreases, as two tuples for the same word are already merged. This is possible in this simple form because the output of the reducer is the same as the output of the mapper, just that the WriteableInteger
part will not necessarily have value 1
afterwards.
After the reduction step, we therefore know how often each word occurred in the text. Furthermore, since the tuples are sorted automatically before reduction, the word/occurrences list is also nicely sorted alphabetically.
The second example, webFinder
tries to use Map-Reduce to find the interconnections between different web sites. It therefore accepts a list of URLs as input, i.e., one or multiple text file(s) where each line represents a single URL. The program has three parameters, the input folder, the output folder, and the maximum depth limit for spidering the websites. This limit is an optional parameter and 1 by default, meaning that for each base URL, one level of links is followed. The mapper receives tuples of type
<Integer (line number), Text (the line contents, i.e., the URL)>
Each URL is loaded recursively, similar to how a bot from a search engine would do it. When reading a web page (using Java's URLConnection object), we look for links/usages of other resources, such as images, CSS, JavaScript, links, frames, and iframes. The latter three are traced recursively up to a given maximum depth (the third, optional, command line parameter of the program). For each linked resource, the mapper will output a tuple
<Text (with the URL to the resource), Text (with base URL from the input>
This means that the reducer receives elements of the form
<Text (with the URL to the resource), Iterable<Text> (all base URLs linking to the resource>
The reduction step is very simple: All input pairs which just contain a single element in the values, i.e., which stand for resources only linked from a single on of the originally specified URLs, are discarded. This leaves only resources linked from multiple input URLs. These resources are thus shared amongst different sites. The reducer returns tuples of
<Text (with the URL to the shared resource), List<Text> (with all originally specified sites linking to the shared resource)>
This will help us to understand how different websites are connected and which resources are vital, i.e., which dependencies are needed by several sites to work properly. We can test this program (see point 2.5) with an example input as shown below and maximum depth 1
. Running the program will take some time.
http://www.bing.com
http://www.tudou.com
http://www.youku.com
http://www.qq.com
http://www.baidu.com
http://www.sogou.com
The final output (stored in output/part-r-00000
and obtained via bin/hdfs dfs -copyToLocal output/part-r-00000 ./list.txt
for the above input and maximum depth 1
, at the date of this writing, is this:
http://c.youku.com/aboutcn/youtu [http://www.tudou.com, http://www.youku.com]
http://c.youku.com/abouteg/youku [http://www.tudou.com, http://www.youku.com]
http://c.youku.com/abouteg/youtu [http://www.tudou.com, http://www.youku.com]
http://cbjs.baidu.com/js/m.js [http://www.baidu.com, http://www.qq.com]
http://css.tudouui.com/skin/__g/img/sprite.gif [http://www.tudou.com, http://www.youku.com]
http://events.youku.com/global/scripts/jquery-1.8.3.js [http://www.tudou.com, http://www.youku.com]
http://events.youku.com/global/scripts/youku.js [http://www.tudou.com, http://www.youku.com]
http://images.china.cn/images1/ch/appxz/2.jpg [http://www.qq.com, http://www.youku.com]
http://images.china.cn/images1/ch/appxz/3.jpg [http://www.qq.com, http://www.youku.com]
http://js.tudouui.com/v3/dist/js/lib_6.js [http://www.tudou.com, http://www.youku.com]
http://mail.qq.com [http://www.baidu.com, http://www.qq.com]
http://minisite.youku.com/mini_common/urchin.js [http://www.tudou.com, http://www.youku.com]
http://player.youku.com/jsapi [http://www.tudou.com, http://www.youku.com]
http://qzone.qq.com [http://www.baidu.com, http://www.qq.com]
http://res.mfs.ykimg.com/051000004D92DF6197927339BA04E210.js [http://www.tudou.com, http://www.youku.com]
http://static.youku.com/user/img/avatar/80/5.jpg [http://www.tudou.com, http://www.youku.com]
http://static.youku.com/user/img/avatar/80/9.jpg [http://www.tudou.com, http://www.youku.com]
http://weibo.com [http://www.baidu.com, http://www.qq.com]
http://www.12377.cn [http://www.baidu.com, http://www.qq.com, http://www.youku.com]
http://www.12377.cn/node_548446.htm [http://www.qq.com, http://www.youku.com]
http://www.bjjubao.org [http://www.baidu.com, http://www.youku.com]
http://www.china.com.cn/player/video.js [http://www.qq.com, http://www.youku.com]
http://www.ellechina.com [http://www.qq.com, http://www.youku.com]
http://www.hao123.com [http://www.baidu.com, http://www.qq.com]
http://www.hd315.gov.cn/beian/view.asp?bianhao=010202006082400023 [http://www.tudou.com, http://www.youku.com]
http://www.miibeian.gov.cn [http://www.qq.com, http://www.tudou.com, http://www.youku.com]
http://www.miibeian.gov.cn/publish/query/indexFirst.action [http://www.tudou.com, http://www.youku.com]
http://www.pclady.com.cn [http://www.baidu.com, http://www.qq.com]
http://www.qq.com [http://www.baidu.com, http://www.qq.com]
http://www.shjbzx.cn [http://www.qq.com, http://www.tudou.com]
http://www.tudou.com [http://www.tudou.com, http://www.youku.com]
http://www.tudou.com/about/cn [http://www.tudou.com, http://www.youku.com]
http://www.tudou.com/about/en [http://www.tudou.com, http://www.youku.com]
http://www.youku.com [http://www.baidu.com, http://www.tudou.com, http://www.youku.com]
http://www.youku.com/show_page/id_z8dc3fdeedcb911e3a705.html [http://www.tudou.com, http://www.youku.com]
http://y.qq.com [http://www.baidu.com, http://www.qq.com]
https://www.alipay.com [http://www.baidu.com, http://www.youku.com]
In other words, we can see that some of the top sites in China link to each other, as done massively by tudou and youku. Several pages also share common resources: qq and youku both use http://www.china.com.cn/player/video.js and http://images.china.cn/images1/ch/appxz/2.jpg, for instance. If we had added more links to check or increased the search depth, we would probably have found more relationships.
By the way, we also find lots of dodgy links which lead to nowhere or contain illegal characters, causing some warnings to be logged by our system.
If you import one of the example projects in this folder in Eclipse, it may first show you a lot of errors. (I recommend using Eclipse Mars or later.) These projects are Maven projects, so you should "update" them first in Eclipse by doing the following. Let's say you want to import the wordCount
project:
- Make sure that you can see the
package view
on the left-hand side of the Eclipse window. - Right-click on the project (
wordCount
) in thepackage view
. - In the opening pop-up menu, left-click on
Maven
. - In the opening sub-menu, left-click on
Update Project...
. - In the opening window...
- Make sure the project (
wordCount
) is selected. - Make sure that
Update project configuration from pom.xml
is selected. - You can also select
Clean projects
. - Click
OK
. - Now the structure of the project in the
package view
should slightly change, the project will be re-compiled, and the errors should disappear.
Now you can actually build the imported project(s), i.e., generate a jar
file that you can pass to Hadoop. Let's say you want to build the wordCount
project.
- Make sure that you can see the
package view
on the left-hand side of the Eclipse window. - Right-click on the project (
wordCount
) in thepackage view
. - In the opening pop-up menu, choose
Run As
. - In the opening sub-menu choose
Run Configurations...
. - In the opening window, choose
Maven Build
- In the new window
Run Configurations
/Create, manage, and run configurations
, chooseMaven Build
in the small white pane on the left side. - Click
New launch configuration
(the first symbol from the left on top of the small white pane). - Write a useful name for this configuration in the
Name
field. You can use this configuration again later. - In the tab
Main
enter theBase directory
of the project, this is the folder calledhadoop/wordCount
containing the Eclipse/Maven project. - Under
Goals
, enterclean compile package
. This will build ajar
archive. - Click
Apply
- Click
Run
- The build will start, you will see its status output in the console window.
- The folder
target
will contain a filewordCount-full.jar
after the build. This is the executable archive with our application.
Under Linux, you can also simply run make_linux.sh
in this project's folder to build the servlet without Eclipse, given that you have Maven installed.
In order to test our example, we now need to set up a single-node Hadoop cluster. We therefore follow the guide given at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html. Here we provide the installation guide for Hadoop 2.7.2 Linux / Ubuntu.
Here we discuss how to download and unpack Hadoop.
- Install prerequisites by running
sudo apt-get install ssh rsync
. - Go into a base folder where you want to install Hadoop. Let's call this folder
X
. - Download Hadoop from one of the mirrors provided at http://www.apache.org/dyn/closer.cgi/hadoop/common/. I choose http://www-eu.apache.org/dist/hadoop/common/ and from there hadoop-2.7.2 from where I download hadoop-2.7.2.tar.gz into
X
. If you chose a different Hadoop version, replace2.7.2.
accordingly in the following steps. - Once hadoop-2.7.2.tar.gz has fully been downloaded, I either can do
Extract Here
in the file explorer ortar -xf hadoop-2.7.2.tar.gz
in the terminal window to extract the archive. - A new folder named
X/hadoop-2.7.2
should have appeared. If you chose a different Hadoop version, replace2.7.2.
accordingly in the following steps. - In order to run Hadoop, you must have
JAVA_HOME
set correctly. Open the fileX/etc/hadoop/hadoop-env.sh
. Find the lineexport JAVA_HOME=${JAVA_HOME}
and replace it withexport JAVA_HOME=$(dirname $(dirname $(readlink -f $(which javac))))
.
We can now test whether everything above has turned out well and all is downloaded, unpacked, and set up correctly.
- In the terminal, enter
X/hadoop-2.7.2/
and execute the commandbin/hadoop
. It should display some help and command line options. - We can further test whether Hadoop works by running the single-node example from the tutorial. Therefore, in your terminal enter
mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' cat output/*
This third command should produce a lot of logging output and the last one should say something like1 dfsadmin
. If that is the case, you are doing well.
For really using Hadoop in a pseudo-distributed fashion on our local computer, we have to do more:
- Enter the directory
X/hadoop-2.7.2/etc
in order to create the basic configuration. - Open the file
core-site.xml
in the text editor. It should exist, if not, there is something wrong. Try your best by creating it. Remove everything in the file and store the following text, then save and close the file. In other words, the complete contents of the file should become:<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Open the file
hdfs-site.xml
in the text editor. It should exist, if not, there is something wrong. Try your best by creating it. Remove everything in the file and store the following text, then save and close the file. In other words, the complete contents of the file should become:<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
In order to run Hadoop in a pseudo-distributed fashion, we need to enable passwordless SSH connections to the local host.
- In the terminal, execute
ssh localhost
to test if you can open a secure shell connection to your current, local computer without needing a password. - It may say something like:
ssh: connect to host localhost port 22: Connection refused
If it does say this (i.e., you did not install the pre-requisites…), then dosudo apt-get install ssh
and it may say something likeReading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: libck-connector0 ncurses-term openssh-server openssh-sftp-server ssh-import-id Suggested packages: rssh molly-guard monkeysphere The following NEW packages will be installed: libck-connector0 ncurses-term openssh-server openssh-sftp-server ssh ssh-import-id 0 upgraded, 6 newly installed, 0 to remove and 0 not upgraded. Need to get 661 kB of archives. After this operation, 3,528 kB of additional disk space will be used. Do you want to continue? [Y/n] y ... Setting up ssh-import-id (4.5-0ubuntu1) ... Processing triggers for ufw (0.34-2) ... Setting up ssh (1:6.9p1-2ubuntu0.2) ... Processing triggers for libc-bin (2.21-0ubuntu4.1) ... Processing triggers for systemd (225-1ubuntu9.1) ... Processing triggers for ureadahead (0.100.0-19) ...
OK, now you've got SSH installed. Dossh localhost
again. - It may ask you something like
The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is SHA256:HZUVFF77GAh5cF/sg8YhjRf1gSGJ9ui5ksdf2GAl5Ha. Are you sure you want to continue connecting (yes/no)?
If it does ask you this, just typeyes
and hit enter (it may then say something likeWarning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
). If it does not ask you this, it does not matter. -
The important thing is the next step: IF it asks you something like
xyz@localhost's password:
, hitCtrl-C
and do the things below. Otherwise, you can directly skip to the next point 2.4.5. So, If you were asked for a password, enter the following into your terminal:ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
- You will get displayed some text such as
Generating public/private dsa key pair.
followed by a couple of other things. After completing the above commands, you should test the result by again executingssh localhost
. You will now no longer be asked for a password and directly receive a welcome message, something likeWelcome to Ubuntu 15.10 (GNU/Linux 4.2.0-35-generic x86_64)
or whatever Linux distribution you use. Via a ssh connection, you can, basically, open a terminal to and run commands on a remote computer (which, in this case, is your own, current computer). You can return to the normal (non-ssh) terminal by enteringexit
and pressing return, after which you will be notified thatConnection to localhost closed.
We now want to test whether our installation and setup works correctly by further following the steps given in the tutorial.
- Format the HDFS file system by entering
bin/hdfs namenode -format
followed by return. You will receive a lot of log output. - Start the
NameNode
andDataNode
daemons by runningsbin/start-dfs.sh
. You may get some logging output messages, which *may* be followed by something likeThe authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. ECDSA key fingerprint is SHA256:HZUVFF77GAh5cF/sg8YhjRf1gSGJ9ui5ksdf2GAl5Ha. Are you sure you want to continue connecting (yes/no)?
which you would answer withyes
followed by a hit to the enter button. If, after that, you get a message like0.0.0.0: packet_write_wait: Connection to 127.0.0.1: Broken pipe
, entersbin/stop-dfs.sh
, hit return, and dosbin/start-dfs.sh
again. - In your web browser, open
http://localhost:50070/
. It should display a web page giving an overview about the Hadoop system now running on your local computer. - Now we can setup the required stuff for the example jobs (making HDFS directories and copying the input files). Make sure to replace
<userName>
with your user/login name on your current machine.bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<userName> bin/hdfs dfs -put etc/hadoop input
- We can now run the job via
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
- We obtain the output of the job via
bin/hdfs dfs -get output output cat output/*
- Like in the local test from point 2.4.2., after a lot of log output, we should sett something like
1 dfsadmin
. - Finally, we need to shutdown Hadoop by running
sbin/stop-dfs.sh
We now want to run one of the provided examples. Let us assume we want to run the wordCount
example. For other examples, just replace wordCount
with their names in the following text. I assume that the distributedComputingExamples
repository is located in a folder Y
on your machine.
- Open a terminal and enter your Hadoop installation folder. I assume you installed Hadoop version
2.7.2
into a folder namedX
, so you wouldcd
intoX/hadoop-2.7.2/
. - We want to start with a "clean" file system, so let us repeat some of the setup steps. Don't forget to replace
<userName>
with your local login/user name.bin/hdfs namenode -format
(answer withY
when asked whether to re-format the file system)sbin/start-dfs.sh bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<userName>
If you actually properly cleaned up the file system after running your last examples (see the second-to-last step here), you just need to dosbin/start-dfs.sh
and do not need to format the HDFS. - Copy the input data of the example into HDFS. You find this data in the example folder
Y/distributedComputingExamples/hadoop/wordCount/input
. So you will performbin/hdfs dfs -put Y/distributedComputingExamples/hadoop/wordCount/input input
. Make sure to replaceY
with the proper path. If copying fails, go to "2.6. Troubleshooting". - Do
bin/hdfs dfs -ls input
to check if the files have properly been copied. - You can now do
bin/hadoop jar Y/distributedComputingExamples/hadoop/wordCount/target/wordCount-full.jar input output
. This command will start the main class of the example, which resides in the fat jarwordCount-full.jar
, with the parametersinput
andoutput
.input
here is the input folder, which we previously have copied to the Hadoop file system.output
is the output folder to be created. If you execute this command, you will see lots of logging information. - Do
bin/hdfs dfs -ls output
. You will see output likeFound 2 items -rw-r--r-- 1 tweise supergroup 0 2016-04-22 18:48 output/_SUCCESS -rw-r--r-- 1 tweise supergroup 303 2016-04-22 18:48 output/part-r-00000
- You can read the results via
bin/hdfs dfs -cat output/part-r-00000 | less
which will result - in the case of thewordCount
example - in something likeA 1 API 4 Actually 2 All 1 Apache 1 As 2 Axis 1 Based 1 Both 1 By 1 C 2 CSS 2 Calls 1 Cascading 1 Communication 1 Control 2 Datagram 1 Description 2 Each 1 Everything 2 Extensible 1 Finally 1 For 3
- You can download the result file to your local folder via
bin/hdfs dfs -copyToLocal output/part-r-00000 .
. Now you will find the text file with the results in the current folder, i.e.,X/hadoop-2.7.2/
. - After being done, you should clean up the file system by deleting the
input
andoutput
folder in the HDFS (not the original input folder of the project!). This way, you do not need to format the HDFS for the next example. Anyway, you do:bin/hdfs dfs -rm -R input bin/hdfs dfs -rm -R output
- Finally, shut down the system by calling
sbin/stop-dfs.sh
.
Sometimes, you may try to copy some file or folder to HDFS and get an error that no such file or directory exists. Then do the following:
- Execute
sbin/stop-dfs.sh
- Delete the folder
/tmp/hadoop-<userName>
, where<userName>
is to replaced with your local login/user name. - Now perform
bin/hdfs namenode -format sbin/start-dfs.sh bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<userName>
- If you now repeat the operation that failed before, it should succeed.
Some of the examples take some inspiration from the maven-hadoop-java-wordcount-template by H3ml3t, for which no licensing information is provided. The examples, are entirely differently in several ways, for instance in the way we build fat jars. Anyway, this original project is nicely described in this blog entry.
Furthermore, the our wordCount
is based on the well-known word counting example for Hadoop's map reduce functionality. It is based on the version by provided Luca Menichetti meniluca@gmail.com under the GNU General Public License version 2.