##Hadoop (HDFS) Foreign Data Wrapper for PostgreSQL
This PostgreSQL extension implements a Foreign Data Wrapper (FDW) for Hadoop (HDFS).
Please note that this version of hdfs_fdw works with PostgreSQL and EDB Postgres Advanced Server 9.3, 9.4 and 9.5. Work is underway for certification with 9.6Beta.
###What Is Apache Hadoop? The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The detail information can be found here. Hadoop can be downloaded from this location.
###What Is Apache Hive. The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. There are two version of Hive HiveServer1 and HiveServer2 which can be downloded from this [4][site].
###Installation To compile the Hadoop foreign data wrapper, Hive C client library is needed. This library can be downloaded from Apache
###Download and Install Thrift
- Download Thrift
wget http://apache.osuosl.org/thrift/0.9.2/thrift-0.9.2.tar.gz
- Extract Thrift
tar -zxvf thrift-0.9.2.tar.gz
- Compile and install Thrift
cd thrift-0.9.2
./configure
make
make install
- Compile and install fb303
cd thrift-0.9.2
cd contrib/fb303
./bootstrap.sh
./configure
make
make install
The detail installation manual can be found at http://thrift.apache.org/docs/install/
###Compile HiveClient library (libhive.so) The libhive library files for Hiveserver1 are downloaded from these sites. The Hiveserver2 support added to these files. You don't need to download these files. These fiiles are under libhive folder.
https://svn.apache.org/repos/asf/hive/trunk/service/src/gen/thrift/gen-cpp/
https://svn.apache.org/repos/asf/hive/trunk/odbc/src/cpp
https://svn.apache.org/repos/asf/hive/trunk/ql/src/gen/thrift/gen-cpp/
https://svn.apache.org/repos/asf/hive/trunk/metastore/src/gen/thrift/gen-cpp/
cd libhive
$ make
$ make install
###Compile and Install hdfs_fdw
-
To build on POSIX-compliant systems you need to ensure the
pg_config
executable is in your path when you runmake
. This executable is typically in your PostgreSQL installation'sbin
directory. For example:$ export PATH=/usr/local/pgsql/bin/:$PATH
-
Compile the code using make.
$ make USE_PGXS=1
-
Finally install the foreign data wrapper.
# make USE_PGXS=1 install
Please note that the HDFS_FDW extension has only been tested on ubuntu systems but it should work on other *UNIX's systems without any problems.
##How To Start Hadoop. The detail installation instruction of hadoop can be found on this site. Here are the steps to start and stop the hadoop.
# sbin/stop-dfs.sh
# sbin/start-dfs.sh
####YARN on Single Node
# sbin/stop-yarn.sh
# sbin/start-yarn.sh
##How to Start HiveServer
####Starting HiveServer1
cd /usr/local/hive/
bin/hive --service hiveserver -v
####Starting HiveServer2
cd /usr/local/hive/
bin/hive --service hiveserver2 --hiveconf hive.server2.authentication=NOSASL
The following parameters can be set on a HiveServer foreign server object:
host
: Address or hostname of the HiveServer. Defaults to127.0.0.1
port
: Port number of the HiveServer. Defaults to10000
client_type
: HiveServer1 or HiveServer2. Default is HiveServer1connect_timeout
: Connection timeout, default value is 300 seconds.query_timeout
: Query timeout, default value is 300 seconds
The following parameters can be set on a Hive foreign table object:
dbname
: Name of the Hive database to query. This is a mandatory option.table_name
: Name of the Hive table, default is the same as foreign table.
Step 1: Download weblogs_parse and follow instructions from this site.
Step 2: Upload weblog_parse.txt file using these commands.
hadoop fs -mkdir /weblogs
hadoop fs -mkdir /weblogs/parse
hadoop fs -put weblogs_parse.txt /weblogs/parse/part-00000
hadoop fs -cp /weblogs/parse/part-00000 /user/hive/warehouse/weblogs/
Step 3: Start HiveServer
bin/hive --service hiveserver -v
Step 4: Start Hive client to connect to HiveServer
hive -h 127.0.0.1
Step 5: Create Table in Hive
CREATE TABLE weblogs (
client_ip STRING,
full_request_date STRING,
day STRING,
month STRING,
month_num INT,
year STRING,
hour STRING,
minute STRING,
second STRING,
timezone STRING,
http_verb STRING,
uri STRING,
http_status_code STRING,
bytes_returned STRING,
referrer STRING,
user_agent STRING)
row format delimited
fields terminated by '\t';
Now we are ready to use the the weblog table in PostgreSQL, we need to follow these steps.
-- load extension first time after install
CREATE EXTENSION hdfs_fdw;
-- create server object
CREATE SERVER hdfs_server
FOREIGN DATA WRAPPER hdfs_fdw
OPTIONS (host '127.0.0.1');
-- create user mapping
CREATE USER MAPPING FOR postgres
SERVER hdfs_server;
-- create foreign table
CREATE FOREIGN TABLE weblogs
(
client_ip TEXT,
full_request_date TEXT,
day TEXT,
Month TEXT,
month_num INTEGER,
year TEXT,
hour TEXT,
minute TEXT,
second TEXT,
timezone TEXT,
http_verb TEXT,
uri TEXT,
http_status_code TEXT,
bytes_returned TEXT,
referrer TEXT,
user_agent TEXT
)
SERVER hdfs_server
OPTIONS (dbname 'db', table_name 'weblogs');
-- select from table
postgres=# SELECT DISTINCT client_ip IP, count(*)
FROM weblogs GROUP BY IP HAVING count(*) > 5000;
ip | count
-----------------+-------
683.615.622.618 | 13505
14.323.74.653 | 16194
13.53.52.13 | 5494
361.631.17.30 | 64979
363.652.18.65 | 10561
325.87.75.36 | 6498
322.6.648.325 | 13242
325.87.75.336 | 6500
(8 rows)
CREATE TABLE premium_ip
(
client_ip TEXT, category TEXT
);
INSERT INTO premium_ip VALUES ('683.615.622.618','Category A');
INSERT INTO premium_ip VALUES ('14.323.74.653','Category A');
INSERT INTO premium_ip VALUES ('13.53.52.13','Category A');
INSERT INTO premium_ip VALUES ('361.631.17.30','Category A');
INSERT INTO premium_ip VALUES ('361.631.17.30','Category A');
INSERT INTO premium_ip VALUES ('325.87.75.336','Category B');
postgres=# SELECT hd.client_ip IP, pr.category, count(hd.client_ip)
FROM weblogs hd, premium_ip pr
WHERE hd.client_ip = pr.client_ip
AND hd.year = '2011'
GROUP BY hd.client_ip,pr.category;
ip | category | count
-----------------+------------+-------
14.323.74.653 | Category A | 9459
361.631.17.30 | Category A | 76386
683.615.622.618 | Category A | 11463
13.53.52.13 | Category A | 3288
325.87.75.336 | Category B | 3816
(5 rows)
postgres=# EXPLAIN VERBOSE SELECT hd.client_ip IP, pr.category, count(hd.client_ip)
FROM weblogs hd, premium_ip pr
WHERE hd.client_ip = pr.client_ip
AND hd.year = '2011'
GROUP BY hd.client_ip,pr.category;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=221.40..264.40 rows=4300 width=64)
Output: hd.client_ip, pr.category, count(hd.client_ip)
Group Key: hd.client_ip, pr.category
-> Merge Join (cost=120.35..189.15 rows=4300 width=64)
Output: hd.client_ip, pr.category
Merge Cond: (pr.client_ip = hd.client_ip)
-> Sort (cost=60.52..62.67 rows=860 width=64)
Output: pr.category, pr.client_ip
Sort Key: pr.client_ip
-> Seq Scan on public.premium_ip pr (cost=0.00..18.60 rows=860 width=64)
Output: pr.category, pr.client_ip
-> Sort (cost=59.83..62.33 rows=1000 width=32)
Output: hd.client_ip
Sort Key: hd.client_ip
-> Foreign Scan on public.weblogs hd (cost=100.00..10.00 rows=1000 width=32)
Output: hd.client_ip
Remote SQL: SELECT client_ip FROM weblogs WHERE ((year = '2011'))
##Regression To execute the Regression, follow the below steps.
- Open /etc/hosts and add the following line (the IP Address is of the Hive Server Machine).
127.0.0.1 hive.server
- Run Hive Server without NOSASL Authentication using the following command.
./hive --service hiveserver2
- Load sample data for the test cases by using the following command.
hdfs_fdw/test/insert_hive.sh
- Run Hive Server with NOSASL Authentication using the following command.
./hive --service hiveserver2 --hiveconf hive.server2.authentication=NOSASL
- Execute the Regression using the following command.
hdfs_fdw/make installcheck
##TODO
- Hadoop Installation Instructions
- Write-able support
- Flum support
- Regression test cases
##Contributing If you experince any bug create new issue and if you have fix for that create a pull request. Before submitting a bugfix or new feature, please read the contributing guidlines.
##Support This project will be modified to maintain compatibility with new PostgreSQL and EDB Postgres Advanced Server releases.
If you require commercial support, please contact the EnterpriseDB sales team, or check whether your existing PostgreSQL support provider can also support hdfs_fdw.
###License Copyright (c) 2011 - 2016, EnterpriseDB Corporation
Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.
See the LICENSE
file for full details.