一、准备数据
造数据脚本 : gendata.sh
#!/bin/bash
file=$1
s=$2
touch $file
for((i=0;i<10000000;i++))
do
str=','$s;
name=${i}${str}${i}
#echo $name
echo $name>> $file
done
echo 'show testdata'
head $file
造数据:
先造十个小文件,每个1000w记录:
bash gendata.sh name.txt name ; bash gendata.sh zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh xyz.txt xyz ;bash gendata.sh opq.txt opq
total 1.8G
-rw-r--r-- 1 root root 189M Feb 9 10:35 abcd.txt
-rw-r--r-- 1 root root 170M Feb 9 10:32 dz.txt
-rw-r--r-- 1 root root 189M Feb 9 10:38 efgh.txt
-rw-r--r-- 1 root root 170M Feb 9 10:28 gj.txt
-rw-r--r-- 1 root root 170M Feb 9 10:25 jg.txt
-rw-r--r-- 1 root root 199M Feb 9 10:22 minzu.txt
-rw-r--r-- 1 root root 189M Feb 9 10:08 name.txt
-rw-r--r-- 1 root root 180M Feb 9 10:49 opq.txt
-rw-r--r-- 1 root root 180M Feb 9 10:41 xyz.txt
-rw-r--r-- 1 root root 208M Feb 9 10:19 zhuzhi.txt
大文件,1亿记录
bash gendata.sh name1000.txt name
-rw-r--r-- 1 root root 2.1G Feb 9 10:50 name1000.txt
二、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G ,不做任何优化的数据分析
hive中建表:
create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
手动建立分区文件夹:
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003
上传数据、修改权限:
hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/
hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par/
修复分区信息:
0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;
+------------+--+
| partition |
+------------+--+
+------------+--+
No rows selected (0.125 seconds)
0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par;
No rows affected (0.551 seconds)
0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;
+--------------------------------+--+
| partition |
+--------------------------------+--+
| sys_sj=20170209/sys_type=2003 |
+--------------------------------+--+
1 row selected (0.123 seconds)
测试:
select count(*) from hyl_test_par where name <> ' ' ;
第一次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (189.116 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (118.107 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (117.551 seconds)
第四次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (117.44 seconds)
第五次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (113.291 seconds)
======================================莫名的分割线=======================================================
下午重新跑,性能块了10倍以上!!!
第一次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.274 seconds)
第二次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.525 seconds)
第三次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.11 seconds)
第四次:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.722 seconds)
三、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G,经过analysis分区信息的表
同理,创建测试analysis的表,并导入数据:
create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003
hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/
hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/
show partitions hyl_test_par_ana;
msck repair table hyl_test_par_ana;
show partitions hyl_test_par_ana;
分区信息分析:
0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ;
INFO : Session is already open
INFO : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021)
.
.
.
INFO : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]
No rows affected (126.81 seconds)
测试:
select count(*) from hyl_test_par_ana;
0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana;
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (0.086 seconds)
换sql:
select count(*) from hyl_test_par_ana where name <> ' ';
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (118.239 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (121.687 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (121.319 seconds)
======================================莫名的分割线=======================================================
下午重新跑,性能块了10倍以上!!!
第一次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (10.923 seconds)
第二次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.058 seconds)
第三次运行
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.45 seconds)
第四次:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.218 seconds)
三、测试1个大文件,2.0G大小,总1亿记录,没有任何优化的表
创建使用一个大文件的同结构表,并上传数据:
create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003
hadoop fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/
hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/
[hdfs@cluster13 tmp]$ hadoop fs -du -h /apps/hive/warehouse/hyl_test_par_big/
2.0 G /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209
show partitions hyl_test_par_big;
msck repair table hyl_test_par_big;
show partitions hyl_test_par_big;
测试:
select count(*) from hyl_test_par_big where name <> ' ';
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.356 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.3 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (5.861 seconds)
第四次:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (5.675 seconds)
第五次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.814 seconds)
======================================莫名的分割线=======================================================
下午测试:
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.933 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (4.435 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (5.868 seconds)
第四次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.403 seconds)
第五次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.814 seconds)
四、测试10个小文件,180MB大小,总1亿记录,是经过insert into方式建立的表
通过insert into 从第一张表中导入数据(这个过程中hive会自动analysis信息)到新表,测试:
create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);
0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);
INFO : Session is already open
INFO : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1)
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)
INFO : Map 1: -/- Reducer 2: 0/10
INFO : Map 1: 0/119 Reducer 2: 0/10
.
.
.
INFO : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001
INFO : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]
[hdfs@cluster13 tmp]$ hadoop fs -ls -h hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003
Found 10 items
-rwxrwxrwx 2 hive hdfs 183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0
-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0
-rwxrwxrwx 2 hive hdfs 183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0
-rwxrwxrwx 2 hive hdfs 184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0
-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0
-rwxrwxrwx 2 hive hdfs 183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0
-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0
-rwxrwxrwx 2 hive hdfs 184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0
-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0
-rwxrwxrwx 2 hive hdfs 183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0
测试:
select count(*) from hyl_test_par_auto where name <> ' ';
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (14.653 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (13.989 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (9.236 seconds)
drop table hyl_test_auto;
create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);
0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);
INFO : Session is already open
INFO : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)
INFO : Map 1: 0/119 Reducer 2: 0/10
.
.
.
INFO : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000
INFO : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]
No rows affected (135.325 seconds)
测试:
select count(*) from hyl_test_par_auto where name <> ' ';
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.303 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.56 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.446 seconds)
第四次:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.643 seconds)
0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ;
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (0.098 seconds)
======================================莫名的分割线=======================================================
下午测试:
第一次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.282 seconds)
第二次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.414 seconds)
第三次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.452 seconds)
第四次运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.24 seconds)
五、统计:
耗时\条件 | 10*180MB*NONE | 10*180MB*analyze | 1*2GB*none | 10*180MB*insert into |
第一次运行 | 11.274 | 10.923 | 11.933 | 11.282 |
第二次运行 | 11.525 | 6.058 | 4.435 | 3.414 |
第三次运行 | 11.11 | 6.45 | 5.868 | 6.452 |
第四次运行 | 11.722 | 6.218 | 3.403 | 6.24 |
平均时间 | 11.40775 | 7.41225 | 6.40975 | 6.847 |
热数据平均时间(去掉第一次) | 11.45233333 | 6.242 | 4.568666667 | 5.368666667 |
六、继续寻找优化项:
单个文件,通过insert into方式插入数据的表:
set mapred.reduce.tasks=1;
create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);
0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)
INFO : Map 1: -/- Reducer 2: 0/1
INFO : Map 1: 0/119 Reducer 2: 0/1
.
.
.
INFO : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000
INFO : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]
No rows affected (104.637 seconds)
测试:
select count(*) from hyl_test_par_big_auto where name <> ' ';
运行:
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.744 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (4.188 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (4.041 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (5.198 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.788 seconds)
耗时\条件 | 10*180MB*NONE | 10*180MB*analyze | 1*2GB*none | 10*180MB*insert into | 1*2GB*none*insert into |
第一次运行 | 11.274 | 10.923 | 11.933 | 11.282 | 11.744 |
第二次运行 | 11.525 | 6.058 | 4.435 | 3.414 | 4.188 |
第三次运行 | 11.11 | 6.45 | 5.868 | 6.452 | 4.041 |
第四次运行 | 11.722 | 6.218 | 3.403 | 6.24 | 5.198 |
平均时间 | 11.40775 | 7.41225 | 6.40975 | 6.847 | 6.29275 |
热数据平均时间(去掉第一次) | 11.45233333 | 6.242 | 4.56866666 | 5.368666667 | 4.475666667 |
优化,列信息统计分析
analyze table hyl_test_par_ana compute statistics for columns;
select count(*) from hyl_test_par_ana where name <> ' ';
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.519 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (7.688 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (6.456 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (5.651 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.413 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.62 seconds)
矢量查询(Vectorized query) 每次处理数据时会将1024行数据组成一个batch进行处理,而不是一行一行进行处理:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
select count(*) from hyl_test_par where name <> ' ';
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.54 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (3.859 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (4.134 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.844 seconds)
+------------+--+
| _c0 |
+------------+--+
| 100000000 |
+------------+--+
1 row selected (11.089 seconds)