Большие данные.. Отчет по лабораторной работе 1 по дисциплине Большие данные и облачные технологии

НазваниеОтчет по лабораторной работе 1 по дисциплине Большие данные и облачные технологии
АнкорБольшие данные
ФГАОУ ВО «Севастопольский государственный университет»

кафедра Информационные системы
Повх Андрей Анатольевич
Институт информационных технологий и управления в технических системах

курс 2 группа ИСм-18-1-о

09.04.02 Информационные системы и технологии (уровень магистра)


по лабораторной работе №1

по дисциплине «Большие данные и облачные технологии»

на тему «Исследование способов использования экосистемы Apache Hadoop»
Руководитель практикума
Cт. преподаватель Строганов В.А.

Севастополь 2019
Изучить назначение основных компонентов экосистемы Apache Hadoop. Исследовать способов использования экосистемы Apache Hadoop для выборки структурированных данных.


Таблица 1 – Варианты заданий

Номер задания



Условие вывода




все заказы со статусом “Complete”




Вывести с использованием функции scan


После предварительных настроек виртуальной машины (была использована версия 5-13 Cloudera Quickstart VM) была произведена попытка экспортировать данные из СУБД в HDFS с помощью утилиты Apache Sqoop по средством команды:

sqoop import-all-tables \
-m 1\
--connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snapy \
--as-avrodatafile \

Что привело к следующему результату:

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

19/09/11 12:02:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0

19/09/11 12:02:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

19/09/11 12:02:03 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

19/09/11 12:02:07 INFO tool.CodeGenTool: Beginning code generation

19/09/11 12:02:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1

19/09/11 12:02:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1

19/09/11 12:02:07 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/885be4e58d989ac5705c6d3cc6cb6f94/categories.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

19/09/11 12:02:19 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/885be4e58d989ac5705c6d3cc6cb6f94/categories.jar

19/09/11 12:02:19 WARN manager.MySQLManager: It looks like you are importing from mysql.

19/09/11 12:02:19 WARN manager.MySQLManager: This transfer can be faster! Use the --direct

19/09/11 12:02:19 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.

19/09/11 12:02:19 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)

19/09/11 12:02:19 INFO mapreduce.ImportJobBase: Beginning import of categories

19/09/11 12:02:19 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

19/09/11 12:02:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

19/09/11 12:02:22 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: com.cloudera.sqoop.io.UnsupportedCodecException: snapy
Как видно в результате получена ошибка ввода-вывода, которая связана с кодеком сжатия snapy. Так как его использование не принципиально, то произведем попытку экспорта данных без его использования:

sqoop import-all-tables -m 1 --connect "jdbc:mysql://quickstart:3306/retail_db" --username=retail_dba --password=cloudera --as-avrodatafile --warehouse-dir=/user/hive/warehouse


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:02:07 INFO mapreduce.JobSubmitter: number of splits:1

19/09/12 00:02:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0002

19/09/12 00:02:08 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0002

19/09/12 00:02:08 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0002/

19/09/12 00:02:08 INFO mapreduce.Job: Running job: job_1568180694054_0002

19/09/12 00:02:59 INFO mapreduce.Job: Job job_1568180694054_0002 running in uber mode : false

19/09/12 00:02:59 INFO mapreduce.Job: map 0% reduce 0%

19/09/12 00:03:53 INFO mapreduce.Job: map 100% reduce 0%

19/09/12 00:03:54 INFO mapreduce.Job: Job job_1568180694054_0002 completed successfully

19/09/12 00:03:55 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=172206

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=1032483

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=51787

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=51787

Total vcore-milliseconds taken by all map tasks=51787

Total megabyte-milliseconds taken by all map tasks=53029888

Map-Reduce Framework

Map input records=12435

Map output records=12435

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1166

CPU time spent (ms)=9850

Physical memory (bytes) snapshot=149233664

Virtual memory (bytes) snapshot=1511247872

Total committed heap usage (bytes)=50921472

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=1032483

19/09/12 00:03:55 INFO mapreduce.ImportJobBase: Transferred 1,008.2842 KB in 118.8311 seconds (8.485 KB/sec)

19/09/12 00:03:55 INFO mapreduce.ImportJobBase: Retrieved 12435 records.

19/09/12 00:03:55 INFO tool.CodeGenTool: Beginning code generation

19/09/12 00:03:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1

19/09/12 00:03:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

19/09/12 00:03:58 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.jar

19/09/12 00:03:58 INFO mapreduce.ImportJobBase: Beginning import of departments

19/09/12 00:03:58 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

19/09/12 00:03:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1

19/09/12 00:03:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1

19/09/12 00:03:58 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.avsc

19/09/12 00:03:58 INFO client.RMProxy: Connecting to ResourceManager at /

19/09/12 00:03:58 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:04:00 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:04:06 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:04:08 INFO db.DBInputFormat: Using read commited transaction isolation

19/09/12 00:04:08 INFO mapreduce.JobSubmitter: number of splits:1

19/09/12 00:04:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0003

19/09/12 00:04:10 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0003

19/09/12 00:04:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0003/

19/09/12 00:04:10 INFO mapreduce.Job: Running job: job_1568180694054_0003

19/09/12 00:04:55 INFO mapreduce.Job: Job job_1568180694054_0003 running in uber mode : false

19/09/12 00:04:55 INFO mapreduce.Job: map 0% reduce 0%

19/09/12 00:05:36 INFO mapreduce.Job: map 100% reduce 0%

19/09/12 00:05:38 INFO mapreduce.Job: Job job_1568180694054_0003 completed successfully

19/09/12 00:05:38 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=171312

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=450

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=37974

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=37974

Total vcore-milliseconds taken by all map tasks=37974

Total megabyte-milliseconds taken by all map tasks=38885376

Map-Reduce Framework

Map input records=6

Map output records=6

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=833

CPU time spent (ms)=3600

Physical memory (bytes) snapshot=138702848

Virtual memory (bytes) snapshot=1510150144

Total committed heap usage (bytes)=50921472

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=450

19/09/12 00:05:38 INFO mapreduce.ImportJobBase: Transferred 450 bytes in 99.9828 seconds (4.5008 bytes/sec)

19/09/12 00:05:38 INFO mapreduce.ImportJobBase: Retrieved 6 records.

19/09/12 00:05:38 INFO tool.CodeGenTool: Beginning code generation

19/09/12 00:05:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1

19/09/12 00:05:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

19/09/12 00:05:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.jar

19/09/12 00:05:42 INFO mapreduce.ImportJobBase: Beginning import of order_items

19/09/12 00:05:42 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

19/09/12 00:05:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1

19/09/12 00:05:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1

19/09/12 00:05:42 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.avsc

19/09/12 00:05:43 INFO client.RMProxy: Connecting to ResourceManager at /

19/09/12 00:05:44 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:05:45 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:05:53 INFO db.DBInputFormat: Using read commited transaction isolation

19/09/12 00:05:53 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:05:53 INFO mapreduce.JobSubmitter: number of splits:1

19/09/12 00:05:54 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:05:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0004

19/09/12 00:05:54 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0004

19/09/12 00:05:54 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0004/

19/09/12 00:05:54 INFO mapreduce.Job: Running job: job_1568180694054_0004

19/09/12 00:06:40 INFO mapreduce.Job: Job job_1568180694054_0004 running in uber mode : false

19/09/12 00:06:40 INFO mapreduce.Job: map 0% reduce 0%

19/09/12 00:07:32 INFO mapreduce.Job: map 100% reduce 0%

19/09/12 00:07:34 INFO mapreduce.Job: Job job_1568180694054_0004 completed successfully

19/09/12 00:07:34 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=171910

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=3933008

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=48890

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=48890

Total vcore-milliseconds taken by all map tasks=48890

Total megabyte-milliseconds taken by all map tasks=50063360

Map-Reduce Framework

Map input records=172198

Map output records=172198

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1102

CPU time spent (ms)=11930

Physical memory (bytes) snapshot=148275200

Virtual memory (bytes) snapshot=1511383040

Total committed heap usage (bytes)=50921472

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=3933008

19/09/12 00:07:34 INFO mapreduce.ImportJobBase: Transferred 3.7508 MB in 111.664 seconds (34.3963 KB/sec)

19/09/12 00:07:34 INFO mapreduce.ImportJobBase: Retrieved 172198 records.

19/09/12 00:07:34 INFO tool.CodeGenTool: Beginning code generation

19/09/12 00:07:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1

19/09/12 00:07:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

19/09/12 00:07:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.jar

19/09/12 00:07:37 INFO mapreduce.ImportJobBase: Beginning import of orders

19/09/12 00:07:37 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

19/09/12 00:07:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1

19/09/12 00:07:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1

19/09/12 00:07:38 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.avsc

19/09/12 00:07:38 INFO client.RMProxy: Connecting to ResourceManager at /

19/09/12 00:07:39 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:07:44 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:07:45 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:07:45 INFO db.DBInputFormat: Using read commited transaction isolation

19/09/12 00:07:46 INFO mapreduce.JobSubmitter: number of splits:1

19/09/12 00:07:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0005

19/09/12 00:07:47 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0005

19/09/12 00:07:47 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0005/

19/09/12 00:07:47 INFO mapreduce.Job: Running job: job_1568180694054_0005

19/09/12 00:08:35 INFO mapreduce.Job: Job job_1568180694054_0005 running in uber mode : false

19/09/12 00:08:35 INFO mapreduce.Job: map 0% reduce 0%

19/09/12 00:09:22 INFO mapreduce.Job: map 100% reduce 0%

19/09/12 00:09:25 INFO mapreduce.Job: Job job_1568180694054_0005 completed successfully

19/09/12 00:09:25 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=171499

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=1779793

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=45340

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=45340

Total vcore-milliseconds taken by all map tasks=45340

Total megabyte-milliseconds taken by all map tasks=46428160

Map-Reduce Framework

Map input records=68883

Map output records=68883

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=953

CPU time spent (ms)=9510

Physical memory (bytes) snapshot=133357568

Virtual memory (bytes) snapshot=1511342080

Total committed heap usage (bytes)=50921472

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=1779793

19/09/12 00:09:25 INFO mapreduce.ImportJobBase: Transferred 1.6973 MB in 107.2342 seconds (16.2082 KB/sec)

19/09/12 00:09:25 INFO mapreduce.ImportJobBase: Retrieved 68883 records.

19/09/12 00:09:25 INFO tool.CodeGenTool: Beginning code generation

19/09/12 00:09:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1

19/09/12 00:09:25 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

19/09/12 00:09:28 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.jar

19/09/12 00:09:28 INFO mapreduce.ImportJobBase: Beginning import of products

19/09/12 00:09:28 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

19/09/12 00:09:28 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1

19/09/12 00:09:28 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1

19/09/12 00:09:28 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.avsc

19/09/12 00:09:28 INFO client.RMProxy: Connecting to ResourceManager at /

19/09/12 00:09:29 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:09:30 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:09:33 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:09:38 INFO db.DBInputFormat: Using read commited transaction isolation

19/09/12 00:09:38 INFO mapreduce.JobSubmitter: number of splits:1

19/09/12 00:09:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0006

19/09/12 00:09:39 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0006

19/09/12 00:09:39 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0006/

19/09/12 00:09:39 INFO mapreduce.Job: Running job: job_1568180694054_0006

19/09/12 00:10:24 INFO mapreduce.Job: Job job_1568180694054_0006 running in uber mode : false

19/09/12 00:10:24 INFO mapreduce.Job: map 0% reduce 0%

19/09/12 00:11:05 INFO mapreduce.Job: map 100% reduce 0%

19/09/12 00:11:07 INFO mapreduce.Job: Job job_1568180694054_0006 completed successfully

19/09/12 00:11:07 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=171804

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=175677

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=37548

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=37548

Total vcore-milliseconds taken by all map tasks=37548

Total megabyte-milliseconds taken by all map tasks=38449152

Map-Reduce Framework

Map input records=1345

Map output records=1345

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=905

CPU time spent (ms)=5410

Physical memory (bytes) snapshot=131756032

Virtual memory (bytes) snapshot=1510158336

Total committed heap usage (bytes)=50921472

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=175677

19/09/12 00:11:07 INFO mapreduce.ImportJobBase: Transferred 171.5596 KB in 98.676 seconds (1.7386 KB/sec)

19/09/12 00:11:07 INFO mapreduce.ImportJobBase: Retrieved 1345 records.
Видно что в данном случае операция успешно завершилась и вернула 1345 записей. Также можно заметить, что в данном логе присутствует множество ошибок, но так как они находятся только на уровне предупреждений ими можно пренебречь.

Далее был просмотрен результат:


]$ hadoop fs -ls /user/hive/warehouse

Found 6 items

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:01 /user/hive/warehouse/categories

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:03 /user/hive/warehouse/customers

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:05 /user/hive/warehouse/departments

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:07 /user/hive/warehouse/order_items

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:09 /user/hive/warehouse/orders

drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:11 /user/hive/warehouse/products

[cloudera@quickstart ]$ hadoop fs -ls /user/hive/warehouse/categories

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2019-09-12 00:01 /user/hive/warehouse/categories/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 1534 2019-09-12 00:01 /user/hive/warehouse/categories/part-m-00000.avro
При экспорте созданы также .avsc-файлы со схемами данных в домашнем каталоге:

[cloudera@quickstart ]$ ls -1 *.avsc






Можно заметить, что в отличия от ожидаемых имен данные имена потеряли приставку “sqoop_import_”.

Содержимое одного из файлов:

[cloudera@quickstart ]$ vim categories.avsc


"type" : "record",

"name" : "categories",

"doc" : "Sqoop import of categories",

"fields" : [ {

"name" : "category_id",

"type" : [ "null", "int" ],

"default" : null,

"columnName" : "category_id",

"sqlType" : "4"

}, {

"name" : "category_department_id",

"type" : [ "null", "int" ],

"default" : null,

"columnName" : "category_department_id",

"sqlType" : "4"

}, {

"name" : "category_name",

"type" : [ "null", "string" ],

"default" : null,

"columnName" : "category_name",

"sqlType" : "12"

} ],

"tableName" : "categories"

Копирование схемы данных в HDFS:

[cloudera@quickstart ]$ sudo -u hdfs hadoop fs -mkdir /user/examples

[cloudera@quickstart ]$ sudo -u hdfs hadoop fs -chmod +rw /user/examples

[cloudera@quickstart ]$ hadoop fs -copyFromLocal /*.avsc /user/examples

19/09/12 00:30:11 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:30:11 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:30:12 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

19/09/12 00:30:12 WARN hdfs.DFSClient: Caught exception


at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1281)

at java.lang.Thread.join(Thread.java:1355)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
Можно наблюдать ранее полученные ошибки, но так как они снова на уровне предупреждения, то можно их проигнорировать.

С помощью утилиты Impala были созданы таблицы на основе ранее экспортированных данных:

CREATE EXTERNAL TABLE categories STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/categories' TBLPROPERTIES

CREATE EXTERNAL TABLE customers STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/customers' TBLPROPERTIES

CREATE EXTERNAL TABLE departments STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/departments' TBLPROPERTIES



CREATE EXTERNAL TABLE order_items STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/order_items' TBLPROPERTIES

CREATE EXTERNAL TABLE products STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/products' TBLPROPERTIES

В качестве примера выполним выборку 10 самых популярных категорий продуктов (рис.1).

