Большие данные.. Отчет по лабораторной работе 1 по дисциплине Большие данные и облачные технологии
Скачать 0.57 Mb.
|
МИНИСТЕРСТВО ОБРАЗОВАНИЯ И НАУКИ РОССИЙСКОЙ ФЕДЕРАЦИИ ФГАОУ ВО «Севастопольский государственный университет» кафедра Информационные системы Повх Андрей Анатольевич Институт информационных технологий и управления в технических системах курс 2 группа ИСм-18-1-о 09.04.02 Информационные системы и технологии (уровень магистра) ОТЧЕТ по лабораторной работе №1 по дисциплине «Большие данные и облачные технологии» на тему «Исследование способов использования экосистемы Apache Hadoop» Отметка о зачете ____________________ ________ (дата) Руководитель практикума Cт. преподаватель Строганов В.А. (должность) (подпись) (инициалы, фамилия) Севастополь 2019 1 ЦЕЛЬ РАБОТЫ Изучить назначение основных компонентов экосистемы Apache Hadoop. Исследовать способов использования экосистемы Apache Hadoop для выборки структурированных данных. 2 ВАРИАНТ ЗАДАНИЯ
3 ХОД РАБОТЫ После предварительных настроек виртуальной машины (была использована версия 5-13 Cloudera Quickstart VM) была произведена попытка экспортировать данные из СУБД в HDFS с помощью утилиты Apache Sqoop по средством команды: sqoop import-all-tables \ -m 1\ --connect jdbc:mysql://quickstart:3306/retail_db \ --username=retail_dba \ --password=cloudera \ --compression-codec=snapy \ --as-avrodatafile \ --warehouse-dir=/user/hive/warehouse Что привело к следующему результату: Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. 19/09/11 12:02:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0 19/09/11 12:02:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 19/09/11 12:02:03 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 19/09/11 12:02:07 INFO tool.CodeGenTool: Beginning code generation 19/09/11 12:02:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1 19/09/11 12:02:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1 19/09/11 12:02:07 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/885be4e58d989ac5705c6d3cc6cb6f94/categories.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 19/09/11 12:02:19 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/885be4e58d989ac5705c6d3cc6cb6f94/categories.jar 19/09/11 12:02:19 WARN manager.MySQLManager: It looks like you are importing from mysql. 19/09/11 12:02:19 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 19/09/11 12:02:19 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 19/09/11 12:02:19 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 19/09/11 12:02:19 INFO mapreduce.ImportJobBase: Beginning import of categories 19/09/11 12:02:19 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 19/09/11 12:02:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 19/09/11 12:02:22 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: com.cloudera.sqoop.io.UnsupportedCodecException: snapy Как видно в результате получена ошибка ввода-вывода, которая связана с кодеком сжатия snapy. Так как его использование не принципиально, то произведем попытку экспорта данных без его использования: sqoop import-all-tables -m 1 --connect "jdbc:mysql://quickstart:3306/retail_db" --username=retail_dba --password=cloudera --as-avrodatafile --warehouse-dir=/user/hive/warehouse java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:02:07 INFO mapreduce.JobSubmitter: number of splits:1 19/09/12 00:02:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0002 19/09/12 00:02:08 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0002 19/09/12 00:02:08 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0002/ 19/09/12 00:02:08 INFO mapreduce.Job: Running job: job_1568180694054_0002 19/09/12 00:02:59 INFO mapreduce.Job: Job job_1568180694054_0002 running in uber mode : false 19/09/12 00:02:59 INFO mapreduce.Job: map 0% reduce 0% 19/09/12 00:03:53 INFO mapreduce.Job: map 100% reduce 0% 19/09/12 00:03:54 INFO mapreduce.Job: Job job_1568180694054_0002 completed successfully 19/09/12 00:03:55 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=172206 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=1032483 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=51787 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=51787 Total vcore-milliseconds taken by all map tasks=51787 Total megabyte-milliseconds taken by all map tasks=53029888 Map-Reduce Framework Map input records=12435 Map output records=12435 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=1166 CPU time spent (ms)=9850 Physical memory (bytes) snapshot=149233664 Virtual memory (bytes) snapshot=1511247872 Total committed heap usage (bytes)=50921472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=1032483 19/09/12 00:03:55 INFO mapreduce.ImportJobBase: Transferred 1,008.2842 KB in 118.8311 seconds (8.485 KB/sec) 19/09/12 00:03:55 INFO mapreduce.ImportJobBase: Retrieved 12435 records. 19/09/12 00:03:55 INFO tool.CodeGenTool: Beginning code generation 19/09/12 00:03:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1 19/09/12 00:03:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 19/09/12 00:03:58 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.jar 19/09/12 00:03:58 INFO mapreduce.ImportJobBase: Beginning import of departments 19/09/12 00:03:58 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 19/09/12 00:03:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1 19/09/12 00:03:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1 19/09/12 00:03:58 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/departments.avsc 19/09/12 00:03:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 19/09/12 00:03:58 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:04:00 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:04:06 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:04:08 INFO db.DBInputFormat: Using read commited transaction isolation 19/09/12 00:04:08 INFO mapreduce.JobSubmitter: number of splits:1 19/09/12 00:04:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0003 19/09/12 00:04:10 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0003 19/09/12 00:04:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0003/ 19/09/12 00:04:10 INFO mapreduce.Job: Running job: job_1568180694054_0003 19/09/12 00:04:55 INFO mapreduce.Job: Job job_1568180694054_0003 running in uber mode : false 19/09/12 00:04:55 INFO mapreduce.Job: map 0% reduce 0% 19/09/12 00:05:36 INFO mapreduce.Job: map 100% reduce 0% 19/09/12 00:05:38 INFO mapreduce.Job: Job job_1568180694054_0003 completed successfully 19/09/12 00:05:38 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=171312 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=450 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=37974 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=37974 Total vcore-milliseconds taken by all map tasks=37974 Total megabyte-milliseconds taken by all map tasks=38885376 Map-Reduce Framework Map input records=6 Map output records=6 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=833 CPU time spent (ms)=3600 Physical memory (bytes) snapshot=138702848 Virtual memory (bytes) snapshot=1510150144 Total committed heap usage (bytes)=50921472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=450 19/09/12 00:05:38 INFO mapreduce.ImportJobBase: Transferred 450 bytes in 99.9828 seconds (4.5008 bytes/sec) 19/09/12 00:05:38 INFO mapreduce.ImportJobBase: Retrieved 6 records. 19/09/12 00:05:38 INFO tool.CodeGenTool: Beginning code generation 19/09/12 00:05:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1 19/09/12 00:05:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 19/09/12 00:05:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.jar 19/09/12 00:05:42 INFO mapreduce.ImportJobBase: Beginning import of order_items 19/09/12 00:05:42 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 19/09/12 00:05:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1 19/09/12 00:05:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1 19/09/12 00:05:42 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/order_items.avsc 19/09/12 00:05:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 19/09/12 00:05:44 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:05:45 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:05:53 INFO db.DBInputFormat: Using read commited transaction isolation 19/09/12 00:05:53 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:05:53 INFO mapreduce.JobSubmitter: number of splits:1 19/09/12 00:05:54 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:05:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0004 19/09/12 00:05:54 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0004 19/09/12 00:05:54 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0004/ 19/09/12 00:05:54 INFO mapreduce.Job: Running job: job_1568180694054_0004 19/09/12 00:06:40 INFO mapreduce.Job: Job job_1568180694054_0004 running in uber mode : false 19/09/12 00:06:40 INFO mapreduce.Job: map 0% reduce 0% 19/09/12 00:07:32 INFO mapreduce.Job: map 100% reduce 0% 19/09/12 00:07:34 INFO mapreduce.Job: Job job_1568180694054_0004 completed successfully 19/09/12 00:07:34 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=171910 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=3933008 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=48890 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=48890 Total vcore-milliseconds taken by all map tasks=48890 Total megabyte-milliseconds taken by all map tasks=50063360 Map-Reduce Framework Map input records=172198 Map output records=172198 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=1102 CPU time spent (ms)=11930 Physical memory (bytes) snapshot=148275200 Virtual memory (bytes) snapshot=1511383040 Total committed heap usage (bytes)=50921472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=3933008 19/09/12 00:07:34 INFO mapreduce.ImportJobBase: Transferred 3.7508 MB in 111.664 seconds (34.3963 KB/sec) 19/09/12 00:07:34 INFO mapreduce.ImportJobBase: Retrieved 172198 records. 19/09/12 00:07:34 INFO tool.CodeGenTool: Beginning code generation 19/09/12 00:07:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1 19/09/12 00:07:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 19/09/12 00:07:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.jar 19/09/12 00:07:37 INFO mapreduce.ImportJobBase: Beginning import of orders 19/09/12 00:07:37 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 19/09/12 00:07:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1 19/09/12 00:07:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1 19/09/12 00:07:38 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/orders.avsc 19/09/12 00:07:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 19/09/12 00:07:39 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:07:44 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:07:45 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:07:45 INFO db.DBInputFormat: Using read commited transaction isolation 19/09/12 00:07:46 INFO mapreduce.JobSubmitter: number of splits:1 19/09/12 00:07:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0005 19/09/12 00:07:47 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0005 19/09/12 00:07:47 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0005/ 19/09/12 00:07:47 INFO mapreduce.Job: Running job: job_1568180694054_0005 19/09/12 00:08:35 INFO mapreduce.Job: Job job_1568180694054_0005 running in uber mode : false 19/09/12 00:08:35 INFO mapreduce.Job: map 0% reduce 0% 19/09/12 00:09:22 INFO mapreduce.Job: map 100% reduce 0% 19/09/12 00:09:25 INFO mapreduce.Job: Job job_1568180694054_0005 completed successfully 19/09/12 00:09:25 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=171499 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=1779793 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=45340 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=45340 Total vcore-milliseconds taken by all map tasks=45340 Total megabyte-milliseconds taken by all map tasks=46428160 Map-Reduce Framework Map input records=68883 Map output records=68883 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=953 CPU time spent (ms)=9510 Physical memory (bytes) snapshot=133357568 Virtual memory (bytes) snapshot=1511342080 Total committed heap usage (bytes)=50921472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=1779793 19/09/12 00:09:25 INFO mapreduce.ImportJobBase: Transferred 1.6973 MB in 107.2342 seconds (16.2082 KB/sec) 19/09/12 00:09:25 INFO mapreduce.ImportJobBase: Retrieved 68883 records. 19/09/12 00:09:25 INFO tool.CodeGenTool: Beginning code generation 19/09/12 00:09:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1 19/09/12 00:09:25 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 19/09/12 00:09:28 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.jar 19/09/12 00:09:28 INFO mapreduce.ImportJobBase: Beginning import of products 19/09/12 00:09:28 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 19/09/12 00:09:28 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1 19/09/12 00:09:28 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `products` AS t LIMIT 1 19/09/12 00:09:28 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/56c17b47fb4edda02e7879974232fc55/products.avsc 19/09/12 00:09:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 19/09/12 00:09:29 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:09:30 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:09:33 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:09:38 INFO db.DBInputFormat: Using read commited transaction isolation 19/09/12 00:09:38 INFO mapreduce.JobSubmitter: number of splits:1 19/09/12 00:09:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568180694054_0006 19/09/12 00:09:39 INFO impl.YarnClientImpl: Submitted application application_1568180694054_0006 19/09/12 00:09:39 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1568180694054_0006/ 19/09/12 00:09:39 INFO mapreduce.Job: Running job: job_1568180694054_0006 19/09/12 00:10:24 INFO mapreduce.Job: Job job_1568180694054_0006 running in uber mode : false 19/09/12 00:10:24 INFO mapreduce.Job: map 0% reduce 0% 19/09/12 00:11:05 INFO mapreduce.Job: map 100% reduce 0% 19/09/12 00:11:07 INFO mapreduce.Job: Job job_1568180694054_0006 completed successfully 19/09/12 00:11:07 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=171804 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=175677 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=37548 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=37548 Total vcore-milliseconds taken by all map tasks=37548 Total megabyte-milliseconds taken by all map tasks=38449152 Map-Reduce Framework Map input records=1345 Map output records=1345 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=905 CPU time spent (ms)=5410 Physical memory (bytes) snapshot=131756032 Virtual memory (bytes) snapshot=1510158336 Total committed heap usage (bytes)=50921472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=175677 19/09/12 00:11:07 INFO mapreduce.ImportJobBase: Transferred 171.5596 KB in 98.676 seconds (1.7386 KB/sec) 19/09/12 00:11:07 INFO mapreduce.ImportJobBase: Retrieved 1345 records. Видно что в данном случае операция успешно завершилась и вернула 1345 записей. Также можно заметить, что в данном логе присутствует множество ошибок, но так как они находятся только на уровне предупреждений ими можно пренебречь. Далее был просмотрен результат: [cloudera@quickstart ]$ hadoop fs -ls /user/hive/warehouse Found 6 items drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:01 /user/hive/warehouse/categories drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:03 /user/hive/warehouse/customers drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:05 /user/hive/warehouse/departments drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:07 /user/hive/warehouse/order_items drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:09 /user/hive/warehouse/orders drwxr-xr-x - cloudera supergroup 0 2019-09-12 00:11 /user/hive/warehouse/products [cloudera@quickstart ]$ hadoop fs -ls /user/hive/warehouse/categories Found 2 items -rw-r--r-- 1 cloudera supergroup 0 2019-09-12 00:01 /user/hive/warehouse/categories/_SUCCESS -rw-r--r-- 1 cloudera supergroup 1534 2019-09-12 00:01 /user/hive/warehouse/categories/part-m-00000.avro При экспорте созданы также .avsc-файлы со схемами данных в домашнем каталоге: [cloudera@quickstart ]$ ls -1 *.avsc categories.avsc customers.avsc departments.avsc order_items.avsc orders.avsc products.avsc Можно заметить, что в отличия от ожидаемых имен данные имена потеряли приставку “sqoop_import_”. Содержимое одного из файлов: [cloudera@quickstart ]$ vim categories.avsc { "type" : "record", "name" : "categories", "doc" : "Sqoop import of categories", "fields" : [ { "name" : "category_id", "type" : [ "null", "int" ], "default" : null, "columnName" : "category_id", "sqlType" : "4" }, { "name" : "category_department_id", "type" : [ "null", "int" ], "default" : null, "columnName" : "category_department_id", "sqlType" : "4" }, { "name" : "category_name", "type" : [ "null", "string" ], "default" : null, "columnName" : "category_name", "sqlType" : "12" } ], "tableName" : "categories" } Копирование схемы данных в HDFS: [cloudera@quickstart ]$ sudo -u hdfs hadoop fs -mkdir /user/examples [cloudera@quickstart ]$ sudo -u hdfs hadoop fs -chmod +rw /user/examples [cloudera@quickstart ]$ hadoop fs -copyFromLocal /*.avsc /user/examples 19/09/12 00:30:11 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:30:11 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:30:12 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 19/09/12 00:30:12 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) Можно наблюдать ранее полученные ошибки, но так как они снова на уровне предупреждения, то можно их проигнорировать. С помощью утилиты Impala были созданы таблицы на основе ранее экспортированных данных: CREATE EXTERNAL TABLE categories STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/categories' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/categories.avsc'); CREATE EXTERNAL TABLE customers STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/customers' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/customers.avsc'); CREATE EXTERNAL TABLE departments STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/departments' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/departments.avsc'); CREATE EXTERNAL TABLE orders STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/orders' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/orders.avsc'); CREATE EXTERNAL TABLE order_items STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/order_items' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/order_items.avsc'); CREATE EXTERNAL TABLE products STORED AS AVRO LOCATION 'hdfs:///user/hive/warehouse/products' TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart/user/examples/products.avsc'); В качестве примера выполним выборку 10 самых популярных категорий продуктов (рис.1). |