sqoop 命令

1/列出mysql数据库中的所有数据库sqoop list-databases -connect jdbc:mysql://localhost:3306/ -username root -password 1234562/连接mysql并列出test数据库中的表sqoop list-tables -connect jdbc:mysql://localhost:3306/test -username root -password 1234563/将关系型数据的表结构复制到hive中，只是复制表结构内容不复制sqoop create-hive-table -connect jdbc:mysql://localhost:3306/test -table sqoop_testTabinMySql -username root -password 123456 -hive-table testNewTabInHive4/从关系数据库导入文件到hive中sqoop import -connect jdbc:mysql://localhost:3306/zxtest -username root -password 123456 -table sqoop_test -hive-import -hive-table s_test -m 15/将hive中的表数据导入到mysql 中，在进行导入前 mysql中的表hive_test 必须提前创建好sqoop export -connect jdbc:mysql://localhost:3306/zxtest -username root -password root -table hive_test -export-dir /user/hive/warehouse/new_test_partition/dt=2012-03-056/从数据库导出表的数据到HDFS上的文件sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test7/数据库增量导入表数据到hdfs中sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test -check -column id -incremental append -last-value 3Importsqoop 数据导入具有以下特点：1.支持文本文件(--as-textfile)、avro(--as-avrodatafile)、SequenceFiles(--as-sequencefile)。 RCFILE暂未支持，默认为文本2.支持数据追加，通过--apend指定3.支持table列选取（--column），支持数据选取（--where），和--table一起使用4.支持数据选取，例如读入多表join后的数据'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) ‘，不可以和--table同时使用5.支持map数定制(-m)6.支持压缩(--compress)7.支持将关系数据库中的数据导入到Hive(--hive-import)、HBase(--hbase-table) 数据导入Hive分三步：1）导入数据到HDFS 2）Hive建表 3）使用“LOAD DATA INPAHT”将数据LOAD到表中数据导入HBase分二部：1）导入数据到HDFS 2）调用HBase put操作逐行将数据写入表*import是将关系数据库迁移到HDFS上默认目录是/user/${user.name}/${tablename}，可以通过--target-dir设置hdfs上的目标目录。export是import的反向过程，将hdfs上的数据导入到关系数据库中由于sqoop是通过map完成数据的导入，各个map过程是独立的，没有事物的概念，可能会有部分map数据导入失败的情况。为了解决这一问题，sqoop中有一个折中的办法，即是指定中间 staging表，成功后再由中间表导入到结果表。这一功能是通过 --staging-table指定，同时staging表结构也是需要提前创建出来的:sqoop export --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --export-dir /user/guojian/sds --staging-table sds_tmp需要说明的是，在使用 --direct， --update-key或者--call存储过程的选项时，staging中间表是不可用的。create-hive-table将关系数据库表导入到hive表中参数说明–hive-homeHive的安装目录，可以通过该参数覆盖掉默认的hive目录–hive-overwrite覆盖掉在hive表中已经存在的数据–create-hive-table默认是false,如果目标表已经存在了，那么创建任务会失败–hive-table后面接要创建的hive表–table指定关系数据库表名sqoop create-hive-table --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --hive-table sds_bak默认sds_bak是在default数据库的。这一步需要依赖HCatalog，需要先安装HCatalog，否则报如下错误：Hive history file=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txtFAILED: ParseException line 1:44 mismatched input ')' expecting Identifier near '(' in column specificationmetastore 配置sqoop job的共享元数据信息，这样多个用户定义和执行sqoop job在这一 metastore中。默认存储在~/.sqoop启动：sqoop metastore关闭：sqoop metastore --shutdownmetastore文件的存储位置是在 conf/sqoop-site.xml中 sqoop.metastore.server.location 配置，指向本地文件。metastore可以通过TCP/IP访问，端口号可以通过 sqoop.metastore.server.port配置，默认是16000。客户端可以通过指定 sqoop.metastore.client.autoconnect.url或使用 --meta-connect，配置为 jdbc:hsqldb:hsql://:/sqoop，例如 jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop。Sqoop will read entire content of the password file and use it as a password. This will include any trailing white space characters such as new line characters that are added by default by most of the text editors. You need to make sure that your password file contains only characters that belongs to your password. On the command line you can use command echo with switch -n to store password without any trailing white space characters. For example to store password secret you would call echo -n "secret" > password.file.Sqoop automatically supports several databases, including MySQL. Connect strings beginning with jdbc:mysql:// are handled automatically in Sqoop. (A full list of databases with built-in support is provided in the "Supported Databases" section. For some, you may need to install the JDBC driver yourself.)You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) Each driver .jar file also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.For example, to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory with --target-dir.NoteIf you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results. 即 where中不能有orSqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. 默认开启4个taskWhen performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.If a table does not have a primary key defined and the --split-by is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. For example:$ sqoop import --connnect--table foo --warehouse-dir /shared \When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument -- is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used by mysqldump:$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ --direct -- --default-character-set=latin1By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by --map-column-java (for changing mapping to Java) or --map-column-hive (for changing Hive mapping).Sqoop is expecting comma separated list of mapping in form=. For example:

$ sqoop import ... --map-column-java id=String,value=Integer

Sqoop will rise exception in case that some configured mapping will not be used.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data.

You can import data in one of two file formats: delimited text or SequenceFiles.

Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.

reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.

While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with --as-sequencefile. The generated class' toString() method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.

When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import.

When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.

2016年9月2日阅读笔记

Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.

7.3. Example Invocations

The following examples illustrate how to use the import tool in a variety of situations.

A basic import of a table named EMPLOYEES in the corp database:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES

A basic import requiring a login: