Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Ability to work with compressed data in a data-warehouse is well understood. It becomes even more pronounced when warehouse size runs in hundreds of terabytes and petabytes. Hodaap provides a way to interact with compressed files natively. Tools like Hive and Pig can read these compressed files and process them, thereby reducing storage requirements to a great extent.
In this post, I am going to describe how easily a hive query can be executed against a compressed data file. All screen shots are from my HDInsight cluster.
I am also using the new Windows Azure Powershell tool to interact with my HDInsight cluster. This is pretty straight forward and simple to interact.
(1) I first used the following commands to connect to my HDInsight cluster
# Provide Windows Azure subscription name, and the Azure Storage account and container that is used for the default HDInsight file system. $subscriptionName = "Windows Azure MSDN - Visual Studio Ultimate" $storageAccountName = "mydemocluster" $containerName = "sample"
# Provide HDInsight cluster name Where you want to run the Hive job $clusterName = "mydemocluster" Add-AzureAccount Select-AzureSubscription -SubscriptionName $subscriptionName Use-AzureHDInsightCluster $clusterName |
Although I have used "Add-AzureAccount" to authenticate myself on the interactive console. The same can also be scripted using certificates.
(2) Next, I created a sample table by invoking Hive for the PowerShell
$queryString = "CREATE EXTERNAL TABLE Sample_Data (Exchange STRING,Stock STRING,Date STRING, High STRING,Low STRING,Open STRING,Close STRING,Field1 STRING,Field2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/Data/HiveExternal';" Invoke-Hive -Query $queryString |
Notice that I have created an external table in hive as I am not doing any aggregation in this example.
(3) Now I will move compressed file in this location so that I can run hive query on this data. I am using Cloud Explorer to move file. Cloud Explorer also has a free version and presents a windows explorer like interface to interact with azure blob storage.
SampleZipFile.zip contains a bunch of CSV files all compressed together. The files are copied to the same location where the external hive table is point to.
(4) Now that data is in there, I can simple query this data using hive query. Hive will automatically run map/reduce jobs on the data nodes which hold a chunk of this compressed file!!
$queryString = "select count(*) from sample_data;" Invoke-Hive -Query $queryString |
That's it !!!
Comments
- Anonymous
January 30, 2014
Great post. Simple walk-through and very effective.