Overview
Google Cloud Storage allows storage of files that can be stored, downloaded, and shared.
Google BigQuery data can be transformed, extracted and loaded to Google Cloud Storage where it can then be shared with other users.
Magnus features GS Export Task which exports data from a source Google BigQuery table to a Google Cloud Storage destination
GS Export Task provides options for exported format, compression, and field delimiters
Front Panel
Flip Panel
Adding GCS Export Task
1. Under Add Task, select the desired position from the "Add to position" drop down list, then click +GS Export.
Source Table
1. Enter the source BQ table to export.
Source Table Options
a. Click the icon to open the BQ selector to select an existing BigQuery table as the source
b. The source BQ table can also be entered manually. It is of the format projectID:datasetID.tableID with optional project ID. If project ID is not provided,
for example:myDataset.myResult
User's billing project will be used.
c. Parameters can be used when specifying the source table
Source Table Permissions
d. User must have read permission for the source table
Destination
1. Enter the destination GS URI
Destination Requirements
a. The destination GS URI must be prefixed with gs://
b. The GS bucket must already exist and user must have write permission to the bucket.
c. If there are subdirectories in the destination GS URI, they will be created if they don't exist already. For example, in the example above, the subdirectory data will be created if it doesn't exist.
d. BigQuery export API has a limit of 1 GB of data per file. So if the source BQ table is bigger than 1 GB, wildcard pattern is required in the Destination GS URI.
Single wildcard URI
Magnus currently doesn't support Multiple wildcard URI.
e. Parameters can be used when specifying the destination GS URI.
Export Options
2. Click on the Export Format link to specify export format
a. These export formats are supported:
i. CSV
ii. JSON
iii. Avro
b. For CSV export format:
i. User has the option to compress the data to GZIP format
ii. User can specify the field delimiter
iii. User has the option to include header
c. For JSON export format:
i. User has the option to compress the data to GZIP format
ii. Field delimiter does not apply
iii. User has the option to include header
d. For Avro export format:
i. User has the option to compress the data to DEFLATE or SNAPPY format.
ii. Field delimiter does not apply
iii. User has the option to include header
Processing Location
Select a location where the BQ extract job will run. BQ jobs that run in a specific location may only reference data in that location. For data in either the US or EU, you may choose auto-select to run the BQ job in the location where the data resides. For data in other locations, you must specify the processing location explicitly.
Limitations
- The collected items list cannot exceed 5 million characters in length.
Execution Mode
User can specify to execute the GS Export Task synchronously or asynchronously. When the GS Export Task is executed in synchronous mode, the Task will wait for the BQ extract job to complete before going on to the next Task. When the GS Export Task is executed in asynchronous mode, the Task will not wait for the BQ extract job to complete. It will go on to the next Task immediately after the BQ extract job is submitted. To specify the Execution Mode:
- Click on the Advanced icon
- Specify the Execution Mode
- For asynchronous Execution Mode, user can choose to collect the BQ job in a custom parameter. User can then use Hub Task to wait for the collected BQ Jobs to complete. See Hub Task for more details.
- By default, the new BQ job will be appended to the specified parameter. To overwrite the parameter instead, check Overwrite List:
Important:
If user chooses to append to the specified parameter and not overwrite, care must be taken to make sure that the parameter is cleared before first use. Otherwise, since this parameter is Workflow-owned, its state is preserved through Workflow executions. Thus, if user does not clear the parameter before first use, new items will be appended accumulatively to the parameter for each Workflow execution.
Here are some suggestions on how to clear the parameter before first use:- In the Workflow, identify the first GS Export Task with asynchronous execution mode, and check the checkbox Overwrite List. This ensures that the parameter is cleared and overwritten for the first use. Then for subsequent GS Export Tasks with asynchronous execution mode, leave the checkbox Overwrite List unchecked, so that new items will be appended to the parameter.
- At the beginning of the Workflow, add a Script Task to clear the parameter. This is a Script Task that returns an empty string and its Return Parameter is set to the parameter to clear.
- At the end of the Workflow, add a Script Task to clear the parameter. This is a Script Task that returns an empty string and its Return Parameter is set to the parameter to clear.