Dealing with Unstructured Data in Snowflake
The storage of unstructured data is on the rise in modern data warehouses due to the increasing collection of text data and metadata in an unstructured format. Like most data warehousing solutions, Snowflake offers support to load and manipulate semi-structured files and data. This is essential, as, modern data warehousing systems need to store large amounts of semi-structured data like emails, reviews, and metadata. This has led to the development of many Nosql databases that are designed to handle large amounts of unstructured data. Snowflake tries to bridge the gap between traditional RDBMS and Nosql databases via native support, for loading and manipulating semi-structured data.
Snowflake supports the JSON, Avro, Parquet, ORC and XML semi-structured file formats for loading data into its tables. For each file, the top-level object is loaded in a separate row. With this native support, these file formats can be integrated with Snowflake`s serverless Snowpipe service and also be queried from external/internal Stages.
How relational/columnar warehouses handle unstructured data:
Warehouses usually handle unstructured data using two approaches:
- Flatten unstructured data into structured columns – the advantage of this approach is, performance optimizations like, indexing and clustering
- Provide a datatype like xml or json to store data in unstructured format. This approach while being able to store all the data leads to parsing during querying, resulting in slower performance
How snowflake handles unstructured data:
Snowflake offers a ‘variant’ data type to handle unstructured data. The maximum size of this data type is 16 mb or the size of a micro-partition. This is similar to approach 2 listed above. Additionally, snowflake tries to create columns from the keys of unstructured data, based on the repetition of the keys.
Snowflake documentation describes in detail about how this is achieved. However here are some columns that are ignored while searching for keys:
- Keys with different data types across all documents/semi-structured partitions
- Keys with null values
This hybrid approach ensures query performance optimization. It ensures native support for semi-structured data using the ‘Variant’ data type and key-value access, using the json/xml paradigm and functions like Flatten, Lateral flatten etc.
Recommendations from snowflake for processing semi-structured data:
- When you already know the end use of data, Snowflake recommends, comparing of approaches 1 and 2 to figure out the right mix of flattening Vs semi-structured data formats. Factors to consider here would be, date-time keys or ID keys for clustering and pruning of partitions during querying
- When you do not know the end use of data, Snowflake recommends using the variant data type to load the semi-structured data without flattening
- In case there is a date time field or an Id field that can be used for clustering, they can be converted to a separate column while keeping the rest of the data in a semi-structured format
With Snowflake, there is a possibility to create a single unified warehouse for both structured and unstructured data sources in any enterprise. Its advantages are, maintaining single unified storage of data and the possibility to use structured and unstructured data for analysis and feature engineering.
To know more about our expertise, write to us at [email protected]