When working with Snowflake to load data from external stage zero-byte (0b) or 1-byte (1b) files are not accessible. In this blog post, we'll explore these challenges and outline potential solutions for handling such file ingestion problems.
The problem - Querying Files from an External Stage
While querying an external stage, zero-byte (0b) and one-byte (1b) files are not accessible through the standard SELECT query in Snowflake, creating an obstacle when that file's metadata is part of your dataset.
For files larger than 1b the content of the queried files can be accessed via the $1 column, and metadata about the files can be retrieved using several built-in metadata columns. These include:
METADATA$FILENAME: The full path to the data file the current row belongs to.
METADATA$FILE_ROW_NUMBER: The row number of the record within the data file.
METADATA$FILE_CONTENT_KEY: A checksum of the data file.
METADATA$FILE_LAST_MODIFIED: The last modified timestamp of the data file.
METADATA$START_SCAN_TIME: The timestamp when the scanning of the file started.
Here’s an example of such a query:
SELECT $1, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER FROM @my_stage (FILE_FORMAT => my_custom_format, PATTERN => '.*\.csv'); |
This query will work for files with content greater than 1 byte. However, files that are 0b or 1b in size are not accessible via the standard SELECT query.
Workaround - listing the files
A useful workaround is to query the results of the LIST stage command or use a SELECT from the stage's directory to access the metadata for these small files.
Here's how you can list the files in the stage and retrieve metadata:
LIST @my_stage; SET qid=LAST_QUERY_ID(); SELECT “name”,“size”, “md5”, “last_modified” FROM table(result_scan($qid)) ; |
Alternatively, you can use SELECT with the DIRECTORY option:
SELECT * FROM DIRECTORY(@my_stage); |
This approach allows you to access the file names, sizes, hash key, and last modified dates, even for small files that cannot be directly queried.
Workaround - adding metadata
If you want to combine metadata from 0b and 1b files with the data from larger files that can be queried, you can use a UNION ALL to merge the results. This approach ensures that all relevant file information is captured. Here's an example of how this can be achieved:
WITH large_files AS ( SELECT $1 AS FILE_CONTENT, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER FROM @my_stage (FILE_FORMAT => my_custom_format, PATTERN => '.*\.csv') ), small_files AS ( SELECT FILENAME, LAST_MODIFIED FROM DIRECTORY(@my_stage) WHERE FILE_SIZE <= 1 ) SELECT * FROM large_files UNION ALL SELECT null AS FILE_CONTENT, FILENAME, null AS FILE_ROW_NUMBER FROM small_files; |
This query selects the data from files larger than 1b and combines it with metadata from smaller files. The same can be done using the LIST stage command. In that case you will need to replace the small files query with the LIST command and query form the query results provided above.
Additional context
Understanding which file types can potentially be 0b or 1b is crucial when dealing with external stages. Below is a breakdown of common file types and their potential to be zero or one byte in size:
CSV Files:
CSV files can technically be 1 byte if they contain an empty column represented by a single character such as a comma. 0b CSV file, however, would be invalid as it wouldn't contain even the structural elements (like delimiters).
JSON Files:
JSON files cannot be 1 byte because they require at least {} to form a valid empty JSON object.
A 0b JSON file is similarly invalid.
Text Files (TXT):
TXT files can be 1 byte and still valid, as they may contain a single character or a newline. A 0b TXT file, though, would be empty and contain no useful information.
Parquet Files:
Parquet files are a structured format and can't be 1 byte. These files have a header and metadata that make it impossible for them to be valid at such a small size. A 0b Parquet file would be considered corrupt or incomplete.
Avro Files:
Similar to Parquet, Avro is a structured format with metadata, and it cannot be valid if it's only 1 byte. 0b Avro files would also be invalid.
Conclusion
Handling file ingestion issues in Snowflake, especially with 0b and 1b files, can be challenging. However, by using workarounds like LIST commands and DIRECTORY queries, and understanding which file types can be zero or one byte, you can better manage the ingestion process. These solutions will help you ensure that no data—regardless of file size—is left behind when querying from external stages.
If you're working with file types that may be susceptible to these issues, it’s important to account for them during your ingestion workflows. Keep in mind the tips outlined here to handle file ingestion more effectively and avoid common pitfalls.
Aslan Hattukai
Data Engineer
Comments