Using Gzip for Storage Optimisation in Large CSV Data Sets

How to work with CSV.gzip files in Python and decompress them through the CLI.

3 min read Permalink

Working with CSV files can be a hassle, especially when the files are large. One way to make the process easier is to compress the files using gzip, which can significantly reduce the file size.

In this post, I’ll show you how to work with CSV.gzip files using Python and how you can decompress them through the command line interface so they can be opened in an application such as Excel.

Working with CSV.gzip files in Python

First, you’ll need to import the gzip module and the csv module. You can do this by running the following code:

import gzip
import csv

Next, you’ll need to open the gzipped CSV file. You can do this using the gzip.open() function, which works just like the built-in open() function, but automatically decompresses the file. Here’s an example:

with gzip.open('data.csv.gz', 'rt') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

In this example, we’re using the with statement to open the file data.csv.gz in read mode. The rt mode stands for “text mode,” which tells the gzip.open() function to decompress the file and return it as a text file. The csv.reader() function is then used to read the decompressed file and return a reader object that can be iterated over to read the rows of the CSV file.

It is also possible to write data to csv.gzip file, you can do this by using the gzip.open() function in write mode. Here’s an example:

with gzip.open('data.csv.gz', 'wt') as f:
    writer = csv.writer(f)
    writer.writerow(['Ticker', 'Price', 'P/E Ratio'])
    writer.writerow(['TSLA', 143.00, 44.33])
    writer.writerow(['AAPL', 140.30, 23.32])

In this example, we’re using the with statement to open the file data.csv.gz in write mode. The wt mode stands for “text mode,” which tells the gzip.open() function to compress the file and return it as a text file. The csv.writer() function is then used to write the data and return a writer object that can be used to write the rows of the CSV file.

Working with CSV.gzip files in Python is a great way to save space and make your data processing tasks more efficient. With the gzip and csv modules, you can easily read and write compressed CSV files with minimal code.

How to decompress a CSV.gzip file using the CLI

You can decompress a CSV.gzip file using the command line interface (CLI) by using the gunzip command. The gunzip command is used to decompress files that have been compressed with the gzip command. Here’s an example of how to use the gunzip command to decompress a CSV.gzip file:

gunzip data.csv.gz

This command will decompress the file data.csv.gz and create a new file named data.csv. You can then open the data.csv file in Excel.

Alternatively, you can also use zcat command:

zcat data.csv.gz > data.csv

This command will decompress the file data.csv.gz and creates a new file named data.csv and pipe the output to the new file.

If you don’t have the gunzip or zcat command installed, you can install it using your package manager, such as apt or yum.

Once the command is run, you will have the decompressed file data.csv which you can open in excel and work with it as you would normally do with a csv file.

How to download a directory from S3 using the AWS CLI

By using the AWS CLI and its `aws s3 cp` command, you can download a folder directly from an S3 bucket to your local machine.

2 min read Permalink

The AWS CLI has the required functionality for you to download a folder direclty from an AWS S3 Bucket to your local machine.

To get started, make sure you have the AWS CLI installed and then create a folder such as ~/data on your local machine where you wish to store your S3 Bucket downloads.

Using the aws s3 cp [bucketURI] [localDirPath] command you can download a file directly from an S3 bucket to your local machine, but to make this work with folders or directories we need to also pass the --recursive flag.

This command tells the CLI to recursivly download all files and folders from the location of the S3 Bucket URI to the ~/data directory on our local machine.

aws s3 cp s3://your-s3-bucket/path ~/data --recursive

Performing a dry run

If it’s a large folder with a lot of files you may wish to do a dry run first by passing the --dry-run flag, this will simulate the the download action without any files actually getting transfered, highlighting any issues or errors along the way.

aws s3 cp s3://your-s3-bucket/path ~/data --recursive --dry-run

Filtering file types

By default when downloading with the --recursive flag from the S3 bucket it will include all the files. If you only want to include files of a certain type in your download request it is possible to filter them by using the --exclude and --include flags.

Its important to note that to use the --include flag correctly you have to first exclude all files with the --exclude "*" flag, then add the --include flags for your chosen file types. The order here is important when setting both of these as the filters that appear later in the command have higher precedence.

The example below will download all files recursivly from the specified S3 bucket location that have a .csv extention.

aws s3 cp s3://your-s3-bucket/path ~/data --recursive --exclude "*" --include "*.csv"

To download multiple file types in a single request you can pass additional --include flags like in the example below which downloads both .csv and .xls files.

aws s3 cp s3://your-s3-bucket/path ~/data --recursive --exclude "*" --include "*.csv" --include "*.xls"

References

You can read more about the available flags and options in the offical AWS CLI documentation.

Loading more posts...