Working with Data

In this section, will go over how we can bring training data to our projects. You need both training and test datasets for your supervised learning algorithms.

Data Folder

Every project has a /volumes/data volume pre-configured for storing both training data and any intermediate data output. You can access this folder via /volumes/data in your JupyterLab IDE terminal. You can also access the /volumes/data folder from your CLI, please refer the CLI Reference for more details.

You can get a list of contents in the /data folder using roro volumes:ls data.

Using wget

wget is a free utility for non-interactive download of data from the web. In many cases the training data is hosted either internally or externally and accessible via an URL. You can use this utility to download data to your project.

Use wget -O <destination-path> <source-path>

# Download data from FTP
$ wget -O /volumes/data/file_name.zip http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz

# Download data to from public S3 bucket
$ wget -O /volumes/data/file_name.jpg http://bucket_name.s3.amazonaws.com/file_name.jpg

# Download to current directory
$ wget -O /volumes/data/file_name.jpg http://bucket_name.s3.amazonaws.com/file_name.jpg

You can also wget to download file in /notebooks folder of your project volume use the following syntax. Just include ! before the command, like so:

!wget <source-path>

!wget http://bucket_name.s3.amazonaws.com/file_name.jpg

Using S3 Utilities

Amazon S3 is routinely used as a store for very large data files. You can access these data files using the AWS CLI and boto3. It is necessary that you have your AWS credentials handy to use this method to access data. You can store your aws_access_key_id and aws_secret_access_key in the project's environment variables. You can also use these credentials directly in an notebook or a file, but we advice against this practice.

Every file in S3 has a prefix and a file_name. Both are required to download the file correctly.

# install AWS CLI
$ pip install awscli

# Using AWS CLI cp command
$ aws s3 cp s3://bucket_name/prefix/file_name
$ aws s3 cp s3://bucket_name/prefix/file_name /volumes/data/file_name

You can also use boto3 Python library from AWS to access and download data.

>> import boto3
>> s3 = boto3.resource('s3')

# Print all the buckets you can access
>> for bucket in s3.bukets.all():
    ... print(bucket.name)

# Download data from the S3 bucket
>> s3.Bucket(bucket_name).download_file(file_prefix + '/' + file_name, 'destination-path/file_name')

Moving data from local machine

You can also upload data from local machine to the /volumes/data folder. Ensure that you are in the correct project folder which contains the roro.yml configuration file.

# Upload file from local machine to cloud
$ roro cp ./dataset.txt volume:/dataset.txt

# Download file from cloud to local machine
$ roro cp volume:/dataset.txt ./dataset.txt

roro cp works well with smaller files, overall size less than 100MB. We recommend you use the two methods described above for more speed and consistency.

Feedback

Help us improve the documentation. Flag errors, issues or request how-tos, guides and tutorials on our #documentation channel on our Slack.