Working with Data
In this section, will go over how we can bring training data to our projects. You need both
test datasets for your supervised learning algorithms.
Every project has a
/volumes/data volume pre-configured for storing both training data and any intermediate data output. You can access this folder via
/volumes/data in your JupyterLab IDE terminal. You can also access the
/volumes/data folder from your CLI, please refer the CLI Reference for more details.
You can get a list of contents in the
/data folder using
roro volumes:ls data.
wget is a free utility for non-interactive download of data from the web. In many cases the training data is hosted either internally or externally and accessible via an URL. You can use this utility to download data to your project.
wget -O <destination-path> <source-path>
# Download data from FTP $ wget -O /volumes/data/file_name.zip http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz # Download data to from public S3 bucket $ wget -O /volumes/data/file_name.jpg http://bucket_name.s3.amazonaws.com/file_name.jpg # Download to current directory $ wget -O /volumes/data/file_name.jpg http://bucket_name.s3.amazonaws.com/file_name.jpg
You can also
wget to download file in
/notebooks folder of your project volume use the following syntax. Just include
! before the command, like so:
Using S3 Utilities
Amazon S3 is routinely used as a store for very large data files. You can access these data files using the AWS CLI and
boto3. It is necessary that you have your AWS credentials handy to use this method to access data. You can store your
aws_secret_access_key in the project's environment variables. You can also use these credentials directly in an notebook or a file, but we advice against this practice.
Every file in S3 has a
prefix and a
file_name. Both are required to download the file correctly.
# install AWS CLI $ pip install awscli # Using AWS CLI cp command $ aws s3 cp s3://bucket_name/prefix/file_name $ aws s3 cp s3://bucket_name/prefix/file_name /volumes/data/file_name
You can also use
boto3 Python library from AWS to access and download data.
>> import boto3 >> s3 = boto3.resource('s3') # Print all the buckets you can access >> for bucket in s3.bukets.all(): ... print(bucket.name) # Download data from the S3 bucket >> s3.Bucket(bucket_name).download_file(file_prefix + '/' + file_name, 'destination-path/file_name')
Moving data from local machine
You can also upload data from local machine to the
/volumes/data folder. Ensure that you are in the correct project folder which contains the
roro.yml configuration file.
# Upload file from local machine to cloud $ roro cp ./dataset.txt volume:/dataset.txt # Download file from cloud to local machine $ roro cp volume:/dataset.txt ./dataset.txt
roro cp works well with smaller files, overall size less than 100MB. We recommend you use the two methods described above for more speed and consistency.
Help us improve the documentation. Flag errors, issues or request how-tos, guides and tutorials on our
#documentation channel on our Slack.