cloudDS module
Title : Cloud data access from python scripts.
Description : A python class function to access files from GCS, S3 or local into the python scripts for data analysis. (Handles authenticating the datasource, retrieveing and uploading the files from/to buckets.)
Functions
def get_logger(
)
Classes
class DataSource
A datasource class for handling its authentication, buckets, files.
Ancestors (in MRO)
- DataSource
- builtins.object
Static methods
def __init__(
self, name='local', set_default=True, log=False, **kwargs)
Datasource class intialization
Initialize the Datasource with base libcloud drivers for the cloud storage along with authentication, setting the bucket/container and creating local temp directory. Also set default environment variable for the storage if arg's supplied
Keyword Arguments:
name {str} -- Name of the datasource (default: {'local'}) set_default {bool} -- Set default env name for Datasource on system (default: {True}) log {bool} -- whether to use log on console (default: {False})
Arguments:
**kwargs: auth_keys {str | tuple} -- authentication keys or env names of the keys as a string or a tuple of strings (default: {''}) auth_profile {str} -- profile name of aws whose credentials are to be used (default: {'default'}) creds_path {str} -- path to the credentials file containing access id and keys for aws (default: {'~/.aws/credentials'}) bucket {str} -- bucket name on the cloud storage to be used (default: {None}) cwd {str} -- path on the cloud storage inside the bucket specified to be used as the cloud working directory (default: {None}) temp_path {str} -- path on the local system where a tempdir is created for the cloudDS instance. (default: {'/tmp/'}) override {bool} -- overrides existing temp directory path in environment variable if true (default: {False}) clear_old {bool} -- when creating new temp path using override, should the previous temp be deleted? (default: {False})
Raises:
IOError -- In cases of failure to set up the data storage, the function raises an error
Examples:
(default init): params(defaults) -- set_default(True), log(False) # set up DataSource with storage type set as default on env variable. >>> s3 = cloud.DataSource('s3') # set up DataSource w/o storage type set as default on env variable. >>> s3 = cloud.DataSource('s3', set_default=False) # set up DataSource with logs enabled >>> s3 = cloud.DataSource('s3', log=True) (authentication): params(defaults) -- auth_keys(''), auth_profile('default'), creds_path('~/.aws/credentials') # set up DataSource with creds from awscli config (default profile) >>> s3 = cloud.DataSource('s3') # set up DataSource with creds from awscli config with different profile. >>> s3 = cloud.DataSource('s3', auth_profile='statlas') # set up DataSource with creds from awscli config file from a path >>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file') >>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file', auth_profile='statlas') # set up DataSource with authentication keys >>> s3 = cloud.DataSource('s3', auth_keys='ID, KEY') >>> s3 = cloud.DataSource('s3', auth_keys=('ID', 'KEY')) >>> s3 = cloud.DataSource('s3', auth_keys='ENV_VAR_ID, ENV_VAR_KEY') >>> s3 = cloud.DataSource('s3', auth_keys=('ENV_VAR_ID', 'ENV_VAR_KEY')) (set bucket and working directory): params(defaults) -- bucket(None), cwd(None) # set up DataSource with a bucket >>> s3 = cloud.DataSource('s3', bucket='bucket-name') # set up DataSource with working directory on storage # along with cwd, bucket argument is required. >>> s3 = cloud.DataSource('s3', bucket='bucket-name', cwd='path/on/bucket') (temp directory): params(defaults) -- temp_path('/tmp/'), override(False), clear_old(False) # set up DataSource with temp directory in system temp path >>> s3 = cloud.DataSource('s3') # set up DataSource with temp directory in a specific path >>> s3 = cloud.DataSource('s3', temp_path='local/temp/path') # reset Datasource with temp directory with different path # sets up new temp by overriding env variable and clearing # old temp directory if exists. >>> s3 = cloud.DataSource('s3', temp_path='some/path', override=True, clear_old=True)
def authenticate(
self, auth_keys='', **kwargs)
Authenticate the cloud storage with authentication keys
By default, the function retrieves keys from the default profile of the credentials file from awscli path. Alternatively, a string of id and key or env variable names separated by a comma or as a tuple of strings can be passed. Also, a different profile or a path for the credentials can be provided.
Keyword Arguments:
auth_keys {str | tuple} -- Access Id and secret key (cases: { 'idx,key', ('idx', 'key'), 'ENVID,ENVKEY', ('ENVID', 'ENVKEY') }) (default: {''}) **kwargs: auth_profile {str} -- profile name of aws whose credentials are to be used (default: {'default'}) creds_path {str} -- path to the credentials file containing access id and keys for aws (default: {'~/.aws/credentials'})
Returns:
{bool} -- true if authentication is success else false
Examples:
# authenticate storage with creds from awscli config # (default profile) >>> s3.authenticate() # authenticate storage with creds from awscli config with # different profile. >>> s3.authenticate(auth_profile='statlas') # authenticate storage with creds from awscli config file # from a path >>> s3.authenticate(creds_path='path/to/credentials/file') >>> s3.authenticate(creds_path='path/to/credentials/file', auth_profile='statlas')
def parse_aws_profile(
self, credentials_path, profile)
Parse the aws access id and key from the credentials file for the requested profile.
Arguments:
credentials_path {str} -- path to credentials file containing aws access id and key profile {str} -- name of the profile whose credentials are to be used
Returns:
{list} -- a list containing aws id and key.
def set_bucket(
self, bucket_name, set_default=False)
Set a bucket from data storage as default and retrieve the container.
Arguments:
bucket_name {str} -- Name of the bucket/container on the cloud storage
Keyword Arguments:
set_default {bool} -- Set a environment variable for the bucket as default for data storage if true (default: {False} )
Returns:
{bool} -- true if bucket/container is retrieved else false
Examples:
# set bucket on connected storage >>> s3.set_bucket('bucket-name') # set bucket on connected storage w/ set default option >>> s3.set_bucket('bucket-name', set_default=True)
def set_local_temp_path(
self, path='/tmp/', **kwargs)
Creates local temp directory for the data storage for the current instance. A temp path where the directory is created can be specified
By default, the dir is created in '/tmp/'
Arguments: path {str} -- local temp path where dir is created (default: {'/tmp/'}) **kwargs:
override {bool} -- overrides existing temp directory path in environment variable if true (default: {False}) clear_old {bool} -- when creating new temp path using override, should the previous temp be deleted? (default: {False})
Returns:
{bool} - true if dir created else false
Examples:
# set/create temp directory in path from # 'CLOUD_DIR' env variable if exists else system temp >>> s3.set_local_temp_path() # set/create temp directory in a specific path # when 'CLOUD_DIR' env variable doesn't exist. # else path taken from env. >>> s3.set_local_temp_path(temp_path='local/temp/path') # reset Datasource's temp directory with different path # sets up new temp by overriding env variable and option to # clear old temp directory if exists. # 'CLOUD_DIR' env variable is updated with new path. >>> s3.set_local_temp_path(temp_path='some/path', override=True, clear_old=True)
def set_wd(
self, path, **kwargs)
Set path on the bucket/container from data storage as the working directory/default.
Arguments:
path {str} -- Path on the local/cloud storage to be set as working directory **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if working directory is set else false
Examples:
# set working directory on connected storage >>> s3.set_wd('bucketpath') # set working directory on connected storage when bucket is not set >>> s3.set_wd('bucketpath', bucket='bucket-name')
def clear_temp(
self)
Clear the temp directory by deleting all the files recursively.
Raises:
Exception -- if failed to clear the tempdir
def create_dir(
self, path)
Create directory(s) with checks.
Arguments:
path {str} -- path for the directory to be created
Raises:
OSError -- if failed to create the dir(s)
def parse_cloudpath(
self, path, set_bucket=False)
Parse the cloud path (s3) into bucket name and bucketpath.
Arguments:
path {str} -- cloud path with bucketname and path (Examples: s3://bucket/bucketpath/to/file, bucket/bucketpath/to/file, arn:aws:s3:::bucket/bucketpath/to/file for S3) set_bucket {bool} -- set parsed bucket on the cloud storage. (default: {False})
Returns:
{tuple} -- a tuple of bucketname and bucketpath to file/directory
Examples:
# parse s3 bucketpath into bucket and path >>> bname, bpath = s3.parse_cloudpath('s3://bname/bpath') >>> bname, bpath = s3.parse_cloudpath('bname/bpath') >>> bname, bpath = s3.parse_cloudpath('arn:aws:s3:::bname/bpath') # set the bucket from parsed bucketpath >>> bname, bpath = s3.parse_cloudpath('bname/bpath', set_bucket=True)
def list_buckets(
self)
List all the current buckets in the authenticated data storage.
Returns:
bucket_list {list} -- List of all bucket-names in the defined cloud storage
Example:
# list buckets from connected storage >>> s3.list_buckets()
def list_files(
self, path='', use_cwd=True, filename_pattern=None, recursive=False, full_name=True, list_dirs=False, list_objs=True, limit=None, obj=False, **kwargs)
List all the files in supplied path/path from data storage
If recursive is False, then list only the "depth=0" items (dirs & objects)
If recursive is True, then list recursively all objects (no dirs).
Keyword Arguments:
path {str} -- path name (bucketpath) (default: {''}, searches in working directory) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [path] arg (default: {True}) filename_pattern {str} -- pattern to be followed in listing files (default: {None}) recursive {bool} -- Should you list sub-directories as well? (default: {False}) full_name {bool} -- whether to return full path to the files (default: {True}) list_dirs {bool} -- list the directories as well if true lists only when recursive is False (default: {False}) list_objs {bool} -- list the objects as well if true, when recursive is False (default: {True}) limit {int} -- number of objects to be listed from the path (default: {None}) obj {bool} -- return the listed paths as S3Obj if true else list of filenames (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
if obj is True, return a list of S3Obj else a list of strings (filenames) q {S3Obj} - a namedtuple with key, modified time, file size and ETag for the s3 object. file_list {list} -- List of all the file paths retireved
Examples:
# list through all S3 objects under some dir with # working directory (cwd) set for the storage >>> flist = s3.list_files('relative/path/to/cwd', recursive=True) # list through all S3 objects under some dir >>> flist = s3.list_files('some/dir', use_cwd=False) # list through all S3 objects under some dir when bucket is not set >>> flist = s3.list_files('some/dir', use_cwd=False, bucket='bucket-name') # list through all S3 objects under some dir recursively # and return list of S3Obj instead of just path names >>> flist = s3.list_files('some/dir', use_cwd=False, recursive=True, obj=True) # non-recursive listing under some dir: >>> flist = s3.list_files('some/dir', use_cwd=False) # non-recursive listing under some dir, listing only dirs: >>> flist = s3.list_files('some/dir', use_cwd=False, recursive=False, list_objs=False) # list files limiting only to first 30 objects >>> flist = s3.list_files('some/dir', use_cwd=False, recursive=False, limit=30) # recursive listing under some dir with a pattern. >>> flist = s3.list_files('some/dir', use_cwd=False, recursive=True, filename_pattern='*file_pat*.csv') # list files from specific bucketpath recursively, # just returning filenames. >>> flist = s3.list_files('some/dir', use_cwd=False, recursive=True, full_name=False)
def file_exists(
self, path, use_cwd=False, **kwargs)
Check if a given file with path exists on data storage.
Arguments:
path {str} -- path to the file.
Keyword Arguments:
use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [path] arg (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if exists else false
Examples:
# check if file on bucket 'bname' exists. >>> s3.file_exists('some/path/to/file', bucket='bname') # check if a file path relative to cwd exists. >>> s3.file_exists('relative/path/to/cwd')
def import_file(
self, bucketpath, localpath='>/', use_cwd=False, overwrite=True, delete_on_failure=True, awscli=False, **kwargs)
Download a file from cloud data storage to a local path. (defaults to tempdir)
Arguments:
bucketpath {str} -- full path to the file on the bucket to be downloaded from cloud storage (relative path to working dir on bucket, if use_cwd is true) (supply '' if bucketpath is same as cwd)
Keyword Arguments:
localpath {str} -- Local path where the file is to be downloaded (default: {'>/'}, to the tempdir) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath] arg (default: {False}) overwrite {bool} -- Overwrite existing file on the local system (default: {True}) delete_on_failure {bool} -- Delete partial file on download failure (default: {True}) awscli {bool} -- Set as True to download files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if import success else false
Examples:
# import a file from cloud to local temp dir from bucket 'bname' >>> s3.import_file('bpath/to/file', bucket='bname') # import a file from cloud to specific local dir using awscli >>> s3.import_file('bpath/to/file', 'local/dir/path', awscli=True)
def import_files(
self, bucketpath_list, localpath='>/', use_cwd=False, overwrite=True, delete_on_failure=True, awscli=False, **kwargs)
Download a list of files from cloud data storage to a local path. (defaults to tempdir)
Arguments:
bucketpath_list {list} -- List of full path to the files on the bucket to be downloaded from cloud storage (relative paths to working dir on bucket, if use_cwd is true)
Keyword Arguments:
localpath {str} -- Local path where the files are downloaded. (default: {'>/'}, to the tempdir) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath_list] arg (default: {False}) overwrite {bool} -- Overwrite existing file on the local system (default: {True}) delete_on_failure {bool} -- Delete partial file on download failure (default: {True}) awscli {bool} -- Set as True to download files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if all imports success else false
Examples:
# import a list of files from cloud to local temp dir # from bucket 'bname' >>> s3.import_files(flist_bpaths, bucket='bname') # import a file from cloud to specific local dir using awscli >>> s3.import_files(flist_bpaths, 'local/dir/path', awscli=True)
def import_folder(
self, bucketpath, localpath='>/', use_cwd=False, recursive=True, pattern=None, overwrite=True, delete_on_failure=True, follow_folder_structure=False, awscli=False, **kwargs)
Download all the files in a bucket path either recursively or not, into the localpath. All the files can be downloaded to a single folder or it can follow the folder structure of the files as on the cloud storage.
Arguments:
bucketpath {str} -- full path to the folder on the bucket to be downloaded from cloud storage (relative path to working dir on bucket, if use_cwd is true) (supply '' if bucketpath is same as cwd)
Keyword Arguments:
localpath {str} -- Local folder path where the files from the folder on cloud storage are downloaded. (default: {'>/'}, to the tempdir) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath_list] arg (default: {False}) recursive {bool} -- Should the files be downloaded recursively? (default: {True}) pattern {str} -- Filename pattern to be followed in listing files (default: {None}) overwrite {bool} -- Overwrite existing file on the local system (default: {True}) delete_on_failure {bool} -- Delete partial file on download failure (default: {True}) follow_folder_structure {bool} -- if the folder structure of the files should be followed as on the bucketpath (default: {False}) awscli {bool} -- Set as True to download files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
status {bool} - true if all imports success else false
Examples:
# import a folder from cloud to local temp dir # from bucket 'bname' recursively, without following # folder structure >>> s3.import_folder('bpath/folder', bucket='bname') # import a folder from cloud to specific local dir using awscli # with following folder structure >>> s3.import_folder('bpath/folder', 'local/dir', bucket='bname', follow_folder_structure=True, awscli=True) # import a folder from cloud to local temp dir non-recursively, # with a filename pattern using awscli >>> s3.import_folder('bpath/folder', 'local/dir', bucket='bname', recursive=False, awscli=True, pattern='*.tif') # import a folder from a relative path to cwd to local temp dir # recursively with following folder structure >>> s3.import_folder('relative/bpath/to/folder', 'local/dir', use_cwd=True, bucket='bname', follow_folder_structure=True)
def export_file(
self, localpath, bucketpath, filename=None, use_cwd=False, overwrite=False, awscli=False, **kwargs)
Upload a file from the localpath to the cloud data storage.
Arguments:
localpath {str} -- path to the file on local to be uploaded to cloud storage (supply '>/' at the start of filename, to export a file from the tempdir) bucketpath {str} -- path (excluding filename) on the cloud storage where the file is uploaded. (relative path to working dir on bucket, if use_cwd is true)
Keyword Arguments:
filename {str} -- if only directory name is supplied with localpath (default: {None}) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath] arg (default: {False}) overwrite {bool} -- Overwrite existing file on the cloud storage (default: {True}) awscli {bool} -- Set as True to download files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if export success else false
Examples:
# export a file to cloud from local temp dir from bucket 'bname' >>> s3.export_file('localpath/to/file', 'bpath/folder', overwrite=True, bucket='bname') >>> s3.export_file('localpath/dir/', 'relative/bpath/to/folder', filename=local_filename, use_cwd=True overwrite=True, bucket='bname') # export a file to cloud from local path using awscli >>> s3.export_file('localpath/to/file', 'bpath/folder', awscli=True, bucket='bname')
def export_files(
self, localpaths, bucketpath, use_cwd=False, overwrite=False, awscli=False, **kwargs)
Upload a list of files from the localpath to the cloud data storage.
Arguments:
localpaths {list} -- list of paths to the files on local to be uploaded to cloud storage (supply '>/' at the start of filenames, to export a file from the tempdir) bucketpath {str} -- path (excluding filename) on the cloud storage where the file is uploaded.
Keyword Arguments:
use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath] arg (default: {False}) overwrite {bool} -- Overwrite existing file on the cloud storage (default: {True}) awscli {bool} -- Set as True to upload files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if all exports success else false
Examples:
# export a list of files to cloud from local temp dir # from bucket 'bname' >>> s3.export_files(flist_localpaths, 'bpath/folder', overwrite=True, bucket='bname') # export a list of files to cloud from local path using awscli >>> s3.export_files(flist_localpaths, 'bpath/folder', awscli=True)
def export_folder(
self, localpath, bucketpath, recursive=False, pattern=None, use_cwd=False, overwrite=False, follow_folder_structure=False, awscli=False, **kwargs)
Export/upload all the files in a local path either recursively or not, into the localpath, along with option to choose files using a pattern. All the files can be exported to a single path on the bucket or it can follow the folder structure of the files as on the localpath.
Arguments:
localpath {str} -- Local folder path from where the files are to be exported to cloud storage. (supply '>/' at the start of filepath, if folder is in the tempdir) bucketpath {str} -- full path to the folder on the bucket where the files are exported on cloud storage if following the folder structure, include the destination folder name as well (relative path to working dir on bucket, if use_cwd is true)
Keyword Arguments:
recursive {bool} -- Should you list sub-directories as well? (default: {False}) pattern {str} -- Filename pattern to be followed in listing files (default: {None}) use_cwd {bool} -- Whether to use the set working dir for bucket if false, full path is to be provided in [bucketpath] arg (default: {False}) overwrite {bool} -- Overwrite existing file on the cloud storage (default: {True}) follow_folder_structure {bool} -- if the folder structure of the files should be followed as on the localpath (default: {False}) awscli {bool} -- Set as True to upload files in parallel, make sure awscli is setup. (default: {False}) **kwargs: bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {cls.bucket_name})
Returns:
{bool} -- true if all exports success else false
Examples:
# export a folder from cloud to local temp dir # from bucket 'bname' recursively, without following # folder structure >>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname') # export a folder from cloud to specific local dir using awscli # with following folder structure >>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname', follow_folder_structure=True, awscli=True) # export a folder from cloud to local temp dir non-recursively, # with a filename pattern using awscli >>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname', recursive=False, awscli=True, pattern='*.tif') # export a folder from a relative path to cwd to local temp dir # recursively with following folder structure >>> s3.export_folder('local/dir', 'relative/bpath/to/folder', use_cwd=True, bucket='bname', follow_folder_structure=True)
Instance Variables
var auth_status
Authentication status for the cloud storage
var bucket_name
Name of the bucket that's set as default on the authenticated cloud storage
var container
Libcloud container instance for the bucket from the authenticated cloud storage
var creds_default
Default credentials path for AWS S3.
var cwd
Set a defalut working directory on the bucketpath for the authenticated cloud storage
var local_temp
Path of local temp directory
var log
Status to set log print true or false
var storage_type
Storage Name ('s3', 'gcs', 'local')
class ReadCloudFiles
A subclass of DataSource for reading some specific files from cloud storage including s3, gcs.
Ancestors (in MRO)
Static methods
def __init__(
self, name='local', set_default=True, log=False, **kwargs)
Inheritance:
DataSource
.__init__
Datasource class intialization
Initialize the Datasource with base libcloud drivers for the cloud storage along with authentication, setting the bucket/container and creating local temp directory. Also set default environment variable for the storage if arg's supplied
Keyword Arguments:
name {str} -- Name of the datasource (default: {'local'}) set_default {bool} -- Set default env name for Datasource on system (default: {True}) log {bool} -- whether to use log on console (default: {False})
Arguments:
**kwargs: auth_keys {str | tuple} -- authentication keys or env names of the keys as a string or a tuple of strings (default: {''}) auth_profile {str} -- profile name of aws whose credentials are to be used (default: {'default'}) creds_path {str} -- path to the credentials file containing access id and keys for aws (default: {'~/.aws/credentials'}) bucket {str} -- bucket name on the cloud storage to be used (default: {None}) cwd {str} -- path on the cloud storage inside the bucket specified to be used as the cloud working directory (default: {None}) temp_path {str} -- path on the local system where a tempdir is created for the cloudDS instance. (default: {'/tmp/'}) override {bool} -- overrides existing temp directory path in environment variable if true (default: {False}) clear_old {bool} -- when creating new temp path using override, should the previous temp be deleted? (default: {False})
Raises:
IOError -- In cases of failure to set up the data storage, the function raises an error
Examples:
(default init): params(defaults) -- set_default(True), log(False) # set up DataSource with storage type set as default on env variable. >>> s3 = cloud.DataSource('s3') # set up DataSource w/o storage type set as default on env variable. >>> s3 = cloud.DataSource('s3', set_default=False) # set up DataSource with logs enabled >>> s3 = cloud.DataSource('s3', log=True) (authentication): params(defaults) -- auth_keys(''), auth_profile('default'), creds_path('~/.aws/credentials') # set up DataSource with creds from awscli config (default profile) >>> s3 = cloud.DataSource('s3') # set up DataSource with creds from awscli config with different profile. >>> s3 = cloud.DataSource('s3', auth_profile='statlas') # set up DataSource with creds from awscli config file from a path >>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file') >>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file', auth_profile='statlas') # set up DataSource with authentication keys >>> s3 = cloud.DataSource('s3', auth_keys='ID, KEY') >>> s3 = cloud.DataSource('s3', auth_keys=('ID', 'KEY')) >>> s3 = cloud.DataSource('s3', auth_keys='ENV_VAR_ID, ENV_VAR_KEY') >>> s3 = cloud.DataSource('s3', auth_keys=('ENV_VAR_ID', 'ENV_VAR_KEY')) (set bucket and working directory): params(defaults) -- bucket(None), cwd(None) # set up DataSource with a bucket >>> s3 = cloud.DataSource('s3', bucket='bucket-name') # set up DataSource with working directory on storage # along with cwd, bucket argument is required. >>> s3 = cloud.DataSource('s3', bucket='bucket-name', cwd='path/on/bucket') (temp directory): params(defaults) -- temp_path('/tmp/'), override(False), clear_old(False) # set up DataSource with temp directory in system temp path >>> s3 = cloud.DataSource('s3') # set up DataSource with temp directory in a specific path >>> s3 = cloud.DataSource('s3', temp_path='local/temp/path') # reset Datasource with temp directory with different path # sets up new temp by overriding env variable and clearing # old temp directory if exists. >>> s3 = cloud.DataSource('s3', temp_path='some/path', override=True, clear_old=True)
def read_raster(
self, filepath, parse_prefix=None, relpath=None, save=True, awscli=False, bucket=None, read_function=None, *args, **kwargs)
Imports the raster file from cloud path to local temp directory and reads the raster using supplied read function.
example functions: gdal.Open, rasterio.open, h5py.File etc. raster file types: all the drivers supported by read function, and also zip, tar, gzip files if read function supports (gdal, rasterio etc.)
Arguments:
filepath {str} -- full path to the raster file to be read, on the storage *args {} -- any positional arguments that are to be supplied to the read function
Keyword Arguments:
parse_prefix {str} -- prefix to be used for files like zip, tar, gzip to be supplied to read function. (examples: '/vsizip/', '/vsitar/', '/vsigzip/', 'zip://', 'tar://', 'gzip://') (default: {None}) relpath {str} -- relative path of the file inside a compressed file that is to be read. (requires parse_prefix) (default: {None}) bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {None}, takes from cls.bucket_name) save {bool} -- if the file is to be saved after reading the file (for cloud storage) (default: {True}) awscli {bool} -- Set as True to import file from aws cli, make sure awscli is setup. (default: {False}) read_function {function} -- function to be used to read the raster (default: {None}) **kwargs {} -- any keyword arguments that are to be supplied to the read function
Returns:
ds_raster {raster dataset} -- if read success else None
Examples:
# read a raster file from cloud storage using a read function. >>> ras = s3.read_raster('bucket/path/to/raster.tif', read_function=rasterio.open) # read a raster file from cloud storage and delete the file after # loading into variable with awscli >>> ras = s3.read_raster('bucket/path/to/raster.tif', awscli=True, save=False, read_function=rasterio.open) # read a compressed zip raster file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> ras = s3.read_raster('bucket/path/to/raster.tif.zip', parse_prefix='/vsizip/', relpath='/raster.tif', awscli=True, save=False, read_function=rasterio.open) # read a compressed tar raster file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> ras = s3.read_raster('bucket/path/to/raster_group.tif.tar', parse_prefix='/vsitar/', relpath='/raster.tif', awscli=True, save=False, read_function=rasterio.open) # read a compressed gzip raster file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> ras = s3.read_raster('bucket/path/to/raster_group.tif.gz', parse_prefix='/vsigzip/', relpath='/raster.tif', awscli=True, save=False, read_function=rasterio.open)
def read_vector(
self, filepath, parse_prefix=None, relpath=None, layername=None, vdriver='GPKG', save=True, awscli=False, bucket=None, read_function=None, *args, **kwargs)
Imports the vector and its dependancies from cloud path to local temp directory and reads the vector dataset using supplied read function.
example functions: ogr.Open, geopandas.read_file, fiona.open etc. shapefile dependency file extenstions: (dbf|prj|shp|shx|cpg|qpj|sbn) also zip, gzip, tar files if using supported function (ogr, fiona, geopandas etc.)
Arguments:
filepath {str} -- path to the vector file on the cloud storage *args {} -- any positional arguments that are to be supplied to the read function
Keyword Arguments:
parse_prefix {str} -- prefix to be used for files like zip, tar, gzip to be supplied to read function. (examples: '/vsizip/', '/vsitar/', '/vsigzip/', 'zip://', 'tar://', 'gzip://') (default: {None}) relpath {str} -- relative path of the file inside a compressed file that is to be read. (requires parse_prefix) (default: {None}) layername {str} -- layername from the vector file to be read (default: {None}, taken from filename) vdriver {str} -- vector driver of the input file. (examples: 'GPKG', 'GeoJSON', 'Shapefile') (default: {'GPKG'}) bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {None}, takes from cls.bucket_name) save {bool} -- if the file(s) is/are to be saved after reading the file (for cloud storage) (default: {True}) awscli {bool} -- Set as True to import file from aws cli, make sure awscli is setup. (default: {False}) read_function {function} -- function to be used to read the vector (default: {None}) **kwargs {} -- any keyword arguments that are to be supplied to the read function
Returns:
ds_vector {vector dataset} -- if read success else None
Examples:
# read a vector file from cloud storage using a read function. # `GPKG` is vector driver by default. >>> vec = s3.read_vector('bucket/path/to/vector.gpkg', vdriver='GPKG', bucket='bucket-name', read_function=geopandas.read_file) # driver = 'GPKG' # read a particular layer from the vector dataset. # reads default layer if not provided. >>> vec = s3.read_vector('bucket/path/to/vector.gpkg', vdriver='GPKG', layername='vector_layer', bucket='bucket-name', read_function=geopandas.read_file) # supply kwargs to the read function # driver = 'GPKG' # example kwarg `bbox` for geopandas >>> vec = s3.read_vector('bucket/path/to/vector.gpkg', vdriver='GPKG', bucket='bucket-name', read_function=geopandas.read_file, bbox=(west, south, east, north)) # read a vector file from cloud storage and delete the file after # loading into variable with awscli # driver = 'shapefile' >>> vec = s3.read_vector('bucket/path/to/vector.shp', vdriver='shapefile', awscli=True, bucket='bucket-name', save=False, read_function=geopandas.read_file) # driver = 'geojson' >>> vec = s3.read_vector('bucket/path/to/vector.geojson', vdriver='geojson', awscli=True, bucket='bucket-name', save=False, read_function=geopandas.read_file) # read a compressed zip vector file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> vec = s3.read_vector('bucket/path/to/vector.shp.zip', vdriver='shapefile', parse_prefix='/vsizip/', relpath='/vector.shp', awscli=True, save=False, bucket='bucket-name', read_function=ogr.Open) # read a compressed tar vector file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> vec = s3.read_vector('bucket/path/to/vector.geojson.tar', vdriver='geojson', parse_prefix='/vsitar/', relpath='/vector.geojson', awscli=True, save=False, bucket='bucket-name', read_function=fiona.open) # read a compressed gzip vector file from cloud storage. # provide the relative path of raster within the compressed file # under `relpath` variable. (starts with `/`) >>> vec = s3.read_vector('bucket/path/to/vector.gpkg.gz', vdriver='gpkg', parse_prefix='/vsigzip/', relpath='/vector.gpkg', awscli=True, save=False, bucket='bucket-name', read_function=geopandas.read_file)
def read_table(
self, filepath, bucket=None, save=False, read_function=None, *args, **kwargs)
Imports the table from cloud path to local temp directory and reads the table/dataframe using supplied read function.
example functions: pandas.read_csv, xlrd.open_workbook etc. table file types: '.csv', '.xlsx', '.xls', '.txt', '.xml', '.odt', '.rtf'
Arguments:
filepath {str} -- full path to table/file on cloud storage. *args {} -- any positional arguments that are to be supplied to the read function
Keyword Arguments:
bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {None}, takes from cls.bucket_name) save {bool} -- if the file(s) is/are to be saved after reading the file (for cloud storage) (default: {False}) read_function {function} -- function to be used to read the table (default: {None}) **kwargs {} -- any keyword arguments that are to be supplied to the read function
Returns:
table {table/dataframe} -- if read success else None
def read_json(
self, filepath, bucket=None, save=False, read_function=None, *args, **kwargs)
Imports the json file from cloud path to local temp directory and reads the file using supplied read function.
example function: json.load, geopandas.read_file etc. json file types: '.json', '.geojson'
Arguments:
filepath {str} -- full path to json/geojson file on cloud storage. *args {} -- any positional arguments that are to be supplied to the read function
Keyword Arguments:
bucket {str} -- Name of the bucket on the cloud storage from where the file is to be imported (default: {None}, takes from cls.bucket_name) save {bool} -- if the file(s) is/are to be saved after reading the file (for cloud storage) (default: {False}) read_function {function} -- function to be used to read the json (default: {None}) **kwargs {} -- any keyword arguments that are to be supplied to the read function
Returns:
json_out {json object/geodataframe} -- if read success else None