Top

cloudDS module

Title : Cloud data access from python scripts.

Description : A python class function to access files from GCS, S3 or local into the python scripts for data analysis. (Handles authenticating the datasource, retrieveing and uploading the files from/to buckets.)

Functions

def get_logger(

)

Classes

class DataSource

A datasource class for handling its authentication, buckets, files.

Ancestors (in MRO)

Static methods

def __init__(

self, name='local', set_default=True, log=False, **kwargs)

Datasource class intialization

Initialize the Datasource with base libcloud drivers for the cloud storage along with authentication, setting the bucket/container and creating local temp directory. Also set default environment variable for the storage if arg's supplied

Keyword Arguments:

name {str} -- Name of the datasource
                (default: {'local'})

set_default {bool} -- Set default env name for Datasource on system
                (default: {True})

log {bool} -- whether to use log on console
                (default: {False})

Arguments:

**kwargs:
    auth_keys {str | tuple} -- authentication keys or env names of
                        the keys as a string or a tuple of strings
                        (default: {''})

    auth_profile {str} -- profile name of aws whose credentials are
                        to be used
                        (default: {'default'})

    creds_path {str} -- path to the credentials file containing
                        access id and keys for aws
                        (default: {'~/.aws/credentials'})

    bucket {str}    -- bucket name on the cloud storage to be used
                        (default: {None})

    cwd {str}       -- path on the cloud storage inside the bucket
                        specified to be used as the cloud working
                        directory
                        (default: {None})

    temp_path {str} -- path on the local system where a tempdir is
                        created for the cloudDS instance.
                        (default: {'/tmp/'})

    override {bool} -- overrides existing temp directory path
                        in environment variable if true
                        (default: {False})

    clear_old {bool} -- when creating new temp path using
                    override, should the previous temp be deleted?
                    (default: {False})

Raises:

IOError -- In cases of failure to set up the data storage,
            the function raises an error

Examples:

(default init):
params(defaults) -- set_default(True), log(False)
# set up DataSource with storage type set as default on env variable.
>>> s3 = cloud.DataSource('s3')

# set up DataSource w/o storage type set as default on env variable.
>>> s3 = cloud.DataSource('s3', set_default=False)

# set up DataSource with logs enabled
>>> s3 = cloud.DataSource('s3', log=True)

(authentication):
params(defaults) -- auth_keys(''), auth_profile('default'),
                    creds_path('~/.aws/credentials')
# set up DataSource with creds from awscli config (default profile)
>>> s3 = cloud.DataSource('s3')

# set up DataSource with creds from awscli config with different profile.
>>> s3 = cloud.DataSource('s3', auth_profile='statlas')

# set up DataSource with creds from awscli config file from a path
>>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file')
>>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file',
                       auth_profile='statlas')

# set up DataSource with authentication keys
>>> s3 = cloud.DataSource('s3', auth_keys='ID, KEY')
>>> s3 = cloud.DataSource('s3', auth_keys=('ID', 'KEY'))
>>> s3 = cloud.DataSource('s3', auth_keys='ENV_VAR_ID, ENV_VAR_KEY')
>>> s3 = cloud.DataSource('s3', auth_keys=('ENV_VAR_ID', 'ENV_VAR_KEY'))

(set bucket and working directory):
params(defaults) -- bucket(None), cwd(None)
# set up DataSource with a bucket
>>> s3 = cloud.DataSource('s3', bucket='bucket-name')

# set up DataSource with working directory on storage
# along with cwd, bucket argument is required.
>>> s3 = cloud.DataSource('s3', bucket='bucket-name',
                        cwd='path/on/bucket')

(temp directory):
params(defaults) -- temp_path('/tmp/'), override(False),
                    clear_old(False)
# set up DataSource with temp directory in system temp path
>>> s3 = cloud.DataSource('s3')

# set up DataSource with temp directory in a specific path
>>> s3 = cloud.DataSource('s3', temp_path='local/temp/path')

# reset Datasource with temp directory with different path
# sets up new temp by overriding env variable and clearing
# old temp directory if exists.
>>> s3 = cloud.DataSource('s3', temp_path='some/path',
                        override=True, clear_old=True)

def authenticate(

self, auth_keys='', **kwargs)

Authenticate the cloud storage with authentication keys

By default, the function retrieves keys from the default profile of the credentials file from awscli path. Alternatively, a string of id and key or env variable names separated by a comma or as a tuple of strings can be passed. Also, a different profile or a path for the credentials can be provided.

Keyword Arguments:

auth_keys {str | tuple} -- Access Id and secret key
                        (cases: {
                                 'idx,key',
                                 ('idx', 'key'),
                                 'ENVID,ENVKEY',
                                 ('ENVID', 'ENVKEY')
                                })
                        (default: {''})

**kwargs:
    auth_profile {str} -- profile name of aws whose credentials are
                        to be used
                        (default: {'default'})

    creds_path {str} -- path to the credentials file containing
                        access id and keys for aws
                        (default: {'~/.aws/credentials'})

Returns:

{bool} -- true if authentication is success else false

Examples:

# authenticate storage with creds from awscli config
# (default profile)
>>> s3.authenticate()

# authenticate storage with creds from awscli config with
# different profile.
>>> s3.authenticate(auth_profile='statlas')

# authenticate storage with creds from awscli config file
# from a path
>>> s3.authenticate(creds_path='path/to/credentials/file')
>>> s3.authenticate(creds_path='path/to/credentials/file',
                    auth_profile='statlas')

def parse_aws_profile(

self, credentials_path, profile)

Parse the aws access id and key from the credentials file for the requested profile.

Arguments:

credentials_path {str} -- path to credentials file containing
                        aws access id and key

profile {str} -- name of the profile whose credentials are to
                be used

Returns:

{list} -- a list containing aws id and key.

def set_bucket(

self, bucket_name, set_default=False)

Set a bucket from data storage as default and retrieve the container.

Arguments:

bucket_name {str} -- Name of the bucket/container on the cloud
                      storage
  

Keyword Arguments:

set_default {bool} -- Set a environment variable for the bucket
                      as default for data storage if true
                      (default: {False} )
  

Returns:

{bool} -- true if bucket/container is retrieved else false
  

Examples:

# set bucket on connected storage
  >>> s3.set_bucket('bucket-name')
  
  # set bucket on connected storage w/ set default option
  >>> s3.set_bucket('bucket-name', set_default=True)
  

def set_local_temp_path(

self, path='/tmp/', **kwargs)

Creates local temp directory for the data storage for the current instance. A temp path where the directory is created can be specified

By default, the dir is created in '/tmp/'

Arguments: path {str} -- local temp path where dir is created (default: {'/tmp/'}) **kwargs:

        override {bool} -- overrides existing temp directory path
                      in environment variable if true
                      (default: {False})
  
          clear_old {bool} -- when creating new temp path using
                      override, should the previous temp be deleted?
                      (default: {False})
  

Returns:

{bool} - true if dir created else false
  

Examples:

# set/create temp directory in path from
  # 'CLOUD_DIR' env variable if exists else system temp
  >>> s3.set_local_temp_path()
  
  # set/create temp directory in a specific path
  # when 'CLOUD_DIR' env variable doesn't exist.
  # else path taken from env.
  >>> s3.set_local_temp_path(temp_path='local/temp/path')
  
  # reset Datasource's temp directory with different path
  # sets up new temp by overriding env variable and option to
  # clear old temp directory if exists.
  # 'CLOUD_DIR' env variable is updated with new path.
  >>> s3.set_local_temp_path(temp_path='some/path',
                             override=True, clear_old=True)
  

def set_wd(

self, path, **kwargs)

Set path on the bucket/container from data storage as the working directory/default.

Arguments:

path {str} -- Path on the local/cloud storage to be set as
                  working directory
  
  **kwargs:
  
      bucket {str} -- Name of the bucket on the cloud storage from
                      where the file is to be imported
                      (default: {cls.bucket_name})
  

Returns:

{bool} -- true if working directory is set else false
  

Examples:

# set working directory on connected storage
  >>> s3.set_wd('bucketpath')
  
  # set working directory on connected storage when bucket is not set
  >>> s3.set_wd('bucketpath', bucket='bucket-name')
  

def clear_temp(

self)

Clear the temp directory by deleting all the files recursively.

Raises:

Exception -- if failed to clear the tempdir

def create_dir(

self, path)

Create directory(s) with checks.

Arguments:

path {str} -- path for the directory to be created

Raises:

OSError -- if failed to create the dir(s)

def parse_cloudpath(

self, path, set_bucket=False)

Parse the cloud path (s3) into bucket name and bucketpath.

Arguments:

path {str} -- cloud path with bucketname and path
                (Examples: s3://bucket/bucketpath/to/file,
                           bucket/bucketpath/to/file,
                           arn:aws:s3:::bucket/bucketpath/to/file
                           for S3)

set_bucket {bool} -- set parsed bucket on the cloud storage.
                (default: {False})

Returns:

{tuple} -- a tuple of bucketname and bucketpath to file/directory

Examples:

# parse s3 bucketpath into bucket and path
>>> bname, bpath = s3.parse_cloudpath('s3://bname/bpath')
>>> bname, bpath = s3.parse_cloudpath('bname/bpath')
>>> bname, bpath = s3.parse_cloudpath('arn:aws:s3:::bname/bpath')

# set the bucket from parsed bucketpath
>>> bname, bpath = s3.parse_cloudpath('bname/bpath', set_bucket=True)

def list_buckets(

self)

List all the current buckets in the authenticated data storage.

Returns:

bucket_list {list} -- List of all bucket-names in the defined
                        cloud storage

Example:

# list buckets from connected storage
>>> s3.list_buckets()

def list_files(

self, path='', use_cwd=True, filename_pattern=None, recursive=False, full_name=True, list_dirs=False, list_objs=True, limit=None, obj=False, **kwargs)

List all the files in supplied path/path from data storage

If recursive is False, then list only the "depth=0" items (dirs & objects)

If recursive is True, then list recursively all objects (no dirs).

Keyword Arguments:

path {str}     -- path name (bucketpath)
                    (default: {''}, searches in working directory)

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [path] arg
                    (default: {True})

filename_pattern {str} -- pattern to be followed in listing files
                    (default: {None})

recursive {bool} -- Should you list sub-directories as well?
                    (default: {False})

full_name {bool} -- whether to return full path to the files
                    (default: {True})

list_dirs {bool} -- list the directories as well if true
                    lists only when recursive is False
                    (default: {False})

list_objs {bool} -- list the objects as well if true,
                    when recursive is False
                    (default: {True})

limit {int}      -- number of objects to be listed from the path
                    (default: {None})

obj {bool}       -- return the listed paths as S3Obj if true else
                    list of filenames
                    (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

if obj is True, return a list of S3Obj else a list of strings
(filenames)

q {S3Obj} - a namedtuple with key, modified time, file size
                    and ETag for the s3 object.

file_list {list} -- List of all the file paths retireved

Examples:

# list through all S3 objects under some dir with
# working directory (cwd) set for the storage
>>> flist = s3.list_files('relative/path/to/cwd', recursive=True)

# list through all S3 objects under some dir
>>> flist = s3.list_files('some/dir', use_cwd=False)

# list through all S3 objects under some dir when bucket is not set
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          bucket='bucket-name')

# list through all S3 objects under some dir recursively
# and return list of S3Obj instead of just path names
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          recursive=True, obj=True)

# non-recursive listing under some dir:
>>> flist = s3.list_files('some/dir', use_cwd=False)

# non-recursive listing under some dir, listing only dirs:
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          recursive=False, list_objs=False)

# list files limiting only to first 30 objects
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          recursive=False, limit=30)

# recursive listing under some dir with a pattern.
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          recursive=True,
                          filename_pattern='*file_pat*.csv')

# list files from specific bucketpath recursively,
# just returning filenames.
>>> flist = s3.list_files('some/dir', use_cwd=False,
                          recursive=True, full_name=False)

def file_exists(

self, path, use_cwd=False, **kwargs)

Check if a given file with path exists on data storage.

Arguments:

path {str} -- path to the file.

Keyword Arguments:

use_cwd {bool} -- Whether to use the set working dir for bucket
                if false, full path is to be provided in
                [path] arg
                (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if exists else false

Examples:

# check if file on bucket 'bname' exists.
>>> s3.file_exists('some/path/to/file', bucket='bname')

# check if a file path relative to cwd exists.
>>> s3.file_exists('relative/path/to/cwd')

def import_file(

self, bucketpath, localpath='>/', use_cwd=False, overwrite=True, delete_on_failure=True, awscli=False, **kwargs)

Download a file from cloud data storage to a local path. (defaults to tempdir)

Arguments:

bucketpath {str} -- full path to the file on the bucket to be
                    downloaded from cloud storage
                    (relative path to working dir on bucket,
                    if use_cwd is true)
                    (supply '' if bucketpath is same as cwd)

Keyword Arguments:

localpath {str} -- Local path where the file is to be downloaded
                    (default: {'>/'}, to the tempdir)

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath] arg
                    (default: {False})

overwrite {bool} -- Overwrite existing file on the local system
                    (default: {True})

delete_on_failure {bool} -- Delete partial file on download failure
                    (default: {True})

awscli {bool} -- Set as True to download files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if import success else false

Examples:

# import a file from cloud to local temp dir from bucket 'bname'
>>> s3.import_file('bpath/to/file', bucket='bname')

# import a file from cloud to specific local dir using awscli
>>> s3.import_file('bpath/to/file', 'local/dir/path', awscli=True)

def import_files(

self, bucketpath_list, localpath='>/', use_cwd=False, overwrite=True, delete_on_failure=True, awscli=False, **kwargs)

Download a list of files from cloud data storage to a local path. (defaults to tempdir)

Arguments:

bucketpath_list {list} -- List of full path to the files on the
                    bucket to be downloaded from cloud storage
                    (relative paths to working dir on bucket,
                    if use_cwd is true)

Keyword Arguments:

localpath {str} -- Local path where the files are downloaded.
                    (default: {'>/'}, to the tempdir)

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath_list] arg
                    (default: {False})

overwrite {bool} -- Overwrite existing file on the local system
                    (default: {True})

delete_on_failure {bool} -- Delete partial file on download failure
                    (default: {True})

awscli {bool} -- Set as True to download files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if all imports success else false

Examples:

# import a list of files from cloud to local temp dir
# from bucket 'bname'
>>> s3.import_files(flist_bpaths, bucket='bname')

# import a file from cloud to specific local dir using awscli
>>> s3.import_files(flist_bpaths, 'local/dir/path', awscli=True)

def import_folder(

self, bucketpath, localpath='>/', use_cwd=False, recursive=True, pattern=None, overwrite=True, delete_on_failure=True, follow_folder_structure=False, awscli=False, **kwargs)

Download all the files in a bucket path either recursively or not, into the localpath. All the files can be downloaded to a single folder or it can follow the folder structure of the files as on the cloud storage.

Arguments:

bucketpath {str} -- full path to the folder on the bucket to be
                    downloaded from cloud storage
                    (relative path to working dir on bucket,
                    if use_cwd is true)
                    (supply '' if bucketpath is same as cwd)

Keyword Arguments:

localpath {str} --  Local folder path where the files from the
                    folder on cloud storage are downloaded.
                    (default: {'>/'}, to the tempdir)

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath_list] arg
                    (default: {False})

recursive {bool} -- Should the files be downloaded recursively?
                    (default: {True})

pattern {str} -- Filename pattern to be followed in listing files
                    (default: {None})

overwrite {bool} -- Overwrite existing file on the local system
                    (default: {True})

delete_on_failure {bool} -- Delete partial file on download failure
                    (default: {True})

follow_folder_structure {bool} -- if the folder structure of the
                    files should be followed as on the bucketpath
                    (default: {False})

awscli {bool} -- Set as True to download files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

status {bool} - true if all imports success else false

Examples:

# import a folder from cloud to local temp dir
# from bucket 'bname' recursively, without following
# folder structure
>>> s3.import_folder('bpath/folder', bucket='bname')

# import a folder from cloud to specific local dir using awscli
# with following folder structure
>>> s3.import_folder('bpath/folder', 'local/dir', bucket='bname',
                     follow_folder_structure=True, awscli=True)

# import a folder from cloud to local temp dir non-recursively,
# with a filename pattern using awscli
>>> s3.import_folder('bpath/folder', 'local/dir', bucket='bname',
                     recursive=False, awscli=True, pattern='*.tif')

# import a folder from a relative path to cwd to local temp dir
# recursively with following folder structure
>>> s3.import_folder('relative/bpath/to/folder', 'local/dir',
                     use_cwd=True, bucket='bname',
                     follow_folder_structure=True)

def export_file(

self, localpath, bucketpath, filename=None, use_cwd=False, overwrite=False, awscli=False, **kwargs)

Upload a file from the localpath to the cloud data storage.

Arguments:

localpath {str} -- path to the file on local to be uploaded to
                    cloud storage
                    (supply '>/' at the start of filename,
                    to export a file from the tempdir)

bucketpath {str} -- path (excluding filename) on the cloud storage
                    where the file is uploaded.
                    (relative path to working dir on bucket,
                    if use_cwd is true)

Keyword Arguments:

filename {str} -- if only directory name is supplied with localpath
                    (default: {None})

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath] arg
                    (default: {False})

overwrite {bool} -- Overwrite existing file on the cloud storage
                    (default: {True})

awscli {bool} -- Set as True to download files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if export success else false

Examples:

# export a file to cloud from local temp dir from bucket 'bname'
>>> s3.export_file('localpath/to/file', 'bpath/folder',
                   overwrite=True, bucket='bname')
>>> s3.export_file('localpath/dir/', 'relative/bpath/to/folder',
                   filename=local_filename, use_cwd=True
                   overwrite=True, bucket='bname')

# export a file to cloud from local path using awscli
>>> s3.export_file('localpath/to/file', 'bpath/folder',
                   awscli=True, bucket='bname')

def export_files(

self, localpaths, bucketpath, use_cwd=False, overwrite=False, awscli=False, **kwargs)

Upload a list of files from the localpath to the cloud data storage.

Arguments:

localpaths {list} -- list of paths to the files on local to be
                    uploaded to cloud storage
                    (supply '>/' at the start of filenames,
                    to export a file from the tempdir)

bucketpath {str} -- path (excluding filename) on the cloud storage
                    where the file is uploaded.

Keyword Arguments:

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath] arg
                    (default: {False})

overwrite {bool} -- Overwrite existing file on the cloud storage
                    (default: {True})

awscli {bool} -- Set as True to upload files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if all exports success else false

Examples:

# export a list of files to cloud from local temp dir
# from bucket 'bname'
>>> s3.export_files(flist_localpaths, 'bpath/folder',
                   overwrite=True, bucket='bname')

# export a list of files to cloud from local path using awscli
>>> s3.export_files(flist_localpaths, 'bpath/folder',
                    awscli=True)

def export_folder(

self, localpath, bucketpath, recursive=False, pattern=None, use_cwd=False, overwrite=False, follow_folder_structure=False, awscli=False, **kwargs)

Export/upload all the files in a local path either recursively or not, into the localpath, along with option to choose files using a pattern. All the files can be exported to a single path on the bucket or it can follow the folder structure of the files as on the localpath.

Arguments:

localpath {str}  -- Local folder path from where the files are to
                    be exported to cloud storage.
                    (supply '>/' at the start of filepath,
                    if folder is in the tempdir)

bucketpath {str} -- full path to the folder on the bucket where
                    the files are exported on cloud storage
                    if following the folder structure,
                    include the destination folder name as well
                    (relative path to working dir on bucket,
                    if use_cwd is true)

Keyword Arguments:

recursive {bool} -- Should you list sub-directories as well?
                    (default: {False})

pattern {str} -- Filename pattern to be followed in listing files
                    (default: {None})

use_cwd {bool} -- Whether to use the set working dir for bucket
                    if false, full path is to be provided in
                    [bucketpath] arg
                    (default: {False})

overwrite {bool} -- Overwrite existing file on the cloud storage
                    (default: {True})

follow_folder_structure {bool} -- if the folder structure of the
                    files should be followed as on the localpath
                    (default: {False})

awscli {bool} -- Set as True to upload files in parallel, make sure
                awscli is setup.
                 (default: {False})

**kwargs:

    bucket {str} -- Name of the bucket on the cloud storage from
                    where the file is to be imported
                    (default: {cls.bucket_name})

Returns:

{bool} -- true if all exports success else false

Examples:

# export a folder from cloud to local temp dir
# from bucket 'bname' recursively, without following
# folder structure
>>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname')

# export a folder from cloud to specific local dir using awscli
# with following folder structure
>>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname',
                     follow_folder_structure=True, awscli=True)

# export a folder from cloud to local temp dir non-recursively,
# with a filename pattern using awscli
>>> s3.export_folder('local/dir', 'bpath/folder', bucket='bname',
                     recursive=False, awscli=True, pattern='*.tif')

# export a folder from a relative path to cwd to local temp dir
# recursively with following folder structure
>>> s3.export_folder('local/dir', 'relative/bpath/to/folder',
                     use_cwd=True, bucket='bname',
                     follow_folder_structure=True)

Instance Variables

var auth_status

Authentication status for the cloud storage

var bucket_name

Name of the bucket that's set as default on the authenticated cloud storage

var container

Libcloud container instance for the bucket from the authenticated cloud storage

var creds_default

Default credentials path for AWS S3.

var cwd

Set a defalut working directory on the bucketpath for the authenticated cloud storage

var local_temp

Path of local temp directory

var log

Status to set log print true or false

var storage_type

Storage Name ('s3', 'gcs', 'local')

class ReadCloudFiles

A subclass of DataSource for reading some specific files from cloud storage including s3, gcs.

Ancestors (in MRO)

Static methods

def __init__(

self, name='local', set_default=True, log=False, **kwargs)

Inheritance: DataSource.__init__

Datasource class intialization

Initialize the Datasource with base libcloud drivers for the cloud storage along with authentication, setting the bucket/container and creating local temp directory. Also set default environment variable for the storage if arg's supplied

Keyword Arguments:

name {str} -- Name of the datasource
                (default: {'local'})

set_default {bool} -- Set default env name for Datasource on system
                (default: {True})

log {bool} -- whether to use log on console
                (default: {False})

Arguments:

**kwargs:
    auth_keys {str | tuple} -- authentication keys or env names of
                        the keys as a string or a tuple of strings
                        (default: {''})

    auth_profile {str} -- profile name of aws whose credentials are
                        to be used
                        (default: {'default'})

    creds_path {str} -- path to the credentials file containing
                        access id and keys for aws
                        (default: {'~/.aws/credentials'})

    bucket {str}    -- bucket name on the cloud storage to be used
                        (default: {None})

    cwd {str}       -- path on the cloud storage inside the bucket
                        specified to be used as the cloud working
                        directory
                        (default: {None})

    temp_path {str} -- path on the local system where a tempdir is
                        created for the cloudDS instance.
                        (default: {'/tmp/'})

    override {bool} -- overrides existing temp directory path
                        in environment variable if true
                        (default: {False})

    clear_old {bool} -- when creating new temp path using
                    override, should the previous temp be deleted?
                    (default: {False})

Raises:

IOError -- In cases of failure to set up the data storage,
            the function raises an error

Examples:

(default init):
params(defaults) -- set_default(True), log(False)
# set up DataSource with storage type set as default on env variable.
>>> s3 = cloud.DataSource('s3')

# set up DataSource w/o storage type set as default on env variable.
>>> s3 = cloud.DataSource('s3', set_default=False)

# set up DataSource with logs enabled
>>> s3 = cloud.DataSource('s3', log=True)

(authentication):
params(defaults) -- auth_keys(''), auth_profile('default'),
                    creds_path('~/.aws/credentials')
# set up DataSource with creds from awscli config (default profile)
>>> s3 = cloud.DataSource('s3')

# set up DataSource with creds from awscli config with different profile.
>>> s3 = cloud.DataSource('s3', auth_profile='statlas')

# set up DataSource with creds from awscli config file from a path
>>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file')
>>> s3 = cloud.DataSource('s3', creds_path='path/to/credentials/file',
                       auth_profile='statlas')

# set up DataSource with authentication keys
>>> s3 = cloud.DataSource('s3', auth_keys='ID, KEY')
>>> s3 = cloud.DataSource('s3', auth_keys=('ID', 'KEY'))
>>> s3 = cloud.DataSource('s3', auth_keys='ENV_VAR_ID, ENV_VAR_KEY')
>>> s3 = cloud.DataSource('s3', auth_keys=('ENV_VAR_ID', 'ENV_VAR_KEY'))

(set bucket and working directory):
params(defaults) -- bucket(None), cwd(None)
# set up DataSource with a bucket
>>> s3 = cloud.DataSource('s3', bucket='bucket-name')

# set up DataSource with working directory on storage
# along with cwd, bucket argument is required.
>>> s3 = cloud.DataSource('s3', bucket='bucket-name',
                        cwd='path/on/bucket')

(temp directory):
params(defaults) -- temp_path('/tmp/'), override(False),
                    clear_old(False)
# set up DataSource with temp directory in system temp path
>>> s3 = cloud.DataSource('s3')

# set up DataSource with temp directory in a specific path
>>> s3 = cloud.DataSource('s3', temp_path='local/temp/path')

# reset Datasource with temp directory with different path
# sets up new temp by overriding env variable and clearing
# old temp directory if exists.
>>> s3 = cloud.DataSource('s3', temp_path='some/path',
                        override=True, clear_old=True)

def read_raster(

self, filepath, parse_prefix=None, relpath=None, save=True, awscli=False, bucket=None, read_function=None, *args, **kwargs)

Imports the raster file from cloud path to local temp directory and reads the raster using supplied read function.

example functions: gdal.Open, rasterio.open, h5py.File etc. raster file types: all the drivers supported by read function, and also zip, tar, gzip files if read function supports (gdal, rasterio etc.)

Arguments:

filepath {str} -- full path to the raster file to be read,
                                  on the storage
              
              *args {} -- any positional arguments that are to be supplied to
                              the read function
              

Keyword Arguments:

parse_prefix {str} -- prefix to be used for files like zip,
                              tar, gzip to be supplied to read function.
                              (examples: '/vsizip/', '/vsitar/', '/vsigzip/',
                                         'zip://', 'tar://', 'gzip://')
                              (default: {None})
              
              relpath {str} -- relative path of the file inside a compressed
                              file that is to be read. (requires parse_prefix)
                              (default: {None})
              
              bucket {str} -- Name of the bucket on the cloud storage from
                              where the file is to be imported
                              (default: {None}, takes from cls.bucket_name)
              
              save {bool} -- if the file is to be saved after reading the file
                              (for cloud storage)
                              (default: {True})
              
              awscli {bool} -- Set as True to import file from aws cli, make sure
                              awscli is setup.
                              (default: {False})
              
              read_function {function} -- function to be used to read the raster
                              (default: {None})
              
              **kwargs {} -- any keyword arguments that are to be supplied to
                              the read function
              

Returns:

ds_raster {raster dataset} -- if read success else None
              

Examples:

# read a raster file from cloud storage using a read function.
              >>> ras = s3.read_raster('bucket/path/to/raster.tif',
                                       read_function=rasterio.open)
              
              # read a raster file from cloud storage and delete the file after
              # loading into variable with awscli
              >>> ras = s3.read_raster('bucket/path/to/raster.tif', awscli=True,
                                       save=False, read_function=rasterio.open)
              
              # read a compressed zip raster file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> ras = s3.read_raster('bucket/path/to/raster.tif.zip',
                                       parse_prefix='/vsizip/',
                                       relpath='/raster.tif',
                                       awscli=True, save=False,
                                       read_function=rasterio.open)
              
              # read a compressed tar raster file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> ras = s3.read_raster('bucket/path/to/raster_group.tif.tar',
                                       parse_prefix='/vsitar/',
                                       relpath='/raster.tif',
                                       awscli=True, save=False,
                                       read_function=rasterio.open)
              
              # read a compressed gzip raster file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> ras = s3.read_raster('bucket/path/to/raster_group.tif.gz',
                                       parse_prefix='/vsigzip/',
                                       relpath='/raster.tif',
                                       awscli=True, save=False,
                                       read_function=rasterio.open)
              

def read_vector(

self, filepath, parse_prefix=None, relpath=None, layername=None, vdriver='GPKG', save=True, awscli=False, bucket=None, read_function=None, *args, **kwargs)

Imports the vector and its dependancies from cloud path to local temp directory and reads the vector dataset using supplied read function.

example functions: ogr.Open, geopandas.read_file, fiona.open etc. shapefile dependency file extenstions: (dbf|prj|shp|shx|cpg|qpj|sbn) also zip, gzip, tar files if using supported function (ogr, fiona, geopandas etc.)

Arguments:

filepath {str} -- path to the vector file on the cloud storage
              
              *args {} -- any positional arguments that are to be supplied to
                              the read function
              

Keyword Arguments:

parse_prefix {str} -- prefix to be used for files like zip,
                              tar, gzip to be supplied to read function.
                              (examples: '/vsizip/', '/vsitar/', '/vsigzip/',
                                         'zip://', 'tar://', 'gzip://')
                              (default: {None})
              
              relpath {str} -- relative path of the file inside a compressed
                              file that is to be read. (requires parse_prefix)
                              (default: {None})
              
              layername {str} -- layername from the vector file to be read
                              (default: {None}, taken from filename)
              
              vdriver {str} -- vector driver of the input file.
                              (examples: 'GPKG', 'GeoJSON', 'Shapefile')
                              (default: {'GPKG'})
              
              bucket {str} -- Name of the bucket on the cloud storage from
                              where the file is to be imported
                              (default: {None}, takes from cls.bucket_name)
              
              save {bool} -- if the file(s) is/are to be saved after reading the
                              file (for cloud storage)
                              (default: {True})
              
              awscli {bool} -- Set as True to import file from aws cli, make sure
                              awscli is setup.
                              (default: {False})
              
              read_function {function} -- function to be used to read the vector
                              (default: {None})
              
              **kwargs {} -- any keyword arguments that are to be supplied to
                              the read function
              

Returns:

ds_vector {vector dataset} -- if read success else None
              

Examples:

# read a vector file from cloud storage using a read function.
              # `GPKG` is vector driver by default.
              >>> vec = s3.read_vector('bucket/path/to/vector.gpkg',
                                       vdriver='GPKG',
                                       bucket='bucket-name',
                                       read_function=geopandas.read_file)
              
              # driver = 'GPKG'
              # read a particular layer from the vector dataset.
              # reads default layer if not provided.
              >>> vec = s3.read_vector('bucket/path/to/vector.gpkg',
                                       vdriver='GPKG',
                                       layername='vector_layer',
                                       bucket='bucket-name',
                                       read_function=geopandas.read_file)
              
              # supply kwargs to the read function
              # driver = 'GPKG'
              # example kwarg `bbox` for geopandas
              >>> vec = s3.read_vector('bucket/path/to/vector.gpkg',
                                       vdriver='GPKG',
                                       bucket='bucket-name',
                                       read_function=geopandas.read_file,
                                       bbox=(west, south, east, north))
              
              # read a vector file from cloud storage and delete the file after
              # loading into variable with awscli
              # driver = 'shapefile'
              >>> vec = s3.read_vector('bucket/path/to/vector.shp',
                                       vdriver='shapefile', awscli=True,
                                       bucket='bucket-name', save=False,
                                       read_function=geopandas.read_file)
              
              # driver = 'geojson'
              >>> vec = s3.read_vector('bucket/path/to/vector.geojson',
                                       vdriver='geojson', awscli=True,
                                       bucket='bucket-name', save=False,
                                       read_function=geopandas.read_file)
              
              # read a compressed zip vector file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> vec = s3.read_vector('bucket/path/to/vector.shp.zip',
                                       vdriver='shapefile',
                                       parse_prefix='/vsizip/',
                                       relpath='/vector.shp',
                                       awscli=True, save=False,
                                       bucket='bucket-name',
                                       read_function=ogr.Open)
              
              # read a compressed tar vector file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> vec = s3.read_vector('bucket/path/to/vector.geojson.tar',
                                       vdriver='geojson',
                                       parse_prefix='/vsitar/',
                                       relpath='/vector.geojson',
                                       awscli=True, save=False,
                                       bucket='bucket-name',
                                       read_function=fiona.open)
              
              # read a compressed gzip vector file from cloud storage.
              # provide the relative path of raster within the compressed file
              # under `relpath` variable. (starts with `/`)
              >>> vec = s3.read_vector('bucket/path/to/vector.gpkg.gz',
                                       vdriver='gpkg',
                                       parse_prefix='/vsigzip/',
                                       relpath='/vector.gpkg',
                                       awscli=True, save=False,
                                       bucket='bucket-name',
                                       read_function=geopandas.read_file)
              

def read_table(

self, filepath, bucket=None, save=False, read_function=None, *args, **kwargs)

Imports the table from cloud path to local temp directory and reads the table/dataframe using supplied read function.

example functions: pandas.read_csv, xlrd.open_workbook etc. table file types: '.csv', '.xlsx', '.xls', '.txt', '.xml', '.odt', '.rtf'

Arguments:

filepath {str} -- full path to table/file on cloud storage.

*args {} -- any positional arguments that are to be supplied to
                the read function

Keyword Arguments:

bucket {str} -- Name of the bucket on the cloud storage from
                where the file is to be imported
                (default: {None}, takes from cls.bucket_name)

save {bool} -- if the file(s) is/are to be saved after reading the
                file (for cloud storage)
                (default: {False})

read_function {function} -- function to be used to read the table
                (default: {None})

**kwargs {} -- any keyword arguments that are to be supplied to
                the read function

Returns:

table {table/dataframe} -- if read success else None

def read_json(

self, filepath, bucket=None, save=False, read_function=None, *args, **kwargs)

Imports the json file from cloud path to local temp directory and reads the file using supplied read function.

example function: json.load, geopandas.read_file etc. json file types: '.json', '.geojson'

Arguments:

filepath {str} -- full path to json/geojson file on cloud storage.

*args {} -- any positional arguments that are to be supplied to
                the read function

Keyword Arguments:

bucket {str} -- Name of the bucket on the cloud storage from
                where the file is to be imported
                (default: {None}, takes from cls.bucket_name)

save {bool} -- if the file(s) is/are to be saved after reading the
                file (for cloud storage)
                (default: {False})

read_function {function} -- function to be used to read the json
                (default: {None})

**kwargs {} -- any keyword arguments that are to be supplied to
                the read function

Returns:

json_out {json object/geodataframe} -- if read success else None