Utils¶
IBM PAIRS Utilities: A collection of tools complementing the RESTful API wrapper.
Copyright 2019-2021 Physical Analytics, IBM Research. All Rights Reserved.
SPDX-License-Identifier: BSD-3-Clause
- class utils.PAIRSProject(queryList, auth=None, downloadDir='./downloads', overwriteExisting=False, maxConcurrent=2, logEverySeconds=30)¶
Utility class to submit a large number of queries to IBM PAIRS. The class leverages
ibmpairs.paw.PAIRSQuery
and maintains a local queue.Usage
>>> project = PAIRSProject(queryList) >>> project.submitAllQueued()
Queries contained in
queryList
can be either query JSONs orpaw.PAIRSQuery
objects. In the latter case, only queries that have previously not been submitted will be submitted when callingsubmitAllQueued
. Once queries have completed processing, the downloaded data can be found in the directory indicated bydownloadDir
. (There can be multiple download directories ifqueryList
containspaw.PAIRSQuery
objects.)While queries are running, the class gives a periodic status update via python’s logging module. Note that this happens at the
logging.INFO
level. A rudimentary setup for this would be as follows:>>> import logging >>> logging.basicConfig(level = <log-level required for your application>) >>> pawLogger = logging.getLogger('ibmpairs.paw') >>> pawLogger.setLevel(logging.ERROR) >>> pairsUtilsLogger = logging.getLogger('ibmpairs.utils') >>> pairsUtilsLogger.setLevel(logging.INFO)
The class stores queries in 4 queues, accessible as
>>> project.queries['queued'] >>> project.queries['running'] >>> project.queries['completed'] >>> project.queries['failed']
One can obtain a list of all query JSONs in one particular queue by calling
getQueryJSONs('<queue name>')
. The following is then feasible:>>> import json >>> completedQueries = oldProject.getQueryJSONs('completed') >>> with open('completedQueries.json', 'w') as fp: >>> json.dump(completedQueries, fp) >>> # ... some other code ... >>> with open('completedQueries.json', 'r') as fp: >>> recoveredQueries = json.load(fp) >>> newProject = PAIRSProject(recoveredQueries)
The properties of the
paw
library make it quite simple to work with completed queries even if the program hosting thePAIRSProject
object has been terminated. Assume the data of completed queries is stored in<downloads/>
(typically the value ofdownloadDir
). Then the following builds an index of what is in that directory:>>> from glob imoprt glob >>> zippedQueries = glob('downloads/*.zip') >>> queries = [paw.PAIRSQuery(z) for z in zippedQueries] >>> for q in queries: >>> q.list_layers()
Crucially, the
list_layers
function here, parses the contents of a query without loading the data to memory. (This is in contrast tocreate_layers
.)- Parameters:
queryList (list) – list containing a mix of PAIRS query JSONs and
paw.PAIRSQuery
objects. Forpaw.PAIRSQuery
objects, only those which have not been submitted yet will be submitted.auth (str, str) – user name and password as tuple for access to pairsHost
overwriteExisting (bool) – destroy locally cached data, if existing, otherwise grab the latest locally cached data, latest is defined by alphanumerical ordering of the PAIRS query ID
downloadDir (str) – directory where to store downloaded data
maxConcurrent (int) – maximum number of concurrent queries. Note that the maximum number of concurrent queries might be limited server side for a particular user. There is no guarantee that a user can submit maxConcurrent queries at a given time.
logEverySeconds (int) – time interval at which the class will send status messages to its logger in seconds (via
logging.INFO
)
- getQueryJSONs(status)¶
Returns all query JSONs in the queue self.queries[status].
- Parameters:
status (string) – indicates queue from which query JSONs should be returned.
- Returns:
list of PAIRS query JSONs
- Return type:
list
- submitAllQueued(cosInfoJSON=None, printStatus=False)¶
Submits all queries in the local queue. Ensures that there are always maxConcurrent queries running. (Note that the maximum number of concurrent queries might be limited server side for a particular user. There is no guarantee that a user can submit maxConcurrent queries at a given time.)
- Parameters:
cosInfoJSON (dict) –
IBM PAIRS with Cloud Object Storage bucket information like ```JSON {
”provider”: “ibm”, “endpoint”: “https://s3.us.cloud-object-storage.appdomain.cloud”, “bucket”: “<your bucket name>”, “token”: “<your secret token for bucket>”
if set, the query result is published in the cloud and not stored locally on your machine. It is a useful feature in combination with IBM Watson Studio notebooks
printStatus (bool) – triggers printing the poll status information of downloading a query