Utils

IBM PAIRS Utilities: A collection of tools complementing the RESTful API wrapper.

Copyright 2019-2021 Physical Analytics, IBM Research. All Rights Reserved.

SPDX-License-Identifier: BSD-3-Clause

class utils.PAIRSProject(queryList, auth=None, downloadDir='./downloads', overwriteExisting=False, maxConcurrent=2, logEverySeconds=30)

Utility class to submit a large number of queries to IBM PAIRS. The class leverages ibmpairs.paw.PAIRSQuery and maintains a local queue.

Usage

>>> project = PAIRSProject(queryList)
>>> project.submitAllQueued()

Queries contained in queryList can be either query JSONs or paw.PAIRSQuery objects. In the latter case, only queries that have previously not been submitted will be submitted when calling submitAllQueued. Once queries have completed processing, the downloaded data can be found in the directory indicated by downloadDir. (There can be multiple download directories if queryList contains paw.PAIRSQuery objects.)

While queries are running, the class gives a periodic status update via python’s logging module. Note that this happens at the logging.INFO level. A rudimentary setup for this would be as follows:

>>> import logging
>>> logging.basicConfig(level = <log-level required for your application>)
>>> pawLogger = logging.getLogger('ibmpairs.paw')
>>> pawLogger.setLevel(logging.ERROR)
>>> pairsUtilsLogger = logging.getLogger('ibmpairs.utils')
>>> pairsUtilsLogger.setLevel(logging.INFO)

The class stores queries in 4 queues, accessible as

>>> project.queries['queued']
>>> project.queries['running']
>>> project.queries['completed']
>>> project.queries['failed']

One can obtain a list of all query JSONs in one particular queue by calling getQueryJSONs('<queue name>'). The following is then feasible:

>>> import json
>>> completedQueries = oldProject.getQueryJSONs('completed')
>>> with open('completedQueries.json', 'w') as fp:
>>>     json.dump(completedQueries, fp)
>>> # ... some other code ...
>>> with open('completedQueries.json', 'r') as fp:
>>>     recoveredQueries = json.load(fp)
>>> newProject = PAIRSProject(recoveredQueries)

The properties of the paw library make it quite simple to work with completed queries even if the program hosting the PAIRSProject object has been terminated. Assume the data of completed queries is stored in <downloads/> (typically the value of downloadDir). Then the following builds an index of what is in that directory:

>>> from glob imoprt glob
>>> zippedQueries = glob('downloads/*.zip')
>>> queries = [paw.PAIRSQuery(z) for z in zippedQueries]
>>> for q in queries:
>>>     q.list_layers()

Crucially, the list_layers function here, parses the contents of a query without loading the data to memory. (This is in contrast to create_layers.)

Parameters:
  • queryList (list) – list containing a mix of PAIRS query JSONs and paw.PAIRSQuery objects. For paw.PAIRSQuery objects, only those which have not been submitted yet will be submitted.

  • auth (str, str) – user name and password as tuple for access to pairsHost

  • overwriteExisting (bool) – destroy locally cached data, if existing, otherwise grab the latest locally cached data, latest is defined by alphanumerical ordering of the PAIRS query ID

  • downloadDir (str) – directory where to store downloaded data

  • maxConcurrent (int) – maximum number of concurrent queries. Note that the maximum number of concurrent queries might be limited server side for a particular user. There is no guarantee that a user can submit maxConcurrent queries at a given time.

  • logEverySeconds (int) – time interval at which the class will send status messages to its logger in seconds (via logging.INFO)

getQueryJSONs(status)

Returns all query JSONs in the queue self.queries[status].

Parameters:

status (string) – indicates queue from which query JSONs should be returned.

Returns:

list of PAIRS query JSONs

Return type:

list

submitAllQueued(cosInfoJSON=None, printStatus=False)

Submits all queries in the local queue. Ensures that there are always maxConcurrent queries running. (Note that the maximum number of concurrent queries might be limited server side for a particular user. There is no guarantee that a user can submit maxConcurrent queries at a given time.)

Parameters:
  • cosInfoJSON (dict) –

    IBM PAIRS with Cloud Object Storage bucket information like ```JSON {

    ”provider”: “ibm”, “endpoint”: “https://s3.us.cloud-object-storage.appdomain.cloud”, “bucket”: “<your bucket name>”, “token”: “<your secret token for bucket>”

    if set, the query result is published in the cloud and not stored locally on your machine. It is a useful feature in combination with IBM Watson Studio notebooks

  • printStatus (bool) – triggers printing the poll status information of downloading a query