Extended descriptors to handle very large records - MdsWiki
Personal tools

From MdsWiki

Jump to: navigation, search

Extended Descriptors

The current data descriptors (implemented as C structures) used internally in MDSplus originated by the descriptors used in OpenVMS. These descriptors are used to describe how data buffers are to be interpreted and include information such as what kind of data is represented (i.e. 8,16,32,64 bit signed or unsigned integers, various types for floating point values, string, and MDSplus specific types such as signals, actions, etc.), the structure of the descriptor (i.e. an array descriptor, a scalar descriptor, a record descriptor, etc.), one or more memory pointers to where the actual data exists or to other descriptors, and various other information such as how many bytes in the data item. The descriptors are used throughout the internals of MDSplus and are used in the data files when storing data. Unfortunately these descriptors were designed long before 64-bit computers became mainstream so data quantities described by them are limited in size. For example, arrays of data in MDSplus are limited to approximately 2 gigabytes because some of the array descriptor shape fields are currently signed 32-bit integers.

A prototype was developed to extend the descriptors to handle much larger sizes. Care was taken to ensure that all data previously stored in MDSplus trees would still be accessible without converting the datafiles. We also tried to retain compatibility as much as possible with existing applications. To do this we implemented an entire new set of descriptor structures in addition to the original set which have larger fields where appropriate. At the interfaces to the MDSplus internals (i.e. when data is read and written to MDSplus data files and at the points where application may call into MDSplus internal routines) the new code will detect older descriptors on data coming in and convert them to new style descriptors, do all internal operations using new style descriptors and when exporting data to files or application, convert the descriptors to the old style descriptors if the data described does not exceed the capacity of the old style descriptors. Using this approach, the new code could be used interchangeably with the older programs. Only when very large data records are encountered would the older versions of MDSplus and user application be unable to recognize the new data structures.

While this seemed like a manageable and clearly defined project we quickly discovered that this would require quite a larger effort than anticipated. It required quite meticulous modification to almost every module in MDSplus since not only did the structure definitions need to be changed but so did many local variable declarations, routine arguments, structure initialization statements and other types of declarations. Almost 3 man-months went into making a somewhat working prototype. Then we discovered a some very large stumbling blocks in completing this endeavor, testing and huge memory requirements. I wish I could report that there is an extensive suite of regression tests for testing all the functionality of MDSplus but regrettably this is not the case. Several attempts at making such a suite have been made over the years but they quickly falter and soon become obsolete. Also, the behavior of the expression evaluator, tdi, in its attempt to handle the large set of different data types and do something appropriate depending on whether an operand is a scalar, array, signal etc.. is, well, not fully understood by anyone but perhaps the original developer. Many user application have been developed using trial and error with the expression evaluator and therefore depend on certain behavior. Trying to put together tests which might exercise all of the code paths would be quite an undertaking. On top of that, running such a test using very large data elements would require a computer with hundreds of gigabytes of RAM and quite a fast processor to perform such tests in a timely fashion. Since during the course of the evaluation of even simple tdi expression, data elements may be copied several times and several of these copies may coexist in memory for much of the expression evaluation, the memory and cpu requirements are significant when handling large data items. We discovered this quite quickly when doing just simple tests on a quite fast system with 32 GB of RAM. Doing extensive testing on that system would be virtually impossible.

For the above reasons, this project has been put on hold until either there becomes a compelling demand for this enhancement or the readily available computer hardware is sufficiently powerful enough to do more extensive testing. In the mean time, it is highly recommended that users use the segmented records to deal with larger data sets. This approach enables applications to read in data in smaller chunks. If they need to manipulate the entire data element they can still do so by reading in the segments into program variables and manipulate the data in a more efficient manner than would be performed by the general purpose expression evaluator of MDSplus. Got close though:

On 11/17/2010:

IDL> x=findgen(1000,1000,2000)
IDL> mdsopen,'test',-1
IDL> mdsput,'gub','$',x
IDL> mdstcl
TCL> dir/full gub


      Status: on,parent is on, usage signal
      Data inserted: 17-NOV-2010 17:09:58.12    Owner: [1751,1750]
      Dtype: DTYPE_FS              Class: CLASS_A             Length: 8000000052 bytes

Total of 1 node.
TCL> exit
IDL> y=mdsvalue('gub')
IDL> help,y
Y               FLOAT     = Array[1000, 1000, 2000]