Re: Document warehouse for/on VMS??

From: Alan Winston - SSRL Admin Cmptg Mgr (winston_at_SSRL.SLAC.STANFORD.EDU)
Date: 03/16/04

Date: Mon, 15 Mar 2004 23:19:48 GMT

In article <ROl5c.1303$>, "John Smith" <> writes:
>Thinking about a possible upcoming project...
>-approximately 100 million documents (.pdf / .pps / .doc / .tif / xml /and
>some other formats) need to be stored, indexed, and available for display
>via a web browser/plug-in
>- each document is variable size, anywhere from 1 to 100's of pages long
>- users have to be able to access specific pages of the documents if they
>are multi-page ones
>Two basic approaches in a roll-your-own solution:
>a) Store the file name, location, description in a rdbms or OR store and
>have the ability in a program to go to the required location and open the
>file. The challenge here is that if the file moves or is deleted/renamed,
>my DB is out of whack.
>b) Store the 'objects' as blobs in a rdbms along with various other
>attributes and indicies to specific sections of interest. The the app would
>have to copy the doc to the file system when the user wanted to access.
>Maybe a little harder to implement, but better for maintaining db integrity.

You don't have to involve the file system at all. Store the blobs in an
rdbms, and have your application, running as a CGI, fetch the blobs, write
appropriate headers and then the binary document itself out to its link to
the webserver. Same strategy as generating a plot and writing it out to
the webserver without an intermediate file; we do that here.

>In a file system approach under VMS, my memory tells me that there are/were
>a couple of limitations on the number of nested directory levels possible
>and/or number of files. Since my current apps only deal with a rather
>limited number of files outside the db I can't be sure if this vaguely
>recollected 'issue' is still a problem.

It's certainly less of a problem unders ODS-5 and under ODS-2 since 7.2.
You used to be limited to eight levels of directory structure (which you could
fake out with rooted directory structures and renaming, but with possible
trouble); there'd also be noticeable performance difficulties in directory
maintenance of large directories; both of these issues have been addressed.

>Other solutions would include 3rd party apps, if available on VMS -
>Documentum springs to mind as an application of this type, but it isn't
>available on VMS.

>Any thoughts on how to handle an app/requirement like this on VMS?

Basic approach given above: store all the documents as blobs in a dbms,
have CGIs fetch 'em from the db and blat 'em out directly. (The loss here
vs. filesystem is in frequently-accessed files, which the webserver won't
cache if they appear to be dynamically generated. But Apache 1.3 doesn't
cache anyway. You'd have to figure out how to tweak your db to make it
cache big, frequently-accessed documents.)

Lot of flexibility. You can put Rdb on a backend machine (not the
webserver); it's fast, not very fragile, and supports LARGE databases very
well. (It sounds like you're doing lots of access and not very much
updating, so TPC-type data isn't really of any use. MySQL is supposed to
be optimized to blat mostly-static data out onto webpages as fast as
possible, but I have no clue how well it compares to Rdb. It _is_
available on VMS nowadays.)

Your CGIs can run on the same or a different box than the webserver does;
through various approaches: FastCGI, OSU DECnet script execution (via
the OSU module on Apache if you're running Apache), or a dedicated server
process and a module to talk to it (easiest to implement on Apache or
WASD). (The FastCGI and dedicated server approaches don't require a
VMS-based webserver.)

You can talk to Rdb from Java, Perl, Python, and (in theory, anyway,
haven't seen it working yet) PHP via the SQLNET4RDB (Oracle Call Interface)
support. All true for Oracle Server, of course, but everything I hear
about Oracle suggests fragility to me.

Webservers: CSWS is most industry-standard, WASD is fastest, OSU has most
third-party CGI tools available.

-- Alan

 Disclaimer: I speak only for myself, not SLAC or SSRL   Phone:  650/926-3056
 Paper mail to: SSRL -- SLAC BIN 99, 2575 Sand Hill Rd, Menlo Park CA   94025