Thursday 31 March 2011

ADMIRAL: the final push

Our most recent development review is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/20110331_Quick_Review.

Over the past few months of the ADMIRAL project, we've been transitioning from a substantially feature development oriented mode to stabilization and maintenance, in order to allow us to ramp up user engagement activities.  As part of this, the project management style has evolved to shorter, less elaborately planned sprints.  Summaries can be seen on the project plan page. These essentially consist of a combined review and planning session conducted at approximately 1-week intervals, driven by requirements recorded in the project issues list.

We intend to complete all items recorded as high and medium priority in the issues list, except where blocked by issues noted that we are unable to resolve with available resources.  This represents a kind of feature freeze in the ADMIRAL data store function, with enhancements focused on stabilization and manageability of the system.  With ADMIRAL user features stabilized, and a stable deployment of Databank, we will update all of the deployed systems, and encourage researchers to deposit real research data sets from ADMIRAL to Databank for preservation and publication.  Getting real research data deposited and published with DataCite DOIs represents the main project goal that we now want to see realized before the end.

Specifically, over the next three months, we aim to:
  • Test and integrate the remaining Databank features (see issue list items tagged "Databank")
  • Issue 9: displaying content tree of dataset prior to confirmation of submission
  • Issue 42: a web interface for user administration
  • Issue 45: basic Debian packaging for ADMIRAL (which we expect to allow us to deploy easily on more recent versions of Ubuntu)
  • if and when time permits, picking up and progressing some of the lower priority technical debt issues
while also dealing with any other critical issues that may arise.
    In parallel, we will engage with the various research groups to learn more about how and to what extent they are using ADMIRAL, and encourage them to start submitting datasets to the Databank service.

    Friday 4 March 2011

    ADMIRAL Sprint 17

    We have recently completed our review of Sprint 17.

    This review was somewhat overdue, as we've been very busy with follow-ups to ADMIRAL deployment with two additional research groups (making a total of three deployments now).  Main acheivements over the past month have been:


    • Two new ADMIRAL deployments; Silk group storage upgraded to use departmental iSCSI facility
    • Resolved awkward technical Apache+LDAP issue
    • Started construction of stand-alone demonstration environment
    • Deployment and management improvements
    • Bug-fixing and usability improvements
    • Documentation of technical problem areas
    • Benefits case study write-up
    • ADMIRAL packaging adopted for 1st protoype of Wf4Ever project


    One of the recent lessons is that the general level of requirement for data storage has increased dramatically since our initial user surveys.  Where most groups were originally content with 200-400Gb storage, they are now asking for Terabytes (due to increased use of high definition video for observations).  So the ability to connect to a departmental iSCSI network storage facility has turned out to be a crucial development for us, especially for new research proposals that are required to include data management and sharing plans.

    Resolving the Apache+LDAP problems has been a most satisfying advance for us; the awkwardness of the Apache web server configuration had been a long-standing difficulty for us, and we will now be able to simplify the overall ADMIRAL configuration and monitoring.

    Looking forward, as we enter the final stages of this project, we intend to change our approach to sprint planning.  Instead of preparing a separate plan, we intend to be more reactive, responding to issues in the project issue list (http://code.google.com/p/admiral-jiscmrd/issues/list), as these most closely reflect user feedback and other issues that need to be addressed.  We will still undertake periodic reviews to help us ensure that efforts are sensibly focused.  In addition to dealing with the issue list, two other developments are planned:
    • Web interface for user management
    • Investigation of Debian installation package for ADMIRAL deployment
    The rationale for choosing these is that they appear to be key features for facilitating continued management and new deployment of ADMIRAL systems within the department.

    Tuesday 22 February 2011

    AJAX content negotiation browser differences

    We have been experiencing hard-to-explain problems with the behaviour of ADMIRAL web pages in different browsers. They would work fine for Firefox, but not with IE, Safari or Google Chrome.

    Javascript (using jQuery) is used at the client end to retrieve RDF/XML information from the ADMIRAL server using AJAX calls. The server is capable of returning different formats for the requested data, basing its decision on HTTP Accept headers .

    The calling Javascript code looks like this:
    jQuery.ajax({
    type:         "GET",
    url:          "...",
    username:     "...",
    password:     "...",
    dataType:     "text",
    beforeSend:   function (xhr)
    {
    xhr.setRequestHeader("Accept", "application/rdf+xml");
    },
    success:      function (data, status, xhr)
    {
    ...
    },
    error:        function (xhr, status)
    {
    ...
    },
    cache:        false
    });
    Using Wireshark to observe the HTTP traffic, we find that Firefox sends the following header:
    Accept: application/rdf+xml
    But when using Safari we see:
    Accept: text/plain,*/*,application/rdf+xml
    IE and Chrome also send something different from Firefox, but at the time of writing we've lost the exact trace.

    The effect of this has been that even when we write an Ajax call to accept just RDF/XML, the server is seeing the additional Accept header options and in some cases is choosing the wrong response format when using browsers other than Firefox.

    We have not yet found a simple work-around that works in all situations. But, generally, servers need to be aware that browsers sometimes add commonly required options to the HTTP Accept header. Prioritizing matches for uncommon content options might go some way to ensuring consistent behaviour across browsers. E.g. in the case illustrated here, servers should favour the less common option application/rdf+xml over the more common text/plain content type.  Also favouring non-wildcard matches that appear later in the Accept header may help in some cases.

    Monday 14 February 2011

    Apache configuration for LDAP access control

    For some time now, HTTP access to the ADMIRAL file store has been restricted because we were unable to use a combination of Apache Require valid user, Require user and Require ldap-attribute directives to combine by-username and by-group access.  For example, we have been unable to realize access control for this scenario:

    /data/private        all registered users have read-only access
    /data/private/UserA  read/write access for UserA,
                         read-only access for research group leaders,
                         all others denied
    /data/private/UserB  read/write access for UserB,
                         read-only access for research group leaders,
                         all others denied 

    We had been finding that any attempt to put an access control statement on the /data/private directory resulted in the access control statements for the user private directories being ignored. Also, we were finding that we could not allow UserA to have read/write access to their personal area, while allowing read access to any user designated as a research group leader.

    The Apache configuration we had been trying to use looked something like this:

    # Default read access to  all areas for registered users
    <Directory /home/data>
        Require valid user
    </Directory>
    # Define access to area for user $username
    <Location /data/private/$username>
        Order Deny,Allow
        Allow from all
        <LimitExcept REPORT GET OPTIONS PROPFIND>
          Require ldap-user $username
        </LimitExcept>
        <Limit PROPFIND OPTIONS GET REPORT>
          Require user $username 
          Require ldap-attribute gidNumber=$RGLeaderGID
        </Limit>
    </Location>

    As this would not work as required, we ended up disabling access to the /data area, leaving users unable to use HTTP to browse between the different user areas, and configuring the research group leader's read-only access using a configured username, making the configuration very much more fragile than needed:

    # Define access to area for user $username
    <Location /data/private/$username>
        Order Deny,Allow
        Allow from all
        <LimitExcept REPORT GET OPTIONS PROPFIND>
          Require user $username
        </LimitExcept>
        <Limit PROPFIND OPTIONS GET REPORT>
          Require user $username $RGLeaderName
        </Limit>
    </Location>

    We recently realized that Apache recognizes different access control providers, and that access to a given area cannot be controlled by a combination of providers.  There is a default provider, invoked by Require user ... and Require valid user, and there is an LDAP provider. Thus, the presence of a matching Require ldap-attribute directive meant that any matching Require user directives were being ignored.
    "When no LDAP-specific Require directives are used, authorization is allowed to fall back to other modules as if AuthzLDAPAuthoritative was set to off"
    http://httpd.apache.org/docs/2.2/mod/mod_authnz_ldap.html
    The overlooked implication here was that if any LDAP authorization directive is matched, the non-LDAP directives are ignored, even when applied to a more specific directory. The picture is further muddied by the fact that the non-LDAP authorization handlers have limited access to LDAP values (via an LDAP PAM module?), so that Require user would still respond to LDAP user entries.

    Having identified the problem, the fix is easy.  Use Require ldap-user instead of Require user:
      
    # Default access to  all areas
    <Directory /home/data>
        Require ldap-attribute gidnumber=600
        Require ldap-attribute gidnumber=601
        Require ldap-attribute gidnumber=602
    </Directory>
    <Location /data/shared/$username>
        Order Deny,Allow
        Allow from all
        <LimitExcept REPORT GET OPTIONS PROPFIND>
          Require ldap-user $username
        </LimitExcept>
        <Limit PROPFIND OPTIONS GET REPORT>
          Require ldap-attribute gidNumber=$RGLeaderGID
          Require ldap-attribute gidNumber=$RGMemberGID
        </Limit>
    </Location>

    There is a lingering issue of the mixed <Directory> and <Location> directives: using <Location> for the /data URI path is not working, and right now we are not sure exactly why, but it seems to be related to either file system symlinks to the data directories, or Apache Alias directives controlling the mapping to same.

    Wednesday 9 February 2011

    Ajax Authentication in Firefox and Safari

    We've recently been tripped up by a difference between the way that Firefox and Mozilla handle authentication in Ajax requests.  We have a jQuery Ajax call whose essential elements are like this:

    jQuery.ajax({
            type:         "GET",
            url:          "/admiral-test/datasets/"+datasetName,
            username:     username,
            password:     password,
            cache:        false
            success:      function (data, status, xhr)
              {
                ...
              },
            error:        function (xhr, status) 
              { 
                ...
              },
            });

    The username and password here are credentials needed for accessing non-public information on the Databank repository server.  We find this works fine with Firefox, but when accessing some Databank services using Safari (and possibly IE) we get HTTP 403 Forbidden responses, despite the fact that we provide correct credentials.

    We diagnosed the problem using Wireshark to monitor the HTTP protocol exchanges.  It is worth noting that Wireshark can trace encrypted HTTPS traffic if a copy of the server private key is provided.  A summary of our investigations is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_with_Safari_and_IE.

    What we observed was that when credentials are supplied in the Ajax call, Firefox always includes an appropriate HTTP Authorization (sic) header. Safari, on the other hand, does not initially include this, but instead re-sends the request with an Authorization header in response to an HTTP 401 Unauthorized status return.  Both behaviours are correct within the HTTP specification. Our problem was caused by the fact that the Databank service was responding to a request without the Authorization header  with a 403 instead of a 401 response.  The 403 response explicitly indicates that re-issuing the request with credentials will not make any difference (in our case, incorrectly).

    There is a separate issue about whether we actually need to provide credentials in the Ajax call: in other parts of our system, we have found that the browser (well, Firefox, anyway) will intelligently pop up a credentials box if an Ajax request needs authentication credentials - this clearly depends on getting a 401 response to the original request, so is something that should be tested when the Databank server is fixed.

    Tuesday 8 February 2011

    Reading RDF/XML in Internet Explorer with rdfQuery

    We've just spent the better part of two days tracking down a stupid bug in Internet Explorer.

    Under the guise of providing better security, Internet Explorer will not recognize as XML any MIME type other than text/xml or application/xml, and then only when the URI (or Content-disposition header filename) ends with .xml [1].  (I say guise of better security, because a server or intercept that is determined to falsely label XML data can do so in any case: refusing to believe the server's content-type when the data properly conforms to that type does not help;  maybe what they are really protecting against is Windows' flawed model of using the filename pattern to determine how to open a file.)

    In our case, we use jQuery to request XML data, and pass the resulting jQuery XML object to rdfQuery to build a local RDF "databank" from which metadata can be extracted.  On Firefox and Safari, this works just fine.  But on Internet Explorer it fails with a "parseerror", which is generated by jQuery.ajax when the retrieved data does not match the requested xml type.

    Fortunately, rdfQuery databank.load is also capable of parsing RDF from plain text as well as from a parsed XML document structure.  So the fix is simple, albeit not immediately obvious: when performing the jQuery.ajax operation, request text rather than XML data.  For example:

    jQuery.ajax({
            type:         "GET",
            url:          "/admiral-test/datasets/"+datasetName,
            username:     "...",
            password:     "...",
            dataType:     "text",    // To work on IE, NOT "xml"!
            cache:        false
            beforeSend:   function (xhr)
              {
                xhr.setRequestHeader("Accept", "application/rdf+xml");
              },
            success:      function (data, status, xhr)
              {
                var databank = jQuery.rdf.databank();
                databank.load(data);
                ...
              },
            error:        function (xhr, status) 
              { 
                ...
              },
            });

    Sigh!

    [1] http://technet.microsoft.com/en-us/library/cc787872(WS.10).aspx


    Monday 24 January 2011

    Data Storage Costs

    The ADMIRAL project is creating a local data management facility and testing its use with researchers in the Zoology Department at Oxford University. The facility is built using a standard Linux operating system running in a VMWare hosting environment installed within the department.  The environment has a limited amount of local fast SCSI disk storage, but is also capable of using network storage via the iSCSI protocol.

    One of the research groups for whom we deployed an ADMIRAL system started using it so enthusiastically that they rapidly filled up their allocated space, at which point they then stopped using the system without telling us.

    With opportune timing, the Zoology Department has recently installed an iSCSI network storage array facility, to which disk storage capacity can be added on an as-needed basis. This facility is to be paid for on a cost recovery basis by projects that use it. We have performed tests to prove that we can use this facility within the ADMIRAL data management framework, and are now waiting for the Silk Group and IT Support to order and install the additional storage capacity so we can deploy an enlarged ADMIRAL server to meet the Silk Group's storage requirements.

    We have also been in discussion with other research groups not currently involved with ADMIRAL about storage provision to meet funding body requirements for data sharing. Recently released BBSRC data sharing requirements (http://www.bbsrc.ac.uk/web/FILES/Policies/data-sharing-policy.pdf) stipulate that research data should be made available for sharing for at least 10 years beyond the life of a research project (i.e. a total of 12-14 years for a typical BSRC-funded project), and claims to allow the cost of this to be added to a research grant proposal to cover the costs thus incurred. See section "BBSRC Data Sharing Policy Statement " in their data sharing policy document. This does not require that data be kept online for this period, but considering cost, attrition rate and time to obsolescence of alternative solutions, some kind of online facility would appear to be as cost effective as any.

    The cost of providing the departmental network storage facility has been estimated at about £400 per Terabyte over a 5 year period.  This is estimated on a hardware cost recovery basis, allowing for a normal rate of disk failures over the 5 year period, but not including departmental IT support personnel costs.  In discussions with our IT support team, we estimate that the project+10 year duration of required availability will approximately double the hardware-only cost of delivering on such a requirement. Based on these discussions, my current recommendation to Zoology department researchers bidding to meet these requirements would be to cost data preservation and sharing to meet BSRC requirements at £1000/Tb, assuming that departmental IT support remains committed to operational management of the storage server. I would also recommend that they allow 1 person day for each month of the project at appropriate FEC cost to cover data management support, especially if the research team itself does not have IT server systems expertise: I estimate this would reasonably cover ongoing support of an ADMIRAL-like system for a project.

    For comparison, Oxford University Computing Service offers a 5 year data archive facility with multiple offsite tape backups for a Full Economic Cost of about £4000/Terabyte. They do not offer an archive service of longer duration. (It is important to distinguish here between a backup service and an archive service: the same OUCS facility provides a daily backup service, but said backups are destroyed after just a few months of no update or access.)

    The university library service has tentatively indicated a slightly higher cost per Terabyte for perpetual storage.  The meaning of "perpetual" here is open to debate, but the intent is to maintain copies of the data in storage for at least several decades, much as historical books and papers are help by the Bodleian Library for the long term.