ADMIRAL project announcements: technicalDevelopment

Showing posts with label technicalDevelopment. Show all posts

Tuesday, 22 February 2011

AJAX content negotiation browser differences

We have been experiencing hard-to-explain problems with the behaviour of ADMIRAL web pages in different browsers. They would work fine for Firefox, but not with IE, Safari or Google Chrome.

Javascript (using jQuery) is used at the client end to retrieve RDF/XML information from the ADMIRAL server using AJAX calls. The server is capable of returning different formats for the requested data, basing its decision on HTTP Accept headers .

The calling Javascript code looks like this:

jQuery.ajax({
type:         "GET",
url:          "...",
username:     "...",
password:     "...",
dataType:     "text",
beforeSend:   function (xhr)
{
xhr.setRequestHeader("Accept", "application/rdf+xml");
},
success:      function (data, status, xhr)
{
...
},
error:        function (xhr, status)
{
...
},
cache:        false
});

Using Wireshark to observe the HTTP traffic, we find that Firefox sends the following header:

Accept: application/rdf+xml

But when using Safari we see:

Accept: text/plain,*/*,application/rdf+xml

IE and Chrome also send something different from Firefox, but at the time of writing we've lost the exact trace.

The effect of this has been that even when we write an Ajax call to accept just RDF/XML, the server is seeing the additional Accept header options and in some cases is choosing the wrong response format when using browsers other than Firefox.

We have not yet found a simple work-around that works in all situations. But, generally, servers need to be aware that browsers sometimes add commonly required options to the HTTP Accept header. Prioritizing matches for uncommon content options might go some way to ensuring consistent behaviour across browsers. E.g. in the case illustrated here, servers should favour the less common option application/rdf+xml over the more common text/plain content type. Also favouring non-wildcard matches that appear later in the Accept header may help in some cases.

Monday, 14 February 2011

Apache configuration for LDAP access control

For some time now, HTTP access to the ADMIRAL file store has been restricted because we were unable to use a combination of Apache Require valid user, Require user and Require ldap-attribute directives to combine by-username and by-group access. For example, we have been unable to realize access control for this scenario:

/data/private all registered users have read-only access
/data/private/UserA read/write access for UserA,
read-only access for research group leaders,
all others denied

/data/private/UserB read/write access for UserB,

read-only access for research group leaders,

all others denied

We had been finding that any attempt to put an access control statement on the /data/private directory resulted in the access control statements for the user private directories being ignored. Also, we were finding that we could not allow UserA to have read/write access to their personal area, while allowing read access to any user designated as a research group leader.

The Apache configuration we had been trying to use looked something like this:

# Default read access to all areas for registered users
<Directory /home/data>
   Require valid user
</Directory>
# Define access to area for user $username
<Location /data/private/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require ldap-user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require user $username
   Require ldap-attribute gidNumber=$RGLeaderGID
   </Limit>
</Location>

As this would not work as required, we ended up disabling access to the /data area, leaving users unable to use HTTP to browse between the different user areas, and configuring the research group leader's read-only access using a configured username, making the configuration very much more fragile than needed:

# Define access to area for user $username
<Location /data/private/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require user $username $RGLeaderName
   </Limit>
</Location>

We recently realized that Apache recognizes different access control providers, and that access to a given area cannot be controlled by a combination of providers. There is a default provider, invoked by Require user ... and Require valid user, and there is an LDAP provider. Thus, the presence of a matching Require ldap-attribute directive meant that any matching Require user directives were being ignored.

"When no LDAP-specific Require directives are used, authorization is allowed to fall back to other modules as if AuthzLDAPAuthoritative was set to off"

- http://httpd.apache.org/docs/2.2/mod/mod_authnz_ldap.html

The overlooked implication here was that if any LDAP authorization directive is matched, the non-LDAP directives are ignored, even when applied to a more specific directory. The picture is further muddied by the fact that the non-LDAP authorization handlers have limited access to LDAP values (via an LDAP PAM module?), so that Require user would still respond to LDAP user entries.

Having identified the problem, the fix is easy. Use Require ldap-user instead of Require user:

# Default access to all areas
<Directory /home/data>
   Require ldap-attribute gidnumber=600
   Require ldap-attribute gidnumber=601
   Require ldap-attribute gidnumber=602
</Directory>
<Location /data/shared/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require ldap-user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require ldap-attribute gidNumber=$RGLeaderGID
   Require ldap-attribute gidNumber=$RGMemberGID
   </Limit>
</Location>

There is a lingering issue of the mixed <Directory> and <Location> directives: using <Location> for the /data URI path is not working, and right now we are not sure exactly why, but it seems to be related to either file system symlinks to the data directories, or Apache Alias directives controlling the mapping to same.

Wednesday, 9 February 2011

Ajax Authentication in Firefox and Safari

We've recently been tripped up by a difference between the way that Firefox and Mozilla handle authentication in Ajax requests. We have a jQuery Ajax call whose essential elements are like this:

jQuery.ajax({
   type: "GET",
   url: "/admiral-test/datasets/"+datasetName,
   username: username,
   password: password,
   cache: false
   success: function (data, status, xhr)
   {
   ...
   },
   error: function (xhr, status)
   {
   ...
   },
   });

The username and password here are credentials needed for accessing non-public information on the Databank repository server. We find this works fine with Firefox, but when accessing some Databank services using Safari (and possibly IE) we get HTTP 403 Forbidden responses, despite the fact that we provide correct credentials.

We diagnosed the problem using Wireshark to monitor the HTTP protocol exchanges. It is worth noting that Wireshark can trace encrypted HTTPS traffic if a copy of the server private key is provided. A summary of our investigations is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_with_Safari_and_IE.

What we observed was that when credentials are supplied in the Ajax call, Firefox always includes an appropriate HTTP Authorization (sic) header. Safari, on the other hand, does not initially include this, but instead re-sends the request with an Authorization header in response to an HTTP 401 Unauthorized status return. Both behaviours are correct within the HTTP specification. Our problem was caused by the fact that the Databank service was responding to a request without the Authorization header  with a 403 instead of a 401 response. The 403 response explicitly indicates that re-issuing the request with credentials will not make any difference (in our case, incorrectly).

There is a separate issue about whether we actually need to provide credentials in the Ajax call: in other parts of our system, we have found that the browser (well, Firefox, anyway) will intelligently pop up a credentials box if an Ajax request needs authentication credentials - this clearly depends on getting a 401 response to the original request, so is something that should be tested when the Databank server is fixed.

Tuesday, 8 February 2011

Reading RDF/XML in Internet Explorer with rdfQuery

We've just spent the better part of two days tracking down a stupid bug in Internet Explorer.

Under the guise of providing better security, Internet Explorer will not recognize as XML any MIME type other than text/xml or application/xml, and then only when the URI (or Content-disposition header filename) ends with .xml [1]. (I say guise of better security, because a server or intercept that is determined to falsely label XML data can do so in any case: refusing to believe the server's content-type when the data properly conforms to that type does not help; maybe what they are really protecting against is Windows' flawed model of using the filename pattern to determine how to open a file.)

In our case, we use jQuery to request XML data, and pass the resulting jQuery XML object to rdfQuery to build a local RDF "databank" from which metadata can be extracted. On Firefox and Safari, this works just fine. But on Internet Explorer it fails with a "parseerror", which is generated by jQuery.ajax when the retrieved data does not match the requested xml type.

Fortunately, rdfQuery databank.load is also capable of parsing RDF from plain text as well as from a parsed XML document structure. So the fix is simple, albeit not immediately obvious: when performing the jQuery.ajax operation, request text rather than XML data. For example:

jQuery.ajax({
   type: "GET",
   url: "/admiral-test/datasets/"+datasetName,
   username: "...",
   password: "...",
   dataType: "text", // To work on IE, NOT "xml"!
   cache: false
   beforeSend: function (xhr)
   {
   xhr.setRequestHeader("Accept", "application/rdf+xml");
   },
   success: function (data, status, xhr)
   {
   var databank = jQuery.rdf.databank();
   databank.load(data);
   ...
   },
   error: function (xhr, status)
   {
   ...
   },
   });

Sigh!

[1] http://technet.microsoft.com/en-us/library/cc787872(WS.10).aspx

Saturday, 28 August 2010

WissKi project for scientific collaboration and data sharing

As part of my CLAROS-related activity, I've been taking a poke around the WissKi project (http://www.wiss-ki.eu/), which is a German-funded, Drupal-based collaboration platform for scientific research and data sharing, and which also uses CIDOC-CRM as a base ontology.

Generally, this looks like an interesting project and I wonder if we shouldn't be looking to establish links with other data management work in Oxford and beyond. I have been asked to attend a WissKi meeting in September, so it will be interesting to see what common themes we can find.

Among other things, they have assembled a couple of useful ontology-related lists:

Name authorities: http://www.wiss-ki.eu/techwatch/name-authorities
Metadata standards: http://www.wiss-ki.eu/techwatch/metadata-standards

Monday, 23 August 2010

Gridworks for data cleaning?

I've noticed a fair buzz recently from open government data people about Gridworks, and specifically this blog post from Jeni Tennison:

http://www.jenitennison.com/blog/node/145

I'm reminded of some problems faced publishing the FlyWeb data (http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_project), and also of some discussions with Alistair Miles about tooling for cleaning up Malariagen data (http://www.malariagen.net/).

Unsurprisingly, similar problems appear to be faced in publishing government data as open linked data, and the solution that is finding favour there is Gridworks. If it works for them, then I figure it should also work for some of the research data data we are trying to deal with. I'm thinking this is something we should look to explore in later phases of the ADMIRAL project, under the broad heading of building more formal structures around raw data (WP6).

Saturday, 24 July 2010

JSON for linked data - migrating to RDF

See:
http://shuffl-announce.blogspot.com/2010/07/json-for-linked-data-migrating-to-rdf.html

Tuesday, 6 July 2010

SWORD white paper: relevant to ADMIRAL?

I've just read through a white paper about directions for the SWORD deposit protocol: http://sword2depositlifecycle.jiscpress.org/

I'm recognizing many of the discussion points we've been having about the submission API for ADMIRAL to the library service RDF Databank appearing here:

submitting datasets as packages of files
selective updating within dataset packages
accessing manifest and content
accessing metadata about the package
etc.

I'm not advocating at this stage that we should be trying to track the SWORD word, but I do think we should try to ensure that noting prevents us from creating a full SWORD interface to RDF databank at some stage in the future.

Friday, 2 July 2010

RDF implementation for Shuffl, via RDF-in-JSON

I'm throwing my hat into the RDF-in-JSON ring, planning an implementation of RDF/XML serialization for Shuffl workspaces and cards using RDFQuery [1] and JRON [2]. My initial mdesign notes are at http://code.google.com/p/shuffl/wiki/JRON_implementation_notes.

http://code.google.com/p/rdfquery/
http://decentralyze.com/2010/06/04/from-json-to-rdf-in-six-easy-steps-with-jron/

Friday, 7 May 2010

mod_dav, do we have a problem?

Tracking down a strange bug this morning, using Javascript code in FireFox+FireBug against Apache 2.2 and mod_dav. My test code does a PUT soon followed by a HEAD to the same URI (part of logic that tests to see if the file just created actually exists). But the HEAD command returns inconsistent results: sometimes 404, sometimes not, running the same test suite with no other changes. I did manage to see the incorrect sequence in a Wireshark trace, so I think that lets FF off the hook here.

A simple, but dubious, workaround is to put a 100ms delay in the test suite after initially creating the test fixture data; now all tests run fine, repeatedly.

I still find that FireFox can be a bit inconsistent in the time it takes to do things (garbage collection?), so some tests time out occasionally, but I can live with this.

Using: MacOS 10.5 and 10.6, Apache 2.2 (XAMPP distribution), Firefox 3.5.7, jQuery 1.4.2

Tuesday, 27 April 2010

Listing a directory using WebDAV

I found it surprisingly hard to find this simple recipe on the web, so thought I'd document it here. The difficulty of achieving this with AtomPub has been one factor holding back wider use and further development of Shuffl, so hopefully that will be changing.

To use WebDAV to list the contents of a directory, issue an HTTP request like this:

PROPFIND /webdav/ HTTP/1.1
Host: localhost
Depth: 1

<?xml version="1.0"?>
<a:propfind xmlns:a="DAV:">
  <a:prop><a:resourcetype/></a:prop>
</a:propfind>

Or, using curl, a Linux shell command like this:

curl -i -X PROPFIND http://localhost/webdav/ --upload-file - -H "Depth: 1" <<end
<?xml version="1.0"?>
<a:propfind xmlns:a="DAV:">
<a:prop><a:resourcetype/></a:prop>
</a:propfind>
end

The response is an XML file like this:

HTTP/1.1 207 Multi-Status
Date: Tue, 27 Apr 2010 09:38:30 GMT
Server: Apache/2.2.14 (Unix) DAV/2 mod_ssl/2.2.14 OpenSSL/0.9.8l PHP/5.3.1 mod_perl/2.0.4 Perl/v5.10.1
Content-Length: 706
Content-Type: text/xml; charset="utf-8"

<?xml version="1.0" encoding="utf-8"?>
<D:multistatus xmlns:D="DAV:" xmlns:ns0="DAV:">

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype><D:collection/></lp1:resourcetype>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/README</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype/>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/shuffltest/</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype><D:collection/></lp1:resourcetype>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

</D:multistatus>

There, that was easy!

Tuesday, 30 March 2010

First attempt at a security model for ADMIRAL

As we are offering to safeguard real users' research data, I thought we should attempt "due diligence" that we were doing so in a reasonable fashion. To this end, I have been working on a security model for the ADMIRAL data stores.

This is new territory for me, and I'm fairly sure there are many things that I have overlooked, failed to properly think through, or just got plain wrong. But, like all the ADMIRAL working documents, it's public and open to review, which in this case I would eagerly welcome.

The security model is documented at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_LSDS_security_model.

I fully expect to have to revisit this over the course of the project, as requirements are developed and concerns are identified. I hope that by starting now with an imperfect model, we'll have plenty of time to clean it up and make it fit for purpose.

Thursday, 25 March 2010

Reviewing new survey returns - "steady as she goes"

We held a brief meeting to review some additional data usage survey returns from the Behaviour group. Notes are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100325_Data_Surveys_Review.

There is nothing here to suggest any change in our main priorities, but some interest was noted for automatic versioning of data and data visualization.

Meanwhile, we're making progress on some tricky access control configuration issues to meet specific requirements from the Silk group, and are learning to use Linux ACLs to meet the requirements. Some notes about how this is done are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_ACL_file_access_control, but until we have a full test suite in place, this remains work in progress.

Friday, 19 March 2010

Sheer curation as "curation by addition"

We are finally getting stuck into selecting metadata for ADMIRAL data capture for preservation, via Databank (http://databank.ouls.ox.ac.uk/). This blog post by Ben O'Steen is most helpful:
http://oxfordrepo.blogspot.com/2008/10/modelling-and-storing-phonetics.html:

"... the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time."

I think this very nicely articulates the tone for the ADMIRAL project as we set out to curate our research partners' data.
As I write this, we've just had a follow-up meeting with one of our research group partners, CH. It is very interesting to note that what we're aiming to offer initially duplicates a facility the research group have already provisioned for themselves (shared filestore), but with just enough additional capability to be useful (automatic daily backup), so in this sense we really are adding small capabilities to researchers' existing pratices. Capturing elements from this, and moving them to Databank should prove to be another small addition.
Other related links include:

https://confluence.ucop.edu/display/Curation/BagIt - BagIt specification, a very simple specification for packaging and shipping directory subtrees.
http://databank.ouls.ox.ac.uk/ - OULS Databank, Ben's realization of what is described in his blog article. Digging around in here illustrates what Ben describes in his blog post.
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_Databank_submission_requirements - Metadata selection for Databank submission.

It will be interesting to see how well we can deploy the "curation by addition".

Friday, 5 February 2010

ADMIRAL public code repository established

We have finally created a public code repository for the ADMIRAL project. It is at http://code.google.com/p/admiral-jiscmrd/.

We had put off establishing a code repository for ADMIRAL until we had a clearer view of the requirements:

Should it be public, or private? If we are using the repository for user-related information, then a public repository is not appropriate. In any case, changes to Shuffl performed as part of the ADMIRAL project will be maintained within the Shuffl project (http://code.google.com/p/shuffl/).
What kind of version management is required? We had in mind to use Mercurial, for reasons mentioned elsewhere (http://imageweb.zoo.ox.ac.uk/wiki/index.php/Mercurial_repository_publication), but did not want to commit in case any specific reasons to use a different versioning system were discovered.

Over the past couple of weeks, we have been investing some effort into configuring Samba, Apache and WebDAV to work with Kerberos authentication (http://imageweb.zoo.ox.ac.uk/wiki/index.php/Zakynthos_Configuration). It seems that what we are doing with Kerberos is (a) pushing at the boundaries of what is commonly deployed and documented, and (b) something that a number of people have asked about doing, so we felt it was time to put some of the things we are learning into public view.

We've chosen a Google Code project ("admiral-jiscmrd", as "admiral" was already taken) and have decided to go with the original plan of using Mercurial version management.

The initial content of the public code repository is a set of scripts and configuration files we are developing to automate the assembly of a virtual machine image for file sharing and web access with access control linked to a Kerberos SSO authentication infrastructure (in our case, Oxford University's).

ADMIRAL project announcements