ADMIRAL project announcements: methodology

Showing posts with label methodology. Show all posts

Monday, 14 February 2011

Apache configuration for LDAP access control

For some time now, HTTP access to the ADMIRAL file store has been restricted because we were unable to use a combination of Apache Require valid user, Require user and Require ldap-attribute directives to combine by-username and by-group access. For example, we have been unable to realize access control for this scenario:

/data/private all registered users have read-only access
/data/private/UserA read/write access for UserA,
read-only access for research group leaders,
all others denied

/data/private/UserB read/write access for UserB,

read-only access for research group leaders,

all others denied

We had been finding that any attempt to put an access control statement on the /data/private directory resulted in the access control statements for the user private directories being ignored. Also, we were finding that we could not allow UserA to have read/write access to their personal area, while allowing read access to any user designated as a research group leader.

The Apache configuration we had been trying to use looked something like this:

# Default read access to all areas for registered users
<Directory /home/data>
   Require valid user
</Directory>
# Define access to area for user $username
<Location /data/private/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require ldap-user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require user $username
   Require ldap-attribute gidNumber=$RGLeaderGID
   </Limit>
</Location>

As this would not work as required, we ended up disabling access to the /data area, leaving users unable to use HTTP to browse between the different user areas, and configuring the research group leader's read-only access using a configured username, making the configuration very much more fragile than needed:

# Define access to area for user $username
<Location /data/private/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require user $username $RGLeaderName
   </Limit>
</Location>

We recently realized that Apache recognizes different access control providers, and that access to a given area cannot be controlled by a combination of providers. There is a default provider, invoked by Require user ... and Require valid user, and there is an LDAP provider. Thus, the presence of a matching Require ldap-attribute directive meant that any matching Require user directives were being ignored.

"When no LDAP-specific Require directives are used, authorization is allowed to fall back to other modules as if AuthzLDAPAuthoritative was set to off"

- http://httpd.apache.org/docs/2.2/mod/mod_authnz_ldap.html

The overlooked implication here was that if any LDAP authorization directive is matched, the non-LDAP directives are ignored, even when applied to a more specific directory. The picture is further muddied by the fact that the non-LDAP authorization handlers have limited access to LDAP values (via an LDAP PAM module?), so that Require user would still respond to LDAP user entries.

Having identified the problem, the fix is easy. Use Require ldap-user instead of Require user:

# Default access to all areas
<Directory /home/data>
   Require ldap-attribute gidnumber=600
   Require ldap-attribute gidnumber=601
   Require ldap-attribute gidnumber=602
</Directory>
<Location /data/shared/$username>
   Order Deny,Allow
   Allow from all
   <LimitExcept REPORT GET OPTIONS PROPFIND>
   Require ldap-user $username
   </LimitExcept>
   <Limit PROPFIND OPTIONS GET REPORT>
   Require ldap-attribute gidNumber=$RGLeaderGID
   Require ldap-attribute gidNumber=$RGMemberGID
   </Limit>
</Location>

There is a lingering issue of the mixed <Directory> and <Location> directives: using <Location> for the /data URI path is not working, and right now we are not sure exactly why, but it seems to be related to either file system symlinks to the data directories, or Apache Alias directives controlling the mapping to same.

Wednesday, 9 February 2011

Ajax Authentication in Firefox and Safari

We've recently been tripped up by a difference between the way that Firefox and Mozilla handle authentication in Ajax requests. We have a jQuery Ajax call whose essential elements are like this:

jQuery.ajax({
   type: "GET",
   url: "/admiral-test/datasets/"+datasetName,
   username: username,
   password: password,
   cache: false
   success: function (data, status, xhr)
   {
   ...
   },
   error: function (xhr, status)
   {
   ...
   },
   });

The username and password here are credentials needed for accessing non-public information on the Databank repository server. We find this works fine with Firefox, but when accessing some Databank services using Safari (and possibly IE) we get HTTP 403 Forbidden responses, despite the fact that we provide correct credentials.

We diagnosed the problem using Wireshark to monitor the HTTP protocol exchanges. It is worth noting that Wireshark can trace encrypted HTTPS traffic if a copy of the server private key is provided. A summary of our investigations is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_with_Safari_and_IE.

What we observed was that when credentials are supplied in the Ajax call, Firefox always includes an appropriate HTTP Authorization (sic) header. Safari, on the other hand, does not initially include this, but instead re-sends the request with an Authorization header in response to an HTTP 401 Unauthorized status return. Both behaviours are correct within the HTTP specification. Our problem was caused by the fact that the Databank service was responding to a request without the Authorization header  with a 403 instead of a 401 response. The 403 response explicitly indicates that re-issuing the request with credentials will not make any difference (in our case, incorrectly).

There is a separate issue about whether we actually need to provide credentials in the Ajax call: in other parts of our system, we have found that the browser (well, Firefox, anyway) will intelligently pop up a credentials box if an Ajax request needs authentication credentials - this clearly depends on getting a 401 response to the original request, so is something that should be tested when the Databank server is fixed.

Monday, 24 January 2011

Data Storage Costs

The ADMIRAL project is creating a local data management facility and testing its use with researchers in the Zoology Department at Oxford University. The facility is built using a standard Linux operating system running in a VMWare hosting environment installed within the department. The environment has a limited amount of local fast SCSI disk storage, but is also capable of using network storage via the iSCSI protocol.

One of the research groups for whom we deployed an ADMIRAL system started using it so enthusiastically that they rapidly filled up their allocated space, at which point they then stopped using the system without telling us.

With opportune timing, the Zoology Department has recently installed an iSCSI network storage array facility, to which disk storage capacity can be added on an as-needed basis. This facility is to be paid for on a cost recovery basis by projects that use it. We have performed tests to prove that we can use this facility within the ADMIRAL data management framework, and are now waiting for the Silk Group and IT Support to order and install the additional storage capacity so we can deploy an enlarged ADMIRAL server to meet the Silk Group's storage requirements.

We have also been in discussion with other research groups not currently involved with ADMIRAL about storage provision to meet funding body requirements for data sharing. Recently released BBSRC data sharing requirements (http://www.bbsrc.ac.uk/web/FILES/Policies/data-sharing-policy.pdf) stipulate that research data should be made available for sharing for at least 10 years beyond the life of a research project (i.e. a total of 12-14 years for a typical BSRC-funded project), and claims to allow the cost of this to be added to a research grant proposal to cover the costs thus incurred. See section "BBSRC Data Sharing Policy Statement " in their data sharing policy document. This does not require that data be kept online for this period, but considering cost, attrition rate and time to obsolescence of alternative solutions, some kind of online facility would appear to be as cost effective as any.

The cost of providing the departmental network storage facility has been estimated at about £400 per Terabyte over a 5 year period. This is estimated on a hardware cost recovery basis, allowing for a normal rate of disk failures over the 5 year period, but not including departmental IT support personnel costs. In discussions with our IT support team, we estimate that the project+10 year duration of required availability will approximately double the hardware-only cost of delivering on such a requirement. Based on these discussions, my current recommendation to Zoology department researchers bidding to meet these requirements would be to cost data preservation and sharing to meet BSRC requirements at £1000/Tb, assuming that departmental IT support remains committed to operational management of the storage server. I would also recommend that they allow 1 person day for each month of the project at appropriate FEC cost to cover data management support, especially if the research team itself does not have IT server systems expertise: I estimate this would reasonably cover ongoing support of an ADMIRAL-like system for a project.

For comparison, Oxford University Computing Service offers a 5 year data archive facility with multiple offsite tape backups for a Full Economic Cost of about £4000/Terabyte. They do not offer an archive service of longer duration. (It is important to distinguish here between a backup service and an archive service: the same OUCS facility provides a daily backup service, but said backups are destroyed after just a few months of no update or access.)

The university library service has tentatively indicated a slightly higher cost per Terabyte for perpetual storage. The meaning of "perpetual" here is open to debate, but the intent is to maintain copies of the data in storage for at least several decades, much as historical books and papers are help by the Bodleian Library for the long term.

Tuesday, 27 April 2010

WebDAV and Javascript same-origin violations

We've noticed some strange problems using WebDAV to access a server running on the local development machine (i.e. "localhost"). We're using Ajax code running in Firefox to issue the WebDAV HTTP requests, and an APache 2.2 server running mod_dav, etc., to service them. We're using a combination of FireBug and Wireshark to monitor HTTP traffic.

The immediate symptom we see is is that the HTTP requests using methods other than GET or POST (specifically DELETE and MKCOL) are being rejected with access-denied errors without being logged by FireBug. But looking at a Wireshark trace, what we see is an HTTP OPTION request and response, with no further HTTP exchange for the operation requested.

What appears to be happenning is that the HTTP OPTION request is being used to perform "pre-flight" checking of a cross-origin resource sharing request, per http://www.w3.org/TR/access-control/, which is being refused by the server's response.

This was puzzling for us in two ways:

That a request to localhost was being rejected in this way, and
The use of the cross-origin sharing protocol, which is performed "under the hood" by recent versions of Firefox.

The rejection of localhost requests is not consistent: on a MacBook, the request is allowed (still using Firefox and Apache), but on some Ubuntu systems it is refused. When the request is refused, the workaround is to send it explicitly to the fully qualified domain name rather that just "localhost". (This is a bit of a pain as it means our test cases are not fully portable, and I'm hoping we can later find an Apache configuration option to allow this level of resource sharing.)

UPDATE

It turns out that the above "observation" is a complete red herring. We had failed to notice that the web page was being loaded using the full domain name of the host, rather than http://localhost/.... When a URI of the form http://localhost/... is used to load the HTML page that invokes the Javascript code, then WebDAV access to localhost works just as expected.

Wireshark

We've been using Wireshark to help debug and understand protocol flows. I've used Wireshark before, and its predecessor Ethereal, but I've been very impressed at how easy recent versions are to install and use (on Linux and MacOS, at least) for high-level software debugging.

The HTTP protocol decode is really useful, and it handles messy details like re-assembling TCP packets so that protocol units are clearly displayed.

Also, it works very well with a local loopback interface, so it's not necessary to fiogure out arcane filters to exclude background network traffic when debugging a local client/server interaction.

Under Linux, remember that the Pcap library is also needed - the Ubuntu package name is libcap2-bin. Under recent versions of Ubuntu, it is also necessary to set appropriate privileges: see http://wiki.wireshark.org/CaptureSetup/CapturePrivileges.

Monday, 18 January 2010

Selecting a platform for ADMIRAL

Part of the past week or so has been spent coming to a (tentative) decision on the basic platform for ADMIRAL data sharing. The requirements are summarized at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_LSDS_requirements_and_survey.

Reviewing the requirements, none of the more exotic options seemed to adequately address all the points. We've also been giving heightened consideration to (a) ongoing supportability of the platform by departmental IT, and (b) allowing users to use their normal SSO credentials for accessing the shared data - this is turning out to be a more important feature for user acceptance than originally allowed. All this directs us towards a platform that consists primarily of very common components:

Ubuntu-based Linux server (JeOS)
CIFS for local file sharing (mainly because reliable clients are standard in all common desktop operating systems)
Apache2+WebDAV for remote file sharing (we might have tried to use WebDAV for all file sharing, but have concerns that it could be awkward to set up in some environments)
Apache2+WebDAV to provide the basis for web application access to data, to provide additional services, such as annotation and visualization
For remote field workers, we plan to experiment with Dropbox to synchronize with the shared file service as and when Internet connections allow.
Mercurial will be trialled as an option for providing versioning of research data. The main advantage of Mercurial over Subversion for this is that it doesn't leave any hidden files in the working directory (e.g., I have used Subversion to version-manage Linux configuration files (such as those in /etc/...), and occasionally find that the hidden .svn directories can cause problems with some system management tools). Mercurial is also a distributed version management system (unlike Subversion), and might be trialled as an alternative to Dropbox for synchronizing with remote workers.
SSH/SFTP as a fallback for access to files and other facilities. SSH is a protocol that often succeeds where others fail, and can be used to tunnel arbitrary service protocols if necessary. There are quite easy-to-use (though not seamless) SFTP clients for Windows (e.g. WinSCP), MacOS (e.g. CyberDuck) and Linux (e.g. FileZilla?).

For deployment, I'm currently planning to use Ubuntu-hosted KVM virtualization. The other obvious choice would be VMWare, asd that is widely used, but I have found that remote access to a VMware server or similar hosting environment from non-Windows clients can be problematic. Also, it appears that KVM is well-integrated with Ubuntu's cloud computing infrastructure (UEC/Eucalyptus), which is itself API-compatible with Amazon EC2. This seems to give us a range of deployment options.

For using single-sign-on (SSO) credentials, the Oxford University SSO mechanisms are underpinned by Kerberos. It seems that all of the key features proposed for use (CIFS, HTTP and SSH) can be configured to use Kerberos authentication, so we should be able to use standard SSO credentials for accessing all the main services.

Daily automatic backup will be provided by installing a standard Tivoli client on the system, which will perform scheduled backup to the University Hierarchical File Storage (HFS) service. Alternative backup mechanisms could easily be configured in different deployments.

This combination of well-tried software seems to be able to meet all of our initial requirements, and provide a basis for additional features as requirements are identified.

Our aim is to create a system that will be used after the ADMIRAL project completes, so it is important that it must be something that is supportable by our departmental IT support. To this end, the various choices are subject to review when I can have a proper discussion with our IT support, who will have more experience of likely operational problems.

It is worth noting that there are two other data management projects in Oxford with some similar requirements (NeuroHub and EIDCSR); we have arranged to keep in touch, pool resources and adopt a common solution to the extent that makes sense for each project. The choices indicated here remain subject to review in discussion with these other projects.

Wednesday, 2 December 2009

Initial project planning and reporting framework

The day-to-day ADMIRAL project planning and reporting framework is being set up along the lines used for the Shuffl JISC Rapid Innovation project. A wiki will be used for the outline project plan and sprint schedule, with separate links to individual sprint plans. This is available at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_planning. Other ongoing progress reports and commentary will be provided via this blog.

Posting tags will be used to allow aggregation of reports by interested parties, based on the tags allocated for the JISC rapid innovation projects, but using JISCMRD in place of JISCRI.

ADMIRAL project announcements