Tuesday, 30 March 2010

First attempt at a security model for ADMIRAL

As we are offering to safeguard real users' research data, I thought we should attempt "due diligence" that we were doing so in a reasonable fashion.  To this end, I have been working on a security model for the ADMIRAL data stores.

This is new territory for me, and I'm fairly sure there are many things that I have overlooked, failed to properly think through, or just got plain wrong.  But, like all the ADMIRAL working documents, it's public and open to review, which in this case I would eagerly welcome.

The security model is documented at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_LSDS_security_model.

I fully expect to have to revisit this over the course of the project, as requirements are developed and concerns are identified.  I hope that by starting now with an imperfect model, we'll have plenty of time to clean it up and make it fit for purpose.

Thursday, 25 March 2010

Reviewing new survey returns - "steady as she goes"

We held a brief meeting to review some additional data usage survey returns from the Behaviour group. Notes are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100325_Data_Surveys_Review.

There is nothing here to suggest any change in our main priorities, but some interest was noted for automatic versioning of data and data visualization.

Meanwhile, we're making progress on some tricky access control configuration issues to meet specific requirements from the Silk group, and are learning to use Linux ACLs to meet the requirements.  Some notes about how this is done are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_ACL_file_access_control, but until we have a full test suite in place, this remains work in progress.

Friday, 19 March 2010

The role of workflows

These are some notes from a discussion with my colleague, Jun Zhao, who has been asked asked about using our research partners as use-case studies for workflow sharing.

Our immediate response to this request was one of skepticism, based on our belief that none of our research partners would be willing to try using workflow-based tools because we couldn't see that they would gain sufficient benefits to justify the "activation energy" of deploying and learning to use such tools. In the past, our partners have been dismissive of using even very simple systems for which they could not perceive immediate benefits.

This was a somewhat surprising conclusion given the enthusiasm for workflow sharing among other bioinformatics researchers, and also researchers in other disciplines, and we wondered why this might be.

We considered each of our research group partners, covering Drosphila genomics, evolutionary development, animal behaviour, mechanical properties and evolutionary factors affecting silk, and elephant conservation in Africa. We noticed that:

  • each research group used quite manually intensive experimental procedures and data analysis, of which the mechanized data analysis portions were quite a small proportion,
  • the nature of the procedures and analysis techniques used in the different groups was very diverse, with very little opportunity for sharing between them.

This seems to stand in contrast to genetic studies that screen large numbers of samples for varying levels of gene products, or high throughput sequencing looking for significant similarities of differences in the gene sequences of different sample populations. The closest our research partners come to this is the evolutionary development group, who use shotgun gene sequencing approaches to look for interesting gene products, but even here the particular workflows used appear to be highly dependent on the experimental hypothesis being tested.

What conclusions can we draw from this? Mainly, we think, that it would be a mistake for computer scientists and software tool developers to assume that a set of tools that has been found useful by one group of researchers is useful to all groups studying similar natural phenomena. Details of experiment design would appear to be a dominant indicator for the suitability of a particular type of tool.

Sprint 5 plan review

Because of the long duration of this sprint (ADMIRAL Sprint 5, 3.5 weeks), the plan has been reviewed part way through. The main changes are pushing back on diagramming activities, and pushing forwards on file synchronization and access control interface investigations.

See: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_5.

Sheer curation as "curation by addition"

We are finally getting stuck into selecting metadata for ADMIRAL data capture for preservation, via Databank (http://databank.ouls.ox.ac.uk/). This blog post by Ben O'Steen is most helpful:
http://oxfordrepo.blogspot.com/2008/10/modelling-and-storing-phonetics.html:
"... the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time."
I think this very nicely articulates the tone for the ADMIRAL project as we set out to curate our research partners' data.
As I write this, we've just had a follow-up meeting with one of our research group partners, CH. It is very interesting to note that what we're aiming to offer initially duplicates a facility the research group have already provisioned for themselves (shared filestore), but with just enough additional capability to be useful (automatic daily backup), so in this sense we really are adding small capabilities to researchers' existing pratices. Capturing elements from this, and moving them to Databank should prove to be another small addition.
Other related links include:

It will be interesting to see how well we can deploy the "curation by addition".

Meeting with Silk Group researcher

Held a meeting today with CH of the Silk Group. This was partly a follow-up from the data surveys, and partly to prepare for our first live LSDS deployment. There were (mercifully) few surprises, the main points noted being:

  • Expectations for usability of access control interface set by Lacie NAS box that the group currently use. For the time being we'll configure the users manually, and later we'll look into UI for creating and modyfing LDAP entries.
  • File sharing with automatic backup is an important advance in functionality over bare NAS.
  • Desirability/priority of looking at automatic harvesting to LSDS is raised by our discussions; we will raise the priority of looking at solutions for this. (We've already tried to deploy the Fascinator "Watcher", but that didn't work for us. Another promising option is iFolder,)

Meeting notes are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100319_meeting_CH

Tuesday, 16 March 2010

ADMIRAL virtualization environment running

Installing VMWare ESXi

Over the past few days, we have received and installed new server hardware, installed VMWare ESXi, and transferred our test file sharing system build to run in the ESXi environment.

VMWare ESXi is a "bare metal" virtualization host; that is, it installs directly onto some server hardware, rather than onto an existing operating system (in contrast to systems like VMWare Server).

Installing and using ESXi turned out to be very easy - much easier than, say, setting up a VMWare Server under Linux. In hindsight, I think this is because ESXi is a dedicated environment, with really no choices to be made, and hence far fewer system components to be configured. This is fortunate, as the VMWare documentation is pretty inpenetrable. The biggest hurdle to overcome was the fact that the ESXi system management console software (vSphere) has to be installed on a Windows client, XP service pack 2 or later, which was slightly awkward for us, since we are mostly a Linux and Mac shop these days.


Installing ADMIRAL file sharing for the Silk group

Transferring the virtual machine images from our KVM test environment to ESXi went pretty smoothly. A small change to the script used to run the virtual machine image builder (vmbuilder) allows VMWare disk images to be generated directly. Copying these files to the ESXi system is a slightly fiddly 2-stage process, taking about 20m minutes in total as the disk image file is quite large at about 600+Mb.

Getting the system running under ESXi required some rethinking of our original approach. While the pre-built image would boot directly into a newly created VM, the networking would not work in our chosen VM configuration until VMWare tools is installed. This in turn requires that the original system image has Linux kernel development tools installed (Ubuntu kernel-package and linux-headers), which are largely responsible for the large size of the disk image file. With this taken care of, the VMWare tools installation runs very smoothly, and the system can be rebooted with functional networking. (NOTE: ESXi does not support NAT, only bridged networking, so an IP address must be allocated for the new server before networking can be activated.)

The need to install VMWare tools to get networking capability means that our original plan of doing much of the system configuration automatically on first system boot has been dropped in favour of using manually initiated scripts to handle post-boot system configuration. Scripts for configuring certificates, LDAP, Samba and automatic backups require a fair degree of user interaction, but are otherwise quite straightforward.

After all this, out test suite (which checks file access via Samba, file access via WebDAV and direct HTTP access) ran straight away against the new server. For this, having the pyunit-based test suite was a real boon.

Currently, we are using Ubuntu 9.10 ("Karmic") for our ADMIRAL server platform. We do intend to update to version 9.14 ("Lucid") when it becomes available, as this version has been designated for "long term support". The LDAP configuration in 9.10 does seem to be something of a work-in-progress, so we fully expect to revisit this work when we come to upgrade the base system. We have also dropped Kerberos from our platform for the time being, because of difficulties getting it to work with a range of common clients. LDAP seems to be a reasonable comptomise, as it allows all ADMIRAL facilities to be authenticated and authorized from a common source.

More information about the setup is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_VMWare_ESXi_notes.

Monday, 8 March 2010

Sprint 5 plan

The ADMIRAL sprint 5 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_5.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_5.

Notes from sprint 4 retrospective meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_4. The summary position for this sprint is:

Good progress was made on a number of fronts, with all but two tasks (i.e. DropBox testing and databank submission metadata requirements) completed to the point of allowing further progress. But getting all the security features to work exactly as intended is still proving tricky. A good start has been made on automated testing of the LSDS features. For the next sprint, we aim to use the progress so far (LDAP for authentication and authorization), focus on a first live deployment, and start to think about basic annotation of datasets.
The value of pair working and group working is showing itself. The web site blitz set out to create a published web site, and that was achieved in a day. Pair working has helped us to get up and running quickly (e.g. setting up the test framework). But GK needs to back off unplanned involvement until asked!
We've also held the first of a series of one-to-one meetings with our research users, notes of which are posted in the project wiki. The main purpose of these meetings has been to clarify and extend the information gleaned from the initial data audit surveys, especially with respect to understanding their requirements regarding data volumes and frequency of access. The meetings have been kept short and sweet, as we don't want the researchers to feel that we're eating into their valuable time whenever we ask for a meeting.