Skip to content

January 2015 saw a major update to the tiny little raspberry Pi single board computer, which I have previously written about here.  The Pi now has four times the number of "cores" on the same chip, and four times the amount of memory of the original, and is roughly six times as powerful.  Yet it is the same astonishing price, just £25.

This puts the Pi well into the frame for establishing a community digital archive running Free and Open Source archiving software, with some caveats.  Certainly, for a community concerned about the practicalities of establishing a digital archive, the initial hardware cost can now be contained to the cost of a pub meal.  What is more, the power required to run a Pi (or several of them) is extremely low, around 2.5w by my tests, which is roughly what an electronic item on standby consumes.  This makes ongoing electricity costs a rounding error rather than a major cost factor.  Finally, the small size of the Pi makes tucking a community archive into a corner of a community building simple.

The biggest two caveats regarding running a community digital archive on a Pi are data security and multimedia processing, and these two issues need to be understood and the issues worked around.

Regarding data security, the biggest issue is that the Pi uses a small SD or micro-SD card, cheap, solid state storage, but with the drawback of the cards having a limited life, occasionally being slow, and with limited capacity.  These issues can be worked around by taking regular backups, holding multiple SD cards on site, in case of failure, and by using a USB-connected conventional disk for the data.  One would never use something like a Pi where absolute and constant data reliability was required, but in an archiving context, processes can be developed to make this issue by no means a show stopper.

While the new Pi is much more powerful than its predecessor, it remains a relatively underpowered device by modern standards.  When used purely in the archiving context, this is of no great concern, but if the "server" is also used to manipulate multimedia files, a process which is usually processor-intensive, one would not be playing to the strength of the Pi.  Having said that, the Pi as a multimedia playback device is quite amazing in its capability, and there is a real case for using them in interpretation and display contexts.

It is also a brilliant platform for prototyping, and for developing transportable archiving.  It is possible to imagine a requirement where on-site archiving capabilities were required.  The Pi makes this very easy.

Does this removal of a barrier interest your archiving requirement?

Omeka is interesting digital archiving software in that it combines the storage requirement with interpretive and presentational capabilities.  This can be a two-edged sword, but for some requirements, Omeka is attractive. Omeka is also Free Software and is published under the General Public Licence. It is much simpler to install than, say, DSpace, in that it is written in PHP, which most servers support easily and which is well understood for ease of use.  And I would suggest that simplicity is of real value when running long-living archives. Omeka is a good example, along with WordPress. of such an application, often running on a "LAMP" software stack - Linux, Apache web server, MySQL database and PHP.

One of the beauties of Free Software is that it is often possible to alter software specifics, as the same service is provided by other software.   This is important, because it means, for example, that a very busy web site can use better performing software to the same end, or, as may be the case for community archives, a lower specification machine or virtual machine may use software that performs adequately with fewer resources.  One example of this is to replace the Apache2 web server with a more nimble alternative, such as Nginx or Lighttpd.  So the "LAMP" stack becomes "LLMP" or "LNMP". The use case of Omeka running on a little Raspberry Pi springs to mind, and that may be worth an article here later.

But in fact, the way Omeka is written, in actually requires the LAMP stack, which is unusual for a PHP application. The reason for this is that it includes a file called .htaccess. a facility that the Apache2 web server uses to instruct the server to behave in particular ways, and which, in Omeka's case, makes the server rewrite some parts of the web address, the URL.  The upshot is that, although there is no technical reason for requiring Apache as the web server, the fact is that it is written with that requirement.

It should be possible to apply Omeka's .htaccess file to Lighttpd's configuration file. But web references to this don't have anything definitive.  Now I must point out that I have not run this on a fully operational live server, but the following configuration does work with Omeka, so ensure to your own satisfaction that this works for you. Add this to your lighttpd.conf file.

############Omeka####
## At last - this works -
##info from https://www.drupal.org/node/43782
##and other sources
url.rewrite-once = (
##admin pages
  "^/omeka/admin([^.?]*)\?(.*)$" => "/omeka/admin/index.php?q=$1&2",
  "^/omeka/admin([^.?]*)$" => "/omeka/admin/index.php?q=$1",
##Main pages
  "^/omeka/([^.?]*)\?(.*)$" => "/omeka/index.php?q=$1&$2",
  "^/omeka/([^.?]*)$" => "/omeka/index.php?q=$1",
)
# Ensure ini files are excluded, Change the "static-file.exclude-extensions"
# higher up this config file, to look like this
##static-file.exclude-extensions = ( ".php", ".pl", ".fcgi", ".ini", ".htaccess" )
### End of Omeka section

 

I am not a Lighttpd or Omeka expert, but I thought I would post this here in the hope that this helps others who may wish to run Omeka in this way.

Addendum - I see Omeka has retweeted my tweet about this article. If you can improve this config, please email me so that we get some info on how to run Omeka unde rLighttpd (or Nginx, for that matter) accessible.

 

One of the constant themes you will find on this web site is the concept of taking a long view regarding running a digital archive.  This truism is sometimes in conflict with the world in which it operates, the technological and digital world, which is driven by constant expectations of "upgrades", "features", "faster" and other implications of improvement.  In the consumer digital world, we are used to the short life spans of technology, but in the world of systems providing particular services, such as a community archive, change for the sake of change is not always welcome.

But we must live in the real world, and the reality is that, after a while, the developers no longer wish to support older systems.  This is fully understandable; if you have spent time improving your software and solving bugs, you don't want to have to deal with those bugs in older versions when you have already fixed them in newer versions.  The Free Software world does give you the option of providing your own support, so you are not bound by the services your supplier wishes to offer.  But sooner or later, part of the longevity equation means keeping reasonably up to date.

In the case of the Assynt Community Digital Archive, we received an email at the beginning of 2014 pointing out that the version of DSpace that we were running was considered to be at the end of its life, and it was recommended that we upgrade.    We chose not to do that at the time for a variety of reasons, but summer is a good time to work on the systems, as in communities like ours, a lot of voluntary effort happens in the slower-paced winter months.

One of the beauties of the way in which the Assynt Archive is implemented is that it uses the concept of virtualisation.  This type of technology, and the reasons for t suiting a community project so well, are explained elsewhere on this website.  Virtualisation allows one to run an entire system independently of the physical hardware that underlies it.  In the case of doing an upgrade, this means that it was possible to take a copy of the entire virtual machine, and work on that, such that the live Archive was not in any way at risk as part of the process.  It also means that, as you go through a complex update procedure, you can take "snapshots" along the way, so that any oopsies do not mean hours of wasted effort.

An update to something like DSpace is not always easy for lay people to understand,  Updating systems software such as DSpace is not like updating productivity software on a laptop or desktop, where it;s a case of inserting a CD or downloading a zip file, and clicking "Setup.exe" or an "Installer" icon.  DSpace needs a runtime framework and a build framework which consists of the Java runtime system and various other components.  In addition, it needs am industrial-strength database.  So it is the type of process that only suitably skilled people should undertake.

Another early design decision was to do the least amount of customisation possible, ideally restricting the customisation to a logo and naming.  This pays dividends when it comes to upgrading.  It means simply following the processes outlined for the upgrade: download and unpack the new version, build it using the supplied tools, make the required changes to the database, deploy the new system and start it all up.  The bits of customisation., if they are restricted to the minimum, need not affect that ideal process too much.

In our case, though, we needed to go through upgrades from version 1.7.2 to version 4.1.  As this is not advisable in one step, it meant carrying out the upgrade process from version 1.7.2 to version 1.8, then 1.8 to 3.0. then 3.0 to 4.1.  At each step, it is necessary to carry out a battery of tests to ensure that each step is working.  The skills involved are a mixture of Linux/Unix skills, some Java development skills, PostgreSQL database skills and some experience as to how these types of things work.  Verison 4.1 also required updates to the deployment system, tomcat, as well as the build mechanism, maven, and ideally to the database, PostgreSQL.  This meant that one step was also to upgrade the operating system running the virtual machine from Debian 6 "squeeze", to Debian 7, "wheezy".  Fortunately, this is a well documented and bullet-proof procedure.   But with all that work, we are now.... I nearly said "future-proof," but maybe immediate-furture-proof would be more accurate - until the end of 2016 or maybe 2017 anyway, when we expect the next steps to be very similar to these.

The virtual machine can then be transferred back to the live Archive, and it will magically be running the new version.

At the start of the Assynt Community Digital Archive project, there were a lot of unknowns.  Among these were the likely initial disk capacity requirements and how to implement them.  It quickly became clear that the Archive would require more than just a machine in the corner, as it had to run a fullish range of network services to be long lasting.  Among these were directory services, email, deployment (web) services, security services and remote access capabilities.  These would be deployed as virtualised services with considerable separation of services, rather than simply running a single server, which would have been possible.  The main reason for this was resilience and the opportunity to make changes to one sub-system without affecting the others, a principle which has worked well for me in the past.  It's the type of design decision that only experience of what it takes to run services after the initial installation can bring.

The possibility of greater complexity than is ideal therefore raised its head, and that affected the choice of hardware.  Rather than go for a NAS storage option, the choice was made for an initial local storage RAID system on the main server.  This was purely an attempt to lessen the number of systems running on the network, and therefore an attempt to reduce complexity.  The cost differences were not that great.  The choice of hardware RAID, though, was a bad one.

The server itself had to be a named brand, and as I had had good success with IBM's x86 server range, we plumped for an x3200 running SATA disks on IBM's M1015 RAID card, an OEM version of an LSI RAID controller.  This was the first time I had used a SATA RAID controller, previous experience being SCSI based.

The first problem was the Debian Squeeze, in 2010, did not support the controller as part of a standard installation.  For reasons stated above, I did not want a special arrangement to shoe-horn Squeeze onto the machine, but Red Hat based distributions worked well enough.  While previously CentOS was the red Hat clone to go for, at that time, a lack of resources had left them without updates for quite a while, and their future was unclear.  Scientific Linux, though, the Red Hat clone developed and supported by CERN, Fermi Labs and others, looked as though it was a god option, and installation was a breeze.  All, that is, except for one compromise, which was that Red Hat and their clones do not support JFS as standard.  We'll leave the file system choice for another post.

Having installed the base system as a KVM virtual machine host, performance testing of the disks threw up horrors.  Write speeds were pitiful, and every now and again the whole machine would simply stop responding for minutes at a time.  To cut a long googling story short, the issue is that the basic RAID card does not have a battery back-up for an onboard cache, and also no cache, and the effect on write speeds in particular is quite astonishing.  It's hard to give an estimate of the speeds at that time, because it varied so wildly, but initial system caching was lightning fast, while subsequent speeds dropped to zero at times.

Then the disks started failing.  Again, let's cut the long story short.  Three disk, all Seagates with the same manufacturing date, failed.  IBM only support SATA disks for a year, though the Seagate version of the same OEM disks are warrantied for 5 years.  These two events, the pitiful performance and the disk failure, leave a string impression and have dented my confidence in lower end IBM kit. By the time the server went dead, when the disks finally failed lemming-like, I had reason to bless the decision to implement as virtual machines, and the system could be transferred to desktop machines running a KVM host within a  few hours.  Again, the recovery process deserves its own post.

The kitty was also bare, as is to be expected with community projects which do not generate an income, so the only thing to be done was to be creative.  This entailed sacrificing four of the external removable backup disks, which contained 2TB Western Digital Caviar disks, and cannibalising them for the IBM.

By this time (January 2013) Debian Wheezy was maturing, and as Debian is used on all the VMs, it would be nicer if Debian was also running on the bare metal.  This time, Debian detected the RAID controller, allowing implementation of JFS.  Wheezy also recognised an EFI based system and installed itself accordingly.  The RAID was created as three virtual disks, one small one for the Debian plus KVM system, one 1TB partition for the virtual machine containers and one 2.6TB partition for the actual archive data.

Performance under Wheezy and JFS was much more consistent, but still absolutely dreadful.  I cannot believe IBM sell or sold the M1015 as a RAID5 controller without the additional hardware that apparently allows it to perform.  On testing, though, if the JFS partitions are mounted with the "sync" option, the performance stops fluctuating wildly, once the system resources are used up, BUT performance then peaks at around 4.5MB/s a good 10 times slower than it ought to be.  It is therefore a toss-up as to whether or not to allow cached performance for the usual smaller writes, while stunning the machine when larger writes are necessary, or to favour consistent operation over occasional flashes of acceptable speed.

The message is simple - don't buy an M1015 if you want to write to the disks at anything other than glacial pace.

It has to be said that, in practice, this isn't a killer issue, but more of an occasional hassle.  However, it does feel as though it is something hanging over one's head, and is an unnecessary distraction.

The one alternative, which is attractive, is to use software RAID.  But that then defeats the point of having a nice managed RAID system, and when the disks couldn't be trusted, sounded like a daft idea.  It is also necessary to boot from a separate disk, of course, further negating redundancy.  The portability  of the overall system may allow testing this at some point in the future though.