Thoughts On Personal Digital Archiving 2017

Last month was our first time participating in the Personal Digital Archiving (PDA) conference in Stanford. We had a great experience meeting archivists, activists and technologists who are all involved in building great Digital Archiving projects. It was also an opportunity for us to show Kumbu, and gather useful feedback. But beyond the immediate feedback, PDA was also an opportunity to learn and reflect on the state of Personal Digital Archiving. I will try to summarize in this article what I’ve learned through the various presentations, posters and conversations we’ve had Arnaud, Bart and myself during the conference.

‍

‍

Personal

One of the values of PDA was to be confronted with what the actors in this space perceived as “personal”. It turns out Personal means many different things, and these meanings are often interwoven.

I see three main threads: who is this about? who created it? who is this for?

‍

Who is this about?

Personal can be about collecting data and artefacts about one person. Even there, there is much variation, wether we’re considering people archiving quantified self data (a lot of content about one person), personal finances, a curated and annotated list of artefacts about one famous person (such as famous writers or musicians), or personal-made content in the DC library memory lab.

‍

Who Created it?

Personal can mean stuff that an individual created (writers, musicians, or everyday people taking screenshots), or a single individual collected (video games, data for research, journalism), or that several individuals have captured as part of a community (community events, protests)

‍

Who is it for?

Personal can also be considered based on who you’re building the archive for: one person (for a diary, a collection, or work), a small group of family and friends, a community (from activists to scholars) or the entire world.

‍

Personal is what you make of it

‍

As Melody Condron put it, Personal is what you make of it. And while I think there is value in defining concepts precisely, I think in the context of PDA, this was just right.

‍

By avoiding a too restrictive definition of the Personal, it opened to the possibility of exploring a wider variety of uses, and bring ideas from many different fields, to everyone’s benefit.

‍

Digital

What is digital, really? In my presentation about Kumbu, I argued that we could differentiate between physical, digital, “born digital” and “born mobile” content. I also naïvely assumed that for physical and digital content, the problems were mostly solved. I guess that’s what happen when you’re solely focused on one content category.

‍

Transforming physical content into digital content is not a solved problem.

Many people still struggle with media format transition, from paper to digital, or from super-8 movies to mp4. There are costs, time and know how associated with these operations that are still not as widespread as one might suppose. I found the initiative of the Washington Library of opening a lab and allowing anyone to book an appointment to use their digitization devices brilliant: it empowers people, fulfills the community role of a library, and, with the people’s consent, builds a common archive of personal media.

‍

Accessing digital and born digital content raises so many issues

A limited subset of born digital content is in a readily accessible format. It usually is the case for text documents, photos, sometimes videos. But as many discovered when processing computer based archives, a lot of specialized tools have proprietary formats, and limited viewers available on the market.

‍

I was (happily) surprised to see emulation used in a widespread fashion. From games, made accessible through mobile arcade machines, to entire PCs so researchers could explore Salman Rushdie’s archive in the context of his computer. This raises many questions though — from the practical (how should archivist be trained to discover & use specialized software?) to the legal (I know that the internet archive and libraries have copyright exclusions, but how the public uses said software is bound to run into unforeseen licensing issues, especially as we move towards server-bound apps).

For web content, It was encouraging to see more effort put towards making great WARC-aware tools. Both WAIL and WebRecorder were presented at PDA, and are worth keeping an eye on as they are now progressing faster than ever.

Finally, as was made obvious during the ePadd presentation, you sometime have to alter or redact content before making it available to the public. But the volume of an e-mail archive can make that difficult. I found the use of Named Entities extraction in ePadd very well done, and could easily rival paid software. Overall, I must say I was impressed by the quality of the software on display at PDA.

‍

Classification algorithms are dull and make everything boring

‍

Do you want your wedding to be like anyone’s wedding? Then by all means let Apple and Google and Facebook algorithm classify it.

‍

Do you want your archives to reflect all societal biases of the time? Easy, just let machine learning do the work. There’s a lot of excitement about Machine Learning and Image and Concept Recognition. While these add to the indexing and description process, they do have several drawbacks, from making everything similar, to mis-labeling (which can be either hilarious or horrifically racist). I was happy to see this subject touched upon during the conference — I think it is one of the areas where we went too fast with the use of technology, and there’s a need to pause and think before we take this further.

‍

The key insights for me are :

Born digital content requires a different thinking about “what is an artefact” and at what level stuff should be savedAccessing digital content can be contextualized using emulation, which is sometimes useful

The practices around these topics are being elaborated right now, by people like X and Y. They are what they are because of a mix between transposing existing archiving know-how, tools (and sometimes IT people) availability, and future perceived value of the source material. PDA was a good snapshot of how the world do these thing, and I really encourage everyone to watch this space, and what these people are doing, because all of this is going to percolate in our lives in various ways.

‍

Archiving

At least we should all agree on what archiving is, right? Well, yes and no. Archives serve many purposes and those were well represented at PDA.

‍

What’s an archive for really?

PDA offered diverse view of what people had in mind when they mentioned “archives”. For some, this was about free-ing the data from quantified self and apps so that it could be repurposed and re-used in other ways, for others is was to ensure compliance, or preserving something before it disappears. MyData made the case for extracting personal data, and putting it into its own archive. I liked seeing this variety of views on display.

‍

Curating born digital content requires creative thinking

When looking at a born digital archive donation, such as the Rushdie Archive, where they received entire computers — what do you consider for the lowest level archival artefact? Do you archive the whole computer? Do you describe each file ? What about complex, multi-level formats (say a website, or a wiki)? All these questions are currently answered mostly on an ad-hoc basis, with a balance between the volume of content to describe, the availability of file format and application expertise, and the perceived future value of the content. All those are hard challenges, and I sense those will become even more complex as we move on to tackle social and mobile born content. In some aspects, it is a technical debate for which I was grateful to see many archiving departments and libraries working in conjunction with IT for operational purposes. But besides the technical parts, I think the archiving practice is pretty well equipped to deal with it, assuming people can take a step back, and be realistic about how we’ll never know much about the perceived value of these archives in the future.

‍

PDA was light on technology, and this was a good thing

There were some interesting archive-tech talks, especially by CERN about building an oais compatible “dark archive”, but overall, the conference talks didn’t go to deep in the technologies involved. I think this was a good thing, as looking at actual practice is often more enlightening than posturing about technology (and it’s also more fun). But I wonder if conversations around trust in storage (sometimes blockchain based), my data, ipfs, or even gpdrwouldn’t have been interesting to have, at least at a high level. Listening to speakers describe the kind of hardware they were running their archives on woke up the technologist in me I guess :)

‍

Archives, too are what you make of them

‍

“Archive” as a concept is messy, but to me it covers a set of practices remarkably similar. I think keeping an all-inclusive definition of “Archive” is what makes PDA an interesting conference.

‍

Conference

I really enjoyed PDA. And I think it has many ingredients of a successful conference :

A diverse community, centered around a proximity of practices rather than a particular professional community. It’s a good way to meet interesting and smart people outside of your usual circles.A single track, with speakers being participants and a willing to engage beyond their talks. Single track conferences are the best in my opinion, as they are closer to a shared common experience.A balanced dose of content with a willingness to share more than the successes. Content is still the most important thing to me in a conference. Having been to countless tech conferences that were light on content, I found PDA had a very good balance, managing to stay entertaining while tackling important issues. Talks were very honest in sharing their journey, talking about successes, but also failures and challenges. Keeping the length to about 20mn per talk was just right.A great setting, with most technical & logistical issues sorted out beforehand. A part from a few wi-fi glitches at the beginning, I thought this was a very well run conference. And the Stanford campus and area was a wonderful environment.

After PDA, I was inspired and energized, and bursting with new ideas on how to make Kumbu even better than it is now. So I’ll end this with a shout out to the organizers and the amazing PDA community, and see you next year at PDA 2018!

‍

Kumbu at PDA

We were really lucky to have the opportunity to show Kumbu at PDA. Right before the conference, we enabled collection sharing, so here’s a collection of us doing Kumbu stuff at PDA. I’m also sharing a transcript of my presentation here, if you’re interested.

‍

That was a lot of fun! If you were at PDA, or if you’re interested in Personal Digital Archiving, please reach out. We’re always happy to have meaningful conversations about Personal Digital Archiving.

Thanks to Kumbu.