by Erica Titkemeyer: Co-Principal Investigator, Project Director
Since the late 1980s, staff at Wilson Library Special Collections within UNC Libraries at the University of North Carolina at Chapel Hill have worked diligently to migrate and digitize unique audiovisual recordings with the goal of long-term preservation and access. Beginning in 2014 and funded by a series of Andrew W. Mellon Foundation grants, grant staff began developing a central database that could collect and easily report on relevant data, including metadata that can aid in setting expectations for future digitization initiatives and storage growth. The database, Jitterbug, serves as the authoritative location for information about original analog items, preservation master files and transfers.
When considering digitization timelines, costs and digital storage needs, the two most useful data points are file duration and file size. Using MySQL queries, we are able to examine these data points by the item-level, by format, and by collection. Since we only began requiring these two fields in the last thee years of digitization, not all audio records report duration and file size. For this reason, the number of audio records examined vary between the two data points.
Finally, each record entered in our database represents a single file from an item. In most cases the number of files for an audio item correlates with the number of sides on that item (usually two sides for audiocassettes, audiotapes and audio discs), while video has a 1:1 ratio of item to file.
Table 1: Duration data by audio format
58,897 total records report audio duration
Table 2: File size (GB) data by audio format
30,996 total records report audio file size
In looking at both of these tables it is clear that file sizes and durations can vary wildly, making it difficult to predict digitization timelines, costs and storage needs. The longest open reel audiotape file, which runs three hours long (side two is equally long – 02:59:54) is a prime example of the unstandardized field recordings we can receive. These recordings are difficult to transfer with their inconsistent and slow speeds, but fortunately they do not make up the majority of our transfers.
Since we collect data at the file level, we could use Jitterbug to predict the percentage of double-sided recordings by format or even collection. This information could become helpful if we were to take on an addition to a collection and want to develop estimates for the total transfer hours or digital storage required based on what has already been digitized.
Table 3: Duration data by video format
Table 4: File size (GB) data by video format
Similar to audio, we have a wide range of durations and file sizes in digitized video recordings. While 6-hour long VHS tapes are relatively rare, they do appear in our collections. It would be interesting as we continue exploring our data to see if we can identify specific collections or types of collections that are more likely to contain long play (LP) or extended play (EP) tapes.
Graph 1: ½”, U-Matic and VHS file sizes v. durations
One final direction I considered when pulling data for this post was how we might be able to more accurately estimate file sizes based on video formats. Currently we assume 875MB per minute for our FFV1 encoded preservation files as a general rule, but that number can clearly change based on the quality of the analog original. After graphing all of the formats by their durations and file sizes, I saw that VHS data could stand in as an average, with a rising slope that cut through the center of all other format slopes (ignoring the LP and EP outliers).
The two formats that appeared to diverge the most from the VHS slope were ½” open reel video and U-Matic. Based on the ½” we have digitized thus far (all black and white), it’s clear that the resulting files are much smaller than the VHS average. As an example, one 30-minute ½” open reel videotape is 50% the size of a VHS tape of a similar duration. On the other end of the spectrum are our U-Matic tapes, which are more likely to produce larger file sizes. For example, one 30-minute U-Matic tape produced a file 25% larger than the same VHS mentioned above. Though we don’t always necessarily need to account for these minor differences, it could become helpful to have separate MB per minute estimates for formats that have proven over time to fall well below or above the average.
To date, we have digitization data for 29,173 audio items and 1,191 video items. While these records do not represent all content digitized, they do represent the bulk of work that has been completed in the last 3 years. As we continue to merge legacy data into the database, we can expand our dataset and continue to better predict costs and resources.
Erica Titkemeyer – Co-Principal Investigator, Project Director
Steve Weiss – Co-Principal Investigator, Curator (Southern Folklife Collection)
Anne Wells – Audiovisual Archivist
Sharon Luong – Applications Developer
Andrew Crook – Audiovisual Archives Assistant
Melanie Meents – Audiovisual Archives Assistant
Brian Paulson – Preservation Audio Engineer
Dan Hockstein – Preservation Audio Engineer
John Loy – Preservation Audio Engineer