by Forrest Greenwood, Quality Control Specialist, Media Digitization and Preservation Initiative, Indiana University
Let’s say, for the sake of argument, that you have a gigantic computer folder that contains over 400 digitized films, each in its own sub-folder. And then let’s say that you want to find out how many of these films are black and white, or how many are silent, or how many were scanned on the fifth of February but not the fourth or the sixth. How would you go about doing that? Ordinarily, you might say to cross-reference a cataloguing database. But what if that database is still in the process of being created or is incomplete?
This is a very real scenario that we have to contend with on the film side of MDPI. And our solution to this problem is to apply some good old-fashioned IT muscle in the form of grep queries and regular expressions, two tools that have been around for the better part of five decades but that still have yet to enter the popular-computing consciousness. “Global Regular-Expression Print,” better known by its acronym grep, is a text-searching tool that was first created for UNIX operating systems in the mid-1970s and is now included in most distributions of UNIX, Linux, and BSD. A regular expression is a string of characters that defines a pattern of text. Put the two together and you have a powerful toolset for searching for not just individual words or phrases, but highly complex and variable patterns.
According to Brian Kernighan, Professor of Computer Science at Princeton, grep was originally developed in 1974 by Ken Thompson while both Thompson and Kernighan were working at Bell Labs.[1] One of Kernighan and Thompson’s colleagues, Lee McMahon, wanted to perform what we would now call natural language processing on the Federalist Papers in order to determine their authorship. However, the primary text editor for Unix at the time, ed, could only perform operations on text stored in memory, and the Federalist Papers consumed around 1 megabyte, far more than the 32 or 64 kilobytes of memory installed on the PDP-11 computer. According to Kernighan, Thompson wrote grep overnight at McMahon’s request in order to use regular expressions to perform operations on text stored on disk, thus allowing for processing of the Federalist Papers. Today’s computers, of course, do not suffer the same memory limitations as the computers of the 1970s, but grep’s fundamental usefulness lives on as a tool for text parsing and searching.
In the film unit of MDPI, all of our deliveries from Memnon are accompanied by a human-readable XML file that lists various characteristics of both the original film and the output scan. These XML files serve as powerful augments to our own databases, and they are what allow us to perform queries on the queue of films waiting for quality control (QC) inspection. Because these XML files are generated by a script, and because they adhere to a standardized format agreed upon by both IU and Memnon, we know that certain strings of text will reliably appear in each XML file. By searching for these strings, we can quickly determine how many films in the queue exhibit particular characteristics.
Now, grep is a UNIX and Linux command-line tool, but MDPI QC operations happen in the world of Microsoft Windows, which does not have a built-in grep utility. As a result, we use a program called PowerGREP, developed by Just Great Software, which provides grep functionality through a graphical user interface. PowerGREP’s interface is complicated, but this is because it offers vast, customizable control over searches, giving us text-parsing capabilities on par with the original grep.[2]
Let’s go back to our scenario from the beginning of this post. We have ~400 films in the QC queue, and we want to find out how many of these are silent films (as opposed to sound films). This is an easy query to perform, as it involves matching a simple string of text. In short, we tell PowerGREP to search our queue folder for Memnon’s human-readable XML files – and only those XML files. MDPI generates a great many documents in XML format, but not all of them contain the information we need. Fortunately, these files all have standardized filenames, so I can specify to PowerGREP that I’m looking only for files whose names match the pattern MDPI_<barcode>.xml (where <barcode> is the 14-digit barcode that identifies a film within MDPI’s systems; this is the pattern for Memnon’s human-readable XML files), and not, say, MDPI_<barcode>_01_mezzRaw.xml, which is a machine-readable file generated by our QC software. Within the human-readable XML files generated by Memnon, I’m looking for the following string: <Sound>Silent</Sound>. This is the XML tag that signifies that the film was scanned without sound; e.g., that it is a silent film. (Or perhaps a sound film scanned as silent at the request of the collection holder, but that is a subject for another time….) I can simply tell PowerGrep to search all instances of MDPI_<barcode>.xml for the string <Sound>Silent</Sound> and get a quick count of how many silent films are in the QC queue. PowerGREP will even generate a list of all the files that contain this string, and since the film’s unique identifier (barcode) is part of the file name, that allows us to quickly and easily sort out those films, if necessary, for special or expedited processing.
But let’s say I want to perform a more advanced query. Let’s say I want to know which films were scanned within a particular range of dates. Here, I’m not looking for a single static string of text; I’m looking for a variable string of text that matches a particular pattern. Here is where regular expressions come into play.
Regular-expression syntax uses special characters to define patterns. The + sign, for example, matches one or more occurrences of whatever item precedes it. So the construct r+ matches the character r one or more times. \d, for another example, indicates a digit. If I search for \d+, that tells grep (or PowerGREP) to match one or more digits within a string. I can also escape a special character by adding a backslash before it. If I’m trying to search for decimal numbers, I would want to escape the decimal point with a backslash because in a regular expression, a decimal point is the special character for a wildcard. For example, the string \d\.\d+ matches any single-digit decimal number down to the tenths place; it would return results for both 0.0 and 6.2. We can make part of the string optional by enclosing it in brackets, so the string \d+(\.\d+)* will return numbers whether or not they have decimal points; the * is a special character that matches an occurrence zero or more times.
Armed with this information, let’s consider a complete regular expression: <Date>2018-(10-(0[1-9]|1[0-9]|2[0-9]|3[0-1])|11-(0[1-9]|1[0-9]|2[0-2]))</Date>. What is this string attempting to find? We can infer from the <Date> XML tag that this string is trying to find a date or range of dates, but how to parse the string between the opening and closing tags? The | is a special character indicating to match either the item before the | or the item after it, so the string r|date would return matches for both rate and date. [] brackets, meanwhile, instruct PowerGREP to match a character class. So gr[ae]y would return matches for both gray and grey. For digits, we can include a hyphen within a class to establish a range of possible digits, so [0-9] matches any digit between 0 and 9. By extrapolation, then, we can see that the string 0[1-9]|1[0-9]|2[0-9]|3[0-1] attempts to match any number between 01 and 31. Our complete regular-expression query, then – <Date>2018-(10-(0[1-9]|1[0-9]|2[0-9]|3[0-1])|11-(0[1-9]|1[0-9]|2[0-2]))</Date> – attempts to match any date between October 1, 2018, and November 22, 2018, in the format yyyy-mm-dd.
There is, of course, a vast amount of power contained within regular expressions, and a brief article like this can only scratch the proverbial surface of what is possible. But hopefully this has provided some understanding of how we use basic programming and query functions within MDPI QC, and perhaps it has inspired some ideas about how you might be able to make use of these tools within your own organization or project.
[1] “Where GREP Came From – Computerphile,” YouTube, July 6, 2018, https://www.youtube.com/watch?v=NTfOnGZUZDk. This video is an interview with Brian Kernighan in which he discusses the history of grep, the limitations of the PDP-11, and the basic operations of ed that formed the structure around which grep was built.
[2] PowerGREP also uses its own particular flavor of regular expressions, which are slightly different in syntax from the GNU regular expressions used in grep. For the purposes of this post, all sample regular expressions are written using PowerGREP’s flavor of regular expressions.
Leave a Reply