The Format Identifier
The format identifier uses Siegfried to identify file formats based on byte sequence signatures in the PRONOM registry.
It streams files one by one from the staging bucket through its identification algorithms and records the results in Redis. If you look at the Redis file record, you'll notice the following fields:
- file_format - this is the format as identified by the format identifier
- format_identified_by - this will be either "siegfried," indicating that Siegfried match the file to an entry in the PRONOM registry, or "ext map," indicating that the file could not be matched, so it was identified by its extension.
- format_match_type - describes how the match was determined. "signature" means it was matched by Sigfried comparing it to a PRONOM signature, "extension" means either Siegfried or a circuit breaker (see below) matched it by extension, or "container" means Siegfried matched it as a container type (e.g. tar, zip, jar, rar, or certain Microsoft Office file types that contain multiple internal file streams).
Short Circuiting and Extension Matching
Siegfried may crash when attempting to identify certain container formats if the file is corrupt or internally inconsistent. This happens most often with proprietary Microsoft Office container formats as listed in the CrashableFormats
map in the format identifier source code.
When the format identifier encounters a file with a crashable extension, it skips Siegfried's PRONOM-matching algorithms and matches on extension only. In this case, the format_identified_by
attribute of the Redis file record will be set to ext map
, and the format_match_type
will be set to extension
, indicating that Siegfried didn't even attempt to identify it.
In other cases, Siegfried may try and fail to do a byte signature match, then fall back to an extension match. In these cases, format_identified_by
will be siegfried
and format_match_type
will be extension
.
Resources
Siegfried uses large amounts of network I/O and memory because it runs large files through a number of internal functions. Memory, in particular, is a problem in our low-resource Docker containers. To prevent out-of-memory exceptions, the format identifier containers process only one file at a time. This makes the format identifier a bottleneck in the ingest pipeline. This worker is also the most likely to scale to multiple instance even under light loads.
The format identifier still occasionally dies before completing its work. This may be due to out-of-memory exceptions, or it may be due to the worker running on spot instances that are killed by AWS because their owners want them back.
In either case, simply requeing the item in the Registry fixes the problem. Requeue to the Format Identification stage and the identifier will pick up where the dead worker left off.
External Services
Service | Function |
---|---|
S3 Staging Bucket | Worker streams files from staging through a format identification function to determine file format. |
Redis | Worker updates file records in Redis with file format and some metadata about how the file format was determined. |
Registry | Source of WorkItem record describing work to be done. |
NSQ | Distributes WorkItem IDs to workers and tracks their status. |
Source Files
Worker | Service | Files | Definition |
---|---|---|---|
Format Identifier | Ingest | Task Worker App |
Identifies the format of files within a bag. |