forensics - .ddd file - Verity Documentum?
2014-07
I work in computer forensics - one of the data sets that I have recently been asked to analyse contains a number of .ddd files that I have so far been unable to open.
Reading through these files in a text/hex editor reveals various mentions of 'Verity Inc version 5.5.0'. Some intense googling reveals they may be related to some old document management software called 'verity documentum'.
These files are dated from back in 2003 - a little before my time! Verity has since been bought by a company called 'Autonomy Corp' which was then purchased by HP. As expected no-one at HP has any idea what i'm talking about and all verity/autonomy contacts I have tried to comminicate with have been dead-ends.
Asking the 'more experienced' members, has anyone come across these kinds of files or this software before? If so, do you have any idea how to open them or convert them to a more readable format?
Verity collections
Verity, Inc. is the company behind the K2 enterprise search engine. Verity's technology has been included in various third-party software such as ColdFusion (from version 5 all the way to version 9.0.1), PeopleSoft, OrCAD, and PaperPort.
An individual collection represents a logical group of documents plus a set of metadata about those documents. The specific information stored for a collection includes various word indexes, an internal documents table containing document field information, and logical pointers to the actual document files.
Source: Features of Collections - Contents of Collection Indexes
Directory structure
From the Verity Collection Reference:
Each collection includes the following subdirectories:
assists
Contains files that give general collection information and assist in optimizing searches, such as spanning word lists (*.wld
), the collection "about" file (*.abt
), and ngram indexes (*.ngm
).
morgue
Contains collection files scheduled for deletion.
parts
Contains the internal fields table (*.ddd
) and the word index (*.did
) for each of the partitions in the collection.
pdd
Contains the partition map file (*.pdd
) for the collection.
style
The style set that configures the collection. Contains both gateway style files and collection style files.
temp
Temporary storage used by Verity Spider and K2 Spider.
topicidx
Contains indexed topic sets, if they exist for this collection.
trans
Contains files (*.trn
) that store information on pending indexing transactions.
work
Temporary storage for files being processed.Source: Verity Collection Reference
Depending on the collection, some of the folders listed above might be empty or missing entirely. The style
and the parts
folders are the most relevant ones.
Partitions
When indexing documents, the Verity engine stores document metadata in units called partitions. Each partition contains metadata (typically a full-word index) for a set of documents consisting of anywhere from 1 to 64K documents. The Verity engine does not actually copy your document; rather, a partition contains all of the metadata associated with the documents that make them searchable, including:
The internal documents table including fields; some fields are defined by default, and custom fields may be defined, like "Title" and "Author".
The full word index of the words (sometimes referred to as the word list) in the documents of that partition.
Each partition consists of a word list and a documents table, which are named after a sequential 8-digit number (e.g. 00000001.did
and 00000001.ddd
). Both are stored as binary files.
The fields within the documents table are defined by the following collection style files:
style.ddd
, defines fields used internally by the Verity engine, identified by an initial underscore character (_
).
style.sfl
, defines standard fields (many of which are commented out to limit the size of the documents table).
style.ufl
, defines custom fields that are not included instyle.sfl
.The value of each field can be filled in from source documents or can be provided explicitly. If a field is blank, it has not been populated.
Source: Using browse
Further reading
Viewing partition data
All Verity products come bundled with some maintenance and troubleshooting tools. Among them there's didump
and browse
. The first one can be used to display the contents of the word lists; the latter can be used to display indexed document fields.
browse
The program accepts a single parameter, which is the path of a .ddd
file:
browse.exe "X:\collection\parts\00000001.ddd"
After successfully opening a file it will display the available options:
BROWSE OPTIONS
?) help
q) quit
c) Number of entries in field
_) Toggle viewing fields beginning with '_'
v) Toggle viewing selected fields
##) Display all fields in specified record number
Dispatch/Compound field options:
n) No dispatch
d) Dispatch
s) Dispatch as stream
Count the amount of records
To check the amount of indexed records you can type c
, and then specify VdkVgwKey
as the field, which is the primary key used to identify each entry in the document table:
Action (? for help): c
Number of entries in field named: VdkVgwKey
There are (58) entries in the field (VdkVgwKey)
Display a specific record
All indexes are zero-based. For example, to get the first entry, type 0
and press Enter:
Record number: 0
0 _DDFLAG FIX-unsg ( 1) = 0x00
1 _DDVALUE VAR-text ( 0) =
2 _DDVALUE_OF FIX-unsg ( 4) = 0
3 _DDVALUE_SZ FIX-unsg ( 2) = 0
4 _DBVERSION CON-text ( 7) = vdk060
5 _DDDSTAMP FIX-date ( 4) = 17-Apr-2003 01:51:06 pm
6 _DOCIDX FIX-text ( 12) = ☺
7 _PARTDESC FIX-text ( 32) = vdk150.dll (Verity, Inc. Version
8 _STYLE AUT-text ( 58) = C:/Users/Test/Desktop/coll/style/style.ddd
9 _DOCID FIX-unsg ( 4) = 1
10 _SECURITY FIX-unsg ( 4) = 0
12 VdkVgwKey_IX FIX-unsg ( 3) = 53
13 VdkVgwKey_MI WRM-text ( 93) = C:\Documents and Settings\khakkara.RATIONAL
\Desktop\DOCCD\rational_clearcase_lt\cc_admin.pdf
14 VdkVgwKey_MX WRM-text ( 75) = C:\Documents and Settings\khakkara.RATIONAL
\Desktop\DOCCD\using_search.pdf
15 VdkVgwKey_OF FIX-unsg ( 4) = 32
16 VdkVgwKey_SZ FIX-unsg ( 2) = 75
17 Exists FIX-unsg ( 1) = 100
18 IsAChunk FIX-unsg ( 1) = 0
19 LargeDoc FIX-unsg ( 1) = 187
20 StartPage FIX-unsg ( 4) = 1
21 EndPage FIX-unsg ( 4) = 0
22 StartPageFrom FIX-unsg ( 4) = 0
23 EndPageAt FIX-unsg ( 4) = 0
24 FileName VAR-text ( 24) = ()(.)(using_search.pdf)
25 PageMap VAR-text ( 4) = D
26 NumPages FIX-unsg ( 4) = 2
27 PermanentID FIX-text ( 32) = 177032712d4a99426aa238bdad896ba2
28 WXEVersion FIX-unsg ( 1) = 2
29 FTS_Title VAR-text ( 41) = Using Search with Rational Documentation
30 FTS_Subject VAR-text ( 0) =
31 FTS_Author VAR-text ( 18) = Rational Software
32 FTS_Keywords VAR-text ( 57) = search, find, full-text Rational Version 20
03.06.00 Beta
33 FTS_Creator VAR-text ( 15) = FrameMaker 7.0
34 FTS_Producer VAR-text ( 34) = Acrobat Distiller 5.0.5 (Windows)
35 FTS_CreationDate FIX-xdat ( 4) = 02-Jul-2002 09:01:00 pm
36 FTS_ModificationDate FIX-xdat ( 4) = 03-Apr-2003 10:08:00 pm
37 DOC DSP-text ( -1) = C:\Documents and Settings\khakkara.RATIONAL
\Desktop\DOCCD\using_search.pdf
38 DOC_FN VAR-text ( 75) = C:/Documents and Settings/khakkara.RATIONAL
/Desktop/DOCCD/using_search.pdf
39 FileName_OF FIX-unsg ( 4) = 32
40 FileName_SZ FIX-unsg ( 2) = 24
41 PageMap_OF FIX-unsg ( 4) = 105
42 PageMap_SZ FIX-unsg ( 2) = 4
43 FTS_Title_OF FIX-unsg ( 4) = 32
44 FTS_Title_SZ FIX-unsg ( 2) = 41
45 FTS_Subject_OF FIX-unsg ( 4) = 0
46 FTS_Subject_SZ FIX-unsg ( 2) = 0
47 FTS_Author_OF FIX-unsg ( 4) = 32
48 FTS_Author_SZ FIX-unsg ( 2) = 18
49 FTS_Keywords_OF FIX-unsg ( 4) = 32
50 FTS_Keywords_SZ FIX-unsg ( 2) = 57
51 FTS_Creator_OF FIX-unsg ( 4) = 90
52 FTS_Creator_SZ FIX-unsg ( 2) = 15
53 FTS_Producer_OF FIX-unsg ( 4) = 56
54 FTS_Producer_SZ FIX-unsg ( 2) = 34
55 DOC_OF FIX-unsg ( 4) = 0
56 DOC_SZ FIX-unsg ( 4) = 4294967295
57 DOC_FN_OF FIX-unsg ( 4) = 32
58 DOC_FN_SZ FIX-unsg ( 2) = 75
59 InstanceID FIX-text ( 32) = 77b25f03d16bf386317bd13c3eba7d5e
60 InstanceID_IX FIX-unsg ( 3) = 22
61 DirID VAR-text ( 6) = ()(.)
62 DirID_IX FIX-unsg ( 3) = 0
63 DirID_OF FIX-unsg ( 4) = 32
64 DirID_SZ FIX-unsg ( 2) = 6
By pressing Enter again you can display the next record.
Further reading
Obtaining the Verity utilities
The easiest way to get a copy is to download some software which includes them. For example, the PaperPort application bundled with some Dell multifunction printers and old ColdFusion trial versions.
Manual installation
I'll use the PaperPort 15-day trial as an example.
Download the trial. Here are the direct links:
Open the executable using 7-Zip, and extract the
PaperPort
folder somewhere.Open a command prompt and navigate to the folder you just extracted:
cd /d "X:\Whatever\PaperPort"
Extract all the files by running the MSI installer in administrative mode:
msiexec /a "Nuance PaperPort 14.msi" targetdir="%cd%\Temp"
Proceed with the installation. When the installer has finished you'll find the Verity tools in the following folder:
X:\Whatever\PaperPort\Temp\program files\Nuance\PaperPort\Verity\vdk\_nti40\bin
Sample collections
Here are some Verity collections I found around the web. They might be useful to testing purposes or simply to better understand how they work:
- ftp://ftp.boulder.ibm.com/software/rational/docs/v2003/win_solutions/index/
- http://www.oecd-nea.org/dbdata/nds_jefreports/jefreport-17/Searches/Searchall/
- http://www.appservgrid.com/documentation/docs/rdbms10g/windows/index/
- http://jotm.objectweb.org/related/ccontrol/Images/Index/
- http://hydro.tg.free.fr/doc/hydro/oleostart/Indice/MASTER/
- http://signal.ee.bilkent.edu.tr/defevent/srchidx/absidx/
- http://www.nt.ntnu.no/users/skoge/prost/proceedings/acc04/ACC2004/
Is there a site that lists popular file formats?
I am aware that it would be hard to find out which file formats are popular, and I do not need exact popularity, just approximation. I guess companies that develop anti-malware software could have some information about files that are scanned.
I have found a few pages that list all file formats (like http://en.wikipedia.org/wiki/List_of_file_formats), but I need just a few popular ones.
It would be nice if list could be filtered by type (audio, video...) or platform (Windows, Linux, Mac...), but that is optional.
Some background: I am testing file upload for web application, and I do not want to test all file formats, just popular ones.
Sure. http://www.fileinfo.com/
If you are testing file uploads, I wouldn't be too concerned with file formats per-se, unless you are post processing them after they are uploaded what I would test is:
- Very large files
- Files that may be insecure .. such as buffer overflow payloads.
- Use some sort of checksum to ensure that the files are uploaded correctly with out error, particularly on a flaky connection.
- Disconnecting partially through an upload an see what state that leaves the server and clients in
- File loads that take longer than web server session timeout
- Files from different filesystems hfs_, ext3, ntfs and Fat32
- Very long filenames
- filenames with multiple dots
- filenames with punctuation, underscores, dashes
etc
Could not resist this one --- TXT format!
The 'plain-text' files that manage to get messed up across unix and windows platforms all time.
+1 to Bruce for approaching the question correctly.
@Željko Filipin, If you know there is different behavior for some formats,
get that list specifically and check for it -- why look at all the formats in the world?
That list itself should suggest if other formats need to be checked.
Here's a handmade CSV of all the most popular filetypes according to FileInfo.com, and here's a CSV of all the file types listed.