2014-11-27 - Progress - Tony Finch
My current project is to replace Cambridge University's DNS servers. The first stage of this project is to transfer the code from SCCS to Git so that it is easier to work with.
Ironically, to do this I have ended up spending lots of time working with SCCS and RCS, rather than Git. This was mainly developing analysis and conversion tools to get things into a fit state for Git.
If you find yourself in a similar situation, you might find these tools helpful.
- Background
- Developmestuction
- PLN
- sccs2rcs proves inadequate
- sccs2rcs1
- sccscheck
- sccsprefix
- rcsappend
- files2rcs
- An aside on file name restrictions
- rcsdeadify
- tar2usermap
- sccs2cvs
- pre-uplift, mid-uplift, post-uplift
- sccscommitters
- Uplifting cvs to git
- Wrapping up
Background
Cambridge was allocated three Class B networks in the 1980s: first the Computer Lab got 128.232.0.0/16 in 1987; then the Department of Engineering got 129.169.0.0/16 in 1988; and eventually the Computing Service got 131.111.0.0/16 in 1989 for the University (and related institutions) as a whole.
The oldest records I have found date from September 1990, which list about 300 registrations. The next two departments to get connected were the Statistical Laboratory and Molecular Biology (I can't say in which order). The Statslab was allocated 131.111.20.0/24, which it has kept for 24 years! Things pick up in 1991, when the JANET IP Service was started and rapidly took over to replace X.25. (Last month I blogged about connectivity for Astronomy in Cambridge in 1991.)
I have found these historical nuggets in our ip-register
directory
tree. This contains the infrastructure and history of IP address and
DNS registration in Cambridge going back a quarter century. But it
isn't just an archive: it is a working system which has been in
production that long. Because of this, converting the directory tree
to Git presents certain challenges.
Developmestuction
The ip-register
directory tree contains a mixture of:
- Source code, mostly with SCCS history
- Production scripts, mostly with SCCS history
- Configuration files, mostly with SCCS history
- The occasional executable
- A few upstream perl libraries
- Output files and other working files used by the production scripts
- Secrets, such as private keys and passwords
- Mail archives
- Historical artifacts, such as old preserved copies of parts of the directory tree
- Miscellaneous files without SCCS history
- Editor backup files with ~ suffixes
My aim was to preserve this all as faithfully as I could, while converting it to Git in a way that represents the history in a useful manner.
PLN
The rough strategy was:
Take a copy of the
ip-register
directory tree, preserving modification times. (There is no need to preserve owners because any useful ownership information was lost when the directory tree moved off the Central Unix Service before that shut down in 2008.)Convert from SCCS to RCS file-by-file. Converting between these formats is a simple one-to-one mapping.
Files without SCCS history will have very short artificial RCS histories created from their modification times and editor backup files.
Convert the RCS tree to CVS. This is basically just moving files around, because a CVS repository is little more than a directory tree of RCS files.
Convert the CVS repository to Git using git cvsimport. This is the only phase that needs to do cross-file history analysis, and other people have already produced a satisfactory solution.
Simples! ... Not.
sccs2rcs proves inadequate
I first tried ESR's sccs2rcs Python script. Unfortunately I rapidly ran into a number of showstoppers.
- It didn't work with Solaris SCCS, which is what was available on the ip-register server.
- It destructively updates the SCCS tree, losing information about the relationship between the working files and the SCCS files.
- It works on a whole directory tree, so it doesn't give you file-by-file control.
I fixed a bug or two but very soon concluded the program was entirely the wrong shape.
(In the end, the Solaris incompatibility became moot when I installed GNU CSSC on my FreeBSD workstation to do the conversion. But the other problems with sccs2rcs remained.)
sccs2rcs1
So I wrote a small script called
sccs2rcs1
which just converts one SCCS file to one RCS file, and gives you
control over where the RCS and temporary files are placed. This meant
that I would not have to shuffle RCS files around: I could just create
them directly in the target CVS repository. Also, sccs2rcs1
uses RCS
options to avoid the need to fiddle with checkout locks, which is a
significant simplification.
The main regression compared to sccs2rcs
is that sccs2rcs1
does not
support branches, because I didn't have any files with branches.
sccscheck
At this point I needed to work out how I was going to co-ordinate the
invocations of sccs2rcs1
to convert the whole tree. What was in
there?!
I wrote a fairly quick-and-dirty script called sccscheck which analyses a directory tree and prints out notes on various features and anomalies. A significant proportion of the code exists to work out the relationship between working files, backup files, and SCCS files.
I could then start work on determining what fix-ups were necessary before the SCCS-to-CVS conversion.
sccsprefix
One notable part of the ip-register
directory tree was the archive
subdirectory, which contained lots of gzipped SCCS files with date
stamps. What relationship did they have to each other? My first guess
was that they might be successive snapshots of a growing history, and
that the corresponding SCCS files in the working part of the tree
would contain the whole history.
I wrote sccsprefix to verify if one SCCS file is a prefix of another, i.e. that it records the same history up to a certain point.
This proved that the files were NOT snapshots! In fact, the working SCCS files had been periodically moved to the archive, and new working SCCS files started from scratch. I guess this was to cope with the files getting uncomfortably large and slow for 1990s hardware.
rcsappend
So to represent the history properly in Git, I needed to combine a series of SCCS files into a linear history. It turns out to be easier to construct commits with artificial metadata (usernames, dates) with RCS than with SCCS, so I wrote rcsappend to add the commits from a newer RCS file as successors of commits in an older file.
Converting the archived SCCS files was then a combination of
sccs2rcs1
and rcsappend
. Unfortunately this was VERY slow, because
RCS takes a long time to check out old revisions. This is because an
RCS file contains a verbatim copy of the latest revision and a series
of diffs going back one revision at a time. The SCCS format is more
clever and so takes about the same time to check out any revision.
So I changed sccs2rcs1
to incorporate an append mode, and used that
to convert and combine the archived SCCS files, as you can see in the
ipreg-archive-uplift
script. This still takes ages to convert and linearize nearly 20,000
revisions in the history of the hosts.131.111
file - an RCS checkin
rewrites the entire RCS file so they get slower as the number of
revisions grows. Fortunately I don't need to run it many times.
files2rcs
There are a lot of files in the ip-register
tree without SCCS
histories, which I wanted to preserve. Many of them have old editor
backup ~ files, which could be used to construct a wee bit of history
(in the absence of anything better). So I wrote
files2rcs
to build an RCS file from this kind of miscellanea.
An aside on file name restrictions
At this point I need to moan a bit.
Why does RCS object to file names that start with a comma. Why.
I tried running these scripts on my Mac at home. It mostly worked,
except for the directories which contained files like DB.cam
(source
file) and db.cam
(generated file). I added a bit of support in the
scripts to cope with case-insensitive filesystems, so I can use my
Macs for testing. But the bulk conversion runs very slowly, I think
because it generates too much churn in the Spotlight indexes.
rcsdeadify
One significant problem is dealing with SCCS files whose working
files have been deleted. In some SCCS workflows this is a normal state
of affairs - see for instance the SCCS support in the
POSIX make
XSI extensions.
However, in the ip-register
directory tree this corresponds to files
that are no longer needed. Unfortunately the SCCS history generally
does not record when the file was deleted. It might be possible to
make a plausible guess from manual analysis, but perhaps it is more
truthful to record an artificial revision saying the file was not
present at the time of conversion.
Like SCCS, RCS does not have a way to represent a deleted file. CVS uses a convention on top of RCS: when a file is deleted it puts the RCS file in an "Attic" subdirectory and adds a revision with a "dead" status. The rcsdeadify script applies this convention to an RCS file.
tar2usermap
There are situations where it is possible to identify a meaningful
committer and deletion time. Where a .tar.gz
archive exists, it
records the original file owners. The
tar2usermap
script records the file owners from the tar files. The contents can
then be unpacked and converted as if they were part of the main
directory, using the usermap file to provide the correct committer
IDs. After that the files can be marked as deleted at the time the
tarfile was created.
sccs2cvs
The main conversion script is
sccs2cvs,
which evacuates an SCCS working tree into a CVS repository, leaving
behind a tree of (mostly) empty directories. It is based on a
simplified version of the analysis done by sccscheck
, with more
careful error checking of the commands it invokes. It uses
sccs2rcs1
, files2rcs
, and rcsappend
to handle each file.
The rcsappend
case occurs when there is an editor backup ~ file
which is older than the oldest SCCS revision, in which case sccs2cvs
uses rcsappend
to combine the output of sccs2rcs1
and files2rcs
. This
could be done more efficiently with sccs2rcs1
's append mode, but for
the ip-register
tree it doesn't cause a big slowdown.
To cope with the varying semantics of missing working files,
sccs2rcs
leaves behind a tombstone where it expected to find a working
file. This takes the form of a symlink pointing to 'Attic'. Another
script can then deal with these tombstones as appropriate.
pre-uplift, mid-uplift, post-uplift
Before sccs2cvs can run, the SCCS working tree should be reasonably clean. So the overall uplift process goes through several phases:
- Fetch and unpack copy of SCCS working tree;
- pre-uplift fixups;
(These should be the minimum changes that are required before conversion to CVS, such as moving secrets out of the working tree.) sccs2cvs
;- mid-uplift fixups;
(This should include any adjustments to the earlier history such as marking when files were deleted in the past.) git cvsimport
orcvs-fast-export | git fast-import
;- post-uplift fixups;
(This is when to delete cruft which is now preserved in the git history.)
For the ip-register
directory tree, the pre-uplift phase also
includes ipreg-archive-uplift
which I described earlier. Then in the
mid-uplift phase the combined histories are moved into the proper
place in the CVS repository so that their history is recorded in the
right place.
Similarly, for the tarballs, the pre-uplift phase unpacks them in place, and moves the tar files aside. Then the mid-uplift phase rcsdeadifies the tree that was inside the tarball.
I have not stuck to my guidelines very strictly: my scripts delete quite a lot of cruft in the pre-uplift phase. In particular, they delete duplicated SCCS history files from the archives, and working files which are generated by scripts.
sccscommitters
SCCS/RCS/CVS all record committers by simple user IDs, whereas git
uses names and email addresses. So git-cvsimport
and cvs-fast-export
can be given an authors file containing the translation. The
sccscommitters
script produces a list of user IDs as a starting point for an authors
file.
Uplifting cvs to git
At first I tried git cvsimport
, since I have successfully used it
before. In this case it turned out not to be the path to swift
enlightenment - it was taking about 3s per commit. This is mainly
because it checks out files from oldest to newest, so it falls foul of
the same performance problem that my rcsappend program did, as I
described above.
So I compiled cvs-fast-export
and fairly soon I had a populated repository: nearly 30,000 commits at
35 commits per second, so about 100 times faster. The
fast-import/export format allows you to provide file contents in any
order, independent of the order they appear in commits. The fastest
way to get the contents of each revision out of an RCS file is from
newest to oldest, so that is what cvs-fast-export
does.
There are a couple of niggles with cvs-fast-export
, so I have a
patch
which fixes them in a fairly dumb manner (without adding command-line
switches to control the behaviour):
- In RCS and CVS style,
cvs-fast-export
replaces empty commit messages with " empty log message ", whereas I want it to leave them empty. cvs-fast-export
makes a special effort to translate CVS's ignored file behaviour into git by synthesizing a.gitignore
file into every commit. This is wrong for theip-register
tree.- Exporting the
hosts.131.111
file takes a long time, during whichcvs-fast-export
appears to stall. I added a really bad progress meter to indicate that work was being performed.
Wrapping up
Overall this has taken more programming than I expected, and more time, very much following the pattern that the last 10% takes the same time as the first 90%. And I think the initial investigations - before I got stuck in to the conversion work - probably took the same time again.
There is one area where the conversion could perhaps be improved: the archived dumps of various subdirectories have been converted in the location that the tar files were stored. I have not tried to incorporate them as part of the history of the directories from which the tar files were made. On the whole I think combining them, coping with renames and so on, would take too much time for too little benefit. The multiple copies of various ancient scripts are a bit weird, but it is fairly clear from the git history what was going on.
So, let us declare the job DONE, and move on to building new DNS servers!