First Time Linux

Synchronizing files

For those people with more than one computer, synchronizing them is a common problem. How do you keep up-to-date, how do you make sure you've got the most recent changes you made, and most importantly of all, how do you avoid accidentally copying the wrong version of a file and losing hours or days of work?

When I just had a laptop, this was no problem - all the files were on there. Then it sadly died, and I got the Shuttle desktop - also no problem since all the files were on there. Now I've got this new netbook, and the problem has suddenly become real. How can I work on files on the netbook and make sure that I don't have to spend hours copying them (and just them!) back onto the desktop?

There are of course many products on the market to help with this problem, it's the same problem experienced with synchronizing appointments and contact information on mobile phones, for example. And a similar problem to that of organising off-site backups of recently-changed files. In this page we'll have a look at a command line tool called rsync.

Obviously the end goal is to set up a synchronization process so that it automatically does its stuff over the network. It should be fast, efficient, and secure. But in order to get there we'll need a few baby steps first, and this means using rsync to synchronize the contents of two folders on the same machine. Once that's done, then we can look at the networking aspects later.

Locally running rsync

In order to demonstrate how rsync works in a simple way, we'll construct two directories of files on the same machine, and use rsync to synchronize them. Here's an outline of the two directories. We can imagine that in the future, one directory will be on a desktop machine and the other will be on a laptop or netbook. Files have been edited on both machines and we want to efficiently synchronize them so we've got the latest versions on both systems and (hopefully) don't lose work.

Desktop file1.txt
This is the main version of the first file.
It's quite simple.
file2.txt
This is the second file.
This won't be modified by either side.
file3.txt
This one will be changed on the host but not on the laptop.
See, now the main copy has been changed on the desktop.
file4.txt file5.txt
This file only exists on the host, so it's like it's been deleted from the second set.
Laptop file1.txt
This is the second version of the first file.
It's now been edited, with some extra text added to it.
file2.txt
This is the second file.
This won't be modified by either side.
file3.txt
This one will be changed on the host but not on the laptop.
file4.txt
This file is new in the second set.
file5.txt

What's important to note here is that rsync is not a version control system. So if we compare two files, rsync will take the later of them. It won't be able to check whether both files have been edited or not - if this is the case then the one which was saved more recently wins, and any of the edits in the other version will be lost.

Similarly, if we look at file5.txt, rsync will not be able to tell whether the file was created on the desktop, or whether it used to be on both systems and has now been deleted from the laptop. So there's no way for rsync to know whether to delete the one on the desktop, or copy it to the laptop. Obviously, copying to the laptop is the safer thing to do, but then that might be frustrating if you explicitly wanted to delete it and it comes back again (and again) after each synchronize. But that's the price you pay for using a synchronizing tool instead of a version control tool. Pick what you want to use.

Ok, now we've got our two directories, what do we want to do? Let's synchronize in both directions, firstly from the laptop to the desktop, and then from the desktop to the laptop. In this case the order doesn't really matter. The options -tuv here specify to preserve timestamps, only update files (copy if newer) and be verbose about it.

$ rsync -tuv laptop/* desktop/
file1.txt
file4.txt

$ rsync -tv desktop/* laptop/
file3.txt
file5.txt

Note that file1.txt (edited on the laptop) and file4.txt (created on the laptop) are successfully copied to the desktop. Also, file3.txt (edited later on the desktop) and file5.txt (created on the desktop / deleted on the laptop) are copied to the laptop. Also note that file2.txt was recognised to be the same on both and so wasn't transferred at all. The whole thing is done using clever comparison algorithms and checksums so that the transfers are very efficient. They can also be compressed too, if you want.

Obviously, if you really did want to delete file5.txt, you'd have to manually delete it on both the laptop and the desktop before the next time rsync is run.

Additional options

The commands shown above only synchronize the files in the given directories, but none of their subdirectories. Fortunately, this is easily done by adding the option r to the commands, and then all subdirectories will be properly handled.

But perhaps the files in one of the directories are under cvs or subversion, in which case there will be lots of other little files there which don't need synchronizing. You can easily omit them from consideration by simply adding the option C as well (that's a capital letter C).

Instead of just synchronizing every file, we could say we only want to synchronize .txt files, simply by putting laptop/*.txt instead. However, that would stop us from doing a recursive synchronize, with the option r, because the subdirectories wouldn't match our filter of *.txt - irritating. Instead you'd have to work with a more complicated and awkward set of include and exclude filters, such as:

rsync -tuvr --include=*/ --include=*.txt --exclude=* main/* desktop/
This could also be useful for specifying just java files or just html files, for example.

Another useful option while you're experimenting with rsync is a so-called "dry run", which doesn't actually copy any files, but just confirms what would be synchronized if it were run properly. This option can be called with --dry-run or in short form -n and is especially useful with the v option for verbose output. If the output is what you expect, then can simply repeat the command without the n option.

TODO: rsync between machines with a USB stick

Over the network

Of course, if all rsync could do was synchronize folders on the same machine, it wouldn't solve our problem with the laptop and the desktop. Somehow we've got to get it working over the network. And in this particular case it's going to be a wireless network.

There are some intermediate steps to go through first though, to make sure that our synchronization will be secure and simple.

Setting up ssh

We'll use ssh (secure shell) to connect from the laptop to the desktop. The laptop already has the openssh-client installed by default, so all we need to do is install the ssh server on the desktop. The package we need is conveniently called ssh so aptitude can get it simply from the repositories.

To test this out, we try the following from the laptop:

ssh user@192.168.x.x

Where here obviously the user and the IP address must match the details of the server. If your username is the same on both machines then you don't need to give it here, just the IP address will do. Then simply enter the password for that user on that machine, and you're in with a console prompt - you can list files and directories as if you were sitting in front of the other machine, and of course edit them using a console editor such as vi.

At this point it doesn't seem too impressive - it's not much different from having access to a shared network drive which is old news. However, with ssh you really are logged into the remote machine, and can even start programs, for example try playing an ogg file with totem from the command line - what do you think will happen?

A neat trick is to pipe the graphical output from the X server on the remote machine. This needs the -X option when starting ssh, so the command would look like this:

ssh -X 192.168.1.1

Then you can start for example OpenOffice Writer (oowriter) from the console, and even though it's not even installed on this laptop, it runs on the remote machine and shows the display on the laptop. Very clever. When you save the file, of course, because it's running on the remote machine it'll save it to the file system on the remote machine too.

Using scp

Accessing the desktop from the command line is great, but what about copying files from one machine to the other? That's the next step, using scp or "secure copy":

scp -p path/to/sourcefile ip.add.re.ss:newfilename

The previous command copies a single file from the local machine (in this case the laptop) onto the specified remote machine (here the desktop). The new file will be created in the home directory in this example.

A second example shows copying a file from the other machine into the current directory on the local machine - in this case from the desktop onto this laptop:

scp -p ip.add.re.ss:path/to/remotefile ./

Fixing computer names

Up until now we've referred to the other machine using its IP address, which is a bit clumsy. First we have to find out what this address is, and because it's obtained via DHCP from the router, it might change next time we try the same command. So it's good to have a way to refer to the machines by name instead.

One way to solve this is to configure the desktop with a fixed IP instead of using DHCP. Then this fixed IP could be stored in the laptop and associated with a memorable name. This is a little ugly though, because as soon as you get more than one fixed IP address, you have to worry about multiple configuration places and possibility of conflicts.

A more elegant solution is to go into the router config, and set up an IP address "reservation" so that it still uses DHCP, but the router recognises the MAC address of the desktop and always gives it the same, specified ip address. So the IP address configuration is in one place, on the router, and other clients can still use DHCP without conflicts.

You still need to let the laptop know about this though, so you need to add a line to its /etc/hosts file specifying the IP address and its associated name. You can then ping and ssh just using the name in the hosts file. This should be the same as the hostname of the other machine for convenience but this isn't required.

So the ssh and scp commands now look like this:

ssh mahogany                                          // Secure login to the machine called mahogany
scp -p path/to/sourcefile mahogany:newfilename        // Copy a file from here to mahogany
scp -p mahogany:path/to/remotefile ./                 // Copy a file from mahogany to here

Now we're nearly there, we just need to run rsync over ssh and we've got our synchronization working!

rsync over ssh

For this we just need to add the options -e ssh to specify that the rsync connection should go via ssh to the other machine. Then we specify the paths as before to go in both directions:

rsync -Ctuv -e ssh localdir/* mahogany:remotedir/
rsync -Ctuv -e ssh mahogany:remotedir/* localdir/

Using an authentication key

When using scp or rsync over ssh, obviously you need to authenticate yourself somehow, and the default way to do this is using a password. But if you're doing a lot of copying and rsyncing, typing in the same password dozens of times can get annoying. Instead, you can set up an authentication key, so that you're automatically authenticated without having to enter a password.

The basic way it works, is that on the client, you generate a two-part key, consisting of a private part and a public part. The private part stays on the client machine, but the public part (which isn't secret) gets copied to the host machine. In our case that's the netbook being the client and the desktop being the host. It sounds a little risky to remove the password protection on the desktop, but with a little thought it's clear that it's not a problem.

Only the netbook has the private part of the key, so if somebody hasn't got the netbook, they can't get into the host without a password. The other users of the netbook haven't got read rights to get to the private key so they can't use it either. And if the netbook is stolen, you can simply revoke the authentication by removing the public key from the desktop. Only if the netbook is silently completely compromised does it pose a problem for the desktop, but then there are more serious implications if that happens. They'd probably have difficulties getting into the wireless network anyway, as the desktop isn't visible outside.

You may also wonder why it's not a problem for somebody else to initiate such a scheme and get themselves automatically authenticated like this. The trick is that it requires somebody to be able to plant their public key in the user home of the desktop. And if they can do that, then they're already in the desktop anyway.

So how to set it up? Basically it's as simple as running a command ssh-keygen on the client machine, and then copying (for example, using scp) the public keyfile across to the host's home directory. There's a concise walkthrough at Wikipedia, and everything else you need is in man ssh-keygen.

And tada! No more password required for ssh, or scp, or rsync ! Result!