seq and xxd

March 18, 2010

Here are two interesting little tools, one of which should definitely be in your arsenal, and another that you still might find useful every now and again.

seq is, simply, a tool that prints out a sequence of numbers. For instance, run

seq -w 01 10

and it will output:

01 02 03 04 05 06 07 08 09 10

This is useful, for example, when you want to run a command that affects a bunch of hosts:

for k in `seq -w 0 10`; do ssh host0$k uptime; done

The -w option tells it to preserve width, so 00 is output instead of 0, for example.

Another tool that I’ve found useful is xxd. The short description of what it does is provide a hex dump of files.

[rmiller@pacific]# echo "hello world" > hello
[rmiller@pacific]# xxd hello
0000000: 6865 6c6c 6f20 776f 726c 640a            hello world.

But the good part about it is, you can use the -r option to recreate the file.

[rmiller@pacific]# xxd hello > hello.out
[rmiller@pacific]# xxd -r hello.out
hello world

So you can use this tool to byte edit files. One rather unusual use I’ve found for it is to paste in an RPM to a system that I only had serial console access to. I just ran xxd on it, copied it into the buffer, and pasted it into a file on the remote server. A quick xxd -r, and voila. RPM.

You could also do this via kermit and other related protocols, but those are so ancient I never bothered figuring them out…

Hope these are a couple of useful tools for you.

  • Share/Bookmark
1

An aside.

March 8, 2010

Today I’m not going to talk about any specific aspect of Linux, or any other operating system. Instead I’m going to talk about the OS from a more holistic perspective.

Operating systems, and programs, etc., need to present themselves to humans in a way that humans can understand. This seems like a no-brainer, but it’s not, because as humans, once something presents itself to us in such a way, we start to think of it in terms of the interface that is presented to us. For example, in Linux, entries in the process table are presented as “processes”. Because of this presentation, we strart to think of processes as their own entities, even, in a way, living. This is great for when we need to understand how things work in a broad, generalized way. But when you get down and dirty into the guts of the OS, it actually hurts. Because processes are NOT objects, not really. They are simply entries in a process table that the kernel uses to determine how CPU cycles get distributed and to which tasks.

Operating systems have – and by design – done such a great job of abstracting the kernel and its services that it’s just this entity that sits in the background, like a black box, controlling everything but you don’t even think about it. I think about my web browser, and my email client, etc., but I almost never think of the kernel.

This is great for an end user. It’s a really bad idea for a professional sysadmin.

The kernel may sit in the background, but it controls everything. It is “God”, in a sense. Everything that goes in and out of every subsystem is done with at least the awareness of the kernel, and except in situations such as DMA transfers (something I have seen go awry as well), full knowledge and permission. If something in the kernel is a little off, at best, the system goes unstable, at worst, you lose all of your data.

You cant afford to not understand how the kernel works.

This doesn’t mean that you have to dive into all hundred million lines of code (if they’re even available). What it does mean is you have to understand how all the subsystems work and fit together. And while you’ll never learn it all, the more you do, the more you will be able to handle those really sticky and incomprehensible things that just seem to happen, without pulling out what’s left of your hair.

  • Share/Bookmark
0

SIGSTOP and SIGKILL

March 7, 2010

Sometimes you can find interesting things to post about just by looking at what people are searching for. Here’s an interesting search query that I thought I’d address, because I certainly understand wanting to understand the “why” of things just as much as the “how”. Today’s query is:

“why sigkill and sigstop signals cannot be ignored by a process in linux”.

The short answer is, because the process never even sees these signals. They control program execution.

Signals are, as I’ve mentioned, a form of interprocess communication – telling a program that some action is required of it. But don’t forget that while programs can receive and act on signals, the Operating System (Linux, in this case) is the entity that is responsible for dispatching these signals. Also don’t forget that a process is entirely and completely under the control of the kernel, and exists only at its forebearance.

When the kernel receives (or sends, even) an SIGKILL signal, it treats it as a signal that the process is currently in an unworkable state and should be terminated with extreme prejudice. The kernel basically just stops the process and removes it from the process table. The process never even knows what hit it.

The same thing applies to SIGSEGV, but for different reasons. Once the process oversteps its bounds, it can’t be trusted and must be terminated.

The only circumstances in which an unstoppable signal may be deferred is if the process is in IO wait, or stuck in kernel space. Then it’s just deferred, and will be dispatched once the process leaves kernel space.

  • Share/Bookmark
0

htop

March 6, 2010

Well, I’m back to it. I haven’t blogged here for a year or so for multiple reasons, such as changing jobs, moving, etc. I am now working as a Systems Engineer for another Internet company in Irvine, CA.

Anyway, right to it. Here’s a right useful tool. htop.

htop is useful because it gives you an idea of usage per core instead of an average. And it’s pretty too.

I find it useful when attempting to debug java problems. I had an issue with two separate java programs that were hanging on a 16 core box. Running htop showed me that one of the cores was showing 100%, and this was concurrent with a heap space set too low error. So I upped heap space, and voila. fixed.

How is that helpful? The second java program just hung and gave me no such error. But having seen the same kind of problem in htop, I knew exactly how to fix it, and fix it I did. Thereby allowing me to save the day. All in a day’s work for a Sysadmin.

htop

  • Share/Bookmark
0

Process States

January 17, 2009

If you look in ps for a process, you will usually see the characters S or R… and sometimes others. But what do they mean?

The kernel contains something called a run queue. When a process is ready to run, it tells the kernel that it needs some cycles from the CPU. Once it does this, it is said to be “in the run queue”. It’s status at this point is runnable, or R.

When the process is not waiting in the run queue, for example, when it is waiting on input or doing something else that does not require processing time from the CPU, it is said to be in “stopped” state, or S.

D state is particularly annoying. It means “uninterruptible IO wait”. This means that the process is stuck in a system call waiting for some IO. When it is in this state, you cannot do anything with the process, not even send a signal (the signals are queued up waiting for the process to leave D state). There are only two ways out of this state – either fix the condition that is causing it (it’s usually NFS related), or reboot the system. There are no other options.

T state, as I mentioned previously, means the process is stopped. You will need to send a SIGCONT to start the process again.

Z means the process is defunct. Either kill the parent, reboot the box, or live with it. This is actually fairly harmless, except that it takes up space in the process table.

There are some other values as well. You can “man ps” to find outwhat they are (look under PROCESS STATE CODES). You’ll find the output of ps to be much more informative than you thought, once you know how to read it.

  • Share/Bookmark
0

Signals

January 16, 2009

Signals are one of the most visible aspects of the Linux operating system. They are also one of the least understood. Every sysadmin, even the PFYs who aren’t PFed yet, know how to kill a process. But do you know how this works underneath? Do you know how flexible the linux signalling system truly is?

If not, you’re about to find out.

Signals are yet another one of those kernel interfaces, like system calls and device drivers. They are not IPC in the sense that they cannot be used to send information to a program of themselves. They are basically an asynchronous way of telling a program that something is expected of it. They are also the kernel’s way of telling a program something as well.

There are three signals that are the most common, and a couple more that are less common but just as important. These are:

Critically important

  • SIGKILL (9) – Terminate a process. Noninterruptible.
  • SIGSEGV (11) – Segmentation Fault.
  • SIGTERM (15) – Terminate a process in an orderly fashion.

And then there are the less well known but at least as important:

  • SIGBUS (2) – Bus Error
  • SIGCHLD (17) – Child process terminated
  • SIGSTOP (19) – Stop executing
  • SIGCONT (18) – Continue executing after a stop

All of these different signals have a specific meaning.

SIGSEGV is one you’ll be very familiar with. It means “Segmentation Fault”. This is actually triggered by something very deep in the hardware itself, but is usually caused by a careless programmer. It is invoked when a programmer attempts to write to or read from memory that it has not actually been given. I’ll go into more details on that when I write about virtual memory.

This signal cannot be caught or ignored.

SIGTERM and SIGKILL are two ways of saying “kill the process”. The difference is that SIGTERM can be caught or even ignored – the process can decide not to listen to this signal. It does not have the same option when it comes to SIGKILL. When you run kill without any arguments, a SIGTERM is sent. When you run kill -9, a SIGKILL is sent.

Because it can’t be caught or ignored, the process does not have the ability to clean up after itself, and whatever it was doing at the time is left in an indeterminate state. Think of is this way – a SIGTERM is quitting time for the day, you get to pack up and take everything with you. a SIGKILL is like a fire alarm – you drop everything and leave the building.

SIGBUS is a bus error. This also originates deep in the hardware, but you’ll get this either under the same circumstances as SIGSEGV, or when hardware is failing. It’s just as catastrophic to a program as SIGSEGV.

SIGCHLD is the reason zombie proceses exist. When a linux process spawns a child (I’ll go into this process some other time), it basically owns the child. When the child dies, the parent process is notified of this fact via a SIGCHLD signal. The parent process is required to call the wait() system call in order to “reap” the child process. During the time between the SIGCHLD is sent and the parent process reaps the child process, the child process exists only as an entry in the process table. Also known as a defunct process, or zombie. So if you see a defunct process, one of two things has happened: The parent process is unable to reap the child, or whomever wrote the parent process screwed up.

I really have no idea why it was designed this way – I’m sure there’s some historical reason that will make perfect sense once I hear it, but it seems like an extra step to me.

SIGSTOP and SIGCONT are two special signals. SIGSTOP is sent to a process to tell to stop. At this point, if you run ps on the process, it will show up with a status of “T”. Then, it will start executing again when you send SIGCONT.

Strace and other processes that attach to a running process use these signals.

There are many other signals as well. SIGUSR1 and SIGUSR2 are some pretty intersting ones – they’re user defined. Some processes will listen for these signals and do some interesting things – such as increase logging, for example.

Look in /usr/include/asm/signal.h for a complete list of signals, or run kill -l.

  • Share/Bookmark
0

strace

January 15, 2009
Tags: ,

I’m getting lot of posts in tonight so that I won’t feel so bad when I wait till the weekend to write more. :-) Besides, this topic is important enough that I just want to get it out there.

Strace is one of the single most important troubleshooting tools you will ever use. And I say that without a trace of hyperbole. A few years ago I had a client who had been doing UNIX Administration for a very long time – longer, I think, than I had. However, it seemed like every time he called me with a problem that he couldn’t solve, I’d have it fixed in a few minutes. Finally, with amazement in his voice, he asked me what my secret was. And I showed him how to use strace.

Strace is, simply put, a way to “spy” on programs to see how they are interacting with the kernel. They do this by attaching to the process using a special debug interface, and watching which system calls are called (and with which arguments), and what the return codes are. In many cases, if you know which system call is returning an error, and which arguments it is passed, figuring out what needs to be done to fix the problem becomes almost trivial.

Let’s do an example.

Here’s an example of an strace.

-bash-3.2$ touch blah/blah
touch: cannot touch `blah/blah’: No such file or directory
-bash-3.2$ strace touch blah/blah
execve(“/bin/touch”, ["touch", "blah/blah"], [/* 21 vars */]) = 0
brk(0) = 0×9f3a000
access(“/etc/ld.so.preload”, R_OK) = -1 ENOENT (No such file or directory)
open(“/etc/ld.so.cache”, O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=44525, …}) = 0
mmap2(NULL, 44525, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f0f000
close(3) = 0
open(“/lib/librt.so.1″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\2008\252\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=44060, …}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0e000
mmap2(0xaa2000, 33324, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xaa2000
mmap2(0xaa9000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×6) = 0xaa9000
close(3) = 0
open(“/lib/libc.so.6″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000?\213\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1602320, …}) = 0
mmap2(0×89e000, 1320356, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0×89e000
mmap2(0×9db000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×13d) = 0×9db000
mmap2(0×9de000, 9636, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0×9de000
close(3) = 0
open(“/lib/libpthread.so.0″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360g\241\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=125744, …}) = 0
mmap2(0xa12000, 90592, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xa12000
mmap2(0xa25000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×12) = 0xa25000
mmap2(0xa27000, 4576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xa27000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0d000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f0d6c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xaa9000, 4096, PROT_READ) = 0
mprotect(0×9db000, 8192, PROT_READ) = 0
mprotect(0xa25000, 4096, PROT_READ) = 0
mprotect(0×895000, 4096, PROT_READ) = 0
munmap(0xb7f0f000, 44525) = 0
set_tid_address(0xb7f0d708) = 31920
set_robust_list(0xb7f0d710, 0xc) = 0
rt_sigaction(SIGRTMIN, {0xa163d0, [], SA_SIGINFO}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0xa162e0, [], SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
uname({sys=”Linux”, node=”katesama.duskglow.com”, …}) = 0
brk(0) = 0×9f3a000
brk(0×9f5b000) = 0×9f5b000
open(“/usr/lib/locale/locale-archive”, O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=56471120, …}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7d0d000
close(3) = 0
close(0) = 0
open(“blah/blah”, O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK|O_LARGEFILE, 0666) = -1 ENOENT (No such file or directory)
futimesat(AT_FDCWD, “blah/blah”, NULL) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/locale.alias”, O_RDONLY) = 0
fstat64(0, {st_mode=S_IFREG|0644, st_size=2528, …}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f19000
read(0, “# Locale name alias data base.\n#”…, 4096) = 2528
read(0, “”, 4096) = 0
close(0) = 0
munmap(0xb7f19000, 4096) = 0
open(“/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en/LC_MESSAGES/coreutils.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, “touch: “, 7touch: ) = 7
write(2, “cannot touch `blah/blah\’”, 24cannot touch `blah/blah’) = 24
open(“/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en_US/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
open(“/usr/share/locale/en/LC_MESSAGES/libc.mo”, O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, “: No such file or directory”, 27: No such file or directory) = 27
write(2, “\n”, 1
) = 1
close(1) = 0
exit_group(1) = ?
-bash-3.2$

I’ve bolded the important part. Note the open() system call – where it tried to open blah/blah (the directory blah doesn’t exist, so it fails) and the return code – ENOENT. ENOENT is a system error code (you can find a nearly comprehensive list of them in /usr/include/asm-generic/errno-base.h and /usr/include/asm-generic/errno.h (or you can do a man errno if you don’t care what the actual numbers are)). It means, literally, “No such file or directory” (that code gets translated into something human readable in the “touch” program.)

But you are literally seeing what you’ve asked the kernel to do, how it responded, and why it said it couldn’t do it.

Now for my example, that’s pretty simple. But what about programs that don’t log well, or at all? What about programs that spit out generic errors that you don’t understand and mean nothing to you? That’s where strace becomes valuable.

Strace has a few useful options.

-f tells strace to follow forks. Many processes will spawn children – by default strace won’t attach to them. It will with these options.

-s <num> tells strace to print out more (or less, but you probably won’t use it that way) information than it does normally.

-e <system call> allows you to specify one or more system calls that you want strace to only print, ignoring the rest. You can see how verbose strace is, so this will help you in some cases to pare through the garbage to find what you need.

Strace is not a panacea, though. There are two circumstances where it’s not advised. The first is when you try to run an suid program. Strace, for security reasons, doesn’t like it when you do that. The other, is when the program you are running is time critical. Strace allows most programs to run at an acceptable speed, but it does noticably slow them down.

Also, sometimes when you quit strace, a program will continue to think it’s in attached state and halt entirely. You can fix this by killing the program with the -CONT signal.

This is one tool you simply cannot afford to leave out of your bag of tricks. Strace. Don’t have root without it.

  • Share/Bookmark
1

Different kinds of files

January 15, 2009

Linux has many different kinds of files. First let’s start with a little more basic discussion: what is a file?

Basically, a file is anything that can have a file descriptor associated with it.

What is a file descriptor?

Ahh. Glad you asked. Sit down, this could take a bit.

Linux is POSIX compliant, which means its API (Application Programming Interface) is consistent with a set of standards developed for Unix a long time ago. It defines a set of system calls (system calls are basically a way of requesting services from the kernel) and library calls. Anything that is POSIX compliant is going to have the same basic core API, although the standards are vague enough that there is a little wiggle room here and there.

There are some important system calls when it comes to file manipulation. Four of them are:

  • open
  • read
  • write
  • close

These system calls make up the foundations of file manipulation, although there are other calls that are just as important to do things like erasing or moving a file.

A file descriptor is a number returned from an open() system call. That’s all it is, is a number. However, once a file is opened, that descriptor is used to tell the kernel which opened file you are trying to operate on. The descriptor is passed into any other system call that is referencing that file, such as read(), write(), and close().

So, basically, a file is anything you can open using open().

You will find that nearly everything in Linux is a file – including network connections (though these don’t appear on the filesystem, you interact with them in nearly the same way as you do a regular file).

There are several different types of files.

  • File
  • Directory
  • Link
  • Named pipe
  • Block special file
  • Character special file
  • Socket

All of these file types ARE files, but they show up differently when listing a filesystem (the first character of the permissions shows you what kind of file it is) and more importantly behave differently when you try to operate on them.

A regular file (indicated by a “-”) is just that, a regular file. You can write to it, read from it, erase it, or whatever.

A directory (indicated by a “d”) is basically a file that contains a list of other files. It’s still a file, however.

A symbolic link (indicated by am “l”) is a file that points to another file – in such a way that the libraries and OS know how to follow it.

A named pipe is basically a FIFO (first in first out) that is exposed on the filesystem. These are used in interprocess communication – a process can have it open for reading, for example, while another has it open for writing. A socket is similar to a named pipe.

Device special and block files are both ways to interface with kernel devices. For example, /dev/null is a special file. When you write into it, the kernel takes the bytes and dumps them into the bitbucket. Other drivers do different things, for example, /dev/tty. When you do a ls of one of these, you’ll see a device major and device minor number – these numbers are the kernel’s way of keeping track of what goes where. You could rename /dev/null to /dev/Bush if you wanted to, and as long as it had the same major and minor numbers it would behave identically. The kernel doesn’t care what it’s called, only what it is.

Now that you understand what the different type of files are, how about a little tip on how to use them?

You probably already know about “ls”, so I’m not going to go into it. But did you know about lsof? lsof will show you all of the open files on your system – including network connections. (Remember I told you that network connections were files too? Here’s proof).

Another useful little command is mknod. This is how you create the device special and block special files (though don’t do it directly if you can avoid it, use MAKEDEV instead). This is useful to know if you, somehow, end up with /dev/null as a regular file. (It happens).

And don’t forget about the simple but tried-and-true command, ln. This creates symbolic links if given with the -s option, and hard links if not (but I’m not going to go into what those are right now).

Unexpectedly complex, huh? You’ll find every aspect of the Linux OS to be like that – a beguiling simplicity overlaying a fiendishly complex nest of interrelated subsystems.

It’s worth it to know all of these things, though. You never know when that kind of knowledge will come in handy.

  • Share/Bookmark
0

Simple tip: Inodes

January 15, 2009

Sometimes when on a Linux system, you will encounter a problem with running out of disk space when it appears that you haven’t. This is because there are actually two different resources on a Linux filesystem.

  • Disk space – this is the total amount of space allotted to all of your files.
  • Inodes – this is the amount of files you can create on the filesystem.

You might be running out of inodes. This will usually happen if you have a lot of small files. You can find out by running

df -i

And you will get output similar to this:

---> root@machine (0.04)# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     2080768   46917 2033851    3% /
/dev/xvda1             26104      35   26069    1% /boot
none                  262144       1  262143    1% /dev/shm

On some linux filesystems, you can change this value on a running filesystem. It’s not recommended to try doing so on ext2/ext3, and the option doesn’t even exist on some later versions of tune2fs. The best way to increase this value is to copy your data off, rebuild the filesystem with a larger inode value (using the -N) option, and copying it back on. Or you could find another drive or storage medium and mount it underneath your root directory, thereby giving you another disk full of inodes to use. The path you’ll want to take is dependent on your own circumstances and goals.

  • Share/Bookmark
0

Backups

January 11, 2009
Tags: ,

This is one of those topics that separates the men from the boys, so to speak. Backups. Always, always, always have them.

Do not treat RAID arrays as if they are inviolable. It’s always a possibility that more than one disk could fail. And it’s also a possibility that the hardware itself could fail, thus corrupting the array. RAID is more fault tolerant than one disk, but only just. Don’t depend on it. A high-traffic company learned this a few weeks ago, taking no backups and depending on RAID mirroring. I’d be surprised if the sysadmin who worked there ever works again. Not doing at least some kind of backup is stupid, and possibly even negligent.

It doesn’t really matter on what medium you do the backups onto. If you have a small amount of critical data, you might consider putting it on CDs or DVDs and sending it offsite. If you have a lot of servers, you might want to consider tape. For my personal systems, I just copy it off onto hard drives that aren’t physically onsite. Restoring through a DSL connection is a pain, but it beats losing everything.

There are several ways of doing backups, and the good news is for Linux systems all of the software you need comes with your system. You can use scp (which is good for brute force backups). You can use rsync (which will keep your backups up to date, good for having a recent copy, not good for having archival copies you can go back to). mkisofs and cdrecord are good tools to create your backups to send to DVD. There are also enterprisey systems like Amanda. There are lots of different ways to do it and I won’t go into them here, but you can feel free to put your favorite method in the comments if you think it will help.

The main thing that I want to get across in this particular article is – do it! It doesn’t take much time to set it up, and you’re not going to think much about it until things die – at which point it will (maybe even literally, depending on whose data you’re storing) save your life.

  • Share/Bookmark
0