Sunday, November 19, 2006

Backing up blogger (beta)

I upgraded this blog a few weeks ago, and forgot that my old automated script wasn't working very well. It is scheduled to run weekly, and it runs nearly forever after the upgrade. It also chews up over 4 gigabytes of space... no idea where it would stop since I typically find my Linux server is running slowly.

Anyway, I finally fixed up the backup script, and since I didn't find one quite like it, here it is:

cd `dirname $0`
TMP=/tmp/blogbackup
mkdir $TMP
cd $TMP

wget --progress=dot robgreene.blogspot.com
gawk '/([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_archive.html)/ { \
match($0, \
"http://robgreene.blogspot.com/[0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_archive.html"); \
print substr($0,RSTART,RLENGTH); }' <index.html >toget
wget --progress=dot --page-requisites --span-hosts --convert-links \
-erobots=off --input-file=toget

cd -
tar czvf robgreene-blog-$(date +'%Y%m%d').tgz $TMP
rm -rf $TMP

Be careful of the word wrap.

Basically, I grab my index and then search for the monthly index links of the format ####_##_##_archive.html (# = digit). Once those are found, we download the monthly archives along with all of their images/css/javascript and other stuff.

No comments: