Anyway, I finally fixed up the backup script, and since I didn't find one quite like it, here it is:
cd `dirname $0`
TMP=/tmp/blogbackup
mkdir $TMP
cd $TMP
wget --progress=dot robgreene.blogspot.com
gawk '/([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_archive.html)/ { \
match($0, \
"http://robgreene.blogspot.com/[0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_archive.html"); \
print substr($0,RSTART,RLENGTH); }' <index.html >toget
wget --progress=dot --page-requisites --span-hosts --convert-links \
-erobots=off --input-file=toget
cd -
tar czvf robgreene-blog-$(date +'%Y%m%d').tgz $TMP
rm -rf $TMP
Be careful of the word wrap.
Basically, I grab my index and then search for the monthly index links of the format ####_##_##_archive.html (# = digit). Once those are found, we download the monthly archives along with all of their images/css/javascript and other stuff.

No comments:
Post a Comment