备份CSDN博文

这是之前写的老文章了，http://blog.csdn.net/marksinoberg/article/details/70946107
最近发现很多博友都想从CSDN迁走了，原因就不多说了。但是迁移博客是个巨大的工程，费时费力。所以我还是打算将我这个还算是比较实用的工具放出来，让迁移变得更加轻松点。

前言

近段时间以来，听群友博友都在谈论着一件事：“CSDN博客怎么没有备份功能啊？”。这其实也在一定程度上表征着大家对于文章这种知识性产品的重视度越来越高，也对于数据的安全提高了重视。

所以我就尝试着写了这么一个工具。专门用来备份CSDN博友的博客。

CSDN博客备份工具

核心

说起来是核心，其实也就那么回事吧。严格来说也就是一对代码，不能称之为核心啦。

登录模块

为什么需要登陆模块可能是正在看这篇文章的你的第一个疑惑之处。

其实原因是这样的，如果没有登录的话，从博文接口那里是获取不到相关的文章内容的。所以为了更省事，就添加了一个获取登录之后的session来帮助我们爬取文章内容。

不过也不用担心账号密码的安全性什么的，这个工具不会记忆关于您的任何信息。可以放心使用（不信可以看看代码哈）。

登录模块的代码部分也很简单，就是一个模拟登陆CSDN的逻辑实现。

# coding: utf8

# @Author: 郭 璞
# @File: login.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: CSDN login for returning the same session for backing up the blogs.

import requests
from bs4 import BeautifulSoup
import json

class Login(object):
    """
    Get the same session for blog's backing up. Need the special username and password of your account.
    """
    def __init__(self, username, password):
        if username and password:
            self.username = username
            self.password = password
            # the common headers for this login operation.
            self.headers = {
                'Host': 'passport.csdn.net',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
            }
        else:
            raise Exception('Need Your username and password!')
    def login(self):
        loginurl = 'https://passport.csdn.net/account/login'
        # get the 'token' for webflow
        self.session = requests.Session()
        response = self.session.get(url=loginurl, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assemble the data for posting operation used in logining.
        self.token = soup.find('input', {'name': 'lt'})['value']

        payload = {
            'username': self.username,
            'password': self.password,
            'lt': self.token,
            'execution': soup.find('input', {'name': 'execution'})['value'],
            '_eventId': 'submit'
        }
        response = self.session.post(url=loginurl, data=payload, headers=self.headers)

        # get the session
        return self.session if response.status_code==200 else None

    def getSource(self, url):
        """
        测试内容， 可删去，(*^__^*) 嘻嘻……
        :param url:
        :return:
        """
        username, id = url.split('/')[3], url.split('/')[-1]
        # print(username, id)
        backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        tempheaders = self.headers
        tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
        tempheaders['Host'] = 'write.blog.csdn.net'
        tempheaders['X-Requested-With'] = 'XMLHttpRequest'
        response = self.session.get(url=backupurl, headers=tempheaders)
        soup = json.loads(response.text)
        return {
            'title': soup['data']['title'],
            'markdowncontent': soup['data']['markdowncontent'],
        }

通过模拟登陆，获取到一个已登录状态的session就可以了，接下来会用得到。

备份模块

一开始我想的是直接获取网页的源码，解析出相应的文章段内容，然后通过一些逻辑实现HTML代码到Markdown文件的转换，但是对于复杂内容的HTML代码，嵌套的层次也比较深，对于表格形式更是有点心有余而力不足。所以技术上还是有难度。

然后很偶然的发现了可以通过这么一个接口来获取到文章相关的json数据，里面包括了文章标题，文章初始的Markdown文件内容。

'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)

这简直是太方便了。然后下面是具体的备份逻辑。

# coding: utf8

# @Author: 郭 璞
# @File: backup.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Back up the blog for getting and stroaging the markdown file.
import json
import os
import re

class Backup(object):
    """
    Get the special url for getting markdown file.
    """
    def __init__(self, session, backupurl):
        self.headers = {
            'Referer': 'http://write.blog.csdn.net/mdeditor',
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }
        # constructor the url: get article id and the username
        # http://blog.csdn.net/marksinoberg/article/details/70432419
        username, id = backupurl.split('/')[3], backupurl.split('/')[-1]
        self.backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        self.session = session
    def getSource(self):
        # get title and content for the assigned url.

        tempheaders = self.headers
        tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
        tempheaders['Host'] = 'write.blog.csdn.net'
        tempheaders['X-Requested-With'] = 'XMLHttpRequest'
        response = self.session.get(url=self.backupurl, headers=tempheaders)
        soup = json.loads(response.text)
        return {
            'title': soup['data']['title'],
            'markdowncontent': soup['data']['markdowncontent'],
        }

    def downloadpic(self, picurl, outputpath):
        tempheaders = self.headers
        tempheaders['Host'] = 'img.blog.csdn.net'
        tempheaders['Upgrade-Insecure-Requests'] = '1'
        response = self.session.get(url=picurl, headers=tempheaders)
        print(response.status_code)
        # change the seperator of your OS
        outputpath = outputpath.replace(os.sep, '/')
        print(outputpath)
        if response.status_code == 200:
            with open(outputpath, 'wb') as f:
                f.write(response.content)
                f.close()
                print("{} saved in {} succeed!".format(picurl, outputpath))
        else:
            raise Exception("Picture Url: {} downloading failed!".format(picurl))

    def getpicurls(self):
        pattern = re.compile("\!\[.*?\]\((.*)?\)")
        markdowncontent = self.getSource()['markdowncontent']
        return re.findall(pattern=pattern, string=markdowncontent)

    def backup(self, outputpath='./'):
        try:
            source = self.getSource()
            foldername = source['title']
            foldername = os.path.join(outputpath, foldername)
            if not os.path.exists(foldername):
                os.mkdir(foldername)
            # write file
            filename = os.path.join(foldername, source['title'])

            with open(filename+".md", 'w', encoding='utf8') as f:
                f.write(source['markdowncontent'])
                f.close()
            # save pictures
            imgfolder = os.path.join(foldername, 'img')
            if not os.path.exists(imgfolder):
                os.mkdir(imgfolder)
            for index, picurl in enumerate(self.getpicurls()):
                imgpath = imgfolder + os.sep+str(index)+'.png'
                try:
                    self.downloadpic(picurl=picurl, outputpath=imgpath)
                except:
                    # 有可能出现： requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
                    pass
        except Exception as e:
            print('恩，又出错了。详细信息为：{}'.format(e))
            pass

博文扫描模块

博文扫描模块原理上是不用登录的，根据自己的用户名就可以一层层的获取到所有的博客链接。然后保存下来配合上面的备份逻辑，循环着跑一遍就可以了。

# coding: utf8

# @Author: 郭 璞
# @File: blogscan.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Scan the domain of your blog domain, get the all links of your blogs.
import requests
from bs4 import BeautifulSoup
import re

class BlogScanner(object):
    """
    Scan for all blogs
    """
    def __init__(self, domain):
        self.username = domain
        self.rooturl = 'http://blog.csdn.net'
        self.bloglinks = []
        self.headers = {
            'Host': 'blog.csdn.net',
            'Upgrade - Insecure - Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def scan(self):
        # get the page count
        response = requests.get(url=self.rooturl+"/"+self.username, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        pagecontainer = soup.find('div', {'class': 'pagelist'})
        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]

        # construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
        for index in range(1, int(pages)+1):
            # get the blog link of each list page
            listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
            response = requests.get(url=listurl, headers=self.headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            try:
                alinks = soup.find_all('span', {'class': 'link_title'})
                # print(alinks)
                for alink in alinks:
                    link = alink.find('a').attrs['href']
                    link = self.rooturl +link
                    self.bloglinks.append(link)
            except Exception as e:
                print('出现了点意外！\n'+e)
                continue

        return self.bloglinks

如此，三大模块就算是搞定了。

演示

接下来演示一下如何使用这个工具吧。

如何使用

第一步肯定是要先下载源代码了。
然后借鉴一下下面的代码

# coding: utf8

# @Author: 郭 璞
# @File: Main.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: The entrance of this blog backup tool.

from csdnbackup.login import Login
from csdnbackup.backup import Backup
from csdnbackup.blogscan import BlogScanner
import random
import time
import getpass

username = input('请输入账户名：')
password = getpass.getpass(prompt='请输入密码：')

loginer = Login(username=username, password=password)
session = loginer.login()

scanner = BlogScanner(username)
links = scanner.scan()

for link in links:
    backupper = Backup(session=session, backupurl=link)
    timefeed = random.choice([1,3,5,7,2,4,6,8])
    print('随即休眠{}秒'.format(timefeed))
    time.sleep(timefeed)
    backupper.backup(outputpath='./')

最后一步 python Main.py

效果

下面看下运行结果。

首先是“总览”（还没测试完，先下载了这几个）

首先是“总览”（还没测试完，先下载了这几个）
然后是单篇文章

然后是单篇文章
再是文章Markdown内容展示

再是文章Markdown内容展示
单篇文章图片内容

单篇文章图片内容
图片查看

图片查看

总结

最后来反思一下这个工具还有那些不足之处。

博客名称引起的创建文件夹异常：这点做了异常处理。
访问过快引起的服务器反制：添加了随机休眠时延，但不是治本之术。
还未添加日志模块，对于备份失败的文章应该予以记录。在文章备份操作完成后，对错误日志进行解析，再次尝试备份操作。
测试还不够充分，我自己这边虽然可以跑起来，但是对于其他人有可能会出现一些奇奇怪怪的问题。

最后，放下源码链接，有兴趣的给点个star咯。

https://github.com/guoruibiao/csdn-blog-backup-tool

最后编辑于：2017.12.11 00:59:55

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 161,192评论 4赞 369
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 68,186评论 1赞 303
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 110,844评论 0赞 252
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,471评论 0赞 217
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,876评论 3赞 294
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,891评论 1赞 224
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,068评论 2赞 317
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,791评论 0赞 205
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,539评论 1赞 249
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,772评论 2赞 253
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,250评论 1赞 265
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,577评论 3赞 260
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,244评论 3赞 241
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,146评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,949评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,995评论 2赞 285
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,812评论 2赞 276

备份CSDN博文

前言

核心

登录模块

备份模块

博文扫描模块

演示

如何使用

效果

总结

推荐阅读更多精彩内容