web_bash 筆記 5

wget

用來下載檔案，例如：

# 下载首页html
wget http://ayqy.net
# 下载多个文件
wget http://www.example.com http://ayqy.net

上例中不帶 www 的地址會回傳 301，wget 會自動追過去，下載 index.html 並儲存到當前目錄，預設檔名相同，已存在的話自動添後綴

支援 2 種 URL 格式：

# http
http://host[:port]/directory/file
# ftp
ftp://host[:port]/directory/file
# 带用户名密码验证的
http://user:password@host/path
ftp://user:password@host/path
# 或者
wget --user=user --password=password URL

儲存檔名透過 -O 選項來指定：

# 输出到文件
wget http://ayqy.net -O page.html
# -表示标准输出
wget http://ayqy.net -O -

注意：必須是大 O，小 o 表示把進度資訊及錯誤資訊記錄到指定的 log 檔案。如果指定的檔案已存在，會被覆蓋掉

其它常用選項：

# POST
wget --post-data 'a=1&b=2' http://www.example.com
# 或者
wget --post-file post-body.txt http://www.example.com
# 断点续传
wget -c http://www.example.com
# 错误重试3次
wget -t 3 http://www.example.com
# 下载限速1k，避免占满带宽
wget --limit-rate 1k http://www.example.com
# 限制总下载量，避免占太多磁盘空间
wget -Q 1m http://www.example.com http://www.example.com

P.S. 限制總下載量依賴服務提供的 Content-Length，不提供就無法限制

另外，wget 還有非常強大的爬蟲功能：

# 递归爬取所有页面，逐个下载
wget --mirror http://www.ayqy.net
# 指定深度1级，要和-r递归选项一起使用
wget -r -l 1 http://www.ayqy.net

還可以增量更新，只下載新檔案（本地不存在的，或者最後修改時間更新的）：

# -N比较时间戳增量更新，只下载新文件
wget -N http://node.ayqy.net

服務檔案不變的話，下次不會下載，提示：

Server file no newer than local file `index.html' -- not retrieving.

P.S. 當然，增量更新依賴服務提供的 Last-Modified，如果不給就無法增量更新，預設下載覆蓋

P.S. 關於 wget 的更多資訊，請查看 GNU Wget 1.18 Manual

curl

比 wget 更強大，不僅可以下載檔案，還可以發送請求（GET/POST/PUT/DELETE/HEAD 等等），指定請求頭等等，支援 HTTP、HTTPS、FTP 等協定，支援 Cookie、UA、Authentication 等等

經常用來測試 RESTful API：

# 增
curl -X POST http://localhost:9108/user/ayqy
# 删
curl -X DELETE http://localhost:9108/user/ayqy
# 改
curl -X PUT http://localhost:9108/user/ayqy/cc
# 查
curl -X GET http://localhost:9108/user/ayqy

POST 提交表單：

# 模拟表单提交
curl -d 'a=1&b=2' --trace-ascii /dev/stdout http://www.example.com

# 请求头和请求体
=> Send header, 148 bytes (0x94)
0000: POST / HTTP/1.1
0011: Host: www.example.com
0028: User-Agent: curl/7.43.0
0041: Accept: */*
004e: Content-Length: 7
0061: Content-Type: application/x-www-form-urlencoded
0092:
=> Send data, 7 bytes (0x7)
0000: a=1&b=2

-d 表示 --data-ascii，另外 3 種方式是 --data-raw、--data-binary、--data-urlencode，其中 --data-urlencode 會對參數值進行編碼

--trace-ascii 用來輸出請求/響應頭、請求/響應體，或者透過代理工具查看請求內容：

# -x或者--proxy走代理，否则抓不着
curl -d 'a=1&b=2' -x http://127.0.0.1:8888 http://www.example.com

也可以像 wget 一樣下載檔案，只是預設輸出到標準輸出，而不是寫入檔案：

# 直接输出响应内容
curl http://ayqy.net

會得到一個 301 簡單頁，curl 不會自動追過去，可以利用這一點來追蹤重定向（當然，直接抓包看更簡單粗暴）

下載檔案可以透過輸出重定向或者 -o 選項來完成：

# 写入文件，默认会输出进度信息
curl http://ayqy.net > 301.html
# 或者
curl http://ayqy.net -o 301.html
# 使用URL中的文件名
curl http://ayqy.net/index.html -O
# URL中没有文件名的话无法下载
curl http://ayqy.net -O
# 静默下载，不输出进度信息
curl http://ayqy.net --silent -o 301.html

一個很有意思的命令：

# curl安装nvm
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

參數 o 的值為 -，表示重定向到標準輸出，然後管道交給 bash 命令執行，整行作用是獲取在線 bash 指令碼並執行

wget 的與之類似：

# wget安装nvm
wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

-q 選項禁言，保證結果乾淨，-O - 重定向到標準輸出，再交給 bash 命令執行

curl 的強大之處在於可以修改請求頭欄位值：

# 指定referer字段
curl --referer http://ayqy.net http://node.ayqy.net
# 设置cookie
curl -v --cookie 'isVisted=true' http://localhost:9103
# 或者，用-H设置任意头字段
curl -v -H 'Cookie: isVisted=true' http://localhost:9103
curl -v -H 'Cookie: isVisted=true' -H 'Referer: http://a.com' http://localhost:9103
# 把返回的cookie写入文件
curl http://localhost:9103 -c cookie.txt
# 设置UA
curl -v -A 'hello, i am android' 'http://localhost:9105'

其它特性及選項：

# 显示下载进度条
curl http://ayqy.net --progress -o 301.html
# 断点续传
# 手动指定偏移量，跳过15个字节，DOCTYPE声明被跳过了
curl http://node.ayqy.net -C 15
# 自动计算偏移量（类似于wget -c）
curl http://node.ayqy.net -C -
# 下载限速（不重定向到文件的话，输出到标准输出也会限速）
curl http://www.ayqy.net > ayqy.html --limit-rate 1k
# 限制总下载量
curl http://node.ayqy.net --max-filesize 100
# 用户名密码验证
curl -v -u username:password http://example.com
# 只输出响应头
# www少很多字段
curl -I http://node.ayqy.net
curl -I http://www.ayqy.net

批次下載圖片

利用 curl 很容易完成類似的簡單工作：

#!/bin/bash
# 批量下载图片

# 参数数量检查
if [ $# -ne 3 ];
then
    echo 'Usage: -d <dir> <url>'
    exit 1
fi

# 取出参数
for i in {1..3};
do
    case $1 in
    -d) shift; dir=$1; shift;;
     *) url=${url:-$1}; shift;;
    esac
done

# 截取基url
baseurl=$(echo $url | egrep -o 'https?://[a-z.]+')

# 取源码，滤出img，提取src
tmpFile="/tmp/img_url_$$.tmp"
curl $url --silent \
    | egrep -o '<img\s.*src="[^"]+\"[^>]*>' \
    | sed 's/.*src="\([^"]*\)".*/\1/g' \
    > $tmpFile
echo "save image urls to $tmpFile"

# 相对根路径转绝对路径
sed -i '' "s;^/;$baseurl;g" "$tmpFile"

# 创建目录
mkdir -p $dir
cd $dir

# 下载图片
while read imgUrl;
do
    filename=${imgUrl##*/}
    curl $imgUrl --silent > "$filename"
    echo "save to $dir/$filename"
done < "$tmpFile"

echo 'done'

執行以上指令碼，抓取捧腹的圖片：

./imgdl.sh http://www.pengfu.com -d imgs

核心部分非常容易，拿到原始碼，找出 img 標籤，提取 src，遍歷下載。取參數部分有個小技巧：

# 取出参数
for i in {1..3};
do
    case $1 in
    -d) shift; dir=$1; shift;;
     *) url=${url:-$1}; shift;;
    esac
done

其中 shift 命令用來彈出命令參數（$1...n）的首項，與其它語言中陣列的 shift 方法含義相同，移除首項，其餘元素前移，所以迴圈中可以只判斷首項 $1。case 匹配參數名和值，處理方式是讀一個刪一個，每次都讀第一個。例如，如果參數是 -d <dir> 這樣的鍵值對形式，先 shift 去掉 -d，接著讀取 <dir>，最後把讀完的 <dir> 也 shift 掉，繼續下一輪讀後面的參數

這樣讀取參數的好處是不限制參數順序，當然，鍵值對形式參數要在一起，各參數之間的順序隨意

其中 ${url:-$1} 表示如果變數 url 存在且非空，就取 url 的值，否則取 $1 的值。這個特性叫參數展開（parameter expansion）：

${parameter:-word}

parameter 未定義或者為空的話，取 word 的值，否則取 parameter 的值

${parameter:=word}

用來設置預設值。parameter 未定義或者為空的話，把 word 的值賦值給 parameter，位置參數（positional parameters，比如 $012..n）和特殊參數不允許這樣賦值（因為是唯讀的）

${parameter:?word}

用來檢查變數未定義或為空的錯誤。parameter 未定義或者為空的話，把 word 原樣輸出到標準錯誤（例如 parameter: word，如果沒給 word，就輸出 parameter null or not set），如果不是可互動的場景就直接退出指令碼。parameter 存在且不為空的話，取 parameter 的值

${parameter:+word}

用來檢查變數是否存在。parameter 未定義或者為空的話，取空，否則取 word 的值

另外，還有 4 個不帶 : 的版本，表示 parameter 可以為空

P.S. 關於參數展開的更多資訊，請查看 Bash Reference Manual: Shell Parameter Expansion

wget

curl

批次下載圖片

評論

提交評論