Web_bash notes 5

wget

Used to download files, for example:

# Download the homepage HTML
wget http://ayqy.net
# Download multiple files
wget http://www.example.com http://ayqy.net

In the example above, the address without www will return a 301, and wget will automatically follow it, download index.html, and save it to the current directory. The default filename is the same, and if it already exists, a suffix is automatically added.

Supports 2 URL formats:

# http
http://host[:port]/directory/file
# ftp
ftp://host[:port]/directory/file
# With username and password authentication
http://user:password@host/path
ftp://user:password@host/path
# Or
wget --user=user --password=password URL

The saved filename is specified through the -O option:

# Output to a file
wget http://ayqy.net -O page.html
# - indicates standard output
wget http://ayqy.net -O -

Note: It must be an uppercase O; lowercase o indicates logging progress and error information to a specified log file. If the specified file already exists, it will be overwritten.

Other common options:

# POST
wget --post-data 'a=1&b=2' http://www.example.com
# Or
wget --post-file post-body.txt http://www.example.com
# Resume download
wget -c http://www.example.com
# Retry 3 times on error
wget -t 3 http://www.example.com
# Limit download rate to 1k to avoid saturating bandwidth
wget --limit-rate 1k http://www.example.com
# Limit total download size to avoid taking up too much disk space
wget -Q 1m http://www.example.com http://www.example.com

P.S. Limiting total download size depends on the Content-Length provided by the server; if not provided, it cannot be limited.

In addition, wget has very powerful crawler functionality:

# Recursively crawl all pages and download them one by one
wget --mirror http://www.ayqy.net
# Specify a depth of 1, to be used with the -r recursive option
wget -r -l 1 http://www.ayqy.net

It can also perform incremental updates, only downloading new files (those that do not exist locally or have a newer last modification time):

# -N compares timestamps for incremental updates, only downloading new files
wget -N http://node.ayqy.net

If the server file has not changed, it will not be downloaded next time, with the prompt:

Server file no newer than local file `index.html' -- not retrieving.

P.S. Of course, incremental updates depend on the Last-Modified provided by the server; if not provided, incremental updates are not possible, and it will default to overwriting upon download.

P.S. For more information about wget, please check the GNU Wget 1.18 Manual.

curl

More powerful than wget, curl can not only download files but also send requests (GET/POST/PUT/DELETE/HEAD, etc.), specify request headers, support protocols like HTTP, HTTPS, and FTP, and support Cookies, UA, Authentication, and more.

Often used to test RESTful APIs:

# Create
curl -X POST http://localhost:9108/user/ayqy
# Delete
curl -X DELETE http://localhost:9108/user/ayqy
# Update
curl -X PUT http://localhost:9108/user/ayqy/cc
# Retrieve
curl -X GET http://localhost:9108/user/ayqy

Submit a form via POST:

# Simulate form submission
curl -d 'a=1&b=2' --trace-ascii /dev/stdout http://www.example.com

# Request headers and request body
=> Send header, 148 bytes (0x94)
0000: POST / HTTP/1.1
0011: Host: www.example.com
0028: User-Agent: curl/7.43.0
0041: Accept: */*
004e: Content-Length: 7
0061: Content-Type: application/x-www-form-urlencoded
0092:
=> Send data, 7 bytes (0x7)
0000: a=1&b=2

-d stands for --data-ascii. Three other methods are --data-raw, --data-binary, and --data-urlencode, where --data-urlencode encodes parameter values.

--trace-ascii is used to output request/response headers and request/response bodies, or to view request content through proxy tools:

# Use -x or --proxy for a proxy, otherwise it cannot be captured
curl -d 'a=1&b=2' -x http://127.0.0.1:8888 http://www.example.com

It can also download files like wget, but it defaults to outputting to standard output instead of writing to a file:

# Output response content directly
curl http://ayqy.net

You will get a simple 301 page, and curl will not automatically follow it; this can be used to track redirections (though direct packet capture is simpler and more straightforward).

Downloading files can be done through output redirection or the -o option:

# Write to a file, progress information will be output by default
curl http://ayqy.net > 301.html
# Or
curl http://ayqy.net -o 301.html
# Use the filename from the URL
curl http://ayqy.net/index.html -O
# If the URL does not have a filename, it cannot be downloaded
curl http://ayqy.net -O
# Silent download, no progress information output
curl http://ayqy.net --silent -o 301.html

A very interesting command:

# Install nvm using curl
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

The value of parameter o is -, indicating redirection to standard output, which is then piped to the bash command for execution; the entire line fetches and executes an online bash script.

wget is similar:

# Install nvm using wget
wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

The -q option silences output to ensure a clean result, and -O - redirects to standard output, which is then passed to the bash command for execution.

The power of curl lies in its ability to modify request header field values:

# Specify the referer field
curl --referer http://ayqy.net http://node.ayqy.net
# Set a cookie
curl -v --cookie 'isVisted=true' http://localhost:9103
# Or use -H to set any header field
curl -v -H 'Cookie: isVisted=true' http://localhost:9103
curl -v -H 'Cookie: isVisted=true' -H 'Referer: http://a.com' http://localhost:9103
# Write the returned cookie to a file
curl http://localhost:9103 -c cookie.txt
# Set UA
curl -v -A 'hello, i am android' 'http://localhost:9105'

Other features and options:

# Show download progress bar
curl http://ayqy.net --progress -o 301.html
# Resume download
# Manually specify an offset, skip 15 bytes, DOCTYPE declaration is skipped
curl http://node.ayqy.net -C 15
# Automatically calculate offset (similar to wget -c)
curl http://node.ayqy.net -C -
# Limit download rate (if not redirected to a file, output to standard output is also rate-limited)
curl http://www.ayqy.net > ayqy.html --limit-rate 1k
# Limit total download size
curl http://node.ayqy.net --max-filesize 100
# Username and password authentication
curl -v -u username:password http://example.com
# Only output response headers
# www has fewer fields
curl -I http://node.ayqy.net
curl -I http://www.ayqy.net

Batch Downloading Images

It's easy to accomplish similar simple tasks using curl:

#!/bin/bash
# Batch download images

# Check number of arguments
if [ $# -ne 3 ];
then
    echo 'Usage: -d <dir> <url>'
    exit 1
fi

# Extract arguments
for i in {1..3};
do
    case $1 in
    -d) shift; dir=$1; shift;;
     *) url=${url:-$1}; shift;;
    esac
done

# Extract base URL
baseurl=$(echo $url | egrep -o 'https?://[a-z.]+')

# Get source code, filter img, extract src
tmpFile="/tmp/img_url_$$.tmp"
curl $url --silent \
    | egrep -o '<img\s.*src="[^"]+\"[^>]*>' \
    | sed 's/.*src="\([^"]*\)".*/\1/g' \
    > $tmpFile
echo "save image urls to $tmpFile"

# Convert relative paths to absolute paths
sed -i '' "s;^/;$baseurl;g" "$tmpFile"

# Create directory
mkdir -p $dir
cd $dir

# Download images
while read imgUrl;
do
    filename=${imgUrl##*/}
    curl $imgUrl --silent > "$filename"
    echo "save to $dir/$filename"
done < "$tmpFile"

echo 'done'

Run the script above to scrape images from Pengfu:

./imgdl.sh http://www.pengfu.com -d imgs

The core part is simple: get the source code, find the img tags, extract src, and iterate through to download. There's a small trick in the parameter extraction part:

# Extract arguments
for i in {1..3};
do
    case $1 in
    -d) shift; dir=$1; shift;;
     *) url=${url:-$1}; shift;;
    esac
done

The shift command is used to pop the first element of the command arguments ($1...n), which has the same meaning as the shift method for arrays in other languages: remove the first element and move the rest forward. Thus, only the first element $1 needs to be checked in the loop. case matches the parameter name and value, handling them by reading one and deleting one, always reading the first. For example, if the arguments are in the form of key-value pairs like -d <dir>, first shift to remove -d, then read <dir>, and finally shift away the read <dir>, continuing to read subsequent parameters in the next round.

The advantage of reading parameters this way is that the order of parameters is not restricted; of course, key-value pair parameters must stay together, but the order between various parameters is arbitrary.

Here, ${url:-$1} indicates that if the variable url exists and is not empty, its value is taken; otherwise, the value of $1 is taken. This feature is called parameter expansion:

${parameter:-word}

If parameter is undefined or empty, the value of word is taken; otherwise, the value of parameter is taken.

${parameter:=word}

Used to set a default value. If parameter is undefined or empty, the value of word is assigned to parameter. Positional parameters (such as $012..n) and special parameters cannot be assigned values this way (because they are read-only).

${parameter:?word}

Used to check for errors where a variable is undefined or empty. If parameter is undefined or empty, the value of word is output as is to standard error (e.g., parameter: word; if word is not given, it outputs parameter null or not set), and if it is not an interactive scenario, the script is exited directly. If parameter exists and is not empty, its value is taken.

${parameter:+word}

Used to check if a variable exists. If parameter is undefined or empty, it returns empty; otherwise, it takes the value of word.

In addition, there are 4 versions without :, meaning parameter can be empty.

P.S. For more information about parameter expansion, please check the Bash Reference Manual: Shell Parameter Expansion.

wget

curl

Batch Downloading Images

Comments

Leave a comment