Skip to main content

Text_bash notes 4

Free2017-03-26#Tool#grep#sed#cut#awk#Mac sed

Text processing is the most commonly used

grep

Used for text searching and matching file content. The syntax format is: grep pattern filename. For example:

# Find all lines containing 'for'
grep 'for' test.sh
# Search across multiple files
grep 'for' test.sh bak.sh
# Highlight matched parts
grep 'for' test.sh --color=auto

The default is wildcard matching; regular expressions require the E(extended) parameter:

# Find all lines starting with echo
grep -E '^\s*echo' test.sh

Or use the egrep command, which allows regular expressions by default:

# Same as above
egrep '^\s*echo' test.sh

Other options and features:

# Output only the matched part
grep -o -E '\s[a-zA-Z]\s' test.sh
# Output only non-matching lines (invert match)
grep -v -E '\s[a-zA-Z]\s' test.sh
# Count matching lines
grep -c -E '\s[a-zA-Z]\s' test.sh
# Count the number of matches
grep -o -E '\s[a-zA-Z]\s' test.sh | wc -l
# Output matching lines and their line numbers
grep -n -E '\s[a-zA-Z]\s' test.sh
# Output the filenames of matches (invert is -L)
grep -l 'return' test.sh bak.sh return.sh
# Recursively search directories, output filenames and line numbers
grep -n -R  'echo' .
# Ignore case
grep -i "ECho" test.sh
# Limit filenames for directory search
# Note: the include parameter value must be quoted, unlike the find command
grep -R '=>' . --include '*.jsx}'
# Exclude specific file formats and directories from directory search
grep -R '' . --exclude '*.md' --exclude-dir 'node_modules'
# Output \0 as the terminator, usually used with -l to output only filenames for xargs -0
grep "echo" . -R -l -Z | xargs ls -l
# Silent match, outputs nothing to stdin, returns 0 if successful
if echo ' abcd' | grep -q -E '^\s*abc'; then echo 'starts with abc'; fi

In addition to locating matches, you can also output the context of the matches:

# Output the matching line and the following 2 lines
seq 10 | grep '4' -A 2
# Output the matching line and the preceding 2 lines
seq 10 | grep '4' -B 2
# Output the matching line and 2 lines before and after
seq 10 | grep '4' -C 2

cut

There are 3 ways to split: -c by character, -f by field, and -b by byte.

Split by character:

# Extract characters 3 through 5 of each line
echo $'1 2 3 4\n5 6 7 8' | cut -c 3-5
# From the 3rd character to the end of the line
echo $'1 2 3 4\n5 6 7 8' | cut -c 3-
# Up to the 5th character
echo $'1 2 3 4\n5 6 7 8' | cut -c -5

Split by field (column), treating each column as a field, similar to awk, extracting specified columns:

echo $'1 2 3 4\n5 6 7 8' | cut -d ' ' -f 1,3

Note: A very important issue is the delimiter. The default is the tab character (Ctrl + v then tab). The -d option specifies other characters, but it must be a single character, which is inconvenient (it cannot handle multiple spaces and is only suitable for single-character separators).

For example, extracting the PID and CMD columns from ps results:

# awk solves the problem perfectly
ps | awk '{print $1,$4}'
# cut is hard to use
# Default cut by tab is ineffective
ps | cut -f 1,4
# Specifying space for cut yields incorrect results
ps | cut -d ' ' -f 1,4

Split by byte; multi-byte character boundaries are ignored by default:

# Default cross-character splitting; Chinese characters are corrupted
echo "想做个好人" | cut -b 2-4
# The -n option does not split multi-byte characters, resulting in `想`
echo "想做个好人" | cut -n -b 2-4

sed

stream editor, a non-interactive editor and common text processing tool. Its most frequent use is text replacement:

# Remove leading whitespace from lines
echo $' \t  我想左对齐' | sed  $'s/^[[:space:]]*\t*//g'

Another common function is in-place file replacement (replacing and writing results back to the original file):

# Replace all words in test.txt with [word]
echo $'this is a new file\nnext line' > test.txt
sed -i '' -E 's/[[:alpha:]]{1,}/[word]/g' test.txt

P.S. On Mac, sed -i for in-place replacement must specify a backup filename (though it can be an empty string). Also, Mac's sed is quite different from GNU sed, lacking features like +, ?, \b, etc. For more differences, see Differences between sed on Mac OSX and other “standard” sed?.

The delimiter is usually /, but any symbol can be used:

# Semicolon
echo $'\t\t\t我想左对齐' | sed $'s;^\t*;;'
# On Mac, it can even be `|`
echo $'\t\t\t我想左对齐' | sed $'s|^\t*||'
# Delimiters without split meaning need to be escaped
echo '&c' | sed -E 's;&[[:alpha:]]{1,}\;;\&;'

Other common options:

# /pattern/d deletes matching lines
sed '/^$/d' test.sh
# & represents the matched part
echo 'abc de' | sed -E 's/[[:alpha:]]{1,}/[&]/g'
# \123.. back-references
echo 'aabcc' | sed 's/\([[:alpha:]]\)\1/[\1x2]/g'
# sed 'expr1; expr2...' applies multiple regexes in sequence, equivalent to a pipe
echo 'aabcc' | sed 's/\([[:alpha:]]\)\1/[\1x2]/g;s/\].*\[/][/'

Note: Capture parentheses in the back-reference example must be escaped.

awk

Usually used for column extraction, for example:

# PID and CMD columns
ps | awk '{print $1, $4}'

Very powerful, capable of operating on both columns and lines. The general format is:

awk 'BEGIN{ print "start" } pattern1{ command } END{ print "end" }' file

BEGIN, END, and pattern blocks are all optional. The BEGIN block executes first, then one line is read from input, each pattern block is executed sequentially until all content is read, and finally the END block executes.

pattern is also optional; if omitted, the statements in the block are executed unconditionally for every line. For example:

# Output as-is
echo $'1 2\n3 4' | awk '{print}'
# Count lines
echo $'1 2\n3 4' | awk 'BEGIN{lineCount=0} {let lineCount++} END{print lineCount}'

print is special: space-separated arguments are concatenated upon output, while comma-separated arguments are separated by spaces. For example:

# Output 123
echo '' | awk '{print 1 2 3}'
# Output 1 2 3
echo '' | awk '{print 1,2,3}'
# Output 1-2-3
echo '' | awk '{print 1"-"2"-"3}'

Built-in Variables

awk has some special built-in variables:

  • NR: number of records, the current line number

  • NF: number of fields, the number of fields in the current line

  • $0: the text content of the current line

  • $1, $2, $3...: the text content of the nth field in the current line

So there is a simpler way to count lines:

echo $'1 2\n3 4' | awk 'END{print NR}'

NR is updated for every line read; when the END block is reached, it represents the total line count.

Note: In awk, accessing variable values does not require $, whether they are built-in or custom variables.

Passing External Variables

External variables cannot be used directly in awk; they must be passed in:

# Output empty
x=3; echo '' | awk '{print x}'
# Output 3
x=3; echo '' | awk -v x=$x '{print x}'

There is a simpler way to pass multiple external variables:

# Output 3 4 5
x=3; y=4; z=5; echo '' | awk -v x=$x -v y=$y -v z=$z '{print x,y,z}'
# Simple way
x=3; y=4; z=5; echo '' | awk '{print x,y,z}' x=$x y=$y z=$z

Passed as command-line arguments in key-value pairs immediately following the statement block.

getline

Generally used to read the next line. Usage is as follows:

# Output the first line
echo $'1 2\n3 4' | awk 'BEGIN{getline line; print line}'
# Skip the first line (discards the 'total xxx' line)
ls -l | awk 'BEGIN{getline} {print $0}'

getline without arguments updates $0, $1, $2, etc. (it does not when used with arguments). For example:

# With arguments: does not update field variables
echo $'1 2\n3 4' | awk 'BEGIN{print $0; getline line; print $0}'
# Without arguments: updates field variables
echo $'1 2\n3 4' | awk 'BEGIN{print $0; getline; print $0}'

Executing Other Commands

Executing other commands in awk is also unique:

# $0 is the output of md5 test.sh
echo '' | awk '{"md5 test.sh" | getline; print $0}'
# Or
echo '' | awk '{"md5 test.sh" | getline md5; print md5}'

Loops and Conditions

C-style loops, conditions, and other structures can be used in awk:

# while loop
seq 10 | awk 'BEGIN{while (getline){print $0}}'
# for loop
seq 10 | awk 'BEGIN{for(i=0; i<10; i++){getline; print $0}}'
# Conditional statement
seq 10 | awk 'BEGIN{for(i=0; i<10; i++){getline; if ($1 % 2) {print $0}}}'

These features make awk very powerful and convenient for line-by-line file processing.

P.S. For more statement structures and built-in functions, please check man awk.

Other Options

Commonly used options:

# Specify the delimiter, default is space
echo 'a;b;c' | awk -F ';' '{print $2}'
# Or
echo 'a;b;c' | awk 'BEGIN{FS=";"} {print $2}'
# Specify the output delimiter
echo 'a b c' | awk 'BEGIN{OFS="\t"} {print $1,$2,$3}'
# Pattern filtering
# Line number less than 2
echo $'1 2\n3 4' | awk 'NR < 2{print $0}'
# Line number between 2 and 4
seq 10 | awk 'NR==2,NR==4{print $0}'
# Match regular expression
echo $'1 2\n3 4' | awk '/^3/{print $0}'

Processing File Content

Read line by line:

# Input redirection
while read line; do echo $line; done < test.sh
# Or subshell
cat test.sh | (while read line; do echo $line; done)

Read individual fields in a line:

line='1 2 3 4'; IFS=' '; for field in $line; do echo $field; done

Read each character in a field:

field='word'; for ((i=0;i<${#field};i++)) do echo ${field:i:1}; done

This uses a substring extraction trick ${field:i:1}, with the format ${var:start_index:length}. The start index can be negative, representing counting from the end:

# Extract the last 2 characters
field='abcdef'; echo ${field:(-2):2}

P.S. Shell's support for string processing is truly incredibly powerful.

paste

Join text content by column. While cat joins by line, paste can join by column:

seq 3 > no.txt
echo $'吃饭\n睡觉\n打豆豆' > action.txt
# Join by line
cat no.txt action.txt
# Join by column
paste no.txt action.txt

The paste results are as follows:

# paste no.txt action.txt | sed -n l
1\t吃饭$
2\t睡觉$
3\t打豆豆$

The default delimiter is the tab character; the -d option can be used to specify other delimiters:

# Joined results separated by semicolons
paste -d ';' no.txt action.txt | sed -n l

Comments

No comments yet. Be the first to share your thoughts.

Leave a comment