Bash remove token from text file or array (conditionally)

I want to remove the 2nd token in some lines of file DATA.TXT (34,000 lines)
Remove the second token of each line if it is not an 8-digit integer.

How can I do this in bash? (probably with bash ‘explode’ on text file) Would I better off using bash, JS, or PhP for this task, which I have to do each month? Would it be easier and faster to initially process DATA.TXT as a text file or convert to an array?

EXAMPLE:
OLD:
COURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*

NEW: “RAYMOND” is removed. Good!
COURTEMANCHE,STEVEN 10004331 07/31/2024 PA 1603*

ANOTHER EX.
OLD:
RACZKA,ALAN V 10001901 12/31/2099 MA 1469*

NEW: “V” is removed. Good!
RACZKA,ALAN 10001901 12/31/2099 MA 1469*

But, some of the other tuples already have a second token as an 8-digit integer, in which case no need to process that line. Like: “CAMPBELL,ROBERT”

DATA.TXT:

RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*

Very much a novice with bash, but here’s my attempt

replace.sh

#!/usr/bin/env bash
# colours
NORMAL=$(tput sgr0 setaf 15)
PRIMARY=$(tput setaf 10 bold)
SECONDARY=$(tput setaf 5 bold)

function find_file() {
    local name=$1
    local found_file=$(find . -type f -name $name)

    if [[ -z $found_file ]]; then
        return 1
    fi
    echo "$found_file"
}


function find_and_replace() {
    local name=$1
    local file=""

    if [[ -z $name ]]; then
        printf "${PRIMARY}No filename entered${NORMAL}\n\r"
        return 1
    fi

    file=$(find_file $name)

    if [[ -z $file ]]; then
        printf "${PRIMARY}File ${SECONDARY}$name${PRIMARY} does not exist${NORMAL}\n\r"
        return 1
    fi
    # works with given examples, but may need tweaking
    sed -E 's/([a-z]+,[a-z]+)\s[a-z]+/\1/gi' "$file" > "$(dirname $file)/new_$(basename $file)"
}

find_and_replace $1
return 0

sample file example.txt

RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*

command line linux(wsl)

. replace.sh example.txt

Outputs to file new_example.txt

RACZKA,ALAN 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN 10004331 07/31/2024 PA 1603*

Why mess with colors? This seems silly and unnecessary.

Why do you include this line twice?
local name=$1

Wouldn’t this be simpler:

sed -E -i ‘s/([a-z]+,[a-z]+)\s[a-z]+/\1/gi’ “DATA.TXT”

Thank you, Sir!

I worked from one of my previous bash files. Just opted to leave them in — it’s only error messages no big deal. Feel free to remove if you think it is silly.

I was presuming you didn’t want to overwrite the existing file — e.g. if there is an edge case where the regex doesn’t match and replace as intended. It doesn’t have to have a prefix of new_. Again feel free to change.

It’s an example, and as previously mentioned bash is not my area of expertise. Just thought I would have a go at it.

2 Likes

What am I missing here.

I tested this

sed -E -i ‘s/([a-z]+,[a-z]+)\s[a-z]+/\1/gi’ “DATA.TXT”

It does not do what is supposed to do. In fact it does not do anything.

Well at least for me, executing it on the command line would require not putting quotes around the input filename. Your OS may be different. (Mine (Linux Mint 21.3 Cinnamon) borked a sed: can't read “data.txt”: No such file or directory). It also doesnt liker fancy quote marks, so make sure your apostrophes are apostrophes and not curly fancy things.

EDIT: Correction. Tripped myself up with my own words. The fancy quote thing is what ate the filename.

Yes. My is Mint too.
However in my Mac it does not seems to work but it does work in my Linux.

sed -E ‘s/([a-z]+,[a-z]+)\s[a-z]+/\1/gi’ “test.txt” > “test_new.txt”

test.txt

RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*

text_new.txt

RACZKA,ALAN 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN 10004331 07/31/2024 PA 1603*

There is more than one flag option for extended regexes, so it maybe the -E flag.

Options:
-E
-r
--regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that egrep accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the -E extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use -E for portability. GNU sed has accepted -E as an undocumented option for years, and *BSD seds have accepted -E for years as well, but scripts that use -E might not port to other older systems. See Extended regular expressions.

Without the extended version I could not use the + operator, and zero-or-many was tripping me up.

I will also add, it was late :biggrin:

Just to add, my version had sourceFile > destinationFile

sed -E 's/([a-z]+,[a-z]+)\s[a-z]+/\1/gi' "$file" > "$(dirname $file)/new_$(basename $file)"

without a pipe, sed will replace in-place.

1 Like

Ah ok, didn’t know that.

Sorry for being pedantic, but it is called a ‘redirection operator’ isn’t it? I have used the pipe ‘|’ operator and it works like compose.

well sed works with either. in general i was using pipe as “if it has no other defined outflow, it assumes to operate in-place in the file stream” (as sed stands for “stream editor”)

1 Like