I want to remove the 2nd token in some lines of file DATA.TXT (34,000 lines)
Remove the second token of each line if it is not an 8-digit integer.
How can I do this in bash? (probably with bash ‘explode’ on text file) Would I better off using bash, JS, or PhP for this task, which I have to do each month? Would it be easier and faster to initially process DATA.TXT as a text file or convert to an array?
EXAMPLE:
OLD:
COURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*
NEW: “RAYMOND” is removed. Good!
COURTEMANCHE,STEVEN 10004331 07/31/2024 PA 1603*
ANOTHER EX.
OLD:
RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
NEW: “V” is removed. Good!
RACZKA,ALAN 10001901 12/31/2099 MA 1469*
But, some of the other tuples already have a second token as an 8-digit integer, in which case no need to process that line. Like: “CAMPBELL,ROBERT”
DATA.TXT:
RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*
Very much a novice with bash, but here’s my attempt
replace.sh
#!/usr/bin/env bash
# colours
NORMAL=$(tput sgr0 setaf 15)
PRIMARY=$(tput setaf 10 bold)
SECONDARY=$(tput setaf 5 bold)
function find_file() {
local name=$1
local found_file=$(find . -type f -name $name)
if [[ -z $found_file ]]; then
return 1
fi
echo "$found_file"
}
function find_and_replace() {
local name=$1
local file=""
if [[ -z $name ]]; then
printf "${PRIMARY}No filename entered${NORMAL}\n\r"
return 1
fi
file=$(find_file $name)
if [[ -z $file ]]; then
printf "${PRIMARY}File ${SECONDARY}$name${PRIMARY} does not exist${NORMAL}\n\r"
return 1
fi
# works with given examples, but may need tweaking
sed -E 's/([a-z]+,[a-z]+)\s[a-z]+/\1/gi' "$file" > "$(dirname $file)/new_$(basename $file)"
}
find_and_replace $1
return 0
sample file example.txt
RACZKA,ALAN V 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN RAYMOND 10004331 07/31/2024 PA 1603*
command line linux(wsl)
. replace.sh example.txt
Outputs to file new_example.txt
RACZKA,ALAN 10001901 12/31/2099 MA 1469*
CAMPBELL,ROBERT 10002826 12/31/2099 MA 1900*
lCOURTEMANCHE,STEVEN 10004331 07/31/2024 PA 1603*
I worked from one of my previous bash files. Just opted to leave them in — it’s only error messages no big deal. Feel free to remove if you think it is silly.
I was presuming you didn’t want to overwrite the existing file — e.g. if there is an edge case where the regex doesn’t match and replace as intended. It doesn’t have to have a prefix of new_. Again feel free to change.
It’s an example, and as previously mentioned bash is not my area of expertise. Just thought I would have a go at it.
Well at least for me, executing it on the command line would require not putting quotes around the input filename. Your OS may be different. (Mine (Linux Mint 21.3 Cinnamon) borked a sed: can't read “data.txt”: No such file or directory). It also doesnt liker fancy quote marks, so make sure your apostrophes are apostrophes and not curly fancy things.
EDIT: Correction. Tripped myself up with my own words. The fancy quote thing is what ate the filename.
There is more than one flag option for extended regexes, so it maybe the -E flag.
Options: -E -r --regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that egrep accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the -E extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use -E for portability. GNU sed has accepted -E as an undocumented option for years, and *BSD seds have accepted -E for years as well, but scripts that use -E might not port to other older systems. See Extended regular expressions.
Without the extended version I could not use the + operator, and zero-or-many was tripping me up.
I will also add, it was late
Just to add, my version had sourceFile > destinationFile
sed -E 's/([a-z]+,[a-z]+)\s[a-z]+/\1/gi' "$file" > "$(dirname $file)/new_$(basename $file)"
well sed works with either. in general i was using pipe as “if it has no other defined outflow, it assumes to operate in-place in the file stream” (as sed stands for “stream editor”)