UTF8 insert Byte Order Mark

UTF8 in general does not need a BOM, a Byte Order Mark but sometimes libraries that are reading and writing files need it as mandatory argument. In UTF16 and UTF32 the BOM is mandatory.

More information could be found on Wikipedia.

Show the differce
With the following commands you could determine the used encoding.

Command: file mytest.file

Output without BOM: mytest.file: UTF-8 Unicode text

Output with BOM: mytest.file: UTF-8 Unicode (with BOM) text

Prerequirement
The follwing software package will install the user space utiliy "uconv". apt-get install libicu-dev

Script
If you like you could make the BASEDIR and TARGETDIR virables as parameter passed to the script. The script will duplicate the filesystem tree to the target directory. The source will remain unchanged.

Be aware: The script deletes the target directory on each run!


 * 1) !/bin/bash

BASEDIR=/root/messages TARGETDIR=/tmp/messages

rm -Rf $TARGETDIR mkdir $TARGETDIR

function RecursiveConvert {       for f in * do               if [ -d $f ]; then echo "Directory: $f" (cd $f; mkdir -v $TARGETDIR/${PWD##*/}; RecursiveConvert); else echo "File: $f" OUTPUT=`file $f | awk -F ":" '{ print $2 }'` OUTPUT=$(sed -e 's/^space:*//' <<<"$OUTPUT") echo $OUTPUT if [ "$OUTPUT" = "UTF-8 Unicode text" ]; then echo "UNICODE WITHOUT BOM" echo "Converting....." uconv --add-signature $f > $TARGETDIR/${PWD##*/}/$f echo "......done!"

else echo "Other file encoding $OUTPUT" echo "Copying....." cp -v $f $TARGETDIR/${PWD##*/}/$f echo ".....done" fi

fi

done }

(cd $BASEDIR; RecursiveConvert)