Work around of file name problem while unzip handling CJK encodings

Unzip 5.x has an option -O to specific the encoding of file names in an ZIP archive, but when 6.0 is arriving with unicode support, that option disappeared as well. CJK users need special cares on support and conversion of obsolete encoding while they are switching to utf-8.

Here is my workaround about this problem, install p7zip and convmv packages on your system first, then:
$ env LC_ALL=C 7z x file.zip
$ convmv -f gbk -t utf8 --notest *

File names extracted by unzip are not able to be converted to correct one whatever you do with it, but what is done by 7z can be converted by convmv.

Moving more on, we can automate this action to a script:
#! /bin/sh
LANG=C /usr/bin/7z x -y "$1" | sed -n 's/^Extracting //p' | sed '1!G;h;$!d' | xargs convmv -f gbk -t utf8 --notest >/dev/null 2>/dev/null

Save it us unzip.sh, then try:
$ sh unzip.sh file.zip
This will act as what unzip does, but with additional care about converting file name encoding from gbk to utf-8. Moreover, convmv can detect whether your file name is already utf-8 encoded and will skip it.

If your file names are encoded other encoding, please replace “gbk” with the appropriate name.

One thought on “Work around of file name problem while unzip handling CJK encodings”

Leave a Reply

Your email address will not be published. Required fields are marked *

This work by Aron Xu is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported.