Wednesday, 28 August 2013

Convert multiple unicode in a string to character

Convert multiple unicode in a string to character

Problem -- I have a string say Buna$002C_TexasBuna$002C_Texas' and where $
is followed by unicode. I want to replace these unicode to its respective
unicode character representation.
In perl if any unicode is in the form of "\x{002C} then it will be
converted to it respective unicode character.Below is the sample code.
#!/usr/bin/perl
my $string = "Hello \x{263A}!\n";
@arr= split //,$string;
print "@arr";
I am processing a file which contain 10 million of records. So i have
these strings in a scalar variable. To do the same as above I am
substituting $4_digit_unicode to \x{4_digit_unicode} as below.
$str = 'Buna$002C_TexasBuna$002C_Texas';
$str =~s/\$(.{4})/\\x\{$1\}/g;
$str = "$str"
It gives me
Buna\x{002C}_TexasBuna\x{002C}_Texas
It is because at $str = "$str" line $str is being interpolated but not its
value. So \x{002C} is not being interpolated by perl.
Is there any way to force perl so that it will also interpolate the
contents of $str too.
OR
Is there any other method to achieve this? I do not want to take out each
of the unicode then pack it using pack "U4",0x002C and then substitute it
back. But something in one line (like below unsuccessful attempt) is OK.
$str =~s/\$(.{4})/pack("U4",$1)/g;
I know above if wrong; but can I do something like above.
For input string $str = 'Buna$002C_TexasBuna$002C_Texas' desired output is
Buna,_TexasBuna,_Texas.

No comments:

Post a Comment