Skip to main content
Inspiring
November 13, 2006
Question

recode_string: replacing diacritical marks

  • November 13, 2006
  • 6 replies
  • 519 views
hi:

MacOS 10.4.8, PHP 5.1.6, MySQL 4.1.21

i have a DB with some fields (collation: latin1_general_ci, type:
VARCHAR & TEXT) containing texts in several occidental european
languages. for a search function i need to replace diaccritical marks

á -> a
è -> e
ü -> u

...

and i've been trying the recode_string function but the example in
php.net isn't working (shows nothing):

<?php
echo recode_string("us..flat", "The following character has a
diacritical mark: &aacute;");
?>

any solution apart ffrom using a lookup table?

tia,

jdoe
This topic has been closed for replies.

6 replies

Inspiring
November 27, 2006
then, if the remote MySQL server has 'latin1' as default charset, do i
need to include a SET NAMES 'utf8' after every connection to the DB in
every page?

Michael Fesser wrote:
> .oO(John Doe)
>
>> Michael Fesser wrote:
>>> .oO(John Doe)
>>>> i have a DB with some fields (collation: latin1_general_ci, type:
>>>> VARCHAR & TEXT) containing texts in several occidental european
>>>> languages.
>>> I always use UTF-8 to store my data.
>>>
>> even if your data only uses English charset for example?
>
> Yes. Even for plain English UTF-8 can be quite useful (not only in the
> DB, but also on the website output), for example if you want to use
> typographically correct quotation marks or some other special chars that
> are not part of Latin1. Personally I simply don't want to use character
> references anymore, I want to type (and store) all chars literally.
>
> If you do it consistently all the way from the DB until the final output
> is sent to the browser, there won't be a problem. Nearly every browser
> used today can handle UTF-8, even the old Netscape 4.
>
> Of course using Latin1 is fine as well, especially if it's an existing
> project. But for new projects I would definitely start with UTF-8.
Inspiring
November 17, 2006
thanks a lot to Micha and David for your useful info!!!
Inspiring
November 13, 2006
.oO(John Doe)

>Michael Fesser wrote:
>> .oO(John Doe)
>>> i have a DB with some fields (collation: latin1_general_ci, type:
>>> VARCHAR & TEXT) containing texts in several occidental european
>>> languages.
>>
>> I always use UTF-8 to store my data.
>>
>even if your data only uses English charset for example?

Yes. Even for plain English UTF-8 can be quite useful (not only in the
DB, but also on the website output), for example if you want to use
typographically correct quotation marks or some other special chars that
are not part of Latin1. Personally I simply don't want to use character
references anymore, I want to type (and store) all chars literally.

If you do it consistently all the way from the DB until the final output
is sent to the browser, there won't be a problem. Nearly every browser
used today can handle UTF-8, even the old Netscape 4.

Of course using Latin1 is fine as well, especially if it's an existing
project. But for new projects I would definitely start with UTF-8.

>isn't there any
>storage overload?

Not much. Most common chars (which belong to the ASCII table) still
require just a single byte. Only when it comes to more special or
"foreign" chars, then two or more bytes are required.

>> I'm not familiar with the GNU Recode extension, but from having a quick
>> look at the Recode documentation I'm wondering how the example above is
>> supposed to work at all.
>>
>> In the given start charset "us" the string "&aacute;" doesn't have any
>> diacritical marks, because it's just an ampersand, followed by an a and
>> so on. It should work if the start charset would be "h" (for HTML) ...
>>
>> I think I'll install that extension on my machine and do some tests.
>> It looks interesting.
>>
>ok, let us know whatever you find!

Nothing special so far, just that the example from the PHP manual
doesn't work. ;) With "h..flat" it works as expected and described
above. Maybe I should file a bug report, but actually it's not something
critical.

>> Another thing: How is your data stored in the database - does it contain
>> any HTML character references like &aacute;?
>
>precisely i was wondering if i should store "Mar&iacute;a" instead of
>"María" in the DB.

Generally spoken you should always store raw data in the database
whenever possible. This not only applies to texts in different
languages, but also to date and time informations for example. When you
need a special formatting or encoding for your target media (be it a
website, a plain text file, a PDF or whatever), do it when you output
the data. That's usually the most flexible way.

So in this case you should store "María".

>i've made testings with several browsers in PC/Mac
>without problems using the latest but i guess for older browsers may be
>a problem

Just make sure that the server tells the browser which encoding was used
for the document. This is done with a "Content-Type" header, which is
usually sent automatically by the server. But sometimes it's necessary
to overwrite it. In my scripts I use something like this at the very
beginning:

header('Content-Type: text/html; charset=UTF-8');

This makes sure the browser knows how to interpret the received data.
No problems so far.

Micha
Inspiring
November 13, 2006
John Doe wrote:
>
> Michael Fesser wrote:
>>
>> I always use UTF-8 to store my data.
>>
> even if your data only uses English charset for example? isn't there any
> storage overload?

There shouldn't be. The ASCII character set uses 1 byte per character.
Accented characters uses 2 bytes.

--
David Powers
Adobe Community Expert
Author, "Foundation PHP for Dreamweaver 8" (friends of ED)
http://foundationphp.com/
Inspiring
November 13, 2006

Michael Fesser wrote:
> .oO(John Doe)
>> i have a DB with some fields (collation: latin1_general_ci, type:
>> VARCHAR & TEXT) containing texts in several occidental european
>> languages.
>
> I always use UTF-8 to store my data.
>
even if your data only uses English charset for example? isn't there any
storage overload?

> I'm not familiar with the GNU Recode extension, but from having a quick
> look at the Recode documentation I'm wondering how the example above is
> supposed to work at all.
>
> In the given start charset "us" the string "&aacute;" doesn't have any
> diacritical marks, because it's just an ampersand, followed by an a and
> so on. It should work if the start charset would be "h" (for HTML) ...
>
> I think I'll install that extension on my machine and do some tests.
> It looks interesting.
>
ok, let us know whatever you find!

> Another thing: How is your data stored in the database - does it contain
> any HTML character references like &aacute;?

precisely i was wondering if i should store "Mar&iacute;a" instead of
"María" in the DB. i've made testings with several browsers in PC/Mac
without problems using the latest but i guess for older browsers may be
a problem

thanks for your time

jdoe

Inspiring
November 13, 2006
.oO(John Doe)

>MacOS 10.4.8, PHP 5.1.6, MySQL 4.1.21
>
>i have a DB with some fields (collation: latin1_general_ci, type:
>VARCHAR & TEXT) containing texts in several occidental european
>languages.

I always use UTF-8 to store my data.

>for a search function i need to replace diaccritical marks
>
>á -> a
>è -> e
>ü -> u
>
>...
>
>and i've been trying the recode_string function but the example in
>php.net isn't working (shows nothing):
>
><?php
>echo recode_string("us..flat", "The following character has a
>diacritical mark: &aacute;");
>?>

I'm not familiar with the GNU Recode extension, but from having a quick
look at the Recode documentation I'm wondering how the example above is
supposed to work at all.

In the given start charset "us" the string "&aacute;" doesn't have any
diacritical marks, because it's just an ampersand, followed by an a and
so on. It should work if the start charset would be "h" (for HTML) ...

I think I'll install that extension on my machine and do some tests.
It looks interesting.

Another thing: How is your data stored in the database - does it contain
any HTML character references like &aacute;?

Micha