[cfe-dev] Handling Unicode : code points vs. code units

Henrique Almeida hdante at gmail.com
Mon Jun 15 17:30:32 CDT 2009


 I've found a default algorithm for finding character boundaries
(called graphemes in Unicode). The most general version of the
algorithm works like this: a grapheme is either "CRLF" or is composed
of zero or more "Prepend" characters, followed by either a
"Hangul-syllabe" or any other code point that is not a control,
followed by a sequence of "Grapheme_Extend" or "Spacing_Mark" code
points. So, every "character" in the editor should have this format.
Because of "Prepend" characters, it's not possible to use the
increment approach directly. In the worst case, a state machine would
do the job.

 http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

2009/6/15 Henrique Almeida <hdante at gmail.com>:
>  (I don't know how the code is structured, I'm answering from a
> conceptual point of view)
>
>  You could still use an integer for keeping track of the column number
> if it's possible to change the "increment" operation. IIRC Unicode has
> enough information to decide if the next code point will allocate a
> new graphical space or not (and combining code points are stored after
> the allocating code point). The new class does exactly the same, but
> the "increment" approach would avoid creating a new class.
>
>  PS: This wouldn't handle text that changes direction in the middle of
> the line, but I think editors don't deal with that either (gnome
> editor, for example, starts counting in reverse if I write some text
> from right to left). So, both line counts would be equally wrong. :-)
>
> 2009/6/15 AlisdairM(public) <public at alisdairm.net>:
>> Hopefully the last question before I start posting some patches on this!
>>
>> I think the big problem we face correctly handling extended characters in UTF-8 (and to a lesser extent UCNs in identifiers) is that much/all of the current code assumes that code points and code units are the same.
>>
>> In Unicode terms, the character is called a code point, which corresponds to the 21 bit value representing a single character from the Unicode character set.  This is a direct mapping in UTF-32 encodings, but may require multiple 'code units' in UTF8/UTF-16.
>>
>> The effect shows up any time a character outside the basic 7-bit ASCII range shows up in a string literal or a comment.  The column numbers for any diagnostic on that line will be wrong beyond that character.
>>
>> Fundamentally, if we want to get this 'right' we should stop talking about characters and deal exclusively with character sequences.  Once we are dealing with UTF-8 we can no longer assume a 'character' will fit into a single 'char' variable.  I am not yet sure how pervasive such a change would be, as AFAICT most functions are already passing around pointers (in)to text buffers.  The difference may be more in how we approach the code than in the source itself.
>>
>> However, on place it really does matter is reporting column numbers in diagnostics.  We need to report column numbers in terms of characters, or code positions, rather than code units as today.
>>
>> In order to clarify when we are dealing explicitly with code positions, I propose to introduce a new class to describe such offsets rather than using a simple integer variable.  Thanks to inlining this class should have performance equivalent to a regular integer apart from in a couple of use cases.  The majority of the codebase should be unaffected, continuing to work in terms of byte offsets into a buffer.  However, whenever we need to render a source-file column number we should go via this new type.  The opaque type should catch many issues with code point vbs. Code unit at compile rather than runtime, although I don't have an exhaustive list of APIs that should be updated yet, so we must learn to alert for APIs using the wrong types as well.
>>
>>
>> A quick sketch of the class would look a little (or a lot!) like:
>>
>> #include <cstddef>
>>
>> struct CharacterPos {
>>   // We must declare a default constructor as there is another
>>   // user declared constructor in this class.
>>   // Choose to always initialize the position member. This means
>>   // that CharacterPos is not a POD class.  In C++0x we might
>>   // consider using = default, which might leave position
>>   // uninitialized, although 0x is more fine-grained in its
>>   // usage of PODs and trivial operations, so explicit initialization
>>   // is probably still the best choice.
>>   CharacterPos() : position() {}
>>
>>   // The remaining special members left implicit in order to
>>   // preserve triviality.  In C++0x would explicitly default them.
>> //   CharacterPos( CharacterPos const & rhs) = default;
>> //   ~CharacterPos() = default;
>> //   CharacterPos & operator=( CharacterPos const & rhs ) = default;
>>
>>   // Constructor iterates string from start to offset
>>   // counting UTF-8 characters i.e code points.
>>   // Throws an exception if str is not a valid UTF-8 encoding.
>>   CharacterPos( char const * str, std::size_t offset );
>>
>>   // Iterates str, returning a pointer to the initial code unit
>>   // of the UTF-8 character at 'position'.
>>   // Throws an exception if str is not a valid UTF-8 encoding.
>>   char const * Offset( char const * str ) const;
>>
>>   CharacterPos & operator+=( CharacterPos rhs ) {
>>      position += rhs.position;
>>      return *this;
>>   }
>>
>>   CharacterPos & operator-=( CharacterPos rhs ) {
>>      position -= rhs.position;
>>      return *this;
>>   }
>>
>>   std::ptrdiff_t operator-( CharacterPos rhs ) const {
>>      return position - rhs.position;
>>   }
>>
>>  bool operator==( CharacterPos rhs) const { return position == rhs.position; }
>>  bool operator<( CharacterPos rhs) const { return position < rhs.position; }
>>  bool operator<=( CharacterPos rhs) const { return position <= rhs.position; }
>>  bool operator!=( CharacterPos rhs) const { return !(*this == rhs); }
>>  bool operator>( CharacterPos rhs) const { return rhs < *this; }
>>  bool operator>=( CharacterPos rhs) const { return rhs <= *this; }
>>
>>
>> private:
>>   std::size_t position;
>> };
>>
>> CharacterPos operator+( CharacterPos lhs, CharacterPos rhs ) {
>>   return lhs += rhs;
>> }
>>
>> char const * operator+( char const * lhs, CharacterPos rhs ) {
>>   return rhs.Offset( lhs );
>> }
>>
>>
>> Note that two operations in here have linear complexity rather than constant:
>>   CharacterPos( char const * str, std::size_t offset );
>>   char const * Offset( char const * str ) const;
>>
>> These are also the important APIs that define why the class exists.
>> In all other ways it should be a reasonable arithmetic type.
>>
>> I am opting for pass-by-value rather than pass-by-reference-to-const as that is typically more efficient for small data types, although obviously I have no performance measurements to back that up yet.
>>
>> Also note that these same two APIs have to deal with badly encoded UTF-8 streams, and indicate failure by throwing an exception.  I have informally picked up that LLVM/Clang prefer to avoid exceptions as error reporting mechanisms. If this is likely to be an issue I would appreciate guidance on an alternate error reporting mechanism for those same APIs - especially the failed constructor.
>>
>> AlisdairM
>>
>>
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>
>
>
> --
>  Henrique Dante de Almeida
>  hdante at gmail.com
>



-- 
 Henrique Dante de Almeida
 hdante at gmail.com



More information about the cfe-dev mailing list