Re: libregex library

From: Tim Robbins (tjr_at_freebsd.org)
Date: 11/22/04

  • Next message: Dag-Erling Smørgrav: "anticipatory I/O scheduling"
    To: Sean Chittenden <sean@chittenden.org>
    Date: Mon, 22 Nov 2004 21:48:21 +1100
    
    

    On Sun, 2004-11-21 at 10:06 -0800, Sean Chittenden wrote:
    > >> Has there been any thought given to moving to the modified Henry
    > >> Spencer regex library used in NetBSD & OpenBSD's libc?
    > >
    > > des@dwp ~% head -3 /usr/src/lib/libc/regex/COPYRIGHT
    > > Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved.
    > > This software is not subject to any license of the American Telephone
    > > and Telegraph Company or of the Regents of the University of
    > > California.
    >
    > I think maybe what Ben was referring to was that Spencer has released
    > an updated version of his regexp library that doesn't penalize wide
    > character locales. I believe our current one performs terribly on
    > everything but one byte character sets, whereas the newer Spencer
    > library performs as well as one could hope with wide characters. The
    > PostgreSQL group did some testing and found Spencers library to be the
    > fastest wide character regexp engine while still maintaining very good
    > levels of performance for single byte character sets. You'll have to
    > check the PostgreSQL archives for details: it's been two years since
    > that change was committed to their tree. -sc

    I think you'd be surprised at how poorly Henry Spencer's new code
    performs in all but the most contrived test cases, regardless of locale.

    You'll find that it performs especially poorly in multibyte locales
    because the matcher itself does not work directly with multibyte
    characters. Instead, the strings must first be entirely converted to
    wide characters, which means reading every single input byte, calling
    mbrtowc() on it, then storing the results in temporary scratch space,
    even if the characters don't participate in the match at all (e.g. all
    characters but the first when matching against patterns like "^x"). The
    FreeBSD 5 regex code only performs the conversion when necessary, and
    can often reject impossible matches without performing a single
    conversion in single-byte and UTF-8 locales.

    (This is assuming your input strings are given as multibyte character
    strings, as is common in UNIX, not wide character strings, as may be
    common in PostgreSQL).

    Tim

    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: Dag-Erling Smørgrav: "anticipatory I/O scheduling"

    Relevant Pages

    • Re: Unicode Support
      ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
      (alt.lang.asm)
    • Re: Character KIND conversion?
      ... One thing I noticed is the lack of a CHARACTER kind conversion function, even though there are such functions for all other intrinsic types, including LOGICAL. ... Which, of course, is only guaranteed to work for characters in the ASCII data set. ... Interestingly, the Fortran-95 ISO-VARYING-STRING standard does define one, for converting between the varying strings and normal fortran strings; it's an overload to the CHARfunction. ...
      (comp.lang.fortran)
    • Re: Need help on string manipulation
      ... better to convert strings to UCS-32 before manipulation? ... Characters represented by wchar_t must use one wchar_t per character, ... which may use a multibyte encoding. ... use some newer Unicode characters, if this is a problem for you, then ...
      (comp.lang.c)
    • Re: Copying string to byte array
      ... of Strings and the CryptEncrypt + CryptDecrypt APIs. ... binary data should not be held in String variables. ... a) not all character codes are valid in a given ...
      (microsoft.public.vb.general.discussion)
    • Re: Ada Interfaces and the Liskov Substitution Principle
      ... Even with strings, you might want to share the strategy for character ... And why should I bother with endianness here? ... Or I just delegate to locale and conversion library that is part of my ...
      (comp.lang.ada)