Fast and efficient CString replacement
Typical applications contain lots of string operations, and MFC includes the CString class for precisely that purpose. Unfortunately, it suffers from major problems. Maybe the three most important are:
- CStrings cannot be extended - their header file is buried within MFC
- CStrings are slow. Catenating a simple value requires copying the string into a new buffer.
- CStrings internally call malloc/free so often that memory becomes very fragmented, and your application incurs a major performance hit.
- Reference counting (the ability to quickly assign one CStr to another without copying the characters) was first implemented in the MFC library accompanying Visual C++ 5. Besides, it's not that efficient.
This article describes a class named CStr, which in many respects is similar to CString -- and in most cases can be used interchangeably. However, the class improves much in the following areas:
- The definition and implementation are open - you can easily edit its header file to include much-needed facilities.
- The class is compatible both with MFC and with simple Win32-based applications.
- The class includes much better method for reference counting. It also supports a buffer larger than the number of characters in the string, so catenation (and assignment of longer strings) becomes a super-fast process.
- The class caches data blocks of commonly used sizes (typically 4, 8, 12, etc - up to 320, this is configurable). When your program destroys a string object, the data is not returned to the memory manager, but is kept in a cache pool. The next time CStr needs a block of that size (and that happens very often), it gets the block very quickly. And memory fragmentation is severly reduced in that way.
CStr features
CStr supports most of the features of CString. The following snippet from CStr.h shows some of the more important features and friend functions:
class CStr
{
// Construction, copying, assignment
public:
CStr();
CStr(const CStr& source);
CStr(const char* s, CPOS prealloc = 0);
CStr(CPOS prealloc);
void operator=(const CStr& source);
void operator=(const char* s);
~CStr();
CStr(const CString& source, CPOS prealloc = 0);
void operator=(const CString& source);
// Get attributes, get data, compare
BOOL IsEmpty() const;
CPOS GetLength() const;
operator const char* () const;
const char* GetString() const; // Same as above
char GetFirstChar() const;
char GetLastChar() const;
char operator[](CPOS idx) const;
char GetAt(CPOS idx) const; // Same as above
void GetLeft (CPOS chars, CStr& result);
void GetRight (CPOS chars, CStr& result);
void GetMiddle (CPOS start, CPOS chars, CStr& result);
int Find (char ch, CPOS startat = 0) const;
int ReverseFind (char ch, CPOS startat = (CPOS) -1) const;
int Compare (const char* match) const; // -1, 0 or 1
int CompareNoCase (const char* match) const; // -1, 0 or 1
// Operators == and != are also predefined
// Global modifications
void Empty(); // Sets length to 0, but keeps buffer around
void Reset(); // This also releases the buffer
void GrowTo(CPOS size);
void Compact(CPOS only_above = 0);
static void CompactFree();
void Format(const char* fmt, ...);
void FormatRes(UINT resid, ...);
BOOL LoadString(UINT resid);
// Catenation, truncation
void operator += (const CStr& obj);
void operator += (const char* s);
void AddString(const CStr& obj);
// Same as +=
void AddString(const char* s);
// Same as +=
void AddChar(char ch);
void AddChars(const char* s, CPOS startat, CPOS howmany);
void AddStringAtLeft(const CStr& obj);
void AddStringAtLeft(const char* s);
void AddInt(int value);
void AddDouble(double value, UINT after_dot);
void RemoveLeft(CPOS count);
void RemoveMiddle(CPOS start, CPOS count);
void RemoveRight(CPOS count);
void TruncateAt(CPOS idx);
friend CStr operator+(const CStr& s1, const CStr& s2);
friend CStr operator+(const CStr& s, const char* lpsz);
friend CStr operator+(const char* lpsz, const CStr& s);
// Window operations and other utilities
void GetWindowText (CWnd* wnd);
// Miscellaneous implementation methods
protected:
// These may be reimplemented by the user
static void ThrowIfNull(void* p);
static void ThrowPgmError();
static void ThrowNoUnicode();
static void ThrowBadIndex();
#ifndef CSTR_LARGE_STRINGS
static void ThrowTooLarge();
#endif
};
BOOL operator ==(const CStr& s1, const CStr& s2);
BOOL operator ==(const CStr& s1, LPCTSTR s2);
BOOL operator ==(LPCTSTR s1, const CStr& s2);
BOOL operator !=(const CStr& s1, const CStr& s2);
BOOL operator !=(const CStr& s1, LPCTSTR s2);
BOOL operator !=(LPCTSTR s1, const CStr& s2);
Tech note: the CPOS type and length limitations
Normally, CStr supports strings with up to 65500 characters. This increases the speed a bit, and saves 4 bytes per string. In some cases you might need to work with very large strings. To do this, define the symbol CSTR_LARGE_STRINGS before including CStr.h in your project
CPOS is a custom type identifying either the character length of a string, or a character position. It is defined as either a 16-bit WORD (with normal strings) or 32-bit UINT (when supporting strings with up to 2^32 characters)
If you work at compiler warning level 3, you will be able to freely mix UINT with CPOS. If you work at level 4, you may need to typecast, or use the CPOS type to prevent warnings.
Tech note: using CStr in place of LPCSTR (const char*)
Like CString, the class described here also has a predefined operator typecast to LPCSTR. This is why you can use CStr where LPCSTR is expected. You can even use CStr in functions declared as requiring CString, but this is not very efficient, since the compiler will generate a temporary CString instance for you.
Tech note: managing buffer length
One of the most important advantages of CStr is that it allows you to specify the length of the buffer that will hold the string. If you anticipate a string will soon grow to 80 bytes, you can request a buffer of that size, even if its initial content is only 7 bytes long. This saves a huge amount of reallocation and copy operations if you add to the string later.
To specify a larger buffer when constructing the string, use the following definitions:
CStr();
// No preallocation
CStr(const char* s, CPOS prealloc = 0); // Buffer chars as second param
CStr(CPOS prealloc);
// Buffer chars as only param
To increase the buffer size for an existing string, call
void GrowTo(CPOS size); // If the buffer is smaller, increases it
Attempting to grow to buffer to a value larger than what's currently allocated is harmless.
At some point (particularly if you store many strings in memory) you may decided that a given string won't be changed, and its originally allocated buffer could be too large and could waste memory. On this occasion you may call
void Compact(CPOS only_above = 0)
Passing only_above=4, for example, means "reallocate and copy to smaller buffer only if 4 or more bytes would be saved".
It is important to know that the buffers for freed strings are not really deallocated. Thus, at certain points in your program (for example, after a large memory-consuming operation) you may wish to invoke a manual "garbage collection" that will return all pooled memory to the memory manager. To do this, call
CStr::CompactFree() // static method
You should always call this method in the ExitInstance() method of your CWinApp class; otherwise, MFC will complain about memory leaks.
Tech note: using Format and FormatRes
CStr::Format and FormatRes are sprintf-like functions. The only difference is that the first takes a pointer to a const character string describing the format parameters (like sprintf does), while the second loads the string from the resource table.
Format specifiers can be looked up in your C++ RTL documentation under "printf"
Tech note: using Empty() or Reset()
Note that when you assign a number of characters to a CStr object, its buffer may be icreased if necessary, but it will not be decreased. This is valid even if you call Empty() - this leaves the string with zero length, but the allocated buffer stays intact for further use.
If you know that you will not assign to this empty string for a long time, it is better to call Reset() instead of Empty(). This will not only set the length to 0 characters, but also deallocate (or rather, return to the cache pool) the string buffer. This is especially important if you reset a large string (say, 512 bytes or more)
Note the presence of the CSTR_DITEMS constant in CStr.h This constant identifies the maximum string for which the "cached buffers" mechanisms will be in effect. Strings larger than this size are always passed to malloc/free. This, if you load a 500-kilobyte text file in CStr, you need not worry that the memory will not be released when you destroy the object.
Tech note: using the string support in single-threaded applications
The supplied class is designed to be completely safe in a multithreaded application, and uses some critical sections to achieve this. If you have a single-threaded app, or you are sure to use CStr from only one thread (and I really mean sure!) you can define the symbol CSTR_NOT_MT_SAFE.
This will omit any references to cirtical sections, and may speed your string operations between 5% (if you manage the string data itself) and 30% (if you do a lot of string assignments and reassignments)
How to use CStr
Remember that CStr is NOT compatible with UNICODE yet (if enough interest gathers, I will make a UNICODE version). When including the class in an MFC project, I suggest that you put the following references:
In stdafx.h: #include "CStr.h" // This, in turn, includes cstrimp.h
In stdafx.cpp: #include "CStrMgr.cpp" // Put this include after everything else
Thus, the string support headers will be precompiled, and you won't need to include them everywhere.
Note that CStrMgr.cpp (the implementation file) is designed to be included in another CPP, not added to the project. If you don't like this, just insert a #include "stdafx.h" in its beginning, and add it to the project file.
There are some conditional symbols you may wish to define right before including CStr.h
- #define CSTR_LARGE_STRINGS: normally, strings can hold up to 65500 characters. Define this conditional to increase the range to 2^32 characters. This incurs a 4 byte penalty per object, and probably some small speed hit.
- #define CSTR_OWN_ERRORS: you will probably want to define this symbol in larger applications. If you do, you will have to implement a couple of methods and functions that handle critical situations, such as out-of-memory conditions and program errors (e.g. out-of-bounds character reference)
- If you do NOT use MFC, you will have to provide a body for the get_stringres() function. It should just return an instance handle so that CStr knows where to load string resources from. The sample application shows an implementation of this function.
- #define CSTR_NOT_MT_SAFE: If you have a single-threaded application, or are completely sure that only one of your threads will use the string support subsystem, this will improve speed significantly. Be very careful - defining this and using CStr from multiple threads might cause very hard-to-detect errors in your application!
Have fun using CStr! Any comments are welcome. Also, I will be glad to add features provided that they seem to be useful to at least 3-4 people, and they do not take too much time. Write to me at kamen@kami.com
Download demo executable - 33 KB
Download source - 11 KB Updated October 17, 1998
Comments
How Can I change it to let it bu suitable to pre C++?
Posted by Legacy on 01/12/2003 12:00amOriginally posted by: fanhua
I need work in workstation environment such as SUN, I have worked hard on it to change it, but it does not work in workstation(Pure C++ environment), How should I do?
Thank YOU guys!
ReplyBug in code and article
Posted by Legacy on 10/02/2002 12:00amOriginally posted by: CallMeJoe
It was the negative comments about CString that caught my eye:
"CStrings cannot be extended - their header file is buried within MFC"
Buried? It's a class; you can't "bury" the header file! I first subclassed CString with 1.52c. I had to make a minor change for 32-bit MFC and for the new CString classes, but other than that it's worked just fine.
"CStrings are slow. Catenating a simple value requires copying the string into a new buffer."
Again, this isn't true. CStrings are actually quite fast and your own benchmarks bear that out.
"CStrings internally call malloc/free so often that memory becomes very fragmented, and your application incurs a major performance hit."
Again, completely false. Have you even looked at the source for CString? Again, your own benchmarks contradict this remark.
"Reference counting (the ability to quickly assign one CStr to another without copying the characters) was first implemented in the MFC library accompanying Visual C++ 5. Besides, it's not that efficient."
And? It's actually quite efficient, but has been discarded in MFC 7.0 due to a fundamental, albeit rare, problem with most reference counting classes that has no real solution.
I would have enjoyed benchmarking your class, but it's no longer freely available so there is no way to test your vaunted results. (Making me sign up even if for free does not make this freely available.)
Note that I did compile the version available here and it was chock full of bugs. When I did get it to run, it was sometimes faster and sometimes slower than CString. The assignment operation was 40% slower and the concatenation degredated rapidly where 100 concatenations totally about 3k was 300% slower.
By comparison when CStr was faster than CString (for those tests that worked), CString was always within 5% the speed of CStr.
Reply
What do you mean by "fast" ?
Posted by Legacy on 04/15/2002 12:00amOriginally posted by: ET Tan
I am looking for a real fast string class as my program operates on large strings - doing search and replace of substrings.
Downloaded and looked at your source code, but I don't see how is it fast
ReplyConverting a Cstring in a double
Posted by Legacy on 10/07/2001 12:00amOriginally posted by: giacomo moro
I would like to know how to convert a Cstring in a double. I have written the following simple example: in a dialog I have put three edit and the third I would like to be the sum of the values put in the first and in the secon edit (for istance in the first I put 5.8 in the second I put 2.4, in the third edit should appear 8.2). After reading your article I know how to do if the values are integers but I don't know how to do if the values are float or double. In the edit I wold like to put values of type Cstring; if the variables
Replyassociated to the edit are float or double in the edit box appear a zero and I don't want the zero appear in the edits when I execute the program.
Could you help me?
My best regards, Giacomo Moro
Extended Find function..
Posted by Legacy on 12/12/2000 12:00amOriginally posted by: JongGurl Moon
Bug to += operator
Posted by Legacy on 11/29/2000 12:00amOriginally posted by: Andrei Boz�ntan
Just try this
#include "CStr.h"
Replyvoid main()
{
CStr s("1");
for (i = 0; i < 20; i++)
s += s;
}
Improvement
Posted by Legacy on 11/08/2000 12:00amOriginally posted by: Bard
Doesn't work with ATL!!!
Posted by Legacy on 11/04/2000 12:00amOriginally posted by: Iliya
I'm writing the shell extension using ATL library without using MFC. Your code doesn't compile - linker error message "unresolved external symbol "void * __cdecl get_stringres(void)" is appeared. If i try to put #include "CStrMgr.cpp" to stdafx.cpp, i getting too many different "unresolved external" errors.
ReplyTotal efficiency can not be Generic.
Posted by Legacy on 07/30/1999 12:00amOriginally posted by: Brad Hochgesang
I see no problem with reinventing the wheel when it gains you a performance advantage. But, any good programmer realizes that a generally fast routine or class that is made for general applications will not be super-optimized for all applications.
CString has it's place. If you need to do non-intensive routines that require simple string manipulation, by all means use CString. You would have to be crazy to re-write a string class for a simple application.
On the other hand, needed to use a string class search in a database containing fields of undetermined lengths that would be inserted in alphabetical order. Had I used a List or Vector class, which I very well could have, and in fact should have had I not needed extreem speed, it would have taken way too long. Instead I wrote a specilized linked list class that could be searched with a binary search (well, a slightly modified binary search) and came up with a tremendously fast program. In that case, reinventing the wheel was useful.
There are a lot of people out there who complain about MFC and Microsoft code being too slow. Well, I'll bet your code may be slow for my applications as well. Your code is optimized (I hope) for your programs. Microsoft's code is optimized for general use, where usability and functionality tend to be thought of before lightning speed.
Re-write a string class if you want to, I say. Better, though, to optimize it for your purpose. General performance gains seem to be getting smaller and smaller these days.
ReplyIt is a very useful object
Posted by Legacy on 06/28/1999 12:00amOriginally posted by: Fan Xia
I think the usefulness of this project is not for improving the CString class and instead for inventing a new string class which is similar to the CString class. So I can get away with the huge MFC library. It is also possible to make my codes better portable to other platforms (i.g. linux in the future). If someone can re-invent the MFC classes and make the source codes available to everyone, I will really appreciate him/her. Keep the great work, Kamen.
ReplyLoading, Please Wait ...