文章網址: Optimize string use: a case study
Hackmd好讀版: [閱讀筆記]Optimize string use: a case study
最近真是挫折,修了C越覺得自己其實不懂C,修了OSDI的課發現越來越不懂OS…不過越是不懂就越要努力把知識補起來!
這篇是一篇介紹如何最佳化string coding(C/C++)的文章的閱讀心得&重點摘要,以及自己的一些深入研究。
該篇文章很長,我也還沒看完,不過這裡記錄到了原作者的第一個最佳化case study的重點和心得。以下有著大量的文字敘述以及程式碼,請小心服用,不過看完應該能學到不少。
Why Strings Are a Problem
Furthermore, the behavior of std::string has been changed over the years to keep up with changes in the C++ standard. This means that a conforming std::string implementation from a C++98 compiler may not behave the same way as a std::string implementation after C++11.
Strings have some behaviors that make them expensive to use, no matter the implementation. They are dynamically allocated, they behave as values in expressions, and their implementation requires a lot of copying.
Strings Are Dynamically Allocated
std:string是動態分配memory的,使用到了大量的複製,動態配置的成本極高,在C裡面動態配置的寫法類似下方這樣:1
2
3
4char* p = (char*) malloc(7);
strcpy(p, "string");
...
free(p);
不過,string的內部mem space(internal character buffer)仍然是固定的,當操作string造成空間超出原本配置的大小時,會重新分配一段空間,再將內容複製過去。
std::string implementations do a trick to amortize the cost of reallocating storage for the character buffer as the string grows. Instead of making a request to the memory manager for the exact number of characters needed, the string implementation rounds up the request to some larger number of characters. For instance, some implementations round up the request to the next power of 2. The string then has the capacity to grow to twice its current size before needing to call the into the memory manager again. The next time an operation needs to extend the length of the string, there is room in the existing buffer, avoiding the need to allocate a new buffer. The benefit of this trick is that the cost of appending characters to a string approaches a constant asymptotically as the string grows longer. The cost of this trick is that strings carry around some unused space. If the string implements a policy of rounding up requests to a power of 2, up to half the storage in a string may be unused.
- 一種技巧是,每次都為請求的空間配置搭約多兩倍的記憶體(round up the request to the next power of 2),從而降低了重新配置的次數
- 不過缺點就是可能會有一些空間沒使用到
Strings Are Values
strings是values,如同下方integer是values的範例:
1 | int i,j; |
string也有相同的特性
1 | std::string s1, s2; |
Because strings are values, the results of string expressions are also values. If you concatenate strings, as in the statement s1 = s2 + s3 + s4;, the result of s2 + s3 is a newly allocated temporary string value. The result of concatenating s4 to this temporary string is another temporary string value. This value replaces the previous value of s1. Then the dynamically allocated storage for the first temporary string and the previous value of s1 are freed. This adds up to a lot of calls into the memory manager.
- 所以在對string做操作的時候,並不是更改原本memory的value,而是先去產生一個新的string儲存結果,然後在把原本的memory釋放掉。因此string的相關操作會大量的呼叫到memory manager
Strings Do a Lot of Copying
Copy on Write: 讓string都先指向相同的空間,透過counter來統計複製的數量,有需要修改時在另外開一個空間配置
There is a well-known programming idiom for things that behave as values but are expensive to copy. It is called “copy on write,” and often abbreviated COW in C++ literature (see not available). In a COW string, the dynamically allocated storage can be shared between strings. A reference count lets each string know if it is using shared storage. When one string is assigned to another, only a pointer is copied, and the reference count is incremented. Any operation that changes a string’s value first checks to see that there is only one pointer to that storage. If multiple strings point to the storage, any mutating operation (any operation that may change the contents of the string) allocates new storage and makes a copy of the string before making its change
1 | COWstring s1, s2; |
- 儘管這樣在assignment和argument-passing時cost很低,但non-const references和any call to a mutating function的cost就很大
- 在平行化時,Cow的cost也很大,需要確保同時間不會有人更改共同的空間
In C++11 and later, the burden of copying is somewhat reduced by the presence of rvalue references and the move semantics (see not available) that go with them. If a function takes an rvalue reference as argument, the string can do an inexpensive pointer copy when the actual argument is an rvalue expression, saving one copy.
Case Study: First Attempt at Optimizing Strings
考慮以下code:
std::string remove_ctrl(std::string s) { |
- 糟糕透了,這僅僅只是一段可以執行的危險代碼,因為有著對大量記憶體的更動,每次更改string都會重新產生一個memory去更改,然後再把原本的刪除
- 假設string長度100,則會create storage和release storage各100次
- 此外,根據assignment的實作不同,可能會有更多的memory call
- If strings are implemented using the copy-on-write idiom, then the assignment operator performs an efficient pointer copy and increments the reference count.
- If strings have a non–shared buffer implementation, then the assignment operator must copy the contents of the temporary string. If the implementation is naïve, or result‘s buffer does not have enough capacity, then the assignment operator also allocates a new buffer to copy into. This results in 100 copy operations and as many as 100 additional allocations.
- If the compiler implements C++11-style rvalue references and move semantics, then the fact that the concatenation expression’s result is an rvalue allows the compiler to call result‘s move constructor instead of its copy constructor. The result is that the program performs an efficient pointer copy.
在作者的設備上,each call took 24.8 microseconds,數字沒有意義,僅作為後續優化的比較參考值,接下來作者會介紹如何一步步的優化
Use Mutating String Operations to Eliminate Temporaries
std::string remove_ctrl_mutating(std::string s) { |
This improvement comes from eliminating all the calls to allocate temporary string objects to hold the concatenation expression result, and the associated copying and deleting of temporaries. Depending on the string implementation, allocation and copying on assignment are also eliminated.
- 修改第五行的concatenation expression,省去了暫時儲存計算結果的 temporary object allocate, 存回去的copying和釋放的deleting。
現在each call是1.72 microseconds per call,進步了13倍
從source code理解上述兩種寫法的差異
看不懂上面這兩個差在哪?來看source code
- 從libstdc++可以看到
operator+=
其實就是呼叫了string::append()
926 | ... |
- 來看看
string:append()
都在做啥
269 | void string::Append(const value_type *str, size_type n) |
原來就是先用reserve()
string的capacity變大,然後memcpy過去,這樣就不用產生一個占存的memory space先記錄改變的結果,然後在assign了(這就是operator+
在做的事情)
- 來看看
operator+
在幹啥
1 | // operator+ |
注意到他先創造了一個新的variable basic_string
,然後初始值放其中一個string,才把另一個變數進行append
Reduce Reallocation by Reserving Storage
上面提到,string空間不足時的re-allocate也是成本,透過reserve()是確認好大小以減少re-allocate的次數。
881 | std::string remove_ctrl_reserve(std::string s) { |
Not only does use of reserve() eliminate reallocation of the string buffer, but it also improves the cache locality of the data accessed by the function
- 除了降低重新配置的cost,也增加了cache locality,使得data能夠更快速的存取
A test using remove_ctrl_reserve() consumes 1.47 microseconds per call, an improvement of 17% over remove_ctrl_mutating().
Eliminate Copying of String Arguments
When a string expression is passed into a function by value, the formal argument (in this case, s) is copy-constructed.
在function傳遞參數時,參數會被copy,根據string的implementation,copy的動作有多種可能的情況:
- If strings are implemented using the copy-on-write idiom, then the compiler generates a call to the copy constructor, which performs an efficient pointer copy and increments the reference count.
- If strings have a nonshared buffer implementation, then the copy constructor must allocate a new buffer and copy the contents of the actual argument.
- If the compiler implemented C++11-style rvalue references and move semantics, then if the actual argument is an expression, it will be an rvalue, so the compiler will generate a call to the move constructor, resulting in an efficient pointer copy. If the actual argument is a variable, then the formal argument’s copy constructor is called, resulting in an allocation-and-copy. Rvalue references and move semantics are described in more detail in not available.
1 | std::string remove_ctrl_ref_args(std::string const& s) { |
- 由於s不會被修改,所以不用另外copy一份,透過const reference來節省空間
The result is a surprise. The timing test of remove_ctrl_ref_args() took 1.60 microseconds per call, 8% worse than remove_ctrl_reserve(). …Reference variables are implemented as pointers. So, everywhere s appears in remove_ctrl_ref_args(), the program dereferences a pointer that it did not have to dereference in remove_ctrl_reserve(). I hypothesize that this extra work might be enough to account for the reduced performance.
- 但在這一步的結果卻沒有優化到code(在Visual Studio上),作者推測是VS對於copy有額外做一些事情,還有因為reference是透過dereference pointer實作的,在dereference上可能有一些額外的工作增加了時間
Eliminate Pointer Dereference Using Iterators
String iterators are simple pointers into the character buffer. That saves two dereference operations versus the non-iterator code in the loop.
- string iterators直接指向string的char buffer,所以不用在額外做兩次的dereference(loop一次合併一次)
1 | std::string remove_ctrl_ref_args_it(std::string const& s) { |
The timing test for remove_ctrl_ref_args_it() produced a satisfying result of 1.04 microseconds per call.
remove_ctrl_ref_args_it() contains one other optimization of note. The value s.end(), used to control the for loop, is cached on loop initialization. This saves another 2n indirections, where n is the length of the argument string.
- 另一種版本的優化是,把
s.end()
位址先記錄下來,就不用每次loop都重找一次
為什麼是2n?
來看看string::end()
的source code,可以看到每次呼叫end()
,都會取得原始的pointer + size(),這個別需要一次operation(可以在追下去看__get_pointer
和size()
各別做了什麼事情),迴圈執行了n次,所以2n可能是這樣來的
910 | const_iterator end() const _NOEXCEPT |
Eliminate Copying of Returned String Values
為了回傳結果,function內必須要create一個變數,最後在將該變數copy給等待接收的變數中。有些compiler會幫忙做優化(though the compiler is permitted to elide (that is, simplify by removing) the copy construction if it can),不過如果要確保沒有copy,有兩種方法,一種是所有C++版本都適用的,也就是call function的時候同時reference一個保存結果的string進來(這也是編譯器優化時在做的事情)
1 | void remove_ctrl_ref_result_it( |
Measured performance of remove_ctrl_ref_result_it() is 1.02 microseconds per call, about 2% faster than the previous version.
不過這種用法要注意,比方說下列的情況就會得到不正確的結果(return empty string)
1 | std::string foo("this is a string"); |
Use Character Arrays Instead of Strings
要效能?用C++幹嘛,用C阿
To use the C-style string functions, the programmer must choose either to manually allocate and free character buffers, or to use static arrays dimensioned to worst-case sizes
- 不過C需要手動配置和釋放記憶體,以及以worst-case決定static array size
Declaring a bunch of static arrays is problematic if memory is at all constrained. However, there is usually room to statically declare large temporary buffers in local storage (that is, on the function call stack). These buffers are reclaimed at negligible runtime cost when the function exits. Except in the most constrained embedded environments, it is no problem to declare a worst-case buffer of 1,000 or even 10,000 characters on the stack.
- 在記憶體有限的狀況下,宣告一堆static array是不好的,不過在function內宣告的memory由於function結束時會釋放出來,除非是真的記憶體非常受限的狀況下(embedded system),不然宣告個1000~10000的 char空間還是沒問題的
1 | void remove_ctrl_cstrings(char* destp, char const* srcp, size_t size) { |
remove_ctrl_cstrings() took 0.15 microseconds per call in the timing test. This is 6 times faster than its predecessor, and an astonishing 170 times faster than the original function. One reason for the improvement is the elimination of several function calls, with a corresponding improvement in cache locality.
- 比上一版快了6倍,比最原始版本快了170倍,我的老天鵝阿!
- 減少了大量的function call和增加了cache locality(char* 是連續的memory space)
Stop and Think
儘管過程中一步步的優化了程式碼,但這些效能可能是犧牲了簡單性和安全性換來的,比方說
- remove_ctrl_ref_result_it()具有潛在的不正確使用方式
- remove_ctrl_cstrings()需要手動管理char*的記憶體空間,設計上較複雜
C++ offers developers a range of choices between simple and safe code that is slow and radically fast code that must be used carefully. Advocates of other languages may call this a weakness, but for optimization, it is one of the greatest strengths of C++.
In the end, the team must answer the question, “How much do we need this performance improvement?”
Summary of First Optimization Attempt
作者也比較了debug mode 和 release mode的差異,能看出好的coding對於程式效能真的影響很大
你知道花為什麼會笑嗎?
: 因為它有梗。
…好啦,雖然不知道有多少人能看到最後這裡,不過恭喜你看完原文的第一部份了!
第二部分之後閱讀完在和大家分享~