关于几种字符串查找算法的对比分析


昨日在新生命团队钉钉群中,看到大石头老师分享了他们实现的 IndexOf 算法,据说可以做到 O(1),第一反应是几乎不可能,了解之后得知是使用了 Boyer Moore 字符串搜索算法,据说这种算法常用于 IDE 工具的查找,比 KMP 更快,所以有了对各种常用字符串查找算法做一下 Benchmark 的想法。

横向对比的各个字符串查找算法

string.IndexOf

此方法为 FLC 中实现的 string 类型的方法,以下源码均已开源在 GitHub

1
2
3
4
5
6
7
8
9
// Determines the position within this string of the first occurence of the specified
// string, according to the specified search criteria. The search begins at
// the first character of this string, it is case-sensitive and ordinal (code-point)
// comparison is used.
//
[Pure]
public int IndexOf(String value) {
return IndexOf(value, StringComparison.CurrentCulture);
}

可以看到,默认使用了 StringComprison.CurrentCulture,关于其具体的定义,可以参照官方文档

通常情况我们使用 CurrentCulture 或是 Ordinal 都能很好的满足需求。


继续向下看(源码为节选,省略了无关内容)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
[Pure]
public int IndexOf(String value, StringComparison comparisonType) {
return IndexOf(value, 0, this.Length, comparisonType);
}
[Pure]
[System.Security.SecuritySafeCritical]
public int IndexOf(String value, int startIndex, int count, StringComparison comparisonType) {
// Validate inputs
if (value == null)
throw new ArgumentNullException("value");

if (startIndex < 0 || startIndex > this.Length)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_Index"));

if (count < 0 || startIndex > this.Length - count)
throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_Count"));
Contract.EndContractBlock();

switch (comparisonType) {
case StringComparison.CurrentCulture:
return CultureInfo.CurrentCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.None);

case StringComparison.CurrentCultureIgnoreCase:
return CultureInfo.CurrentCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.IgnoreCase);

case StringComparison.InvariantCulture:
return CultureInfo.InvariantCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.None);

case StringComparison.InvariantCultureIgnoreCase:
return CultureInfo.InvariantCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.IgnoreCase);

case StringComparison.Ordinal:
return CultureInfo.InvariantCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.Ordinal);

case StringComparison.OrdinalIgnoreCase:
if (value.IsAscii() && this.IsAscii())
return CultureInfo.InvariantCulture.CompareInfo.IndexOf(this, value, startIndex, count, CompareOptions.IgnoreCase);
else
return TextInfo.IndexOfStringOrdinalIgnoreCase(this, value, startIndex, count);

default:
throw new ArgumentException(Environment.GetResourceString("NotSupported_StringComparison"), "comparisonType");
}
}

下面继续跟着源码来到 CompareInfo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
public unsafe virtual int IndexOf(String source, char value, CompareOptions options)
{
if (source==null)
throw new ArgumentNullException("source");
Contract.EndContractBlock();

return IndexOf(source, value, 0, source.Length, options);
}

[System.Security.SecuritySafeCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[ResourceConsumption(ResourceScope.Process, ResourceScope.Process)]
public unsafe virtual int IndexOf(String source, char value, int startIndex, int count, CompareOptions options)
{
// Validate inputs
if (source == null)
throw new ArgumentNullException("source");

if (startIndex < 0 || startIndex > source.Length)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_Index"));

if (count < 0 || startIndex > source.Length - count)
throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_Count"));
Contract.EndContractBlock();

if (options == CompareOptions.OrdinalIgnoreCase)
{
//
return source.IndexOf(value.ToString(), startIndex, count, StringComparison.OrdinalIgnoreCase);
}

// Validate CompareOptions
// Ordinal can't be selected with other flags
if ((options & ValidIndexMaskOffFlags) != 0 && (options != CompareOptions.Ordinal))
throw new ArgumentException(Environment.GetResourceString("Argument_InvalidFlag"), "options");

// to let the sorting DLL do the call optimization in case of Ascii strings, we check if the strings are in Ascii and then send the flag RESERVED_FIND_ASCII_STRING to
// the sorting DLL API SortFindString so sorting DLL don't have to check if the string is Ascii with every call to SortFindString.
return InternalFindNLSStringEx(
m_dataHandle, m_handleOrigin, m_sortName,
GetNativeCompareFlags(options) | Win32Native.FIND_FROMSTART | ((source.IsAscii() && (value <= '\x007f')) ? RESERVED_FIND_ASCII_STRING : 0),
source, count, startIndex, new String(value, 1), 1);
}

// InternalFindNLSStringEx parameters is not exactly matching kernel32::FindNLSStringEx parameters.
// Call through to NewApis::FindNLSStringEx so we can get the right behavior
[System.Security.SecurityCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[DllImport(JitHelpers.QCall, CharSet = CharSet.Unicode)]
[SuppressUnmanagedCodeSecurity]
private static extern int InternalFindNLSStringEx(IntPtr handle, IntPtr handleOrigin, String localeName, int flags, String source, int sourceCount, int startIndex, string target, int targetCount);

可以看到,最终是调用了 InternalFindNLSStringEx

不同的选项被转换成了不同的 flag

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[Pure]
internal static int GetNativeCompareFlags(CompareOptions options)
{
// some NLS VM functions can handle COMPARE_OPTIONS_ORDINAL
// in which case options should be simply cast to int instead of using this function
// Does not look like the best approach to me but for now I am going to leave it as it is
//
Contract.Assert(options != CompareOptions.OrdinalIgnoreCase, "[CompareInfo.GetNativeCompareFlags]CompareOptions.OrdinalIgnoreCase should be handled separately");

// Use "linguistic casing" by default (load the culture's casing exception tables)
int nativeCompareFlags = NORM_LINGUISTIC_CASING;

if ((options & CompareOptions.IgnoreCase) != 0) { nativeCompareFlags |= NORM_IGNORECASE; }
if ((options & CompareOptions.IgnoreKanaType) != 0) { nativeCompareFlags |= NORM_IGNOREKANATYPE; }
if ((options & CompareOptions.IgnoreNonSpace) != 0) { nativeCompareFlags |= NORM_IGNORENONSPACE; }
if ((options & CompareOptions.IgnoreSymbols) != 0) { nativeCompareFlags |= NORM_IGNORESYMBOLS; }
if ((options & CompareOptions.IgnoreWidth) != 0) { nativeCompareFlags |= NORM_IGNOREWIDTH; }
if ((options & CompareOptions.StringSort) != 0) { nativeCompareFlags |= SORT_STRINGSORT; }

// Suffix & Prefix shouldn't use this, make sure to turn off the NORM_LINGUISTIC_CASING flag
if (options == CompareOptions.Ordinal) { nativeCompareFlags = COMPARE_OPTIONS_ORDINAL; }

Contract.Assert(((options & ~(CompareOptions.IgnoreCase |
CompareOptions.IgnoreKanaType |
CompareOptions.IgnoreNonSpace |
CompareOptions.IgnoreSymbols |
CompareOptions.IgnoreWidth |
CompareOptions.StringSort)) == 0) ||
(options == CompareOptions.Ordinal), "[CompareInfo.GetNativeCompareFlags]Expected all flags to be handled");

Contract.Assert((nativeCompareFlags & RESERVED_FIND_ASCII_STRING) == 0, "[CompareInfo.GetNativeCompareFlags] RESERVED_FIND_ASCII_STRING shouldn't be set here");

return nativeCompareFlags;
}

可以看到继续前进涉及到了 CLR 层的 comnlsinfo

1
2
3
4
5
6
7
8
9
10
11
12
//
// Search for the character in the string.
//
Buffer1 = pString1->GetBuffer();
Buffer2 = pString2->GetBuffer();

if (dwFlags == COMPARE_OPTIONS_ORDINAL)
{
iRetVal = FastIndexOfString(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
goto lExit;
}

其中 COMPARE_OPTIONS_ORDINAL 涉及到的是 FastIndexOfString

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

INT32 COMNlsInfo::FastIndexOfString(__in WCHAR *source, INT32 startIndex, INT32 endIndex, __in_ecount(patternLength) WCHAR *pattern, INT32 patternLength)
{
CONTRACTL {
NOTHROW;
GC_NOTRIGGER;
MODE_COOPERATIVE;
SO_TOLERANT;
PRECONDITION(CheckPointer(source));
PRECONDITION(CheckPointer(pattern));
PRECONDITION(startIndex >= 0);
PRECONDITION(endIndex >= 0);
PRECONDITION(patternLength>= 0);
} CONTRACTL_END

int endPattern = endIndex - patternLength + 1;

if (endPattern<0) {
return -1;
}

if (patternLength <= 0) {
return startIndex;
}

WCHAR patternChar0 = pattern[0];
for (int ctrSrc = startIndex; ctrSrc<=endPattern; ctrSrc++) {
if (source[ctrSrc] != patternChar0)
continue;
int ctrPat;
for (ctrPat = 1; (ctrPat < patternLength) && (source[ctrSrc + ctrPat] == pattern[ctrPat]); ctrPat++) {
;
}
if (ctrPat == patternLength) {
return (ctrSrc);
}
}

return (-1);
}

其余的则是 IndexOfString

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
FCIMPL7(INT32, COMNlsInfo::IndexOfString,
INT_PTR pNativeCompareInfo,
INT32 LCID,
StringObject* pString1UNSAFE, // String to search in
StringObject* pString2UNSAFE, // String we're looking for
INT32 StartIndex, // Index to start at in search string
INT32 Count, // # of chars to search
INT32 dwFlags)
{
CONTRACTL
{
THROWS;
DISABLED(GC_TRIGGERS);
MODE_ANY;
SO_TOLERANT;
PRECONDITION(CheckPointer((LPVOID)pNativeCompareInfo));
PRECONDITION(CheckPointer(pString1UNSAFE));
PRECONDITION(CheckPointer(pString2UNSAFE));
} CONTRACTL_END;

INT32 iRetVal = -1;
STRINGREF pString1 = (STRINGREF) pString1UNSAFE;
STRINGREF pString2 = (STRINGREF) pString2UNSAFE;

HELPER_METHOD_FRAME_BEGIN_RET_2(pString1, pString2);

WCHAR *Buffer1 = NULL;
WCHAR *Buffer2 = NULL;

//
// Get the arguments.
//
int StringLength1;
int StringLength2;
StringLength1 = pString1->GetStringLength();
StringLength2 = pString2->GetStringLength();

//
// Get the arguments.
//
_ASSERTE(StartIndex >= 0 && StartIndex <= StringLength1);

int EndIndex = StartIndex - 1 + Count;

_ASSERTE(Count >= 0 && EndIndex < StringLength1);

//
// Check the ranges.
//
if (StringLength1 == 0)
{
if (StringLength2 == 0)
iRetVal = 0;
// else iRetVal = -1 (not found)
goto lExit;
}

//
// See if we have an empty string 2.
//
if (StringLength2 == 0)
{
iRetVal = StartIndex;
goto lExit;
}

//
// Search for the character in the string.
//
Buffer1 = pString1->GetBuffer();
Buffer2 = pString2->GetBuffer();

if (dwFlags == COMPARE_OPTIONS_ORDINAL)
{
iRetVal = FastIndexOfString(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
goto lExit;
}

//For dwFlags, 0 is the default, 1 is ignore case, we can handle both.
if (dwFlags<=1 && IS_FAST_COMPARE_LOCALE(LCID))
{
//If we've never before looked at whether this string has high chars, do so now.
if (IS_STRING_STATE_UNDETERMINED(pString1->GetHighCharState()))
{
COMString::InternalCheckHighChars(pString1);
}

//If we've never before looked at whether this string has high chars, do so now.
if (IS_STRING_STATE_UNDETERMINED(pString2->GetHighCharState()))
{
COMString::InternalCheckHighChars(pString2);
}

//If neither string has high chars, we can use a much faster comparison algorithm.
if (IS_FAST_INDEX(pString1->GetHighCharState()) && IS_FAST_INDEX(pString2->GetHighCharState()))
{
if (dwFlags==0)
iRetVal = FastIndexOfString(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
else
iRetVal = FastIndexOfStringInsensitive(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
goto lExit;
}
}

iRetVal = ((NativeCompareInfo*)(pNativeCompareInfo))->IndexOfString(
Buffer1, Buffer2, StartIndex, EndIndex, StringLength2, dwFlags, FALSE);

// We checked flags in managed code, so this shouldn't happen.
_ASSERTE(iRetVal != INDEXOF_INVALID_FLAGS);
// if (iRetVal == INDEXOF_INVALID_FLAGS) {
// COMPlusThrowArgumentException(L"flags", L"Argument_InvalidFlag");
// }

lExit: ;
HELPER_METHOD_FRAME_END();
return iRetVal;

}
FCIMPLEND

FCIMPL5(INT32, COMNlsInfo::IndexOfStringOrdinalIgnoreCase,
INT_PTR ptr,
StringObject* pString1UNSAFE, // String to search in
StringObject* pString2UNSAFE, // String we're looking for
INT32 StartIndex, // Index to start at in search string
INT32 Count) // # of chars to search
{
CONTRACTL
{
THROWS;
DISABLED(GC_TRIGGERS);
MODE_ANY;
SO_TOLERANT;
PRECONDITION(ptr != NULL);
PRECONDITION(CheckPointer(pString1UNSAFE));
PRECONDITION(CheckPointer(pString2UNSAFE));
} CONTRACTL_END;

INT32 iRetVal = -1;
STRINGREF pString1 = (STRINGREF) pString1UNSAFE;
STRINGREF pString2 = (STRINGREF) pString2UNSAFE;

HELPER_METHOD_FRAME_BEGIN_RET_2(pString1, pString2);

WCHAR *Buffer1 = NULL;
WCHAR *Buffer2 = NULL;

//
// Get the arguments.
//
int StringLength1;
int StringLength2;
StringLength1 = pString1->GetStringLength();
StringLength2 = pString2->GetStringLength();

//
// Get the arguments.
//
_ASSERTE(StartIndex >= 0 && StartIndex <= StringLength1);

int EndIndex = StartIndex - 1 + Count;

_ASSERTE(Count >= 0 && EndIndex < StringLength1);

//
// Check the ranges.
//
if (StringLength1 == 0)
{
if (StringLength2 == 0)
iRetVal = 0;
// else iRetVal = -1 (not found)
goto lExit;
}

//
// See if we have an empty string 2.
//
if (StringLength2 == 0)
{
iRetVal = StartIndex;
goto lExit;
}

//
// Search for the character in the string.
//
Buffer1 = pString1->GetBuffer();
Buffer2 = pString2->GetBuffer();

//If we've never before looked at whether this string has high chars, do so now.
if (IS_STRING_STATE_UNDETERMINED(pString1->GetHighCharState()))
{
COMString::InternalCheckHighChars(pString1);
}

//If we've never before looked at whether this string has high chars, do so now.
if (IS_STRING_STATE_UNDETERMINED(pString2->GetHighCharState()))
{
COMString::InternalCheckHighChars(pString2);
}

//If neither string has high chars, we can use a much faster comparison algorithm.
if (IS_FAST_INDEX(pString1->GetHighCharState()) && IS_FAST_INDEX(pString2->GetHighCharState())) {
iRetVal = FastIndexOfStringInsensitive(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
}
else {
NativeTextInfo* pNativeTextInfo = (NativeTextInfo*)ptr;
iRetVal = pNativeTextInfo->IndexOfStringOrdinalIgnoreCase(Buffer1, StartIndex, EndIndex, Buffer2, StringLength2);
}

lExit: ;
HELPER_METHOD_FRAME_END();
return iRetVal;

}

Span<Char>.IndexOf

System.Span<T> 是在 .NET 中发挥关键作用的新值类型。使用它,可以表示任意内存的相邻区域,无论相应内存是与托管对象相关联,还是通过互操作由本机代码提供,亦或是位于堆栈上。除了具有上述用途外,它仍能确保安全访问和高性能特性,就像数组一样。
官方在 System.Memory.dll 中提供了 MemoryExtensions.IndexOf 方法,同样可以用于字符串的查找。

简单看一下源码
MemoryExtensions.Globalization.cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/// <summary>
/// Reports the zero-based index of the first occurrence of the specified <paramref name="value"/> in the current <paramref name="span"/>.
/// <param name="span">The source span.</param>
/// <param name="value">The value to seek within the source span.</param>
/// <param name="comparisonType">One of the enumeration values that determines how the <paramref name="span"/> and <paramref name="value"/> are compared.</param>
/// </summary>
public static int IndexOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType)
{
string.CheckStringComparison(comparisonType);

if (comparisonType == StringComparison.Ordinal)
{
return SpanHelpers.IndexOf(
ref MemoryMarshal.GetReference(span),
span.Length,
ref MemoryMarshal.GetReference(value),
value.Length);
}

switch (comparisonType)
{
case StringComparison.CurrentCulture:
case StringComparison.CurrentCultureIgnoreCase:
return CultureInfo.CurrentCulture.CompareInfo.IndexOf(span, value, string.GetCaseCompareOfComparisonCulture(comparisonType));

case StringComparison.InvariantCulture:
case StringComparison.InvariantCultureIgnoreCase:
return CompareInfo.Invariant.IndexOf(span, value, string.GetCaseCompareOfComparisonCulture(comparisonType));

default:
Debug.Assert(comparisonType == StringComparison.OrdinalIgnoreCase);
return CompareInfo.IndexOfOrdinalIgnoreCase(span, value, fromBeginning: true);
}
}

可以看到,只有 StringComparison.Ordinal 下,调用了 SpanHelpers.IndexOf 方法,其余参数下依旧采用了和 string.IndexOf 一致的 CompareInfo 处理。
SpanHelpers.IndexOf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public static int IndexOf(ref byte searchSpace, int searchSpaceLength, ref byte value, int valueLength)
{
Debug.Assert(searchSpaceLength >= 0);
Debug.Assert(valueLength >= 0);

if (valueLength == 0)
return 0; // A zero-length sequence is always treated as "found" at the start of the search space.

byte valueHead = value;
ref byte valueTail = ref Unsafe.Add(ref value, 1);
int valueTailLength = valueLength - 1;
int remainingSearchSpaceLength = searchSpaceLength - valueTailLength;

int offset = 0;
while (remainingSearchSpaceLength > 0)
{
// Do a quick search for the first element of "value".
int relativeIndex = IndexOf(ref Unsafe.Add(ref searchSpace, offset), valueHead, remainingSearchSpaceLength);
if (relativeIndex == -1)
break;

remainingSearchSpaceLength -= relativeIndex;
offset += relativeIndex;

if (remainingSearchSpaceLength <= 0)
break; // The unsearched portion is now shorter than the sequence we're looking for. So it can't be there.

// Found the first element of "value". See if the tail matches.
if (SequenceEqual(ref Unsafe.Add(ref searchSpace, offset + 1), ref valueTail, (nuint)valueTailLength)) // The (nunit)-cast is necessary to pick the correct overload
return offset; // The tail matched. Return a successful find.

remainingSearchSpaceLength--;
offset++;
}
return -1;
}

Boyer Moore

截图来自维基百科词条:博耶-穆尔字符串搜索算法

这里直接上 C# 版本源码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public static int BoyerMooreIndexOf(this string source, string pattern, int offset = 0, int count = -1) => BoyerMooreIndexOf(Encoding.UTF8.GetBytes(source), Encoding.UTF8.GetBytes(pattern), offset, count);

public static int BoyerMooreIndexOf(this byte[] source, byte[] pattern, int offset = 0, int count = -1)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (pattern == null) throw new ArgumentNullException(nameof(pattern));

var total = source.Length;
var length = pattern.Length;

if (count > 0 && total > offset + count) total = offset + count;
if (total == 0 || length == 0 || length > total) return -1;

// 初始化坏字符,即不匹配字符
var bads = new int[256];
for (var i = 0; i < 256; i++)
bads[i] = length;


var last = length - 1;
for (var i = 0; i < last; i++)
bads[pattern[i]] = last - i;


var index = offset;
while (index <= total - length)
{
// 尾部开始比较
for (var i = last; source[index + i] == pattern[i]; i--)
if (i == 0) return index;


// 坏字符规则:后移位数 = 坏字符的位置 - 搜索词中的上一次出现位置
index += bads[source[index + last]];
}

return -1;
}

KMP

截图来自维基百科词条:KMP算法

同样直接上源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
 public static int KMPSearch(this string text, string pattern)
{
var n = text.Length;
var m = pattern.Length;

if (n < m) return -1;
if (n == m && text == pattern) return 0;
if (m == 0) return 0;

var lpsArray = new int[m];

LongestPrefixSuffix(pattern, ref lpsArray);

int i = 0, j = 0;
while (i < n)
{
if (text[i] == pattern[j])
{
i++;
j++;
}
if (j == m)
return i - j;

if (i >= n || text[i] == pattern[j]) continue;
if (j != 0)
j = lpsArray[j - 1];
else
i++;
}

return -1;
}

public static void LongestPrefixSuffix(string pattern, ref int[] lpsArray)
{
var m = pattern.Length;
var len = 0;
lpsArray[0] = 0;
var i = 1;

while (i < m)
{
if (pattern[i] == pattern[len])
lpsArray[i++] = ++len;
else if (len == 0)
lpsArray[i++] = 0;
else
len = lpsArray[len - 1];
}
}

算法性能横向测试

本次横向对比了 IndexOf IndexOfOrdinal SpanIndexOf SpanIndexOfOrdinal BoyerMooreIndexOf KMPIndexOf,在查找样本长度 100, 10000, 1000000, 100000000 四种数据规模下,被查找字符串分别处于样本 开头、中间、结尾、随机位置 四种情况下的性能。

测试完整源码

Program.cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Text;

namespace BenchmarkSample
{
public static class Program
{
[RPlotExporter]
public class IndexOfTest
{
private string _haystack;
private string _needle = "needle";

[Params(100, 10000, 1000000, 100000000)]
public int HaystackSize { get; set; }

[Params(SubstringLocationEnum.Start, SubstringLocationEnum.Middle, SubstringLocationEnum.End,
SubstringLocationEnum.Random)]
public SubstringLocationEnum Location { get; set; }

public IndexOfTest()
{
_haystack = GetRandomString(HaystackSize, _needle, Location); ;
}

[Benchmark]
public int IndexOf() => _haystack.IndexOf(_needle);
[Benchmark]
public int IndexOfOrdinal() => _haystack.IndexOf(_needle, StringComparison.Ordinal);
[Benchmark]
public int SpanIndexOf() => _haystack.AsSpan().IndexOf(_needle);
[Benchmark]
public int SpanIndexOfOrdinal() => _haystack.AsSpan().IndexOf(_needle, StringComparison.Ordinal);
[Benchmark]
public int BoyerMooreIndexOf() => _haystack.BoyerMooreIndexOf(_needle);
[Benchmark]
public int KMPIndexOf() => _haystack.KMPSearch(_needle);

}

public enum SubstringLocationEnum
{
Start = 0, Middle, End, Random
}

/// <summary>
/// 生成随机字符串并加入自定义串
/// </summary>
/// <param name="length">目标字符串的长度</param>
/// <param name="custom">要插入的自定义串</param>
/// <param name="location">插入的位置 0:开头 1:中间 2:尾部 3:随机</param>
/// <returns></returns>
public static string GetRandomString(int length, string custom, SubstringLocationEnum location = SubstringLocationEnum.Middle)
=> GetRandomString(length, true, true, true, true, null, custom, (int)location);

/// <summary>
/// 生成随机字符串
/// </summary>
/// <param name="length">目标字符串的长度</param>
/// <param name="useNum">是否包含数字,1=包含,默认为包含</param>
/// <param name="useLow">是否包含小写字母,1=包含,默认为包含</param>
/// <param name="useUpp">是否包含大写字母,1=包含,默认为包含</param>
/// <param name="useSpe">是否包含特殊字符,1=包含,默认为不包含</param>
/// <param name="customChars">要包含的自定义字符,直接输入要包含的字符列表</param>
/// <param name="custom">要插入的自定义串</param>
/// <param name="location">插入的位置 0:开头 1:中间 2:尾部 3:随机</param>
/// <returns>指定长度的随机字符串</returns>
public static string GetRandomString(int length, bool useNum = true, bool useLow = true, bool useUpp = true, bool useSpe = true, string customChars = null, string custom = null, int location = 0)
{
var b = new byte[4];
new System.Security.Cryptography.RNGCryptoServiceProvider().GetBytes(b);
Random r = new(BitConverter.ToInt32(b, 0));
var charList = new List<char>();
var sb = new StringBuilder();
custom ??= string.Empty;
charList.AddRange(custom);
if (useNum) charList.AddRange("0123456789");
if (useLow) charList.AddRange("abcdefghijklmnopqrstuvwxyz");
if (useUpp) charList.AddRange("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
if (useSpe) charList.AddRange("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~");
var charArray = charList.ToArray();

for (var i = 0; i < length; i++)
{
sb.Append(charArray[r.Next(0, charArray.Length - 1)]);
}

var insertIndex = location switch
{
0 => 0,
1 => length / 2,
2 => length - 1,
3 => r.Next(0, length),
_ => -1
};

return sb.Insert(insertIndex, custom).ToString();
}

public static int BoyerMooreIndexOf(this string source, string pattern, int offset = 0, int count = -1) => BoyerMooreIndexOf(Encoding.UTF8.GetBytes(source), Encoding.UTF8.GetBytes(pattern), offset, count);
/// <summary>Boyer Moore 字符串搜索算法,比KMP更快,常用于IDE工具的查找</summary>
/// <param name="source"></param>
/// <param name="pattern"></param>
/// <param name="offset"></param>
/// <param name="count"></param>
/// <returns></returns>
public static int BoyerMooreIndexOf(this byte[] source, byte[] pattern, int offset = 0, int count = -1)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (pattern == null) throw new ArgumentNullException(nameof(pattern));

var total = source.Length;
var length = pattern.Length;

if (count > 0 && total > offset + count) total = offset + count;
if (total == 0 || length == 0 || length > total) return -1;

// 初始化坏字符,即不匹配字符
var bads = new int[256];
for (var i = 0; i < 256; i++)
bads[i] = length;


var last = length - 1;
for (var i = 0; i < last; i++)
bads[pattern[i]] = last - i;


var index = offset;
while (index <= total - length)
{
// 尾部开始比较
for (var i = last; source[index + i] == pattern[i]; i--)
if (i == 0) return index;


// 坏字符规则:后移位数 = 坏字符的位置 - 搜索词中的上一次出现位置
index += bads[source[index + last]];
}

return -1;
}
public static int KMPSearch(this string text, string pattern)
{
var n = text.Length;
var m = pattern.Length;

if (n < m) return -1;
if (n == m && text == pattern) return 0;
if (m == 0) return 0;

var lpsArray = new int[m];

LongestPrefixSuffix(pattern, ref lpsArray);

int i = 0, j = 0;
while (i < n)
{
if (text[i] == pattern[j])
{
i++;
j++;
}
if (j == m)
return i - j;

if (i >= n || text[i] == pattern[j]) continue;
if (j != 0)
j = lpsArray[j - 1];
else
i++;
}

return -1;
}

public static void LongestPrefixSuffix(string pattern, ref int[] lpsArray)
{
var m = pattern.Length;
var len = 0;
lpsArray[0] = 0;
var i = 1;

while (i < m)
{
if (pattern[i] == pattern[len])
lpsArray[i++] = ++len;
else if (len == 0)
lpsArray[i++] = 0;
else
len = lpsArray[len - 1];
}
}

public static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<IndexOfTest>();
Console.ReadLine();
}
}
}

测试结果

因为测试结果比较多,生成的图表一大堆,就不放了,直接放时间表。

1
2
3
4
5
6
7
8

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100-preview.4.21255.9
[Host] : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT


MethodHaystackSizeLocationMeanErrorStdDevMedian
IndexOf100Start1,907.369 ns10.5384 ns9.8576 ns1,909.585 ns
IndexOfOrdinal100Start10.262 ns0.0898 ns0.0750 ns10.266 ns
SpanIndexOf100Start6.445 ns0.1105 ns0.1033 ns6.402 ns
SpanIndexOfOrdinal100Start11.435 ns0.1470 ns0.1148 ns11.426 ns
BoyerMooreIndexOf100Start152.684 ns0.8215 ns0.6860 ns152.648 ns
KMPIndexOf100Start4.804 ns0.0463 ns0.0387 ns4.797 ns
IndexOf100Middle1,928.897 ns23.5317 ns20.8602 ns1,928.090 ns
IndexOfOrdinal100Middle10.455 ns0.0554 ns0.0518 ns10.458 ns
SpanIndexOf100Middle6.515 ns0.0611 ns0.0571 ns6.522 ns
SpanIndexOfOrdinal100Middle11.298 ns0.0352 ns0.0329 ns11.296 ns
BoyerMooreIndexOf100Middle152.590 ns3.0697 ns7.9786 ns150.171 ns
KMPIndexOf100Middle4.900 ns0.0791 ns0.0661 ns4.907 ns
IndexOf100End1,911.385 ns12.0498 ns11.2714 ns1,908.234 ns
IndexOfOrdinal100End10.713 ns0.1678 ns0.1401 ns10.653 ns
SpanIndexOf100End6.891 ns0.1039 ns0.0972 ns6.849 ns
SpanIndexOfOrdinal100End11.380 ns0.0908 ns0.0805 ns11.378 ns
BoyerMooreIndexOf100End149.144 ns2.7001 ns2.3936 ns148.958 ns
KMPIndexOf100End4.750 ns0.0201 ns0.0188 ns4.753 ns
IndexOf100Random1,926.007 ns37.7531 ns35.3143 ns1,925.254 ns
IndexOfOrdinal100Random10.446 ns0.1609 ns0.1427 ns10.404 ns
SpanIndexOf100Random6.384 ns0.0295 ns0.0261 ns6.383 ns
SpanIndexOfOrdinal100Random11.130 ns0.0977 ns0.0816 ns11.109 ns
BoyerMooreIndexOf100Random145.757 ns1.7552 ns1.4656 ns145.716 ns
KMPIndexOf100Random4.769 ns0.0271 ns0.0253 ns4.770 ns
IndexOf10000Start1,889.613 ns30.0781 ns28.1351 ns1,876.613 ns
IndexOfOrdinal10000Start10.447 ns0.1852 ns0.1641 ns10.433 ns
SpanIndexOf10000Start6.631 ns0.0485 ns0.0453 ns6.612 ns
SpanIndexOfOrdinal10000Start11.409 ns0.0353 ns0.0331 ns11.413 ns
BoyerMooreIndexOf10000Start149.268 ns2.6223 ns6.3331 ns146.790 ns
KMPIndexOf10000Start4.882 ns0.0362 ns0.0282 ns4.874 ns
IndexOf10000Middle1,925.414 ns14.1439 ns13.2303 ns1,924.021 ns
IndexOfOrdinal10000Middle10.617 ns0.0675 ns0.0598 ns10.599 ns
SpanIndexOf10000Middle6.610 ns0.0871 ns0.0772 ns6.610 ns
SpanIndexOfOrdinal10000Middle11.441 ns0.0652 ns0.0610 ns11.437 ns
BoyerMooreIndexOf10000Middle161.190 ns3.1946 ns3.2806 ns161.188 ns
KMPIndexOf10000Middle4.930 ns0.0398 ns0.0372 ns4.934 ns
IndexOf10000End1,924.827 ns13.6193 ns12.7395 ns1,922.824 ns
IndexOfOrdinal10000End10.900 ns0.1292 ns0.1209 ns10.893 ns
SpanIndexOf10000End6.816 ns0.0734 ns0.0651 ns6.825 ns
SpanIndexOfOrdinal10000End11.402 ns0.0474 ns0.0396 ns11.387 ns
BoyerMooreIndexOf10000End163.459 ns3.1810 ns3.6632 ns162.601 ns
KMPIndexOf10000End5.143 ns0.0412 ns0.0385 ns5.145 ns
IndexOf10000Random2,000.571 ns39.6725 ns38.9637 ns2,004.035 ns
IndexOfOrdinal10000Random10.909 ns0.2449 ns0.5723 ns10.663 ns
SpanIndexOf10000Random6.673 ns0.0237 ns0.0222 ns6.666 ns
SpanIndexOfOrdinal10000Random10.826 ns0.1216 ns0.1078 ns10.797 ns
BoyerMooreIndexOf10000Random163.361 ns3.6494 ns10.3528 ns161.226 ns
KMPIndexOf10000Random4.882 ns0.0330 ns0.0309 ns4.872 ns
IndexOf1000000Start1,917.655 ns5.6120 ns5.2495 ns1,917.252 ns
IndexOfOrdinal1000000Start10.716 ns0.1213 ns0.1135 ns10.709 ns
SpanIndexOf1000000Start6.572 ns0.0252 ns0.0197 ns6.572 ns
SpanIndexOfOrdinal1000000Start11.066 ns0.0776 ns0.0726 ns11.066 ns
BoyerMooreIndexOf1000000Start155.307 ns3.1059 ns3.6974 ns156.110 ns
KMPIndexOf1000000Start4.843 ns0.0137 ns0.0122 ns4.846 ns
IndexOf1000000Middle1,920.323 ns18.2098 ns17.0335 ns1,922.634 ns
IndexOfOrdinal1000000Middle10.683 ns0.0894 ns0.0836 ns10.660 ns
SpanIndexOf1000000Middle6.569 ns0.0653 ns0.0579 ns6.570 ns
SpanIndexOfOrdinal1000000Middle11.046 ns0.0928 ns0.0868 ns11.040 ns
BoyerMooreIndexOf1000000Middle153.735 ns3.0260 ns7.9716 ns149.605 ns
KMPIndexOf1000000Middle4.948 ns0.0780 ns0.0730 ns4.944 ns
IndexOf1000000End1,922.367 ns15.6553 ns13.8780 ns1,920.379 ns
IndexOfOrdinal1000000End18.310 ns0.2918 ns0.2730 ns18.192 ns
SpanIndexOf1000000End6.586 ns0.0560 ns0.0523 ns6.580 ns
SpanIndexOfOrdinal1000000End10.961 ns0.0885 ns0.0785 ns10.963 ns
BoyerMooreIndexOf1000000End152.627 ns2.9601 ns7.1490 ns151.619 ns
KMPIndexOf1000000End4.875 ns0.0351 ns0.0328 ns4.859 ns
IndexOf1000000Random1,911.500 ns6.8142 ns6.3740 ns1,912.532 ns
IndexOfOrdinal1000000Random10.602 ns0.0849 ns0.0752 ns10.602 ns
SpanIndexOf1000000Random6.673 ns0.0511 ns0.0478 ns6.673 ns
SpanIndexOfOrdinal1000000Random11.070 ns0.1190 ns0.1113 ns11.049 ns
BoyerMooreIndexOf1000000Random149.230 ns3.0118 ns5.0320 ns147.710 ns
KMPIndexOf1000000Random4.858 ns0.0327 ns0.0290 ns4.859 ns
IndexOf100000000Start1,915.633 ns14.4868 ns13.5510 ns1,917.496 ns
IndexOfOrdinal100000000Start10.772 ns0.1435 ns0.1342 ns10.764 ns
SpanIndexOf100000000Start6.827 ns0.0666 ns0.0623 ns6.806 ns
SpanIndexOfOrdinal100000000Start10.977 ns0.0593 ns0.0526 ns10.989 ns
BoyerMooreIndexOf100000000Start146.921 ns2.9779 ns2.6398 ns147.245 ns
KMPIndexOf100000000Start4.833 ns0.0151 ns0.0134 ns4.832 ns
IndexOf100000000Middle1,906.318 ns5.4295 ns5.0788 ns1,905.934 ns
IndexOfOrdinal100000000Middle10.725 ns0.2305 ns0.2467 ns10.689 ns
SpanIndexOf100000000Middle6.862 ns0.0724 ns0.0642 ns6.851 ns
SpanIndexOfOrdinal100000000Middle11.001 ns0.0903 ns0.0845 ns11.019 ns
BoyerMooreIndexOf100000000Middle152.705 ns2.4382 ns2.2807 ns152.401 ns
KMPIndexOf100000000Middle4.850 ns0.0262 ns0.0245 ns4.846 ns
IndexOf100000000End1,923.149 ns9.6492 ns9.0258 ns1,921.531 ns
IndexOfOrdinal100000000End10.718 ns0.1682 ns0.1491 ns10.759 ns
SpanIndexOf100000000End6.807 ns0.0234 ns0.0207 ns6.804 ns
SpanIndexOfOrdinal100000000End11.071 ns0.1070 ns0.1001 ns11.076 ns
BoyerMooreIndexOf100000000End147.044 ns2.9552 ns2.7643 ns147.472 ns
KMPIndexOf100000000End4.966 ns0.1067 ns0.0891 ns4.949 ns
IndexOf100000000Random1,925.095 ns30.9030 ns28.9067 ns1,923.878 ns
IndexOfOrdinal100000000Random10.486 ns0.0910 ns0.0806 ns10.498 ns
SpanIndexOf100000000Random7.009 ns0.1658 ns0.1628 ns7.015 ns
SpanIndexOfOrdinal100000000Random11.080 ns0.0898 ns0.0840 ns11.112 ns
BoyerMooreIndexOf100000000Random155.551 ns3.1132 ns4.2614 ns155.335 ns
KMPIndexOf100000000Random4.887 ns0.0573 ns0.0536 ns4.871 ns

总结

  1. 不同算法在数据集大小不同和关键字符串位置不同的情况下,表现都很稳定。
  2. 默认的 IndexOf 最好使用 StringComparison.Ordinal ,比默认的 StringComparison.CurrentCulture 快两个数量级,平均大约是 20 倍的差距。
  3. 算法效率由快到慢排序为 KMPIndexOf SpanIndexOf IndexOfOrdinal SpanIndexOfOrdinal BoyerMooreIndexOf IndexOf,效率比率大约为:
KMPIndexOfSpanIndexOfIndexOfOrdinalSpanIndexOfOrdinalBoyerMooreIndexOfIndexOf
122.5335200

PS:以上结果仅为本人测试所得,供本人日后编程参考,不排除有代码 BUG 的可能,欢迎斧正。

欢迎关注我的其它发布渠道