Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Introduction
There are times when an application may require a very large index of strings for some specific purpose, and you desire the ability to search for strings that begin with given characters. A simple spell check, or syntax matching routine might want to check to see if each word entered by a user exists in a collection of known words that contains thousands of entries, and support auto-complete functionality. A research assistant application might want to store every word in a series of books or articles and allow the user to find any given word or series of starting characters in any paragraph (or otherwise defined section) of the content.
While there are certainly existing ways to store a collection of strings and quickly find a given entry in the collection, the difficulty lies in quickly finding a string that begins with a given set of characters.
Collection Classes in .Net
The generic collection classes provided by the .Net Framework are well thought-out and perform quite well for the great majority of a developer’s collection handling needs. The only real issues arise when the contents of a collection get too large and you want to use a method like Contains to see if an element exists in the collection. The Contains method is an O(n) operation, meaning that it takes proportionally longer to execute the larger the collection becomes.
Many times the solution to the problem is to use a collection backed by a HashSet. So long as all of the items in the collection are unique, an index can be built and organized using the hash value for each element in the collection making the Contains method of a HashSet an O(1) operation (it takes the same amount of time to execute regardless of the number of elements in the collection).
However, the HashSet can have its limitations as well. Depending on the objects being used it may be possible to generate a hash collision (two objects in the collection produce the same hash value), though this generally won’t happen unless you are using custom types. There’s also the fact that there is only one hash value per element to search on.
For a collection of Strings, in particular, this could be a problem. In the case of strings, you may want the ability to search for a full string in the collection, or just the starting characters to find all strings that begin with those characters. A HashSet would not be suitable for this because there is no way to search for the substrings without enumerating the entire collection.
A More Capable Solution for Strings
There is another solution to fast searches over a large collection of strings, but it comes with a cost. It is possible to build a collection that offers performance approaching O(1) for any number of strings in a collection; however, to do so we must be willing to sacrifice memory consumption for performance. If we are willing to consume more memory than the raw contents of the collection requires, we can create a collection of strings backed by a character dictionary of character dictionaries.
A String is, by definition, an Array of Characters. So the string “Hello World” is also an array containing the characters:
(0) = ‘H’
(1) = ‘e’
(2) = ‘l’
(3) = ‘l’
(4) = ‘o’
(5) = ‘ ‘
(6) = ‘W’
(7) = ‘o’
(8) = ‘r’
(9) = ‘l’
(10) = ‘d’
Instead of using one array to hold the characters of the string, we can reorganize a string such that each letter is contained in a Dictionary(Of Char, Dictionary(Of Char)) creating a tree-like structure of characters. For example, if we added the words “Hello” and “Help” to the collection, we would have a tree of dictionaries like:
(0) = ‘H’
(0) = ‘e’
(0) = ‘l’
(0) = ‘l’
(0) = ‘o’
(1) = ‘p’
So the letter “H” is the first key in the root dictionary. The value for the key “H” is another Dictionary(Of Char) and its first key is the letter “e”, and so on to the “o” in “Hello”. For the word “Help”, the “H”, “e”, and “l” are already in the tree, so the first “l” now gets a second child of the letter “p”.
By this logic, we also have the word “Hell” in our collection, but we didn’t add that word… so how do we tell the difference between words explicitly added to the collection and those added accidentally? We can use Chr(0) as a terminator character to indicate the end of an intended string. In this way, a letter with a child of Chr(0) will indicate the end of a complete word. So our previous example then becomes:
(0) = ‘H’
(0) = ‘e’
(0) = ‘l’
(0) = ‘l’
(0) = ‘o’
(0) = Chr(0)
(1) = ‘p’
(0) = Chr(0)
As you can begin to see, with an extra character on each string and a new dictionary for each new letter in a word, the amount of memory consumed is much greater than is required to merely represent the characters in the string. But at the same time, because we have organized the characters into a tree structure, the greatest number of iterations required to find any word in the collection is equal to the longest word in the collection; so searches become very fast. And if we choose to execute a search that does not look for the termination character, we can quickly find all words in the collection that start with a given set of letters.
Implementing the Collection
Although the basic premise of the collection is fairly straightforward, the code implementation can be a bit tricky in that our tree of characters must still be able to expose itself as a collection of complete strings. To achieve this, we need to create a custom enumerator for our collection that knows how to find each complete string within the tree of characters.
We can begin by creating a “SearchableStringCollection” class which implements ICollection(Of String) and declares a backing store of Dictionary(Of Char, SearchableStringCollection):
Public Class SearchableStringCollection
Implements ICollection(Of String)
Private _ListCore As New Dictionary(Of Char, SearchableStringCollection)
End Class
There are a few simple class members that we can implement without first defining our actual character storage and retrieval mechanisms. First we can implement the Count property by simply defining a private _Count field to hold the number of full strings in the collection. We’ll want to track this total manually as items are added to and removed from the collection because our backing store will not have this information readily at hand (we would have to enumerate the entire collection to find and count all of the complete strings, and that would have undesirable performance).
Private _Count As Integer
Public ReadOnly Property Count As Integer Implements System.Collections.Generic.ICollection(Of String).Count
Get
Return _Count
End Get
End Property
The IsReadOnly property is easily implemented as the collection is not Read-Only, so the property can always return False.
Public ReadOnly Property IsReadOnly As Boolean Implements System.Collections.Generic.ICollection(Of String).IsReadOnly
Get
Return False
End Get
End Property
The Clear method is also straightforward and can simply clear the entire backing store, resetting the _Count field to zero.
Public Sub Clear() Implements System.Collections.Generic.ICollection(Of String).Clear
_ListCore.Clear()
_Count = 0
End Sub
The CopyTo method is also simple in that it only needs to enumerate each string in the collection and copy it to the specified array. This method will be an O(n) operation and will take proportionally longer to execute based on the size of the provided array. However, this is also the only method in the collection which is not high-performance.
Public Sub CopyTo(array() As String, arrayIndex As Integer) Implements System.Collections.Generic.ICollection(Of String).CopyTo
For Each entry As String In Me
array(arrayIndex) = entry
arrayIndex += 1
If arrayIndex = array.Length Then Exit For
Next
End Sub
The primary functionality methods Add, Contains, and Remove will require helper methods to assist in their execution, and we’ll look at each more closely later in this article. We’ll also want to add a “ContainsStartingWith” method to see if a partial match exists in the collection, as well as a “Find” method which can return all strings beginning with a given set of characters.
The only remaining class members then are the implementations of GetEnumerator. The implementation coming from IEnumerable can be easily coded to just return the strongly-typed implementation coming from IEnumerable (Of String). We can also change the access level of this member to Protected so that the class only exposes a single GetEnumerator method.
However, the actual implementation of GetEnumerator from IEnumerable(Of String) will need to return an instance of the custom enumerator class which is capable of finding all complete strings within the tree of characters.
Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of String) Implements System.Collections.Generic.IEnumerable(Of String).GetEnumerator
End Function
Protected Function GetEnumeratorCore() As System.Collections.IEnumerator Implements System.Collections.IEnumerable.GetEnumerator
Return GetEnumerator()
End Function
Recursive Helper Methods
Let’s take a look at the definitions of the four primary helper methods used to add, remove, and find strings in the collection:
Protected Overridable Function OnAdd(characters() As Char, index As Integer) As Boolean
End Function
Protected Overridable Function OnContains(characters() As Char, index As Integer) As Boolean
End Function
Protected Overridable Function OnFind(characters() As Char, index As Integer) As String()
End Function
Protected Overridable Function OnRemove(characters() As Char, index As Integer) As Boolean
End Function
As you can see, each method takes a character array and an index as parameters. The reason for this design is to facilitate calling each of these methods in a recursive fashion. A recursive method is simply a function or subroutine which continues to call itself while some condition is met. Recursion works well with tree structures of data because you need to “walk the branches” of the tree whenever you work with the data. This means that the primary worker methods only need to know a starting position within the tree at which they should begin their work. Each method can then call itself, passing the original character array it received along with an updated index indicating the next position at which processing needs to continue.
Adding Strings to the Collection
The OnAdd method will provide the functionality of walking the character tree, looking for existing characters and adding new branches when needed. The code for this method consists of four conditional checks with one opportunity for recursion:
Protected Overridable Function OnAdd(characters() As Char, index As Integer) As Boolean
Dim result As Boolean = False
If characters.Length > 0 Then
If Not _ListCore.ContainsKey(characters(index)) Then
_ListCore.Add(characters(index), New SearchableStringCollection)
If index = characters.Length - 1 Then result = True
End If
If index < characters.Length - 1 Then
result = _ListCore(characters(index)).OnAdd(characters, index + 1)
End If
End If
Return result
End Function
The method begins by declaring a result variable of the returning type (Boolean). This value will be returned at the end of the method to indicate whether or not a new item was added to the collection (if the provided complete string already exists in the collection, no new string will be added). Next it checks the trivial case that the provided character array is empty. Providing that some characters were supplied, the method then begins to look for existing characters or add new branches to the tree.
The method checks to see if the specified character exists in the local character dictionary. If that character does not yet exist, a new branch is created and the character is added to the collection. It is important to note that when this method is first called from an instance of SearchableStringCollection, the _ListCore field is pointing to the “root” dictionary of the tree, however, since each branch of the tree contains another SearchableStringCollection, in subsequent recursive calls the _ListCore field will point to the dictionary for the particular branch of the tree that execution has walked to.
Once the current character has been processed, the method repeats itself for each remaining character in the originally provided string. However, it is again important to note that the target of this second call to OnAdd is the branch stemming from the root collection; it is not a recursive call into the same object instance. This is how each helper method recursively walks the character tree.
Removing Strings from the Collection
A string can be removed from the collection by simply finding and removing the termination character (Chr0), however, in order to properly clean up the collection we need to recursively move backward through the characters in the string and remove any which do not have another child other than the character we are analyzing.
Protected Overridable Function OnRemove(characters() As Char, index As Integer) As Boolean
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
If _ListCore(characters(index)).OnRemove(characters, index + 1) Then
If _ListCore(characters(index))._ListCore.Count = 0 Then
_ListCore.Remove(characters(index))
End If
Return True
End If
Else
_ListCore.Remove(characters(index))
Return True
End If
End If
Return False
End Function
To achieve this, the method first checks to see if the current character is in the current dictionary. If not, there is nothing to remove so the method returns false. If the character is found, then the method checks to see if it is on the last character of the string. If so, the character can be removed and the removal of the specified word is complete. If the current character is not the last character, then the method tries to remove the next character in the string. If this fails, it means there is no match for the specified word so the method returns false. If the remove succeeds, then the method checks to see if there are any other existing branches for this character, and if not, it then removes the current character since there are no more words using the character. If there are other characters in the list, then the method returns true indicating that the specified word was removed.
This can be a little hard to follow, but the important thing to keep in mind is that each time the method calls itself the instance of _ListCore becomes the SearchableStringCollection on the next branch of the dictionary. If you are having trouble following the logic, try using the previous example of the words "Hello World" and "Help" and write down the steps that the method would take to remove the word "Help". In the first step, the OnRemove method would be passed the characters() "H, e, l, p" and an index of 0. So the first line of code would return true because the root _ListCore contains the letter "H". The index would not yet be on the last character so the method would call OnRemove on the SearchableStringCollection for the key "H" in the current _ListCore. So step 2 would then begin with the same characters but an index of 1, and _ListCore would now be the SearchableStringCollection for the "H" branch of the tree. If you continue writing down the logic in this manner, the execution of the method may become clearer.
Finding Strings in the Collection
Searching for a single matching string within the collection is a fairly simple process. We only need to check to see if each character in the string exists at the corresponding branch of the character tree. This makes the OnContains method short and simple:
Protected Overridable Function OnContains(characters() As Char, index As Integer) As Boolean
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
Return _ListCore(characters(index)).OnContains(characters, index + 1)
End If
Return True
End If
Return False
End Function
The method only needs to check the current instance of _ListCore for the current character specified by index. If the current index is not the last character, then the process repeats for the next branch of the tree. Matching an exact word, or any starting characters, is simply a matter of including a terminating Chr(0) in the characters() array to match an exact word or excluding to match starting characters.
But if we want to find and return all strings that start with a given set of characters, we are going to need the custom enumerator mentioned earlier in this article. We'll take a look at the OnFind method and then get to the implementation of the custom enumerator.
Protected Overridable Function OnFind(characters() As Char, index As Integer) As String()
Dim result As New List(Of String)
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
result.AddRange(_ListCore(characters(index)).OnFind(characters, index + 1))
Else
Using enm As New SearchableStringCollectionEnumerator(_ListCore(characters(index))._ListCore.GetEnumerator)
While enm.MoveNext
result.Add(String.Concat(New String(characters), enm.Current))
End While
End Using
End If
End If
Return result.ToArray
End Function
The method begins by creating a temporary List(Of String) to store each of the matches found while searching the collection. The method then checks to see if the current character is in the current dictionary and if so the determines whether to continue looking for starting characters (index is not yet at the last character), or proceeds to enumerate all of the existing strings from the current branch of the character tree. Once the last of the specified characters is found in the tree, the enumerator takes over doing the work of finding all of the complete strings that begin with the specified characters.
Implementing the SearchableStringCollectionEnumerator
To create a custom string enumerator we only need to define a class which inherits from IEnumerator(Of String). Our implementation will require a special constructor so that we can create an enumerator at any point in the character tree as well as a list of sub-enumerators used for walking the branches of the tree and a StringBuilder for constructing the current value of an instance of the enumerator.
Public Class SearchableStringCollectionEnumerator
Implements IEnumerator(Of String)
Private _BaseEnumerator As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection))
Private _Enumerators As New List(Of IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)))
Private _Current As New System.Text.StringBuilder
Protected Friend Sub New(baseEnumerator As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)))
_BaseEnumerator = baseEnumerator
_Enumerators.Add(_BaseEnumerator)
End Sub
End Class
Whenever an instance of SearchableStringCollectionEnumerator is created it must be supplied with an instance of IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)). This is the type returned by a call to GetEnumerator on an instance of our _ListCore dictionary. When an enumerator is created for the entire SearchableStringCollection, the constructor is passed the enumerator provided by the root dictionary. When using a method like OnFind, the enumerator is created using the enumerator provided by the SearchableStringCollection belonging to the last character in the specified set of characters to find.
The baseEnumerator variable always points to the first enumerator used to create the class instance. This enumerator is added to the list of working enumerators being used to walk the character tree.
Implementing IEnumerator means that we must provide three class members: a Current property, and the methods MoveNext and Reset. The Current property will simply return the characters contained in the StringBuilder. Since we are implementing a generic interface we must also have an implementation for the base IEnumerator interface, though we can hide this implementation and return the generic one's value just as we did for GetEnumerator in the SearchableStringCollection class.
Public ReadOnly Property Current As String Implements System.Collections.Generic.IEnumerator(Of String).Current
Get
Return _Current.ToString
End Get
End Property
Protected ReadOnly Property CurrentCore As Object Implements System.Collections.IEnumerator.Current
Get
Return Current
End Get
End Property
The Reset method is fairly simple as well. Its only requirement is to return the enumerator to the state it was in when it was created. This can be accomplished by clearing the StringBuilder and working enumerator list and then adding the base enumerator back into the working list. We'll clean up any of the working enumerators that were created by calls to MoveNext and reset the base enumerator back to its initial state.
Public Sub Reset() Implements System.Collections.IEnumerator.Reset
_Current.Clear()
_Enumerators.Remove(_BaseEnumerator)
_BaseEnumerator.Reset()
For Each enm As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)) In _Enumerators
enm.Dispose()
Next
_Enumerators.Clear()
_Enumerators.Add(_BaseEnumerator)
End Sub
The MoveNext method is quite short, but relies on a "BuildCurrent" helper method to do all of the real work.
Public Function MoveNext() As Boolean Implements System.Collections.IEnumerator.MoveNext
Dim result As Boolean = BuildCurrent()
If _Current.Length > 0 Then _Current.Length -= 1
Return result
End Function
BuildCurrent will return a value indicating whether or not there are any more strings in the enumerator. Since all strings explicitly added to the collection end with a Chr(0), the MoveNext method strips the last character from the current string before returning its result. The BuildCurrent helper method cannot strip this character because it is recursive and relies on the contents of the StringBuilder during its recursive execution.
Private Function BuildCurrent() As Boolean
Dim result As Boolean = False
If _Enumerators.Count > 0 Then
If _Enumerators(_Enumerators.Count - 1).MoveNext() Then
_Current.Append(_Enumerators(_Enumerators.Count - 1).Current.Key)
If Not _Current(_Current.Length - 1) = Chr(0) Then
_Enumerators.Add(_Enumerators(_Enumerators.Count - 1).Current.Value._ListCore.GetEnumerator)
result = BuildCurrent()
Else
result = True
End If
Else
_Enumerators.RemoveAt(_Enumerators.Count - 1)
If _Current.Length > 0 Then _Current.Length -= 1
result = BuildCurrent()
End If
End If
Return result
End Function
The method begins by creating a result variable initialized to false (meaning there are no more strings in the enumerator). If there are still enumerators in the working collection, then the method calls MoveNext on the last one. Keep in mind that these are enumerators from the various instance of _ListCore from each branch of the character tree. If MoveNext returns true, then that character is appended to the current string. If that character was not the termination character (Chr0) then the method gets an enumerator from the SearchableStringCollection associated with that character and continues building the string. If MoveNext returns false, that enumerator is removed from the working collection, the last character (if any) is trimmed from the current string, and BuildCurrent is called again with the previous working enumerator. The previous enumerator will be looking over words that begin with one less character than the current enumerator was looking at, and this is why we must remove the last character from the current string.
Finally we must implement IDisposable so that our enumerator can be cleaned up when it is used implicitly such as by a For Each statement.
Private disposedValue As Boolean
Protected Overridable Sub Dispose(disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
For Each enm As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)) In _Enumerators
enm.Dispose()
Next
_Enumerators.Clear()
_Current.Clear()
_BaseEnumerator.Dispose()
_BaseEnumerator = Nothing
End If
End If
Me.disposedValue = True
End Sub
Public Sub Dispose() Implements IDisposable.Dispose
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
This method simply disposes any remaining enumerators and clears the working collection and current string.
Tying it All Together
With the custom enumerator in place, we can now implement the primary public worker methods of the SearchableStringCollection class. And since the custom enumerator does all of the heavy lifting, these method implementations are short and easy.
The Add method will only need to call the OnAdd helper method can then increment the _Count field if the method succeeds:
Public Sub Add(item As String) Implements System.Collections.Generic.ICollection(Of String).Add
If OnAdd((item & Chr(0)).ToArray, 0) Then
_Count += 1
End If
End Sub
Similarly, the Remove method needs only call the OnRemove helper method and decrement the _Count field if the call succeeds:
Public Function Remove(item As String) As Boolean Implements System.Collections.Generic.ICollection(Of String).Remove
Dim result As Boolean = OnRemove((item & Chr(0)).ToCharArray, 0)
If result Then _Count -= 1
Return result
End Function
For the GetEnumerator implementation we can return a new instance of SearchableStringCollectionEnumerator using the root dictionary's enumerator:
Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of String) Implements System.Collections.Generic.IEnumerable(Of String).GetEnumerator
Return New SearchableStringCollectionEnumerator(Me._ListCore.GetEnumerator)
End Function
Our Contains and ContainsStartingWith methods will pass the characters of the specified word to the OnContains method, with Contains adding a termination character (Chr0) to the specified string:
Public Function Contains(item As String) As Boolean Implements System.Collections.Generic.ICollection(Of String).Contains
Return OnContains((item & Chr(0)).ToCharArray, 0)
End Function
Public Function ContainsStartingWith(startsWith As String) As Boolean
Return OnContains(startsWith.ToCharArray, 0)
End Function
And finally the Find method need only return the result of a call to OnFind:
Public Function Find(startsWith As String) As String()
Return OnFind(startsWith.ToCharArray, 0)
End Function
Summary
It is possible to build a high-performance string collection capable of storing a very large number of strings while still providing fast search capability for whole and partial string matches. But doing so involves a trade-off of performance for memory consumption. Most any PC-based application will be able to handle this trade-off; however, mobile applications may not be able to do something like this.
It is also possible to expand this concept into a generic dictionary of string keys. Instead of Implementing ICollection(Of String), you could implement IDictionary(Of String, T) to create a generic SearchableStringDictionary(Of T) class. You would add a field to the class to hold the associated value and then implement all of the dictionary functionality similar to the collection functionality outlined above.
Although it doesn't take a lot of code to implement a SearchableStringCollection class, the logic of the code can be a bit daunting to follow. I highly recommend that you copy the sample from Appendix A into Visual Studio and then write some code to fill the collection with strings and exercise its basic methods. Step through the code and follow the execution of calls to OnAdd, OnRemove, OnContains, and OnFind. That may be the best way to come to understand what the code is actually doing.
Appendix A: Complete Code Sample
Option Strict On
''' <summary>
''' Represents a collection of strings similar in performance to a HashSet(Of String) that offers partial string searches.
''' </summary>
''' <remarks></remarks>
Public Class SearchableStringCollection
Implements ICollection(Of String)
Private _ListCore As New Dictionary(Of Char, SearchableStringCollection)
''' <summary>
''' Adds the specified string to the collection if it does not already exist.
''' </summary>
''' <param name="item">The string to add to the collection.</param>
''' <remarks></remarks>
Public Sub Add(item As String) Implements System.Collections.Generic.ICollection(Of String).Add
If OnAdd((item & Chr(0)).ToArray, 0) Then
_Count += 1
End If
End Sub
''' <summary>
''' Clears all strings from the collection.
''' </summary>
''' <remarks></remarks>
Public Sub Clear() Implements System.Collections.Generic.ICollection(Of String).Clear
_ListCore.Clear()
_Count = 0
End Sub
''' <summary>
''' Gets a value indicating whether the collection contains the specified string.
''' </summary>
''' <param name="item">The exact string to find.</param>
''' <returns>True if the collection contains the specified string, otherwise false.</returns>
''' <remarks></remarks>
Public Function Contains(item As String) As Boolean Implements System.Collections.Generic.ICollection(Of String).Contains
Return OnContains((item & Chr(0)).ToCharArray, 0)
End Function
''' <summary>
''' Gets a value indicating whether the collection contains any strings that begin with the specified characters.
''' </summary>
''' <param name="startsWith">The starting characters to search for.</param>
''' <returns>True if the collection contains one or more strings beginning with the specified characters, otherwise false.</returns>
''' <remarks></remarks>
Public Function ContainsStartingWith(startsWith As String) As Boolean
Return OnContains(startsWith.ToCharArray, 0)
End Function
''' <summary>
''' Copies the contents of the collection to the specified array.
''' </summary>
''' <param name="array">The array of strings to copy the collection into.</param>
''' <param name="arrayIndex">The location in the array at which copying begins.</param>
''' <remarks></remarks>
Public Sub CopyTo(array() As String, arrayIndex As Integer) Implements System.Collections.Generic.ICollection(Of String).CopyTo
For Each entry As String In Me
array(arrayIndex) = entry
arrayIndex += 1
If arrayIndex = array.Length Then Exit For
Next
End Sub
Private _Count As Integer
''' <summary>
''' Gets the number of complete strings contained in the collection.
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property Count As Integer Implements System.Collections.Generic.ICollection(Of String).Count
Get
Return _Count
End Get
End Property
''' <summary>
''' Gets an array of strings that begin with the specified characters.
''' </summary>
''' <param name="startsWith">The starting characters to find.</param>
''' <returns>An array of strings beginning with the specified characters.</returns>
''' <remarks></remarks>
Public Function Find(startsWith As String) As String()
Return OnFind(startsWith.ToCharArray, 0)
End Function
''' <summary>
''' Gets an enumerator that iterates over the strings in the collection.
''' </summary>
''' <returns></returns>
''' <remarks></remarks>
Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of String) Implements System.Collections.Generic.IEnumerable(Of String).GetEnumerator
Return New SearchableStringCollectionEnumerator(Me._ListCore.GetEnumerator)
End Function
Protected Function GetEnumeratorCore() As System.Collections.IEnumerator Implements System.Collections.IEnumerable.GetEnumerator
Return GetEnumerator()
End Function
''' <summary>
''' Returns false indicating that the collection is not read-only.
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property IsReadOnly As Boolean Implements System.Collections.Generic.ICollection(Of String).IsReadOnly
Get
Return False
End Get
End Property
''' <summary>
''' Removes the specified string from the collection.
''' </summary>
''' <param name="item">The string to remove from the collection.</param>
''' <returns>True if the string is found and removed, otherwise false.</returns>
''' <remarks></remarks>
Public Function Remove(item As String) As Boolean Implements System.Collections.Generic.ICollection(Of String).Remove
Dim result As Boolean = OnRemove((item & Chr(0)).ToCharArray, 0)
If result Then _Count -= 1
Return result
End Function
Protected Overridable Function OnAdd(characters() As Char, index As Integer) As Boolean
Dim result As Boolean = False
If characters.Length > 0 Then
If Not _ListCore.ContainsKey(characters(index)) Then
_ListCore.Add(characters(index), New SearchableStringCollection)
If index = characters.Length - 1 Then result = True
End If
If index < characters.Length - 1 Then
result = _ListCore(characters(index)).OnAdd(characters, index + 1)
End If
End If
Return result
End Function
Protected Overridable Function OnContains(characters() As Char, index As Integer) As Boolean
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
Return _ListCore(characters(index)).OnContains(characters, index + 1)
End If
Return True
End If
Return False
End Function
Protected Overridable Function OnFind(characters() As Char, index As Integer) As String()
Dim result As New List(Of String)
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
result.AddRange(_ListCore(characters(index)).OnFind(characters, index + 1))
Else
Using enm As New SearchableStringCollectionEnumerator(_ListCore(characters(index))._ListCore.GetEnumerator)
While enm.MoveNext
result.Add(String.Concat(New String(characters), enm.Current))
End While
End Using
End If
End If
Return result.ToArray
End Function
Protected Overridable Function OnRemove(characters() As Char, index As Integer) As Boolean
If _ListCore.ContainsKey(characters(index)) Then
If index < characters.Length - 1 Then
If _ListCore(characters(index)).OnRemove(characters, index + 1) Then
If _ListCore(characters(index))._ListCore.Count = 0 Then
_ListCore.Remove(characters(index))
End If
Return True
End If
Else
_ListCore.Remove(characters(index))
Return True
End If
End If
Return False
End Function
''' <summary>
''' Enumerates the elements of a SearchableStringCollection.
''' </summary>
''' <remarks></remarks>
Public Class SearchableStringCollectionEnumerator
Implements IEnumerator(Of String)
Private _Enumerators As New List(Of IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)))
Private _Current As New System.Text.StringBuilder
Private _BaseEnumerator As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection))
Protected Friend Sub New(baseEnumerator As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)))
_BaseEnumerator = baseEnumerator
_Enumerators.Add(_BaseEnumerator)
End Sub
''' <summary>
''' Gets the element in the collection at the current position of the enumerator.
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property Current As String Implements System.Collections.Generic.IEnumerator(Of String).Current
Get
Return _Current.ToString
End Get
End Property
Protected ReadOnly Property CurrentCore As Object Implements System.Collections.IEnumerator.Current
Get
Return Current
End Get
End Property
''' <summary>
''' Advances the enumerator to the next element in the collection.
''' </summary>
''' <returns></returns>
''' <remarks></remarks>
Public Function MoveNext() As Boolean Implements System.Collections.IEnumerator.MoveNext
Dim result As Boolean = BuildCurrent()
If _Current.Length > 0 Then _Current.Length -= 1
Return result
End Function
''' <summary>
''' Sets the enumerator to its initial position, which is before the first element in the collection.
''' </summary>
''' <remarks></remarks>
Public Sub Reset() Implements System.Collections.IEnumerator.Reset
_Current.Clear()
_Enumerators.Remove(_BaseEnumerator)
_BaseEnumerator.Reset()
For Each enm As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)) In _Enumerators
enm.Dispose()
Next
_Enumerators.Clear()
_Enumerators.Add(_BaseEnumerator)
End Sub
Private Function BuildCurrent() As Boolean
Dim result As Boolean = False
If _Enumerators.Count > 0 Then
If _Enumerators(_Enumerators.Count - 1).MoveNext() Then
_Current.Append(_Enumerators(_Enumerators.Count - 1).Current.Key)
If Not _Current(_Current.Length - 1) = Chr(0) Then
_Enumerators.Add(_Enumerators(_Enumerators.Count - 1).Current.Value._ListCore.GetEnumerator)
result = BuildCurrent()
Else
result = True
End If
Else
_Enumerators.RemoveAt(_Enumerators.Count - 1)
If _Current.Length > 0 Then _Current.Length -= 1
result = BuildCurrent()
End If
End If
Return result
End Function
#Region "IDisposable Support"
Private disposedValue As Boolean
Protected Overridable Sub Dispose(disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
For Each enm As IEnumerator(Of KeyValuePair(Of Char, SearchableStringCollection)) In _Enumerators
enm.Dispose()
Next
_Enumerators.Clear()
_Current.Clear()
_BaseEnumerator.Dispose()
_BaseEnumerator = Nothing
End If
End If
Me.disposedValue = True
End Sub
Public Sub Dispose() Implements IDisposable.Dispose
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
#End Region
End Class
End Class
Appendix B: Performance Results from a Practical Application
As a research utility for my own personal use, I wrote a program that I call "Cross Bible Viewer" which uses the string-keyed dictionary version of the class described in this article. The program parses seven translations of the Christian Bible, indexing every word by verse. The program also offers language dictionary look-ups for any word using a number of English dictionaries. But the relevant part for this article is the performance of the SearchableStringDictionary.
All of the words from seven translations of The Bible (24,946 words), indexing a collection of verses (the value is an instance of VerseLocation containing Translation, Book, Chapter, and Verse information), consume approximately 1.2GB of Windows memory (compared to 34.6MB of source XML data for all seven translations). While this sounds like a lot (and it is), it is still well within the working limits of modern computers. If I search for the word "love", it takes longer to display the results in a ListBox then it does to gather them from the collection. Gathering 2,119 results and loading them into a ListBox (after BeginUpdate and before EndUpdate) takes just 100 milliseconds. Checking to see if the dictionary contains the word "love" takes 0.087 ms of that total 100. If I search for all words beginning with the letters "the", it takes 6.97 ms to determine that there are 316,599 verses (across all seven translations) with matching words (it takes a minute and a half to actually display all of those results in the program).
So in the right setting, this kind of performance is well worth the cost of additional RAM.
Acknowledgments
I've experimented with many kinds of tree collections in the past; sometimes the experiments work and sometimes they don't. I had thought of trying to create something like this several years ago, but wasn't sure it would turn out to be one of the successful experiments and didn't try to build it. It wasn't until I had a conversation with another community contributor, CrazyPennie, in which he showed a simple example of what he called a "binary dictionary", that it was confirmed that splitting a string into a tree of characters was a viable thing to do.
So special thanks go to CrazyPennie for showing me that it would be worth the effort of developing a full .Net implementation of a character-tree backed string collection.