Blogg
Här finns tekniska artiklar, presentationer och nyheter om arkitektur och systemutveckling. Håll dig uppdaterad, följ oss på LinkedIn
Här finns tekniska artiklar, presentationer och nyheter om arkitektur och systemutveckling. Håll dig uppdaterad, följ oss på LinkedIn
Continuing the journey of optimizing a text-similarity HTTP request handler.
In the first part, I told a little tale on how I stumbled upon an article on Medium having one of those click-friendly “Go vs Rust” headings, which led me onto the path of trying to figure out if I could make the Go implementation perform better.
In this part, I’ll continue by looking at different options for JSON-parsing and maps, used in the text-similarity algorithm.
Go’s encoding/json
is convenient and works fine. But it is not terribly performant, with some reflection happening under the hood, especially if unmarshalling to map[string]interface{}
. Good options exist though!
Let’s put a few alternatives to the test using go test -bench
on our typical JSON input of two 10000 char lines. First, the encoding/json
benchmark code:
func BenchmarkEncodingJSON(b *testing.B) {
// Set up sample request and re-use structs.
reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
req := SimilarityRequest{}
resp := &SimilarityResponse{
Similarity: 0.54,
Interpretation: string(InterpretationResultModerately),
}
for i := 0; i < b.N; i++ {
// Really simple implementation
_ = json.Unmarshal(reqData, &req)
_ = req.Text1
_ = req.Text2
_, _ = json.Marshal(resp)
}
}
Note that we’re skipping error checking to keep benchmark (and sample code) focused on performance. Also, note explicit assigns to _
which are sometimes needed, or the Go compiler may decide to skip function calls entirely!
simdjson-go provides a custom JSON parser tailored to process large and complex JSON structures at speeds of potentially gigabytes per second. It utilizes SIMD (Single instruction, multiple data) through Go assembly to maximize performance.
Implementation-wise, the API is quite different compared to encoding/json
, and in order to get the best performance, the developer needs to pre-allocate data structures needed which can be reused to minimize memory allocations.
Implemented as a Go benchmark, a semi-optimized solution parsing the two lines from the input JSON looks like this:
func BenchmarkJSONSimd(b *testing.B) {
// Use same input for all iterations. Each line is 10000 characters.
reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
// Define response struct once.
resp := &SimilarityResponse{
Similarity: 0.54,
Interpretation: string(InterpretationResultModerately),
}
// Pre-allocate some data-structures for re-use.
var lem1 = &simdjson.Element{}
var lem2 = &simdjson.Element{}
var pj = &simdjson.ParsedJson{}
for i := 0; i < b.N; i++ {
// Parse input
pj, _ = simdjson.Parse(reqData, pj)
// Use simdjson-go's ForEach API to iterate over fields on the root level.
_ = pj.ForEach(func(i simdjson.Iter) error {
// Extract text1 and text2 (skipping error checking to make benchmark more fair)
_, _ = i.FindElement(lem1, "text1")
_, _ = lem1.Iter.String()
_, _ = i.FindElement(lem2, "text2")
_, _ = lem2.Iter.String()
return nil
})
// Return JSON is marshalled using encoding/json since simdjson-go only provides parsing.
_, _ = json.Marshal(resp)
}
}
Implementation in the HTTP handler is similar to the above benchmark, e.g:
func similarityHandler(c *fiber.Ctx) error {
body := c.BodyRaw() // use
p1, err := simdjson.Parse(body, nil)
if err != nil {
return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
"error": "Bad Request",
})
}
var text1, text2 string
_ = p1.ForEach(func(i simdjson.Iter) error {
// assign text1 and text2 here
return nil
})
// ... rest omitted ...
While not as simple as the c.BodyParser(&req)
provided by Fiber (with encoding/json
as Decoder), not too complex either.
Another JSON-parsing option for Go is buger/jsonparser which takes a somewhat different approach compared to simdjson-go
. buger/jsonparser
is really oriented around avoiding memory allocations and being really efficient for extracting one or a few values from a larger payload.
Benchmark code:
func BenchmarkJSONParser(b *testing.B) {
// Same reqData and pre-baked response as the other JSON benchmarks.
reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
resp := &SimilarityResponse{
Similarity: 0.54,
Interpretation: string(InterpretationResultModerately),
}
// Known keys to use when extracting from the input
key1 := []byte(`text1`)
key2 := []byte(`text2`)
for i := 0; i < b.N; i++ {
// Use the ObjectEach function with a custom callback function
_ = jsonparser.ObjectEach(reqData, func(key []byte, value []byte, dataType jsonparser.ValueType, offset int) error {
// If key is "text1" etc.
if bytes.Equal(key, key1) {
_ = value // assignment to variable outside closure scope
}
if bytes.Equal(key, key2) {
_ = value
}
return nil
})
// Use stdlib to marshal response, jsonparser does not provide marshalling
_, _ = json.Marshal(resp)
}
}
I’ll skip most of the example implementation code in the HTTP handler, it’s just the core of the code inside the loop with the []byte
being provided by Fiber’s c.BodyRaw()
:
func similarityHandler(c *fiber.Ctx) error {
var text1, text2 []byte
_ = jsonparser.ObjectEach(c.BodyRaw(), func(key []byte, value []byte, _ jsonparser.ValueType, _ int) error {
if bytes.Equal(key, key1) {
text1 = byteReplacer.Replace(value) // Use byteReplacer from first part of blog directly on []byte.
}
Running go test -bench=.
over the three different JSON implementations resulted in the following:
BenchmarkEncodingJSON-16 21247 55931 ns/op
BenchmarkJSONSimd-16 82616 13861 ns/op
BenchmarkJSONParser-16 119608 9811 ns/op
The clear winner for this benchmark is buger/jsonparser
, providing almost 6x the performance of encoding/json
. simdjson-go
is about 4x faster, but for this particular use case, it cannot keep up with buger/jsonparser
.
Re-running the 2.5 minute web service benchmark with buger/jsonparser
results in:
checks.........................: 100.00% 1488834 out of 1488834
✓ http_req_duration..............: avg=75.89ms min=622µs med=6.99ms max=4.82s p(90)=254.27ms p(95)=509.4ms
http_reqs......................: 744417 4962.830158/s
A few runs of simdjson-go
for completeness averaged to:
checks.........................: 100.00% 1454662 out of 1454662
✓ http_req_duration..............: avg=77.67ms min=614µs med=11.31ms max=3.68s p(90)=238.19ms p(95)=456.54ms
iterations.....................: 727331 4848.81476/s
To summarize with the results from part 1:
Go regexp : 2266 req/s, avg: 168 ms.
Go strings.Replacer : 4296 req/s, avg. 87.2 ms.
Go cheat : 4488 req/s, avg: 84.0 ms.
Go byteReplacer : 4512 req/s, avg: 83.5 ms.
Go byteReplacer + simdjson-go : 4849 req/s, avg: 77.7 ms.
Go byteReplacer + jsonparser : 4963 req/s, avg: 75.9 ms.
Rust : 5045 req/s, avg: 74 ms.
Awesome! Switching to a faster JSON parser netted us another 400-500 req/s, now almost on par with the Rust solution!
Can we do even better? Perhaps, read on!
After the JSON has been parsed and the texts have been normalized, the next step is to start counting the number of times each word per string occurs.
The Medium article’s implementation does this using the standard Go map
:
frequencyMap := make(map[string]int)
for _, word := range words {
frequencyMap[word]++
}
Very straightforward. Is it “fast”? Maybe, let’s find out by comparing with an alterante map
implementation.
In 2023, the team behind the DoltDB SQL database announced SwissMap, which is a Go-native implementation of a hashtable, more closely described here.
The SwissMap syntax is accessible enough, the equivalent of the frequency map code above looks like:
func generateFrequencyMapSwiss(words []string) *swiss.Map[string, int] {
frequencyMap := swiss.NewMap[string, int](uint32(len(words)))
for _, word := range words {
num, ok := frequencyMap.Get(word)
if !ok {
frequencyMap.Put(word, 1)
} else {
frequencyMap.Put(word, num+1)
}
}
return frequencyMap
}
Since it’s not the built-in map
we cannot index into it and increment in a single line. It technically works to always execute frequencyMap.Put(word, num+1)
instead of the if-else statement, but for some reason it consistently performs slightly worse.
Results:
BenchmarkFrequencyStandardMap-16 8222 130927 ns/op
BenchmarkFrequencyStandardMapWithCapacity-16 13032 91923 ns/op
BenchmarkFrequencySwissMap-16 8534 123810 ns/op
Overall, performance for this particular write-heavy benchmark is not that good. Note that the SwissMap implementation sets len(words)
as initial capacity just like the faster standard map
one. Without initial capacity, SwissMap was 20-25% slower.
Interestingly enough, in Go 1.24, the internal map
implementation has been rewritten to be based on Swiss tables. I’ll make sure to re-benchmark this once 1.24 is out.
A part of the text-similarity algorithm is to obtain a set of all unique words from both input texts. The default implementation uses a map[string]struct{}
for this, the “de-facto” way to create a set (since keys are unique) in Go without external dependencies:
words1 := strings.Split(string(text1), " ")
words2 := strings.Split(string(text2), " ")
fm1 := generateFrequencyMap(words1)
fm2 := generateFrequencyMap(words2)
// Iterate over keys of the frequency maps and add to
// new map to create the set
uw := make(map[string]any, 0)
for word := range fm1 {
uw[word] = struct{}{}
}
for word := range fm2 {
uw[word] = struct{}{}
}
// use maps.Keys from the maps package to get a slice containing
// unique words.
uniqueWords := maps.Keys(uw)
We can try to use github.com/scylladb/go-set/strset and also operate directly on the result of those strings.Split
calls:
words1 := strings.Split(string(text1), " ")
words2 := strings.Split(string(text2), " ")
uw := strset.NewWithSize(len(words1) + len(words2))
uw.Add(append(words1, words2...)...)
uniqueWords := uw.List()
Those for-loops from the original solution can be removed.
To benchmark this, a full 2m30s test was executed:
checks.........................: 100.00% 1537420 out of 1537420
✓ http_req_duration..............: avg=73.31ms min=605µs med=9.37ms max=4.71s p(90)=234.21ms p(95)=453.78ms
iterations.....................: 768710 5124.658582/s
5124.65 req/s with 73.3 ms average latency! Now, we’re faster than the Rust solution! Do note that our switch to strset
also entailed a possibly more efficient set creation algorithm, so in some way we’re not strictly comparing language/library efficiency anymore.
When declaring a map
in Go using make
, one can optionally specify an initial capacity for the map. In the following example the number of unique words are used to set initial capacity:
frequencyMap := make(map[string]int, len(words))
Specifying a sufficiently large initial size of a map up front will consume more memory, but should also mean fewer incremental memory allocations and less internal bucket resizing, re-hashing etc. as new items are added. It may also result in larger continuous blocks of memory being allocated, making the garbage collector’s job easier later on.
A quick microbenchmark of the generateFrequencyMap
function with/without setting an initial capacity based on the total number of words produces a slightly surprising result:
BenchmarkFrequencyStandardMap-16 8469 130563 ns/op
BenchmarkFrequencyStandardMapWithCapacity-16 13141 91397 ns/op
We see a ~55% increase! This may come at the cost of using slightly more memory, which overall should be pretty marginal though.
I updated all maps initializations (there are 3 of them) in the text-similarity handler so they are initialized in this manner. Note that this is done on top of the strset
and JSON improvements we’ve already covered.
Running the 2m30s benchmark again:
checks.........................: 100.00% 1834446 out of 1834446
✓ http_req_duration..............: avg=59.16ms min=475µs med=13.72ms max=1.94s p(90)=188.33ms p(95)=284.84ms
iterations.....................: 917223 6114.715968/s
6114.7 req/s! Another 20% increase. The size of the increase surprised me, but I guess that when doing 5000+ req/s, reducing the number of memory allocations will increase raw performance, as well as putting less strain on the GC.
Time for another detour, this time we’ll use go tool pprof
to capture a profile on memory allocation before and after the changes to map initialization. Use go tool pprof -alloc_space
and go tool pprof -alloc_objects
to capture snapshots of total number of allocations. One can also use -inuse_space
and -inuse_objects
to get the current number of active allocations and amount of heap memory in use.
I’ve collected alloc_space
and alloc_objects
data after the first 30 seconds of the test run, resulting in:
Scenario | Requests performed | Num allocations | Memory allocated | Allocations per request |
---|---|---|---|---|
No initial map capacity | 152 227 | 17 722 736 | 75.83 GB | 116.4 |
With initial map capacity | 181 626 | 4 398 554 | 77.68 GB | 24.22 |
As seen, specifying an initial capacity for Go’s map
in this particular scenario where we expect something like 500-1000 entries (unique words) per map means that the Go runtime doesn’t have to grow the capacity dynamically as more and more entries are being added. Reducing the number of allocations by 4/5th in a hot path definitely improves overall performance in this case!
If you want to learn more on Go map internals, Victoria metrics has a nice article on the topic.
The original implementation is well written and easy to understand. I did however identify a few cases where the code could be slightly more efficiently written. In particular, the function calculateIDF
(IDF is an acronym for Inverse Document Frequency, I recommend reading the original Medium article for details) iterates over all unique words, counting the total number of occurrences over all documents in one for-loop, and then performs another for-loop to compute a score using some logarithmic function for each word in the overall corpus.
I realized that the original calculateIDF function could be re-written to use a single for-loop, and that the result of math.Log(1.0 + 2.0/(float64(c1+c2)+1.0))
can only have two possible outcomes depending on whether the given word exists exactly 1 or 2 times in the set of documents, since this web service only deals with 2 documents at a time.
So instead of calculating the IDF for every word in every request, we can pre-calculate it for 1 or 2 occurrences and just apply it per-word. Also, note that this removes one map
initialization altogether, putting less strain on the memory subsystem.
// Pre-computed IDF scores for 1 and 2 occurrences
var oneOccurrenceIDF = math.Log(1.0 + 2.0/(float64(1)+1.0))
var twoOccurrencesIDF = math.Log(1.0 + 2.0/(float64(2)+1.0))
func calculateIDFOptimized(uniqueWords []string, fm1, fm2 map[string]int) []float64 {
idf := make([]float64, len(uniqueWords))
var occurrences = 0
var ok = false
for i, word := range uniqueWords {
occurrences = 0
if _, ok = fm1[word]; ok {
occurrences++
}
if _, ok = fm2[word]; ok {
occurrences++
}
if occurrences == 1 {
idf[i] = oneOccurrenceIDF
} else if occurrences == 2 {
idf[i] = twoOccurrencesIDF
}
}
return idf
}
The calculateIDF
simplification provided another nice performance boost for the 2m30s benchmark;
checks.........................: 100.00% 1981694 out of 1981694
✓ http_req_duration..............: avg=46.85ms min=425µs med=19.53ms max=1.26s p(90)=131.14ms p(95)=189.42ms
iterations.....................: 990847 6605.453127/s
Over 6600 req/s, and an average latency of just 46.85 ms. We can also see that max
has gone from about 6 seconds in the original implementation, to now be just 1.26 seconds. The 95th percentile has gone from 985ms to 189ms. The 95th percentile is now slightly better than the Rust implementation, though max
of 1.26 seconds is still far from Rust’s 314 ms.
I suspect Go’s “stop-the-world” Garbage Collector may be the reason for the high max
value.
We’ve basically improved the performance x3 from the original using alternate methods for text sanitation, JSON parsing, map and set initialization and some algorithmic improvements.
To see if there are any other relatively low-hanging fruits, it’s time for a new 30-second CPU profile run.
Pro-tip! go tool pprof
may become confused when collecting profiles, which will be evident when doing a weblist
on a .pprof file, showing out-of-date or even deleted source code. I’m not sure why this happens, maybe Go caches an old source-code version? To mitigate strange weblist
output, supply the path to your binary when running go tool pprof
when collecting the CPU profile, e.g:
go tool pprof main http://localhost:6060/debug/pprof/profile\?seconds\=30
In this case, my executable file is named main
and I’m in the text-similarity/go
folder.
After all these optimizations, it turns out we are kind-of out of low-hanging things left to improve. Running top
in pprof interactive mode only shows syscall
and various runtime
nodes.
(pprof) top
Showing nodes accounting for 157.43s, 76.30% of 206.33s total
Dropped 413 nodes (cum <= 1.03s)
Showing top 10 nodes out of 133
flat flat% sum% cum cum%
85.28s 41.33% 41.33% 85.35s 41.37% syscall.syscall
14.65s 7.10% 48.43% 14.65s 7.10% runtime.madvise
11s 5.33% 53.76% 11s 5.33% runtime.usleep
10.75s 5.21% 58.97% 10.75s 5.21% runtime.kevent
8.61s 4.17% 63.15% 19.55s 9.48% runtime.mapassign_faststr
8.31s 4.03% 67.17% 8.31s 4.03% runtime.pthread_kill
5.59s 2.71% 69.88% 5.59s 2.71% aeshashbody
4.78s 2.32% 72.20% 7.85s 3.80% runtime.findObject
4.76s 2.31% 74.51% 4.76s 2.31% runtime.procyield
3.70s 1.79% 76.30% 6.07s 2.94% runtime.mapaccess2_faststr
Running weblist main.go
gives us more detailed information, the most time-consuming code snippets being inside the external strset
module, generateFrequencyMap
and calculateTF
.
125 . . func generateFrequencyMap(words []string) map[string]int {
126 . 1.11s frequencyMap := make(map[string]int, len(words))
127 460ms 500ms for _, word := range words {
128 40ms 9.82s frequencyMap[word]++
129 . . }
130 . .
131 . . return frequencyMap
132 . . }
145 . . func calculateTF(uniqueWords []string, frequencyMap map[string]int, total int) []float64 {
146 . 350ms tf := make([]float64, len(uniqueWords))
147 60ms 70ms for i, word := range uniqueWords {
148 1.08s 5.93s tf[i] = float64(frequencyMap[word]) / float64(total)
149 . . }
150 . .
151 . . return tf
152 . . }
Building the frequency map involves a lot of map
reads and writes, in particular frequencyMap[word]++
which in one statement both finds the correct map entry and then updates it. The weblist
GUI has a pretty cool feature where statements are expandable, so one can see the Go assembly code generated for the statement. For frequencyMap[word]++
, the actual assembly is not trivial:
40ms 9.82s frequencyMap[word]++
10ms 10ms 1339265: MOVQ AX, BX main.go:128
. . 1339268: LEAQ typerel.*+246232(SB), AX main.go:128
. 4.77s 133926f: CALL runtime.mapassign_faststr(SB) main.go:128
10ms 10ms 1339274: INCQ 0(AX) main.go:128
⋮
. . 133928e: MOVQ 0x120(SP), AX main.go:128
⋮
. . 1339345: MOVQ AX, BX main.go:128
10ms 20ms 1339348: LEAQ typerel.*+246232(SB), AX main.go:128
. 5s 133934f: CALL runtime.mapassign_faststr(SB) main.go:128
10ms 10ms 1339354: INCQ 0(AX) main.go:128
⋮
. . 133936e: MOVQ 0x118(SP), AX main.go:128
I won’t pretend that I know Go assembly well enough to explain why the compiler seemingly generates the same block of instructions twice, but we do see that almost all time is spent in the runtime.mapassign_faststr function.
Generally speaking, one should almost never ever try to outsmart the compiler by hand-writing assembly code, though it can sometimes be beneficial to write a function slightly differently, build and then introspect the assembly to see if what the compiler did output, in particular any differences.
For example, changing frequencyMap[word]++
to frequencyMap[word] = frequencyMap[word]+1
yields exactly the same Go assembly.
However, if we try separating read and assign into two different lines:
for _, word := range words {
val := frequencyMap[word]
frequencyMap[word] = val + 1
}
The total time spent goes from ~9 to ~13 seconds, and a listing of the generated assembly is roughly twice as large with a total of 4 calls to runtime.mapaccess1_faststr
and runtime.mapaccess_faststr
.
The key take-away here is NOT to write your own assembly, but by looking at generated assembly MAY provide clues to whether a particular snippet of source code performs better or worse. Also, remember that Go assembly is an “intermediate” assembly that the toolchain will compile to actual machine-specific assembly. See the go blog for a good summary on Go assembly.
In the CPU profile, we did see the function calculateTF
include the line tf[i] = float64(frequencyMap[word]) / float64(total)
, taking about 5 seconds total CPU time. Since (a) total
will remain the same for a given call to calculateTF
and division being a quite expensive operation on the CPU, let’s see if we can rewrite it a bit.
Implementation:
func calculateTFMultiply(uniqueWords []string, frequencyMap map[string]int, total int) []float64 {
tf := make([]float64, len(uniqueWords))
fraction := 1 / float64(total) // pre-compute the fraction before iterating
for i, word := range uniqueWords {
tf[i] = float64(frequencyMap[word]) * fraction // multiply by fraction instead of dividing by total.
}
return tf
}
The rewrite results in slightly better performance for the variant using multiplication:
BenchmarkCalculateTF-16 50061 23333 ns/op
BenchmarkCalculateTFMultiply-16 52477 22972 ns/op
Note that doing * fraction
instead of / total
may result in a tiny rounding difference.
The 2m30s web service benchmark shows a small improvement as well:
checks.........................: 100.00% 2064722 out of 2064722
✓ http_req_duration..............: avg=43.5ms min=429µs med=18.82ms max=1.04s p(90)=119.91ms p(95)=175.68ms
iterations.....................: 1032361 6882.294302/s
The CPU profiling shows that most time being spent in the similarityHandler
code is spent on map
read/writes and strset
. As for the calculateSimilarity
function that computes dot products and the final similarity score, one could think that this math-heavy section would be a good candidate for SIMD, possibly providing large speed-ups for certain mathematical computations. Nevertheless, the CPU profiling shows quite negligible time being spent in the calculateSimilarity
function. It’s possible SIMD may speed up synthetic benchmarks in isolation, but since the Go compiler does not inline assembly small/trivial functions implemented in assembly may in practice be slower due to not being inlined at the callsite.
I might still do some SIMD experimentation later, but I don’t think this particular web service would benefit that much.
The idea behind using a worker pool is that each worker can pre-allocate necessary data structures such as map
, slice
once at startup, and then they would be re-used for each handled “job”, where the “job” would consist of processing a single HTTP similarity request. This approach would almost certainly reduce the number of memory allocations, but it could also lead to goroutine congestion where the workers would use channels to pass work and results between producing and consuming goroutines.
This is also something I might consider doing, perhaps overall performance would be worse, but lower memory/GC pressure could mean that the max
duration would become much lower due to less GC?
In part1 and in this article, we took the code from the original Medium article, benchmarked it on my MacBook and then improved upon the Go solution. The final benchmark results:
Go regexp : 2266 req/s, avg: 168 ms.
Go strings.Replacer : 4296 req/s, avg. 87.2 ms.
Go cheat : 4488 req/s, avg: 84.0 ms.
Go byteReplacer : 4512 req/s, avg: 83.5 ms.
Go byteReplacer + simdjson-go : 4849 req/s, avg: 77.7 ms.
Go byteReplacer + jsonparser : 4963 req/s, avg: 75.9 ms.
Rust : 5045 req/s, avg: 74 ms.
Go byteReplacer + jsonparser + custom set : 5124 req/s, avg: 73.3 ms.
Go as above + initial map capacity : 6115 req/s, avg: 59.2 ms.
Go as above + optimized calculateIDF : 6605 req/s, avg: 48.9 ms.
Go as above + optimized calculateTF : 6882 req/s, avg: 43.5 ms.
The Go performance was more than tripled and is now surpassing the Rust solution by about 36%.
I would like to stress the fact that this doesn’t mean I think Go is “faster” than Rust (it generally isn’t), and I’m sure a sufficiently skilled Rust developer (I’m not) could make the Rust solution perform at least as good as the improved Go solution. Rust vs Go wasn’t the point of this 2-part blog, it was about the journey of profiling, pinpointing bottlenecks, making optimizations, doing experimentation over and over until running out of low-hanging fruit, hopefully learning something along the way.
Also, remember that “premature optimization is the root of all evil”, as famously stated by Donald Knuth (or maybe it was Tony Hoare, opinions seems to differ). I personally agree that one first and foremost need to be concerned with correctness, security and maintainability of the software we write. Only when identified as mission-critical, optimizing software for latency/throughput/memory consumption or whatever your metric is, becomes a priority.
That said, for Go, there are some things one can keep in mind when writing software in general, which might help keeping performance good:
make()
. Allocated memory will usually be garbage collected, which affects performance. For example, instead of allocating memory for a new slice inside a for-loop using make
, create slice outside the loop and re-use its memory by calling myslice := myslice[:0]
on each new iteration.encoding/json
are generally seen as “slower”.strings.Replacer
showed us.go test -bench
is powerful and easy to work with - micro-benchmarking often report results in isolation that does not carry over to overall program performance, especially under heavy load.That’s enough wisdom for today. I hope you enjoyed reading about my little optimization journey. A big shout-out to Dmytro Misik for the original article that got me started, and for open-sourcing his source code!
Thanks for reading! // Erik Lupander