Blogg

Här finns tekniska artiklar, presentationer och nyheter om arkitektur och systemutveckling. Håll dig uppdaterad, följ oss på LinkedIn

Callista medarbetare Erik Lupander

A Go optimization journey, part 2

// Erik Lupander

Continuing the journey of optimizing a text-similarity HTTP request handler.

1. Introduction

In the first part, I told a little tale on how I stumbled upon an article on Medium having one of those click-friendly “Go vs Rust” headings, which led me onto the path of trying to figure out if I could make the Go implementation perform better.

In this part, I’ll continue by looking at different options for JSON-parsing and maps, used in the text-similarity algorithm.

2. JSON

Go’s encoding/json is convenient and works fine. But it is not terribly performant, with some reflection happening under the hood, especially if unmarshalling to map[string]interface{}. Good options exist though!

Let’s put a few alternatives to the test using go test -bench on our typical JSON input of two 10000 char lines. First, the encoding/json benchmark code:

func BenchmarkEncodingJSON(b *testing.B) {
	// Set up sample request and re-use structs.
	reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
	req := SimilarityRequest{}
	resp := &SimilarityResponse{
		Similarity:     0.54,
		Interpretation: string(InterpretationResultModerately),
	}
	for i := 0; i < b.N; i++ {
		// Really simple implementation
		_ = json.Unmarshal(reqData, &req)
		_ = req.Text1
		_ = req.Text2
		_, _ = json.Marshal(resp)
	}
}

Note that we’re skipping error checking to keep benchmark (and sample code) focused on performance. Also, note explicit assigns to _ which are sometimes needed, or the Go compiler may decide to skip function calls entirely!

2.1 simd/json

simdjson-go provides a custom JSON parser tailored to process large and complex JSON structures at speeds of potentially gigabytes per second. It utilizes SIMD (Single instruction, multiple data) through Go assembly to maximize performance.

Implementation-wise, the API is quite different compared to encoding/json, and in order to get the best performance, the developer needs to pre-allocate data structures needed which can be reused to minimize memory allocations.

Implemented as a Go benchmark, a semi-optimized solution parsing the two lines from the input JSON looks like this:

func BenchmarkJSONSimd(b *testing.B) {
	// Use same input for all iterations. Each line is 10000 characters.
	reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
	
	// Define response struct once.
	resp := &SimilarityResponse{
		Similarity:     0.54,
		Interpretation: string(InterpretationResultModerately),
	}

	// Pre-allocate some data-structures for re-use.
	var lem1 = &simdjson.Element{}
	var lem2 = &simdjson.Element{}
	var pj = &simdjson.ParsedJson{}
	for i := 0; i < b.N; i++ {
		
		// Parse input
		pj, _ = simdjson.Parse(reqData, pj)

		// Use simdjson-go's ForEach API to iterate over fields on the root level.
		_ = pj.ForEach(func(i simdjson.Iter) error {
			
			// Extract text1 and text2 (skipping error checking to make benchmark more fair)
			_, _ = i.FindElement(lem1, "text1")
			_, _ = lem1.Iter.String()

			_, _ = i.FindElement(lem2, "text2")
			_, _ = lem2.Iter.String()

			return nil
		})
		
		// Return JSON is marshalled using encoding/json since simdjson-go only provides parsing.
		_, _ = json.Marshal(resp)
	}
}

Implementation in the HTTP handler is similar to the above benchmark, e.g:

func similarityHandler(c *fiber.Ctx) error {
	body := c.BodyRaw() // use 
	p1, err := simdjson.Parse(body, nil)
	if err != nil {
		return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
			"error": "Bad Request",
		})
	}

	var text1, text2 string
	_ = p1.ForEach(func(i simdjson.Iter) error {
		// assign text1 and text2 here
		return nil
	})
	// ... rest omitted ...

While not as simple as the c.BodyParser(&req) provided by Fiber (with encoding/json as Decoder), not too complex either.

2.2 buger/jsonparser

Another JSON-parsing option for Go is buger/jsonparser which takes a somewhat different approach compared to simdjson-go. buger/jsonparser is really oriented around avoiding memory allocations and being really efficient for extracting one or a few values from a larger payload.

Benchmark code:

func BenchmarkJSONParser(b *testing.B) {
	// Same reqData and pre-baked response as the other JSON benchmarks.
	reqData := []byte(fmt.Sprintf(`{"text1":"%s", "text2":"%s"}`, text1, text2))
	resp := &SimilarityResponse{
		Similarity:     0.54,
		Interpretation: string(InterpretationResultModerately),
	}

	// Known keys to use when extracting from the input
	key1 := []byte(`text1`)
	key2 := []byte(`text2`)
	for i := 0; i < b.N; i++ {
		// Use the ObjectEach function with a custom callback function
		_ = jsonparser.ObjectEach(reqData, func(key []byte, value []byte, dataType jsonparser.ValueType, offset int) error {
			// If key is "text1" etc.
			if bytes.Equal(key, key1) {
				_ = value // assignment to variable outside closure scope
			}
			if bytes.Equal(key, key2) {
				_ = value
			}
			return nil
		})
		// Use stdlib to marshal response, jsonparser does not provide marshalling
		_, _ = json.Marshal(resp)
	}
}

I’ll skip most of the example implementation code in the HTTP handler, it’s just the core of the code inside the loop with the []byte being provided by Fiber’s c.BodyRaw():

func similarityHandler(c *fiber.Ctx) error {

	var text1, text2 []byte
	_ = jsonparser.ObjectEach(c.BodyRaw(), func(key []byte, value []byte, _ jsonparser.ValueType, _ int) error {
		if bytes.Equal(key, key1) {
			text1 = byteReplacer.Replace(value) // Use byteReplacer from first part of blog directly on []byte.
		}

2.3 JSON benchmark results

Running go test -bench=. over the three different JSON implementations resulted in the following:

BenchmarkEncodingJSON-16                     	   21247	     55931 ns/op
BenchmarkJSONSimd-16                         	   82616	     13861 ns/op
BenchmarkJSONParser-16                       	  119608	      9811 ns/op

The clear winner for this benchmark is buger/jsonparser, providing almost 6x the performance of encoding/json. simdjson-go is about 4x faster, but for this particular use case, it cannot keep up with buger/jsonparser.

2.4 Web Service benchmark

Re-running the 2.5 minute web service benchmark with buger/jsonparser results in:

  checks.........................: 100.00% 1488834 out of 1488834
✓ http_req_duration..............: avg=75.89ms min=622µs  med=6.99ms max=4.82s   p(90)=254.27ms p(95)=509.4ms 
  http_reqs......................: 744417  4962.830158/s

A few runs of simdjson-go for completeness averaged to:

    checks.........................: 100.00% 1454662 out of 1454662  
  ✓ http_req_duration..............: avg=77.67ms min=614µs  med=11.31ms max=3.68s   p(90)=238.19ms p(95)=456.54ms
    iterations.....................: 727331  4848.81476/s

To summarize with the results from part 1:

Go regexp                       : 2266 req/s, avg: 168 ms.
Go strings.Replacer             : 4296 req/s, avg. 87.2 ms.
Go cheat                        : 4488 req/s, avg: 84.0 ms.
Go byteReplacer                 : 4512 req/s, avg: 83.5 ms.
Go byteReplacer + simdjson-go   : 4849 req/s, avg: 77.7 ms.
Go byteReplacer + jsonparser    : 4963 req/s, avg: 75.9 ms.
Rust                            : 5045 req/s, avg: 74 ms.

Awesome! Switching to a faster JSON parser netted us another 400-500 req/s, now almost on par with the Rust solution!

Can we do even better? Perhaps, read on!

3. Faster maps and sets

After the JSON has been parsed and the texts have been normalized, the next step is to start counting the number of times each word per string occurs.

The Medium article’s implementation does this using the standard Go map:

	frequencyMap := make(map[string]int)
	for _, word := range words {
		frequencyMap[word]++
	}

Very straightforward. Is it “fast”? Maybe, let’s find out by comparing with an alterante map implementation.

3.1 Swissmap

In 2023, the team behind the DoltDB SQL database announced SwissMap, which is a Go-native implementation of a hashtable, more closely described here.

The SwissMap syntax is accessible enough, the equivalent of the frequency map code above looks like:

func generateFrequencyMapSwiss(words []string) *swiss.Map[string, int] {
	frequencyMap := swiss.NewMap[string, int](uint32(len(words)))
	for _, word := range words {
		num, ok := frequencyMap.Get(word)
		if !ok {
			frequencyMap.Put(word, 1)
		} else {
			frequencyMap.Put(word, num+1)
		}
	}
	return frequencyMap
}

Since it’s not the built-in map we cannot index into it and increment in a single line. It technically works to always execute frequencyMap.Put(word, num+1) instead of the if-else statement, but for some reason it consistently performs slightly worse.

Results:

BenchmarkFrequencyStandardMap-16                	    8222	    130927 ns/op
BenchmarkFrequencyStandardMapWithCapacity-16    	   13032	     91923 ns/op
BenchmarkFrequencySwissMap-16                   	    8534	    123810 ns/op

Overall, performance for this particular write-heavy benchmark is not that good. Note that the SwissMap implementation sets len(words) as initial capacity just like the faster standard map one. Without initial capacity, SwissMap was 20-25% slower.

Interestingly enough, in Go 1.24, the internal map implementation has been rewritten to be based on Swiss tables. I’ll make sure to re-benchmark this once 1.24 is out.

3.2 Faster Set

A part of the text-similarity algorithm is to obtain a set of all unique words from both input texts. The default implementation uses a map[string]struct{} for this, the “de-facto” way to create a set (since keys are unique) in Go without external dependencies:

    words1 := strings.Split(string(text1), " ")
    words2 := strings.Split(string(text2), " ")

	fm1 := generateFrequencyMap(words1)
	fm2 := generateFrequencyMap(words2)

	// Iterate over keys of the frequency maps and add to 
	// new map to create the set
	uw := make(map[string]any, 0)
	for word := range fm1 {
		uw[word] = struct{}{}
	}
	for word := range fm2 {
		uw[word] = struct{}{}
	}

	// use maps.Keys from the maps package to get a slice containing 
	// unique words.
	uniqueWords := maps.Keys(uw)

We can try to use github.com/scylladb/go-set/strset and also operate directly on the result of those strings.Split calls:

words1 := strings.Split(string(text1), " ")
words2 := strings.Split(string(text2), " ")

uw := strset.NewWithSize(len(words1) + len(words2))
uw.Add(append(words1, words2...)...)
uniqueWords := uw.List()

Those for-loops from the original solution can be removed.

To benchmark this, a full 2m30s test was executed:

     checks.........................: 100.00% 1537420 out of 1537420
   ✓ http_req_duration..............: avg=73.31ms  min=605µs  med=9.37ms  max=4.71s    p(90)=234.21ms p(95)=453.78ms
     iterations.....................: 768710  5124.658582/s

5124.65 req/s with 73.3 ms average latency! Now, we’re faster than the Rust solution! Do note that our switch to strset also entailed a possibly more efficient set creation algorithm, so in some way we’re not strictly comparing language/library efficiency anymore.

3.3 Setting initial map capacity

When declaring a map in Go using make, one can optionally specify an initial capacity for the map. In the following example the number of unique words are used to set initial capacity:

frequencyMap := make(map[string]int, len(words))

Specifying a sufficiently large initial size of a map up front will consume more memory, but should also mean fewer incremental memory allocations and less internal bucket resizing, re-hashing etc. as new items are added. It may also result in larger continuous blocks of memory being allocated, making the garbage collector’s job easier later on.

A quick microbenchmark of the generateFrequencyMap function with/without setting an initial capacity based on the total number of words produces a slightly surprising result:

BenchmarkFrequencyStandardMap-16                	    8469	    130563 ns/op
BenchmarkFrequencyStandardMapWithCapacity-16    	   13141	     91397 ns/op

We see a ~55% increase! This may come at the cost of using slightly more memory, which overall should be pretty marginal though.

I updated all maps initializations (there are 3 of them) in the text-similarity handler so they are initialized in this manner. Note that this is done on top of the strset and JSON improvements we’ve already covered.

Running the 2m30s benchmark again:

     checks.........................: 100.00% 1834446 out of 1834446
   ✓ http_req_duration..............: avg=59.16ms  min=475µs    med=13.72ms max=1.94s    p(90)=188.33ms p(95)=284.84ms
     iterations.....................: 917223  6114.715968/s

6114.7 req/s! Another 20% increase. The size of the increase surprised me, but I guess that when doing 5000+ req/s, reducing the number of memory allocations will increase raw performance, as well as putting less strain on the GC.

3.3.1 Tracking allocations

Time for another detour, this time we’ll use go tool pprof to capture a profile on memory allocation before and after the changes to map initialization. Use go tool pprof -alloc_space and go tool pprof -alloc_objects to capture snapshots of total number of allocations. One can also use -inuse_space and -inuse_objects to get the current number of active allocations and amount of heap memory in use.

I’ve collected alloc_space and alloc_objects data after the first 30 seconds of the test run, resulting in:

Scenario Requests performed Num allocations Memory allocated Allocations per request
No initial map capacity 152 227 17 722 736 75.83 GB 116.4
With initial map capacity 181 626 4 398 554 77.68 GB 24.22

As seen, specifying an initial capacity for Go’s map in this particular scenario where we expect something like 500-1000 entries (unique words) per map means that the Go runtime doesn’t have to grow the capacity dynamically as more and more entries are being added. Reducing the number of allocations by 4/5th in a hot path definitely improves overall performance in this case!

If you want to learn more on Go map internals, Victoria metrics has a nice article on the topic.

3.4 A code tweak

The original implementation is well written and easy to understand. I did however identify a few cases where the code could be slightly more efficiently written. In particular, the function calculateIDF (IDF is an acronym for Inverse Document Frequency, I recommend reading the original Medium article for details) iterates over all unique words, counting the total number of occurrences over all documents in one for-loop, and then performs another for-loop to compute a score using some logarithmic function for each word in the overall corpus.

I realized that the original calculateIDF function could be re-written to use a single for-loop, and that the result of math.Log(1.0 + 2.0/(float64(c1+c2)+1.0)) can only have two possible outcomes depending on whether the given word exists exactly 1 or 2 times in the set of documents, since this web service only deals with 2 documents at a time.

So instead of calculating the IDF for every word in every request, we can pre-calculate it for 1 or 2 occurrences and just apply it per-word. Also, note that this removes one map initialization altogether, putting less strain on the memory subsystem.

// Pre-computed IDF scores for 1 and 2 occurrences
var oneOccurrenceIDF = math.Log(1.0 + 2.0/(float64(1)+1.0)) 
var twoOccurrencesIDF = math.Log(1.0 + 2.0/(float64(2)+1.0))

func calculateIDFOptimized(uniqueWords []string, fm1, fm2 map[string]int) []float64 {
	idf := make([]float64, len(uniqueWords))

	var occurrences = 0
	var ok = false
	for i, word := range uniqueWords {
		occurrences = 0
		if _, ok = fm1[word]; ok {
			occurrences++
		}
		if _, ok = fm2[word]; ok {
			occurrences++
		}
		if occurrences == 1 {
			idf[i] = oneOccurrenceIDF
		} else if occurrences == 2 {
			idf[i] = twoOccurrencesIDF
		}
	}
	return idf
}

3.4.1 Result

The calculateIDF simplification provided another nice performance boost for the 2m30s benchmark;

     checks.........................: 100.00% 1981694 out of 1981694
   ✓ http_req_duration..............: avg=46.85ms  min=425µs    med=19.53ms max=1.26s    p(90)=131.14ms p(95)=189.42ms
     iterations.....................: 990847  6605.453127/s

Over 6600 req/s, and an average latency of just 46.85 ms. We can also see that max has gone from about 6 seconds in the original implementation, to now be just 1.26 seconds. The 95th percentile has gone from 985ms to 189ms. The 95th percentile is now slightly better than the Rust implementation, though max of 1.26 seconds is still far from Rust’s 314 ms.

I suspect Go’s “stop-the-world” Garbage Collector may be the reason for the high max value.

4. What’s left to optimize?

We’ve basically improved the performance x3 from the original using alternate methods for text sanitation, JSON parsing, map and set initialization and some algorithmic improvements.

To see if there are any other relatively low-hanging fruits, it’s time for a new 30-second CPU profile run.

4.1 Profile collection troubleshooting

Pro-tip! go tool pprof may become confused when collecting profiles, which will be evident when doing a weblist on a .pprof file, showing out-of-date or even deleted source code. I’m not sure why this happens, maybe Go caches an old source-code version? To mitigate strange weblist output, supply the path to your binary when running go tool pprof when collecting the CPU profile, e.g:

go tool pprof main http://localhost:6060/debug/pprof/profile\?seconds\=30

In this case, my executable file is named main and I’m in the text-similarity/go folder.

4.2 CPU profile results

After all these optimizations, it turns out we are kind-of out of low-hanging things left to improve. Running top in pprof interactive mode only shows syscall and various runtime nodes.

(pprof) top
Showing nodes accounting for 157.43s, 76.30% of 206.33s total
Dropped 413 nodes (cum <= 1.03s)
Showing top 10 nodes out of 133
      flat  flat%   sum%        cum   cum%
    85.28s 41.33% 41.33%     85.35s 41.37%  syscall.syscall
    14.65s  7.10% 48.43%     14.65s  7.10%  runtime.madvise
       11s  5.33% 53.76%        11s  5.33%  runtime.usleep
    10.75s  5.21% 58.97%     10.75s  5.21%  runtime.kevent
     8.61s  4.17% 63.15%     19.55s  9.48%  runtime.mapassign_faststr
     8.31s  4.03% 67.17%      8.31s  4.03%  runtime.pthread_kill
     5.59s  2.71% 69.88%      5.59s  2.71%  aeshashbody
     4.78s  2.32% 72.20%      7.85s  3.80%  runtime.findObject
     4.76s  2.31% 74.51%      4.76s  2.31%  runtime.procyield
     3.70s  1.79% 76.30%      6.07s  2.94%  runtime.mapaccess2_faststr

Running weblist main.go gives us more detailed information, the most time-consuming code snippets being inside the external strset module, generateFrequencyMap and calculateTF.

    125            .          .           func generateFrequencyMap(words []string) map[string]int { 
    126            .      1.11s           	frequencyMap := make(map[string]int, len(words)) 
    127        460ms      500ms           	for _, word := range words { 
    128         40ms      9.82s           		frequencyMap[word]++ 
    129            .          .           	} 
    130            .          .            
    131            .          .           	return frequencyMap 
    132            .          .           } 
    145            .          .           func calculateTF(uniqueWords []string, frequencyMap map[string]int, total int) []float64 { 
    146            .      350ms           	tf := make([]float64, len(uniqueWords)) 
    147         60ms       70ms           	for i, word := range uniqueWords { 
    148        1.08s      5.93s           		tf[i] = float64(frequencyMap[word]) / float64(total) 
    149            .          .           	} 
    150            .          .            
    151            .          .           	return tf 
    152            .          .           } 

Building the frequency map involves a lot of map reads and writes, in particular frequencyMap[word]++ which in one statement both finds the correct map entry and then updates it. The weblist GUI has a pretty cool feature where statements are expandable, so one can see the Go assembly code generated for the statement. For frequencyMap[word]++, the actual assembly is not trivial:

40ms      9.82s           		frequencyMap[word]++ 
10ms       10ms  1339265:                     MOVQ AX, BX                                                  main.go:128
   .          .  1339268:                     LEAQ typerel.*+246232(SB), AX                                main.go:128
   .      4.77s  133926f:                     CALL runtime.mapassign_faststr(SB)                           main.go:128
10ms       10ms  1339274:                     INCQ 0(AX)                                                   main.go:128
                     ⋮
   .          .  133928e:                     MOVQ 0x120(SP), AX                                           main.go:128
                     ⋮
   .          .  1339345:                     MOVQ AX, BX                                                  main.go:128
10ms       20ms  1339348:                     LEAQ typerel.*+246232(SB), AX                                main.go:128
   .         5s  133934f:                     CALL runtime.mapassign_faststr(SB)                           main.go:128
10ms       10ms  1339354:                     INCQ 0(AX)                                                   main.go:128
                     ⋮
   .          .  133936e:                     MOVQ 0x118(SP), AX                                           main.go:128

I won’t pretend that I know Go assembly well enough to explain why the compiler seemingly generates the same block of instructions twice, but we do see that almost all time is spent in the runtime.mapassign_faststr function.

Generally speaking, one should almost never ever try to outsmart the compiler by hand-writing assembly code, though it can sometimes be beneficial to write a function slightly differently, build and then introspect the assembly to see if what the compiler did output, in particular any differences.

For example, changing frequencyMap[word]++ to frequencyMap[word] = frequencyMap[word]+1 yields exactly the same Go assembly.

However, if we try separating read and assign into two different lines:

	for _, word := range words {
		val := frequencyMap[word]
		frequencyMap[word] = val + 1
	}

The total time spent goes from ~9 to ~13 seconds, and a listing of the generated assembly is roughly twice as large with a total of 4 calls to runtime.mapaccess1_faststr and runtime.mapaccess_faststr.

The key take-away here is NOT to write your own assembly, but by looking at generated assembly MAY provide clues to whether a particular snippet of source code performs better or worse. Also, remember that Go assembly is an “intermediate” assembly that the toolchain will compile to actual machine-specific assembly. See the go blog for a good summary on Go assembly.

4.3 One more improvement

In the CPU profile, we did see the function calculateTF include the line tf[i] = float64(frequencyMap[word]) / float64(total), taking about 5 seconds total CPU time. Since (a) total will remain the same for a given call to calculateTF and division being a quite expensive operation on the CPU, let’s see if we can rewrite it a bit.

Implementation:

func calculateTFMultiply(uniqueWords []string, frequencyMap map[string]int, total int) []float64 {
	tf := make([]float64, len(uniqueWords))
	fraction := 1 / float64(total) // pre-compute the fraction before iterating
	for i, word := range uniqueWords {
		tf[i] = float64(frequencyMap[word]) * fraction // multiply by fraction instead of dividing by total.
	}
	return tf
}

4.3.1 Results

The rewrite results in slightly better performance for the variant using multiplication:

BenchmarkCalculateTF-16            	   50061	     23333 ns/op
BenchmarkCalculateTFMultiply-16    	   52477	     22972 ns/op

Note that doing * fraction instead of / total may result in a tiny rounding difference.

The 2m30s web service benchmark shows a small improvement as well:

     checks.........................: 100.00% 2064722 out of 2064722    
   ✓ http_req_duration..............: avg=43.5ms   min=429µs    med=18.82ms max=1.04s    p(90)=119.91ms p(95)=175.68ms
     iterations.....................: 1032361 6882.294302/s

4.4 Other improvements?

4.4.1 Using SIMD?

The CPU profiling shows that most time being spent in the similarityHandler code is spent on map read/writes and strset. As for the calculateSimilarity function that computes dot products and the final similarity score, one could think that this math-heavy section would be a good candidate for SIMD, possibly providing large speed-ups for certain mathematical computations. Nevertheless, the CPU profiling shows quite negligible time being spent in the calculateSimilarity function. It’s possible SIMD may speed up synthetic benchmarks in isolation, but since the Go compiler does not inline assembly small/trivial functions implemented in assembly may in practice be slower due to not being inlined at the callsite.

I might still do some SIMD experimentation later, but I don’t think this particular web service would benefit that much.

4.4.2 Using worker pools?

The idea behind using a worker pool is that each worker can pre-allocate necessary data structures such as map, slice once at startup, and then they would be re-used for each handled “job”, where the “job” would consist of processing a single HTTP similarity request. This approach would almost certainly reduce the number of memory allocations, but it could also lead to goroutine congestion where the workers would use channels to pass work and results between producing and consuming goroutines.

This is also something I might consider doing, perhaps overall performance would be worse, but lower memory/GC pressure could mean that the max duration would become much lower due to less GC?

5. Closing words

In part1 and in this article, we took the code from the original Medium article, benchmarked it on my MacBook and then improved upon the Go solution. The final benchmark results:

Go regexp                                   : 2266 req/s, avg: 168 ms.
Go strings.Replacer                         : 4296 req/s, avg. 87.2 ms.
Go cheat                                    : 4488 req/s, avg: 84.0 ms.
Go byteReplacer                             : 4512 req/s, avg: 83.5 ms.
Go byteReplacer + simdjson-go               : 4849 req/s, avg: 77.7 ms.
Go byteReplacer + jsonparser                : 4963 req/s, avg: 75.9 ms.
Rust                                        : 5045 req/s, avg: 74 ms.
Go byteReplacer + jsonparser + custom set   : 5124 req/s, avg: 73.3 ms.
Go as above + initial map capacity          : 6115 req/s, avg: 59.2 ms.
Go as above + optimized calculateIDF        : 6605 req/s, avg: 48.9 ms.
Go as above + optimized calculateTF         : 6882 req/s, avg: 43.5 ms.

The Go performance was more than tripled and is now surpassing the Rust solution by about 36%.

I would like to stress the fact that this doesn’t mean I think Go is “faster” than Rust (it generally isn’t), and I’m sure a sufficiently skilled Rust developer (I’m not) could make the Rust solution perform at least as good as the improved Go solution. Rust vs Go wasn’t the point of this 2-part blog, it was about the journey of profiling, pinpointing bottlenecks, making optimizations, doing experimentation over and over until running out of low-hanging fruit, hopefully learning something along the way.

Also, remember that “premature optimization is the root of all evil”, as famously stated by Donald Knuth (or maybe it was Tony Hoare, opinions seems to differ). I personally agree that one first and foremost need to be concerned with correctness, security and maintainability of the software we write. Only when identified as mission-critical, optimizing software for latency/throughput/memory consumption or whatever your metric is, becomes a priority.

That said, for Go, there are some things one can keep in mind when writing software in general, which might help keeping performance good:

  • Keep track of your number of memory allocations and use of make(). Allocated memory will usually be garbage collected, which affects performance. For example, instead of allocating memory for a new slice inside a for-loop using make, create slice outside the loop and re-use its memory by calling myslice := myslice[:0] on each new iteration.
  • It’s often better to allocate once up-front than incrementally if you are not memory-constrained, and you have rough estimate on what to set as capacity.
  • Packages making heavy use of reflection such as encoding/json are generally seen as “slower”.
  • Go’s regexp package is safe and easy to work with, albeit somewhat slow. Using regular expressions isn’t always the right answer when finding or replacing stuff in text, as the eminent strings.Replacer showed us.
  • Go’s profiling tools are superb. Use them, not only for improving performance - they’re also invaluable for tracking down memory and goroutine leaks.
  • Algorithmic efficiency sometimes comes at the cost of hard-to-read code. Even if optimizing to use a single for-loop instead of two, a cool lookup table or some clever use of bit-masking - will the next developer understand what the code does and how it works?
  • Micro-benchmarking is a tool among others, and while go test -bench is powerful and easy to work with - micro-benchmarking often report results in isolation that does not carry over to overall program performance, especially under heavy load.

That’s enough wisdom for today. I hope you enjoyed reading about my little optimization journey. A big shout-out to Dmytro Misik for the original article that got me started, and for open-sourcing his source code!

Thanks for reading! // Erik Lupander

Tack för att du läser Callistas blogg.
Hjälp oss att nå ut med information genom att dela nyheter och artiklar i ditt nätverk.

Kommentarer