Wednesday, November 16, 2011

GO App Engine datastore operations design


GO App Engine datastore.Load/Save uses goroutines and channels to iterate over datastore entity properties, causing overhead.

Background
With GAE 1.6.0, Support for Indexed Properties, Hooks, etc was introduced with a nice, elegant design using a PropertyLoadSaver interface that uses channels (as an iterator).

I noticed that, after updating my code to utilize the PropertyList, some of my application requests started taking about double the time they were taking before. Previously, with datastore.Map, my requests still took roughly same amount of time.

On digging further, I found the following in the implementation:
    appengine/datastore/load.go
      func loadEntity(dst interface{}, src *pb.EntityProto) ...
          c := make(chan Property, 32)
          errc := make(chan os.Error, 1)
          go protoToProperties(c, errc, src)
    appengine/datastore/save.go
      func saveEntity(defaultAppID string, key *Key, src interface{}) ...
          c := make(chan Property, 32)
          donec := make(chan struct{})
          go func() { ... }

That is, For each entity (analogous to each row in a table), we create and use:
     1 goroutine and 2 channels.

The deprecated datastore.Map retrieval bypasses this Channel/Goroutine dance, which is why my response time did not change until I switched to datastore.PropertyList.

Concerns:
- For each Get/Save request, which load/save n entities, n goroutines and 2*n channels are started.
  This is analogous to starting 1 goroutine and 2 channels for each row in sql return set.
    For example, One API RPC call that returns 100 entities will cause 100 goroutines spawned and 200 channels created just for this one API call.
- However, the requests are still serialized
  (ie one load/save conversion is performed before the other).
  So we don't gain any parallelism but pay a large cost.
- This is a significant overhead for a simple iteration where concurrency is not a goal.
- There's still significant allocation (which is what we were trying to avoid).
  - each "row" or entity causes a new goroutine and channel to be returned.
  - each channel (for iterating properties) has a buffer size of 32.
  There ends up potentially being more allocation than if we just returned []Property.
- Implementation detail bleeding into the API, making it harder to optimize later.
  By typing API to channels, we cannot optimize out of this later on.
- Also, we seem to be using the "channels as iterator" anti-pattern, which is frowned upon, especially by the GO team. See:
  Russ Cox: http://code.google.com/p/go/source/detail?r=ed32ab5693
  Russ Cox: https://groups.google.com/d/msg/golang-nuts/jb4YfdFwmmM/P-55mxV0a8oJ
  David Symonds: https://groups.google.com/d/msg/golang-nuts/bcAWzaSYC0Y/nk-b5fUR_loJ
  http://stackoverflow.com/questions/5033605/common-programming-mistakes-for-go-developers-to-avoid/5034195#5034195

Can we do without the goroutines/channels, especially in the API? This way, we can use different implementations.

Alternative solution using iterators
An alternative, equally elegant solution would just use iterators:
- type PropertyIterator interface:
  Next() (Property, os.Error) //To signal end, os.EOF/datastore.Done is returned
- type PropertyLoadSaver interface:
  PropertyLoad(PropertyIterator p) os.Error
  PropertySave() PropertyIterator, os.Error

For implementations of PropertyIterator:
- type PropertyIteratorFunc func() (Property, os.Error)
  Next() (Property, os.Error) //calls itself
- type PropertyList []Property:
  Iterator() PropertyIterator //actually a PropertyIteratorFunc
- type ChanPropertyIterator chan Property: //implements PropertyIterator itself
  Next() (Property, os.Error) //does a <- on channel, and returns os.EOF/datastore.Done appropriately
  //optional: for people who prefer goroutine/channel alloc over slice alloc)
  //this would be like the current solution today, but not exposed in the API)

Since GO Runtime is still experimental, making a contained API change should be ok.

But RPC dominates the overhead per request. Why focus on goroutines/channels use?

Definitely, the RPC time will dominate the overhead from a goroutine and 2 channels. However, we're talking about potentially 100's or 1000's of goroutines per request (equal to the number of "row" returned by, or sent to the API call). E.g. for a GET that returns 100 entities, thats 100 goroutines and 200 channels created to service that 1 API call. And these goroutines/channels we're making have nothing to do with concurrency: we're just using this for iterators.

Also, within our application code, we still have to optimize our code (and especially our exported APIs), even though we know that RPC overhead will overshadow it.

Main Concern: Implementation bleeds into the API
My main concern is that this bleeds into the API. By using Iterators, you can use channels and a goroutine in the implementation, and change that afterwards, without application users having to know about it.

The alternative implementation proposed above shows how thic can be done using iterators. It's trivial to implement (in GO code) and you can gain what you want, without restricting your implementation:
- Objects don't need to exist longer than it needs to populate the fields
- Intermediate state is supported
- No need to pass around []Property for a large entity

However, the API is not tied to an implementation, so you can implement with goroutines/channels, or with a List. User code that passes a PropertyLoadSaver can use whatever is most applicable/optimized for his usecase. For example, in my user code, I can pass PropertyList into each call and will not incur the overhead of goroutines/channels.

Have others solved similar problems using goroutines/channels? Where?
It seems that the use of goroutines/channels as iterators is not done in other *similar* places:
- See datastore.Query whose iteration doesn't expose goroutines/channels
- See exp/sql/driver whose iteration doesn't expose goroutines/channels (just a Next([]interface{}) method)

What is the performance overhead (load on CPU, RAM) with this? Does it scale?

Initially, when I did this, I ran some rudimentary tests to find the maximum number of goroutines I could create on my machine and how much resources it took.

The summary of the results is that, On a 2.0GHz core, I could start a maximum of 5e5 (500,000) goroutines which basically did nothing (beyond that, I got errors). The RAM usage was 2.0GB.

An app engine instance is 600MHz single core with 128MB limit. That's about a 1/4 the CPU and 1/20 the memory. (Even my nexus one has way more resources than that.)

In summary, 2.0GHz, 2GB RAM produced 500,000 goroutines max. I wonder how many a 600MHz, 128MB app engine instance would accomodate.

I'd suspect a few thousand goroutines on such a tiny "computer" (600MHz, 128MB) would tax the system. However, it's really easy to get into such a situation with the current design. If most of the time is spent on RPC (I/O) and CPU load is low, GO can easily support a large number of concurrent requests. 50 concurrent requests each retrieving 200 entities will mean 10,000 goroutines (+20000 channels) at the same time, just serving API requests, and imposed by the SDK runtime (ie not application code which we can control or tune). In this scenario, the runtime is imposing an overhead which does not seem necessary.

If we expect that most people will pass a PropertyList to calls to GetXXX or PutXXX, then the goroutine/channel is completely redundant.

Also, remember that each goroutine allocates an initial stack of 4K, so each goroutine has a cost in memory allocation, which becomes non-trivial under load.

The rudimentary go code used to run this test is available at:
Shared online: You can download a go file, to compile and run on your computer here.
- On Golang Play: You need to run this on your local computer

Monday, November 14, 2011

Testing Go App Engine Applications natively



With changes to allow concurrent requests in Go App Engine, Testing support follows naturally and natively.

Following support for concurrent requests described previously, Testing support is as easy as ensuring the following is called one time before your test is run. I have tested it and it works flawlessly.

(I call this once using sync.Once.Do(...) within init method)
//don't conflict with http socket that devserver's go_app uses (switch ugorji with your username)
 os.Remove("unix:/tmp/dev_appserver_ugorji_8080_socket_http_2")
 flag.Set("addr_http", "unix:/tmp/dev_appserver_ugorji_8080_socket_http_2")
 flag.Set("addr_api", "unix:/tmp/dev_appserver_ugorji_8080_socket_api")
 go appengine_internal.Main()
 time.Sleep(1e9)

Once this is done, and you have a Python Dev Server running, then all your normal calls work.

To create a context and use it:
 req, _ := http.NewRequest("GET", "/", nil)
 req.Header.Set("X-AppEngine-Inbound-AppId", "dev~app")
 ctx := appengine.NewContext(req)
We need to do the dance of setting flags and stuff because the appengine_internal.Main is what is called by your app's main() method. It uses parameters passed on the command line, and it internally will start a server socket for http (which is why we have to run it in a goroutine). We have to use this function because it is exported.

We really only need the initAPI function (which would be a 1-line call to make testing seamless).
    appengine_internal.InitAPI("unix", "/tmp/dev_appserver_ugorji_8080_socket_api")

To make this easier, it would be nice if the appengine_internal.initAPI function is exported:
    appengine_internal.InitAPI(netw, addr string)

Enable Concurrent Requests in Go App Engine SDK


This details how to enable concurrent requests in the Go App Engine SDK.

UPDATES:
Nov 15:
Added that python sdk is currently not threadsafe. This shows how to make GO side threadsafe, and still test concurrency in your application (even though only 1 API request is processed at a time).


Background

The GO App Engine SDK has a pretty elegant design which I wished the Java App Engine SDK had. Full SDK with support for app engine services is supported one time (via Python), and new language runtime (like Go) can be introduced quickly, leveraging that investment (as opposed to duplicating it). Brilliant.

It also simulates what happens in production to an extent, where there's an App Engine instance that runs your application, but uses RPC (remote procedure calls) to access services provided by App Engine.

In this setup, the Python SDK which supports all the App Engine Runtime services acts as two things:
  • A front end.
    Non-app requests are handled by the Python SDK front-end, and app requests get proxied over to the Go Application Instance.
  • RPC Server.
    All App Engine services reside on the Python SDK. The Go Application uses RPC to access those services.

Getting everything to work is pretty neat.
  • The Python Dev Server creates a Go Instance as a child process
  • The Go Instance creates a Server Socket which the Python SDK uses to proxy http requests to it
  • The Python Dev Server creates a Server Socket which the Go Instance uses to send API requests to it
  • Only one request happens at a time, as detailed below.

When a request comes through the Python SDK for the Go App, the following happens:
  • The Python SDK creates a socket to Go Instance and sends the http request to it
  • The Go Instance handles the request.
    For any API calls, it makes a socket connection to the Python SDK API, and sends the request and receives the API response back.
  • The Go Instance sends the response back to the Python SDK
  • The Python SDK forwards the response to the client

Currently, the design has some limitations that allow only one request be handled at a time:
  • This implementation uses CGI
  • Handling socket communication only occurs within the context of a request i.e. the sockets are not listened to unless a single request is in process

Objective:

The objective here is to support concurrent request. This can be done by making the Python SDK a full proxy, with standalone support as an API RPC Server (outside the context of a request).

This will allow more involved testing scenarios:
  • Have tests running directly on the GO Server within the regular context of a request (including common work done before and after a handler is called)
  • Have tests using the Python SDK server directly for API calls

To summarize, these are the things we hope to achieve:
  • Let the Python SDK be a true and full proxy to Go Instance, allowing concurrent requests be proxied and handled.
  • Honor allow_skipped_files flag (to allow skipped files e.g. test files, etc)
    Allowing skipped files in development is very necessary for tests, pre-building, etc.
  • Support testing framework, which can access the Python SDK as an API server without going through a request.
    This way, testing can involve just starting a Python Dev Server (even if no http request happens).
To achieve this, the following changes are necessary:
  • Use the Python 2.7 SDK which allows for concurrent requests
  • Use WSGI (as opposed to CGI) which allows for concurrent requests 
  • Have API socket listening and handling be always-on (not only when a http request is in process).
    Use a thread to listen to and respond to all API socket communication (listening and handling)
  • Have a setup/init function that is run when the Python SDK is started for a GO Runtime, as opposed to a one-time run when a http request happens
This support is got by minor edits to 2 files, a more involved edit to 1 file, and an one-line change in your app.yaml:
  1. google/appengine/ext/go/__init__.py
  2. google/appengine/tools/dev_appserver_main.py (minor edit)
  3. google/appengine/tools/dev_appserver.py (minor edit)
  4. app.yaml (to reference the WSGI app instead of _go_app)
I've shared a folder containing all of the changed files online here. Feel free to download the changed files and follow through. For all changes, look for the name "ugorji" in a comment in the file before each change.

But Python SDK does *NOT* support concurrent requests
Yes, Even with these changes, requests to the Python SDK are still inherently single threaded:
  • dev_appserver...serve_forever() will handle one request at a time
  • dev_appserver is not thread safe. In the midst of multiple threads handling requests, it get datastore collisions and barfs
Thus, these changes will make the GO side run concurrently. A user can add a back-door http listen port and access the GO instance directly. I do this within an init() or sync.Once.Do(...) surrounded by an if appengine.IsDevAppServer() { ... }

    http.HandleFunc("/", ...)
    http.ListenAndServe(":9999", nil)

Also, within your top-level request handling code, do a check to ensure the header for contexts is set. This is necessary because the Python SDK will add this to the headers proxied to your application. Bypassing the python proxy requires that at a minimum, you set this yourself (before creating any appengine.Context).

    if r.Header.Get("X-AppEngine-Inbound-AppId") == "" {
        r.Header.Set("X-AppEngine-Inbound-AppId", "dev~app")
    }

After that, you can make requests at http://localhost:9999 and get access to your application. Requests through this url can run concurrently. Access to API's will still be serial (one at a time) but you can still test concurrency in general for your application. This way, only API requests block but the everything else runs concurrently. 

When Python SDK becomes thread safe, we only need to make a few changes to be compliant.

  1. On Go AppEngine end, update the following:
    1. appengine_internal.InitAPI:
      just store the network address for the API server
    2. appengine_internal.call:
      open/close a connection to API server for each request
  2. On Python SDK extension:
    1. ext/go/__init__.py:
      change DelegateServer to listen(n) where n is number of concurrent requests supported e.g. listen(10)