Category Archives: C#

Records and the ‘with’ operator, redux

In my previous blog post I described some behaviour of C# record types which was unexpected to me, though entirely correct according to the documentation. This is a follow-up post to that one, so if you haven’t read that one yet, please do so – I won’t go over all the same ground.

Is this a problem or not?

The previous blog post made it to Hacker News, and the comments there, on Bluesky, and on the post itself have been decidedly mixed.

Several people believe this is working as intended, several people believe it’s a terrible decision on the part of the C# language team.

Perhaps unsurprisingly, the most insightful comment was from Eric Lippert, who referred to a post on the Programming Language Design and Implementation Stack Exchange site. Eric has answered the question thoroughly, as ever.

I believe the difference in opinions comes from interpretation about what “with” is requesting. Eric wrote:

The with makes a copy of the record, changing only the value of the property you identified as wishing to change, so it ought not to be surprising that nothing else changed.

That’s not how I’ve ever thought of with – I haven’t expected it to say “this object but with these different properties”, but instead “a new record, with the same parameters as the original one, but with these different parameters“. It’s a subtle distinction – sufficiently subtle that I hadn’t even bothered to think about it until running into this problem – but I suspect it explains how different people think about the same feature in different ways. I wouldn’t have thought of “setting the property” because I think of records as being immutable to start with: that the only way you can end up with a record where the property returns a different value is by providing that different value as part of construction. (Again, to be crystal clear: I don’t think I’ve found any bits of documentation which are incorrect. It’s my mental model that has been wrong.)

I haven’t gone back over previous YouTube videos describing the feature – either from the C# team itself or from other developers – to see whether a) it’s described in terms of setting properties rather than parameters; b) the videos describe the distinction in order to make it clear which is “right”.

In my defence, even when you do have a better mental model for how records work, this is a pretty easy mistake to make, and you need to be on the ball to spot it in code review. The language absolutely allows you to write records which aren’t just “lightweight data records” in the same way that you do for classes – so I don’t think it should be surprising that folks are going to do that.

So, after this introductory spiel, this post has two aspects to it:

  • How am I going to stop myself from falling into the same trap again?
  • What changes have I made within the Election 2029 code base?

Trap avoidance: Roslyn Analyzers

In the previous post, I mentioned writing a Roslyn analyzer as a possible way forward. My initial hope was to have a single analyzer which would just spot the use of with operators targeting any parameter which was used during initialization.

That initial attempt worked to some extent – it would have spotted the dangerous code from the original blog post – but it only worked when the source code for the record and the source code using the with operator were in the same project. I’ve now got a slightly better solution with two analyzers, which can even work with package references where you may not have access to the source code for the record at all… so long as the package author is using the same analyzers! (This will make more sense when you’ve seen the analyzers.)

The source code of the analyzers is on GitHub and the analyzers themselves are in the JonSkeet.RoslynAnalyzers NuGet package. To install them in a project, just add this to an item group in your project file:

<PackageReference Include="JonSkeet.RoslynAnalyzers" Version="1.0.0-beta.6"
        PrivateAssets="all"
        IncludeAssets="runtime; build; native; contentfiles; analyzers"/>

Obviously, it’s all very beta – and there are lots of corner cases it probably wouldn’t find at the moment. (Pull requests are welcome.) But it scratches my particular itch for now. (If someone else wants to take the idea and run with it in a more professional, supported way, ideally in a package with dozens of other useful analyzers, that’s great.)

As I mentioned, there are two analyzers, with IDs of JS0001 and JS0002. Let’s look at how they work by going back to the original demo code from the previous post. Here’s the complete buggy code:

// Record
public sealed record Number(int Value)
{
    public bool Even { get; } = (Value & 1) == 0;
}

// Use of record
var n2 = new Number(2);
var n3 = n2 with { Value = 3 };
Console.WriteLine(n2); // Output: Number { Value = 2, Even = True }
Console.WriteLine(n3); // Output: Number { Value = 3, Even = True }

Adding the analyzer package highlights the int Value parameter declaration in Number, with this warning:

JS0001 Record parameter ‘Value’ is used during initialization; it should be annotated with [DangerousWithTarget]

Currently, there’s no code fix, but we need to do two things:

  • Declare an attribute called DangerousWithTargetAttribute
  • Apply the attribute to the parameter

Here’s the complete attribute and record code with the fix applied:

[AttributeUsage(AttributeTargets.Parameter)]
internal sealed class DangerousWithTargetAttribute : Attribute;

public sealed record Number([DangerousWithTarget] int Value)
{
    public bool Even { get; } = (Value & 1) == 0;
}

The attribute doesn’t have to be internal, and indeed in my election code base it’s not. But it can be, even if you’re using the record from a different assembly. The analyzer doesn’t care what namespace it’s in or any other details (although it does currently have to be called DangerousWithTargetAttribute rather than just DangerousWithTarget).

At this point:

  • The source code makes it clear to humans that we know it would be dangerous to set the Value property in a with operator with Number
  • The compiled code makes that clear (to the other analyzer) as well

After applying the above change, we get a different warning – this time on n2 with { Value = 3 }:

JS0002: Record parameter ‘Value’ is annotated with [DangerousWithTarget]

(Both of these warnings have more detailed descriptions associated with them as well as the summary.)

Now you know the problem exists, it’s up to you to fix it… and there are multiple different ways you could do that. Let’s try to get warning-free by replacing our precomputed property with one which is computed on demand. The analyzers don’t try to tell you if [DangerousWithTarget] is applied where it doesn’t need to be, so this code compiles without any warnings, but it doesn’t remove our JS0002 warning:

// No warning here, but the expression 'n2 with { Value = 3 }' still warns.
public sealed record Number([DangerousWithTarget] int Value)
{
    public bool Even => (Value & 1) == 0;
}

As it happens, this has proved unexpectedly useful within the Election2029 code, where even though a parameter isn’t used in initialization, there’s an expected consistency between parameters which should discourage the use of the with operator to set one of them.

Once we remove the [DangerousWithTarget] attribute from the parameter though, all the warnings are gone:

public sealed record Number(int Value)
{
    public bool Even => (Value & 1) == 0;
}

The analyzer ignores the Even property because it doesn’t have an initializer – it’s fine to use Value for computed properties after initialization.

A new pattern for Election2029

So, what happened when I enabled the analyzers in my Election2029 project? (Let’s leave aside the bits where it didn’t work first time… there’s a reason the version number is 1.0.0-beta.6 already.)

Predictably enough, a bunch of records were flagged for not specifying the [DangerousWithTarget]… and when I’d applied the attribute, there were just one or two places where I was using the with operator in an unsafe way. (Of course, I checked whether the original bug which had highlighted the issue for me in the first place was caught by the analyzers – and it was.)

For most of the records, the precomputation feels okay to me. They’re still fundamentally pretty lightweight records, with a smattering of precomputation which would feel pointlessly inefficient if I made it on-demand. I like the functionality that I’m given automatically by virtue of them being records. I’ve chosen to leave those as records, knowing that at least if I try to use the with operator in a dangerous operator, I’ll be warned about it.

However, there are two types – ElectionCoreContext and ElectionContext, which I [wrote about earlier] – which have a lot of precomputation. They feel more reasonable to be classes. Initially, I converted them into just “normal” classes, with a primary constructor and properties. It felt okay, but not quite right somehow. I liked the idea of the record type for just the canonical information for the context… so I’ve transformed ElectionContext like this (there’s something similar for ElectionCoreContext):

public sealed class ElectionContext : IEquatable<ElectionContext>
{
    public ElectionContextSourceData SourceData { get; }

    // Bunch of properties proxying access
    public Instant Timestamp => SourceData.Timestamp;
    // ...

    public ElectionContext(ElectionContextSourceData sourceData)
    {
        // Initialization and validation
    }

    public sealed record ElectionContextSourceData(Instance Timestamp, ...)
    {
        // Equals and GetHashCode, but nothing else
    }
}

At this point:

  • I’ve been able to add validation to the constructor. I couldn’t do that with a record in its primary constructor.
  • It’s really clear what’s canonical information vs derived data – I could even potentially refactor the storage layer to only construct and consume the ElectionContextSourceData, for example. (I’m now tempted to try that. I suspect it would be somewhat inefficient though, as it uses the derived data to look things up when deserializing.)
  • I can still use the with operator with the record, when I need to (which is handy in a few places)
  • There’s no risk of the derived data being out of sync with the canonical data, because the ordering is very explicit

Ignoring the naming (and possibly the nesting), is this a useful pattern? I wouldn’t want to do it for every record, but for these two core, complex types it feels like it’s working well so far. It’s early days though.

Conclusion

I’m really pleased that I can now use records more safely, even if I’m using them in ways that other folks may not entirely condone. I may well change my mind and go back to using regular classes for all but the most cut-and-dried cases. But for now, the approaches I’ve got of “use records where it feels right, even if that means precomputation” and “use classes to wrap records where there’s enough behavior to justify it” are working reasonably well.

I don’t really expect other developers to use my analyzers (although you’re very welcome to do so, of course) – but the fact that they’re even feasible points to Roslyn being a bit of a miracle. I’m not recommending either my “careful use of records slightly beyond their intended use” or “class wrapping a record” approaches yet. I’ve got plenty of time to refactor if they don’t work out for the Election2029 project. But I’d still be interested in getting feedback on whether my decisions at least seem somewhat reasonable to others.

Unexpected inconsistency in records

Unexpected inconsistency in records

The other day, I was trying to figure out a bug in my code, and it turned out to be a misunderstanding on my part as to how C# records work. It’s entirely possible that I’m the only one who expected them to work in the way that I did, but I figured it was worth writing about in case.

As it happens, this is something I discovered when making a change to my 2029 UK general election site, but it isn’t actually related to the election, so I haven’t included it in the election site blog series.

Recap: nondestructive mutation

When records were introduced into C#, the “nondestructive mutation” with operator was introduced at the same time. The idea is that record types can be immutable, but you can easily and efficiently create a new instance which has the same data as an existing instance, but with some different property values.

For example, suppose you were to have a record like this:

public sealed record HighScoreEntry(string PlayerName, int Score, int Level);

You could then have code of:

HighScoreEntry entry = new("Jon", 5000, 50);

var updatedEntry = entry with { Score = 6000, Level = 55 };

This doesn’t change the data in the first instance (so entry.Score would still be 5000).

Recap derived data

Records don’t allow you to specify constructor bodies for the primary constructor (something I meant to write about in my earlier post about records and collections, but you can initialize fields (and therefore auto-implemented properties) based on the values for the parameters in the primary constructor.

So as a very simple (and highly contrived) example, you could create a record which determines whether or not a value is odd or even on initialization:

public sealed record Number(int Value)
{
    public bool Even { get; } = (Value & 1) == 0;
}

At first glance, this looks fine:

var n2 = new Number(2);
var n3 = new Number(3);
Console.WriteLine(n2); // Output: Number { Value = 2, Even = True }
Console.WriteLine(n3); // Output: Number { Value = 3, Even = False }

So far, so good. Until this week, I’d thought that was all fine.

Oops: mixing with and derived data

The problem comes when mixing these two features. If we change the code above (while leaving the record itself the same) to create the second Number using the with operator instead of by calling the constructor, the output becomes incorrect:

var n2 = new Number(2);
var n3 = n2 with { Value = 3 };
Console.WriteLine(n2); // Output: Number { Value = 2, Even = True }
Console.WriteLine(n3); // Output: Number { Value = 3, Even = True }

“Value = 3, Even = True” is really not good.

How does this happen? Well, for some reason I’d always assumed that the with operator called the constructor with the new values. That’s not actually what happens. The with operator above translates into code roughly like this:

// This won't compile, but it's roughly what is generated.
var n3 = n2.<Clone>$();
n3.Value = 3;

The <Clone>$ method (at least in this case) calls a generated copy constructor (Number(Number)) which copies both Value and the backing field for Even.

This is all documented – but currently without any warning about the possible inconsistency it can introduce. (I’ll be emailing Microsoft folks to see if we can get something in there.)

Note that because Value is set after the cloning operation, we couldn’t write a copy constructor to do the right thing here anyway. (At least, not in any sort of straightforward way – I’ll mention a convoluted approach later.)

In case anyone is thinking “why not just use a computed property?” obviously this works fine:

public sealed record Number(int Value)
{
    public bool Even => (Value & 1) == 0;
}

Any property that can easily be computed on demand like this is great – as well as not exhibiting the problem from this post, it’s more efficient in memory too. But that really wouldn’t work for a lot of the properties in the records I use in the election site, where often the record is constructed with collections which are then indexed by ID, or other relatively expensive computations are performed.

What can we do?

So far, I’ve thought of four ways forward, none of them pleasant. I’d be very interested to hear recommendations from others.

Option 1: Shrug and get on with life

Now I know about this, I can avoid using the with operator for anything but “simple” records. If there are no computed properties or fields, the with operator is still really useful.

There’s a risk that I might use the with operator on a record type which is initially “simple” and then later introduce a computed member, of course. Hmm.

Option 2: Write a Roslyn analyzer to detect the problem

In theory, at least for any records being used within the same solution in which they’re declared (which is everything for my election site) it should be feasible to write a Roslyn analyzer which:

  • Analyzes every member initializer in every declared record to see which parameters are used
  • Analyzes every with operator usage to see which parameters are being set
  • Records an error if there’s any intersection between the two

That’s quite appealing and potentially useful to others. It does have the disadvantage of having to implement the Roslyn analyzer though. It’s been a long time since I’ve written an analyzer, but my guess is that it’s still a fairly involved process. If I actually find the time, this is probably what I’ll do – but I’m hoping that someone comments that either the analyzer already exists, or explains why it isn’t needed anyway.

Update, 2025-07-29: I’ve written a pair of analyzers! See my follow-up post for more details.

Option 3: Figure out a way of using with safely

I’ve been trying to work out how to potentially use Lazy<T> to defer computing any properties until they’re first used, which would come after the with operator set new values for properties. I’ve come up with the pattern below – which I think works, but is ever so messy. Adopting this pattern wouldn’t require every new parameter in the parent record to be reflected in the nested type – only for parameters used in computed properties.

public sealed record Number(int Value)
{
    private readonly Lazy<ComputedMembers> computed =
        new(() => new(Value), LazyThreadSafetyMode.ExecutionAndPublication);

    public bool Even => computed.Value.Even;

    private Number(Number other)
    {
        Value = other.Value;
        // Defer creating the ComputedMembers instance until 
        computed = new(() => new(this), LazyThreadSafetyMode.ExecutionAndPublication);
    }

    // This is a struct (or could be a class) rather than a record,
    // to avoid creating a field for Value. We only need the computed properties.
    // (We don't even really need to use a primary
    // constructor, and in some cases it might be best not to.)
    private struct ComputedMembers(int Value)
    {
        internal ComputedMembers(Number parent) : this(parent.Value)
        {
        }

        public bool Even { get; } = (Value & 1) == 0;
    }
}

This is:

  • Painful to remember to do
  • A lot of extra code to start with (although after it’s been set up, adding a new computed member isn’t too bad)
  • Inefficient in terms of memory, due to adding a Lazy<T> instance

The inefficiency is likely to be irrelevant in “large” records, but it makes it painful to use computed properties in “small” records with only a couple of parameters, particularly if those are just numbers etc.

Option 4: Request a change to the language

I bring this up only for completeness. I place a lot of trust in the C# design team: they’re smart folks who think things through very carefully. I would be shocked to discover that I’m the first person to raise this “problem”. I think it’s much more likely that the pros and cons of this behaviour have been discussed at length, and alternatives discussed and prototyped, before landing on the current behaviour as the least-worst option.

Now maybe the Roslyn compiler could start raising warnings (option 2) so that I don’t have to write an analyzer – and maybe there are alternatives that could be added to C# for later versions (ideally giving more flexibility for initialization within records in general, e.g. a specially named member that is invoked when the instance is “ready” and which can still write to read-only properties)… but I’m probably not going to start creating a proposal for that without explicit encouragement to do so.

Conclusion

It’s very rare that I discover a footgun in C#, but this really feels like one to me. Maybe it’s only because I’ve used computed properties so extensively in my election site – maybe records really aren’t designed to be used like this, and half of my record types should really be classes instead.

I don’t want to stop using records, and I’m definitely not encouraging anyone else to do so either. I don’t want to stop using the with operator, and again I’m not encouraging anyone else to do so. I hope this post will serve as a bit of a wake-up call to anyone who is using with in an unsound way though.

Oh, and of course if I do write a Roslyn analyzer capable of detecting this, I’ll edit this post to link to it. (As noted earlier, this is that post.)

Election 2029: Postcodes

Postcodes

After a pretty practical previous post about records and collections, this post is less likely to give anyone ideas about how they might tackle a problem in their own project, and doesn’t have any feature requests for Microsoft either. It’s an area I’ve found really fun though.

An introduction to postcodes in the UK

I realise that most of the readers of this blog post will probably not be in the UK. Most countries have postal codes of some description, but the details vary a lot by country. UK postcodes are quite different in scale to US zipcodes, for example. A UK postcode is quite fine-grained – often just a part of a single street – so knowing a house/flat number an a postcode is usually enough to get to a precise address.

Importantly for my election site, every postcode is in a single constituency – and indeed that’s what I want to use them for. My constituencies page allows you to start typing a constituency name or postcode, and it filters the results as you type. I suspect that a significant proportion of the UK population knows their postcode but not the name of their constituency (particularly after the boundary changes in 2023) so it’s helpful to be able to specify either.

Wikipedia has a lot more information about UK postcodes but I’ll summarize it briefly here. Note that I only care about modern postcodes – the Wikipedia has details about the history of how things evolved, but my site doesn’t need to care about legacy postcodes.

A postcode consists of an outcode (or outward code) followed by incode (or inward code). The outcode is an area followed by a district, and the incode is a sector followed by a unit. As an example, my postcode is RG30 4TT: broken down into:

  • Outcode: RG30
    • Area: RG
    • District: 30
  • Incode: 4TT
    • Sector: 4
    • Unit: TT

Incodes are nice and uniform: they’re always a digit followed by two letters. (And those two letters are within a 20-character alphabet.) Outcodes are more “interesting” as they fall into one of the seven formats below, where ‘9’ represents “any digit” and ‘A’ represents “a letter” (although the alphabet varies):

  • AA9
  • AA99
  • A9
  • A99
  • A9A
  • AA9A

Note how the first two formats for outcodes only vary by whether they have one or two digits – and remember that an incode always starts with a digit. This isn’t a problem when parsing text that is intended to represent a complete postcode (as you can tell the length of the outcode by assuming the final three characters are the incode) – but when I need to parse an incomplete postcode, it can be ambiguous. For example, “RG31 4FG” and “RG3 1AA” are both valid postcodes, and as I don’t want to force users to type the space, “RG31” should display constituencies for both (“Reading West and Mid Berkshire” and “Reading Central” respectively).

Requirements

The requirements for my use of postcodes in the election site is pretty straightforward:

  • Ingest data from an external source (the Office of National Statistics was what I found)
  • Store everything we need in a compact format, ideally in a single Firestore document (so 1MB)
  • Keep the data in memory, still in a reasonably compact format
  • Provide an API with input of “a postcode prefix” and return “the set of all constituencies covered by postcodes starting with that prefix” sufficiently quickly for it to feel “instant”

Interestingly, Wikipedia refers to “a table of all 1.7 million postcodes” but the file I ingest currently has 2,712,507 entries. I filter out a few, but I still end up with 2,687,933 distinct postcodes. I don’t know whether the Wikipedia number is just a typo, or whether there’s a genuine discrepancy there.

Spoiler alert: the network latency easily dominates the latency for making a request to the API. Using airport wifi in Munich (which is pretty good for free wifi) I see a roundtrip time of about 60ms, and the page feels like it’s updating pretty much instantaneously as I type. But the time within ASP.NET Core is less than a millisecond (and that’s including filtering by constituency name as well).

Storage and in-memory formats

Logically, the input data is just a sequence of “postcode, constituency code” entries. A constituency code is 9 characters long. That means a very naive representation of 2712507 entries, each taking 16 characters (expecting an average of 7 characters for the postcode, and 9 characters for the constituency code) would be just under 42MB, and that’s before we take into account splitting the data into the entries. Surely we can do better.

There’s a huge amount of redundancy here – we’re bound to be able to do better. For a start, my naive calculation assumed using a whole byte per character, even though every character we need is in the range A-Z, 0-9 (so an alphabet of 36 characters) – and several of the characters within those values are required to just be digits.

Grouping by outcode

But before we start writing complex code to represent the entry string values in fewer bytes, there’s a lot more redundancy in the actual data. It’s not like postcodes are randomly allocated across all constituencies. Most postcodes within the same sector (e.g. “RG30 4”) will be in the same constituency. Even when we group by outcode (e.g. “RG30”) there are relatively few constituencies represented in each group.

At the time of writing, there are 3077 outcodes:

  • 851 are all in 1 constituency
  • 975 are in 2 constituencies
  • 774 are in 3
  • 348 are in 4
  • 98 are in 5
  • 21 are in 6
  • 8 are in 7
  • 2 are in 8

We also want to diffentiate between invalid and valid postcodes – so we can think of “no constituency at all” as an additional constituency in each of the lines above.

In each outcode, there are 4000 possible incodes, which are always the same (0AA, 0AB .. 9ZY, 9ZZ). So for each outcode, we just need an array of 4000 values.

A simple representation with a byte per value would be 4000 bytes per map, with 3077 such maps (one per outcode) – that’s 12,308,000 bytes.

Compressing the maps

Can we do better by using the fact that we don’t really need a whole byte for each value? For the 851 outcodes in a single constituency, each of those values would be 0 or 1. For the 975 outcodes in two constituencies, each value would be 0, 1 or 2 – and so on. Even without splitting things across bytes, we could store those maps as:

  • 851 maps with a single bit per value (500 bytes per map)
  • 975 + 774 maps with two bits per value (1000 bytes per map)
  • The remaining (348 + 98 + 21 + 8 + 2) maps with four bits per value (2000 bytes per map)

That’s a total of (851*500) + ((975 + 774) * 1000) + ((348 + 98 + 21 + 8 + 2) * 2000) = 3,128,500 bytes. That sounds great – it’s saved three quarters of the space needed for the maps! But it’s still well over the maximum size of a Firestore document. And, of course, for each outcode we also need the outcode text, and the list of constituencies represented in the map. That shouldn’t be much, but it doesn’t help an already-blown size budget.

At this point, I tried multiple options, including grouping by sector instead of outcode (as few outcodes actually have all ten possible sectors, and within a sector there are likely to be fewer constituencies, so we’ll have more sectors which can be represented by one or two bits per value). I tried a hand-rolled run-length encoding and various options. I did manage to get things down quite a lot – but the code became increasingly complicated. Fundamentally, what I was doing was performing compression, and trying to apply some knowledge about the data to achieve a decent level of compression.

Around this point, I decided to stop trying to make the code smarter, and instead try making the code simpler and lean on existing code being smart. The simple 4000-byte maps contain a lot of redundancy. Rather than squeezing that redundancy out manually, I tried general-purpose compression – using DeflateStream and InflateStream. The result was beyond what I’d hoped: the total size of the compressed maps is just 500,602 bytes. That leaves plenty of room for the extra information on a per-outcode basis, while staying well under the 1MB document limit.

I would note that when I wrote about storage options, I mentioned the possibility of using Google Cloud Storage instead of Firestore – at which point there’d really be no need for this compression at all. While it would be nice to remove the compression code, it’s a very isolated piece of complexity – which is only actually about 15 lines of code in the end.

Processing a request

So with the storage requirements satisfied, are we done? Pretty much, it turns out. I keep everything uncompressed in memory, with each per-outcode map storing a byte per incode – leading to just over 12MB in memory, as shown earlier. That’s fine. Each per-outcode map is stored as an OutcodeMapping.

The process of converting a prefix into a set of possible constituencies is fiddly simply due to the ambiguity in the format. I don’t return anything for a single character – it’s reasonable to require users to type in two characters of a postcode before we start doing anything with it.

It’s easy to store a dictionary of “first two characters” to “list of possible OutcodeMapping entries”. That list is always reasonably short, and can be pruned down very quickly to “only the entries that actually match the supplied prefix”. For each matching OutcodeMapping, we then need to find the constituencies that might be represented by the incode part of whatever’s left of the supplied prefix.

For example, if the user has typed “RG30” then we’ll quickly filter down to “RG3” and “RG30” as matching outcodes. From that, we need to say:

  • Which constituencies are represented by incodes for RG3 which start with “0”?
  • Which constituencies are represented by incodes for RG30? (We’ve already consumed the whole of the supplied prefix for that outcode.)

By splitting it into outcode and then incode, it makes life much simpler because the incode format is so simpler. For example, if the user had typed in “RG30 4” then the questions above become:

  • Which constituencies are represented by incodes for RG3 which start with “04”?
    • There can’t be any of those, because the second character of an incode is never a digit.
  • Which constituencies are represented by incodes for RG30 which start with “4”?

We always have zero, one, two or three characters to process within an outcode mapping:

  • Zero characters: the full set of constituencies for this outcode
  • One character: the set of constituencies for that “sector” (which we can cache really easily; there are only ten sectors per outcode)
  • Two characters: the fiddliest one – look through all 20 possibilities for that “half unit” in the map.
  • Three characters: a single postcode, so just look it up in the map; the result will be either nothing (if the postcode is invalid) or a single-entry result

I suspect that there’s more efficiency that could be squeezed out of my implementation here – but it’s already fast enough that I really don’t care. At some point I might do a load test to see how many requests I can process at a time while still keeping the latency low… but I’d be very surprised if this caused a problem. (And hey, Cloud Run will just spin up some more instances if necessary.)

Conclusion

There are two things I love about this topic. Firstly, it’s so isolated. You don’t need to know anything about the rest of the site, election polls, parties etc to understand this small domain. You need to know about postcodes and constituencies, but that’s it. Aside from anything else, that makes it a pleasant topic to write about. Secondly, the results seem so wildly impressive, with relatively little effort. In around half a megabyte of data, we’re storing information about over 2.7 million postcodes, and the final lookup is dirt cheap.

There’s an interesting not-quite-contradiction in the approaches taken in the post, too.

Firstly, even though the data is sort of mapping a 6/7 character input (postcode) to a 9 character output (constituency code), it’s very far from random. There are only 650 constituencies, and the constituencies are obviously very “clustered” with respect to the postcodes (so two postcodes which only differ in the fifth or later character are very likely to belong to the same constiteuncy). The approach I’ve taken uses what we know about the data as a first pass – so that aspect is very domain-specific. Just compressing a file of “postcode constituency-code” lines only gets us down to just over 7MB.

In contrast, when we’ve got as far as “for each outcode, we have a map of 4000 entries, each to a number between 0 and 8 inclusive” (with awareness that for many outcodes, the numbers are either in the range 0-1 or 0-2 inclusive), trying to be “clever” didn’t work out terribly well. The code became increasingly complicated as I tried to pack as much information as possible into each byte… whereas when I applied domain-agnostic compression to each map, the results were great.

I don’t think it’s particularly rare to need to find that balance between writing code to try to take advantage of known data characteristics, and relying on existing domain-agnostic algorithms. This is just one of the clearest examples of it that I’ve come across.

Records and Collections

Records and Collections

This post is to some extent a grab-bag of points of friction I’ve encountered when using records and collections within the election site.

Records recap

This may end up being the most generally useful blog post in this series. Although records have been in C# since version 10, I haven’t used them much myself. (I’ve been looking forward to using them for over a decade, but that’s a different matter.)

Having decided to make all my data models immutable, using records (always sealed records in my case) to implement those models in C# was pretty much a no-brainer. Just specify the properties you want using the same format as primary constructors, and the compiler does a bunch of boilerplate work for you.

As a simple example, consider the following record declaration:

public sealed record Candidate(int Id, string Name, int? MySocietyId, int? ParliamentId);

That generates code roughly equivalent to this:

public sealed class Candidate : IEquatable<Candidate>
{
    public int Id { get; }
    public string Name { get; }
    public int? MySocietyId { get; }
    public int? ParliamentId { get; }

    public Candidate(int id, string name, int? mySocietyId, int? parliamentId)
    {
        Id = id;
        Name = name;
        MySocietyId = mySocietyId;
        ParliamentId = parliamentId;
    }

    public override bool Equals(object? obj) => obj is Candidate other && Equals(other);

    public override int GetHashCode()
    {
        // The real code also uses EqualityContract, skipped here.
        int idHash = EqualityComparer<int>.Default.GetHashCode(Id);
        int hash = idHash * -1521134295;
        int nameHash = EqualityComparer<string>.Default.GetHashCode(Name);
        hash = (hash + nameHash) * -1521134295;
        int mySocietyIdHash = EqualityComparer<int?>.Default.GetHashCode(MySocietyId);
        hash = (hash + mySocietyIdHash) * -1521134295;
        int parliamentIdHash = EqualityComparer<int?>.Default.GetHashCode(ParliamentId);
        hash = (hash + parliamentIdHash) * -1521134295;
        return hash;
    }

    public bool Equals(Candidate? other)
    {
        if (ReferenceEquals(this, other))
        {
            return true;
        }
        if (other is null)
        {
            return false;
        }
        // The real code also uses EqualityContract, skipped here.
        return EqualityComparer<int>.Default.Equals(Id, other.Id) &&
            EqualityComparer<string>.Default.Equals(Name, other.Name) &&
            EqualityComparer<int?>.Default.Equals(MySocietyId, other.MySocietyId) &&
            EqualityComparer<int?>.Default.Equals(ParliamentId, other.ParliamentId);
    }

    public static bool operator==(Candidate? left, Candidate? right) =>
    {
        if (ReferenceEquals(left, right))
        {
            return true;
        }
        if (left is null)
        {
            return false;
        }
        return left.Equals(right);
    }

    public static bool operator!=(Candidate? left, Candidate? right) => !(left == right);

    public override string ToString() =>
        $"Candidate {{ Id = {Id}, Name = {Name}, MySocietyId = {MySocietyId}, ParliamentId = {ParliamentId} }}";

    public void Deconstruct(out int Id, out string Name, out int? MySocietyId, out int? ParliamentId) =>
        (Id, Name, MySocietyId, ParliamentId) = (this.Id, this.Name, this.MySocietyId, this.ParliamentId);
}

(This could be written a little more compactly using primary constructors, but I’ve kept to “old school” C# to avoid confusion.)

Additionally, the compiler allows the with operator to be used with records, to create a new instance based on an existing instance and some updated properties. For example:

var original = new Candidate(10, "Jon", 20, 30);
var updated = original with { Id = 40, Name = "Jonathan" };

That’s all great! Except when it’s not quite…

Record equality

As shown above, the default implementation of equality for records uses EqualityComparer<T>.Default for each of the properties. That’s fine when the default equality comparer for the property type is what you want – but that’s not always the case. In our election data model case, most of the types are fine – but ImmutableList<T> is not, and we use that quite a lot.

ImmutableList<T> doesn’t override Equals and GetHashCode itself – so it has reference equality semantics. What I really want is to use an equality comparer for the element type, and say that two immutable lists are equal if they have the same count, and the elements are equal when considered pairwise. That’s easy enough to implement – along with a suitable GetHashCode method. It could easily be wrapped in a type that implements IEqualityComparer<ImmutableList<T>>code>, although it so happens I haven’t done that yet.

Unfortunately, the way that records work in C#, there’s no way of specifying an equality comparer to be used for a given property. If you implement the Equals and GetHashCode methods directly, those are used instead of the generated versions (and the Equals(object) generated code will still use the version you’ve implemented) but it does mean you’ve got to implement it for all the properties. This in turn means that if you add a new property in the record, you need to remember to modify Equals and GetHashCode (something I’ve forgotten to do at least once) – whereas if you’re happy to use the default generated implementation, adding a property is trivial.

What I’d really like would be some way of indicating to the compiler that it should use a specified type to obtain the equality comparer (which could be assumed to be stateless) for a property. For example, imagine we have these types:

// Imagine this is in the framework...
public interface IEqualityComparerProvider
{
    static abstract IEqualityComparer<T> GetEqualityComparer<T>();
}

// As is this...
[AttributeUsage(AttributeTargets.Property)]
public sealed class EqualityComparerAttribute : Attribute
{
    public Type ProviderType { get; }

    public EqualityComparer(Type providerType)
    {
        ProviderType = providerType;
    }
}

Now I could implement the interface like this:

public sealed class CollectionEqualityProvider : IEqualityComparerProvider
{
    public static IEqualityComparer<T> GetEqualityComparer<T>()
    {
        var type = typeof(T);
        if (!type.IsGenericType)
        {
            throw new InvalidOperationException("Unsupported type");
        }
        var genericTypeDefinition = type.GetGenericTypeDefinition();
        if (genericTypeDefinition == typeof(ImmutableList<>))
        {
            // Instantiate and return an appropriate equality comparer
        }
        if (genericTypeDefinition == typeof(ImmutableDictionary<,>))
        {
            // Instantiate and return an appropriate equality comparer
        }
        // etc...
        throw new InvalidOperationException("Unsupported type");
    }
}

It’s unfortunate that the comments would involve further reflection – but it would certainly be feasible.

We could then declare a record like this:

public sealed record Ballot(
    Constituency Constituency,
    [IEqualityComparerProvider(typeof(CollectionEqualityProvider))] ImmutableList<Candidacy> Candidacies);

… and I’d expect the compiler to generate code such as:

public sealed class Ballot
{
    private static readonly IEqualityComparer<ImmutableList<Candidacy>> candidaciesComparer;

    // Skip code that would be generated as it is today.

    public bool Equals(Candidate? other)
    {
        if (ReferenceEquals(this, other))
        {
            return true;
        }
        if (other is null)
        {
            return false;
        }
        return EqualityComparer<Constituency>.Default.Equals(Constituency, other.Constituency) &&
            candidaciesComparer.Equals(Candidacies, other.Candidacies);
    }

    public override int GetHashCode()
    {
        int constituencyHash = EqualityComparer<Constituency>.Default.GetHashCode(Constituency);
        int hash = constituencyHash * -1521134295;
        int candidaciesHash = candidaciesComparer.GetHashCode(Candidacies);
        hash = (hash + candidaciesHash) * -1521134295;
        return hash;
    }
}

I’m sure there are other ways of doing this. The attribute could instead specify the name of a private static read-only property used to obtain the equality comparer, removing the interface. Or the GetEqualityComparer method could be non-generic with a Type parameter instead (leaving the compiler to generate a cast after calling it). I’ve barely thought about it – but the important thing is that the requirement of having a custom equality comparison for a single property becomes independent of all the other properties. If you already have a record with 9 properties where the default equality comparison is fine, then adding an 10th property which requires more customization is easy – whereas today, you’d need to implement Equals and GetHashCode including all 10 properties.

(The same could be said for string formatting for the properties, but it’s not an area that has bitten me yet.)

The next piece of friction I’ve encountered is also about equality, but in a different direction.

Reference equality

If you remember from my post about data models, within a single ElectionContext, reference equality for models is all we ever need. The site never needs to fetch (say) a constituency result from the 2024 election from one context by specifying a Constituency from a different context. Indeed, if I ever found code that did try to do that, it would probably indicate a bug: everything within any given web request should refer to the same ElectionContext.

Given that, when I’m creating an ImmutableDictionary<Constituency, Result>, I want to provide an IEqualityComparer<Constituency> which only performs reference comparisons. While this seems trivial, I found that it made a pretty significant difference to the time spent constructing view-models when the context is reloaded.

I’d expected it would be easy to find a reference equality comparer within the framework – but if there is one, I’ve missed it.

Update, 2025-03-27T21:04Z, thanks to Michael Damatov

As Michael pointed out in comments, there is one in the framework: System.Collections.Generic.ReferenceEqualityComparer – and I remember finding it when I first discovered I needed one. But I foolishly dismissed it. You see, it’s non-generic:

public sealed class ReferenceEqualityComparer :
    System.Collections.Generic.IEqualityComparer<object>,
    System.Collections.IEqualityComparer

That’s odd and not very useful, I thought at the time. Why would I only want IEqualityComparer<object> rather than a generic one?

Oh Jon. Foolish, foolish Jon.

IEqualityComparer<T> is contravariant in T – so there’s an implicit reference conversion from IEqualityComparer<object> to IEqualityComparer<X> for any class type X.

I have now removed my own generic ReferenceEqualityComparer<T&gt type… although it’s meant I’ve had to either cast or explicitly specify some type arguments where previously the types were inferred via the type of the comparer.

End of update

I’ve now made a habit of using reference equality comparisons everywhere within the data models, which has made it worth adding some extension methods – and these probably don’t make much sense to add to the framework (although they could easily be supplied by a NuGet package):

public static ImmutableDictionary<TKey, TValue> ToImmutableReferenceDictionary<TSource, TKey, TValue>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TValue> elementSelector) where TKey : class =>
    source.ToImmutableDictionary(keySelector, elementSelector, ReferenceEqualityComparer<TKey>.Instance);

public static ImmutableDictionary<TKey, TSource> ToImmutableReferenceDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector) where TKey : class =>
    source.ToImmutableDictionary(keySelector, ReferenceEqualityComparer<TKey>.Instance);

public static ImmutableDictionary<TKey, TValue> ToImmutableReferenceDictionary<TKey, TValue>(
    this IDictionary<TKey, TValue> source) where TKey : class =>
    source.ToImmutableDictionary(ReferenceEqualityComparer<TKey>.Instance);

public static ImmutableDictionary<TKey, TValue> ToImmutableReferenceDictionary<TKey, TValue, TSourceValue>(
    this IDictionary<TKey, TSourceValue> source, Func<KeyValuePair<TKey, TSourceValue>, TValue> elementSelector) where TKey : class =>
    source.ToImmutableDictionary(pair => pair.Key, elementSelector, ReferenceEqualityComparer<TKey>.Instance);

(I could easily add similar methods for building lookups as well, of course.) Feel free to take issue with the names – while they’re only within the election repo, I’m not going to worry too much about them.

Interlude

Why not make reference equality the default?

I could potentially kill two birds with one stone here. If I often want reference equality, and “deep” equality is relatively hard to achieve, why not just provide Equals and GetHashCode methods that make all my records behave with reference equality comparisons?

That’s certainly an option – but I do lean on the deep equality comparison for testing purposes: if I load the same context twice for example, the results should be equal, otherwise there’s something wrong.

Moreover, as record types encourage deep equality, it feels like I’d be subverting their natural behaviour by specifying reference equality comparisons. While I’m not expecting anyone else to ever see this code, I don’t like writing code which would confuse readers who come with expectations based on how most code works.

End of interlude

Speaking of extension methods for commonly-used comparers…

Ordinal string comparisons

String comparisons make me nervous. I’m definitely not an internationalisation expert, but I know enough to know it’s complicated.

I also know enough to be reasonably confident that the default string comparisons are ordinal for Equals and GetHashCode, but culture-sensitive for CompareTo. As I say, I’m reasonably confident in that – but I always find it hard to validate, so given that I almost always want to use ordinal comparisons, I like to be explicit. Previously I’ve specified StringComparer.Ordinal (or StringComparer.OrdinalIgnoreCase just occasionally) but – just as above with the reference equality comparer – that gets irritating if you’re using it a lot.

I’ve therefore create another bunch of extension methods, just to make it clear that I want to use ordinal string comparisons – even if (in the case of equality) that would already be the default.

I won’t bore you with the full methods, but I’ve got:

  • OrderByOrdinal
  • OrderByOrdinalDescending
  • ThenByOrdinal
  • ThenByOrdinalDescending
  • ToImmutableOrdinalDictionary (4 overloads, like the ones above for ToImmutableReferenceDictionary)
  • ToOrdinalDictionary (4 overloads again)
  • ToOrdinalLookup (2 overloads)

(I don’t actually use ToOrdinalLookup much, but it feels sensible to implement all of them.)

Would these be useful in the framework? Possibly. I can see why they’re not there – string is “just another type” really… but I bet a high proportion of uses of LINQ end up with strings as keys in some form or another. Possibly I should suggest this for MoreLINQ – although having started the project over 15 years ago, I haven’t contributed to it for over a decade…

Primary constructor and record “call hierarchy” niggle in VS

I use “call hierarchy” in Visual Studio all the time. Put your cursor on a member, then Ctrl-K, Ctrl-T and you can see everything that calls that member, and what calls the caller, etc.

For primary constructor and record parameters, “find references” works (Ctrl-K, Ctrl-R) but “call hierarchy” doesn’t. I’m okay with “call hierarchy” not working for primary constructor parameters, but as the record parameters become properties, I’d expect to see the call hierachy for them just as I can with any other property.

More frustrating though is the inability to see the call hierarchy for “calling the constructor”. Given that the declaration of the class/record sort of acts as the declaration of the constructor as well, I’d have thought that putting your cursor on the class/record declaration (in the name) would work. It’s not that it’s ambiguous – Visual Studio just complains that “Cursor must be on a member name”. You can get at the calls by expanding the source file entry in Solution Explorer, but it’s weird only to have to do that for this one case.

Feature requests (for the C# language, .NET, and Visual Studio)

In summary, I love records, and love the immutable collections – but some friction could be reduced with the introduction of:

  • Some way of controlling (on a per-property basis) which equality comparer is used in the generated code
  • Equality comparers for immutable collections, with the ability to specify the element comparisons to use
  • An IEqualityComparer<T> implementation which performs reference comparisons
  • “Call Hierarchy” showing the calls to the constructors for primary constructors and records

Conclusion

Some of the niggles I’ve found with records and collections are at least somewhat specific to my election site, although I strongly suspect that I’m not the only developer with immutable collections in their records, with a desire to use them in equality comparisons.

Overall, records have served me well so far in the site, and I’m definitely pleased that they’re available, even if there are still possible improvements to be made. Similarly, it’s lovely to have immutable collections just naturally available – but some help in performing comparisons with them would be welcome.

Lessons from election night

Introduction

On Thursday (July 4th, 2024) the UK held a general election. There are many, many blog posts, newspaper articles, podcast episodes etc covering the politics of it, and the lessons that the various political parties may need to learn. I, on the other hand, learned very different lessons on the night of the 4th and the early morning of the 5th.

In my previous blog post, I described the steps I’d taken at that point to build my election web site. At the time, there was no JavaScript – I later added the map view, interactive view and live view which all do require JavaScript. Building those three views, adding more prediction providers, and generally tidying things up a bit took a lot of my time in the week and a half between the blog post and the election – but the election night itself was busier still.

Only two things really went “wrong” as such on the night, though they were pretty impactful.

Result entry woes

Firstly, the web site used to crowd source results for Democracy Club had issues. I don’t know the details, and I’m certainly not looking to cause any trouble or blame anyone. But just before 2am, the web site no longer loaded, which means no new results were being added. My site doesn’t use the Democracy Club API directly – instead, it loads data from a Firestore database, and I have a command-line tool to effectively copy the data from the Democracy Club API to Firestore. It worked very smoothly to start with – in fact the first result came in while I was coding a new feature (using the exit poll as another prediction provider) and I didn’t even notice. But obviously, when the results stop being submitted, that’s a problem.

At first, I added the results manually via the Firestore console, clearing the backlog of results that I’d typed into a text document as my wife had been calling them out from the TV. I’d hoped the web site problems were just a blip, and that I could just keep up via the manual result entry while the Democracy Club folks sorted it out. (It seemed unlikely that I’d be able to help fix the site, so I tried to avoid interrupting their work instead.) At one point the web site did come back briefly, but then went down again – at which point I decided to assume that it wouldn’t be reliable again during the night, and that I needed a more efficient solution than using the Firestore console. I checked every so often later on, and found that the web site did come back every so often, but it was down as often as it was up, so after a while I stopped even looking. Maybe it was all sorted by the time I’d got my backup solution ready.

That backup solution was to use Google Sheets. This was what I’d intended from the start of the project, before I knew about Democracy Club at all. I’ve only used the Google Sheets API to scrape data from sheets, but it makes that really quite simple. The code was already set up, including a simple “row to dictionary” mapping utility method, and I already had a lot of the logic to avoid re-writing existing results in the existing tooling targeting Democracy Club – so creating a new tool to combine those bits didn’t take more than about 20 minutes to write. Bear in mind though that this is at 2:30am, with more results coming in all the time, and I’d foolishly had a mojito earlier on.

After a couple of brief teething problems, the spreadsheet result sync tool was in place. I just needed to type the winning party into the spreadsheet next to the constituency name, and every 10 seconds the tool would check for changes and upload any new results. It was a frantic job trying to keep up with the results as they came in (or at least be close to keeping up), but it worked.

Then the site broke, at 5:42am.

Outage! 11 minutes of (partial) downtime

The whole site has been developed rapidly, with no unit tests and relatively little testing in general, beyond what I could easily check with ad hoc data. (In particular, I would check new prediction data locally before deploying to production.) I’d checked a few things with test results, but I hadn’t tested this statement:

Results2024Predicted = context.Predictions.Select(ps => (ps, GetResults(ps.GetPartyOrNotPredicted)))
    // Ignore prediction sets with no predictions in this nation.
    .Where(pair => pair.Item2[Party.NotPredicted] != Constituencies.Count)
    .ToList();

The comment indicates the purpose of the Where call – I have a sort of “fake” value in the Party enum for “this seat hasn’t been predicted, or doesn’t have a result”. That worked absolutely fine – until enough results had come in that at about 5:42am one of the nations (I forget which one) no longer had any outstanding seats. At that point, the dictionary in pair.Item2 (yes, it would be clearer with a named tuple element) didn’t have Party.NotPredicted as a key, and this code threw an exception.

One of the friends I was with spotted that the site was down before I did, and I was already working on it when I received a direct message on Twitter from Sam Freedman about the outage. Yikes. Fortunately by now the impact of the mojito was waning, but the lack of sleep was a significant impairment. In fact it wasn’t the whole site that was down – just the main view. Those looking at the “simple”, “live”, “map” or “interactive” views would still have been fine. But that’s relatively cold comfort.

While this isn’t the fix I would have written with more time, this is what I pushed at 5:51am:

Results2024Predicted = context.Predictions.Select(ps => (ps, GetResults(ps.GetPartyOrNotPredicted)))
    // Ignore prediction sets with no predictions in this nation.
    .Where(pair => !pair.Item2.TryGetValue(Party.NotPredicted, out var count) || count != Constituencies.Count)
    .ToList();

Obviously I tested that locally before pushing to production, but I was certainly keen to get it out immediately. Fortunately, the fix really was that simple. At 5:53am, through the magic of Cloud Build and Kubernetes, the site was up and running again.

So those were the two really significant issues of the night. There were some other mild annoyances which I’ll pick up on below, but overall I was thrilled.

What went well?

Overall, this has been an immensely positive experience. It went from a random idea in chat with a friend on June 7th to a web site I felt comfortable sharing via Twitter, with a reasonable amount of confidence that it could survive modest viral popularity. Links in a couple of Sam Freedman’s posts definitely boosted the profile, and monitoring suggests I had about 30 users with the “live” view which refreshes the content via JavaScript every 10 seconds. Obviously 30 users isn’t huge, but I’ll definitely take it – this is in the middle of the night, with plenty of other ways of getting coverage.

I’ve learned lots of “small” things about Razor pages, HTML, CSS and JavaScript, as well as plenty of broader aspects that I’ve described below.

Other than the short outage just before 6am – which obviously I’m kicking myself about – the site behaved itself really well. The fact that I felt confident deploying a new feature (the exit poll predictions) at 11:30pm, and removing a feature (the swing reporting, which was incorrect based on majority percentages) at 3am is an indication of how happy I am with the code overall. I aimed to create a simple site, and I did so.

What would I do differently next time?

Some of the points below were thoughts I’d had before election night. Some of them were considered before election night, but only confirmed in terms of “yes, this really did turn out to be a problem” on election night. Some were really unexpected.

Don’t drink!

At about 7pm, I’d been expecting to spend the time after the exit poll was announced developing a tool to populate my result database from a spreadsheet, as I hadn’t seen any confirmation from Democracy Club that the results pages were going to be up. During dinner, I saw messages on Slack saying it would all be okay – so I decided it would be okay to have a cocktail just after the exit polls came out. After all, I wasn’t really expecting to be active beyond confirming results on the Democracy Club page.

That was a mistake, as the next 10 hours were spent:

  • Adding the exit poll feature (which I really should have anticipated)
  • Developing the spreadsheet-to-database result populator anyway
  • Frantically adding results to the spreadsheet as quickly as I could

I suspect all of that would have been slightly easier with a clear head.

Avoid clunky data entry where possible (but plan ahead)

When the Democracy Club result confirmation site went down, I wasn’t sure what to do. I had to decide between committing to “I need new tooling now” and accepting that there’d be no result updates while I was writing it, or doing what I could to add results manually via the Firestore console, hoping that the result site would be back up shortly.

I took the latter option, and that was a mistake – I should have gone straight for writing the tool. But really, the mistake was not writing the tool ahead of time. If I’d written the tool days before just in case, not only would I have saved that coding time on the night, but I could also have added more validation to avoid data entry errors.

To be specific: I accidentally copied a load of constituency names into my result spreadsheet where the party names should have been. They were dutifully uploaded to Firestore, and I then deleted each of those records manually. I then pasted the same set of constituency names into the same (wrong) place in the spreadsheet again, because I’m a muppet. In my defence, this was probably at about 6am – but that’s why it would have been good to have written the tool to anticipate data entry errors. (The second time I made the mistake, I adjusted the tool so that fixing the spreadsheet would fix the data in Firestore too.)

Better full cache invalidation than “redeploy the site”

A couple of times – again, due to manual data entry, this time of timestamp values – the site ended up polling data waiting for results to be uploaded two hours in the future. Likewise even before the night itself, my “reload non-result data every 10 minutes” policy was slightly unfortunate. (I’d put a couple of candidates in the wrong seats.) I always had a way of flushing the cache: just redeploy the site. The cache was only in memory, after all. Redeploying is certainly effective – but it’s clunky and annoying.

In the future, I expect to have something in the database to say “reload all data now”. That may well be a Firestore document which also contains other site options such as how frequently to reload other data. I may investigate the IOptionsMonitor interface for that.

Better “no result update” than “site down”

The issue with the site going down was embarrassing, of course. I started thinking about how I could avoid that in the future. Most of the site is really very static – the only thing that drives any change in page content is when some aspect of the data is reloaded. With the existing code, there’s no “load the data” within the serving path – it’s all updated periodically with a background service. The background service then provides an ElectionContext which can be retrieved from all the Razor page code-behind classes, and that’s effectively transformed into a view-model for the page. The view-model is then cached while the ElectionContext hasn’t changed, to avoid recomputing how many seats have been won by each party etc.

The bug that brought the site down – or rather, the main view – was in the computation of the view-model. If the code providing the ElectionContext instead provided the view-model, keeping the view-model computation out of the serving path, then a failure to build the view-model would just mean stale data instead of a page load failure. (At least until the server was restarted, of course.) Admittedly if the code computing the view-model naively transformed the ElectionContext into all the view-models, then a failure in one would cause all the view-models to fail to update. This should be relatively easy to avoid though.

My plan for the future is to have three clear layers in the new site:

  • Underlying model, which is essentially the raw data for the election, loaded from Firestore and normalized
  • View-models, which provide exactly what the views need but which don’t actually depend on anything in ASP.NET Core itself (except maybe HtmlString)
  • The views, with the view-models injected into the Razor pages

I expect to use separate a project for each of these, which should help to enforce layering and make it significantly easier to test the code.

Move data normalization and validation to earlier in the pipeline

The current site loads a lot of data from Google Sheets, using Firestore just for results. There’s a lot of prediction-provider-specific code used to effectively transform those spreadsheets into a common format. This led to multiple problems:

  • In order to check whether the data was valid with the transformation code, I had to start the web site
  • The normalization happened every time the data was loaded
  • If a prediction provider changed the spreadsheet format (which definitely happened…) I had to modify the code for it to handle both the old and the new format
  • Adopting a new prediction provider (or even just a new prediction set) always required redeploying the site
  • Loading data from Google Sheets is relatively slow (compared with Firestore) and the auth model for Sheets is more geared towards user credentials than services

All of this can be fixed by changing the process. If I move from “lots of code in the site to load from Sheets” to “lots of individual tools which populate Firestore, and a small amount of code in the site to read from Firestore” most of those problems go away. The transformation code can load all of the data and validate it before writing anything to Firestore, so we should never have any data that will cause the site itself to have problems. Adding a new prediction set or a new prediction provider should be a matter of adding collections and documents to Firestore, which the site can just pick up dynamically – no site-side code changes required.

The tooling doesn’t even have to load from Google Sheets necessarily. In a couple of cases, my process was actually “scrape HTML from a site, reformat the HTML as a CSV file, then import that CSV file into Google Sheets.” It would be better to just “scrape HTML, transform, upload to Firestore” without all the intermediate steps.

With that new process, I’d have been less nervous about adding the “exit poll prediction provider” on election night.

Capture more data

I had to turn down one feature – listing the size of swings and having a “biggest swings of the night” section – due to not capturing enough data. I’d hoped that “party + majority % in 2019” and “party + majority % in 2024” would be enough to derive the swing, but it doesn’t work quite that way. In the future, I want to capture as much data as possible about the results (both past and present). That will initially mean “all the voting information in each election” but may also mean a richer data model for predictions – instead of bucketing the predictions into toss-up/lean/likely/safe, it would be good to be able to present the original provider data around each prediction, whether that’s a predicted vote share or a “chance of the seat going to this party” – or just a toss-up/lean/likely/safe bucketing. I’m hoping that looking at all the predictions from this time round will provide enough of an idea of how to design that data model for next time.

Tests

Tests are good. I’m supportive of testing. I don’t expect to write comprehensive tests for a future version, but where I can see the benefit, I would like to easily be able to write and run tests. That may well mean just one complex bit of functionality getting a load of testing and everything else being lightweight, but that would be better than nothing.

In designing for testability, it’s likely that I’ll also make sure I can run the site locally without connecting to any Google Cloud services… while I’ll certainly have a Firestore “test” database separate from “prod”, it would be nice if I could load the same data just from local JSON files too.

What comes next?

I enjoyed this whole experience so much that I’ve registered the https://election2029.uk domain. I figure if I put some real time into this, instead of cobbling it all together in under a month, I could really produce something that would be useful to a much larger group of people. At the moment, I’m planning to use Cloud Run to host the site (still using ASP.NET Core for the implementation) but who knows what could change between now and the next election.

Ideally, this would be open source from the start, but there are some issues doing that which could be tricky to get around, at least at the moment. Additionally, I’d definitely want to build on Google Cloud again, and with a site that’s so reliant on data, it would be odd to say “hey, you can look at the source for the site, but the data is all within my Google Cloud project, so you can’t get at it.” (Making the data publicly readable is another option, but that comes with issues too.) Maybe over the next few years I’ll figure out a good way of handling this, but I’m putting that question aside for the moment.

I’m still going to aim to keep it pretty minimal in terms of styling, only using JavaScript where it really makes sense to do so. Currently, I’m not using any sort of framework (Vue, React, etc) and if I can keep things that way, I think I’ve got more chance of being able to understand what I’m doing – but I acknowledge that if the site becomes larger, the benefits of a framework might outweigh the drawbacks. It does raise the question of which one I’d pick though, given the timescale of the project…

Beyond 2029, I’ll be starting to think about retirement. This project has definitely made me wonder whether retiring from full-time commercial work but providing tech tooling for progressive think-tanks might be avery pleasant way of easing myself into fuller retirement. But that’s a long way off…

Building an election website

Introduction

I don’t know much about my blog readership, so let’s start off with two facts that you may not be aware of:

  • I live in the UK.
  • The UK has a general election on July 4th 2024.

I’m politically engaged, and this is a particularly interesting election. The Conservative party have been in office for 14 years, and all the polls show them losing the upcoming election massively. Our family is going to spend election night with some friends, staying up for as many of the results we can while still getting enough sleep for me to drive safely home the next day.

I recently started reading Comment is Freed, the Substack for Sam and Lawrence Freedman. This Substack is publishing an article every day in the run-up to the election, and I’m particularly interested in Sam’s brief per-constituency analysis and predictions. It was this site that made me want to create my own web site for tracking the election results – primarily for on-the-night results, but also for easy information lookup later.

In particular, I wanted to see how accurate the per-seat predictions were with reality. Pollsters in the UK are generally predicting three slightly different things:

  • Overall vote share (what proportion of votes went to each party)
  • Overall seat tallies (in the 650 individual constituencies, how many seats did each party win)
  • Per-seat winners (sometimes with predicted majorities; sometimes with probabilities of winning)

The last of these typically manifests as what is known as an MRP prediction: Multi-level Regression and Poststratification. They’re relatively new, and we’re getting a lot of them in this election.

After seeing those MRPs appear over time, I reflected – and in retrospect this was obvious – that instead of only keeping track of how accurate Sam Freedman’s predictions were, it would be much more interesting to look at the accuracy of all the MRPs I could get permission to use.

At the time of this writing, the site includes data from the following providers:

  • The Financial Times
  • Survation
  • YouGov
  • Ipsos
  • More in Common
  • Britain Elects (as published in The New Stateman)

I’m expecting to add predictions from Focaldata and Sam Freedman in the next week.

Information on the site

The site is at https://jonskeet.uk/election2024, and it just has three pages:

  • The full view (or in colour) contains:
    • Summary information:
    • 2019 (notional) results and 2024 results so far
    • Predictions and their accuracy so far (in terms of proportion of declared results which were correctly called)
    • Hybrid “actual result if we know it, otherwise predicted” results for each prediction set
    • 2019/2024 and predicted results for the four nations of the UK
    • Per-seat information:
    • The most recent results
    • The biggest swings (for results where the swing is known; there may be results which don’t yet have majority information)
    • Recent “surprises” where a surprise is deemed to be “a result where at least half the predictions were wrong”
    • “Contentious” constituencies – i.e. ones where the predictions disagree most with each other
    • Notable losses/wins – I’ve picked candidates that I think will be most interesting to users,
      mostly cabinet and shadow cabinet members.
    • All constituencies, in alphabetical order
  • The simple view (or in colour) doesn’t include predictions at all. It contains:
    • 2019 (notional) results and 2024 results so far
    • Recent results
    • Notable losses/wins
  • An introduction page so that most explanatory text can be kept off the main pages.

I have very little idea how much usage the site will get at all, but I’m hoping that folks who want a simple, up-to-date view of recent results will use the simple view, and those who want to check specific constituencies and see how the predictions are performing will use the full view.

The “colour mode” is optional because I’m really unsure whether I like it. In colour mode, results are colour-coded by party and (for predictions) likelihood. It does give an “at a glance idea” impression of the information, but only if you’ve checked which columns you’re looking at to start with.

Implementation

This is a coding blog, and the main purpose of writing this post was to give a bit of information about the implementation to anyone interested.

The basic architecture is:

  • ASP.NET Core Razor Pages, running on Google Kubernetes Engine (where my home page was already hosted)
  • Constituency information, “notable candidates” and predictions are stored in Google Drive
  • Result information for the site is stored in Firestore
  • Result information originates from the API of The Democracy Club
    and a separate process uploads the data to Firestore
  • Each server refreshes its in-memory result data every 10 seconds and candidate/prediction data every 10 minutes via a background hosted service

A few notes on each of these choices…

I was always going to implement this in ASP.NET Core, of course. I did originally look at making it a Cloud Function, but currently the Functions Framework for .NET doesn’t support Razor. It doesn’t really need to, mind you: I could just deploy straight on Cloud Run. That would have been a better fit in terms of rapid scaling to be honest; my web site service in my GKE cluster only has two nodes. The cluster itself has three. If I spot there being a vast amount of traffic on the night, I can expand the cluster, but I don’t expect that to be nearly as quick to scale as Cloud Run would be. Note to self: possibly deploy to Cloud Run as a backup, and redirect traffic on the night. It would take a bit of work to get the custom domain set up though. This is unlikely to actually be required: the busiest period is likely to be when most of the UK is asleep anyway, and the site is doing so little actual work that it should be able to support at least several hundred requests per second without any extra work.

Originally, I put all information, including results in Google Drive. This is a data source I already use for my local church rota, and after a little initial setup with credential information and granting the right permissions, it’s really simple to use. Effectively I load a single sheet from the overall spreadsheet in each API request, with a trivial piece of logic to map each row into a dictionary from column name to value. Is this the most efficient way of storing and retrieving data? Absolutely not. But it’s not happening often, and it does end up being really easy to read code, and the data is very easy to create and update. (As of 2024-06-24, I’ve added the ability to load several sheets within a single request, with unexpectedly simplified some other code too. But the “treat each row as a string-to-string dictionary” design remains.)

For each “prediction provider” I store the data using the relevant sheets from the original spreadsheets downloaded from the sites. (Most providers have a spreadsheet available; I’ve only had to resort to scraping in a couple of cases.) Again, this is inefficient – it means fetching data for columns I’ll never actually access. But it means when a provider releases a new poll, I can have the site using it within minutes.

An alternative approach would be to do what I’ve done for results – I could put all the prediction information in Firestore in a consistent format. That would keep the site code straightforward, moving the per-provider code to tooling used to populate the Firestore data. If I were starting again from scratch, I’d probably do that – probably still using Google Sheets as an intermediate representation. It doesn’t make any significant difference to the performance of the site, beyond the first few seconds after deployment. But it would probably be nice to only have a single source of data.

The “raw” data is stored in what I’ve called an ElectionContext – this is what’s reloaded by the background service. This doesn’t contain any processed information such as “most recent” results or “contentious results”. Each of the three page models then has a static cache. A request for a new model where the election context hasn’t changed just reuses the existing model. This is currently done by setting ViewData.Model in the page model, to refer to the cached model. There may well be a more idiomatic way of doing this, but it works well. The upshot is that although the rendered page isn’t cached (and I could look into doing that of course), everything else is – most requests don’t need to do anything beyond simple rendering of already-processed data.

I was very grateful to be informed about the Democracy Club API – I was expecting to have to enter all the result data manually myself (which was one reason for keeping it in Google Sheets). The API isn’t massively convenient, as it involves mapping party IDs to parties, ballot paper IDs to constituency IDs, and then fetching the results – but it only took a couple of hours to get the upload process for Firestore working. One downside of this approach is that I really won’t be able to test it before the night – it would be lovely to have a fake server (running the same code) that I could ask to “start replaying 2019 election results” for example… but never mind. (I’ve tested it against the 2019 election results, to make sure I can actually do the conversion and upload etc.) You might be expecting this to be hosted in some sort of background service as well… but in reality it’s just a console application which I’ll run from my laptop on the night. Nothing to deploy, should be easy to debug and fix if anything goes wrong.

In terms of the UI for the site itself, the kind way to put it would be “efficient and simplistic”. It’s just HTML and CSS, and no request will trigger any other requests. The CSS is served inline (rather than via a separate CSS resource) – it’s small enough not to be a problem, and that felt simpler than making sure I handled caching appropriately. There’s no JS at all – partly because it’s not necessary, and partly because my knowledge of JS is almost non-existent. Arguably with JS in place I could make it autorefresh… but that’s about all I’d want to do, and it feels like more trouble than it’s worth. The good news is that this approach ends up with a really small page size. In non-colour mode, the simple view is currently about 2.5K, and the full view is about 55K. Both will get larger as results come in, but I’d be surprised to see them exceed 10K and 100K respectively, which means the site will probably be among the most bandwidth-efficient way of accessing election data on the night.

Conclusion

I’ve had a lot of fun working on this. I’ll leave the site up after the election, possibly migrating the data all to Firestore at some point.

I’ve experienced yet again the joy of working on something slightly out of my comfort zone (I’ve learned bits of HTML and CSS I wasn’t aware of before, learned more about Razor pages, and used C# records more than elsewhere, and I love collection expressions) that is also something I want to use myself. It’s been great.

Unfortunately at the moment I can’t really make the code open source… but I’ll consider doing so after the election, as a separate standalone site (as opposed to part of my home page). It shouldn’t be too hard to do – although I should warn readers that the code is very much in the “quick and dirty hack” style.

Feedback welcome in the comments – and of course, I encourage the use of the site on July 4th/5th and afterwards…

Variations in the VISCA protocol

Nearly three years ago, I posted about some fun I’d been having with VISCA using C#. As a reminder, VISCA is a camera control protocol, originally used over dedicated serial ports, but more recently over IP.

Until this week, all the cameras I’d worked with were very similar – PTZOptics, Minrray and ZowieTek all produce hardware which at least gives the impression of coming out of the same factory, with the same “base” firmware that’s then augmented by the specific company. I’ve seen differences in the past, but they’re largely similar in terms of VISCA implementation.

This week, my Obsbot Tail Air arrived. I’ve been looking at Obsbot for a while, attracted by the small form factor, reasonable price, and fun object tracking functionality. However, earlier models didn’t have the combination of the two features I was most interested: VISCA and NDI support. The Tail Air has both. I bought it in the hope that I could integrate it with my church A/V system (At Your Service) – allowing for auto-tracking and portability (as the Tail Air is battery powered and wireless).

The NDI integration worked flawlessly from the start. Admittedly the Tail Air takes a lot of bandwidth by default – I’ve turned the bitrate down to “low” in order to get to a reasonable level (which still looks great). But fundamentally it just worked.

VISCA was trickier – hence this blog post. First, there wasn’t documentation on whether it was using TCP or UDP, or which port it was listening on. To be clear, the product’s only just launched, and I’m sure the documentation will improve over time. Fortunately, there was information on the Facebook group, suggesting that other people had got some VISCA clients working with UDP port 52381.

To start with, that was going to cause a bit of a problem as my implementation only supported TCP. However, it was pretty easy to change it to support UDP; the code had already been written to isolate the transport from other aspects, at least to some extent. Fortunately, the PTZOptics camera supports both TCP and UDP, so it was easy to test the UDP implementation that way.

Unfortunately, the implementation that worked for my PTZOptics camera entirely failed for the Tail Air. After doing a bit of digging, I found out why.

It turns out that there are two variants of VISCA over IP – what I’ve called “raw” and “encapsulated”, but that’s definitely not official terminology. In the “raw” version, each message is between 3 and 16 bytes:

  • The first byte designates the type of message, the source and destination devices, and normal/broadcast mode.
  • There are 1-14 “payload” bytes
  • The last byte is always 0xff (and no other byte ever should be)

The “encapsulated” version uses the same data part as the “raw” version, but with an additional header of 8 bytes:

  • The first two bytes indicate the message “type” (command, inquiry, reply, device setting, control, control reply)
  • The next two bytes indicate the length of the data to follow (even though it’s never more than 16 bytes…)
  • The final four header bytes indicate a sequence number (initially 00 00 00 00, then 00 00 00 01 etc)

So for example, the raw command for “get power status” is 81-09-04-00-ff.

The encapsulated command for the same request (with a sequence number of 00-00-00-00) is 01-10-00-05-00-00-00-00-81-09-04-00-FF.

Once I’d figured that out (and hand-crafted the “get power status” packet to check that I was on the right lines), the rest was fairly straightforward. My VISCA library now allows the use of TCP or UDP, and raw or encapsulated format.

The Tail Air still doesn’t behave quite the same as my PTZOptics camera in terms of VISCA support, mind you:

  • It ignores power standby/on commands.
  • The “set pan/tilt” command ignores the specified tilt speed, using the pan speed for both pan and tilt.
  • The “set pan/tilt” command replies that it’s completed immediately – instead of waiting until the camera has actually finished moving.
  • The pan/tilt and zoom limits are (understandably) different.

Still, none of that should prevent fuller integration within At Your Service. I need to take account of the different pan/tilt/zoom limits within At Your Service, allowing them to be configurable, but after that I should have a reasonably workable system… as well as a handy little test camera to take for demo purposes!

All the code changes can be seen in the CameraControl directory of my demo code GitHub repo.

The Tail Air is not the only new toy I’ve received this week… I’ve also taken delivery of an Allen & Heath CQ20B digital mixer, so I’ve been playing with that today as well, trying to work out how to integrate that into DigiMixer. I’m really hoping to get a chance to write some more blog posts on DigiMixer over the Christmas holidays… watch this space.

Taking .NET MAUI for a spin

I’ve been keeping an eye on MAUI – the .NET Multi-platform App UI – for a while, but I’ve only recently actually given it a try.

MAUI is essentially the evolution of Xamarin.Forms, embracing WinUI 3 and expanding from a mobile focus to desktop apps as well. It’s still in preview at the time of writing, but only just – release candidate 1 came out on April 12th 2022.

I’ve been considering it as a long-term future for my V-Drum Explorer application, which is firmly a desktop app, just to make it available on macOS as well as Windows. However, when a friend mentioned that if only I had a mobile app for At Your Service (my church A/V system), it would open up new possibilities… well, that sounded like an opportunity to take MAUI for a spin.

This blog post is about initial impressions. It’s worth being really clear about that – please take both praise and criticism of MAUI with a pinch of salt. I’m not a mobile developer, I’m not a UI designer, I haven’t tried doing anything massively taxing with MAUI, and I may well be doing a bunch of things in the wrong way.

What’s the goal?

Besides “having fun”, the aim is to end up with a workable mobile app for At Your Service (AYS from here onwards). In an ideal world, that would work on iPhones, iPads, Android phones and Android tablets. In reality, the number of people who will ever use the app is likely to be 1 or 2 – and both of us have Android phones. So that’s all I’ve actually tested with. I may at some point try to build and test with an iPad just for kicks, but I’m not really bothered by it. As it happens, I’ve tried the Windows app version, but that hasn’t worked out for me so far – more details later.

So what does this mobile app need to do? While I dare say it may be feasible to replicate almost everything you can do with AYS, that’s not the aim here. I have no desire to create new service plans on my phone, nor to edit hymn words etc. The aim is only to use the application to “direct” a service without having to physically sit behind the A/V desk.

Crucially, there’s no sensible way that a mobile app could completely replace the desktop app, at least with our current hardware. While a lot of the equipment we use is networked (specifically the cameras and the mixer), the projector in the church building is connected directly to the desktop computer via an HDMI cable. (OBS Studio captures that output as a virtual webcam for Zoom.) Even if everything could be done with an entirely standalone mobile app, it would mean reimplementing or at least adapting a huge amount of code.

Instead, the aim is to make the mobile app an alternative control mechanism for an instance of AYS running on the church desktop in the normal way. I want it to be able to handle all the basic functionality used during a service:

  • Switch between “scenes” (where as scene in AYS is something like “a hymn” or “a reading” or “the preacher in a particular place”; switching between scenes brings up all the appropriate microphones and cameras, as well as whatever text/PowerPoint/video needs to be displayed)
  • Change pages in scenes with text content (e.g. hymns and liturgy)
  • Change slides in PowerPoint presentations
  • Play/pause for videos, along with volume control and simple “back 10 seconds” and “forward 10 seconds” buttons
  • Basic camera controls, across multiple cameras
    • Move to a preset
    • Change whether the camera is shown or not, and how it’s shown (e.g. “top right corner”)
  • Basic mixer controls
    • Mute/unmute microphones
    • Potentially change the volume for microphones – if I do this, I might want to change the volume for the Zoom output independently of the in-building output

What’s the architecture?

The desktop AYS system already has a slightly split architecture: the main application is 64-bit, but it launches a 32-bit secondary app which is a combined WPF + ASP.NET Core server to handle Zoom. (Until fairly recently, the Zoom SDK didn’t support 64-bit apps, and the 32-bit address space ended up causing problems when decoding multiple cameras.) That meant it wasn’t much of a stretch to figure out at least one possible architecture for the mobile app:

  • The main (desktop) AYS system runs an ASP.NET Core server
  • The mobile app connects to the main system via HTTP, polling for current status and making control requests such as “switch to scene 2”.

Arguably, it would be more efficient to use a gRPC stream to push updates from the desktop system to the mobile app, and at some point I might introduce gRPC into the mix, but frequent polling (about every 100ms) seems to work well enough. Sticking to just JSON and “regular” HTTP for requests and responses also makes it simple to test some aspects in a browser, too.

One quirk of both of the servers is that although they receive the requests on threadpool threads, almost all of them use the WPF dispatcher for execution. This means I don’t need to worry about (say) an status request seeing half the information from before a scene change and half the information after a scene change. It also means that the rest of the AYS desktop code can still assume that anything that affects the UI will happen on the dispatcher thread.

Even without using gRPC, I’ve made a potentially-silly choice of effectively rolling my own request handlers instead of using Web API. There’s a certain amount of wheel reinvention going on, and I may well refactor that away at some point. It does allow for some neatness though: there’s a shared project containing the requests and responses, and each request is decorated (via an attribute) with the path on which it should be served. The “commands” (request handlers) on the server are generic in the request/response types, and an abstract base class registers that command with the path on the request type. Likewise when making a request, a simple wrapper class around HttpClient can interrogate the request type to determine the path to use. At some point I may try to refactor the code to keep that approach to avoid duplication of path information, while not doing quite as much wheel reinvention as at the moment.

What does the UI look like? (And how does it work?)

When I first started doing a bit of research into how to create a MAUI app, there was a pleasant coincidence: I’d expected a tabbed UI, with one of the tabs for each part of the functionality listed above. As it happens, that’s made particularly easy in MAUI with the Shell page. Fortunately I found documentation for that before starting to use a more manually-crafted use of tabs. The shell automatically removes the tab indicator if only one tab is visible, and basically handles things reasonably simply.

The binding knowledge I’ve gradually gained from building WPF apps (specifically V-Drum Explorer and AYS) was almost immediately applicable – fortunately I saw documentation noting that the DataContext in WPF is BindingContext in MAUI, and from there it was pretty straightforward. The code is “mostly-MVVM” in a style that I’ve found to be pretty pragmatic when writing AYS: I’m not dogmatic about the views being completely code-free, but almost everything is in view models. I’ve always found command binding to be more trouble than it’s worth, so there are plenty of event handlers in the views that just delegate directly to the view model.

There’s a separate view model for each tab, and an additional “home” tab (and corresponding view model) which is just about choosing a system to connect to. (I haven’t yet implemented any sort of discovery broadcast. I don’t even have app settings – it’s just a manually-curated set of URLs to connect to.) The “home” view model contains a reference to each of the other view models, and they all have two features (not yet via an interface, although that could come soon):

  • Update the view model based on a status polling response
  • A property to determine whether the tab for the view model should be visible. (If there’s no text being displayed, we don’t need to display the text tab, etc.)

I’m not using any frameworks for MVVM: I have a pretty simplistic ViewModelBase which makes it easy enough to raise property-changed events, including automatically raising events for related properties that are indicated by attributes. I know that at some point I should probably investigate C# source generators to remove the boilerplate, but it’s low down my priority list.

MAUI supports dependency injection, and at some point when investigating navigating between different tabs (which initially didn’t work for reasons I still don’t understand) I moved to using DI for the view models, and it’s worked well. The program entry point is very readable (partly due to a trivial ConfigureServices extension method which I expect to be provided out-of-the-box at some point):

public static MauiApp CreateMauiApp() => MauiApp
    .CreateBuilder()
    .UseMauiApp<App>()
    .ConfigureFonts(fonts => fonts
        .AddFont("OpenSans-Regular.ttf", "OpenSansRegular")
        .AddFont("OpenSans-Semibold.ttf", "OpenSansSemibold"))
    .ConfigureServices(services => services
        .AddSingleton<AppShell>()
        .AddSingleton<HomeViewModel>()
        .AddSingleton<MixerViewModel>()
        .AddSingleton<MultiCameraViewModel>()
        .AddSingleton<TextViewModel>()
        .AddSingleton<MediaViewModel>()
        .AddSingleton<PowerPointViewModel>()
        .AddSingleton<ScenesViewModel>()
        .AddSingleton<ApiClient>())
    .Build();

I’ve had to tweak the default style very slightly: the default “unselected tab” colour is almost unreadably faint, and for my use case I really need to be able to see which tabs are available at any given time. Fortunately the styling is pretty clear – it didn’t take much experimentation to get the effect I wanted. Likewise I added extra styles for the next/previous buttons for the PowerPoint and text tabs.

Sources of truth

One aspect I always find interesting in this sort of UI is what the source of truth is. As an example, what should happen when I select a different text page to display? Obviously I need to send a request to the main application to make the change, but what should the UI do? Should it immediately update, expecting that the request will be successful? Or should it only update when we next get a status polling response that indicates the change?

I’ve ended up going for the latter approach, after initially using the former. The main reason for this is to make the UI smoother. It’s easy to end up with race conditions when there’s no one source of truth. For example, here’s a situation I’ve seen happen:

  • T=0: Make status request
  • T=1: Status response: text page 3 is selected
  • T=2: Start status request
  • T=3: User clicks on page 4
  • T=4: Start “move to page 4” request
  • T=5: Status response: text page 3 is selected
  • T=6: Page change response: OK
  • T=7: Start status request
  • T=8: Status response: text page 4 is selected

(These aren’t meant to be times in seconds or anything – just a sequence of instants in time.)

If the UI changes at T=3 to show page 4 as the selected one, then it ends up bouncing back to page 3 at T=5, then back to page 4 at T=8. That looks really messy.

If instead we say that the only thing that can change the UI displaying the selected page is a status response, then we stay with page 3 selected from T=1 until T=8. The user needs to wait a little longer to see the result, but it doesn’t bounce between two sources of truth. As I’m polling every \~100ms, it doesn’t actually take very long to register. This also has the additional benefit that if the “change page” request fails, the UI still reflects the reality of the system state.

If this all sounds familiar from another blog post, that’s because it is. When originally writing about controlling a digital mixer using OSC an X-Touch Mini I observed the same thing. I’m sure there are plenty of cases where this approach doesn’t apply, but it seems to be working for me at the moment. It does affect how binding is used – effectively I don’t want to “allow” a list item to be selected, instead reacting to the intent to select it.

Screenshots

This section shows the tabs available, without very much explanation. I really wanted to include two of my favourite features: PowerPoint slide previews (little thumbnail images of the slides) and camera snapshots (so the user can see what a camera is currently pointing at, even if that camera isn’t being displayed on-screen at the moment). Unfortunately, images seem to be somewhat-broken in RC-1 at the moment. I can get the PowerPoint slides to display in my ListView if I just use an ImageCell, but that’s too restrictive. I can’t get the camera preview to display at all. I think it’s due to this issue but I’m not entirely sure.

With that caveat out of the way, let’s have a little tour of the app.

Home tab

On starting the app, it’s not connected to any system, so the user has to select one from the drop-down and connect. Notice how there are no tabs shown at this point.

Home tab (disconnected)

After connecting, the app shows the currently-loaded service (if there is one). If there’s a service loaded that contains any scenes at all, the Scenes tab is visible. The Mixer and Cameras tabs will always be visible when connected to a system (unless that system has no sound channels or no cameras, which seems unlikely).

In the screenshot below, the Text tab is also visible, because it so happens that the current scene contains text.

Home tab (connected)

Scenes tab

The Scenes tab shows the list of scenes, indicating which one is currently “active” (if any). If there is an active scene, the “Stop Scene” button is visible. (I’m considering having it just disabled if there’s no active scene, rather than it appearing and disappearing.)

Tapping on a scene launches it – and if that scene has text, PowerPoint or a video, it automatically navigates to that tab (as the next thing the user will probably want to do is interact with that tab).

Scenes tab

Text tab

The text tab shows the various pages of the text being displayed. Even though AYS supports styling (different colours of text, bold, italic etc) the preview is just plain text. It’s deliberately set at about 3 1/2 lines of text, which makes it obvious when there’s more being displayed than just what’s in the preview.

The user can select different pages by tapping on them – or just keep using the “next” button in the top right. The selected page is scrolled into view when there are more pages available than can be shown at a time.

Text tab

PowerPoint tab

The PowerPoint is like the text tab, but for PowerPoint slides. The screenshot below looks pretty awful due to the image display bug mentioned earlier. When preview images are working, they appear on the right hand side of the list view. (The slide numbers are still displayed.)

PowerPoint tab

Media tab

The media tab is used for videos, audio, and picures. (There’s nothing that can usefully be done with a single picture; at some point I want to create the idea of a “multi-picture media item” as an alternative to creating a PowerPoint presentation where each slide is just an image.)

As noted before, simple controls are available:

  • Play/pause
  • Back/forward (10 seconds)
  • Volume up/down (in increments of 10 – a slider would be feasible, but not terribly useful)

One thing I’ve found very useful in AYS in general is the indicator for the current position and the total length of the media item. The screenshot below shows that the media filename is shown in this tab – whereas it’s not in the PowerPoint tab at the moment (nor the title of the text item in the Text tab). I could potentially move the title to become the title of the tab, and put it in all three places… I’m really not sure at the moment.

Media tab

Mixer tab

The mixer tab shows which microphones are muted (toggled off) or unmuted (toggled on) as well as their current output gain within the church building (the numbers on the left hand side, in dB). At the moment, the only interaction is to mute and unmute channels; I’m not sure whether I’ll ever implement tweaking the volume. The intention is that this app is only for basic control – I’d still expect the user to be in the church building and able to access the computer for fine-grained control where necessary.

Mixer tab

Cameras tab

The cameras tab starts off with nothing useful: you have to select a camera in order to interact with it. At that point you can:

  • Change its window position
  • Change the “corner size” – when a camera position is top-left, top-right, bottom-left, bottom-right you can change that to be 1/2, 1/3, 1/4 or 1/5 of the size of the window
  • Move it to a different preset
  • Take a preview snapshot (currently not working)

As you can see by the screenshot below (taken from the church configuration) we have quite a few presets. Note that unlike the Scene/Text/PowerPoint tabs, there’s no concept of a “currently selected” preset, at least at the moment. Once the camera has moved to a preset, it can be moved separately on the desktop system, with a good deal more fine-tuning available. (That’s how new presets are created: pan/tilt/zoom to the right spot, then set that up as a new preview.) That fine-tuning isn’t available at all on the mobile app. At some point I could add “up a bit”, “down a bit” etc, but anything more than that would require a degree of responsiveness that I just don’t think I’d get with the current architecture. But again, I think that’s fine for the use cases I’m actually interested in.

Cameras tab

Conclusion

So that’s the app. There are two big questions, of course:

  • Is it useful?
  • What’s MAUI like to work with?

The answer to the first is an unqualified “yes” – more so than I’d expected. Just a couple of days ago, on Maundy Thursday, we had a communion service with everyone seated around a table. A couple of weeks earlier, I would have had to be sat apart from the rest of the congregation, at the A/V desk. That would definitely have disrupted the sense of fellowship, at least for me – and I suspect it would have made others feel slightly awkward too. With the mobile app, I was able to control it all discreetly from my place around the table.

In the future, I’m expecting to use the app mostly at the start of a service, if I have other stewarding duties that might involve me being up at the lectern to give verbal notices, for example. I still expect that for most services I’ll use the desktop AYS interface, complete with Stream Deck and X-Touch Mini… but it’s really nice to have the mobile app as an option.

In terms of MAUI – my feelings vary massively from minute to minute.

Let’s start off with the good: two weeks ago, this application didn’t exist at all. I literally started it on April 5th, and I used it to control almost every aspect of the A/V on April 10th. That’s despite me never having used either MAUI or Xamarin.Forms before, hardly doing any mobile development before, MAUI not being fully released yet, and all of the development only taking place in spare time. (I don’t know exactly how long I spent in those five days, but it can’t have been more than 8-12 hours.)

Despite being fully functional (and genuinely useful), the app required relatively little code to implement, and will be easy to maintain. Most of the time, debugging worked well through either the emulator or my physical device, allowing UI changes to be made without restarting (this was variable) and regular debugger operations (stepping through code) worked far better than it feels they have any right to given the multiple layers involved.

It’s not all sunshine and roses though:

  • The lack of a designer isn’t a huge problem, but it did make everything that bit harder when getting started.
  • Various bugs existed in the MAUI version I was using last week, some of which have now been fixed… but at the same time, other bugs have been introduced such as image one mentioned above.
  • I’ve seen various crashes that are pretty much infeasible for me to diagnose and debug, given my lack of knowledge of the underlying system:
    • One is an IllegalStateException with a message of “The specified child already has a parent. You must call removeView() on the child’s parent first.”
    • One is a NullPointerException for Object.toString()
    • I don’t know how to reproduce either of them right now.
  • Even when images were working, getting the layout and scaling right for them was very much a matter of trial and error. Various other aspects of layout have been surprising as well – I don’t know whether my expectations are incorrect, or whether these were bugs. I’m used to layouts sometimes being a bit of a mystery, but these were very odd.
  • The Windows app should provide an easy way of prototyping functionality without needing an emulator… and the home tab appears to work fine. Unfortunately the other tabs don’t magically appear (as they do on Android) after connecting, which makes it hard to make any further progress.
  • Sometimes the emulator seems to get stuck, and I can’t deploy to it. Unsticking it seems to be hit and miss. I don’t know whether this is an issue in the emulator itself, or how VS and MAUI are interacting with it.

In short, it’s very promising – but this doesn’t really feel like it’s release-candidate-ready yet. Maybe my stability expectations are too high, or maybe I’ve just been unlucky with the bugs I happen to have hit, but it doesn’t feel like I’ve been doing anything particularly unusual. I’m hopeful that things will continue to improve though, and maybe it’ll all be rock solid in 6 months or so.

I can see myself using MAUI for some desktop apps in the future – but I suspect that for anything that doesn’t naturally feel like it would just fit into a mobile app (with my limited design skills) I expect to keep using WPF. Now that I’ve got a bit of experience with MAUI, I can’t see myself porting V-Drum Explorer to it any time soon – it very much feels like “a mobile app framework that lets you run those mobile apps on the desktop”. That’s not a criticism as such; I suspect it’s an entirely valid product direction choice, it just happens not to be what I’m looking for.

All the problems aside, I’m still frankly astonished at getting a working, useful mobile app built in less than a week (and then polished a bit over the following week). Hats off to the MAUI team, and I look forward to seeing the rough edges become smoother in future releases.

What’s up with TimeZoneInfo on .NET 6? (Part 2)

In part 1, we ended up with a lot of test data specified in a text file, but without working tests – and with a conundrum as to how we’d test the .NET Core 3.1 data which requires additional information about the “hidden” AdjustmentRule.BaseUtcOffsetDelta property.

As with the previous blog post, this one is fairly stream-of-consciousness – you’ll see me changing my mind and spotting earlier mistakes as I go along. It’s not about trying to give advice – it’s intended for anyone who is interested in my thought process. If that’s not the kind of material you enjoy, I suggest you skip this post.

Abstracting platform differences

Well, time does wonders, and an answer to making most of the code agnostic to whether it’s running under .NET 6 or not now seems obvious: I can introduce my own class which is closer to the .NET 6 AdjustmentRule class, in terms of accessible properties. As it happens, a property of StandardOffset for the rule makes my code simpler than adding TimeZoneInfo.BaseUtcOffset and AdjustmentRule.BaseUtcOffsetDelta together every time I need it. But fundamentally I can put all the information I need into one class, and populate that class appropriately (without even needing a TimeZoneInfo) for tests, and use the TimeZoneInfo where necessary in the production code. (We have plenty of tests that use the actual TimeZoneInfo – but using just data from adjustment rules makes it easy to experiment with the Unix representation while on Windows.)

That means adding some derived data to our text file for .NET Core 3.1 – basically working out what the AdjustmentRule.BaseUtcOffsetDelta would be for that rule. We can do that by finding one instant in time within the rule, asking the TimeZoneInfo whether that instant is in daylight savings, and then comparing the prediction of “zone base UTC offset and maybe rule daylight savings” with the actual UTC offset.

With that in place, we get most of the way to passing tests – at least for the fixed data we’re loading from the text file.

Hidden data

There’s one problem though: Europe/Dublin in 1960. We have this adjustment rule in both .NET Core 3.1 and .NET 6:

1960-04-10 - 1960-10-02: Daylight delta: +00; DST starts April 10 at 03:00:00 and ends October 02 at 02:59:59.999

Now I know that should actually be treated as “daylight savings of 1 hour, standard offset of UTC+0”. The TimeZoneInfo knows that as well – if you ask for the UTC offset at (say) June 1st 1960, you the correct answer of UTC+1, and if you ask the TimeZoneInfo whether it’s in daylight savings or not, it returns true… but in most rules, a “daylight delta” of 0 means “this isn’t in daylight time”.

I believe this is basically a problem with the conversion from the internal rules to the publicly-available rules which loses some “hidden” bits of information. But basically it means that when I create my standardized adjustment rule, I need to provide some extra information. That’s annoying in terms of specifying yet more data in the text file, but it’s doable.

Given that the Unix rules and the Windows rules are really quite different, and I’ve already got separate paths for them (and everything still seems to be working on Windows), at this point I think it’s worth only using the “enhanced” adjustment rule code for Unix. That has the helpful property that we don’t need different ways of identifying the first instant in an adjustment rule: for Unix, you always use the daylight saving transition time of day; for Windows you never do.

At this point, I’m actually rethinking my strategy of “introduce a new type”. It’s got so much more than I really wanted it to have, I think I’m just going to split what was originally one method into two:

// Hard to test: needs a time zone
internal static BclAdjustmentRule FromUnixAdjustmentRule(TimeZoneInfo zone, TimeZoneInfo.AdjustmentRule rule)

... becomes ...

// Same signature as before, but now it just extracts appropriate information and calls the one below.
internal static BclAdjustmentRule FromUnixAdjustmentRule(TimeZoneInfo zone, TimeZoneInfo.AdjustmentRule rule)

// Easier to test with data from a text file
internal static BclAdjustmentRule FromUnixAdjustmentRule(TimeZoneInfo.AdjustmentRule rule,
    string standardName, string daylightName, TimeSpan zoneStandardOffset, TimeSpan ruleStandardOffset,
    bool forceDaylightSavings)

The unit tests that I’m trying to get to pass with just rule data just need to call into the second method. The production code (tested in other unit tests, but only on the “right” system) will call the first method.

(In the end, it turns out it’s easier to make the second method return a ZoneInterval rather than a BclAdjustmentRule, but the idea is the same.)

Are we nearly there yet?

At this point, I wish I’d paid slightly more attention while changing the code… because the code that did work for America/Sao_Paulo in 2018 is now failing for .NET 6. I can see why it’s failing now – I’m not quite so sure why it worked a few hours before.

The problem is in this pair of adjustment rules:

2017-10-15 - 2017-12-31: Daylight delta: +01; DST starts October 15 at 00:00:00 and ends December 31 at 23:59:59.999
2018-01-01 - 2018-02-17: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends February 17 at 23:59:59.999

These should effectively be merged: the end of the first rule should be the start of the second rule. In most rules, we can treat the rule as starting at the combination of “start date and start transition time” with a UTC offset of “the base offset of the zone” (not the rule). We then treat the rule as ending as the combination of “end date and end transition time” with a UTC offset of “the base offset of the zone + daylight delta of the rule”. But that doesn’t work in the example above: we end up with a gap of an hour between the two rules.

There’s a horrible hack that might fix this: if the end date is on December 31st with an end transition time of 23:59:59 (with any subsecond value), we could ignore daylight savings.

In implementing that, I found a commented out piece of code which did it, which was effectively abandoned in the refactoring described above – so that explains my confusion about why the code had only just stopped working.

With that in place, the data-based unit tests are green.

Now to run the main set of unit tests…

TimeZoneInfo-based unit tests

Just to reiterate, I have two types of tests for this code:

  • Unit tests based on rules described in a text file. These can be run on any implementation, and always represent “the rule data we’d expect to see on Unix”. There are currently 16 tests here – specific periods of history for specific time zones.
  • Unit tests based on the TimeZoneInfo objects exposed by the BCL on the system we’re running on. These include checking every time zone, and ensuring that every transition between 1950 and either 2037 or 2050 (depending on the system, for reasons I won’t go into now) is the same between the TimeZoneInfo representation, and the Noda Time representation of the TimeZoneInfo.

The first set of tests is what we’ve been checking so far – I now need to get the second set of tests working in at least four contexts: .NET Core 3.1 and .NET 6, on both Windows and Linux.

When I started this journey, the tests were working on Windows (both .NET Core 3.1 and .NET 6) and on Linux on .NET Core 3.1. It was only .NET 6 that was failing. Let’s start by checking that we haven’t broken anything on Windows… yes, everything is still working there. That’s pretty unsurprising, given that I’ve aimed to keep the Windows code path exactly the same as it was before. (There’s a potential optimization I can apply using the new BaseUtcOffsetDelta property on .NET 6, but I can apply that later.)

Next is testing .NET Core 3.1 on Linux – this used to work, but I wouldn’t be surprised to see problems introduced by the changes… and indeed, the final change I made broke lots of time zones due to trying to add daylight savings to DateTime.MaxValue. That’s easily fixed… and with that fix in place there are two errors. That’s fine – I’ll check those later and add text-file-data-based tests for those two time zones. Let’s check .NET 6 first though, which had large numbers of problems before… now we have 14. Definite progress! Those 14 failures seem to fall into two categories, so I’ll address those first.

Adding more test data and fixing one problem

First, I’m going to commit the code I’ve got. It definitely needs changing, but if I try something that doesn’t help, I want to be able to get back to where I was.

Next, let’s improve the exception messages thrown by the .NET 6 code. There are two exceptions, that currently look like this:

(For Pacific/Wallis and others)
System.InvalidOperationException : Zone recurrence rules have identical transitions. This time zone is broken.

(For America/Creston and others)
System.ArgumentException : The start Instant must be less than the end Instant (Parameter 'start')

Both have stack traces of code, but that doesn’t help me know what the invalid values are, which means I can’t easily find the relevant rules to work on.

After a few minutes of work, this is fixed and the output is much more informative:

(For Pacific/Wallis and others)
System.InvalidOperationException : Zone recurrence rules have identical transitions. This time zone is broken. Transition time: 0002-12-31T11:45:00Z

(For America/Creston and others)
System.ArgumentException : The start Instant must be less than the end Instant. start: 1944-01-01T07:00:00Z; end: 1944-01-01T06:01:00Z (Parameter 'start')

The broken rule for Pacific/Wallis is particularly interesting – year 2AD! So let’s see what the rules look like in textual form. First let’s look at Pacific Wallis

Pacific/Wallis

.NET Core 3.1:
Base offset = 12
0001-01-01 - 1900-12-31: Base UTC offset delta: +00:15; Daylight delta: +00; DST starts January 01 at 00:00:00 and ends December 31 at 23:44:39
1900-12-31 - 2038-01-19: Daylight delta: +00; DST starts December 31 at 23:44:40 and ends January 19 at 15:14:06
2038-01-19 - 9999-12-31: Daylight delta: +00; DST starts January 19 at 15:14:07 and ends December 31 at 23:59:59

.NET 6.0:
Base offset = 12
0001-01-01 - 0001-12-31: Base UTC offset delta: +00:15; Daylight delta: +00; DST starts January 01 at 00:00:00 and ends December 31 at 23:59:59.999
0002-01-01 - 1899-12-31: Base UTC offset delta: +00:15; Daylight delta: +00; DST starts January 01 at 00:00:00 and ends December 31 at 23:59:59.999
1900-01-01 - 1900-12-31: Base UTC offset delta: +00:15; Daylight delta: +00; DST starts January 01 at 00:00:00 and ends December 31 at 23:44:39.999

Noda Time zone intervals:
0001-01-01T00:00:00Z - 1900-12-31T11:44:40Z, +12:15:20, +0
1900-12-31T11:44:40Z - 9999-12-31T23:59:59Z, +12, +0

The first Noda Time zone interval extends from the start of time, and the second one extends to the end of time. I haven’t yet decided whether I’ll actually try to represent all of this in the “regular” kind of test. The offsets shown as +00:15 should actually be +00:15:20, but it looks like .NET doesn’t handle sub-minute offsets. That’s interesting… I can easily change the Noda Time data to round to expect 12:15 of course.

Both .NET Core 3.1 and .NET 6 have pretty “interesting” representations here:

  • Why does .NET Core 3.1 have a new rule in 2038? It’s no coincidence that the instant being represented is 231 seconds after the Unix epoch, I’m sure… but there’s no need for a new rule.
  • Why does .NET 6 have one rule for year 1AD and a separate rule for years 2 to 1899 inclusive?
  • Why use an offset rounded to the nearest minute, but keep the time zone transition at 1900-12-31T11:44:40Z?

It’s not clear to me just from inspection why this would cause the Noda Time conversion to fail, admittedly. But that’ll be fun to dig into. Before we do, let’s find the test data for America/Creston, around 1944:

America/Creston

.NET Core 3.1:
Base offset = -7
1942-02-09 - 1944-01-01: Daylight delta: +01; DST starts February 09 at 02:00:00 and ends January 01 at 00:00:59
1943-12-31 - 1944-04-01: Daylight delta: +00; DST starts December 31 at 23:01:00 and ends April 01 at 00:00:59
1944-04-01 - 1944-10-01: Daylight delta: +01; DST starts April 01 at 00:01:00 and ends October 01 at 00:00:59
1944-09-30 - 1967-04-30: Daylight delta: +00; DST starts September 30 at 23:01:00 and ends April 30 at 01:59:59

.NET 6.0:
Base offset = -7
1942-02-09 - 1942-12-31: Daylight delta: +01; DST starts February 09 at 02:00:00 and ends December 31 at 23:59:59.999
1943-01-01 - 1943-12-31: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends December 31 at 23:59:59.999
1944-01-01 - 1944-01-01: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends January 01 at 00:00:59.999
1944-04-01 - 1944-10-01: Daylight delta: +01; DST starts April 01 at 00:01:00 and ends October 01 at 00:00:59.999
1967-04-30 - 1967-10-29: Daylight delta: +01; DST starts April 30 at 02:00:00 and ends October 29 at 01:59:59.999

Noda Time zone intervals:

1942-02-09T09:00:00Z - 1944-01-01T06:01:00Z, -7, +1
1944-01-01T06:01:00Z - 1944-04-01T07:01:00Z, -7, +0
1944-04-01T07:01:00Z - 1944-10-01T06:01:00Z, -7, +1
1944-10-01T06:01:00Z - 1967-04-30T09:00:00Z, -7, +0

Well those are certainly “interesting” rules – and I can immediately see why Noda Time has rejected the .NET 6 rules. The third rule starts at 1944-01-01T07:00:00Z (assuming that the 1944-01-01T00:00:00 is in zone standard time of UTC-7) and finishes at 1944-01-01T06:01:00Z (assuming that the 1944-01-01T00:00:59.999 is in daylight time).

Part of the problem is that we’ve said before that if a rule ends on December 31st at 23:59:59, we’ll interpret that as being in standard time instead of being in daylight time… which means that the second rule would finish at 1944-01-01T07:00:00Z – but we genuinely want it to finish at 06:00:00Z, and maybe understand the third rule to mean 1944-01-01T06:00:00Z to 19:44-01-01T06:01:00Z for that last minute of daylight time before we observe standard time until April 1st.

We could do that by adding special rules:

  • If a rule appears to end before it starts, assume that the start should be treated as daylight time instead of standard time. (That would make the third rule valid, covering 1944-01-01T06:00:00Z to 19:44-01-01T06:01:00Z.)
  • If one rule ends after the next one starts, change its end point to the start of the next one. I currently have the reverse logic to this, changing the start point of the next one instead. That wouldn’t help us here. I can’t remember exactly why I’ve got the current logic: I need to add some comments on this bit of code…

Astonishingly, this works, getting us down to 8 errors on .NET 6. Of these, 6 are the same kind of error as Pacific/Wallis, but 2 are unfortunately of the form “you created a time zone successfully, but it doesn’t give the same results as the original one”. Hmm.

Still, let’s commit this change and move on to Pacific/Wallis.

Handling Pacific/Wallis

Once I’d added the Pacific/Wallis data, those tests passed – which means the problem must lie somewhere in how the results of the rules are interpreted in order to build up a DateTimeZone from the converted rules. That’s logic’s already in a BuildMap method, which I just need to make internal instead of private. That also contains some code which we’re essentially duplicating in the test code (around inserting standard zone intervals if they’re missing, and coalescing some zone intervals together). At some point I want to refactor both the production and test code to remove the duplication – but I want to get to working code first.

I’ll add a new test, just for Pacific/Wallis (as that’s the only test case we’ve got which is complete from the start to the end of time), and which just constructs the map. I expect it will throw an exception, so I’m not actually going top assert anything about the result yet.

Hmm. It doesn’t throw. That’s weird. Let’s rerun the full time zone tests to make sure we still have a problem at all… yes, it’s still failing.

At this point, my suspicion is that some of the code that is “duplicated” between production and test code really isn’t quite duplicated at all. Debugging the code on Linux is going to be annoying, so let’s go about this the old-fashioned way: logging.

I’d expected to be able to log the zone interval for each part of the map… but PartialZoneIntervalMap.GetZoneInterval fails, which is really, really weird. What’s even weirder is that the stack trace includes StandardDaylightAlternatingMap – which is only used in the Windows rules.

All my unit tests assume we’ve recognized that the adjustment rules are from Unix… but the ones for Pacific/Wallis actually look like Windows ones: on .NET 6, they start on January 1st, and end on December 31st.

Let’s add a grotty hack: I happen to know that Windows time zone data never has any “really old” rules other than ones that start at the beginning of time – if we see anything with a start year that’s not 1 and isn’t after (say) 1600, that’s going to be a Unix rule.

Put that extra condition in, and lo and behold, Pacific/Wallis starts working – hooray!

Let’s rerun everything…

So, running all the tests on every framework/platform pair that I can easily test, we get:

  • Linux, .NET 6: 2 failures – Australia/Broken_Hill and Antarctica/Macquarie
  • Linux, .NET Core 3.1: 3 failures – Asia/Dhaka, Australia/Broken_Hill and Antarctica/Macquarie
  • Windows, .NET 6: all tests pass
  • Windows, .NET Core 3.1: all tests pass

All the remaining failures are of the form “the offset is wrong at instant X”.

So, joy of joys, we’re back to collecting more data and adding more test cases.

First though, I’ll undo making a few things internal that didn’t actually help in the end. I might redo them later, but I don’t need them now. Basically the extra test to create the full map didn’t give me any more insight. (There’s a bunch of refactoring to do when I’ve got all the tests passing, but I might as well avoid having more changes than are known to be helpful.)

Going for green

At this point, I should reveal that I have a hunch. One bit of code I deleted when rewriting the “Unix rule to zone interval” conversion code was the opposite of the issue I described earlier of rules which are in daylight savings, but have a DaylightDelta of zero. The code I deleted basicaly said “If the time zone says it’s not in daylight savings, but DaylightDelta is non-zero, then treat it as zero anyway.” So I’m hoping that’s the issue, but I want to get the test data first. I’m hoping that it’s the same for all three time zones that are having problems though. We’ll start with Australia/Broken_Hill, which is failing during 1999. Dumping the rules in .NET 6 and .NET Core 3.1 under Linux, and looking at the Noda Time tzvalidate page, I get:

Base UTC offset: 09:30

.NET Core 3.1:
1998-10-25 - 1999-03-28: Daylight delta: +01; DST starts October 25 at 02:00:00 and ends March 28 at 02:59:59
1999-03-28 - 1999-10-31: Daylight delta: +00; DST starts March 28 at 02:00:00 and ends October 31 at 01:59:59
1999-10-31 - 1999-12-31: Daylight delta: +01; DST starts October 31 at 02:00:00 and ends December 31 at 23:59:59
1999-12-31 - 2000-03-26: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends March 26 at 02:59:59

.NET 6:

1999-01-01 - 1999-03-28: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends March 28 at 02:59:59.999
1999-10-31 - 1999-12-31: Daylight delta: +01; DST starts October 31 at 02:00:00 and ends December 31 at 23:59:59.999
1999-12-31 - 1999-12-31: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends December 31 at 23:59:59.999
2000-01-01 - 2000-03-26: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends March 26 at 02:59:59.999

Noda Time:
1998-10-24T16:30:00Z - 1999-03-27T16:30:00Z, +09:30, +01
1999-03-27T16:30:00Z - 1999-10-30T16:30:00Z, +09:30, +00
1999-10-30T16:30:00Z - 2000-03-25T16:30:00Z, +10:30, +01

Annoyingly, my test data parser doesn’t handle partial-hour base UTC offsets at the moment, so I’m going to move on to Antarctica/Macquarie – I’ll put Broken Hill back in later if I need to. The text format of Broken Hill would be:

Australia/Broken_Hill
Base offset = 09:30
---
1998-10-25 - 1999-03-28: Daylight delta: +01; DST starts October 25 at 02:00:00 and ends March 28 at 02:59:59
1999-03-28 - 1999-10-31: Daylight delta: +00; DST starts March 28 at 02:00:00 and ends October 31 at 01:59:59
1999-10-31 - 1999-12-31: Daylight delta: +01; DST starts October 31 at 02:00:00 and ends December 31 at 23:59:59
1999-12-31 - 2000-03-26: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends March 26 at 02:59:59
---
1999-01-01 - 1999-03-28: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends March 28 at 02:59:59.999
1999-10-31 - 1999-12-31: Daylight delta: +01; DST starts October 31 at 02:00:00 and ends December 31 at 23:59:59.999
1999-12-31 - 1999-12-31: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends December 31 at 23:59:59.999
2000-01-01 - 2000-03-26: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends March 26 at 02:59:59.999
---
1998-10-24T16:30:00Z - 1999-03-27T16:30:00Z, +09:30, +01
1999-03-27T16:30:00Z - 1999-10-30T16:30:00Z, +09:30, +00
1999-10-30T16:30:00Z - 2000-03-25T16:30:00Z, +10:30, +01

However, let’s have a look at Antarctica/Macquarie, which is broken in 2009. Here’s the data:

Antarctica/Macquarie
Base UTC offset: +10

.NET Core 3.1:
2008-10-05 - 2009-04-05: Daylight delta: +01; DST starts October 05 at 02:00:00 and ends April 05 at 02:59:59
2009-04-05 - 2009-10-04: Daylight delta: +00; DST starts April 05 at 02:00:00 and ends October 04 at 01:59:59
2009-10-04 - 2009-12-31: Daylight delta: +01; DST starts October 04 at 02:00:00 and ends December 31 at 23:59:59
2009-12-31 - 2011-04-03: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends April 03 at 02:59:59

.NET 6.0:
2008-10-05 - 2008-12-31: Daylight delta: +01; DST starts October 05 at 02:00:00 and ends December 31 at 23:59:59.999
2009-01-01 - 2009-04-05: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends April 05 at 02:59:59.999
2009-10-04 - 2009-12-31: Daylight delta: +01; DST starts October 04 at 02:00:00 and ends December 31 at 23:59:59.999
2009-12-31 - 2009-12-31: Daylight delta: +01; DST starts December 31 at 23:00:00 and ends December 31 at 23:59:59.999
2010-01-01 - 2010-12-31: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends December 31 at 23:59:59.999
2011-01-01 - 2011-04-03: Daylight delta: +01; DST starts January 01 at 00:00:00 and ends April 03 at 02:59:59.999

Noda Time:
2008-10-04T16:00:00Z - 2009-04-04T16:00:00Z +10, +01
2009-04-04T16:00:00Z - 2009-10-03T16:00:00Z +10, +00
2009-10-03T16:00:00Z - 2011-04-02T16:00:00Z +10, +01

(The .NET 6 data needs to go as far as 2011 in order to include all the zone intervals for 2009, because there were no changes in 2010.)

Good news! The test fails… but only in .NET Core 3.1.

(A few days later… this blog post is being written in sporadic bits of spare time.)

Okay, let’s check I can still reproduce this – in .NET 6 on Linux, BclDateTimeZone test that converts a TimeZoneInfo to a BclDateTimeZone fails for Antarctica/Macquarie because it gives the wrong offset at 2009-10-04T00:00:00Z – the TimeZoneInfo reports +11, and the BclDateTimeZone reports +10. But the unit test for the .NET 6 data apparently gives the correct ZoneInterval. Odd.

Again, this is tricky to debug, so I’ll add some logging. Just a simple “new” test that logs all of the zone intervals in the relevant period. The results are:

Australian Eastern Standard Time: [2008-04-05T16:00:00Z, 2008-10-04T16:00:00Z) +10 (+00)
Australian Eastern Daylight Time: [2008-10-04T16:00:00Z, 2009-04-04T16:00:00Z) +11 (+01)
Australian Eastern Standard Time: [2009-04-04T16:00:00Z, 2009-12-31T13:00:00Z) +10 (+00)
Australian Eastern Daylight Time: [2009-12-31T13:00:00Z, 2011-04-02T16:00:00Z) +11 (+01)
Australian Eastern Standard Time: [2011-04-02T16:00:00Z, 2011-10-01T16:00:00Z) +10 (+00)
Australian Eastern Daylight Time: [2011-10-01T16:00:00Z, 2012-03-31T16:00:00Z) +11 (+01)

It looks like we’re not adding in the implicit standard time zone interval between April and October 2009. This is code that I’m aiming to deduplicate between the production code and the rule-data-oriented unit tests – it looks like the unit test code is doing the right thing, but the production code isn’t.

(In the process of doing this, I’ve decided to suppress warning CA1303 – it’s completely useless for me, and actively hinders simple debug-via-console-logging.)

Adding extra logging to the BuildMap method, it looks like we’ve already lost the information by the time we get there: there’s no sign of the October 2009 date anywhere. Better look further back in the code…

… and immediately spot the problem. This code, intended to handle overlapping rules:

convertedRules[i - 1] = convertedRules[i].WithEnd(convertedRules[i].Start);

… should be:

convertedRules[i - 1] = convertedRules[i - 1].WithEnd(convertedRules[i].Start);

Again, that’s code which isn’t covered by the rule-data-oriented unit tests. I’m really looking forward to removing that duplication. Anyway, let’s see what that fix leaves us with… oh. It hasn’t actually fixed it. Hmm.

Ha. I fixed it in this blog post but not in the actual code. So it’s not exactly a surprise that the tests were still broken!

Having actually fixed it, now the only failing test in .NET 6 on Linux is the one testing the .NET Core 3.1 data for Antarctica/Macquarie. Hooray! Running the tests for .NET Core 3.1, that’s also the only failing test. The real time zone seems to be okay. That’s odd… and suggests that my test data was incorrectly transcribed. Time to check it again… no, it really is that test data. Hmm. Maybe this time there’s another bug in the code that’s intended to be duplicated between production and tests, but this time with the bug in the test code.

Aha… the .NET Core 3.1 test code didn’t have the “first fix up overlapping rules” code that’s in the .NET 6 tests. The circumstances in which that fix-up is needed happen much more rarely when using the .NET Core 3.1 rules – this is the first time we’d needed it for the rule-data-oriented tests, but it was happening unconditionally in the production code. So that makes sense.

Copy that code (which now occurs three times!) and the tests pass.

All the tests are green, across Windows and Linux, .NET Core 3.1 and .NET 6.0 – wahoo!

Time for refactoring

First things first: commit the code that’s working. I’m pretty reasonable at refactoring, but I wouldn’t put it past myself to mess things up.

Okay, let’s try to remove as much of the code in the tests as possible. They should really be pretty simple. It’s pretty easy to extract the code that fixes up overlapping adjustment rules – with a TODO comment that says I should really add a test to make sure the overlap is an expected kind (i.e. by exactly the amount of daylight savings, which should be the same for both rules). I’ll add that later.

The part about coalescing adjacent intervals is trickier though – that’s part of a process creating a “full” zone interval map, extending to the start and end of time. It’s useful to do that, but it leaves us with a couple of awkward aspects of the existing test data. Sometimes we have DST adjustment rules at the start or end of the .NET 6 data just so that the desired standard time rule can be generated.

In the tests we just committed, we accounted for that separately by removing those “extra” rules before validating the results. It’s harder to do that in the new tests, and it goes against the aim of making the tests as simple as possible. Additionally, if the first or last zone interval is a standard time one, the “create full map” code will extend those intervals to the start of time, which isn’t what we want either.

Rather than adding more code to handle this, I’ll just document that all test data must start and end with a daylight zone interval – or the start/end of time. Most of the test data already complies with this – I just need to add a bit more information for the others.

Interestingly, while doing this, I found that there’s an odd discrepancy between data sources for Europe/Prague in 1945 in terms of when daylight savings started:

  • TimeZoneInfo says April 2nd in one rule, and then May 8th in another
  • Noda Time (and tzvalidate, and zdump) says April 2nd
  • timeanddate.com says April 8th

Fortunately, specifying both of the rules from TimeZoneInfo ends up with the tests passing, so I’ll take that.

With those data changes in, everything’s nicer and green – so let’s commit again.

Next up, there’s a bit of rearranging of the test code itself. Literally just moving code around, mostly moving it into the two nested helper classes.

Again, run all the tests – still green, so let’s commit again. (I’m planning on squashing all of these commits together by the end, of course.)

Next there’s a simplification option which I’d noted before but never found time to implement – just removing a bit of redundancy. While fixing that, I’ve noticed we’ve got IZoneIntervalMap and IZoneIntervalMapWithMinMax… if we could add min/max offset properties to IZoneIntervalMap, we could simplify things further. I’ve just added a TODO for this though, as it’s a larger change than I want to include immediately. Run tests, commit on green.

When comments become more than comments

Now for some more comments. The code I’ve got works, but the code to handle the corner cases in BclAdjustmentRule.ConvertUnixRuleToBclAdjustmentRule isn’t commented clearly. For every case, I’m going to have a comment that:

  • Explains what the data looks like
  • Explains what it’s intended to mean
  • Gives a sample zone + framework (ideally with a date) so I can look at the data again later on

Additionally, the code that goes over a set of converted Unix rules and fixes up the end instant for overlapping rules needs:

  • More comments including an example
  • Validation to ensure the fix-up only occurs in an expected scenario
  • A test for that validation (with deliberately broken rules)

When trying to do this, I’ve found it hard to justify some of the code. It’s all just a bit too hacky. With tests in place that can be green, I’ve tried to improve things – in particular, there was some code to handle a rule ending at the very end of the year, which changed the end point from daylight time to standard time. That feels like it’s better handled in the “fix-up” code… but even that’s really hard to justify.

What I can do is leave the code there with some examples of what fails. I’m still hoping I can simplify it more later though.

A new fly in the ointment

In the course of doing this, I’ve discovered one additional tricky aspect: in .NET 6, the last adjustment rule can be a genuine alternating one rather than a fixed one. For example, Europe/London finishes with:

2037-10-25 - 9999-12-31: Daylight delta: +01; DST starts Last Sunday of March; 01:00:00 and ends Last Sunday of October; 02:00:00

Currently we don’t take that into account, and that will make life trickier. Sigh. That rule isn’t too hard to convert, but it means I’m unlikely to get there today after all.

It shouldn’t be too hard to test this though: we currently have this line of code in BclDateTimeZoneTest:

// Currently .NET Core doesn't expose the information we need to determine any DST recurrence
// after the final tzif rule. For the moment, limit how far we check.
// See https://github.com/dotnet/corefx/issues/17117
int endYear = TestHelper.IsRunningOnDotNetCoreUnix ? 2037 : 2050;

That now needs to be “on .NET Core 3.1, use 2037; on .NET 6 use 2050”. With that change in place, I expect the tests to fail. I’ve decided I won’t actually implement that in the first PR; let’s get all the existing tests working first, then extend them later.

Let’s get it merged…

Even though there’s more work to do, this is much better than it was.

It’s time to get it merged, and then finish up the leftover work. I might also file an issue asking for Microsoft to improve the documentation and see if they’re able to provide sample code that makes sense of all of this…

Diagnosing an ASP.NET Core hard crash

As part of my church A/V system (At Your Service), I run a separate local web server to interact with the Zoom SDK. Initially this was because the Zoom SDK would only run in 32-bit processes and I needed a 64-bit process to handle the memory requirements for the rest of the app. However, it’s also proven useful in terms of keeping the Zoom meeting for a church service alive if At Your Service crashes. Obviously I try hard to avoid that happening, but when interoperating with a lot of native code (LibVLC, NDI, the Stream Deck, PowerPoint via COM) there are quite a few avenues for crashes. The web server runs ASP.NET Core within a WPF application to make it easy to interact with logs while it’s running, and to give the Zoom SDK a normal event dispatcher.

Yesterday when trying to change my error handling code significantly, I found that the web server was crashing hard, with no obvious trace of what’s going on. I’ve already spent a little time trying to figure out what’s going on, but I couldn’t get to the bottom of it. I know the immediate cause of the crash, and I’ve fixed that fairly easily – but I want to harden the web server against any further bugs I might introduce. I figured it would be useful to blog about that process as I went along.

What I know so far

The immediate crash was due to an exception being thrown in an async void method.

Relevant threading aspects:

  • I start the ASP.NET Core app in a separate thread (although that’s probably unnecessary anyway, now that I think about it) calling IHost.Start
  • I have handlers for Dispatcher.UnhandledException and TaskScheduler.UnobservedTaskException
  • I execute all Zoom-specific code on the WPF Dispatcher thread

The immediate error came from code like the following. You can ignore the way this effectively reproduces Web API to some extent… it’s the method body that’s important.

public abstract class CommandBase<TRequest, TResponse> : CommandBase
{
    public override async Task HandleRequest(HttpContext context)
    {
        var reader = new StreamReader(context.Request.Body);
        var text = await reader.ReadToEndAsync();
        var request = JsonUtilities.Parse<TRequest>(text);

        var dispatcher = Application.Current.Dispatcher;
        try
        {
            var response = await dispatcher.Invoke(() => ExecuteAsync(request));
            var responseJson = JsonUtilities.ToJson(response);
            await context.Response.WriteAsync(responseJson);
        }
        catch (ZoomSdkException ex)
        {
            SetExceptionResponse(new ZoomExceptionResponse { /* More info here */ });
        }

        async void SetExceptionResponse(ZoomExceptionResponse response)
        {
            var responseJson = JsonUtilities.ToJson(response);
            await context.Response.WriteAsync(responseJson);
            context.Response.StatusCode = 500;
        }
    }

    public abstract Task<TResponse> ExecuteAsync(TRequest request);
}

There are at least three problems here:

  • I’m trying to set HttpResponse.StatusCode after writing the body
  • The SetExceptionResponse method is async void (generally a bad idea)
  • I’m not awaiting the call to SetExceptionResponse (which I can’t, due to it returning void)

(It’s also a bit pointless having a local method there. This code could do with being rewritten when I don’t have Covid brain fog, but hey…)

The first of these causes an InvalidOperationException to be thrown. The second and third, between them, cause the app to crash. The debugger has been no help here in working out what’s going on.

Create with a console app to start ASP.NET Core

It feels like this should be really easy to demonstrate in a simple console app that does nothing but start a web server which fails in this particular way.

At this stage I should say how much I love the new top-level statements in C# 10. They make simple complete examples an absolute joy. So let’s create a console app, change the Sdk attribute in the project file to Microsoft.NET.Sdk.Web, and see what we can do. I’m aware that with ASP.NET Core 6 there are probably even simpler ways of starting the server, but this will do for now:

using System.Net;

var host = Host.CreateDefaultBuilder()
    .ConfigureWebHostDefaults(builder => builder
        .ConfigureKestrel((context, options) => options.Listen(IPAddress.Loopback, 8080))
        .Configure(application => application.Run(HandleRequest)))
    .Build();
host.Start();
host.WaitForShutdown();

async Task HandleRequest(HttpContext context)
{
    await context.Response.WriteAsync("Testing");
}

Trying to run that initially brought up prompts about IIS Express and trusting SSL certificates – all very normal for a regular web app, but not what I want here. After editing launchSettings.json to a simpler set of settings:

{
"profiles": {
"AspNetCoreCrash": {
"commandName": "Project"
}
}
}

… I can now start the debugger, then open up localhost:8080 and get the testing page. Great.

Reproduce the exception

Next step: make sure I can throw the InvalidOperationException in the same way as the original code. This is easy, just replacing the body of the HandlRequest method:

async Task HandleRequest(HttpContext context)
{
    await context.Response.WriteAsync("Testing");
    context.Response.StatusCode = 500;
}

Sure enough the console logs show that it’s failed as expected:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
  at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
  at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
  at Microsoft.WebTools.BrowserLink.Net.BrowserLinkMiddleware.ExecuteWithFilterAsync(IHttpSocketAdapter injectScriptSocket, String requestId, HttpContext httpContext)
  at Microsoft.AspNetCore.Watch.BrowserRefresh.BrowserRefreshMiddleware.InvokeAsync(HttpContext context)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

… but (again, as expected) the server is still running. It’s interesting that BrowserLink occurs in the stack trace – I suspect that wouldn’t be the case in my

Let’s try making the failure occur in the same way as in At Your Service:

async Task HandleRequest(HttpContext context)
{
    // In AYS we await executing code in the dispatcher;
    // Task.Yield should take us off the synchronous path.
    await Task.Yield();
    WriteError();

    async void WriteError()
    {
        await context.Response.WriteAsync("Testing");
        context.Response.StatusCode = 500;
    }
}

This time we get a longer stack trace, and the process quits, just like in AYS:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
   at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
   at Program.c__DisplayClass0_0.<<<Main>$>g__WriteError|4>d.MoveNext() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.c.b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.c.b__6_0(QueueUserWorkItemCallback quwi)
   at System.Threading.ExecutionContext.RunForThreadPoolUnsafe[TState](ExecutionContext executionContext, Action`1 callback, TState& state)
   at System.Threading.QueueUserWorkItemCallback.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

This happens in both the debugger and when running from the command line.

Setting a break point in the WriteError method shows a stack trace like this:

   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at Program.c__DisplayClass0_0.<<Main>$>g__WriteError|4()
   at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 14
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.c.b__6_0(Action innerContinuation, Task continuationIdTask)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.ContinuationWrapper.Invoke()
   at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.RunAction(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

There’s nothing about ASP.NET Core in there at all… so I wonder if we can take that out of the equation too?

Reproducing the crash in a pure console app

To recap, I’m expecting at this stage that to reproduce the crash I should:

  • Write an async void method that throws an exception
  • Call that method from a regular async method

Let’s try:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Console.WriteLine("Done");

async Task NormalAsyncMethod()
{
    await Task.Yield();
    Console.WriteLine("Start ofNormalAsyncMethod");
    BrokenAsyncMethod();
    Console.WriteLine("End of NormalAsyncMethod");
}

async void BrokenAsyncMethod()
{
    await Task.Yield();
    throw new Exception("Bang");
}

Hmm. That exits normally:

$ dotnet run
Start ofNormalAsyncMethod
End of NormalAsyncMethod
Done

But maybe there’s a race condition between the main thread finishing and the problem crashing the process? Let’s add a simple sleep:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Thread.Sleep(1000);
Console.WriteLine("Done");
// Remainder of code as before

Yup, this time it crashes hard:

Start ofNormalAsyncMethod
End of NormalAsyncMethod
Unhandled exception. System.Exception: Bang
   at Program.<<Main>$>g__BrokenAsyncMethod|0_1() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 15
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

Interlude: what about async Task?

At this point I’m remembering some of what I’ve learned about how async void methods handle exceptions. What happens if we turn it into an async Task method instead? At that point, the Task returned by the method (which we ignore) will have the exception, and as by default unobserved task exceptions no longer crash the process, maybe we’ll be okay. So just changing BrokenAsyncMethod to:

async Task BrokenAsyncMethod()
{
    throw new Exception("Bang");
}

(and ignoring the warning at the call site)… the program no longer crashes. (I could subscribe to TaskScheduler.UnobservedTaskException but I’m not that bothered… I’m pretty convinced it would fire, at least eventually.)

Do all ThreadPool exceptions crash the app?

We don’t need to use async methods to execute code on the thread pool. What happens if we just write a method which throws an exception, and call that from the thread pool?

ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");

void ThrowException(object? state)
{
    throw new Exception("Bang!");
}

Yup, that crashes:

Unhandled exception. System.Exception: Bang!
   at Program.<<Main>$>g__ThrowException|0_0(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 7
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

At this point some readers (if there still are any…) may be surprised that this is a surprise to me. It’s been a long time since I’ve interacted with the thread pool directly, and taking down the process like this feels a little harsh to me. (There are pros and cons, certainly. I’m not trying to argue that Microsoft made the wrong decision here.)

Can we change the ThreadPool behaviour?

Given that we have things like TaskScheduler.UnobservedTaskException, I’d expect there to be something similar for the thread pool… but I can’t see anything. It looks like this is behaviour that changed with .NET 2.0 – back in 1.x, thread pool exceptions didn’t tear down the application.

After a bit more research, I found AppDomain.UnhandledException. This allows us to react to an exception that’s about to take down the application, but it doesn’t let us mark it as “handled”.

Here’s an example:

AppDomain.CurrentDomain.UnhandledException += (sender, args) =>
    Console.WriteLine($"Unhandled exception: {((Exception)args.ExceptionObject).Message}");
ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");

void ThrowException(object? state) =>
    throw new Exception("Bang!");

Running this code a few times, I always get output like this:

Unhandled exception: Bang!
Unhandled exception. System.Exception: Bang!
   at Program.<<Main>$>g__ThrowException|0_1(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 8
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

… but sometimes “Done” is printed too. So I guess there’s some uncertainty about how quickly the AppDomain is torn down.

Hardening At Your Service

Given what I know now, I don’t think I can easily stop the web server for Zoom interactions from terminating if I have a bug – but I can make it easier to find that bug afterwards. I already normally write the log out to a text file when the app exits, but that only happens for an orderly shutdown.

Fortunately, it looks like the AppDomain.UnhandledException is given enough time to write the log out before the process terminates. Temporarily reverting to the broken code allows me to test that – and yes, I get a log with an appropriate critical error.

Conclusion

I knew that async void methods were generally bad, but I hadn’t quite appreciated how dangerous they are, particularly when executed from a thread pool thread.

While I’m not thrilled with the final result, I at least understand it now, and can find similar errors more easily in the future. The “not understanding” part was the main motivation for this blog post – given that I’d already found the immediate bug, I could have just fixed it and ignored the worrying lack of diagnostic information… but I always find it tremendously unsettling when I can’t explain significant behaviour. It’s not always worth investigating immediately but it’s generally useful to come back to it later on and keep diving deeper until you’ve got to the bottom of it.

I haven’t put the source code for this blog post in GitHub as there are so many iterations – and because it’s all in the post itself. Shout if you would find it valuable, and I’m happy to add it after all.