Tracing Notifications – Slack Engineering

Notifications are a key side of the Slack person expertise. Customers depend on well timed notifications of mentions and DMs to maintain on high of vital info. Poor notification completeness erodes the belief of all Slack customers. 

Notifications move by nearly all of the techniques in our infrastructure. As illustrated in Determine 1 under, a notification request flows by the webapp (our software logic and net / Desktop consumer monorepo), job queue, push service, and a number of other third-party companies earlier than hitting our iOS, Android, Desktop, or net purchasers.

Additional, the choice about when and the place to ship a notification can be very sophisticated, as proven in Determine 2 under, which is from our 2017 weblog publish (additionally summarized here).

Since 2017, our notification workflow has solely grown extra complicated, by the addition of recent options like Huddles and Canvas. Consequently, fixing notification points can result in multi-day debugging periods throughout a number of groups. Buyer tickets associated to notifications additionally had the bottom NPS scores and took the longest time to resolve in comparison with different buyer points.

Debugging notification points inside our techniques was troublesome as a result of every system had a special logging pipeline and information format, making it essential to take a look at information with completely different codecs and backends. This course of required deep technical experience and took a number of days to finish. The context by which occasions had been logged additionally various throughout techniques, prolonging any investigations. This resulted in a time-consuming course of requiring experience in all components of the stack simply to grasp what occurred.

We started a undertaking to hint the move of notifications throughout our techniques to deal with these challenges. The purpose was to standardize the information format and semantics of occasions to make it simpler to grasp and debug notification information. We wished to reply questions on notifications resembling: if it was despatched, the place it was despatched, if it was seen, and if the person had opened it. This publish paperwork our multi-quarter, cross-organizational journey of tracing notifications all through Slack’s backend techniques, and the way we use this hint information to enhance the Slack buyer expertise for everybody.

Notification move

The sequence of steps to grasp how notifications had been despatched and obtained is one thing we’ve dubbed the “notification move.” Step one to enhance the notification move was to mannequin the steps within the notification course of the identical means throughout all our purchasers. We additionally aimed to seize all occasions in a standard information mannequin constantly in the identical format.

We created a notification spec to grasp all of the occasions in a notification hint. This concerned figuring out all of the occasions in a hint, creating an idealized funnel, and setting the context by which every occasion might be logged. We additionally needed to agree on the semantics of a span and the names of the occasions, which was a difficult activity throughout completely different platforms. The result’s a notification move (simplified for this weblog publish), proven within the picture under.

Mapping notification move to a hint

After we completed planning the move of our system, we would have liked to select a strategy to preserve monitor of that info. We selected to make use of SlackTrace as a result of a hint was a pure strategy to characterize a move, and all of the components of our system can already ship info within the span occasion format. Nevertheless, we encountered two main challenges when modeling notification flows as traces.

  1. 100% sampling for notification flows: Not like backend requests—which had been sampled at 1%—notification flows shouldn’t be sampled since our CE group wished 100% constancy to reply all buyer requests. In some situations like `@right here` and `@channel`, a push notification message could be doubtlessly despatched to a whole lot of 1000’s of customers throughout a number of units, leading to billions of spans for a single hint of a slack message. A hint with doubtlessly billions of spans would wreak havoc on our hint ingestion pipeline and storage backends. No sampling would additionally power us to hint each Slack message despatched.
  2. Tracing notifications as a move separate from the unique message despatched hint. At the moment, OpenTelemetry (OpenTracing) instrumentation tightly {couples} tracing to a request context. In a notification move, this tight coupling would break because the notification move executes in a number of contexts and doesn’t cleanly map to a single request context. Additional, mixing a number of hint contexts additionally made implementing tracing throughout our code difficult.

To resolve each of those challenges we determined to mannequin every notification despatched as its personal hint. To tie the sender’s hint to every of the notifications despatched, we used span links to causally hyperlink the spans collectively. Every notification was assigned a notification_id which was used as a trace_id for the notification move.

This method has a number of benefits: 

  • Since SlackTrace’s instrumentation doesn’t tightly couple hint context propagation with request context propagation, modeling these flows drastically simplifies the hint instrumentation.
  • Since every notification despatched was its personal hint, it made the traces smaller and simpler to retailer and question.
  • It allowed 100% sampling for notification traces, whereas maintaining the senders sampling charge at 1%.
  • Span linking helped us protect causality for the hint information.

Totally different groups labored collectively to map the steps within the notification move to a span. The result’s a desk as proven under.

Span identify Description Hint id Mum or dad span id Span tags
notification:set off  Decide if the notification ought to be despatched or not.  Trace_id is the request id. Span hyperlinks have a listing of notification_id’s despatched. trigger_type (DM, @right here, @channel), user_id, team_id channel_id message_ts notification_id
notification:notify  Notify the person on all of their purchasers.  Trace_id is notification_id. Id of notification:set off span. user_id, team_id channel_id message_ts
notification:despatched Notification is shipped to a slack consumer to all of the a number of slack purchasers on the person’s machine.  Trace_id is notification_id ID of notification:notify channel_id platform particular notification  tags.
notification:obtained Notification is obtained on the person’s slack consumer.  Trace_id is notification_id ID of notification:despatched span. Service identify is consumer identify and consumer tags.
notification:opened  Person opened a notification on the machine.  Trace_id is notification_id ID of notification:obtained span. Service identify is consumer identify and consumer tags.
notification:learn in app Person clicked on the notification to view the notification within the app.The beginning of the span is true after opening. The top of the span is when the message is rendered within the channel. Trace_id is notification_id ID of notification:opened span. Service identify is consumer identify and consumer tags.

Benefits of modeling a notification move as a hint

Representing the notification move as a Hint/SpanEvent has the next benefits over our present strategies.

  • Constant information format: Since all of the companies reported the information as a Span, the information from varied backend and consumer techniques was in the identical format.
  • Service identify to determine supply: We set the service identify discipline to Desktop, iOS, or Android to uniquely determine the consumer or service that generated an occasion. 
  • Commonplace names for contexts: We used the span identify and repair identify to uniquely determine an occasion throughout techniques. For instance, the service identify for a notification :obtained occasion could be iOS, Android and Net to precisely tag these occasions. Beforehand, the occasions from these three purchasers would have completely different codecs and it was arduous to uniformly question them. 
  • Standardized timestamps and period fields: All of the occasions have a constant timestamp in the identical decision and time zone as the remainder of the occasions. If there’s a period related to an occasion, we set the period discipline or set it to a default worth of 1 when reporting a one-off occasion. This offered a single place for storing all of our period info. 
  • Constructed-in periods: We’d use the notification ID because the hint ID for all the move. Consequently all of the occasions in a move are already sessionized and there’s no have to additional sessionize the information. For instance, we couldn’t use the notification ID because the be part of key in every single place since just some occasions would have a notification ID. For instance, the notification triggered of a notification learn occasion wouldn’t have a notification ID in them. We are able to use the hint ID to tie these occasions collectively as a substitute of utilizing bespoke occasions.
  • Clear, easy, and dependable instrumentation: Since a hint is sessionized, we solely want so as to add the tags to the hint as soon as once we mannequin the notification move as a hint. This additionally made the instrumentation code cleaner, easier, and dependable because the adjustments had been localized to small components of the code that may be unit examined nicely. It additionally made the information simpler to make use of since there is just one be part of key as a substitute of bespoke be part of key for some subset of occasions.
  • Versatile information mannequin: This mannequin can be versatile and extendable. If a consumer wants so as to add further context, they will add further tags to an present span. If not one of the present spans are match, they will add a brand new span to the hint, with out altering the prevailing hint information or hint queries.
  • No duplicate occasions: The SpanID within the occasion helped seize the individuality of occasions at supply. This lowered the variety of occasions that had been double reported and eliminated the necessity to de-dupe occasions in our backend once more. The older methodology reported thrift objects with out distinctive IDs which led to utilizing de-dupe jobs to determine double reporting of occasions.
  • Span linking for tying associated traces collectively: Linking spans throughout traces helps protect causality with out resorting to advert hoc information modeling.

How we use notification hint information at Slack

After a number of quarters of arduous work by a number of groups we had been in a position to hint notifications end-to-end throughout all of the Slack purchasers. Our traces had been despatched to a real-time retailer and our information warehouse utilizing the hint ingestion pipeline.

Builders use the notification hint information to triage points. Beforehand, monitoring notification failures concerned going by logs of a number of techniques to grasp the place a notification was dropped. This course of was concerned and took a number of hours of very senior engineers’ time to grasp what went on. Nevertheless, after notification tracing, anybody was in a position to take a look at a hint of the notification to exactly see the place a hint was despatched and the place within the move a notification was dropped.

Our buyer expertise group makes use of hint information to triage buyer points a lot quicker today. We now know exactly the place within the notification move a message dropped. Since our traces are simpler to learn, our CE engineers can take a look at a hint to study what occurred in a notification to reply a buyer’s question as a substitute of escalating it to the event group, who then needed to comb by the numerous logs. This helped us triage our notifications far more rapidly, and lowered the time to triage notification tickets for our CE group by 30%.

Notification analytics

At the moment, we ingest notification hint information to ElasticSearch/Grafana and our information warehouse.  

Our iOS engineers and Android engineers have began utilizing this information to construct Grafana dashboards and alerts to grasp the efficiency of our purchasers. Usually, consumer engineers don’t use dashboarding instruments like Grafana, however our consumer engineers have used them very successfully to triage and debug points in our notification move.

We’ve got additionally ingested this information into our information warehouse, over which anybody can run complicated analytics on this information. Initially information scientists used this information to grasp efficiency regressions in our purchasers over lengthy durations of time.

The span occasion format and tracing system additionally has an surprising profit. Our information scientists used this information to construct a product analytics dashboard displaying funnel analytics on notification flows, to higher perceive notification open charges. Usually, that product analytics information could be captured by a separate set of instrumentation ingested through a special pipeline into the information warehouse. Nevertheless, since we despatched the hint information to the information warehouse, our information scientists can use it to compute funnel analytics on the information to get the identical insights. 

An much more extraordinary consequence was when the information scientists had been in a position to mine the hint information to determine and report bugs in software and instrumentation. Up to now two years since, notification traces had been used many instances exterior of the preliminary use case. This reveals the benefits of utilizing hint information as a single supply of fact, as a consequence of its help for a number of use circumstances.

Conclusion

Modeling flows or funnels as a hint is a good concept, however there are some challenges. On this weblog publish we have now proven how Slack modeled notification flows as traces, the challenges we confronted, and tips on how to overcome these challenges by cautious modeling.

Implementing notification tracing wouldn’t have been attainable with out decoupling the hint context propagation from a request context within the SlackTrace framework. The instrumentation helped us rapidly and cleanly implement tracing throughout a number of backend companies, whereas avoiding the destructive unintended effects of present libraries, resembling cluttered instrumentation and huge traces. At the moment, we instrument a number of different flows within the manufacturing Slack app utilizing the identical technique. 

Modeling notification flows as hint information helped our CE group resolve notification points 30% quicker whereas additionally lowering escalations to the event group.

Along with the unique use case of debugging notification points, notification hint information was additionally used for calculating funnel analytics for manufacturing analytics use circumstances. Modeling product analytics information as traces offers high-quality information in a constant information format throughout all of our complicated stack. Additional, the built-in sessionization of hint information simplified our analytics pipeline by eliminating further jobs to de-dupe and sessionize the hint information. Up to now two years, backend and frontend builders and information scientists have used the hint information as a single supply of fact for a number of use circumstances. 

The success of notification tracing has inspired a number of different use circumstances the place flows are modeled as traces at Slack. At this time within the Slack app there are at the least a dozen tracers working concurrently within the Slack app.

All for taking over attention-grabbing initiatives, making individuals’s work lives simpler, or optimizing some code? We’re hiring! 💼 Apply now