MIT License
This Microhack is organized into the following 3 challenges:
Each challenge has a set of tasks that need to be completed in order to move on to the next challenge. It is advisable to complete the challenges and tasks in the prescribed order.
For the next tasks, we will use the LogisticsTelemetry
table (which obtains data from the Event Hub).
By taking 10 records, we can see that the telemetry column has a JSON structure. In this task, we will use an 'Update Policy' to manipulate the raw data in the LogisticsTelemetry
table (the source table) and transform the JSON data into separate columns, that will be ingested into a new table that weβll create (referred to as the "target" or "destination" table).
Update Policy is like an internal Extract-Transform-Load (ETL) step. It can help you manipulate or enrich the data as it gets ingested into the source table (e.g. extracting JSON into separate columns, creating a new calculated column, joining the new records with a static dimension table that is already in your database, etc). For these cases, using an update policy is a very common and powerful practice.
Each time records get ingested into the source table, the update policy's query (which we'll define in the Update Policy) will run on these records (and only on newly ingested records - other existing records in the source table aren't visible to the Update Policy when it runs). The results of the query will be appended to the target table.
Though ADX is purpose-built for interactive, ad-hoc queries over nested JSON data, parsing the JSON data into typed columns at ingestion time can improve performance and reduce query complexity for end users - particularly if you have a lot of queries that need to access the same fields.
We want to create a new table (called LogisticsTelemetryManipulated
), with a calculated column (we will call it: NumOfTagsCalculated
) that contains the result of the calculation: telemetry.TotalTags - telemetry.LostTags
The schema of the new (destination) table will be:
(deviceId:string, enqueuedTime:datetime, NumOfTagsCalculated:int, Temp:real)
As a reminder, our telemetry data looks like this (note that the order of the keys may be different):
{
"BatteryLife": 73,
"Light": "70720.236143472212",
"Tilt": "18.608539012789223",
"Humidity": "60.178854625386215",
"Shock": "-4.6141182265359628",
"Pressure": "529.61165751122712",
"ActiveTags": 165,
"TransportationMode": "Ocean",
"Status": "Online",
"LostTags": 9,
"Temp": "7.5054504554744765",
"TotalTags": 185,
"Location": {
"alt": 1361.0469,
"lon": -107.7473,
"lat": 36.0845
}
}
βπ» Run this command to create the target table.
.create table LogisticsTelemetryManipulated (deviceId:string, enqueuedTime:datetime, NumOfTagsCalculated:long, Temp:real)
βπ» Complete the function that extracts the desired columns and types.
π΅π» Hint: You can test the query outside of the .create-or-alterβ―function
function to verify it works as expected. Then paste the logic into the function to create it.
.create-or-alterβ―functionβ―ManipulateLogisticsTelemetryData()
{
<Complete the query>
}
βπ» Now that the transformation function has been created, let's make a change to the update policy of the destination table to invoke our new function for each new batch of data.
π΅π» Hint: Refer to the docs for the syntax of the update policy.
<Complete the command>
LogisticsTelemetryManipulated
| take 10
At the end of this activity, you should see the data from the source table LogisticsTelemetry
transformed into the destination table LogisticsTelemetryManipulated
.
References:
π Use table: LogisticsTelemetryHistorical
βπ» Use a 'let' statement to create a list of the 10 device ids which have the highest Shock. Then, use this list in a following query to find the total average temperature of these 10 devices.
You can use the 'let' statement to set a variable name equal to an expression or a function.
let statements are useful for:
π΅π» Hint 1: in operator
π΅π» Hint 2: let
π΅π» Hint 3: Remember to include a semicolon (;
) at the end of your let statement.
π Use table: LogisticsTelemetryHistorical
βπ» Write a query to show a time chart of the historical number of records, by transportation mode. Use 10 minute bins.
Example result:
π Use table: LogisticsTelemetryHistorical
βπ» Write a query to show on a map, the locations (based on the longitude and latitude) of 10 devices with the highest temperature in the last 90 days.
π΅π» Hint 1: top
operator
π΅π» Hint 2: render scatterchart with (kind=map)
Once the map is displayed, you can click on the locations. Note that in order to show more details in the balloon, you need to change the render phrase to include series=<TempColumn>
.
Example result:
References:
Range is a tabular operator; it generates a single-column table of values, whose values are start, start + step, ... up to and until stop.
Run the following query and review the results:
range MyNumbers from 1 to 8 step 2
Range also works with dates:
range LastWeek from ago(7d) to now() step 1d
We will use the range operator as part of the time series creation in the next tasks.
Many interesting use-cases use machine learning algorithms and derive interesting insights from telemetry data. Often, these algorithms require a strictly structured dataset as their input. The raw log data usually doesn't match the required structure and size. We will see how we can use the make-series
operator to create well curated time-series data.
Then, we can use built in functions like series_decompose_anomalies. Anomalies (outliers) will be detected by the Kusto service and highlighted as red dots on the time series chart.
A time series is a collection of observations of well-defined data items obtained through repeated measurements over time and listed in time order. Most commonly, the data points are consistently measured at equally spaced intervals. For example, measuring the temperature of the room each minute of the day would comprise a time series. Data collected irregularly is not a time series.
Time series analysis comprises of methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting, for example, is the use of a model to predict future values based on previously observed values.
Time series decomposition involves thinking of a series as a combination of 4 components:
trends (increasing or decreasing value in the series)
seasonality (repeating short-term cycle in the series)
baseline (the predicted value of the series, which is the sum of seasonal and trend components)
noise (The residual random variation in the series).
We can use built-in functions, that uses time series decomposition to forecast future metric values and/or detect anomalous values.
This is what a time series looks like:
The summarize operator does not add "null bins" - rows for time bin values for which there's no corresponding row in the table. It's a good idea to "pad" the table with those bins. Advanced built-in ML capabilities like anomaly detection need the data points to be consistently measured at equally spaced intervals. The make-series can create such a "complete" series.
π Use table: LogisticsTelemetryHistorical
βπ» Write a query to create an anomaly chart of the average shock.
For this task, we will provide more instructions. To generate these series, start with:
let min_t = (toscalar(LogisticsTelemetryHistorical | summarize min(enqueuedTime)));
let max_t = (toscalar(LogisticsTelemetryHistorical | summarize max(enqueuedTime)));
let step_interval = 10m;
LogisticsTelemetryHistorical
| make-series avg_shock_series=avg(Shock) on (enqueuedTime) from (min_t) to (max_t) step step_interval
Now, we will use this avg_shock_series
and run series_decompose_anomalies
.
This built-in function takes an expression containing a series (dynamic numerical array) as input, and extracts anomalous points with scores.
| extend anomalies_flags = series_decompose_anomalies(avg_shock_series, 1)
| render anomalychart with(anomalycolumns=anomalies_flags, title='Avg Shock Anomalies')
The anomalies/outliers can be clearly spotted in the 'anomalies_flags' points.
Example result:
π€ Wait! It looks a lot like it simply figured out the highest and lowest Shock values. Can you tweak the parameters to make it less sensitive?
References:
π‘ FOR THE NEXT TASKS, WE WILL USE THE NYC TAXI DATA
The NYC Taxi data was ingested into the NYCTaxiRides
table during Microhack 1. If the proctor hasn't provided the data set, use this Azure Open Dataset on NYC Taxi Rides by ingesting it into your ADX cluster.
π Use table: NYCTaxiRides
Write some queries to get familiar with this table.
βπ» After some familiarity, write a query to create a pie chart of the payments type. Use tostring()
to convert the payment_type
to a string before rendering the pie chart.
For now, the payment types will be represented by numbers. We will join the payment type table in a later task.
Example result:
π Use table: NYCTaxiRides
βπ» Write a query to create a column chart which will show the number of rides for each day of the week, across the entire data set. You can use 1, 2, ..., 7 to denote Sunday through Saturday.
Example result:
References:
π Use table: NYCTaxiRides
βπ» Write a query to find out if the tip amount correlates with the number of passengers in the taxi between 1 July 2022 and 31 July 2022. Restrict the number of passengers to a maximum of 4.
π΅π» Hint: How is average tip amount changing against passenger count and pickup time?
Example result:
π Use table: NYCTaxiRides
βπ» Write a query to draw an anomaly chart for the tip amount in the month of July 2022.
π΅π» Hint 1: make-series
for the average tip amount, with 1-hour steps
π΅π» Hint 2: Use series_decompose_anomalies
with this series and parameter of 5
(sensitivity level)
Example result:
π Use table: NYCTaxiRides
The externaldata
operator returns a table whose schema is defined in the query itself, and whose data is directly read from an external storage artifact, such as a blob in Azure Blob Storage, a file in Azure Data Lake Storage, or even a file in a GitHub repository. Since the data is not being ingested into ADX, it cannot be indexed, compressed, or stored in the hot cache. For best performance, we recommend that data be ingested. External data can, however, be used in sporadic cases, where you do not want to ingest the data.
Take a look at this csv file: https://raw.githubusercontent.com/Azure/azure-kusto-microhack/main/assets/ExternalData/payment_type_lookup.csv.
The file represents the mapping between the numeric code of the payment type and its description.
βπ» Here is how we can use KQL to handle this external data:
let payment_type_lookup_data = (externaldata (code:string,description:string)
[ @"https://raw.githubusercontent.com/Azure/azure-kusto-microhack/main/assets/ExternalData/payment_type_lookup.csv" ]
with(format="csv", ignoreFirstRecord=true))
| project tolong(code), description;
payment_type_lookup_data
References:
π Use table: NYCTaxiRides
The taxi rides table has a field of Payment_type
. This is a numeric code signifying how the passenger paid for the trip. Use the payment_type_lookup
, to join between the payment code and the description. Use a leftouter join to merge the rows of the two tables to form a new table, by matching values of the payment code column.
βπ» Peform the join operation and render a time chart of the number of records, per payment type over time, with 1-day bins, based on data between 2022-07-01 and 2022-07-31.
π€ What is the most common method of payment for rides? Credit cards or cash? What does it look like over time?
Example result:
References:
π Use table: NYCTaxiRides
This task will comprise of multiple steps that lead to one overall query.
βπ» Create a time chart that will show:
series_decompose_forecast
function).Additional Information:
π΅π» Hint 1: Start your query with:
let min_t = datetime(2022-07-01);
let max_t = datetime(2022-08-07); // Note that there is no data in the first week of August. We will forecast the data for this week.
NYCTaxiRides
| where tpep_dropoff_datetime between (min_t .. max_t)
π΅π» Hint 2: Make a series of the number of rides, on tpep_pickup_datetime
between these dates. Use steps of 30 minutes.
π΅π» Hint 3: Use series_decompose_forecast
with parameters of this series and second parameter of: 24*7
. The second parameter is an Integer specifying the number of points at the end of the series to predict (forecast). These points are excluded from the learning (regression) process. We will use 24*7
additional data points, in order to forecast a week forward.
Example result:
βπ» Using the Dashboard feature of Azure Data Explorer, build a dashboard using outputs of any 5 queries (on the LogisticsTelemetryHistorical
table) that you have created in the previous challenges with the following improvements:
Include filters for the dashboard so that the queries do not need to be modified if the user wants to analyze the charts with different values of a parameter. For example, users would like to analyze the charts over the last week, the last 14 days as well as the last 1 month. Users would also like to analyze the charts by different transportation modes.
π΅π» Hint 1: In the query window, explore the "Share" menu.
Example dashboards:
References:
βπ» Visualize the outputs of Task 5 (Pie Chart) and Task 7 (Datetime Operations) in Challenge 5 in Power BI using the DirectQuery mode.
π΅π» Hint 1: In the query window, explore the "Share" menu.
There are multiple ways to connect ADX and Power BI depending on the use case. For this microhack, we will use the DirectQuery method. Feel free to explore other methods listed in the docs.
References:
π Congrats! You've completed the second Microhack. Keep going with Microhack 3: Advanced KQL, Policies, Security