Work with data in Ballerina
This simple guide helps you understand how query expressions (a.k.a. language integrated queries) can be used on data to produce new data.
Table of contents Expand all Collapse all
Ballerina has first-class support for writing SQL-like queries to process data. Language-integrated queries can process any Ballerina iterable.
In this tutorial, you will be writing queries to filter, sort, and join with different data sets and produce new data sets.
Set up the prerequisites
To complete this tutorial, you need:
- A command terminal
- A text editor
Tip: Preferably, Visual Studio Code with the Ballerina extension.
- A Ballerina installation
Create a new Ballerina project
To create a new Ballerina package, use the bal new
command as shown below.
$ bal new query_expressions
This will create a new Ballerina package called query_expressions
.
The bal new
command generates the below files inside query_expressions
.
cd query_expressions
tree .
.
├── Ballerina.toml
└── main.bal
0 directories, 2 files
Define the dataset to be processed
To keep the tutorial simple, you will be using an in-memory table to store the COVID-19 dataset. Each record of type CovidEntry
in the table represents the COVID-19 data related to a particular country. The iso_code
is used to uniquely identify a country and other fields are self-explanatory.
public type CovidEntry record {|
readonly string iso_code;
string country;
decimal cases;
decimal deaths;
decimal recovered;
decimal active;
|};
public final table<CovidEntry> key(iso_code) covidTable = table [
{iso_code: "AFG", country: "Afghanistan", cases: 159303, deaths: 7386, recovered: 146084, active: 5833},
{iso_code: "SL", country: "Sri Lanka", cases: 598536, deaths: 5243, recovered: 568637, active: 14656},
{iso_code: "US", country: "USA", cases: 69808350, deaths: 880976, recovered: 43892277, active: 25035097},
{iso_code: "IND", country: "India", cases: 80808350, deaths: 980976, recovered: 33892279, active: 35035095}
];
Copy the above code snippet to main.bal
. This will act as our COVID-19 Dataset.
Filter the data
Let’s define a function, which will filter out the records, which have values higher than 100,000 for the cases
field.
public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
string[] filteredCountries = from CovidEntry entry in dataTable
where entry.cases > noOfCases
select entry.country;
return filteredCountries;
}
The filterCountriesByCases
function uses a query expression to iterate the records in the dataTable
table and filter only the records, which have more than noOfCases
cases. To filter, a where
clause with a condition is used. All records, which satisfy the condition in the where
clause will be selected. Copy the above code snippet and paste it on to main.bal
just below the definition of the covidTable
.
Let’s call the filterCountriesByCases
function from inside the main
function.
public function main() {
string[] countries = filterCountriesByCases(covidTable, 10000000);
io:println("Countries with more than 10 million cases: ", countries);
}
In the main
function, you have called the filterCountriesByCases
function and have provided covidTable
and 10000000
as parameters so that the function will filter the countries, which have more than 10000000 COVID-19 cases. In the next line, you have printed the result from the function. Copy the above code onto main.bal
and execute bal run
from within the query_expression
project folder.
The result will print something similar to the below.
Compiling source
.../query_expressions:0.1.0
Running executable
Countries with more than 10 million cases: ["USA","India"]
Sort countries by COVID-19 deaths
Let’s define a new function to find the top three countries with the highest number of COVID-19 deaths. In this function, you will use another query to sort and retrieve a limited number of records from the table.
public function findCountriesByHighestNoOfDeaths(table<CovidEntry> dataTable, int n) returns [string, decimal][] {
[string, decimal][] countriesWithDeaths = from CovidEntry entry in dataTable
order by entry.deaths descending
limit n
select [entry.country, entry.deaths];
return countriesWithDeaths;
}
The findCountriesByHighestNoOfDeaths
function uses query expression to find the top three countries with the highest COVID-19 deaths. The order by
clause is used to sort the records in the table in descending
order and the limit
clause to limit the number of output records of the query to n
. As the result, the query produces an array of tuples of type [string, decimal]
. Each tuple contains the country name and the number of reported deaths. The produced array is in descending order by the number of deaths.
Let’s call the findCountriesByHighestNoOfDeaths
function from within the main
function to find the top three countries by the number of deaths.
public function main() {
string[] countries = filterCountriesByCases(covidTable, 10000000);
io:println("Countries with more than 10 million cases: ", countries);
[string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
io:println("Countries with highest deaths:", countriesWithDeaths);
}
Copy the last two lines of the above code onto main.bal
and execute bal run
from within the query_expression
project folder.
The output will look like the below.
Compiling source
.../query_expressions:0.1.0
Running executable
Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]
Join with another datasource
Using query expressions, you can join two collections and produce a new collection. The join
operation is similar to the SQL join
operation. In this tutorial, you will be joining the covidTable
with a string array, which contains three countries.
public function findRecoveredPatientsOfCountries(table<CovidEntry> dataTable, string[] countries) returns [string, decimal][] {
[string, decimal][] countriesWithRecovered = from CovidEntry entry in dataTable
join string country in countries on entry.country equals country
select [entry.country, entry.recovered];
return countriesWithRecovered;
}
The findRecoveredPatientsOfCountries
function uses a query expression to join the dataTable
table with an array of strings named countries
. The join condition is provided after the on
keyword.
For every record in the dataTable
, all the elements in the countries
array will be joined. The output array of tuples will have the country and the number of recovered patients only if the condition after the on
keyword is satisfied for that particular pair of table record and element of the array being joined.
Now, change the existing main
function by calling the findRecoveredPatientsOfCountries
function at the end to get the number of recovered patients.
You will get the number of recovered patients in the USA, India, and Afghanistan. The updated main
function should look like the one below.
public function main() {
string[] countries = filterCountriesByCases(covidTable, 10000000);
io:println("Countries with more than 10 million cases: ", countries);
[string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
io:println("Countries with highest deaths:", countriesWithDeaths);
string[] c = ["USA", "India", "Afghanistan"];
[string, decimal][] countriesWithRecovered = findRecoveredPatientsOfCountries(covidTable, c);
io:println("Countries with number of Recovered patients:", countriesWithRecovered);
}
Now, if you execute bal run
from inside the query_expressions
, you will get an output similar to below.
Compiling source
.../query_expressions:0.1.0
Running executable
Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]
Countries with number of Recovered patients:[["Afghanistan",146084],["USA",43892277],["India",33892279]]
Find discrepancies in the dataset
This example shows how you can use the let
clause to maintain an intermediate state while iterating a collection using query expression and how to use that intermediate state for further processing.
For example, in this dataset, the total number of reported cases should be equal to the sum of the number of deaths, recovered, and active. If they are not equal, an error should have occurred while the dataset is populated.
Let’s define a function called printErroneousData
to find any erroneous records in the dataset. If there is any record in which the number of reported cases
is not equal to the sum of recovered
, active
, and deaths
, this function will print it.
public function printErroneousData(table<CovidEntry> dataTable) {
string[] countries = from CovidEntry entry in dataTable
let decimal sum = entry.recovered + entry.deaths + entry.active
where entry.cases != sum
select entry.country;
if countries.length() > 0 {
io:println("Found erroneous entries for countries: ", countries);
}
}
Here, you use the sum
variable to hold the results of intermediate calculations in the query expression.
Note: Even though this particular example uses a separate variable to demonstrate the usage of the
let
clause, you can do the calculations inline from within thewhere
clause also.
Now, let’s call this function inside the main
function. The final version of the main
function looks like below.
public function main() {
string[] countries = filterCountriesByCases(covidTable, 10000000);
io:println("Countries with more than 10 million cases: ", countries);
[string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
io:println("Countries with highest deaths:", countriesWithDeaths);
string[] c = ["USA", "India", "Afghanistan"];
[string, decimal][] countriesWithRecovered = findRecoveredPatientsOfCountries(covidTable, c);
io:println("Countries with number of Recovered patients:", countriesWithRecovered);
printErroneousData(covidTable);
}
Now, execute bal run
from within the query_expression
folder, and you will see an output similar to below.
Compiling source
.../query_expressions:0.1.0
Running executable
Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]
Countries with number of Recovered patients:[["Afghanistan",146084],["USA",43892277],["India",33892279]]
Found erroneous entries for countries: ["Sri Lanka","India"]