Hi all-- Please find below an introductory primer for the Ataccama Expressions language, for those familiar with Excel formulae (which are deliberately very similar in structure and use to the way we’ve designed the expression language)…
Introduction
Overview
One of Ataccama’s core values when developing Ataccama ONE is to create a user experience which simplifies as far as possible the technical complexities of data manipulation and data quality, and to put as many capabilities into the hands of business and non-technical users as possible.
With this philosophy in mind we have created a powerful expression language, which moves away from the complexity of development languages such as scripting/python/SQL. Despite containing powerful and extensible functionality for complex data challenges, the design of the expressions have been made as simple as possible, and can be easily learned by anyone with a familiarity in MS-Excel formulas and features.
This expression language is used throughout Ataccama ONE, in both the web and desktop applications. Expressions can be applied to any data regardless of the type of data source (SQL database, files, data streams, ETL, clustered environments, etc), removing the need for users to be familiar with the syntax of the various languages supported by these technologies. An expression / rule, once written, can be re-used and applied anywhere without the need to rewrite into differing programming languages.
This document provides an introduction to the expression language, its syntax and its use, alongside useful functions and a full expression reference guide within the appendix.
Audience
This document is aimed at Ataccama beginners with an expectation of some knowledge of the workings of Excel formulae and Conditional Formatting.
Document Structure
- What are Ataccama Expressions?
- Introduction to the syntax and style of the expressions, with side by side comparisons to Excel.
- A list of useful and common expressions and functions.
- Notes on advanced features which use regular expressions
- The full expressions documentation from Ataccama’s documentation.
Introduction to Expressions and Formulae
What is an expression
Put simply an expression is a way of instructing an application to perform an action (or a series of actions) on one or more values, and to return a result. Every expression must therefore have; (i) some input values*, (ii) the list of actions to perform (and the sequence in which to perform them), (iii) a result.
( Purist note; In some cases (eg. the Excel function TODAY(), which returns today’s date, input values aren’t required.)*
Results of expressions
The result of an expression will be a specific type of data; text, whole number, decimal, etc. To take some simple examples…
Numeric expressions
The expression…
3 + 4
….would result in a number, 7. (The action is addition and the input values are 3 and 4.)
String expressions
The expression…
“ATA” & “CCAMA”
…In Excel, the action ‘&’ is an action used to concatenate text values together. Therefore the input values ‘ATA’ and ‘CCAMA’ (both text) are concatenated and the result is the text value ‘ATACCAMA’.
Conditional Expressions
Consider the following…
4 > 3
…The ‘>’ indicates a conditional action ‘is greater than?’. In natural language terms we’d read it as a question; “Is 4 greater than 3?”, the answer being ‘yes, of course it is!’.
Conditional expressions in both Excel and Ataccama result in a data type called boolean, a true/false value.
As we can see in Excel:
This formula equates to TRUE
Conditional expressions are one of the most important types of expression in Ataccama, especially when writing rules. Consider ‘Is this account number eight digits?’, ‘Does this policy type exist in my central list of policy types?’, ‘Is this value unique?’. When working with Ataccama rules all expressions you’ll use will result in a true-false (yes or no) result, known as a boolean.
Functions, Sequencing and Brackets
Function Basics
The above examples of simple arithmetic and string manipulation are only sufficient for quite basic operations. The value in the processing comes from advanced actions/functions, small pre-coded routines built into the software. Both Excel and Ataccama contain a huge set of built in functions across all data types:
The format of an expression in both Ataccama and Excel is the same;
ExpressionName ( < list of input parameters > )
For example, the expression Left ( “ATACCAMA”, 3 ) will return the text value ‘ATA’ (the leftmost three characters) in both Excel and Ataccama.
Nesting Functions and Sequencing
In both Excel and Ataccama it is possible to embed functions within functions, and this is achieved by using brackets.
This expression…
Upper (Left (”AtaccamaONE”, & “ Data Quality”)
…would return ‘ATACCAMA DATA QUALITY’, in upper case. When functions are nested with brackets they are evaluated from the innermost bracket (the function ‘left’ in this case), then any operations (in this case the concatenation ‘&’, and so on, finally with the outer function ‘Upper’ (a function which puts any text into full upper case) last.
Excel and Ataccama’s syntax do however have some subtle differences between them and we cover key differences later in the document.
Useful Ataccama Functions
Text/string functions
Expression | Description |
---|---|
left(srcStr, count) | Returns a new string that is a substring of srcStr containing count of chars taken from the start of the srcStr. |
right(srcStr, count) | Returns a new string that is a substring of srcStr containing count of chars taken from the end of the srcStr. |
substr(srcStr, beginIndex), | |
substr(srcStr, beginIndex, strLen) | Returns a new string that is a substring of the string srcStr. |
transliterate(srcStr, charsFrom, charsTo) | Transforms characters of the string srcStr. |
containsWord(srcStr, srcWord) | Searches for the occurrence of the word srcWord in the string srcStr. |
indexOf(srcStr, subStr), indexOf(srcStr, | |
subStr, fromIndex) | Returns the index within the string srcStr of the first occurrence of the specified substring subStr. |
lastIndexOf(strStr, subStr) | The sister function to lastIndexOf(), this super useful function will return the position of the final instance of subStr. This is often used to obtain last “/” in a file path so you can get rid of everything in front and just get the file name. |
lower(srcStr) | Transforms all characters of the string srcStr to lower case. |
upper(srcStr) | Transforms all characters of the string srcStr to upper case. |
capitalize(srcStr) | Transforms the first character of each word to upper case and all following characters to lower case. |
replicate(srcStr, n) | Returns n copies of the string srcStr concatenated together without any separator. |
replace(srcStr, what, withWhat) | Replaces occurrences of the string what with the string withWhat in the string srcStr. |
Regular Expression Functions ( See ‘Understanding Regular Expressions’ below)*
Regular Expressions are a highly tuned, efficient way of evaluating text based formats and widely used throughout Ataccama. Although they can be daunting at first, once understood the use of regular expressions can vastly improve the speed of creating rules.
Regular Expressions are a highly tuned, efficient way of evaluating text based formats and widely used throughout Ataccama. Although they can be daunting at first, once understood the use of regular expressions can vastly improve the speed of creating rules.
Expression | Description |
---|---|
find(srcRegex, srcStr | |
e, caseInsensitive]) | Verifies whether the string srcStr or its parts match the regular expression srcRegex. |
matches(srcRegex, srcStr <, | |
caseInsensitive]) | Verifies whether the string srcStr matches exactly the pattern of the regular expression srcRegex. |
substituteAll(srcPattern, srcReplacement, | |
srcStr o, caseInsensitiveFlag]) | Replaces all occurrences of srcPattern in string srcStr with srcReplacement. |
substituteMany(srcPattern, | |
srcReplacement, srcStr, srcVolume S, caseInsensitiveFlag]) | Replaces all occurrences of srcPattern in the string srcStr with srcReplacement, the maximum number of replacements is defined by the parameter srcVolume. |
Conversion Functions
Expression | Description |
---|---|
toDate(expr, dateFormate, dateLocale]) | Returns the date specified in expression expr converted to date type DAY. |
toString(expr, strFormate, strLocale]) | Converts the expression expr to a STRING value. |
Datetime Functions
Expression | Description |
---|---|
datePart(srcDate, fieldName) | This function returns the value of the field fieldName (YEAR, MONTH or DAY) of srcDate. |
dateDiff(startDate, endDate, fieldName) | Returns the difference between endDate and startDate expressed in fieldName units. |
today() | Returns the current date in type DAY. |
dateAdd(srcDate, srcValue, fieldName) | Adds the specified srcValue of the type specified by fieldName (YEAR, MONTH or DAY) to the srcDate. |
Conditional Functions
Expresion | Description |
---|---|
iif(ifExpr, trueExpr, elseExpr) | Returns trueExpr if ifExpr is TRUE. |
case(expr, exprValueh, expr, exprValue]...f, defaultExpr]) | Returns the value of expression exprValue immediately following the first expression expr whose value is TRUE. |
decode(decodeExpr, expr, exprValues, expr | |
, exprValue]...i, defaultExpr]) | Returns the value of expression exprValue immediately following the first expression expr whose value is equal to decodeExpr. |
nvl(exprl, expr]...) | Returns the value of the first expression expr whose value is not NULL. |
Excel vs. Ataccama - Key Differences
Although Ataccama’s expression language follows the syntax and structure of Excel, there are a few subtle differences between the two. (This is due to the nature of the underlying programming language of Ataccama and the manner in which its expressions are executed; Some operators such as ‘=’ or ‘&’, or functions such as ‘if’ are already reserved for other purposes within Ataccama and so alternatives had to be chosen.)
Below we present a list of key differences between the two;
Excel | Ataccama Equivalent | Notes |
---|---|---|
& | + | The & symbol in Excel will concatenate two text values together. Within Ataccama, the + symbol is used for addition when working with numeric values, or concatenation when working with text. |
In Excel, 4+3 would return 7 In Ataccama, 4+3 also returns 7
In Excel, “Ata” & “ccama” returns “Ataccama”, a text. The Ataccama equivalent is “Ata” + “ccama” | | if | iif | The function IF, in Excel has the following syntax:
IF (conditional-expression, value-if-true, value-if-false)
Within Ataccama, this works in exactly the same way but the function is named iif, with two is.
IF ( A > B, “A is greater than B”, “A is not greater than B” ) becomes IIF ( A > B, “A is greater than B”, “A is not greater than B” ) | | mid | substr | In order to extract a sub-set of a piece of text, within Excel we would use the function ‘mid’. For example mid (”Ataccama”, 4, 2) would return ‘cc’. Within Ataccama the function substr (substring) is used to achieve the same result. It is important to note that within Ataccama text the first character is numbered 0 (vs. 1 in Excel), so an equivalent in Ataccama is: substr (”Ataccama”, 3, 2). |
Advanced Feature - Understanding Regular Expressions
Introduction
Regular expressions are an industry standard way of describing the pattern or structure of data, their use dating back to the early 1950s. Regex, used within Ataccama functions, are widely used across all industries and organisations to describe standard pattern or structure of a specific type of data. When creating logic in Ataccama it’s often easy to find a regex pattern which describes the data you are trying to validate.
There are numerous online sites, government portals, or open source regular expressions available to use within your Ataccama expressions!
Worked Example
Example
UK National Insurance numbers (UK’s social security number) are defined here:
https://www.regexlib.com/REDetails.aspx?regexp_id=527
A national insurance number is formed of two (specific) characters, six digits, and an optional final character. The regex for this is:
^"A-CEGHJ-PR-TW-Z]{1}iA-CEGHJ-NPR-TW-Z]{1}?0-9]{6}iA-DFM]{0,1}$
Within Ataccama we are able to use the function ‘matches’, to check this:
matches ( myTextValue, “^tA-CEGHJ-PR-TW-Z]{1}hA-CEGHJ-NPR-TW-Z]{1}H0-9]{6}-A-DFM]{0,1}$” )
Unpicking this example
At first glance, for those unfamiliar with regular expressions, a regular expression can be rather daunting. However, once a few simple rules are understood, they become fairly straight forward to read.
The structure of a regex pattern is describing the format of data for text values to match against. (Hence the function ‘matches’ being used above. Ataccama will attempt to match the input value against the regex string.)
In the above example:
- The symbol ^ indicates the beginning of the text.
- The symbol $ indicates the end of the text.
- Everything within ^ and $ will be part of the matching process.
- A set of brackets indicates an optional list of characters, and the ‘-’ found within indicates a range. ^ A-E] would therefore mean “The first character of this piece of text must be one of A, B, C, D, E, only”.
- Therefore rA-CEGHJ-PR-TW-Z] will match one character out of ABCEGHJKLMNOPRTWXYZ.
- The numbers in curly brackets {} indicate the number of times a character can be repeated.
- Therefore ,0-9]{6} can be read as ‘at this point in the text, we must find one of the digits 0-9, repeated six times.
Regular Expressions Basics
The below summary has been inspired from this comprehensive guide - one of many brilliantly written articles available online covering this subject; https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
Character Matching and Repeated Characters
Logic | Description | Example |
---|---|---|
^The | Matches any text beginning ‘The’ | The house - matches correctly |
the house - would not. (Note that regex is ‘case sensitive’, and the upper and lower case must match exactly). | ||
end$ | Matches any text with the word end, at the end. | In the end - matches correctly |
In the end I wrote something else - does not match | ||
* | Matches any number of characters | matches a string that has ab followed by zero or more c |
+ | Indicates one or more of the preceeding character | matches a string that has ab followed by one or more c |
? | Indicates zero or more of the proceeding character | matches a string that has ab followed by zero or one c |
{x} | Indicates the text must match exactly ‘x’ number of repeated characters. | matches a string that has ab followed by 2 c |
{x,} | The text must match x or more instances of the preceding character | matches a string that has ab followed by 2 or more c |
{x,y} | The text must match x-y instances | matches a string that has ab followed by 2 up to 5 c |
a(bc) | Brackets are a method to group together specific characters when using these operators. In this case the text bc must match what follows. | matches a string that has a followed by zero or more copies of the sequence bc |
a(bc){2,5} | As in this example, a must be followed by two to five instances of the text bc. | matches a string that has a followed by 2 up to 5 copies of the sequence bc |
Two was to specify ‘this OR that’.
Logic | Description | Example |
---|---|---|
a(b | c) | The |
a bc] | As mentioned above, any series of characters within square brackets will match one of the characters within. | |
The - character can be used to specify a range. | /0-9] for example, would match any digit. | |
asbc] achieves the same ab or ac match as in the first example using a pipe | . |
Matching Characters
Logic | Description | Example |
---|---|---|
\d | This will match any digit 0-9 | \d* would match any number (without decimals) |
\w | This will match any ‘word’ character; that is any upper or lower case letter, or a _ underscore. | \w* would match any word such as Ataccama_ONE. Ataccama ONE (containing a space) would not match. |
\s | This will match any whitespace / spacebar character. | |
\D \W \S | The capitalised versions of these tags will match the opposite of their lower case equivalent. For example \D matches any non-digit. |
Further Reading
There are numerous in-depth resources for understanding regular expressions available online, and the topic is larger than this primer will allow. The site regex101.com provides comprehensive detailed documentation and a tool for testing regular expressions.
Summary
This concludes the introductions to Ataccama’s expression language, and by this point you should have enough information to create rules and logic within both the desktop and web applications using this powerful feature. Please do let us know of any questions you may have, to be included in future versions.
Thanks for reading!