Data Source: Overview of Data Lake
Data Lake is a search engine-based NO-SQL database. Users can ingest petabyte, zettabyte, yottabyte of records both structured and unstructured for Analytics, Storage, Machine Learning and deep learning.
Data Source is a connection pool in eZintegrations platform to retrieve data in JSON format.
Response from Data Lake source is stored in key bizdata_dataset_response. If you are using Single Line to Multiline Operations as the next operation in your pipeline, then the Chop key will have value as
['bizdata_dataset_response']
Goldfinch Analytics Data Lake Source has the following parameters:
Data Lake Version: Data Lake Version is the Data Lake Name and its version assigned to the organization.
Index / Table Name: Index or Table name that you want to retrieve data from Data source. For the list of table names or indices, check Datalake section in the Visualization product.
Pagination Wait Time: By default, it is 2m, where m is minute. Pagination is a standard API capability. Data source retrieves data in a paginated way. This parameter is to set how long to wait for the next page. If the response of a Table/Index is very high (having 100+ keys in a single record), try increasing the Pagination Wait Time. In general, 2m is sufficient to stream data. Use this when you have high network traffic congestion.
Can also use h for hours and s for seconds
Timeout: By default, it is 2m, where m is minute. In general, 2m is high enough to get a response from Data sou. Increase this when response from Data Lake is slow. This can happen when the Data Lake Cluster size is small. Reach out to support team to increase cluster size of Goldfinch Analytics Data Lake.
Can also use h for hours and s for seconds
Size: By default, it is 1000. Size is number of streaming record count from Data Lake source. The source will stream the records inside pipeline in chunks of 1000 and move them to operations and finally to Data Target. This can be increased to max of 10,000 records in cases such as one-time historical data loads.
For better performance and durability 1000 is recommended. This size helps to process 1000 records in target faster and ensures efficiency in real-time data processing.
Query: JSON Body based query to retrieve data from tables/index of Data Lake.
Get all the Records from a table
This example retrieves all the records from a table. Similar to SQL: select * from table
{
"query": {
"match_all": {}
}
}
Get all Records with specific columns/keys from a table
Example: retrieves only store_number and customer_number. Similar to SQL:
select store_number, customer_number from table
{
"_source": ["store_number", "customer_number"],
"query": {
"match_all": {}
}
}
Get specific records with specific column/keys from a table
Example: retrieves only employee_id and employee_name where employee_id = 130. Similar to SQL:
select employee_id, employee_name from table where employee_id=130
{
"query": {
"match": {
"employee_id": 130
}
},
"_source": {
"includes": ["employee_id", "employee_name"]
}
}
Get Specific Columns/keys with Filters from a table
Similar to SQL:
select Project,title,Assigned To,Priority,Created By,createdDateTime,dueDateTime from table where Project='Project ABC' and Priority is not null and percentComplete=100
{
"size": 50,
"sort": [{}],
"_source": ["Project", "title", "Assigned To", "Priority", "Created By", "createdDateTime", "dueDateTime"],
"query": {
"bool": {
"must": [
{ "query_string": { "query": "*" }},
{ "query_string": { "query": "Project:\"Project ABC\" AND Priority:[* TO *] AND NOT percentComplete:100" }},
{ "bool": { "should": [] }}
],
"must_not": []
}
}
}
Get Specific Columns/keys with Filters when key name contains spaces
Similar to SQL:
SELECT ThreadId, Ticket Created At FROM YourTableName WHERE Status != 'Closed' AND Thread Type = 'create'
Note: Thread Type becomes Thread\\ Type in JSON.
{
"size": 1000,
"sort": [{}],
"_source": ["ThreadId", "Ticket Created At"],
"query": {
"bool": {
"must": [
{ "query_string": { "query": "*" }},
{ "query_string": { "query": "NOT Status:\"Closed\" AND Thread\\ Type: \"create\"" }},
{ "bool": { "should": [] }}
],
"must_not": []
}
}
}
SELECT * FROM table WHERE asn IS NULL
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "asn"
}
}
}
}
}
SELECT id,ipAddress FROM table WHERE asn IS NULL
{
"_source": ["id", "ipAddress"],
"query": {
"bool": {
"must_not": {
"exists": {
"field": "asn"
}
}
}
}
}
SELECT * FROM table WHERE asn IS NULL AND ipAddress = ‘{%ipAddress%}’
{
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }}
],
"must_not": [
{ "exists": { "field": "asn" }}
]
}
}
}
SELECT asn, as FROM table WHERE as IS NOT NULL AND ipAddress = ‘{%ipAddress%}’
{
"_source": ["asn", "as"],
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }},
{ "exists": { "field": "as" }}
]
}
}
}
SELECT asn, as FROM table WHERE as IS NOT NULL AND ipAddress = ‘{%ipAddress%}’ TOP 1
{
"_source": ["asn", "as"],
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }},
{ "exists": { "field": "as" }}
]
}
},
"size": 1,
"terminate_after": 1
}
SELECT asn, as FROM table WHERE as IS NOT NULL AND ipAddress = ‘{%ipAddress%}’ order record by ascending TOP 1
{
"_source": ["asn", "as"],
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }},
{ "exists": { "field": "as" }}
]
}
},
"size": 1,
"terminate_after": 1,
"sort": [
{
"_doc": { "order": "asc" }
}
]
}
