Overview
The Data Lake in eZintegrations is a search engineβbased NoSQL database designed to store and process massive volumes of structured and unstructured data for analytics, storage, machine learning, and deep learning.
A Data Source in eZintegrations acts as a connection pool that retrieves data from the Data Lake and delivers it in JSON format to the integration pipeline.
Responses from the Data Lake source are stored under the bizdata_dataset_response key for further processing.
When to Use
Use Data Lake as a Source when large-scale analytical or operational data needs to be retrieved and processed within an Integration Bridge.
- Extracting analytical datasets
- Processing historical records
- Streaming operational data
- Supporting reporting workflows
- Feeding machine learning pipelines
How It Works
The Data Lake Source retrieves records using JSON-based queries.
Data is streamed in chunks based on the configured size and pagination settings.
Retrieved records are stored in the bizdata_dataset_response key and passed to downstream operations and targets.
When using Single Line to Multiline Operations, the Chop key must be set to:
[‘bizdata_dataset_response’]
Data Lake Source Parameters
Data Lake Version
Specifies the Data Lake name and version assigned to the organization.
Index / Table Name
Defines the index or table from which data is retrieved.
Available indices and tables can be found in the Datalake section of the Visualization product.
Pagination Wait Time
Controls how long the system waits for the next page of data.
- Default: 2m (2 minutes)
- Supports: m (minutes), h (hours), s (seconds)
- Increase for large responses or high network congestion
Timeout
Defines the maximum wait time for receiving a response.
- Default: 2m
- Increase when Data Lake response is slow
- May be required for small cluster sizes
Size
Controls the number of records streamed per batch.
- Default: 1000
- Maximum: 10000
- Recommended: 1000 for optimal performance
Query
Defines the JSON-based query used to retrieve records from the Data Lake.
Query Examples
Get All Records
{
"query": {
"match_all": {}
}
}
Get Specific Columns
{
"_source": ["store_number", "customer_number"],
"query": {
"match_all": {}
}
}
Filter by Field Value
{
"query": {
"match": {
"employee_id": 130
}
},
"_source": {
"includes": ["employee_id", "employee_name"]
}
}
Filter with Multiple Conditions
{
"size": 50,
"sort": [{}],
"_source": ["Project", "title", "Assigned To", "Priority", "Created By", "createdDateTime", "dueDateTime"],
"query": {
"bool": {
"must": [
{ "query_string": { "query": "*" }},
{ "query_string": { "query": "Project:\"Project ABC\" AND Priority:[* TO *] AND NOT percentComplete:100" }},
{ "bool": { "should": [] }}
],
"must_not": []
}
}
}
Key Names with Spaces
{
"size": 1000,
"sort": [{}],
"_source": ["ThreadId", "Ticket Created At"],
"query": {
"bool": {
"must": [
{ "query_string": { "query": "*" }},
{ "query_string": { "query": "NOT Status:\"Closed\" AND Thread\\ Type: \"create\"" }},
{ "bool": { "should": [] }}
],
"must_not": []
}
}
}
Check for NULL Values
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "asn"
}
}
}
}
}
Dynamic Filter Using Sprintf
{
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }}
],
"must_not": [
{ "exists": { "field": "asn" }}
]
}
}
}
Limit and Sort Results
{
"_source": ["asn", "as"],
"query": {
"bool": {
"must": [
{ "term": { "ipAddress": "{%ipAddress%}" }},
{ "exists": { "field": "as" }}
]
}
},
"size": 1,
"terminate_after": 1,
"sort": [
{
"_doc": { "order": "asc" }
}
]
}
Frequently Asked Questions
What is Data Lake Source in eZintegrations?
It is a source connector that retrieves structured and unstructured data from the Goldfinch Analytics Data Lake.
Where is the response stored?
All retrieved data is stored under the bizdata_dataset_response key.
What is the recommended batch size?
The recommended size is 1000 records for balanced performance and reliability.
Can I use dynamic values in queries?
Yes. Dynamic values can be passed using Sprintf placeholders.
When should I increase timeout and pagination time?
Increase these values when working with large datasets, slow networks, or small cluster sizes.
Notes
- Always validate queries before production deployment.
- Use secure filters to avoid unnecessary data loads.
- Optimize size and pagination for performance.
- Monitor cluster capacity for large workloads.
- Maintain consistent query structures across integrations.