This page covers the backlog for openIDL.
|1||18-JAN-21||Extraction pattern - tech|
What is the tech for the extraction pattern? map/reduce, optimized for scale, GraphQL or others?
- The extraction pattern model currently uses a map reduce function in MongoDB. This locks us into MongoDB and uses a closed environment without access to the outside world we’ll need for correlating other data like census. The extraction capability must be reimagined.
- Suitable for simple data source layout. Can't do cross lookup, validation or reference data checks
- Currently works for mongoDB only. Won't work for other sources, example AAIS' data lake (Hadoop/Cloudera)
- GraphQL seems like a strong candidate. POC is required to validate the hypothesis
- Discuss any other candidates that should be considered
|2||18-JAN-21||How to assert data integrity|
How to assert data integrity?
- Do we expect same answer if the question is asked more than once? Within the parameters of use case, scope, time (duration) etc. For example: total premium for all Auto policies from 1/1/2021 to 12/31/2021 for purposes of stat reporting
- Do we expect a different answer if the question is asked more than once? Within the parameters of use case, scope, time (duration) etc. For example: new data that has been committed to a past period
- HLF response: see https://hyperledger-fabric.readthedocs.io/en/v0.6/API/CoreAPI.html#chaincode
- Solution option: A checksum after record is locked & written to the chain, store the acknowledgement (chaincode hash) from the HLF to a control DB and map it to a record set
|3||18-JAN-21||How to assert data quality|
How to run technical and business validation on data and certify the data?
- Technical rules: may include JSON schema validation, format, cardinality check
- Business rules: enum check, field-to-transaction-to-record-set-to-dataset validations, reference data
- Error threshold: calculate the error threshold. NAIC allows for up to 5% error rate
- Timing: when should validation be applied? As the data arrives into the HDS (staging→core), or just before extraction (time pressure, lost opportunity with time that could be used to rectify errors), timeouts on the extraction API etc.
|4||18-JAN-21||Common Rule Set|
Is it possible to provide a common set of rules that can be used by all carriers against their data before making it available to the extraction?
- Assumption: rules are standardized across openIDL members for a given use case. Example, rules related to Auto stat reporting
|5||24-JAN-21||Is the HDS data source permanent or temporary?|
When the question (extraction request) is asked, the HDS (interfacing API) is expected to be available to respond.
- HDS API is required to be available when the request is due. For example, end of a quarter or end of Feb for an annual stat report
- The type of HDS stack is driven by the use case. For example, stat reports are due annually, AIPSO reports are due quarterly etc.
- There is no requirement for the HDS API to be available outside of scheduled activity. It is optional for members to have the HDS API available outside of scheduled activity
- Consideration is to be given for the HDS to be available for internal systems to write to the HDS; or allow write to a scheduled batch event (no daily or more frequent writes)
- There is delineation between the HDS data source and HDS API. The HDS data source & HDS API can be decoupled. Storage costs can be optimized by holding data in low cost storage (like AWS Glacier), and shift it to immediate access storage (like AWS S3) for the HDS API to be able to query and respond in reasonable time
|6||18-JAN-21||How do we support calculating the error rate?|
Reporting practice allows for an error rate of up to 5%.
Options to consider are
- to process the entire (large to very large) dataset in-memory
- break up (shard) the data set and apply validation at a more manageable layer (see data set hierarchy below) and write to a control DB
For example, for annual stat reporting, the data set is presented in this hierarchy (illustrative)
- Time period
- Entities and sub-entities
- Months/Duration and/or ZIP codes
- Policy and Loss group
- Policy and Loss sub-group
- Records making up the sub-group
|7||18-JAN-21||Reference data validation|
Where to host reference data service? Within member's enterprise or within node?
When to apply reference data validation:
- at rest: when the data is at rest in the HDS. This would mean that the data is washed and prepared for extraction. Recommended
- at extraction: when the data is being queried. Not recommended
- after extraction: at the Analytics node. Not recommended
Placement of reference data validation may be different depending on the scenario.
Reference data implementation options:
- lookup tables with list of values/enums
- APIs. For example, USPS ZIP code validation API
- Should APIs be looked up at runtime or in the background (call an API and localize the reference data)
|8||18-JAN-21||Reference data lookup services/APIs|
Which APIs to look up? For example, USPS state/zip validation, Carfax etc.
We can be opportunistic with this and certainly allows for MVP, tactical and strategic solutions
|9||18-JAN-21||Reference data lookup services/APIs - pricing model||Who pays (assumption - whoever owns the data pays), and how to charge the consumers (via assigned accounts, via centralized billing account prorated to consumption etc.). Who signs the vendor contracts|
|10||18-JAN-21||Separating the Hyperledger Fabric Network from the data access||Can a carrier participate in the network from a hosted node without putting the data there? That is, can we give a carrier access to the network without them having the data access portion hosted in the same node. The HLF runtimes are not required to run in the carrier, and only a simple api is made available for extraction.|
|11||18-JAN-21||Simplify the technical footprint||Can we simplify the architecture so that there are not so many technologies required?|
|12||18-JAN-21||Hosted nodes||Should we consider hosted nodes for the HLF network instead of requiring all carriers who desire data privacy to host the network?|
|13||19-JAN-21||HDS vs Interface Spec for nodes||Do we need HDS to be uniform or only have uniform interface spec and communicate via service?|