05 / Data Sourcing
How do we source the data?
User Input
Electronic Medical Records
Research Database
06 / Data Processing
How do we process the data?
This is the step-by-step process Iβd take:
Data cleaning: handle missing value, remove duplicates, correct inaccuracies
Create baseline: normalize + scale data to find median
Data encoding: turn qualitative data into quantitate
Handle outliers
Create features so AI can start to recognize patterns
Collect time periods for certain interactions (ie: treatment duration)
Text data processing for NPL (ie: community discussions, medical literature)
Diversify + merge multiple data sets
Privacy: make sure all data is private
Account for underrepresented data sets (ie: certain locations, HHI, etc)
Prioritize data sets that are relevant to features > deprioritize irrelevant data sets
Data splitting: split data into 3 cats β training, validation + test sets
07 / Data Sets
How do we ensure the data sets are accurate?
Training Set: Train the AI; uses 80% of data set
Historical user data: past interactions, treatment outcomes, community engagement
Anonymous data from research databases + clinical studies
Features representing: user profiles, medical history + preferences
2. Validation Set: Validate + fine-tune; uses 10-15% of data set
This would help with hyertuning
3. Testing Set: Evaluate the final performance; uses 10-15% of data
Represents a completely unseen dataset for the model
08 / Privacy Concerns
What are the privacy and data concerns? How will we resolve or combat?
This is the process Iβd take for each aspect of privacy + data concern:
-
Concern: Difficulty in tracking and responding to security incidents.
Mitigation: Implement audit trails to log user activities and monitor the system for unusual behavior, enabling rapid response to security incidents.
-
Concern: Violation of data protection regulations such as GDPR, HIPAA, or other local laws.
Mitigation: Ensure strict adherence to relevant data protection laws and obtain informed consent from users regarding data usage and storage practices. -
Concern: Preserving user privacy by preventing the identification of individuals.
Mitigation: Anonymize personally identifiable information (PII) in the dataset and ensure that aggregated results cannot be traced back to individual users.
-
Concern: Protecting data in transit and at rest to prevent unauthorized access.
Mitigation: Implement strong encryption protocols for communication between users and the AI platform, as well as for storing data.
-
Concern: Lack of transparency about data usage and AI model purposes.
Mitigation: Obtain explicit informed consent from users regarding the use of their data for AI model training, research, and improvement purposes.
-
Concern: Unidentified vulnerabilities in the system.
Mitigation: Conduct regular security audits and penetration testing to identify and address potential vulnerabilities in the AI software.description
-
Concern: Unauthorized access to stored data.
Mitigation: Use secure, compliant data storage solutions with access controls, regular security audits, and compliance checks.
-
Concern: Intercepting sensitive information during communication.
Mitigation: Ensure secure communication channels using encryption protocols, especially during virtual consultations and data exchanges.
-
Concern: Lack of clarity regarding how user data is used and shared.
Mitigation: Clearly communicate privacy policies to users, detailing the purpose of data collection, storage, and usage. Provide users with options to control their data preferences.
-
Concern: Unauthorized access to user profiles and medical information
Mitigation: Employ robust user authentication mechanisms (e.g., multi-factor authentication) and ensure strict authorization controls to limit access based on user roles and permissions
-
Concern: Lock-in of user data without options for portability.
Mitigation: Provide users with the ability to export their data, promoting data portability and transparency.