Happy to Announce Our New Discussion Forum | Join Now

Transient EMR cluster or Use an EMR cluster?


#1

A clinical trial will rely on medical sensors to remotely assess patient health. Each physician who participates in the trial requires visual reports each morning. The reports are built from aggregations of all the sensor data taken each minute.
What is the most cost-effective solution for creating this visualization each day?

A. Use Spark streaming on EMR to aggregate the patient sensor data in every 15 minutes and generate a QuickSight visualization on the new data each morning for the physician to review.
B. Use Kinesis Aggregators Library to generate reports for reviewing the patient sensor data and generate a QuickSight visualization on the new data each morning for the physician to review.
C. Use a transient EMR cluster that shuts down after use to aggregate the patient sensor data each night and generate a QuickSight visualization on the new data each morning for the physician to review.
D. Use an EMR cluster to aggregate the patient sensor data each night and provide Zeppelin notebooks that look at the new data residing on the cluster each morning for the physician to review.


#2

I think it should be C. “cost-effective”. A, D both running with EMR cluster without shutting down so cost-wise is not saving a lot. For B, Kinesis Aggregators Library (https://github.com/awslabs/amazon-kinesis-aggregators) could run on EB, EC2, and an Amazon Kinesis-enabled application, whereas I don’t see if it running on Lambda. Plus, Lambda have now 15 minutes timeout; I am not sure aggregator could finish job on-time.


#3

Agreed with C, since EMR will be shutdown when the job completed.


#4

I believe likewise C because the question targets clearly the most cost effective solution, therefore, doesnt make sense keep on running an entire cluster if i just need it to apply aggregations and then dump the results in a visualization report on QuickSight. However how the EMR will process the patient data ? it requires a previous collection process that might be generated somehow. Within the text it doesn’t specify that this data is already on S3 and then . maybe my cluster can read it as part of the EMRFS.

Therefore, I might say that perhaps a collection process should be set for that purpose and then somehow transform the payload and separate on daily folders within a bucket the process data. For visualization then QuickSight may map this daily repository and aggregate the information in bars or any other way.

Why not the others:
A: Even though I may be running constantly an EMR cluster this is more expensive that Kinesis service and what Spark may do I can easily do it with my consumer library before dumping into S3.
C: This one would have been my first thought but then how the data is collected every minute?
D: I personally dont believe to present data to a health doctor Zeppeline may be an option.

thanks so much for feedback. im thinking to take the exam early December.


#6

“Cost effective” solution is an important keyword in the question. Option C looks more suitable due to fact that “Transient” EMR cluster is being used for data engineering and then it would possible push data back to S3 and then Quicksight(serverless) visualization would be created on top of data stored on S3 . There needs to be bit more context to the questions, I believe.

None of the option in standalone way could satisfy use case provided in the question, unless and until we make some assumptions on our own