Sample Size Determination and Justification for Medical Device

I need input from all experts out there to determine the sample size for assembly processes validation for OQ and PQ. I am struggling to determine sample size to challenge process window study. In general, I run OQ with 3 runs. All Low parameters, all High parameters and confirmation run at Nominal parameters. The question I have is, How can I determine and justify my run size and sample size for each run for variable responses? I understand the it depends on risk level and process etc. In general what are the methods or guidance? FDA does not provide guidance on sample size and run size.
Similarly, I have same question for PQ. I run 3 runs for PQ at Nominal which is standard. I still am not able to determine and justify, the run size or batch size of each run and determine and justify the inspection Sample size for variable responses that requires capability study.
I have no problem with Attributes response. I can use %Confidence Level vs %Reliability. I have only problem with Variable responses. I look forward to learn your inputs.

There are no guidances that I recall. Assuming the distribution is Gaussian the variable is relative to the range of the specification (not much point having the variable large than the specification range). If the range is 98.0 to 102.0%, I have used a variable of 0.1%. If the range is 95.0 to 105.0%, then 0.5%.

Hi Boomer. Your statement does not make sense to me. Can you give me example. BTW, it seems like you are chemist. Probably, it will not be the same for the devices.

I was trained as a chemist. The example I used above was for content uniformity. For a medical device the manufacturing process is Gaussian that is it produces a widget with a constituent (which you test) which has a distribution around a mean. The specification range should be in this distribution. You use SPC (statistical process control) and a ‘control chart’ to view/see this distribution.

The visual representation of the Gaussian distribution is called a ‘histogram’.

I understand all about distribution and SPC. I need example to calculate sample size. How many samples I should consider to for X-Bar and R chart distribution to represent population that my process will be stable for that particular batch. I think, this discussion is going off the track to my original sample size question.

Pete Thakor

Sr. Manufacturing Engineer

D 8184286597 M 2096292557

RedMed Motor Technologies Inc. 9540 De Soto Ave, Chatsworth, CA 91311


1st IQ/OQ/PQ applies only to the manufacturing equipment (not the process). It is called ‘commissioning and qualification’.

2nd Sample size is determined by your risk analysis (unless it has been pre-determined for you). If you have a ‘check weigher’ on your packaging line, then every sample is tested, and if OOS rejected. If you are making 100 units per year then I would assume every unit is checked. If you are assembling 1,000 units per second then you use statistics to determine your subpopulation and assume a ‘risk’ 0.1%, 0.01%, or 0.001%. The specifications (and tolerances) are set by the OEM (Auto?) client.

We need more information like what widget, how many…

I got it. I think, you don’t need to respond anymore.

Pete Thakor

Sr. Manufacturing Engineer

D 8184286597 M 2096292557

RedMed Motor Technologies Inc. 9540 De Soto Ave, Chatsworth, CA 91311


Can I try to re-direct a little bit here?

The FDA does NOT require you to run OQ at high, low, middle (nominal) during qualification. In fact how can you possibly do this? If you have 1 factor that is changing then you have 3 studies. If you have 2 factors that are changing then you have 9 studies (3^2). If you have 3 factors all tested at high/middle/low, then you have 27 studies (3^3), ect… Most equipment has many, many variable that can change. So which ones do you test at high/middle/low? You cannot do “all at high”, and “all at low” and “all at middle”. Because worst case might be variable 1 at high, and variable 2 a low, So again, I repeat, FDA does not require to do OQ at high/middle/low.

So then the question arises, “How can one possible define the qualified ranges of operation”. The answer lies in the FDA 2011 process validation guidance.

A search for “statistics” shows 15 results in a 15 page document. The HEAVY emphasis is on statistical understanding of your process. They reference having a statistician on the team multiple times in the guidance.

With statistical analysis you can use a Design of Experiment “DOE” to test all the ranges of operation. This can be done in an engineering study (not qualification). I know this goes against the grain of what is you are probably used to. But actually the DOE generates much more information than qualification could, and generates it quicker, and comes up with a response curve, and actually can be used to optimize your process to be run at the best conditions.

So now to answer your original question “how many samples are needed”. The answer is that the DOE will
also tell you that. If you work with the statistician and come up with a required criticality, accuracy needs, the statistical models will tell you how many samples are needed.

This whole approach is definately novel, but is one example of ICH’s Quality by Design and lifecycle approach. Again, this probably goes against what you are used to, but I promise/guarantee that the approach is faster, better, cheaper, and more accepted than a simple high/middle/low approach to validation. (by the way I rarery promise/garantee anything in life).

In my experience we pushed for IQ/OQ to test that the equipment ran correctly (alarms checked out, interlocks checked out, materials of construction were correct, etc.). Then we did “DOE” and engineering studies to know how our process works on the equipment. Then we finished SOPs and batch records. Then we did PQ, PPQ (PV) etc. I’ll admit we kind of made up the order of operations I just described (as I’ve not seen too many companies adopt the FDA 2011 guideline completely). But after we came up with that process, I transfered to another major pharma company now (one of the top 2 in the country) for OSD, and they are doing IQ then OQ, then Notebook studies, then PQ as well.

Anyway good luck. And feel free to follow up with disagreement or else any questions you might have. Good luck! I’m actually excited for you - there is such an opportunity to be cutting edge in validation practices (validation isn’t typically cutting edge).

Hi Jared. Thanks for the response. Chief, the process characterization or DOE is performed prior to OQ as you explained in your response above. During the DOE for example, we come with 3 factors that affects the our responses. In theory, you are correct if I run 3 factors with Full factorial you run 27 runs. During DOE, we use 10 samples to measure responses. Now, during OQ for process, we still run 3 runs at higher sample size by running 3 all Highs, all Lows and at Nominal for confirmation run.
I just need to know how the method for determining sample size and justification for OQ and if specially for PQ. I dont need to know what is IQ/OQ/PQ.

Use 1/10th size for OQ. Use full scale for PQ.

Again, I would not do 3 runs during OQ, but proceed with it if you’d like

1/10 of what?
I like to see response from experts from Medical device manufacturing. Pharmaceutical drug validation is somewhat different than Medical Device validation.

@pthakor everyone is just trying to help so please show some appreciation…this is a great community so let’s keep the tone pleasant.

Thank you.

I am not arrogant or offensive. I really appreciated for whoever have responded to help me out. The way I understood, the responses I got have nothing to do with my questions.
All I need to know is the Run size, Sample size and justification for either OQ and PQ or PQ. . I am not looking for definition of IQ/OQ/PQ. Also, the responses I got, I assume, experts from Pharmaceutical industries which could be different in process validation from Medical Device Class I/II/II process validation.
If you think, I was not respectful, next time I will be careful.


Pete. No worries. But calling someone “Chief” is very condescending. I think that is where the defensive reaction came from.

But I’ll stick by my answer that a statistician will answer your question regarding run size, sample size, and accept/reject allowances.

Note: My interpretation is applicable to Medical Device Manufacturing. Please see below. I am sure that this may not answer all you asked, but helps to understand few things raised above.

Using GHTF guidance for Process Validation (FDA/Global Harmonization Task Force (GHTF; medical devices),
– Process Validation, edition 2, guidance, January 2004) and following ‘Annex A Statistical methods and tools for process validation’ would be beneficial for you. This is specific guidance to Medical Device Manufacturing, which is recognized by FDA in "Guidance for Industry Process Validation: General Principles and Practices, Jan 2011.
Per these references:
Manufacturing Equipment is Qualified and documented in IQ, while Manufacturing Process is validated and documented in OQ and PQ- in general.

  1. an initial qualification of the equipment used and provision of necessary services – also
    know as installation qualification (IQ); 2) a demonstration that the process will produce acceptable
    results and establishment of limits (worst case) of the process parameters – also known as operational
    qualification (OQ); and 3) and establishment of long term process stability – also known as
    performance qualification (PQ).

Also recommended to read this article published by Industry Guru, Dr Wayne Taylor (“Selecting Statistically Valid Sampling Plans”).

You seem very frustrated with the calculations, I feel you though. You are a senior engineer so I wouldn’t preach more, but dude calm yourself down and show some love to the brothers of this community.

There is no clear guidance to determine sample size (wouldn’t that be great?). But FDA is smart and they don’t want to get stuck in something.
Here is some explanation - If you can afford to produce large number of units, I will recommend following calculations -

n= Z alpha/2 (squared) X Sigma (squared) / E (squared)

Where, n = required sample size;
Z = Confidence interval (1.96);
σ = Standard Deviation (Since there is no historic data, a standard deviation of 1 shall be used);
E = Maximum error of estimate (as this is a Pass/Fail test, a minimum value shall be considered – 0.05)

Using this your answer would come 384 samples.

Now if you have historic data, generate OC curve which will give you exact number of samples that you need to pull to VALIDATE your PROCESS. (well you know the difference between Validation and Qualification so I won’t explain that).

By the way, where did you get this Three (3) OQ and Three (3) PQ run from?

1 Like

Hello Swagat. Thanks for the response. I understood, FDA does not have guidance. The formula you showed above is Margin of Error for sampling Poll or survey. Its not exactly applicable to process validation. I have been advised by FDA auditor couple times not use Margin of Error process validation.
Now, for OQ, you need to challenge process Window limits for High and Low levels. For example if I have time and temperature input parameters affecting my response, I need to make sure that my process is robust enough at least at all High limits and all low limits of Time and Temperature (we taking risk by not performing interactions or we studied interaction that not affecting response prior to OQ). Third run is at Nominal as confirmation run to make sure the probability of PQ success.
For PQ, I prefer to perform minimum of 3 runs. If I run PQ only for 1 run, it could be fluke that my PQ run pass or fail. If I run PQ of Two runs and if one PQ run pass and second fails, then you going no where. Hence if you have minimum of 3 PQ runs you can at least have some conclusion that either Two pass or two fail from 3 runs and easy to perform root cause for failure. Again, there is no guidance for 3 runs and and you have to have 3 runs.

Interesting ! Most probably they did not like that formula as it is for Type I error only, good to know that.

I read one very interesting article and had taken some portion of it out for my understanding. I hope that may help you -

For continuous data one can use the idea of tolerance intervals — also known as the K value. Under this approach, you would determine the confidence level and reliability and use the ISO 16269-6 standard for tolerance levels to find the value of K, as shown in the formulas below:
K Value Formula
X¯ + k * S < U
X¯ - k * S > L
E.g. A 95 percent confidence and 95 percent reliability is required for a process with a mean of 10.0 and a standard deviation of 0.55, for instance, with a specification of 8.5 to 12.0.
The mean, 10, is closest to the lower specification, so we solve for K:
10 – k * 0.55 > 8.5, k = 2.727
Going to ISO 16269-6, consult the table in the document’s Appendix D, looking for 2.727 with 95 percent confidence and 95 percent reliability. The ISO-recommended sample size in the table for this example is 20 to 22. So, 21 samples can be selected. If all 21 units are between 8.5 and 12 when tested, then we can say with 95 percent confident that 95 percent of the individual values will be within 8.5 and 12
Another way to determine sample size for continuous data is via the standard hypothesis testing formula n = (Zα + Zβ ) 2 S2 /E2 (which I posted above) , where Zα represents the Type I error, Zβ the Type II error, S the standard deviation and E the meaningful difference between the true population mean and what is observed in the sample. With the E value, the goal is to get as close as possible to the true population mean, so a small E is desirable.
For example, assume a Type I error rate of 5 percent, meaning a 95 percent confidence level (Zα = 1.96), a Type II error rate of 1 percent for a 99 percent reliability figure (Zβ = 2.326) and unknown standard deviation and unknown E, but a requirement to detect a 0.5 standard deviation shift. It is assumed that this is a two-sided specification. Using the formula above, n = (1.96 + 2.326)2 (1)2 /0.52, so the sample size, n, is 74.

I understand running OQ at boundary limits, but you don’t need three different runs for that. Also, the 1 is by chance and 3 is validation is long gone (except for Autoclave, sigh !), but again if you can afford it then sure !