Background: According to the industrialization of countries and development of societies, tests and methods are required to employ people in industries and organizations and make the best selection in getting workforce. Interviews, Written tests , and multiple choice tests are common methods used in employing people. Among these methods , multiple choice tests is the easiest one because of the facility of correction and getting results. In this study ,general and technical tests which were held in 1389 and 1390 in Mashhad University of Medical Sciences were studied based on Item Response Theory(IRT) for determining the quality of multiple choice tests and the effective factors in their results.
Methods: In this applied study ,the participants included male and female general and special medicine students in the fields of emergency medicine and euro logy of Mashhad University of Medical Sciences. As in educational assessment sampling is not allowed therefore all available data was judged. The study was conducted on 251 answer sheets of 6 tests (2 tests of atagers,1 of internship ,and 3 tests of residency) based on IRT.
The comparison of the difficulty level ,diagnosis capability, and line slope of each question based on IRT showed that in the first test which was related to residency questions number 1,5,and 27 had better omitted because their line slope was out of the interval of 0.5 to 205.Questions of 3 and 26 because of being bigger than the degree of freedom had meaningful qui squares and should be omitted from the test. Questions number 14 and 9 should be omitted their scores ignored because of low .
Conclusion: After the determination of the three parameters of slope 12ag' type="#_x0000_t75"> ,difficulty 12bg' type="#_x0000_t75">,and success by chance 12cg' type="#_x0000_t75">,for each of the 284 items ,and omitting the incorrect irems,ᶿ ability of the participants was determined,and therefore valid items were identified and differentiated in this way:out of 251 studied questions based on line slope ,item difficulty, and guess in the six tests it is considered that 7 questions with line slope of out of the interval of 0.5 to 2.5,10 questions were bigger than three times of degree of freedom and had meaningful qui square,9 questions with guess parameter of higher than 0.25 were invalid and should be omitted. In a 31 item test which was related to residency 7 items were omitted because of being invalid ,therefore it is worth to plan such questions by which examinees are assessed properly and the intervention of other factors in getting their real score is prevented.
One of the important goals of education is the determination of capabilities, aptitudes, and limitations of members of the society .So that chances of growth and improvement are prepared by social ,economical ,and professional leadership. In this way accurate tools and methods of testing and assessment are necessary and no educational system can reach its goals without applying them(1).
Different kinds of tests are used in a wide variety in schools ,psychotherapy centers ,industry, army, and governmental organizations for giving advice, identification of mental issues, making a choice, leadership and job selection. psycho -educational tests are generally used in assessing individual differences in the decision making process .Test, measure individual differences from the point of talents and personality traits(2).
Because of great improvements in the case of individual identification in behavioral science, different theories have been invented in the field of tests and testing.Normally psychotherapists use mathematic models for planning the items and analyzing the scores in order to express their theories.A mathematic model includes some theories about the data which explains the specific relation among observable constructions and unobservable constructions of the model(3).
If we consider a multiple choice test as a tool for assessing a kind of trait or specification of a person ,the most basic question about the test will be that what traits or specifications the mentioned test measures?and how well does it measure that trait or specification?(4)
The Classic Model
The Spearman model is based on this theory that tells us that each score can be considered as a combination of two additive components , which are the true score and random error score .In other words, by taking a test of a person or a group of people ,some scores are achieved which are although favored by the tester to consider them as the real measurement of the trait or capability but because of some factors this score does not express the real amount of that trait or the level of the person b itself and it is just an observed score and nothing much.(5)
In the classic theory of the test the difference in standard variables such as item difficulty (the ratio of correct answers) cannot be used for the evaluation of these side-takings.Sometimes the criteria of determination of the item (the difference of the relation of correct answers of criteria groups groups to test items) can not delete these side takings.These criteria can not find the side taking of in group difference in a specification which is measured by the test. (6)
Accessibility to computer which started in 1960s ,provided the invention of the testing theory of latent trait and adaptive assessment with computer in late 1970s and early 1980. Latent trait is an unobservable trait which determines a specific collection of stability and coordination among individuals along with the differences between them all at once.(7)As computer was used in analyzing the data of psychotherapy ,the theory of which has been invented previously but could not be used was used.
The practicality of latent trait theory caused a lot of changes in the performance of psychotherapy tests which affected all the job which was done in 60 years in classic theory (8).
Psychotherapists and testing experts turned in to such theories with more interest and gradually scientific texts and computer software spread among them for the purpose of psychotherapy. It is not the matter of advantages or disadvantages of new or classic theories anymore and studies are conducted in the case of choosing the appropriate model, selection of faster and more accurate methods for calculating the parameters of the models, performing stronger tests for the determination of the appropriateness of data models.(6)
The new theories of psychotherapy first was strongly connected to latent trait
In a way that can be seen a lot in the review of literature and background of new theories but nowadays IRT with the theory of specific slope has become more popular and seems more appropriate for testing and data analysis.(9)
Item-Response Theory and Specific Slope of the Item
The Item-Response theory is usually shown by IRT.(10)
Tests in which the items are harmonious from the point of content, it is logical to consider a united dimension of trait which is the fundamental function in all items of the test. This trait dimension which does not necessarily have to be psychologically simple, is statistically considered a united structure which acts as a determining factor of success in all items of the test and based on the imagination of the trait dimension which is a fundamental specification with which the test is measured.ICC or specific slope can be examined.(7)
Specific slope of the item is a function which relates success probability of the triable place the item to the testable position in the measured fundamental trait . (Torkashvand)
Two terms of specific slope and IRT are used interchangeably a lot.
Specific slope of the item is a slope which shows the possibility of giving a correct answer to an item(
12pg' type="#_x0000_t75"> ) as a function of different trait levels(ᶿ) which leads to success in answering the item.IRT considers both the role of the items of the test and answers to them.(5)
In this applied study ,the participants included male and female students of general and specialized medicine. Specialty majors included emergency medicine and euro logy in Mashhad University of Medical Sciences. As sampling is not allowed in educational assessment ,therefore all available data was measured.
In this study,according to the sensitivity of the university in some cases and
Lack of archives of the multiple choice questions of the previous years ,just the questions of two fields of emergency medicine and euro logy of 2010 and 2011 were accessible and it was not possible to get access to other
questions. Thus the questionnaires , answer keys ,and answer sheets were received and the study was done on the residents of the two mentioned specialties. Finally this study was conducted on 251 answer sheets collected from taking 6 tests (2 tests of stagership,1 test of internship, and 3 tests of residency)by using IRT model.
For each file a separate table was planned in which the first row included the number of items,the second row the answer key ,and the following rows related to the answers students had given.Data of each table was analyzed.
For data analysis IRT model was used as follows.
1_What is slope line of each item?
2_Ehat is the item difficulty?
3_What is the probability of choosing the correct answer by chance?
4_How much is the capability (ᶿ) of students?
5_What are the valid items in each test?
The study was conducted on 251 answer sheets related to 6 tests of Mashhad University of Medical Sciences based on IRT model.The results are as following.
Before data analysis ,it is necessary to show the type of the test,number of examinees,and number of items of each test in table no.1,then the tertiary parameters of each test along with students' capabilities(ᶿ) are determined,and finally the progress score of the students based on the standard score ,average of 80,and standard deviation of 10 are given.
Table 1-Data of conducting 6 tests on 251 students of Mashhad University of Medical Sciences
Number of examineesا
Number of females
Number of males
Number of questionsا
Success in an individual item ,even more than success in the whole test, depends on the examinee's status of latent trait ,systematic, and various random factors. For a multiple choice item to be acceptable ,it is necessary that the success chance of the examinee ,in the mentioned item increases continuously along with the promotion of his rank in latent trait .
A function which shows the success probability in the item is called
Latent trait ,or awareness function ,or item specific slope .(11)
This can be described by three parameters:
1_The parameter which shows the bend slope ,it means when the relation of change in success probability moves toward above latent trait. This parameter is shown through the sign of a ,and its usual amount is from 0.5 to 2.5.(12)
2_The parameter that shows specific slope of the item is placed where in
latent trait ,which is the parameter which expresses the item difficulty and is shown by b. Its usual amount is from _2.5 to 2.5,item difficulty for amounts less than -2.5 is considered very easy, and for amounts more than 2.5 very difficult.(12)
3_The parameter which shows the base line curve for very low levels of the, it means that the level of success by chance for multiple choice questions is expressed with C and in multiple choice items its usual amount is 0.25.(12)
The tertiary parameters of the first residency test were calculated and their amount are shown in table 2 along with the number of examinees, amount of qui square, and degree of freedom. .
Table 2-The Determination of Parameters of the First Residency Test(n=19)
As it was mentioned the amounts of a in table 2 are determined from 0.5 to 2.5 and amounts other than this interval have been determined as invalid items by the computer and shown by Item Deleted . Therefore items 1,5,and 27 which are mentioned below are invalid and deleted from the rest of the calculations.
On the other hand,experimentally qui square (the second column on the right) need to be less than three times of degree of freedom (the first column on the right),otherwise the amount of qui square is meaningful and the reliability of the item gets in risk.(11)The amount of qui square in items number three and twenty six are respectively 72.057 and 61.439are larger than the three times of degree of freedom (48) and are meaningful ,thus items three and twenty six are invalid and need to be deleted.
In the first residency test ,parameter a is in the interval of 0.618(item 2) to 1.817(item 13) .It is normal that the more the amount of a,the steeper the bend slope is ,which means that in item
13 with increase in the level of latent trait ,the probability of success in the question increases rapidly .In other words this item just relies on latent trait and nothing else much, in a way that a person with enough latent trait is sure about his success and a person with not enough latent trait is almost sure about his failure.In contrast ,the item slope of item 2 is almost flat (a=0.618) and in the case of item 2 with increase in the level of latent trait ,the probability of success slowly increases. In other words success in replying item 2 depends on factors other than latent trait to a great extent and the level of having latent trait hardly affects success in replying item 2. Experimentally amounts less than 0.5 for a ,causes the item becomes inappropriate. In the first test of residency item difficulty or the amounts of b varies from -2.294 (item 3) to 3.0 ( items 7,8,15,17,18).The amounts b2.5 shows that item difficulty is very high or the item is very difficult. Therefore just item 3 is very easy, items 25(b=-1.532),24(b=-1.080),20(b=-1.006),26 (b=-0.660),19(b=-0.577),22 (b=-0.188),16 (b=-0.133),and 21 (b=-0.064) are the easiest questions respectively ,and on the other hand , items 7,8,15,17,18 (b=3.0) are very difficult. items4(b=2.008).12(b=2.005),6(b=1.762)14(b=0.729),23(b=0.695),9(b=0.661),10(b=0.562), 30(b=0.484),29(b=0.359),2(b=0.342),13(b=0.192),28(b=0.064),and 11(b=0.006) were difficult to easy items respectively .In the first test of residency ,parameter C expresses the probability of selecting the correct choice by chance.The amounts of C lie between 0.160(item 13) and 0.330(item 7).In this test ,C=0.25 shows that selecting the right choice among the others is not possible through chance, larger amounts like C=0.330 in item 7 shows that selecting the right choice is possible by little chance and is related less to .
On the other hand, C
The scale of ability (ᶿ) of participants of the first test of residency based on standard score ( mean of 0 and standard deviation of 1) and also based on progress score (mean of 80 and standard deviation
of 10) is shown in table 3
Table 3_Gender and Ability of Examinees in the First Test of Residency based on ᶿ and progress scale (n=19)
As the data of table 3 shows the standard score of ability (ᶿ) of the examinees lie between -1.59 to 1.47,minus scores express being less than the mean and positive scores show latent trait in more than average in residents.Corresponding with ᶿ the progress score of students lie in the interval of 64.10 to 94.65.
In the second test of residency the amounts of a in items 5 and 33 are determined as invalid because of standing out of the interval and are deleted.
Qui square in item 16 was 21.607 which was bigger than three times of degree of freedom (21) and is meaningful,therefore item 16 is invalid and needs to be deleted.
In the second test of residency item difficulty or b lies between _1.608 (item 32) to 3.0(items 6,17,18,23,28).b<_2.5 shows that item difficulty is very low or in other words the item is very easy,in contrast the amounts of b>2.4 express that item difficulty is very high or the item is very difficult .Therefore item 32 is the easiest and items 6,17,18,23,28 (b=3.0) are very difficult .In the second test of residency ,parameter c of item 1,26,and 31 show that choosing the right answer among others is possible with a little chance and by guessing and is less related to latent trait .
The standard ability scores of the participants lie between _1.45 to 1.45,the negative scores are less than the average ,and the positive
scores are more than the average. The progress scores of the participants lie between 65.48 to 95.36.
In the third test if internship as the allowed amounts for are between 0.5 to 2.5 and according to the fact that all the numbers in column a placed in this interval ,therefore all the items of the tests are valid and none are deleted .Qui square in items 17 and 19 are 84.245 and 77.347 respectively and are bigger than the three times of degree of freedom and meaningful, thus items 17 and 19 are invalid and must be deleted.
In internship test, item difficulty or the amounts of b lie between _2.552 (item 17) to 3.0(items 3,4,5,8,9,11,18,22,24,25,26). b<_2.5 shows that item difficulty is very low and in other words the item is very easy and in contrast b>2.5 shows that the item is very difficult or high item difficulty. Therefore just item 17 is very easy and items 13 (b=_1.558),23(b=_1.399), 12 (b=_1.227) , 19 (b=_1.011), 14 (b=_0.849),15 (b=_0.778), 6(b=_0.755), 21 (b=_0.655), 10 (b=_0.500),27 (b=_0.434), 20 (b=_0.092), and 1 (b=_0.077) are the easiest items respectively .On the other hand, items 3,4,5,8,9,11,18,22,
24,25,26 (3.0) are very difficult and items 2 (b=1.916), 28 (b=1.786), 7 (b=1.202) and (b=0.750) are difficult to easy items respectively.
In internship test ,parameter c for the items 2 and 22 shows that the right choice can be guessed and is little related to latent trait.
The standard score of ability (ᶿ) of the participants is between _1.76 to 1.33, the negative scores are less than the average and the positive ones are higher than the average .The progress scores of the participants lie between 62.38 to 93.27.
In the fourth test of residency the amount of a in item 72 is out of the
interval ,therefore it is invalid and must be deleted.
The amounts of qui square in item 21 and 59 are 110.682 and 58.971 respectively which are bigger than the three times of degree of freedom and are meaningful .Thus items 21 and 59 are also invalid and need to be deleted.
In the fourth test of residency item difficulty (b) ranges from _2.388 (item 48) to 3.0 (items 12,4,1,18,20,23,27,29,30,31,37,40,41,43,47,51,
55,56,57,60,61,65,66,67,68,69,70,71).b<_2.5 shows high item difficulty or a very hard question. In this way just item 48 is easy.
In the fourth test of residency ,parameter c introduces the probability of choosing the right answer by chance .Its amounts range from 0.130 (items 43 and 59) to 0.410 (item 27).
The standard scores of ability (ᶿ) of the participants range from _1.49 to 1.39,negative scores are less than the average and positive ones are higher than the average. Progress scores of students vary from 65.11 to 93.94.
In the fifth test of internship as the amounts of a lie between 0.5 to 2.5 therefore all the item are valid and none are deleted .Qui square of item 36 is 55.185 which is larger than the three times of degree of freedom (51) and so is meaningful, thus item 36 is invalid and must be deleted.
In the fifth test of internship item difficulty (b) lie between _2.920 (item 36) to 3.0 (items 28,43,46,47,48,49,51,53,54,57,58,60). b< _2.5 shows that item difficulty is very low and in other words the item is very easy and need to be deleted ,on the other hand b>2.5 expresses high item difficulty or a very difficult item.
Therefore just item 36 is very easy and items 28,43,46,47,48,49,51,53,
54,57,58,60 are very difficult.
In the fifth test of internship ,the amount of c range from 0.020 (item 49) to 0.440(item 45).
ᶿ lies between _1.36 to 1.70,negative scores are higher than average and positive scores are higher than average. Progress scores of students are placed between 66.40 to 97.04.
In the sixth test of internship as it is mentioned in table 12 the amounts of a lie between 0.5 to 2.5 and any amount except this interval is considered as invalid by the computer and marked as Item Deleted. Thus item 30 is invalid and is deleted.
Qui squares of items 7,31,35 are 85.178,53.120,54.635 respectively and are bigger than the three times of degree of freedom and are meaningful, therefore items 7,31,and 35 are invalid and must be deleted .In the sixth test of internship item difficulty or b ranges from _3.0 (item 35) to 3.0 (items 48,52,56,60) ._2.5 shows that item difficulty is very low and in other words the item is very easy ,on the other hand the amounts of b
In the sixth test of internship,parameter c shows the probability of selecting the right choice by chance.Its amounts vary between 0.150 (item 48) to 0.410 (item 21) ,which shows that in item 21 selecting the correct choice is possible by guessing and is less related to latent trait .The standard ability score of the participants lie between _1.61 to 1.71,the negative scores are lower than the average and
positive scores are higher than average .Progress scores scores of the students lie between 46.94 to 90.46.
The main purpose of this study was to refine the items and determination valid and invalid items and recognition of parameters of each item .The advantage of this study over the classic one is that it is free of testing and in other words it does not have a dependent test and each item is analyzed independently.Therefore the organization which need them can supply their resources and provide tests of the highest standard and get real results.
Zolfaghari (2007)conducted a study on 457 participants which were selected randomly from the first ,second, and third level of secondary school of Ferdows .For studying the parameters of the test factor analysis and tertiary parameter model were used which was determined that this test has more than one dimension.For the determination of the properness of this test with the examinees awareness function and specific slope function of the item showed that this test is proper and along with the ability of the examinees.As a result the first hypothesis was rejected and the second one was accepted.
The results of holding the test show that the comparison of item difficulty, judgement capability ,and slope line of each item expressed that in file 1 which related to residents items 1,5,and 27 had better be omitted because their line slopes was out of the interval of 0.5 to 2.5.Items 3 and 26 must also be deleted as they have a meaningful qui square .Items 9 and 14 should also be deleted because of low .
In a 31 item test which related to the residency level ,7 items had to be deleted because of not being valid thus it is worth paying the essential attention in planning tests so that the examinees are assessed appropriately and other factors do not interfere.
The studied tests are far from standard conditions .For planning tests more attention must be paid so that their results are reliable.
1- Ebrahimifar T. Studying the second scale 2 of cattle's intelligence test that is against the culture by using the classical method and IRT among the girls and boys of Tehran province. MS. Dissertation. Islamic Azad University of Tehran, 1999. [In Persian].
2- Pashasharifi H. The fundamentals of psychometry and the assessment of psychology. Tehran: Nour Hekmat; 1995.
3- Thorndike R. Applied psychometrics. Tehran: Tehran University; 1996.
4- Joker B. Assessment of the scale 2 of the intelligence test that is according to the Cattle culture. MS. Dissertation. Shiraz University; 1993. [In Persian].
5- Zolfaqhari QH. Discussing about the dimensions of the content of Riven test and assessing its parameters according to the question and answer hypothesis. MS. Dissertation. Islamic Azad University of Tehran; 2007. [In Persian].
6- Sepasi H. The classical test view and its limitation. The journal of educational sciences of Chamran University 1995; 2:133 [Persian].
Abbassi F. Studying the second scale 2 of cattle's intelligence test that is against the culture by using the classical method and IRT among the girls and boys of Tehran province. MS. Dissertation. Islamic Azad University of Tehran, 1999. [Persian].
8- Alimohamadi F. Using the IRT model in balancing the test of self-respect of Izneck and Kooper for the girl students in the guidance school and the first grade of high school in Tehran according to the three oarameter IRT model. Islamic Azad University of Tehran, 1997. [ Persian].
10- Human H. Statistical presumption in behavioral research. Tehran: Peik Farhang; 1994. [In Persian].
11- Human H. Psychological and educational assessing and the way of providing test. Tehran: Parsa; 1994. [Persian].
12- Human H. Comparing the Rash one parameter method with a two parameter method with the Tehran-Bines data for assessing the intelligence of the students from11 to 15. MS. Dissertation. Allame Tabatabaee University of Tehran,1994.