Investigation of Three-Tier Diagnostic and Multiple Choice Tests on Chemistry Concepts with Response Change Behaviour

This study aims to investigate the test scores of the three-tier diagnostic chemistry test (TDCT)(cid:4664) and multiple choice chemistry test (MCCT) by response change behaviour (RCB). The study is a descriptive research study aiming to investigate the item response efforts of TDCT (cid:4664) and MCCT in a computerized testing environment (Quizzer test program, QTP). In both TDCT(cid:4664) and MCCT, QTP maintains a continuous record for each tier of the test. Participants in the study are students in the Science Education Department at the state university in the Aegean region of Turkey (n=115). The study was conducted in two groups: there were 58 students in Group 1 and 57 students in Group 2. In Group 1, a TDCT(cid:4664) was used; in Group 2, an MCCT test was applied. Tests were distributed by random sampling between Group 1 and Group 2. The data were collected by adding a confirmation tier to the TDCT(cid:4663) involving 44 items. The TDCT(cid:4664) was applied to 115 pre-service teachers; the reliability coefficient of the test was found to be 0.72. SPSS and MS Excel programs were used to analyse the data. Data were analysed using descriptive statistical methods. Considering the results obtained from the study, the rate of completing the test with RCB of test items for both tests is approximately 7–12 per cent. Another important consequence is that RCB does not provide an advantage or disadvantage in terms of scoring. than the average percentage in both MCCT and TDCT(cid:4664) (I) . This finding differs from the result in Adodo’s (2013) comparison of an MCT with a TDT(cid:4663) . Adodo (2013) compared the adequacy of the multiple tier test with the TDT(cid:4663) in a pre-test and post-test control group experiment. In his study, a slightly higher score was observed for TDT(cid:4663) . This finding is similar to the studies by Li and Yang (2010), Yang, Li and Lin (2008), and Yang and Lin (2015). In these studies, it was found that the rate of correct responses was higher in TDCT(cid:4664) (I) than in TDCT(cid:4664) (II) . In fact, in Yang and Lin’s (2015) study, the percentage of correct responses in TDT(cid:3555) (I) was around 50 per cent, while TDT(cid:3555) (II) was around 25 per cent. There are studies showing that TDT(cid:3555) (I) facilitates a score higher than TDT(cid:3555) (II) (Arslan, Çiğdemoğlu, & Moseley, 2012; Peşman & Eryılmaz, 2010; Şen & Yılmaz, 2017). In this study, a significant difference was found between TDCT(cid:4664) (I) and TDCT(cid:4664) (II) . However, the difference is not great compared to studies in the related literature. Yang and Lin (2015) considered that TDT(cid:3555) (I) and TDT(cid:3555) (II) were evaluated by the students as two separate problems. In this finding, the reason why TDCT(cid:4664) (II) scores low can be explained in two ways: firstly, TDCT(cid:4664) (I) may affect TDCT(cid:4664) (II) , or vice versa; secondly, it may be that TDCT(cid:4664) (II) is somewhat more difficult. Because it is thought that more cognitive processes were performed in TDT(cid:3555)


Introduction
The first studies on misconceptions in scientific concepts were in the early 1970s, and by the middle of the 1980s, they were the focus of several researchers (Driver, 1981;Linke & Venz, 1979;Osborne & Cosgrove, 1983;Tamir, 1971). Up to 2015, there were many studies on misconceptions in science education. Open-ended questions, multiple-choice tests (MCTs) and interviews were used in 91 per cent of these studies. A very small proportion (9%) benefited from tier diagnostic tests ( TDT s) (Gürel, Eryılmaz, & McDermott, 2015). Nowadays, misconceptions are determined by TDT s, and these tests are supported by interviews. Two-tier diagnostic tests (TDT ), three-tier diagnostic tests (TDT ) and four-tier diagnostic tests (TDT ) are being developed (Odom & Barrow, 1995;Peşman & Eryılmaz, 2010).
TDT or TDT may be more standardized than other tests. In addition, since the knowledge of students is questioned in the second tier (II) of TDT s, the chance factor can be reduced. In the context of measurement theories, the error rate of measurement tools reducing the chance factor decreases, and thus reliability increases (Çakır & Aldemir, 2011;Özbayrak & Kartal, 2012). However, there are some problems with the writing of distractors in TDT s (II) . In addition, the use of a small number of test items and the scoring of the stages with different models in TDT s raise validity and reliability problems (Xiao, Han, Koenig, Xiong, & Bao, 2018). TDT s (II) can be designed as MCTs or open-ended tests. It may be a disadvantage that TDT s (II) have open-ended items. For example, in a diagnostic test developed based on the PISA (Programme for International Student Assessment) exam, it was stated that students could easily do the first tier with its multiple choice items, but had difficulty writing reasons for their responses in the second tier consisting of open-ended question items (Sadıç & Çam, 2015). Therefore, it may be a disadvantage if TDT s are prepared using open-ended test items. The validity and reliability of the tests can be solved if TDT s are prepared with item choices using standard misconceptions.
As the second tier is integrated into the first tier of TDT s, the response time and performance of the test are affected. Therefore, these tests can be applied with a limited number of items. Therefore, TDT s can be criticized in terms of content validity. Research has found that while students were successful in numerical problem content tests, they failed in TDT s where relationships between concepts were explored, and in TDT s, items are often used to address the relationship between concepts (Bernhard, 2000;Crouch & Mazur, 2001). There are validity and reliability problems due to scoring problems of open-ended questions, drawing-based qualitative measurement tools and MCTs. Therefore, these measurement tools are inadequate to measure students' abilities, skills and knowledge. For these reasons, the need for studies on adaptive TDT s is increasing day by day.

How are s Prepared and Scored?
Although TDT s are similar to MCTs in terms of their practice, there are differences in the development process of the tests. Firstly, in the preparation of TDT or TDT , concept map drawing comes to the forefront. It is important to show the relationships between concepts correctly while preparing a TDT , because TDT s pay attention to the relationships between these concepts. While preparing a TDT , learning objectives in the curriculum are determined on the concept map first; then misconceptions from literature in pursuant of the determined learning objectives are listed; finally, the test items are created (Treagust, 1988). The first tier (I) of TDT is similar to an MCT item that measures concepts about the subject; TDT (II) is composed of choices that are thought to be related to TDT (I) . A TDT (II) prepared in this way may consist of items which express the reasoning for the response in connection with TDT (I) (Mutlu & Şeşen, 2015;Taber, 1999;Uyulgan, Akkuzu, & Alpat, 2014). In some studies, when developing a TDT , different methods may be followed in the process of test development than that proposed by Treagust (1988). Some researchers prefer to quote the validated and reliable test items used in theses, scientific articles, and national and international exams in the first tier of the tests when developing TDT s (Sadıç & Çam, 2015). After TDT (II) , an 'Are you confident of your response?' is added, and TDT is converted into TDT . The students are asked to confirm this question as 'Yes/No' and their consistent attitude towards the concept is determined: in fact, it is the student's consistent attitude that proves the existence of misconceptions in the test item. One of the choices in TDT s should be correct in terms of scientific proposition, and the propositions in the other choices should contain misconceptions. Similar to TDT s and TDT s, a TDT can be prepared. In TDT s, after TDT s (I) and TDT s (III) , the question proposition confirming the students' confidence in their answers is added an extra two times. In summary, TDT s (II) and TDT s (IV) are the confirmation stage.
It is important to identify misconceptions as well as to measure current knowledge and the curriculum objectives achieved at the beginning or end of the learning process (Bektaş & Kudubeş, 2014). The scoring of TDT s is done differently from other measurement and evaluation tools. In the scoring of a TDT in the light of classical test theory, when the participant correctly answers both tiers of the TDT , the score of the test item becomes 1; in other cases, the score of the test item is 0. A binary rating such 0/1 is considered to be more reliable computation. However, there is still controversy about the reliability and scoring of TDT s and MCTs (Bademci, 2006;Taber, 2017). Due to reliability and scoring difficulties, each tier of a TDT can be evaluated under separate parameters with logistic or Rasch models (Xiao, Han, Koenig, Xiong, & Bao, 2018). In TDT s, it is important to find the reason for the students' answers alongside the score and to determine any misconceptions about the subject. Scoring the tiers in a TDT is advantageous both in determining the lack of knowledge and in arriving at test scores. Hestenes and Halloun (1995) proposed the definition of the false positive (FP) and the false negative (FN) as evidence of external validity in TDT s. By Hestenes and Halloun (1995), FP is defined as the correct response to the test item with a confident attitude based on an incorrect reason, while FN is defined as an incorrect response to the test item with a confident attitude based on the correct reason. The researchers also noted that minimizing the probability of FPs and FNs could provide higher validity in TDT s. FN for external validity in TDT s should be less ies.ccsenet.org International Education Studies Vol. 13, No. 9;2020 than 10 per cent (Gürçay & Gülbaş, 2015;Şen & Yılmaz, 2017). However, it is very difficult to reduce FP in TDT s. Due to the nature of the test, students can choose the right alternative in the content layer even if they have misconceptions (Peşman & Eryılmaz 2010).
There are some problems with the scoring of TDT s (I) and TDT s (II) separately or jointly. These are the chance factor, preference interactions between tiers, and uncertainties in calculating the reliability coefficient. In the model proposed by Hestenes and Halloun (1995), if the students know that the scoring of the test will compute TDT s (I) and TDT s (II) together, their choices can contribution positively to FP (the first tier is the correct, the second tier is incorrect). However, more in-depth scientific understanding and reasoning processes may not be determined by their scoring model. In this case, students can avoid guessing to the test items. If students know that TDT s (I) and TDT s (II) will be scored separately in TDT s, their choices may have a negative effect on FP. Students can make separate estimates for both tiers and increase the chance factor in the TDT tiers. In this type of scoring, students can establish a systematic relationship between the stages and thus predict (Xiao, Han, Koenig, Xiong, & Bao, 2018). In compared with their model, this study may be important in terms of seeing the positives and negatives of the three-tier diagnostic chemistry test (TDCT) .

Response Change Behaviour (RCB) in Tests
In the tests, the effects of students' RCB on the test scores can be examined. There is a common belief that the first response to the test items is correct in the test and the second choice is incorrect if the response is changed later. Although test participants have worries about changing responses, they persistently change their responses (Cox-Davenport, Haynes, & Lawson, 2014). Therefore, the effect of RCB on test scores has always been a matter of interest. For this purpose, permanent markings and deleted markings were initially examined in paper and pencil tests. Nowadays, RCBs and response time can be followed by the software system in computerized tests. This recording feature of computers may be an important source of data for researchers, and maybe evaluated as a parameter in ability estimation.
When the deleted markings or choices in the open-ended, true-false and MCTs made with paper and pencil items were examined, it was found that there was a general increase in the test scores of test participants (Al-Hamly & Coombe, 2003;Baştürk, 2011;Beck, 1978;Cox-Davenport, Haynes, & Lawson, 2014;Kim, 2019;Lynch & Smith, 1972). Only Noorbala and Mohammadi (2011) explained that RCB had a negative effect on test scores in a study conducted with medical students. It was found that test participants with a higher test score or more talent showed less frequent RCB than other weaker candidates (Beck, 1978;McMorris, Schwarz, Richichi, Fisher, Buczek, Chevalier, Meland, 1991). It was found that repeating RCB of a test item does not contribute to the test score (Lynch & Smith, 1972). It was observed that the test type had no effect on RCB (McMorris, Schwarz, Richichi, Fisher, Buczek, Chevalier, Meland, 1991). There were no significant differences between the sexes in studies of RCB (Baştürk, 2011). In the comparison of RCB with variables such as test item difficulty, the frequency of RCB was parallel to item difficulty. It was seen that students showed more RCB with difficult items (Baştürk, 2011;Beck, 1978;Lynch & Smith, 1972). Some studies have emphasized that students should be encouraged to change their response behaviour (Al-Hamly & Coombe, 2003;Casteel, 1991;McMorris et al., 1991). In addition, participants who change their response behaviour during the test spend more time and exhaust their minds. Therefore, the response performance for test items may be affected. Therefore, RCB can be used in two-, threeand four-parameter logistic modelling for ability estimation in tests (Kim, 2019;Yen, Ho, Liao, & Chen, 2012). In addition, when the probability value for RCB is used in three-parameter logistic models, students engaging in cheating can be identified. If the RCB is examined in computerized test environments, it can be determined how many times the RCB is repeated and how long it takes to decide. Therefore, RCB should be considered in computerized tests (Van Der Linden & Jeon, 2012). RCB should be considered when developing test items (Lynch & Smith, 1972). RCB can be utilized in test development processes.

Applications of Computer-Based Tests from Past to Present
Computers have been included in administration offices of schools as an auxiliary tool outside the learning process since the 1980s, and in the classroom as a teaching tool since the 1990s. Since the 2000s, their use in measurement and evaluation has become widespread and is today very active (Linden & Glas, 2002). When the literature on the use of computers for measurement and evaluation is examined, it is seen that they are applied in different ways (Aybek, 2012). In computer-based tests, the projection of test items to the screen can take different forms, such as similar to paper and pencil tests, one by one sequential order, one by one blended order, or by students' preference. In addition, multimedia and visuals can be used for the presentation of test items on the screen. Data in computer-based tests can be collected online with external data loggers, a computer connected to the network centre, or online through an internet web server. The data obtained from computer-based tests can be processed on the basis of classical test theory or on the paper and pencil test, and can be processed on the basis of item response theory by computer-adapted algorithmic methods. Computer-adapted algorithmic methods can be used for different variables such as test response time, personal attention and motivation data, in addition to test scores (Tabakçıoğlu, Çizmeci, & Ayberkin, 2016;Weiss & Kingsbury, 1984). Computers can be used in developing TDT and determining misconceptions. Lin (2016) used computer-based TDT s to identify misconceptions about electrical circuits. Maier, Wolf and Randler (2016) showed that misconceptions can be better determined by the automatic feedback given to students during the application of the multi-tiered diagnostic test in the computer software environment. Yang, Hwang, Yang, and Hwang (2015) determined that students' skills were observed better with computer-based TDT to measure their computer programming skills. In this study, it was aimed to determine the positive and negative aspects of computer-based TDT .

Quizzer Test Program (QTP)
Since there is no licensed computer-supported test program suitable for the purpose of this study, the Quizzer test program (QTP) has been developed by the expert software programmer in computer teaching technologies at the university where the research is conducted. During the academic semester, pilot trials of QTP were carried out and missing aspects were corrected. QTP aimed to perform the tests easily and effectively in the experimental and control groups. Examples of QTP are shown in Figure 1 and 2 in data collection section.

Problem Status, Importance and Purpose of the Study
In this study, it is understood from the literature review that there are very few studies on response behaviour in computer-supported exams. In addition, computer-supported exams were not encountered in relation to TDT s, especially TDCT s. Yang and Sianturi (2018) used a computerized online test in a three-tier mathematical diagnostic test, but this test was not related to chemistry. Chiang and Chiu (2015) performed a computer-supported chemistry test, but this computer-supported test was not a TDCT . The purpose of their test was to reveal mental models in chemistry. There are not many studies investigating RCB with computer-supported TDCT . An indicator of misconceptions is that students insist on their response. Students may exhibit RCB for some test items in TDT s. Because RCB shows that the student is not completely confident about the concept of the test item, if the student responds to the test item with a single response behaviour, it can be understood that he/she knows the response and is confident of the response. It can be said that if the decision in the single response behaviour of the student is wrong, the student has misconceptions, because the misconception arises by insisting that the correct response is wrong. Therefore, if the majority of students insist on their decision in single response behaviour, this may indicate misconceptions. There are no data showing the positive or negative contribution of RCB to tiers in TDT s. However, the advantages and disadvantages are not known in relation to conventional tests. In this context, the aim is to compare QTP-supported TDCT and QTP-supported MCCT considering students' response behaviours, but also to investigate RCB in both tests and to determine the effect of TDCT between tiers.

Research Questions
The problem explored in the study is: What are the differences between the three-tier chemistry diagnostic test (TDCT ) and the multiple-choice chemistry test (MCCT) for RCB? In this context, the following sub-problems were utilized in the solution of the problem.
1) Do RCB percentages differ significantly between TDCT and MCCT?
2) Is there a significant difference between the correct response percentages of TDCT and MCCT according to single response behaviour?
3) What are rates of false positive (FP) and false negative (FN) for TDCT according to single response behaviour? 4) What are the trend of correct (TC) and the trend of incorrect (TIC) responses for TDCT and MCCT according to RCB?

Method
This study is a descriptive study aiming to examine the test scores of students who participated in computer-supported TDCT and computer-supported MCCT considering RCB. In this study, a comparative research design was used within the scope of a non-experimental research design. In comparative design, the difference between two or more events or cases is investigated (Fraenkel & Wallen, 2000;McMillan & Schumacher, 2010). Therefore, the experimental and control groups were formed to determine different features of TDCT and MCCT for this study, but no experimental treatment process affecting the groups was performed. The data were collected within the context of the Chemistry II course in the Science Teacher Training programme in a ies.ccsenet.org International Vol. 13, No. 9;2020 state university in Turkey.

Participants in the Study
The participants were pre-service science teachers at a state university in Turkey (n=115) in the 2017-2018 academic year. Experimental and control groups were selected randomly from the classes in which the students were officially registered. For this reason, the study was conducted with two groups determined by random sampling. In Group 1 (n=57), TDCT was performed, while MCCT was applied in Group 2 (n=58). Participants comprised 89 female teachers and 26 male pre-service teachers, distributed as 46 female and 11 male pre-service teachers in the experimental group and 43 female and 15 male pre-service teachers in the control group.

Data Collection
The data were collected by adding a confirmation tier to TDT involving 44 items developed by Mutlu and Şeşen (2015). The tests consisted of chemistry concepts such as acids-base, electrochemistry, thermodynamics, chemical kinetics and equilibrium. Their test was developed with 151 pre-service teachers. The test reliability was found to be 0.84. In this study, a third tier was added to their test and TDCT was converted to TDCT . TDCT was then applied to 115 pre-service teachers. TDCT (I) and TDCT (II) were coded by the graded scoring of Milenković, Hrin, Segedinac, and Horvat (2016), and TDCT 's reliability was calculated as 0.72. KR 20 was found to be 0.51 when scored as 1 in the correct response in both tiers (first and second) of TDCT and 0 in other response situations. When the KR 20 coefficients of TDCT (I) and TDCT (II) were calculated separately, the results were 0.18 for TDCT (I) and 0.51 for TDCT (II) . During the test process, response performances in the items following the item marked by RCB may be affected by participants' RCB (Kim, 2019;Yen, Ho, Liao, & Chen, 2012). In other words, scores of test items may be affected by RCB during the test process. TDCT (I) included the question forms of items and their distractors; TDCT (II) included the misconception choices which made a causative inquiry process related to TDCT (I) ; and the final tier included the stage in which the responses were confirmed. MCCT involved the question forms of items, their distractors and the confirmatory response in last test item. The test structures for TDCT (I) and MCCT were the same. Therefore, there may be differences in reliability coefficients. These test items, taken from Mutlu and Şeşen (2015) and transferred to QTP, were applied as TDCT for the experimental group and MCCT for the control group. These tests were performed individually in the computer laboratory. Figure 1 shows the process of time recording in QTP for TDCT . Figure 2 shows sample screenshots from QTP.  The FP and FN rates can be calculated for TDT using Table 1: The FN rate is the power to distinguish FNs from TPs and is calculated by FN/(FN+TP).
The FP rate is the power to distinguish FPs from TNs and is calculated by FP/(FP+FN+TN). Table 1 can be used in analysing response change behaviours for both TDCT and MCCT. In this case, when Table  1 for TDCT is adapted to RCB for both test types, 'first response in RCB' is used instead of TDCT (I) , and 'second response in RCB' instead of TDCT (II) . The confirmation tier is the last tier. In RCB, students used the right to reply a second time if they were not confident about their responses. In this case, it is accepted that the students are confident of their own answers since they have switched to the next question item with the 'I am confident' preference at the tier of confirming their responses in the second response process. Therefore, in Table 2, there is no 'unconfident' choice and the table is reduced. The final version of Table 2 is given below. TC and TIC rates can be calculated for TDCT and MCCT using Table 2.

Calculating TC and TIC Rates on for TDCT and MCCT According to RCB
The TC rate is the power to distinguish TC responses from SC responses and is calculated by TC/(TC+SC).
The TIC rate is the power to distinguish TIC responses from SI responses and is calculated by TIC/ (TIC+SI).

Results
Findings are listed in order of the sub-problems: the percentage of RCB; the percentage of correct responses according to single response behaviour; the rates of FP and FN for TDCT according to single response behaviour; TC and TIC rates related to RCB. Table 3 shows the findings for the first sub-problem, explaining the percentage of RCB. In the findings of Table 3, the percentage of RCB was 12.73 for TDCT and 8.09 for MCCT. The percentages of single response behaviour without RCB were 5.3 (3/57=0.053) for TDCT and 10.3 (6/58=0.103) for MCCT. Since the test used in the experimental group was TDCT , the connections between TDCT (I) and TDCT (II) led the students ies.ccsenet.org

Findings on Percentages of RCB
International Vol. 13, No. 9;2020 to RCB. The majority of the students in both tests chose to respond to all of the items in the test without showing any RCB. Table 4 shows the findings of the second sub-problem, explaining correct response according to single response behavior. In the descriptive values of Table 4, the average percentage of correct responses according to single response behaviour was 34.61 for TDCT (I) , 30.74 for TDCT (II) and 35.31 for MCCT. There was no significant difference in descriptive values between TDCT (I) and MCCT. There is a significant difference between TDCT (I) and TDCT (II) . Similarly, there is a significant difference between MCCT and TDCT (II) . In TDCT (II) , there was a slight decrease in scores. In the single response behaviour, the average percentage of correct responses from students ranged between 30 and 35 per cent. TDCT (II) shows an average correct response percentage of 30.74, which may indicate that students have misconceptions at this rate. Table 5 shows findings for the third sub-problem, explaining the rates of FP and FN according to single response behavior. In Table 5, the rates of FP and FN for TDCT are 0.15 and 0.46 respectively. In this case, when the number of participants who make TDCT (I) incorrect and TDCT (II) correct is divided by number of participants who make both tiers of the test correct, the joint conditional rate is 0.46. In the same way, when the number of participants who make TDCT (I) correct and TDCT (II) incorrect is divided by number of participants who make both tiers of the test incorrect, the joint conditional rate is 0.15. Table 6 shows the findings of the fourth sub-problem, explaining TC and TIC rates for TDCT and MCCT according to RCB.  Table 6 shows that the MCCT trend rates are highest in both TC and TIC considering trends in both tests. It is determined that the TDCT (II) trend rates are the lowest in both TC and the TIC considering trends in both tests. In both groups, it is seen that students change from an incorrect choice to a correct choice. This trend is greater in MCCT, while TDCT (I) is slightly lower than MCCT. This trend rate is very low in TDCT (II) . The relationship between the tiers of TDCT presents a problem for students to distinguish the correct from the incorrect choice. ies.ccsenet.org International Education Studies Vol. 13, No. 9;2020 Students' tendency to change from the correct choice to the incorrect choice is less than their tendency to change from the incorrect choice to correct choice. TDCT (II) has a low trend from correct to incorrect choice. MCCT directs students from the correct choice to the incorrect choice. All tiers of test items give clues to students.

Discussion
In the findings on the percentages of RCB in this study, it was seen that students participating in both the TDCT and the MCCT insisted on responding to some test items with single response behaviour, but not all of the items in the test. In other words, it is seen that they do not prefer RCB. Nevertheless, the majority of participants in both tests (TDCT /MCCT) felt the need to complete the test with RCB. This requirement appears to be higher in TDCT . It was seen that the students participating in TDCT had less insistence on continuing the test with single response behaviour than those doing the MCCT and relied more on RCB. It can be said that TDCT is more advantageous in terms of measurement and evaluation than MCCT. It is seen that the students participated in reasoning due to the tendency of their choices in these TDCT . However, this rate is not very high when comparing the values of TDCT and MCCT. The part of MCCT criticized by TDCT is that MCCT does not constitute a reasoning process. Another important finding is that students replied to most items in the test with single response behaviour, even though the majority of the students completed the test using RCB. This finding may be evidence that students do not want to use RCB. In this case, it is not advantageous to carry out tests with RCB.
In the single response behaviour of the study, the average percentage of TDCT (I) and the average percentage of MCCT were equivalent in terms of the percentage of correct responses. However, the average percentage in TDCT (II) was lower than the average percentage in both MCCT and TDCT (I) . This finding differs from the result in Adodo's (2013) comparison of an MCT with a TDT . Adodo (2013) compared the adequacy of the multiple tier test with the TDT in a pre-test and post-test control group experiment. In his study, a slightly higher score was observed for TDT . This finding is similar to the studies by Li and Yang (2010), Yang, Li and Lin (2008), and Yang and Lin (2015). In these studies, it was found that the rate of correct responses was higher in TDCT (I) than in TDCT (II) . In fact, in Yang and Lin's (2015) study, the percentage of correct responses in TDT (I) was around 50 per cent, while TDT (II) was around 25 per cent. There are studies showing that TDT (I) facilitates a score higher than TDT (II) (Arslan, Çiğdemoğlu, & Moseley, 2012;Peşman & Eryılmaz, 2010;Şen & Yılmaz, 2017). In this study, a significant difference was found between TDCT (I) and TDCT (II) . However, the difference is not great compared to studies in the related literature. Yang and Lin (2015) considered that TDT (I) and TDT (II) were evaluated by the students as two separate problems. In this finding, the reason why TDCT (II) scores low can be explained in two ways: firstly, TDCT (I) may affect TDCT (II) , or vice versa; secondly, it may be that TDCT (II) is somewhat more difficult. Because it is thought that more cognitive processes were performed in TDT (II) (Yang and Lin, 2015). In TDT (II) , only one of the four choices included scientific knowledge and the other three contained misconceptions. When tests are examined in terms of the chance factor, TDCT (II) is disadvantageous in terms of scoring compared to both TDCT (I) and MCCT. This disadvantage should be considered when scoring TDCT (Xiao, Han, Koenig, Xiong, & Bao, 2018). Furthermore, the misconception rate of 69.26 per cent (100%-30.74%) according to the single response behaviour is a significant contribution to the research on TDCT (II) .
In single response behaviour, the rates of FP and FN for TDCT are 15 per cent and 46 per cent respectively, these rates being above the critical point of 10 per cent (Hestenes & Halloun, 1995;Şen, Yılmaz, & Geban, 2018). It is difficult to reduce the rate of FP. Due to the nature of the tests, students may choose the right alternative according to the content of the test even if they have misconceptions (Peşman & Eryılmaz, 2010). This study showed that the explanation or causal reasoning in TDCT (II) was more difficult than TDCT (I) . The TDCT FP rate in this study is similar to the results of the Rasch model in TDT conducted by Xiao, Han, Koenig, Xiong, and Bao (2018). In this study, it is considered normal for the rates of FP and FN to be higher than the critical value. The FP and FN for TDT can be evaluated as parameters in logistic analysis. If the FP and FN rates are taken together according to RCB, more concordant results can be achieved.
In this study, the TDCT and MCCT findings regarding TC and TIC rates are compared according to RCB. The rate of TC was lowest in TDCT (II) . It is seen that students change their preferences from an incorrect choice to a correct choice in RCB. This trend is highest in the MCCT and very low in TDCT (II) . The relationship between TDCT (I) and TDCT (II) can distract students from an incorrect to a correct choice. It is seen that the TIC rate was highest in MCCT. Moreover, students' TIC rate was lower level than their TC rate for TDCT and MCCT. The TIC rate was lowest in TDCT (II) . This trend's rate is higher in MCCT than in TDCT (II) . TDCT (I) and TDCT (II) provide students with clues.

Conclusion
In general, the following results were obtained from the research. Firstly, although the majority of the students ies.ccsenet.org International Vol. 13, No. 9;2020 completed the test with RCB, it was seen that they preferred to answer the majority of the items in the test with single response behaviour. In TDT s, this response tendency differs from other tests. Secondly, single response behaviour did not produce a significant difference between the TDCT (I) and MCCT scores. It was concluded that TDCT (II) is more difficult than TDCT (I) . Thirdly, TDCT (I) and TDCT (II) were found to incorporate a guiding feature according to the single response behaviour, but the direction could not be determined. This could be because the test item difficulty of the two tiers of TDT is different, and the item structures are different and are contrary to the theories. Furthermore, the majority of the choices in TDT (II) consist of possible misconceptions, and so have a significant advantage in terms of the chance factor. In this respect, TDT s can be criticized. Fourthly, it was found that RCB did not provide an advantage or disadvantage in terms of scoring. RCB percentage has showed that students have misconceptions definitely. Qualitative interviews with these students can lead to detailed results. Finally, although the response time for TDCT was longer than MCCT due to TDCT (II) , the scores of TDCT (I) were equivalent to the scores of MCCT, indicating that there were no negative aspects of TDT s in terms of time.
This study was limited to the data of TDCT and MCCT. The misconception in TDCT can be better confirmed by the questioning of confidence stage compared to TDCT . In the four-stage test compared to the three-stage test, the scoring system becomes more complex and difficult to apply in QTP. For these reasons, TDCT and MCCT are preferred. The study can be expanded by testing with TDT in better computerized testing environments for future research.