The Factor Structure and Invariance of an Observational Checklist to Measure Children’s Emergent Literacy Skill Development across Male and Female Samples

Study purpose was to test the factor structure of the Jumpstart School Success Checklist (JSSC) and tests its measurement invariance (factor structure similarity) across male and female samples, based on national Jumpstart data ( N = 5,545). Factor analytic results supported conceptualizing the JSSC item-level data in terms of a bifactor model (Gibbons & Hedeker, 1992), where each scale item related to a primary factor (Literacy) in addition to one sub-domain: Language Arts or Social Relationships. A comparison of the equivalence of the JSSC factor structure across sex groups indicated that the scale’s factor structure met partial measurement invariance (Bryne, Shavelson, & Muthén, 1989). A follow-up latent means structure analysis reported that females had slightly higher latent means across the factors than males. Study implications pertain to (a) the degree to which the JSSC scores function across sex groups, and (b) how factorial invariance research can be used to examine raters’ of students’ literacy skill development.

for a number of reasons, including: limited funds to cover the costs of the assessments and materials, qualified individuals to administer the tests, and feasibility to assess a large number of students in an efficient, timely manner. While direct assessments may serve useful in the context of controlled, empirical studies to investigate instructional or program effectiveness to promote children's literacy skill acquisition, in many cases they may not provide program providers a quick, efficient method of assessment in a naturalistic environment.
On the other hand, indirect assessments can be broadly characterized as instruments that involve informants (e.g., teachers) evaluating a child's emergent literacy skills. Both observational rating forms and checklists represent indirect measures that have promoted to assess young children's behavior and literacy outcomes (e.g., Neuman & Roskos, 2007). Cabell, Justice, Zucker, and Kilday (2010) identify several attractive features of these measures, including: time and cost efficient; convenient completion; elimination of child characteristics in testing (e.g., mood); and, lastly, may offer more specific developmental information than provided by direct instruments. Notwithstanding these benefits, there are important factors to consider related to their use for assessing literacy skills. For instance, these measures are not designed to measure strengths/weaknesses in certain dimensions of literacy skills (Lonigan, 2006). Popham (2000) identifies the scoring scale, raters (e.g., teachers), and scoring procedure as key sources of error associated with the practice of evaluating student outcomes. Nonetheless, such instruments enjoy widespread use as one source of information used by early childhood programs to evaluate childhood outcomes, as well as within empirical research (e.g., Lapointe, Ford, & Zumbo, 2007). Consequently, a body of research is emerging regarding the psychometric properties of these types of measures (e.g., predictive validity; Cabell et al., 2011).
Based on the use of indirect assessments across practical and research settings, imperative questions related to their use includes (a) whether the resultant scores represent the scale's theoretical factor structure, and (b) the extent to which obtained scores have similar measurement properties across diverse student sub-groups (e.g., sex, race/ethnicity). Factor analysis represents a broad class of statistical procedures to investigate empirical questions related to the structure of scale data. Thompson (2004) identifies three purposes of factor analysis: (a) gather empirical evidence on test score validity, (b) develop theory on hypothetical constructs (e.g., literacy), and (c) summarize relationships among variables using factor scores. The two major classes of factor analysis include exploratory and confirmatory factor analysis. Whereas exploratory factor analysis (EFA) is a data-driven approach to identifying the number of factors underlying scale data, confirmatory factor analysis (CFA) is based on the use of a priori information (or theory) to test the number of factors explaining the relationship among a set of observed variables (e.g., rating scale items).
Measurement invariance is a desired property of test scores that indicates that the psychometric properties (e.g., discrimination) of the scores are the same across compared groups (e.g., experimental vs. control). Within the factor analytic framework, measurement invariance is tested using multi-sample CFA by testing the statistical fit of competing models that differ in terms of the model parameters set equal across compared groups (Millsap & Yun-Tein, 2004). A finding of measurement invariance indicates that scores can be interpreted similarly, whereas a lack of invariance indicates scores cannot be interpreted the same. Thus, a lack of invariance indicates that across group score disparities may be due to trait (e.g., literacy) differences in addition to measurement error (Raju, Laffitte, & Bryne, 2002). Based on these considerations, measurement invariance research provides literacy researchers vital information on the meaning of obtained scores, as well as an avenue to pursue research into students' literacy development.
Structural equation modeling (SEM) offers a valuable model-based approach to investigate the relationships among observed (e.g., items) and latent (e.g., literacy) variables (Bollen, 1989). SEM can also be used to formally test the measurement invariance of scale items to judge whether the psychometric properties of scores are similar across diverse groups (e.g., sex, race/ethnicity). This entails fitting and comparing a series of increasingly restrictive models that differ by the particular item parameter(s) constrained equal across groups. Measurement model parameters of interest include: factor loadings (i.e., discrimination), thresholds, and residuals (error terms). Factor loadings characterize the relationship between the observed and latent variables and are directly related to item discrimination (Bock & Gibbons, 2010). Thresholds indicate the point on the trait continuum where there is a given probability of selecting a particular response option over the next lowest category (e.g., selecting Agree over Neutral). Residuals indicate the amount of item variance unexplained by the underlying latent trait, or unexplained error. The degree to which these model parameters are similar across groups corresponds to the level of invariance of an instrument's factor structure. A finding of partial measurement invariance (Byrne, Shavelson, & Muthén, 1989) provides a basis for comparing groups on the underlying latent means. structure of the Jumpstart School Success Checklist (JSSC), a 15-item rating scale completed by informants (e.g., mentors) regarding the literacy skills of preschool aged children enrolled in Jumpstart, a national supplemental pre-kindergarten program designed to promote young children's language and literacy skills (see www.jstart.org for program description and background). Second, the study tested the extent to which the JSSC factor structure was invariant (or similar) across male and female samples? Study findings are designed to contribute to the literature base regarding the psychometric properties of indirect or observational measures to assess young children's emergent language and literacy skills.

Participants
Study data included the item-level pretest data of males (n = 2,760) and females (n = 2,739) with complete data comprising the 2007-2008 JSSC dataset (N = 5,545). As reported in Table 1, females comprised 50.30% of the sample, and slightly half of the sample (47.75%) was enrolled in the Jumpstart program. The majority of the sample was African American (40.19%), followed by Hispanic (31.40%), White (16.50%), Asian (6.55), and other (5.37%). The majority of the children sample spoke English only (62.90%), and the average age was 48.78 months (SD = 6.14; range = 36 to 59 months).

Instrumentation
The JSSC is a 15-item observational rating form designed to assess preschool students' literacy skills. Each scale item relates to a specific area of literacy (e.g., using vocabulary, relating to adults) and corresponds to either the Language Arts or Social Relationships subscale. As shown in Table 2, the Language Arts sub-domain consists of 8 items and Social Relationships consists of 7 items. The instrument is administered as a pre-and post-test and completed by program providers (e.g., mentors) who rate each child's literacy skills across the items using a 5-point scale based on the child's demonstration of specific levels of literacy proficiency. The scale also collects student demographic information, such as: date of birth, sex, and language spoke, among others.

Data Analysis
Due to the limited availability of information on the underlying JSSC factor structure, the sample was randomly divided in half to investigate scale dimensionality. CFA was used to test the JSSC two-factor model of the item-level data, based on the first random group data (n = 2,749). CFA was deemed appropriate since the JSSC is a theoretically-based instrument designed to measure preschool students' literacy skills across the domains of language arts and social behavior (Kline, 2005). Model specification entailed first fitting a correlated two-factor model to the data, with the first eight items specified to the Language Arts factor and the remaining seven items specified to the Social Relationships factor (see Table 2). Model-data fit was based on inspection of the statistical fit of the theoretical model to the data.
Due to the ordinal nature of the item-level data (e.g., Likert scale), robust weighted least squares (WLSMV; Muthén, du Toit, & Spisic, 1997) was used for parameter estimation using MPLUS 5.0 (Muthén & Muthén, 1998-2006. Model fit was evaluated in terms of the following fit statistics: chi-square statistic (WLSMV), root mean square error of approximation (RMSEA), and comparative fit index (CFI). The RMSEA provides a measure of the discrepancy between the actual and estimated variance-covariance matrix per degree of freedom. RMSEA values less than .06 were used to indicate good model fit and those less than .08 suggested reasonable fit (Hu & Bentler, 1999). The CFI provides a measure of the discrepancy between a restricted and null model in relation to the fit of the null model (Bentler, 1990), with values equal to or above .95 used to indicate adequate fit (Hu & Bentler, 1999).
Provided the two-factor model of the JSSC did not reported acceptable model-data fit (as based on above fit statistics), an EFA based on a principle axis with promax rotation was used to further investigate the scale's underlying factor structure based on the second random group data. The use of EFA to investigate scale dimensionality following a poorly fit confirmatory-based model was intended to address the issue of finding an acceptable model due to chance when testing a series of modified models (MacCallum, Roznowski, & Necowitz, 1992). As the eigenvalue greater than 1.00 rule has been found to result in the overextraction of factors (Zwick & Velicer, 1986), factor retention was based on comparing the eigenvalues of the EFA to those obtained from a parallel analysis (Henson & Roberts, 2006;O'Connor, 2000). To obtain a parsimonious model that demonstrated simple structure (Thurstone, 1947), items reporting cross-loadings (>.30) on multiple factors were considered for removal from subsequent analyses (Hinkin, 1998). Subsequently, CFA was used to test the fit of the exploratory-based model.
Provided that a model was fit to the JSSC data that reported acceptable model-fit, multisample confirmatory factor analysis (MCFA) was used to formally test the measurement invariance of model parameters (e.g., factor loadings; Millsap & Yun-Tien, 2004). MCFA was used to test the invariance of the following matrices: (a) factor loadings (pattern coefficients), (b) thresholds, and (c) error variances. Factor loadings report the strength of association between items and the underlying scale factors (e.g., Language Arts) and, thus, provide critical test score validity information (Keith, 1997). Thresholds indicate the location on the underlying trait continuum (e.g., Literacy) where a student would be assigned to a particular item response category (Rating of 1, 2, 3, 4, or 5) (Bock & Gibbons, 2010). For rating scale data, there are m-1 categories (where m equals number of rating categories). For the JSSC, there are (5-1) 4 threshold parameters for each item. The error variances deal with the amount of unexplained error in items, and represent final parameters tested for invariance. Each model parameter provides relevant information on the functioning of JSSC scale scores.
As based on Millsap and Yun-Tien (2004), invariance testing of ordered-categorical data was based on testing the statistical difference between a series of increasingly restrictive nested measurement models. The models differed in terms of the matrices (e.g., factor loadings) or parameters (e.g., thresholds) constrained equal across groups. First, an acceptable CFA model was fit to each group's data (referred to as the free model). This baseline (or free) model provided the basis for the subsequent test of the invariance of the factor loading matrix. This was conducted by constraining the matrix of factor loadings equal across groups to obtain the likelihood chi-square value of the constrained model. The statistical significance of the likelihood chi-square difference value (obtained by comparing the likelihood chi-square statistics of the free and constrained models) provided an indication of whether the factor loading matrix was invariant.
A nonsignificant chi-square difference statistic (based on use of the DIFFTEST in MPLUS, as per Muthén & Muthén, 1998-2006 was used to judge the invariance of model parameters. Conversely, a statistically significant chi-square difference statistic indicated that the constrained model resulted in a decline in model-data fit and that at least one parameter in the matrix lacked invariance. Subsequently, each parameter within the matrix was individually tested for invariance (Reise, Widaman, & Pugh, 1993). Parameters found to lack invariance were specified to be freely estimated in subsequent invariance tests. Sequential tests of nested model comparisons were continued until all matrices (thresholds, residuals) and corresponding parameters were tested for invariance.
As the chi-square difference statistic is known to reject the null hypothesis of equivalent model parameters in invariance testing based on trivial differences in large sample sizes, the incremental changes of the CFI and RMSEA values were also used in the tests of measurement invariance (Cheung & Rensvold, 2002). Furthermore, the procedures for invariance testing using WLSMV, as well as the theta parameterization option in MPLUS to test error variance equality, were employed (Muthén & Muthén, 1998-2006. If the JSSC factor structure exceeded partial measurement invariance (Byrne et al., 1989), across group differences on the latent means were conducted. Inspection of latent mean differences entailed constraining the latent mean of Group 1 to zero and freely estimating the latent mean of Group 2 (Muthén & Muthén, 1998-2006). An effect size estimate (as per Hancock, 2004) was used to indicate the magnitude of the difference between the latent means, with values interpreted as: small (0.2), medium (0.5), and large (0.8) (Cohen, 1988).

Results
A test of a two-factor model of the JSSC item-level data resulted in unacceptable model-data fit, X 2 (56) = 2,623.17, CFI = .90, RMSEA = .13. A subsequent EFA support a one-factor model, based on the retention of empirical factors by comparing the eigenvalues from the EFA to those obtained in a parallel analysis (Henson & Roberts, 2006;O'Connor, 2000). Eigenvalues for the first two factors, based on the EFA, were 9.61 and 0.53, whereas those based on the parallel analysis were 0.13 and 0.10, respectively. A factor is retained if its eigenvalue based on the EFA is greater than that obtained from the parallel analysis. It can therefore be inferred that a general dominant factor (Literacy) underlies the JSSC scale data.
A subsequent test of a series of CFA models suggested that the JSSC scale data may be modeled in terms of a bifactor model (Gibbons et al., 2007;Gibbons & Hedeker, 1992;Immekus & Imbrie, 2008). Figure 1 illustrates the path diagram depicting the final bifactor model of the JSSC item-level data, which reported acceptable model-data fit, X 2 (52) = 2,107.15, CFI = .96, RMSEA = .08. Notably, to achieve a simple structure where each item reports a clear relationship to a designated factor, Item 10 was dropped from the analysis due to cross-loading on both sub-domains. As shown, all items were specified to load on a primary dimension, with specific items also allowed to load (or relate) on a secondary dimension (i.e., Language Arts, Social Relationships). The basis of the bifactor model is that each scale item is related to a primary dimension in addition to one sub-domain: Language Arts and Social Relationships.  Table 3 reports results based on nested model comparisons between the bifactor model and competing one -and two-factor structures of the JSSC (Rindskopf & Rose, 1988). Model comparisons supported conceptualizing the JSSC in terms of a bifactor model ( Figure 1) with a primary (Literacy) dimension and two sub-domains: Language Arts and Social Relationship. The bifactor model is well suited for modeling survey data that is typically based on sampling items from sub-domains (e.g., Language, Social Relationships) situated within a broader domain (e.g., literacy). The bifactor model (Figure 1) reported acceptable model-data fit for the data of males (Χ 2 = 1,088.18, p < .01) and females (Χ 2 = 1,039.39, p < .01), as well as for the entire sample, Χ 2 (df = 101) = 2,127.57, p < .01, CFI = .96, RMSEA = .08. Although the chi-square statistic p-value was statistically significant (p < .01), the statistic is well known to be influenced by sample size and other fit statistics were acceptable (e.g., CFI > .95). Table 4 reports the factor loadings and residual errors of the items across groups.
Based on an acceptable fit of the bifactor model to the data, analyses proceeded to testing model parameters for invariance. The initial test that all factor loadings were equal across groups indicated that at least one or more lacked invariance, Χ 2 Difference (df = 18) = 104.97, p < .01. Subsequently, a test of the invariance of the factor loadings on the primary (Literacy) factor was conducted, which indicated that one or more loadings differed across groups, Χ 2 Difference (df = 12) = 51.91, p < .01. A sequential test of the invariance of each factor loading indicated that the factor loadings of the following items differed across groups on the primary Literacy factor: 4, 5, and 13. Last, a test of the factor loadings comprising the Social Relationships sub-domain indicated a lack of invariance among the parameters, Χ 2 Difference (df = 4) = 62.27, p < .01. A follow-up test of the invariance of each factor loading on the Social Relationships factor indicated that the following three parameters lacked invariance: Item 11, Item 14, and Item 15. Table 5 reports the threshold parameters for each item across sex groups. Threshold parameters indicate the location on the underlying trait continuum (Literacy) where a student would be assigned by a rater (e.g., mentor, teacher) in a particular item response category (e.g., 2, 3, 4, or 5). Therefore, for Item 1, the point on the underlying trait continuum where males and females would be more likely to receive a rating of 2 instead of a 1 www.ccsenet.org/jedp Journal of Educational and Developmental Psychology Vol. 3, No. 1; would be approximately -.40 (almost a half a standard deviation below the mean [0], based on z-score scale [mean = 0, standard deviation = 1]), and .10 (above mean) for a score of 3 instead of a 2. For Item 5, which lacked invariance, the threshold of 2-3 (being assigned to category 3 over category 2) indicated that males and females had a different point on the underlying continuum for receiving a rating of 3. In this case, females have a higher likelihood of being rated in this category than males, based on lower trait level. That is, compared to males, females were more likely to be assigned to a rating of 3 with lower levels of Literacy.
Results indicated that one or more threshold parameters differed across groups, Χ 2 Difference (df = 27) = 117.04, p < .01. Subsequently, the four thresholds (one less than number of response categories) for each item were tested for equality across groups. Results indicated that specific thresholds (see Table 12) for the following items lacked invariance: 5, 6, 7, 11, 12, 13, and 15.
Of less importance was the degree to which each item's residual error was similar across gender groups. Residual errors are reported in Table 11. A lack of invariance among error terms indicated that there were different amounts of error in each group's item score. A test of the similarity of the JSSC item error terms indicated that one or more lacked invariance, Χ 2 Difference (df = 11) = 43.36, p < .01. In particular, the following items' error terms lacked invariance: 2, 5, 7, 9, 13, and 15. Notably, a lack of invariance among error terms is often not considered in applied research, and a finding of error term invariance would represent a very strict level of invariance. Based on the JSSC factor structure demonstrating partial measurement invariance (Byrne et al., 1989), a follow-up comparison of latent mean differences was conducted. Results indicated that females had higher standing on the underlying traits than males. Based on Cohen's (1988) interpretation of effect sizes, the magnitude of the difference in favor of females was small across the latent factors: Primary (Literacy) factor (.17), Language Arts (.16), and Social Relationships (.09).

Discussion
loadings, thresholds, and error variances. Factor loadings are considered the most important model parameters since they indicate the strength of the relationship between the items (observed variable) and factors (unobserved variable) (Keith, 1997). More specifically, factor loadings indicate the degree to which an item measures a particular factor and deals directly with test score validity. The finding that the factor loadings of Items 4, 5, and 13 on the primary (Literacy) factor lacked invariance indicates that these items are not equally discriminating across male and female aged preschool aged children. That is, Items 4 and 5 were slightly more discriminating for males than females, whereas Item 13 was more discriminating for females than males. To recall, discrimination deals with an item's ability to tease apart differences between students with low and high literacy skills.
Specific factor loadings also lacked invariance on the secondary factors of Language Arts and Social Relationships. Again, the lack of invariance of the secondary loading of Item 7 on the Language Arts factor indicated that the item was more discriminating for females than males. Likewise, the lack of invariance for Item 11, Item 14, and Item 15 on the Social Relationships factor indicated that each item was more discriminating for females than males. The largest difference was for Item 15, whereas only slight differences were found for Items 11 and 14.
From a practical stand point, the findings of certain items being more discriminating for males than females, and vice-versa, indicate areas for further consideration not necessarily scale revision. For example, the factor loading of Item 4 on the primary dimension reported the largest discrepancy between males and females. The item deals with the awareness of sounds in words and was more discriminating for males. Practically speaking, this indicated that it was easier for raters (i.e., mentors) to identify males who could or could not show awareness of the sounds in words than females. In terms of the factor loading of Item 7 on the Language Arts factor, the item was more discriminating for females. Therefore, raters were more likely to identify females who were or were not beginning to read than males; or, it was more difficult to identify (or discriminate between) males who were beginning to read.
On the Social Relationships factor, Items 11, 14, and 15 were slightly more discriminating for females. Similarly, raters (e.g., teachers, mentors) were more readily able to discriminate between female students initiating play (Item 11), relating to adults (Item 14), and relating to other children (Item 15) than males. Overall, these findings point to areas of consideration to determine why those completing the JSSC may be able to discriminate between the literacy skills of males or females. This looks like a potential area for more research (e.g., factors associated with students' demonstration of literacy skills and subsequent observer ratings), not necessarily a finding that warrants revising the JSSC. Given that only a few factor loadings lacked invariance provides evidence that the items seem to measure the intended traits roughly the same across gender groups.
Several of the item thresholds were also found to lack invariance across gender groups. The finding of threshold differences deals with the ability level of the student and his/her likelihood of being assigned to a particular response category, such as: Strongly Disagree, Disagree, Neutral, Agree, and Strongly Agree. For example, consider Item 5 (demonstrating knowledge about books), which reported a lack of invariance for thresholds 2 (point on scale of going from a rating of 1 to 2), 3 (point on scale of going from a rating of 2 to 3), and 4 (point on scale of going from a rating of 4 to 5). For this item, males had a reported threshold of -.13 and females had a value of -.23 for threshold 2, which indicated the point on underlying trait continuum for a probability of receiving a score of 2 instead of 1. Here, females with lower literacy skills were more likely to receive a score of 2 than males, who needed a higher trait level (-.13) to receive a rating of 2 on Item 5. Similarly, the trait level for females to receive a rating of 4 on the item was .65, compared to .77 for males. Thus, depending on the item, the trait level needed to be assigned to a particular categorical rating differed across gender groups for certain JSSC items. Similar to the finding of factor loading differences, the lack of invariance among certain thresholds provides a basis for more research into why a rater (e.g., mentor) may not assign the same score to a student based on gender.
Although not as critical as the aforementioned findings, specific item error terms differed across groups. Error terms that differed across groups included: Items 2, 5, 7, 9, 13, and 15. The lack of invariance among error terms indicated that there are different amounts of unexplained variance in item scores of males and females, after accounting for the variance explained by the latent traits (i.e., Literacy, Language Arts, & Social Relationships). This finding is not surprising given the types of error that may influence student ratings on the JSSC, including: experience level of rater, students' demonstration of literacy skills, and raters' familiarity of the student being rated, among many.
for a comparison of latent mean differences across gender groups. A comparison of male and female latent means is desirable because it accounts for differences in the measurement properties of the instrument itself. In the present study, females were found to have higher standing on the latent traits underlying the JSSC item-level data (i.e., Literacy, Language Arts, & Social Relationships) than males. For the primary dimension of Literacy, the difference between the latent means of males and females was small, as based on the effect size of .17. Similarly, the difference in the latent means of males and females on the Language Arts sub-domain was small (effect size equal to .16), whereas negligible differences were found on the sub-domain of Social Relationships (.09). Notably, these differences should be interpreted cautiously given the finding of partial measurement invariance.
The JSSC was found to display partial measurement invariance. Whereas specific factor loadings, threshold, and error variances were found to differ across groups, such differences are not likely to impact programmatic decisions based on test scores. Alternatively, these findings suggest some areas of future consideration, such as why certain literacy skills, such as showing awareness of sounds in words (Item 4), discriminate between high and low ability males more than females, and vice-versa. One way to further explore this issue is testing the generalizability of these findings by conducting the same analyses based on more recent Jumpstart data; alternatively, conducting focused interviews with raters (e.g., mentors) to gauge factors they consider when assigning a particular rating to a student may provide a valuable source of information on this topic. As such, current findings do not necessarily lend themselves to specific recommendations to modify the JSSC instrument itself. Instead, the results point to areas for further inquiry into the judgments of raters when rating the literacy skills of male and female students.