DecodeME: the biggest ME/CFS study ever

The first results of DecodeME are in, the largest research project ever undertaken on myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS). More than 15,000 patients shared their DNA to uncover the underlying pathology of the disease. The results show 8 hits: regions in the human genome where ME/CFS patients differ significantly from controls. Most of the implicated genes point to the brain and nervous system.

The letters of our genetic code

Much has already been written about the DecodeME preprint. But because it’s such a major study, it’s worth exploring the methodology and results in more depth.

Let’s start with what is measured and compared. All ME/CFS participants mailed their DNA to the researchers using saliva samples. DNA has a structure that looks like a twisted ladder with four possible letters in between: A, T, C, and G. These letters represent nitrogenous bases, the fundamental building blocks of our genetic code. They pair in a specific way: A goes with a T on the other strand while C goes with G.

Human DNA has 3.2 billion of these ‘base pairs’, but most of them are the same in everyone or not relevant. What matters are the letters that vary among people and determine the way we function. These are called single nucleotide polymorphisms (SNPs – pronounced as ‘snips’). Some people may, for example, have a letter T where others have a letter C. The DecodeME study tested more than 8 million of these SNPs.

Below is an example of a single SNP. It indicates if someone has a letter T rather than a C at a specific location in the human genome. Although a location can have multiple variants, there are usually only two, which are called alleles.

17:52183006:C:T
17 is the chromosome
52183006 is the location on that chromosome
C is the non-effect allele
T is the effect allele

In this example, one of the 8 hits from DecodeME, ME/CFS patients more often had the letter T at this location than controls. Unfortunately, the location can be ambiguous because there are different ‘builds’ of our genome where the same number may refer to different regions. Therefore, SNPs are sometimes referred to by an unambiguous rsID. For example, rs34626694 is the rsID for our SNP (17:52183006:C:T) mentioned above.

Linkage disequilibrium (LD)

Genetic research doesn’t focus on a single SNP, though. This is because SNPs are often highly correlated with each other. If a person has a letter A here, then she’s likely to have a letter T there, and so forth.

The reason for this is that our DNA letters are not fully independent. Some are often inherited together. The technical term for this is ‘linkage disequilibrium’ (LD), and it’s key to understanding genetic studies. LD is the reason why scientists can test so many DNA letters at relatively low cost. They only need a few good reads to infer all the others. DecodeME, for example, only used 400,000 well-measured SNPs. The millions of others were imputed.

LD is also crucial to understanding the results. Rather than a single SNP sticking out, we usually have a group of correlated SNPS that are different between patients and controls. If your data doesn’t include the top SNP, you can check one of the SNPs nearby that is highly correlated with it. That’s why the DecodeME preprint speaks of ‘loci’, regions in the genome where patients and controls tend to have different letters.

In between genes

Much like letters on a page can form a word, the letters of our DNA can form a gene. A gene is like a stretch of DNA with instructions for making one of the many molecules our body needs to function. Humans have around 20,000 to 25,000 genes.

Sometimes having a different letter in a gene doesn’t make any difference, like using synonymous words. In other cases, it completely disrupts how a protein works, resulting in diseases such as sickle cell disease or cystic fibrosis.

Most diseases, however, do not result from a single mutation like the examples above. Large genetic studies have shown that for most illnesses, the genetic risk is spread out over many (often hundreds) of different loci in the genome. Most often, the implicated SNPs do not lie within genes but between them. They do not change the gene itself or the molecule that it makes. Instead, they slightly alter how much the gene is turned on or off, like a volume knob.

Let’s take a look at one of these loci on chromosome 17 using the DecodeME data. On the x-axis, we have the genomic position of almost 3000 tested SNPs in this region. The one in the middle at the very top is our SNP (17:52183006:C:T) that we looked at earlier.

We created this graph ourselves using DecodeME summary data but Supplementary Figure 5E in the preprint shows the region as well. We zoomed out to have extra context.

The y-axis shows p-values for the comparison between patients and controls. These indicate how unlikely the data would be if there were no group differences. Values lower than 0.00000005 (a 5 preceded by 8 zeros – shown in the graph by a dotted line) are considered statistically significant and indicate that differences between groups are unlikely to have arisen by random variation. In the graph, we use the –log10 of p-values, so the higher on the plot, the lower the p-value, and the more unusual the SNP.

The location of protein-coding genes is added in color below the SNPs. The significant SNPs at the top fall near the gene CA10. This means that the SNP likely influences the expression of this gene.

Small effects

Because genetic risk is spread out over many SNPs, each only accounts for a very small effect. Indeed, the difference in prevalence of SNPs hits is only about 1 to 2 percentage point between patients and controls. The top SNP on chromosome 17, for example, was present in 34% of ME/CFS patients compared to 32% in the control group. Not exactly a big difference.

SNPCombined prevalenceOdds ratioPrevalence ME/CFSPrevalence controls
1:173846152:T:C0.3250.9270.30950.3259
6:26239176:A:G0.2611.0860.27630.2601
6:97984426:C:CA0.5460.9340.53000.5470
12:118202773:C(T^13):C0.1391.1000.15010.1383
13:53194927:GT:G0.2871.0770.30150.2861
15:54866724:A:G0.3121.0820.32820.3110
17:52183006:C:T0.3301.0840.34700.3290
20:48914387:T:TA0.6341.0950.65360.6328
We estimated the prevalence in patients and controls using the odds ratio and frequency of the effect allele in the combined cohort from table 3 in the preprint.

The eight SNP hits that were associated with ME/CFS are common and occur in 13%-63% of the general population. In other words, these do not determine if you have ME/CFS or not. These SNPs are just the tip of the iceberg. DecodeME found eight significant ones, but as with other human traits and diseases, there are probably hundreds of SNPs that contribute to the risk of having ME/CFS.

A pointer to the problem

So why are these SNPs important if they have such small effects? There are two main reasons. The first is that the effects themselves don’t matter much because the SNPs are pointers to a problem, not the problem itself. They are just clues to what might be going wrong in the body. Even when the effect of the pointer itself is small, it can still signal a major problem.

A recent paper in Nature backs this up. It examined data on drug development and found that effect sizes from genetic studies did not influence the chance that a drug will be successful. The authors give the example of HMGCR, an enzyme involved in cholesterol synthesis. Variants in the HMGCR gene identified by genetic studies have only a small effect. However, drugs such as statins, which inhibit HMGCR, produce substantial reductions in cholesterol and cardiovascular risk.

How is this possible? We formulated a (made-up) analogy that might help to understand the issue, but you can skip this part if you are already on track.

SNPs as pointers – the dam analogy
Suppose an illness is caused by a structure somewhere in the body that lets cells through that it should hold back, like a dam that is breaking. There is a gene X that helps to create a simple protein that is one of many support structures in the dam. An SNP nearby acts like a volume knob for gene X, turning it down and making it a little bit harder to synthesise the support protein. The difference is minor and this protein is only a minuscule part of what holds the dam. There are many other mechanisms involved in strengthening the dam that involve feedback loops and complex interactions. These are much more important, but gene X is simple and straightforward. It only has one job.

In genetic studies of the illness, the SNP close to gene X might show up with a small effect size. The other genes might not show up because the mechanisms are too complex and intertwined, or there is a signal, but it’s ambiguous and hard to interpret. Luckily gene X points to the problem: the dam is breaking! Scientists now understand what is causing the illness. They know a lot of biology so they can do much more to support the dam than gene X could by coding its little protein. They can create drugs that ensure the dam no longer breaks.  

So, in this analogy, the dam might be breaking even in those who do not have the SNP variant that was lowering the expression of gene X. And fixing the dam might cause physiological changes and benefits that are out of proportion to the effect size of gene X. This is a simplified and made-up analogy (the dam-analogy has nothing to do with ME/CFS and DecodeME), but we hope it helps to show why the effect sizes found in genetic studies are not crucial: it’s what they are pointing to that matters!

The wonder of DNA

There’s another reason why the SNPs are important despite their small effects: our genetic code is set at birth and doesn’t change.

In other ME/CFS studies, there’s a constant uncertainty that differences might be due to deconditioning, a different sleep-wake rhythm, or the supplements and drugs patients are taking. DNA is a smart way around this problem. It comes first, before ME/CFS and lifestyle changes. When DNA variants are associated with ME/CFS, they suggest a causal relationship with the disease.

The big sample size of DecodeME is like a microscope that allows us to zoom in on ME/CFS, closer than ever before. It’s the highest-resolution picture of the illness that we have. And because it looks at DNA, we know that it uncovers causal relations. It’s not only a microscopic image of the illness but also a sneak peek under the hood.

What do the SNPs point to?

You’re now probably wondering what the sneak peek reveals: what do the SNPs point to? Unfortunately, it’s not so easy to tell. In our example on chromosome 17, there was only one protein-coding gene nearby. But many other loci are stacked with genes, and we do not know which ones are causally related to ME/CFS (it could be one or multiple).  

Because of LD, the genetic picture we get is blurred and ambiguous. Take, for example, the region on chromosome 1 that was associated with ME/CFS. The graph below shows it’s a crowded place. The DecodeME preprint highlights 11 genes in this region, especially RABGAP1L, but we don’t know which ones are truly related to ME/CFS and which ones are accidental bystanders.

We created this graph ourselves using DecodeME summary data but Supplementary Figure 5A in the preprint shows the region as well. We zoomed out to have extra context.

In addition to that, each gene is usually involved in multiple biological pathways and expressed in various tissues. So one quickly gets a complex web of potential explanations of what the SNP signal might mean. Our next blog post will go deeper into this. For now, we’ll zoom out and take a bird’s-eye view. Using a tool called MAGMA, researchers can take all implicated genes and see where in the body they are expressed. The answer for ME/CFS is, overwhelmingly, in the brain. As the graph below shows, the significant tissues (indicated in red) were all located in the brain.

This plot was created using FUMA, it’s similar to Figure 3 in the preprint.

Many of the implicated genes in DecodeME point to the development and communication of neurons. If ME/CFS were a war, then the brain would be its main battlefield. There are some potential pointers to the immune system as well, but these are less clear.

With another tool called LD Score Regression (LDSC), we can also look at the genetic correlation between ME/CFS and other diseases registered in the UK Biobank. There were substantial correlations with many illness categories, especially those related to gut problems, fatigue, pain, and depression.

Trait in the UK BiobankGenetic correlation (rg)Bonferroni-corrected p-value
Non-cancer illness code, self-reported: irritable bowel syndrome0.75         0.00015
Non-cancer illness code, self-reported: chronic fatigue syndrome0.70         0.00005
Sleeping too much0.66         0.00028
Number of things worried about during worst period of anxiety0.61         0.02336
Treatment/medication code: amitriptyline0.61         0.00000
Recent feelings of tiredness or low energy0.61         0.00000
Mental health problems ever diagnosed by a professional: Depression0.60         0.00000
Never eat eggs, dairy, wheat, sugar: Dairy products0.60         0.00096
Diagnoses – main ICD10: M47 Spondylosis0.59         0.01585
Frequency of tiredness / lethargy in last 2 weeks0.57         0.00000
The top 10 genetic correlations between DecodeME and 3,167 traits in the UK biobank. Calculated using BIGA GWAS: https://bigagwas.org.

There were also significant correlations with schizophrenia (rg = 0.53) and childhood asthma (rg = 0.31), while there were none with multiple sclerosis, rheumatoid arthritis, Crohn’s disease, and diabetes (types 1 and 2). We wouldn’t take these data too seriously, however, because for some categories, such as schizophrenia, the correlation changed dramatically based on how the diagnosis was recorded. For many other illnesses, such as lupus or autism, no correlation data were available.

Modest Heritability

Lastly, DecodeME also provided an estimate of the heritability of ME/CFS, which was 9.5%.

This estimate is called the ‘SNP-based heritability’, and it’s based on LD Score Regression. This method checks whether SNPs that are in strong linkage disequilibrium with others tend to show stronger statistical signals. SNPs that are in high LD with other SNPs represent more of the underlying heritability of ME/CFS than those that are not. Therefore, they should have more significant results. The slope of the regression gives an indication of how much this is the case.

It’s a useful measure, but research has shown that it underestimates heritability estimates from twin studies by quite a bit (this is referred to as the ‘missing heritability problem’). The most useful interpretation of SNP heritability is probably to compare it to similar estimates for other diseases. You can search for similar heritability estimates in the UK Biobank using the ‘UKB SNP-Heritability Browser’ provided by the Neal Lab. Some examples are given in the table below.

PhenotypeHeritability (liability scale)
Diagnoses – main ICD10: F20 Schizophrenia0.259
Diagnoses – main ICD10: K50 Crohn’s disease [regional enteritis]0.241
Type 1 diabetes0.219
Diagnoses – main ICD10: G35 Multiple sclerosis0.117
Depression0.237
Non-cancer illness code, self-reported: asthma0.170
Non-cancer illness code, self-reported: chronic fatigue syndrome0.0879
Non-cancer illness code, self-reported: fibromyalgia0.0112
Several estimates of SNP-Heritability from traits in the UK biobank provided by the Neal Lab.

The heritability of ME/CFS turns out to be normal, but it was similar to the mean heritability of all traits in the UK Biobank (h2 = 0.1). But compared to many other diseases, it was modest, less than, for example, the estimates for schizophrenia, Crohn’s disease, or Type 1 diabetes.

What’s up next?

The DecodeME data we discussed was only a first analysis, which hasn’t been peer-reviewed yet. The authors plan to do a more in-depth analysis, for example, of the sex chromosomes and the HLA region. The latter is a location on chromosome 6 that is stacked with genes involved in autoimmune diseases.

Also, DecodeME looked at the common SNPs only where the frequency of the minor allele was 1% or higher. Luckily, the authors have plans to study the rare SNPs as well in a study called SequenceME. Rare SNPs might show larger and clearer effects than the common SNPs analyzed in DecodeME.

In our next blog article, we will take a closer look at eight hits from DecodeME and try to figure out which genes are causally related to ME/CFS. We will also analyze whether the DecodeME results might be due to confounding or selection bias. And we will make a case that most of the implicated genes point to neurons and their synapses in the brain. Stay tuned!

Acknowledgement

Many thanks to forestglip and others on the Science for ME forum for their analysis of the DecodeME results and for helping us explore many complex tools for genetic studies such as FUMA, MAGMA, Locus Zoom, BIGA GWAS, and LDSC.

5 thoughts on “DecodeME: the biggest ME/CFS study ever

  1. Scout says:

    Thanks so much for making all this information so accessible, you folks are heroes 🙂

    Reply

Leave a Reply