Event ID: 967292
Event Started: 4/3/2008 8:28:10 AM ET
Please stand by for realtimePlease stand by for realtime transcript.

Captioner: I am unable to access audio at the link that was given https://webmeeting.N I H.gov/R 43516274/course.

Captioner: I am connected to the site now and waiting for the class to begin.Please stand by for realtime captioning.

Can you hear me?

Hello, this is Kate. I am testing the voice-over IP.

Captioner: I can see your chat messages, but I cannot hear audio.

Captioner: No, I still cannot hear you.

[Captioner unable to send captions due to technical difficulties on client end. There is video, but no audio. Standing by. If there are any further questions, please contact Caption Colorado at 1-800-590-4197.]

[Captioner still withoutaudio. If there are any further questions, please contact Caption Colorado at 1-800-590-4197.]

Now, when you are searching they have set the maximum target sequences that you will get back at one-hundred. That is a very minimal number to see. If you are looking at something distinctly related you will probably not get it in that first 100 sequences. However, you can choose to have up to about 20,000 sequences come back. Most of the time you won't receive that many, but you could choose to have that. There are other ways. You can choose, but it is possible to pull that number back. And some of the parameters -- remember, I should you have to reformat the results.

If you take this defaults to start with it will not let you choose any more. If that is your of limits in the that sets of results. So I always recommend when he set it to at least 1,000. That way they can go back when they reformat their results and see more than they thought before. If you don't do that all you are going to see is the top 100. Okay? This I told you about. Automatically adjust parameters for short sequences. This is to be a completely separate search. But because we used Google it caused lots of problems. They just simply incorporated it in. It automatically looks for short sequences. This is our expect threshold. This is an expect threshold. This is set at 10. Ten is a huge number for an expect threshold. What that means is in a search that I do set the pressure so that I could find a positive fit. The reason that they allow this is because they want you to people to bring everything back even if it is not the real hit. They want you to get as much back as you can. The default used to be 500. I would bet it is because they do a trillion searches a year. We will see what they do with their competitional file. So expect threshold can be changed. You can change that. You want to see things that are more closely related. You can set that limit to a much smaller number. This means you're going to only look at what is very closely related. And show you what those results look like. Right. If you look at the results, they give you the chance that this is a false positive. Besides this gets at more about how this works. The algorithm is basically a two hit process. This is actually easier to show in protein. In proteins the word size is three. With amino acids you have to have three. It is set at seven in it nucleotides. What happens is we put together every combination possible of those seven, and it has it in what is called a [ indiscernible ]. So as it goes along it can see what matches to what part of that table. It has scores of signs based upon the match. The scores are derived and nucleotides because of match and a mismatch. So for every match that I get hits plus two, and for every mismatch I get minus 3. What happens as it goes along is it looks at all of those. It does anything that does not rise above. It does not look any longer at that particular match in this table. If it does not rise above a certain minimum score that you can set it throws it out. Then if it does hit above that threshold score it goes down and extends this search in either direction by 25 nucleotides. Then it looks again for a score above or equal to that threshold score. If it does not find it by sending out [ indiscernible ] it throws out. This makes it very powerful. Faced being the first hours and that came about for searching was called [ indiscernible ]. At that time there are probably about, I don't know -- this was back in the 1980's. There were probably only about 20,000 sequences in the database. The computers at that time, it took 24 hours to get a match. So you could see this is running against billions of sequences and finding new matches in seconds. But when you do that, when you increase speed you decrease answerable. Okay? You're throwing out things that may actually have some hits. We have enough sequences of there now that we're probably going to find everything your interested and within these parameters, but there is an area of match point that is hard to distinguish. There is no set a line where you can say that is a match and that isn't a match. When you get down to your entire sequence you're getting into the twilight zone. You don't really know. This is where you have to go back to the bench. You take this finding and set up experiments to determine whether that is a real match. , that is up to the scientists.

[Speaker Unclear].

The statistics will stay the same based upon what you put in. The threshold might go up, you could change that threshold. But that does is either lowers the stringency of the search or raises it. What you find might be different if you lower the stringency you will find things that are more of a [ indiscernible ]. Okay? It doesn't matter. That was just -- in this particular -- yes, I'll show you other things that run. The value is based upon the natural algorithm. I can't really write the natural algorithm number in there. Last allows gaps. Why should allow gaps? If you're only doing local alignment why not just find every piece that matches and just forget about those. It is because if you are looking at a sequence where and there happens to be an insertion or deletion and it that is going to cause a gap in your query and going to have it to reach over. That is built but things real and relevant. Why but because there is an institution and that the other sequence is trying to match again announced he would not find that match because it may not rise up in the score . You want to extend your alignment as far as you can. If you allow any gap you could match anything to anything else. So you have to have certain limits on how you allow matches in project gaps. There are no statistics for putting gaps in. So these have all been determined empirically. They know that these are the ones that work. In order to open up a gap I'm going to charge you four points. In order to extend its for every nucleotide your extended recurring to charge you four more points of of your score. So they are lowering scores so that this sequence does not rise up into your threshold. So you are penalizing after that gap. Right. That is just what they use as the default. If I changed it to 1-2, what do you think? It would be a more stringent search. Whereas if I went that the other end and gave it one it would be a less stringent search. This is one of the reasons I tell people not to use nucleotides. When you are looking at nucleotides you have a choice of four. You have a 25 percent chance of making a match. If you're looking at a mill - - looking at amino acids you have 20 different amino acids. Much less chance of it being there. Let's switch over of to the BLAST for proteins. I will run that. You have a lot of the same things up here. By default it Stutz Europe -- it starts here with gender. I think they are encouraging people to use [ indiscernible ]. You have choices that are different. There aren't as many. You don't have MRNA and genomic DNA. This is fairly well put together. You have patented sequences and structural sequences and you have garbage. The environmental samples that are not well known and often translated DNA. We will talk about the [ indiscernible ] if we can get to it. Here is the matrix. It is not simple. They actually are given names. Pam was not somebody's girlfriend. That stand s for [ indiscernible ] accepted mutation. Before I get into the matrices let's take a break. Come back in 11:15.

[Session on break until 11:15 a.m.]

Okay. Let's get started. We are going to start talking here about what the difference in at the matrices are and why you might want to use a different matrix and why you want to use protein if you can. Here are the choices. We are going to have to go outside. I guess I am displaying my desktop. I forgot the W, but it still came up. It is actually [ indiscernible ]. This is the module set up. It is set up in exactly the same way. If we go to modules and similar ity searching. You guys don't have to go along here. I want you to know because it is the same kind of information. It is more detailed. Here is my theme. What I wanted to show the view was that matrices. Like I said, Pam stands for percent accepted mutation. It was a unit -- I don't have the reference. It is back in the early 1980's. He aligned sequences by hand, 1500 sequences by hand and looked for the excepted changes in does sequences. Okay? So it was protein work back then. You look at the accepted changes between does proteins. She defined a 1% change of the Manassas that worthy% accepted a mutation. If you then built a matrix that showed that 1 percent change -- let me show you. Okay. So the first definition was for that 1 percent change. She then took that and using the matrix algebra and expanded it out to -- up to 250. So she started with 1 percent change and has gone up to 250. How did you get 250% change? There is a possibility these has been substituted. This means the percent for the matrix, the lower the number the more stringent the search. You are down closer to that difference. The higher the number, the less stringent the search. What does that make sense? In 8WPIS create also, this is the default for protein. This was done in a very different way. We ticket -- we took a database called blossum. That stands for block sum. He took the block and then asked what the substitutions and amino acids that are allowed within that part of 60%. Okay? And these are the numbers that were derived from that analysis. Penalized for some of these exchanges, and you could fairly high numbers for others. So the rare amino acid like kerosene is going to be much higher than a common amino acid. Been being the thought that blossom is actually more chronologically rel. offensive, but the collection of work. Each of the blossom or done in the same way. If I were to do a search would that be more or less stringent? Right. It is less stringent -- I mean is more stringent because 85 percent are identical. Therefore fewer substitutions were allowed. So the higher the Blosum number the higher the stringency. So the higher the Blosum number, the more stringent the search. That is because of the way they were derived. Statistics, and they are somewhat complicated. It is about percents. Yes. Percent identity it terms of how it was derived, but it means different things. Let's go back here. Go back to the protien page. I'm going to ask you some questions. If I had a sequence and I wanted to see if I could imagine that sequence in a zebra fish what would you choose as a parameter? Which matrix do you think you might choose? Blosum. That is because that is the default. That is what everybody selects. The lower number probably would work better than the higher number. Everybody goes by default here. Almost no one changes. So they aren't optimized. Because of the evolutionary distance you want a less stringent search. All right. They aren't that different. You could use eighty. You may not find some things. There is no set pattern for other a sequence is highly conserved or not. It depends very much on what the function is and how much the sequence determines the function of the protein. There is a protein in the yeast. I can't remember now. It has exactly the same function. You can exchange those. You can rescue mutation from them. They aren't identical. So what is more important in that case is actually the structure of the protein than the actual sequence. And the sequence, though it determines the structure does not mean that if you determine one structure with one set of amino acids you can't derive their same structure with a different set. Some things are so structurally dependent that the amino acid sequence is not as important as the structure of the protein.

[Speaker Unclear]

Yeah, I will show you. Now we have set up very nice place to save your search parameters. They have set up ways to save your results. They have done that within the past year.

[Speaker Unclear]

Is there a difference?

[Speaker Unclear]

No, actually, I'm not. I know.

[Speaker Unclear]

There are actually tools out there. There is actually kind of a nice of all site called [ indiscernible ]. It shows you the connections between them in a distance between organisms.

[Speaker Unclear]

Right.

[Speaker Unclear]

The organism itself is. Yeah. You are using nucleotides to look across distances. If you are using genomics you don't have much choice. There are a couple of other things I need to tell you about why you want to use protein if you can. If you look at -- I told you already these are much more complex. They are allowing you to take a advantage of that complexity to end the protein that is in an organism that you cannot do with nucleotides. There are four nucleotides. So you have a lot more choice. You can specify a lot more. There is also the fact that the types of -- they nucleotide, you know, plus two and minus three is making the assumption that you can substitute any one of That means that each of these substitutions should occur on the frequency, and they don't. A perimeterring for a perimeterring change happens at a much higher frequency than a pro curing change. Okay? So those -- even this simple matrix is not real. It doesn't match biological reality.[ speaker unclear ]]

Databases are big enough now you probably will get some results. Depending on what the sequence is that you put in and how much change it has had over time. It is hard to predict. I don't think -- it is hard to make such a mistake that you'll break the system.

[ inaudible ] everything very stringent? [ indiscernible ] come at the top.

You can do that. What you're going to get generally is the same sequence back again. If you really make it a stringent search, you'll get your same organize organism back, and then you'll get other organisms. The advantage is that you can narrow down your search. If someone wants to find a druse dros DROSOPHILIA gene, all of these things up here at the top of the page are simply used as a way of narrowing your search down. They're not really there to do anything else except help you narrow the search. There is no point in putting in anything here on this top part of it which is why so many of them do it that way.

[ inaudible ]

You can definitely think about it that way. Okay?You see the choices for word size. Remember I talked about nucleotide BLAST to 7? Here you have a choice of 3 to 2. Which one is more stringent?

[ inaudible ]

That's right. You're having to match 3 to reach your score whether as here you only have to match 2 to reach a threshold score. Okay? To match a 3 versus 2, the match with 2 is going to take twice as long as the match with 3 actually. It doesn't really matter because it is so fast, but it does increase the [ indiscernible ]. Down here you'll see again some choices. I would recommend leaving this alone because it is fairly complicated. In NCBI itself has flipped back and forth between all of these choices setting as did he default over the past five years, so actually these three have been set as default at different times during the past five years. Positional adjustments are pretty well set. You can leave those alone. Because of the way BLAST works, let me show you. Down here you have filters and mask. None of these are set on by default. However, if you choose "no adjustment", you will see that that filter comes on automatically. Okay? One of the reasons I think they've set the maximum target sequencing at 100 is because the default search is now more stringent or not more stringent but more come pu taigs ally demanding because they're adjusting the matrix every time you do a search which is why you no longer have to filter out the [ indiscernible ]. Okay? When I show you side BLAST, if we can get to side BLAST, we may not get to side BLAST. It is kind of -- side BLAST is a very powerful tool, and I would like to show it to you because I think you can show your researchers it is another tool that they don't use and they should. This will be clearer what compositional score matrix adjustment [ indiscernible ] and I will show you that maybe even after lunch, but we'll see how much we can get through here. The filters since they're not used, why are they even there? Filters can actually be used as tools. Okay? A lot of people do searches, and they'll do a search with a let's say a Maritrans memorandum train -- transmembrane, so many proteins have a seven Trans membrane domain, so you have a protein that goes through the membrane of a cell multiple times. For each time it goes through the membrane it has to have what's called a Trans membrane domain which allows the acids, so the amino acids are charged such they can sit in these fatty acid layer. Okay? Each of these Trans membrane toe mains has a somewhat similar sequence composition.So if you do a search, you may only get back a transmembrane domain protein, and you see the same thing every time you do a search because of how many of these transmembrane domains are in this amino acids sequence. What you can do if you want to see if there are other functions outside of the transmembrane domain regions that are important is to put your sequence in with those transmembrane domain masks as lower case letters. Okay? Simply take your sequence, look at all the transmembrane regions, change those to lower case, keep all the other pieces upper case. If I now mask those, what that means is that the search, this initial search that BLAST does is not seeing those [ indiscernible ] pieces of the sequence. It doesn't look at those. It skips over the lower case. Okay?

[ inaudible ]

Right, right. You simply -- it is easy to do in Word, change case on your letters. If you know where your transmembrane regions are, simply change those to lower case, and then skipping over those when it does that initial search, then what you want to do is mask or look up "table only" because what that does is you're not going to get the extension from those sequence because you didn't see them, but it then includes them in the score. You're bringing them back in to make the score the same as it would have been if you had included them in the initial search. You don't want to change your score that you're getting, but you want to skip those so you're not seeing them in that initial search. Almost nobody uses this as that type of tool either. There is lots of ways to use this. A lot of people have conserved domains, and all they get back is things they see that have this conserve domain. You can mask that conserved domain by lower case. Okay? It is a very cool tool to use, and it gets you lots of different results. You can look for pieces of function outside of that domain that you're always seeing without taking that domain out of the context of the protein. See what they see that's different. You're still going to get a lot of the same proteins because you're going to get matches that are outside, but you also will start picking up new things. I can guarantee that. Things that you didn't see because that conserve domain is so dominant in the database itself, so there is thousands of home owe box proteins in the database. You search it with a HOMObox domain in it, you'll get thousands, but if you take it out this way, you'll get other things. It makes no difference.I teach a class to the post docs and graduate students and sometimes faculty, and I call it BLAST tips and tricks. This is one of those little tips that I tell people. Okay. We've covered the whole screen here. Anybody have questions on any parts of this? There is another thing you can do up here which is to query a subrange of your sequence. If you have a 1,600 amino acids sequence, you can set your query so that it only sees 400 [ indiscernible ]. Again, you can pick out ways of looking at that sequence that are slightly different than if you just [ indiscernible ] without having to do a lot of manipulation in Microsoft word. Let's go back here. Let's go back to our slides. Did you reset the home page on this?Here is what I was telling you about with regard to how BLAST works to give you the cut off, okay? Let's go to the next sequence. You saw before that what Michael Creighton did was to put in a toning vector as the sequence. Mark actually wrote to him and said "this is stupid." [ Laughter ] you need to put in something that actually has more relevance, so Mark put together a sequence which is right here. Let's copy that and put it into Blastn.As you can see, if you set your thing for nucleotide collection in somewhat similar sequences, it should come back to the same setting, and I am going to set mine to 1,000 just for fun. Hit the BLAST search.Go back to the module. Go to BLAST.Go no another book, another sequence. Cut and paste that in to nucleotide BLAST. This is "Lost World"." Mark has just joined our center for biomedical and [ indiscernible ] school at Harvard. He is a great guy actually. I didn't know him before, but now I am getting to know him. Go back to BLAST page. Paste it in. The biggest -- one of the biggest -- people often complain to me about how long BLAST searches take, very inpatient scientists, and actually the quickest way you can get the BLAST search to run the quickest is run it at 4:00 in the morning because now it is noon, so the people in California are coming online trying to do their search and it takes awhile. [ unclear speaker ]

Let's take a look at what the results are. Here we have the results, and we can scroll down. What do we see? We see a bird, a chicken, a frog, we see human. We see cows, prime mates, dogs, all kinds of stuff. Mark made sure he picked up everything he could pick up out of that grouping. Okay? He included something that was going to be evolutionary relevant to that book. We'll go back. We'll go back. Let's go back to our -- sometimes BLAST won't let you go backwards because it keeps trying to do the search as you're hitting the button. There we go. Okay. Here is my sequence. I pasted it in. That's just the home page for BLAST. I selected nucleotide BLAST. Since this is the nucleotide sequence. Paste it in. Okay?It is on this slide.

[ inaudible ]

You can't take it out. Go back one more, I think. Keep going back. Go back. You're missing the slide there.

Okay.

We did skip a lot of slides.

No. Keep going. Keep going.Sorry. [ Laughter ] I am not following the slides along here. We'll go back to the story in just a second. What I did actually every one of these modules has at the end a slide list, so if you click on the end button, it is the -- there is a slide list. What I did was choose "Another book, another sequence"."

[ inaudible ]

Then what I did was go to nucleotide BLAST, paste in my sequence. Mine was already set to nucleotide collection, already set for somewhat similar sequences. I set it for 1000 results, and I hit the BLAST button.

[ inaudible ]

Yep.Nope, nope. All you would see would be human results. We'll go over the results now.

[ inaudible ]

Serm has the longest match across the entire distance of the protein. It means it is still searching, so it checks in with you just to let you know it is still searching.Mark chose the chicken simply because birds are the most closely related to dinosaurs. [ Laughter ]

[ inaudible ]

Sparrows are little [ indiscernible ]. [ Laughter ]. Okay. Here is our results.Still have it set on the advanced search. Let's just go across what this results page looks like and what you're seeing. Okay. Everybody knows about ex session numbers. What is that sirs succession number? MRNA, from what database? From which nucleotide database? Good. Okay. Here is the description. This is an a REIT ERYTHROID is a trance transcription factor. It is one of those when it is turned into a protein it comes in and turns genes on or after. Made it specific because this is supposed to be mosquito blood taken from a dinosaur that a mosquito ate, so it is a blood [ indiscernible ] specific transcription factor. Right? Here is where scores come in. We talked about scores. Remember? The matrices give you numbers that are simply added up. Okay? You have to reach a certain threshold score in order to be included in the search. You also have to reach a certain threshold score to be included in the list. It is listed here based upon score, so the highest come to the top and the lowest scores at the bottom. Four is the number you want to pay attention to here. Here we have the score. This is another important -- you don't see this when you're not looking at the advanced page. The percent of the query coverage. That means if you look up here at the top on the graphic, you can see that none of these match across the entire protein. We're covering about 66% with that top match. Okay? The rest match in the 40% level. Okay? That's what this graphic is showing you. This graphic is also showing you the scores based upon the color that you see here. Anything that's red has a score greater than 200. Anything that's purple has a score in the range of 80 to 200. Okay? Each of these black graphic lines is actually a link that will take you down to the alignment. Let's go back up and talk about the results a little bit more.Here we have our total score. The maximum score equals the total score in this case. This is the result of having no gaps in that alignment. Okay? The reason that the maximum score does not equal this total score in these other cases is because there is gaps included. Okay? Make sense?

[ inaudible ]

Say that again.

We have a maximum score and a total score. Okay? The first one, there are no gaps in that alignment, so the maximum score equals the total score. Almost all of the rest of these, the maximum score is less than the total score -- I mean the total score -- the maximum score -- the total score is generally higher than the maximum score because you eliminated, you subtracted out the gap off. Okay? So you're scoring across this whole region. That would be a score that you would get in you were aligned across that region where this alignment is. You don't get that. Let me show you. It is easier --- let's go down and show you that alignment, and I can show you -- there are gaps in this one. Let me think about that that means then.This is based on the statistic, so you're looking at -- if you lodge at this top match, it goes across starting at nucleotide 39 all the way up to 953. Here we're starting at 1584 to 1938. It is a much smaller alignment, so your score is going to be less simply because you're including less of the sequence as an alignment. Okay? Then you're looking at what the maximum score could be is what would be a perfect match across the entire line. There has been an insertion or a deletion in one or the other of the -- a mutation of some type. It could have been this is a small insertion here or it could be that this is a small deletion deletion. The assumption always is that these are derived from a common ancestor or that they have evolved to be mostly aligned. So this lower case actually has to do with filters, so if you look at the parameters for nucleotide sequence, let's go back, here at the bottom turn on the filters, and keep the mask and look up table 1, and [ indiscernible ] didn't used to be. You're filtering out the complexity. If you look at the places in the sequence that is -- you have this whole series of ATATATAT. We get matches to that at random quite often.Let's go back up here and look at one more thing and then we'll break for lunch and come back and run a few other types of BLAST searches. E-value. Remember the expect value that was set at 10 when we were setting our BLAST parameters? It is set at 10 to allow as much back to get as many results back as you can. This is that same statistic. What this number is telling you is what is the chance that this is a false positive match? In this particular case it is zero. Okay? Then you go to the natural algorithms two times each, to the minus 175th. It is a very, very small number, so the chance that it is a false positive is also quite small. Okay? The number that you want to pay attention to that makes the most difference is the score because those will vary based upon the actual alignment. This number is much -- is just a predictor of whether it is a false positive or not. You can have a whole line of zeros up here, but the scores may be quite different. The higher the score, the better the match. There is a relative relationship. Like I said, you may have whole line of 0s simply because you have lots of good matches. The scores may vary somewhat. There is a transaction. The E-values get larger as you go down the list. They're all [ indiscernible ].

[ inaudible ]

There is insignificant differences in E-value even there though there are significant differences in the scores. All I am saying is that this number is less meaningful than the score number is with regard to that match. You can pay attention to both, but it is the score that makes the difference in whether it is a good alignment. You have a score of 0 here, you're pretty sure it is a good match. I mean, you are sure. You're positive it is a good match, but your scores over there would tell you which is the better alignment.

[ inaudible ]

I am not sure I understand why. I usually don't use this screen, so I am not sure what that is. Let me find out.Just realign them based on that parameter. Let me find out, and I will let you know. I am not completely sure why they listed [ indiscernible ] maximum. I don't

[ inaudible ]

I don't think so. We will find out what those are.

I just rearranged by score the alignment.

[ inaudible ]

Not sure why you get higher total scores than higher maximum scores.You can see you tend to get much smaller matched -- this is actually 51 nucleotides matched matched against a whole chromosome.This was the top one from the list here. It had a huge total score but a small maximum score.They're going to be the ones that have the best alignment. That first list looks for everything, and the best matches come to the top. The reason that I told you to set -- to reset that to 1000 is back up here to reformat. I go here and I say descriptions. If I had not changed that -- you see there is nothing now below 1000. If I had not changed that 100, there would be nothing below 100, so it just takes away your choice.

[ inaudible ]

Yeah.

Right. Right. If you look at when I resorted, now because I brought up to the top, it is still basing -- when I brought up to the top that other total score, when I resorted, what it did was bring up more of the other sequences that actually are very low in terms of maximum score.It is bringing them up so that they are setting a new threshold for what is seen as being significant [ indiscernible ].

[ inaudible ]

That makes sense to me, but it still doesn't make sense to me why you get some tolts scores -- total scores are not necessarily the ones that are showing in the alignment.

Right.[ inaudible ]

Outside of the alignment.

[ inaudible ]

Okay. Okay. That makes sense. That makes sense. See this line here between these? This is actually -- this scores higher than this does. The high-scoring pairs may be here, but when you're looking at this line, it is including these as part of the maximum score. Okay? You don't see -- because the total score from that region -- you don't see the total score because it is a local alignment that you don't see the maximum score because this would be the maximum score. You see the total score. Okay? We can talk about it more. I understand it.Those are just the top, the ones that you see in that graphic. You click on those, those, there is not 104 lines there, but there are 104 items because you're looking at -- because this is a redundant database, we're looking at lots of different things that match the pane because they're the same sequence. If you click on one of those lower ones, you will see there may be six or seven different GI numbers or different ac session numbers associated with each one of those that is a match. It is because of the redundancy of this database. It wouldn't be quite as bad.

[ inaudible ]

Because you would see chromosomal regions versus -- yeah.

[ inaudible ]

Right.

Chicken. Then you have statistically less [ inaudible ], so how do you make sense of that?

What it means is that this top match is the best alignment. These other matches are also good alignments. You're still finding things that are significant matches.

[ inaudible ]

This was nucleotide.They are real.Those alignments are real.

[ inaudible ]

Those pieces of DNA in those other organisms are there. We know that sequence is there in that other organism and matches this sequence that we [ indiscernible ].

[ inaudible ]

That's what evolution is, change over time.Matches across 66% of the query.

What a scientist does, it depends on why they ask these questions. Okay? If they're asking an evolutionary question, they may take this whole set of things and do an alignment so that multiple sequence alignment and look at what the relationship is between organisms based on this sequence. Okay? They may look for the chicken is really not a very good model organism. They may look down here and see if they can find mouse or criss so [ indiscernible ]. It just depends on why they want to know the answer, why they want to know what the relationship is. There is lots of different reasons for using this to pull similar sequences out. Let's wait until after lunch. I will talk about what the molecule is and how you talk about the molecule after lunch. Let's take lunch and we'll come back here 1:20. Is that okay?

Captioner: Breaking for lunch until 1:20 p.m.CaptionerCaptioner present and standing by.

We are back, and I need to get to close this window. And then close this window. We just got the results back from the sequence that Mark had put in. Actually if you -- let me go back to the home page for our course. Going back to the slide list here, and I'm bringing up -- we've gone through all of these steps, and I'll show you that in a little bit. He actually -- has a publication. He wrote up the fact that the sequence was in there and that he had done the Anl Lis. It Ses reJekted by science reJekted by nature. So this paper is actually published in biotechnique. And then after that it was that he submitted the next sequence. If you take this sequence and translate it into protein, you'll be able to look at it and see that he's written in here NIH in the amino acid code. One of those parts that didn't match, he added that in. I'm going to go back to the search to show you a couple of the other things from the results quickly.

I can always tell when people in California are starting to get to work because the lab closes down. There are other ways I can show you. You can submit jobs to Ron later and get the results. You can submit jobs and get the results later and there are different ways of saving results now. So you'll see that. I'll show you some of these things. If you look here, there are some things that we didn't cover. The link over here, each one of these stand for a different database. U Ni gene and what that contains is a small compilation of information about your particular protein. It lists what the closest matches are from different organisms. This is also found in a database called e Ri Molg but it is the top sequence of there. It lists -- this information that is here are the top matches from other organisms. There's also information that is found in a Da base called Ma Fo Lo Ji. So a bit of information here that is found in several different places. In gene expression. This is a relative gene expression pattern, and this piece of information is not that useful in this particular case because it's showing you lots of different expressions from that same tissue that your original -- this is gene expression in multitissues normalized. There are much more specific databases for looking at specific gene expression that you can see. It is found however in one of the proteins that is found in this list here, a specific transcription Fakfactor somewhere on here. There are quite a few here. The sequence, so you can download those. That's not it. Let's go back. We can get back to our BLAST results with the other links. I went too far. The express stands for expression and it is the gene expression on the bus database. So Ther are the geo results for our particular sequence, and this is data that comes in from micro experiment that people submit. This particular experiment they are looking at home cyst tin effect on cardiac new Ral cells. These are experiments where they are looking at expression of gene under different conditions in cells under different conditions. This particular case, they took a set of cell, cardiac muscle cell and put them probably cultured cells. No. Yeah. They are cultured cardiac cells. So they are looking in the presence and absence of home cyst tin affects gene expression. These are -- this is actually the data. Click on the link at the top there. I just clicked -- there was a thumbnail picture of the chart right here. You can actually get the raw data for each one of these experiments if you want to look at them yourself. So you can download the raw data for each one of the experiments, which I find quite useful because when I am evaluating a tool that I may be deciding to purchase for analysises, for different types of analysis, this gives me a whole bunch of things that I can use that have already been analyzed. I can take a look at what the results should be. I can try them out, the new tools that I'm using and make sure that they do the same thing. It's nice to get the raw data to play around with. This is the actual relative expression value, RG in the experimental condition. There's the listing, the controls treated, which one of those is and the value of the expression of your gene. This does have in inHib toryexpression but it has an effect -- Ib Hib toryexpect across that board. This is a different result. G stands for gene info. This takes us back to entry G, the database that Chris will talk about later on. Any other questions about those? I don't know where you are. Oh, well this top one is actually from the database. You can tell that by the NA under score. We get information but not as much information as you would get from EntrezGene. So it will give you information about publications and things like that but probably you would get more if you went to EntrezGene for that. We talked a little bit about why it's better to use nucleotide BLAST. One thing I didn't mention because -- we look at the genetic code yesterday and saw that it's degenerate. So you have multiple code for the same amino acid. Because of that fact, there are instances where you can have a codon that does not align to another codon for searing even though both call for searing. That is the reason why it's better to use protein sequence analysis. The the [ Speaker/Audio Faint & Unclear ] code can cause problems. You can do a BLAST search by using a translated BLAST blastx, the blastx protein. What it does is it translates the new Klo side and searches against the database, the protein database with all six of your reading frames. To To remind you about the six reading frames. So here we are going to use that program, so copy that sequence down. We are going to use blastx here. Leave the standard genetic code. We'll search against the nonredundant protein sequence. Okay. We are going to leave everything up here the same. Everything up at the top will be the same. The only thing we are going to change right now is that thousand, the limit to a thousand instead of a hundred down here. I just do that routinely. It's a habit for me now. Using the default parameters for everything else. That sequence from the slide that we were on. Go to -- go to one of these and open it up and go back -- did you set your own page for the -- down here. Modules. Go to tools. Then hit options. And then say okay. Every time you are going to come back to this page. Then you want to go to BLAST and the BLAST list and then we are on -- you want to copy that sequence and put it in the search. This is the screen -- not the advanced screen. This is the standard screen that most people see when they do a BLAST search. You don't have all of the score information. You just have the single score, the E value and then the link. You don't have the length of sequence that is matching against the percentage of the sequence that is matching against, but you have everything else here. Okay. So it has done to translate all of this in six frames and then do a search against the database with a minimum -- with the sequence, so there must be -- these are essentially the same results here that we got with the nucleotide search. We have a high match to chicken at the top and then we go down to other organisms. I don't know what Mo No Del fist is. As you see, we are not getting the same order of organisms here. There we had chicken first then we had frog next and here we are not seeing frog, cow, man. There's rat, mouse. There is the zebra fish. Another fish. I am not seeing frog here at all. There it is. So there is the first frog sequence. You see you get different results when you use the protein match against the database than what you get with the nucleotide match. Those are the scores. It's just showing you what color each of these -- so any match that has a score greater than 200 will be read in this list and then each of the color is within the range. This is actually the alignment. So this is the graph of the alignment. If you think of this top bar as our query, this shows you exactly how much of that query sequence we are matching against with this sequence from the database. It's not exactly which of the match. It's more of a general -- it matches in this region of your protein. Right. I almost never run this search, so it's actually not used very often. I actually don't know if it gives you an indication. I don't see it here. You'd have to take your sequence out and translate it. Here's the frame. Yeah. There are it is. It tells you which frame it's in. Right there. These are matching to that same frame, so it must be the only significant match in the database is that particular -- because we've sequence gene now, we pulled a lot of things where the structure of the DNA in a region predicts that there should be a coding region there but no one has seen it as a protein. No one knows what its function is. It doesn't match necessarily -- it hasn't been a good enough match. Nobody has done any work to prove that it dozen the function that it does here. So because one of the things that is really lacking in biology right now is a function nature Tor. We need to prove function for these sequences that we have no idea what they do. Hypothetical simply means we don't know what it does. We can't say for sure. The function, not that it's match. It may be hypothetical that it may be a protein because it's based upon a predicted structure for a gene that no one has seen a transcript for, no one has got a messenger for, it's just a hypothetical in the terms of whether it even exists as a gene. Usually on the form you specify whether your sequence codes for a protein so it gets translated.

[ inaudible ].

There's another hands on thing for you to do. Take this sequence and plug it in, copy it and plug it in. So you want to put it into blastn, and the question is what function might this gene have. It's an amino acid translation down below. So you can use that translation which is better. Go to the next slide. You can click up there. Keep going. There is an amino acid sequence down below there. And there it is. Remember that this is not fast A, so don't copy that. Amino acid translation. This is the conserved domain. It just shows you that as the first screen as it is doing the search for BLAST. That's just a default. It shows you conserved domains. It's probably already open. Yeah. There you go. It says down here to do it against -- Swiss-Prot database. Sorry I didn't point that out before.

Get your results. You probably didn't run it against the Swiss-Prot database. If you want to do it against Swiss-Prot, you have to go back and restart your search. Do it from here, but in the Swiss-Prot database. It depends on how busy the servers are there. [ inaudible ] that's the default setting. So you didn't change your default setting. [ inaudible ] let's take a look. Maybe there is only 1 hundred matches. The Swiss-Prot database is not that large because of the fact that it was -- of the way it was put together to start with. People who are into actual [ inaudible ] building trees, they don't really do anything, so I'm not sure that they really give you very much information, and it's based on base he can local alignment in this one protein. And so it's not very accurate information. [ inaudible ]. Organisms you can pull out. This is a database that has been limited to those things that were originally sequenced as amino acid, so it's a small database, but it's a well curated database. It's different in terms of what they show you than some of the Entrez databases. But if we look at the results here, what is the function of this protein? You can see it right from the title. This is one of those that I told you about where you may get lots of E values of zero but the scores will differ significantly even with that E value of zero which is why you pay more attention to the score than you do to the E value. So you see over here you have top results of human mal trap, the best alignment if the human because that is our original sequence, so you are going to have the best match there, but these other matches are highly significant to give you an E value of zero. The function of this gene is to go in and repair mutation that defines. So it looks for things where the base pairs don't match, but sometimes you can get a chemical reaction that will change for instance an A to a T, and so you have an AA base pair and this will go along that strand and repair that, pull out that A that wasn't supposed to be there. So during this whole process of DNA replication, there is prove reading that goes on in the cell all the time based upon matching. But the enZin doesn't know whether it was an A or a T so it may change the wrong one. This was Swiss-Prot database. Go back up and scroll up. The database is all nonre Dun Dant. You are on Entrez nucleotide pages. This probably will take you to the Swiss-Prot database. It's right here. [ inaudible ]. Yeah. [ inaudible ]. That's fine. But it accepts all different formats for putting in. If you wanted to change it to an actual FASTA format, you would have to put a description line, but just the sequence is fine. Some databases and some tools out on the web look for that carrot as a signifier, so they won't accept raw sequence like that. Another thing that is like it says relatively new, shows you based upon BLAST against with this single protein what the relationships are between the organisms between the search. So it's called a distant tree of results. You can make yourself dizzy, if you'd like. Most of the time I would say that these types of trees, nobody uses. I'm not sure why they put it here actually. It gives you a quick idea, I think, but it does not represent good Tox onomy. This is a single small [ inaudible ] to be used in a local alignment search, which means that you may be throwing things out here that would be good matches. There are lots of problems with this building of trees in this way. So no Tox onomy will use a tree building this way. It gives you a quick way of looking at what groups might be represented here. You can play around with getting dizzy. You figure out the function a little bit for that gene without doing a lot of lab work. Let's keep this set. Don't get rid of that page and I will show you a couple of other things here that you can do. So this slide talks about using an RID. RIDs stands for request ID. It's not as important as it used to be because what you have here are your recent results. So if you click on recent results, you can see that it keeps those for you. They are there for up to 24 hours. They may disappear based upon how busy they are, but they can be there for up to 24 hours. And if you want what you can do is you take this request ID and get your, Pait it on here and get your job back. When you submit something, you will see on the first screen the request ID there. So if somebody has a long list of BLAST searches they want to do, they can take that list, submit them, and then come back in the morning and look at the results. Okay? BLAST allows you to run multiple searches from that same query box. So as long as you put a return -- carrot return in between each sequence and there is some way of distinguish between sequences you can put up to 100 sequences in that query box. How many of you have used my NCBI? So it knows me. You don't use -- I want to make sure that everybody knows about -- I want to show you one thing about my NCBI. So I do lots of different searches on a regular basis that are emailed to me. I keep track of all the publications that come from my school by doing my NCBI on a regular basis. I know what is published because when you get that result back in the first three weeks that that result is there at pub med, all of the authors addresses are there. If you set this up to do a search now, you will get all the publications as long as you have the designation. So I do this -- but, see, because it's Har Vard medical school, they always put Har Vard medical school in their affiliation. Oftentimes your author may not be the author on the paper. Exactly. I know now because I've been selecting this data, I know that Har Vard publishes 170 papers a week. It will do my search once a week and send me back the results. I get it in an email. Of course the other thing you can do is set up your filter. If I do a search in pub med, what I have full access to from Har vard the first three weeks that is there. [ inaudible ]. The affiliation field. For some reason the index in pub med throws it out. If you put in Har Vard medical school and then put the brackets, the address -- [ inaudible ]. When you -- the other nice thing that is really nice to show your faculty, this is really -- I am getting off track, but I can't believe lie Braerns are not using my NCBI. You can do a search that is very complex so here's a search that I've built. It's a very long search. It's updated -- actually I update this one on a daily basis, and everything that comes into me is the newest. So I don't see all the old stuff that I've already seen before. It's only what is new since the last time I've searched it. So if a Kult members faculty members love this because they can see what is new. You would miss some of that. It's Har vard medical school address, with the bracket on it. Okay. A little bit of a distraction there. The way you use it here is if -- so let's say we go back to our results here. It's showing me as not being signed in. I don't know if I refresh this, whether it will show me as being signed in. And actually there is usually a save -- that's where it is. Here. When you save, what is saving is not the results. It's saving the strategy. So when you take that in your save strategy, what it will do simply reinsert your search back in the way you had set it up before and do the search for you. So it's a way of saving your strategies when you are repeating a search. This is valuable also for faculty because they send to forget what they did to get the result that they wanted to get. Letting them know about my NCBI and the ability to save things is a good thing. A couple of other things now I want to show you about -- let's go back to this result. So it is there once you have come back to this page. If I go in here now to reformat these results, there are some choices that I can make here. I have choices for the alignment view that you see. It is what you get at the bottom of your page. You can have pairwise for identity. You can have query anchor or flat query anchor. Do you still have your results? Okay. Go in and play with these and see what difference that makes. Use query anchored with dots for identities first.

Query anchored with dots for identities. You have to be on the results page. [ inaudible ]. You go to reformat. Well, it's because we have linked out, so it just goes through all the things that we have. I can show you. Okay. This saves the sequence and plugs it in for you and saves everything that you set as parameters so it knows exactly what search you did. There is a warning on there that says won't save anything than 10,000 times larger won't save. Here is a different looking result. We don't have pairwise matches any more. Why would anyone want to use something like this? You are looking for, when you use a result that looks like this rather than pairwise. What's different? You are trying to pick out the difference. So pairwise with dots for identity, these are highly conserved amino acid within this protein. They are all the same but you have differences from your query sequence in everything that you see that is not a dot. So if you have a highly conserved protein and you want to see if there are Snips on any difference, it's a fast way of doing this. Another thing that this does with the query anchored is show you insertions. So this particular sequence has an insertion here to amino acid. Just because of where it's pointing. You can change this. So if you are looking for insertion for deletion -- let's go back and reformat these results again. Let's do flat query anchored with dots for identities. It has flattened out all the results. So even the insertions and deletions are on the same line. You can see where that insertion took place. So that insertion was between the I and the S I think it was. So here's the amino acid insertion. Okay? If you scroll down this list, it will look different because you don't have sticking out things that are not included. It's indicated an insertion somewhere along that sequence. Because -- when you are doing an alignment here, there's an insertion somewhere in this line of sequences. I don't know where it is yet. You have to scroll all the way down and look where those two amino acids are actually there instead of -- right. That's where the -- [ inaudible ] somewhere in that line there is an insertion. Because it's flat, you can't do a real alignment where there is an insertion. You have to leave that gap. Okay? There are three insertions somewhere down the line here. There are two somewhere down the line here. Because of the number of dashes that are there. One other useful tool from the alignment use that I should show you. Let's go back to reformat these results. Go all the way down here to the bottom, to the hit table and click on view report. This is a simple table of the results, and what this is is a space table that you can download into itself. People who are doing love this format because they can do a BLAST search, get the results, reload it into another program and go from there. Oh yeah. For those types of alignments, yeah. Okay. Any questions about these alignment views? Let's go back again to reformat results. Run a search strategy, you can go back and relimit. So if you've gotten too many results, you can relimit to a specific organism, if you want. You can put a query, if you want. And here is where I think this is a very nice tool that again people don't know about. You can put in a range of expect values here. Okay? So if I am looking for something and in the first thousand results I don't see something that I want, what I can do is look at that list of thousand results, I have an expect value of the last result on that line of -- [ inaudible ] 150,000, and say, okay, this is a highly conserved protein. There is too much in the database that I'm seeing. What I can do is I'll expect a minimum and a maximum to be somewhere in that range and walk my way down the list of proteins that are out there in the database to look at. So you can see a completely different set of proteins every time you change that range of expect values that you are going to bring back. You don't want to put your minimum expect value at 10 because that means that everything you get back is going to be a false hit. You want to be reasonable in the number you are setting there, but it will change what you see significantly, and it's a good way of extending your search in a way that you can easily analyze it. Does that make sense? You guys are doing great. We are spending a lot of time here. I think -- are there other things that I should show them? Oh, yeah. Side bar. Okay. BLAST is a powerful search and it's only available under the protein lab. Sorry. I've had trouble with it recently. Let's just try it from here and see what it looks like. Okay. This is the first result from a BLAST, a sequence that we already looked at. PSI BLAST stands for position specific irated BLAST and what you use that for is for picking out family members from protein family. Okay? Does everybody understand what the protein family is? Okay. So one of the examples I showed you was an amino transfer. There are a lot of proteins that have this amino transfer. It's is set of protein that may be similar function, maybe not the same thing but similar function. So you have these transcription factors and one type of transcription factor is called the Hom Yon domain. So there is a Hom Yon domain family of protein. The way that PSI BLAST works is that it makes its own matrix every time it runs a search. It does this initial search and gets this alignment for these proteins and from that alignment, it starts to look what other conserved amino acid from this set of protein, it then gives a hire score to those that are conserved within that position of that sequence and it gives that as the scoring maker. So let me show you -- I can give you a little bit of graphic. I have a slide that might show it better but I don't know if I want to take the time to look at that slide. Remember in our scoring matrix we had all the amino acids here and all the amino acids here and we have numbers where the substitution value. What you have is not all the amino acid, you have your query sequence. Each amino acid within that position. So over here you'll have 2/10. You'll have an 8. You'll have a 2/12. You'll have an M. So it's a line based on its position within your query sequence and what BLAST does is it goes in and checks all of the results, the alignment and looks for the conserved amino acid within that alignment for that particular position. So you may have at position 213 you may have a searing and over here in the matrix you may have an all amino acid searing and this will give you a score of 1. But in position 225 down here of searing, it may give you a score of 7. Okay? Because of the conservation at that position within this alignment. Okay? This example is the one from the slide that I showed. This particular protein is a searing, and that searing is the active site few Klo file new Klo file. That is necessary for the function of that enZin Zin. So each time we build a new matrix and that conservation of those position is stronger and stronger and it allows us to find highly conserved amino acid position in alignment that we wouldn't see with just a regular straight BLAST [ inaudible ]. In that same position within the alignment of that whole group of things that you are seeing. [ inaudible ]. Only from the results that you are seeing. So you don't actually -- if I were to show you that matrix, you wouldn't be able to interpret it because it's not a straight matrix like these are. But it is dependent upon the alignment that is done already in the first BLAST search. Okay? And what you do to run this, now if you look at the results page you'll see that there are differences here that you don't see in a regular BLAST search. So every single one of the results have new -- [ inaudible ] here. Down here, when you start getting into hire E value, you are getting closer to maybe not getting a hit but just to see what you get if you change. It doesn't matter if you do it or not because you would weed out anything that is false from it. I'm going to run it now and you will see what the results are. Teach your faculty how to use this. You will do them a huge favor. I guarantee they will find things that they haven't been able to with a BLAST result. It's not in that list. If you go to protein BLAST. Scroll down. You can scroll from here actually. It's down -- go back up just a bit. I did this from the results page, when I went to reformat results, there is a side BLAST choice on the formatting. This is a compilation extensive work because you are building a new matrix every time you run this. So it takes longer to run these searches. You can bring up -- this is a problem that I've been getting every time I run a search now. I get an error message no matter what I run. So I've written to NCBI but I haven't gotten a good answer back. So now what you see here, the green ball means that this was checked on the previous iteration. When you come here one of the things you see here is get to the first, when I push that I get to the first sequence that we did not see. Okay? You see that we've got lots and lots of new things that we did not see on that first search. The recommendation is that you run this search to con Gres, meaning that you don't find anything new. I find a Hom Yoe box protein that was fairly short for 20 iterations. So it can take a long time to do this, to completion, but each of those results that you are getting as you are going through this search is significant. Each of those new results is a significant match. So pay attention to those. BLAST is a tool that is used to look for new members of a protein family. We'll let him set up and finish up. The first new sequence is up there fairly high and then the next sequence is down away, and as I keep going down I will see more sequences showing up. These results are still fairly low expect value and the scores -- loading. It's too long to load. It's not responding to the click here. The researcher will know what it's seeing. What do you mean by recommendation? [ inaudible ]. Okay. [ inaudible ]. There are lots of things you can do. These pages are printable, so you can printout that list. One of the best ways to do is to get that hit table and put it in a spreadsheet and you can save it there. I don't know. How many of you guys have people who are using electronic notebook? Electronic notebooks for their lab notebooks. So they are starting to come in. Those kind of tools are much easier to document those things and to save that kind of information than having to write things down. But saving an excel spreadsheet in an electronic notebook is a snap. PDAs. [ inaudible ]. As an experiment. [ inaudible ]. That's a good idea. If you do a search for somebody and the top hits that you get, if there is an EntrezGene there, send them that information that is in EntrezGene.

Let's take a break until 3:00 o'clock. We'll be pushing it for structure and gene.


[Event on break until 3:00 p.m.]

Okay. If you're going to go ahead and go through all of this with us, and I highly encourage you to go through this structure hands-on. Do this structure hands on with us because a lot of it is take five minutes. So be psychologically prepared to do some things on your own machine. So the good news is, the first part of this is reviewed, and this should look very similar to what you saw yesterday morning. Is everybody remembering? Yes? Okay. So the primary protein changes is what you will see. They're all coming together and of their complex. You will see examples. We talked a lot about the fact that proteins' structures are harder and more complicated to visualize then you have to get to there quotas for hell of a full and where they are in space. There are two techniques we are focusing on. One is excellent. So for those of you as Rosaline Franklin and DNA early on. And then also nuclear magnetic resonance. So those are the two techniques we will see examples of that. How many of you have used structure before? It used to be very hard to tease out from records you were looking at. What kind of structure it was. They have made it very clear and have as little tabs. And then additionally we also talked about about the fact that you can expressly guess what the structure might have been that competition in the speaking. Those are only predictions, and we talked it a bit about how people do and don't value predictive models. There aren't very many predictive models that are available. And actually in the structure database and they don't include the models at all. Hitherto visualizations. These are just flat screen shots. And on the left-hand side its nuclear but in the residence. This technique produces lots of miles of it. You can see that it is lots and lots and lots are brought together. Each one of these is a model of them, and they're putting them together. When we go live into the viewer make a [ indiscernible ] for yourself of this number. This is the protein is an identifier for this. This will be a nice one for you to the visualize. More so potentially than one of the other ones you have. We already know what it looks like. You will be familiar with it. So the protein have a bill movement. They have different models. This is kind of representing a of the protein lives. Then that this is a the backbone of it. There is but the ID number for that, too. You can write that down as well. Does everybody sees shadow's? Can anybody tell me what they think of us are? You cannot really see them in this model, but you can see them fairly clearly in the other model. They are calcium ions. You will see these refer to as the [ indiscernible ], which just means and a molecule that is connected to the other structures. Sometimes there will be tying the structures together. You will see lakes that talk about this, but for example of the things you want to think about is, when you are helping people to searches and visualizations, if their goal is really to see where these are connected, which of the two models are a few more able to visualize the calcium connections and? Yes. So part of the trust -- which one would you use? What would you show this? Some of it is about the focus. If the focus is to see the natural movement of the proteins and help that different models are you might be hoping to get and NM or a model. But if you just want to bare bones it, this might be an easier thing. It is not often that you have eight toes of both different kinds of the visualization. I seem to always use this example because this is one of the few that I have both. Most people know what calcium is. Feel free to use this as a -- of this particular is example as a teaching example. She provided a lot more information. So in terms of teaching, you can already know about it, but also because it's pretty easy to look at. Does anybody think it is hard - - easy to look at? Let me not to assume that you agree with me. The question was if you were going to try to show a of elements would you use that X-ray crystallography? I would set for examples that have [ indiscernible ] that are visible regardless. And the likelihood you see more of them is fairly high, but not necessarily. I think that's it for what it was about to see attached or that I was decease of the attached. And what are the examples says, what to look at this, but I want to look at it with its complex with DNA so that I can see how its connect and did not just see it floating around by itself. Yes, please. I am just repeating that the crystal structure really gives you the higher resolution, but it may be all you are able to do is MMR. It is lucky we have an example that has both techniques because it is not all that, and the database. We are going to be talking a lot about posting data and healthy entree structure is taking the information and repackaging it. You might find a lot of your protein people -- and David talked about this on day one. And other people are approaching this debate you users. They don't use these structures. They are happy using that. If they want to use different viewers -- let me just ask, how many of you have chemical visualizations that were on the library public computers? Okay. A few of you. What other programs are you using in your libraries? Did anybody remember? Held many of you have a CN3B which is NCBI That's free. All of the ones we will show and talk about our free. If you said a lot of chemists, the likelihood that you have protein explorer or if you are going further down the food chain. Also there is very to get to all of them. That is why you want to be in sync. If all of your faculty are demonstrating a particular one in class and you are bringing somebody into the library to were with you and say, I want you to learn CN3B, that might not go together so well. Not that everybody can not broaden their horizons, but you should be able to use what's your faculty are sharing things in. So protein databank is an international the debate of the 3D structures. If you have the associated [ indiscernible ]. Does not work the other way. You have lots and lots of protein sequences of which very few have experimentally and unstructured, and also you the kind of retrieval numbers you get. That is usually the case. It is possible to have the structure of the protein and not have the sequence, but that is much less common. If so please try to get to all of the work to do the structure there was likely also granted to the other amount of work to make sure they have the sequence. And you can see that people can submit to there structure detect it directly, and there are different formats. There are actually tools. We have the tool called PPB explorer where you can generate the 3D for your structure based upon the experimental data that you have. And it has some of those theoretical models. It is not increased because nobody trusts them. So here is a sample. We can also go to live on the web site if you prefer. That's fine so here had the kind of has some of the basic information. It is from a human this is experimental data, so what kind of structure is this? Is its X-ray crystallography kanaka note. It is an in in March. So a lot of this should look similar. You ever Journal, field, title, reference, very similar to what you have seen before. This compound Information -- you see that this isn't a sign. It is the in the commission. If these have registry Numbers sometimes you will see those dependent upon what your of the king at. And this particular has a five. You remember repurchased looking at, it had like 30 mall. This has five. We will come back to the terminology. We talked about residue yesterday. In terms of using residue as opposed to amend the acid. TLC is a little bit of variation and terminology. And what you have your, you have information about how they generate within this structure. So a bit of the experimental data. And then use that getting all of that 3-D court in its data. All of these are cordons for how it should appear. So each of these, you can see these all of the amino acids. This is each amino acid and how it appears and where it appears. Okay? Does that seems fairly straightforward kickback what happens is these are all inputing data bank and did they have numbers. So this is but the ID number, and what happens is that this does to protein debate and six be experimentally derived once and brings them in and to refinance them into NCBI format. It looks more like X and bell. A ticket from this and reformat it to end at different information that they have and assigned it's eight molecular modeling it database number and it lives. So all of these should have two numbers. What is the protein did the big number and the other is the molecular modeling number which is the entree structured member. You can search on either one. Okay. I am trying to think of how to rephrase the question. Every one of the articles -- most of the structures and there have been published it probably do have some associated articles, references and them. So it is possible, and will see that in a minute. When you go into the did it make you can link to these papers that the structures were describe in. I think if your goal is seeking papers this would be the case where I would go and do whatever surge of was going to do and then use that display feature to say it shall be the ones with hard links to structures. Let me just -- so if I were here -- this is an all tires related protein. If I had searched for as Alzheimer's and I had wanted to a share me the thing with structured plans I would come up yet to display and say, show me. The problem you have is these are in the mostly alphabetical order, but it's not completely. The structured links are down here. It shall be the ones with hard structured links, and you see a of those thousands and thousands of papers there are abundant three unique structures referenced by all of those papers. If Michael was to look at the relationship between structures and papers the more familiar, I approach it from this side as opposed to these other. There is no reason you cannot do it the other way as well. This just gives me more words to work with because I can do a much more specific search. That is my take on it, but you could do it either. Yes? Okay. So the structure -- the scientific literature around structure has adopted the same kind of publishing and the posting requirements that people have long had. If people want to publish new experiments of it derived structures they must submit them to pretend that the bank which means eventually NCBI will take them and put them into our structure. Do you have a question? Said those of you who haven't looked recently we were excited to read is that you're actually gets the previews. This used to be a database where you just saw be brief summary results and that they actually give you a preview of what you are picking at. Also they take as hell about the only take the experimental ones. They don't take but the radical models. They add some explicit information and they add all those other banks because NCBI have other databases that can lead to. They store records and this format which you're going to see on the next slide. Okay. You can see that it looks very different and the coding that you saw on the protein that bank record, but most of the elements will be the same. And there are kind of two things about this format. One is, everything else that kind of works with each other and connects is in a similar front. I think that's part of it. They test want to synchronize the field of formats where there are common fields. There was to cross resources. The other thing is that's this is a viewer. It has really optimized to share structure and a sequence of the same time and let you play it back and forth between the two we need a lot of different carriers. Most of them are more designed to show you the structure and not help you to work between the sequence. So you might have cases where somebody should be using this because you really want to use sequence and structure together. And the other tools are just kind of adding that functionality in the now. Okay. And I think there is a note in this slide. It says that the sea in the 3P structure reads this formats whereas other things don't read this. It reads the P T D format record. Okay. So this is an example of what -- so many versions ago you don't even need to look at it. You'll see two things. One thing is that people will be taking static images of things to put in there power point slides were to go in there journal articles. So there will be some need for capturing images. El show you how to export just the structured pictures as static images because that was one of the comic questions be used to kit. It has become easy now though. The other thing is that people will try to end that live functioning imaging. These are just the basic rate records. What we are going to work on is how you entities so that you point out what is meaningful. It is putting people to which part of the structure is of biological and historical interest. What is calling on with its? There is labeling and annotation, and that must be saved. Whether it could save in a screen capture or whether it gets saved by you saving the file is up to whatever he or faculty member or you personally are going to do. I tend to like to go by when I'm teaching, but because, you know, stuff always happens I tend to also have saved a couple of them. So if I have to open it up what to show a lot of in addition, to get people excited about what it could look like, here is that a good one. Now, let me show you how to get from the naked one to the edited one. It hopes to do some of this step ahead of time. I'm big on having backups in case of failure. Okay. Let's talk about entrees structure for a minute. We're already year. If you don't have this year and the left-hand side it's going to give you the link that you should download it. Even if your nuts lightly prepared to roll it out on all of your library public computers he should at least go ahead and download it and play with it and get a sense of what it is going to do. It is loaded on your computers so you can go along on these exercises today. So the top part should look very similar. We have been looking at this over and over again. So certainly do have different limits and things like that. But for the most part all of this is pretty much the same. One of the major differences is the sort by. If you're used to kicking at that, if you look here is your current to feel very different. And so frankly I don't use this very much because most of the searches and doing are just very basic keyword searches to see, I want to show what the structure of insulin looks like. How do I bring that up? That that necessarily focus on these things, was just in case your interested in any of this it is worth knowing that they are different. It okay? What is helpful is this section here. These are the default tabs. Even though we of luck in as David -- these are all things that are already there. These are if that taps inherent in structure. So remember I said if you were trying to figure out which ones are which, it's very easy. You click on the tab and said, show me that 20. It's nice here if you mess over you'll see it's popping up to say -- so even if you didn't remember my conversation with you at all about X-ray crystallography, if you mess over that you are going to get at least a little bit of information about what it is and what the crew was and all of that. A K? So if I was just the models I clicked on that. File, tab. It is going to redraw just the 20 that are from that. If I want to see the ones that have a liking with them, you see a I can say, show me the 64 that have -- and if you don't remember, that's fine. This is going to tell you what it is trying to do for you. So it's hard to see at this level. And I'll show you how we get rid of all that stuff in the battle that keeps you from visualizing it when we opened the record. Of. So these to you. At most of these have records. When we get toward the end of the structure to the risk in the pulpit of time talking about this which is the small molecule databases that NCBI roll out last year, year-and-a-half in terms of the roadmap and looking more at smaller tools. So those of you who support chemists as well as biologists and clinical researchers perhaps already have seen that. A bit. And going back to the very first pitch of the 31st lecture yesterday morning you can choose if you want structures that came from bacterial growth or you caryatid organisms. In this case I don't think there is too much bacterial of Alzheimer's. In case your life, why other so little, are members recessed on. Does anybody remember that we did our initial surge here. So we follow this link it isn't saying, a search for Alzheimer's. If I had that appear -- this is saying I want to search the structure to the base for any mention of the word Alzheimer's, and you'll see that I get 12. This is telling me that there are 12 records and this database that have the word Alzheimer's summer and the record. What we're looking at before, every record that had the word Alzheimer's that happens to have a structure associated with it -- now, it could be -- you could have two things going on, and maybe Elsie your -- what is one of the things that could be cut on that explains the difference? Yes. That is could. Even if I get to all, let's be fair and make this a better comparison. If I keyword search I'll have one and did something when I come over. What do you pass think the differences? Right. That's one possibility. Even though the paper is mentioned the structures attached might be a structure related to something else that paper is talking about, whether it's the structure of drug that is being discussed for but the paper is about all signers and six other conditions. So that is one thing that might make you think some of those 103 things are false. What could be another generic it might make you go into reverse and think you -- make you think that the stuff here is giving you more. BP there are a lot more words. You have a chance of hitting it with the title and abstract. So if they are talking about Alzheimer's you have more opportunity. Standing upon the level of information, remember, those levels, from PPB. You pretty much did the definition and maybe some details about the experiment. So if they did not actually say Alzheimer's in one of those three fields you're not going to get it. If this says something about [ indiscernible ], but the other structure just said they this structure, you won't restrict retrieve a that by searching the word or summers. Does that make sense to everybody? So he is asking about the publication date. When we get to look at a more complete version you will see the publication date. They are running into the case of the will to keep it short and to give some people to click through to. You can apply limits. One is the publication limit. We'll go into that. You can also sort them by that date. And that's probably just as effective as the literature. So you could sort them by the deposit date. You wouldn'tbe seeing things in order of when they were deposited.That any rate to the whole point is that there are multiple ways to do this. Depending upon whether you want to search very broadly and get some false draws versus whether you're really just looking for a couple of clear examples. That depends what you are doing. Let's see how much of the overlap that was. What I'm doing is ordering them to see how much uniqueness there was. There were two databases and structure guarantees they show me the structure says. Let's see what these were and what we did not retrieve them. This is a refresher and how of the structure differs. So we have two records here. I'm going to show them. I'm trying to think care. Which you would have wanted look at it and? Had been held just visualize them one at a time. So here is the reference. You were talking about the literature reference. Here it is. It was deposited in 2000. I'm just going to do a fine and see where it says Alzheimer's. You probably already know. You are faster than I am. I need to spell it correctly. I'm just looking.It's volume 90.Yes. That's interesting. That is unusual. I'm looking for where it said Alzheimer's. Do remember what I said? I did yesterday about when you teach? No examples. This is what I get for going off script. This is pretty much the whole record. I will definitely come back to figuring this out. Okay. Well, let's look at the other one. Yes. Yes, these things link across the board. This is a link back to the paper. This shouldn't be what draws it back. It should be something within that record. I will look at those. At any rate I just illustrated why it is important. Do what I say, not what I do. I wanted you to see the overlap. You're going to be working with somebody and you have to make the judgment. Do I start with a structured database or pub med? Is all depending upon how comfortable you get and how comfortable the person you're working with this. We have a lot of these at Cornell. There are people who are familiar with the literature and are trying to transition. Some of those people following the links out and seeing what's going on will be the best strategy. You will want to waste the time of others, but you know the plans and you'll make the right judgments that will work for your situation. Okay. Good. Yeah. That's absolutely true.What happens if you do Alzheimer's. It's true. All of the strategies you would use for text word searching in any other database you should use in all of these. If you did that -- was just see. If you you get a lot better. One of the things is that they aren't using truncation or doing who researches. They expect the data base to resolve that stuff for them. People who are used to using this and have it too magical things behind the scenes will probably as them all of these other databases will to magical things. You're right. If you're actually teaching somebody in in the database, it is a really important teaching point. Okay. And you see that we do a lot better with that. Right. Right. There is definitely timing issues with any of these names. We talked about it with genes earlier. It used to be called S182. And I think that particular gene does not have a known structure. When I teach with that I have to use a different example. If I was tied to make sure there wasn't one I would have to do or searching or deal with that question of the old stuff versus new stuff. Tradition is is the easiest, quickest way to deal with that. We should pillhead end up -- we should go ahead and open up a structure and kind of get our hands dirty. You can go into more details of the structure or you have the little viewers. So this is one of the things about the new features that gives it little confusing. All that does is blow it up to a slightly bigger image. This does not to open up the opportunity to manipulation. It is that clear to people unless they stop and read. If they really want to launch it they can go through this interum level record or they can just launch it directly. Let me say a couple things about this. We discovered this is the pub med and that for some reason they don't like to list the journal articles because this is the third one we've seen. That's a good practice. That would be a good practice. This was a positive in January of this year. It's a Cuban structure. You have the ID number for the database. That lead to say a word. This stands for vector alignment search tool. When you're doing and structure to the structures similarity you're dealing with it under the structures physical nature of level. This is the sense of, you're looking at how it's aligned. That is why it is called a factor. Not like we are talking about the sequence we saw earlier. This is your three-dimensional model. It is held this alliance in space. So you might have two things that have very different sequence, but in space they fold up and seem very similar end of they have 3D cordons. This looks at what is similar. So doing and structure to a structure similar research is going to give you a different results than doing a protein sequence to protein sequence search using the sequence. Things make -- depending upon where things are and what they're doing a tenfold and behave differently. This is an algorithm to look for things with a similar 3D structure. There are so many did accordance. It would take a lot of time if this wasn't pre-computed. I'll click on it to open it in a new tab so that you can see. It should be pretty quick. These of the chains of other similar protein. That you are in the related structures. Well, there is only one. Well, there aren't that many to begin with. Also, there aren't that many that are that similar to one of these. Let me close this. This is a good follow-up. This has two sections. You see one of them is the section -- one of the debate is this blue-made and the other is -- she should, a magenta. We will this there is that the main family. He were talking about protein families earlier. This is a domain that carries through. This top level is the overall sequence. Best of these have the other line sickens levels because the power that we're going to do is that you can see the structure and sequence working to get a. He concedes that we also have something. It's should take you and show you what the compound is. I tell Bexley know what that is. A meal Ethel something. Okay. Let's do it. Let's go ahead and go into the actual structure. It should prompt you to open it or save it. It should open two windows. A lot of times when they see multiple things popping up they start closing things. I have seen people close the sequence window thinking it's not supposed to be part of what is going on.Conveniently everything is color mapped. You can at least see where things are. If you highlight this you should see it's hard to see with the lights on. Maybe you can see it. This chorus of bonding -- corresponding section turned yellow. The six in New Hyde turns yellow. So if you're trying to orient yourself or you have the particular piece your trying to visualize you can do that. So let's talk about this. You notice I'm spending in December and did just by grabbing it and moving it to around. Did all of you get the structure of an okay? If you did not, now is the time to say something. The first thing we'll do is spin it around. These crayons and arrows -- I hear various names. They are there to show you directionality of protein. Sometimes once you have seen how things are going people welt take those away because they have kind of obscure everything else. They are hopeful to get your initial orientation. We are going to walk through this many. One of the things you can do is in but the style menu you have lots of different options. I'm calling to go into this at its global style. In that this window if you go up to style and did its global style you have lots of different -- yes. Okay. So just click on this, and it should open. Okay. It's okay that you have a different structure. It doesn't matter. Remember we talked about the alpha helix and beta sheets. We have helix objects and strand objects. That is what these are. I don't need to see these any more. I take away my big green crayon things. I'm sure there is some other big scientific name for them, but I don't know. And taking does away. Now I'm going to take away the strand objects as well. No I concede more clearly what is going on. As I stuck to work on this and do annotations I don't have as directional objects in the way. Notice this says apply after every change. If that isn't defaulted you are going to have to hit apply after you make each change. I don't always remembered what each of these is going to do, and by having it do it automatically I can go, yeah, that's what happens. You have options about how to visualize things. I'm going to say done with this. Let me just fine a good brain view. My colleague used to teach for the National Training Center used to have this example for a chart target. If you try and did they suddenly you did see the exact spot where the drug was intended to plug in and it made sense. Depending upon who you are looking with care to find an example that would get them excited about how it's relevant. Do a literature search. Look for interesting drug targets. Okay. Let's walk through a few of these things and then we can play with it. The really basic obvious thing is the file saving. The thing I want to point out is if they're going to use the image as a flat image in a proper point article, they have to export P&G. This is a port graphics file format. You are used to jpegs and gifs. This is this another format. You just save it. A lot of people who've never seen that, when they see export -- the other view for something that says said image as. They see that and don't realize it is what it is. Of course if you use any of these images you should -- bless you. You should encourage your faculty tea say where they came from. The only things to look can be different. There is a citation at the bottom. Position that is a citation for how to set the molecular modeling data base. And sure you saw over and over on the BLAST slide the first thing that came up is the publishing citation. It is become so ubiquitous that the behavior is often not to cite it, but it is still the right things do. The citation is done here at the bottom. That is when this was first published. This isn't the most recent version. Yes. You could approach it either way. That's fair. Were you going to say something? Right. That's what I'm saying. You could make a judgment other way. This clearly tells you the article that describes this did this as a tool, this was the first version of that that was available, and this is the simple citation. Lots of people are trying to figure out appropriate ways to said that image from that record within the database to be more specific. Probably was but the other thing people will do is whatever scientific paper that image was to arrive from, if they're using that image is probably a list that paper -- they might just said the paper. That will be cited. Typically you should also cite the tools. You see that with statistical software. Some people will sites the version number and location or the Research triangle. The will said the version they were using so there is a whole chain of behavior, but they should at least said something about what they have done. A kid. Of jump back to my record here. But it's a common question. You can imagine that the resolution you get are taking a screen shot and corrupting up the piece is not nearly as bad as what you get from exporting. Okay. We have lots of other choices. While many of you have open houses will be elected show fun things are of little events or displays or exhibits? When the bill say is, the 3D structures are cool. Let's show that. Especially if you have kids coming. One of the things that you can do is animate the display. We can actually make it so that the others go away. You can say, I want you to spend this structure so that I can have it moving and people can come in and look at it. So you can set it to do that automatically. According to stop it because I'm not trying to induce anything. If you go back to something -- you can tell I am a public health person. So but and you can actually tell it to play its way through this . That is a really cool thing to be able to do. All right. It should be pretty apparent. Yes. You can of course make your window bigger as well. If you want to make everybody -- if you make your window believe began does it in and spin it -- that is what is really cool. Yes. For a large protein it takes a lot of computational energy. The other thing is, how many of you have a visualization wall? A few of you do. I know you you do at Harvard. There are many different kinds of, but it is big display screen of various types that you use the be able to really kind of walk up and look at what is going on and did inside the door still lots of images and have it but space to see them and did juxtapose them. That my new place we have one for geographic information that mostly is used for a system so we can visualize bitmaps and things like that. A lot of bite and dramatic groups have that kind of technology so that they can really make this much bigger. I don't think it turned the one at Cornell on. It is upstairs. If you don't remember how many times you click through to a much stuff you can just restore its to what the original thing was supposed to be. You don't have to kill and undo. Okay. This goes into gave little more what you like to with this, not for beautification, but actually for scientific discovery. In the case of this structure and have nothing in the mind. Ollie for you to play with it. If you were looking at particular demands to focus in on our review what's less certain aspects of its this really lets you pick in and out what you want to look at. You see a lot of these have arrows to give you a the little bit of additional choice. Remember that when we were on the very first structure summary page we said we had our sequence line and then we had to demands and we had a family. The Saturday they -- does everybody vaguely remember? The structure might have with many chains were multiple domains. He could say, I don't want to look at the whole thing. He could have done that when you first started with it from the display screen. Usually it is easier to pop it open and start to narrow down what you want to look at a later. Okay. We were already in the style menu. Let me just say a few things. About the people the notice these. Does everybody remember from chemistry class the big ball and stick? If people like to visualize things in different ways, you have different options. If you have people who liked to visualize things a certain way, you can accommodate. They can change color schemes. Although, I think easily I tend to leave the default color scheme because later on you have things you can do to say show me things whether it is hydrophobic. If it tries to avoid water. If it likes to go toward water. This is different functionality. The more you mess with your other colors you can make messes. I generally don't change a lot of this, what somebody who knows what they're doing and understands that chemical property should feel free to change the colors of their protein. I just don't bother to do it. Yes. There you go. Labeling is one area I tend to look at. I showed you down here that I could hyalite -- you could go down here and highlight certain pieces of your sequence and visualize where it was. Sometimes you want to be actually seeing the labels. So you have it shows of three letter on letter codes. For the amino acids you can specify that. You can say, show me. This year is to save shimmy. When we're looking at the calcium, this is the thing that says show me that calcium is calcium. If you have the label on it will have this on. This, this isn't the calcium one. This was that compound. I just wanted to show you. So these are uninterested cuds. I'll say, shall we every 50th. Let's see. It will actually do it. Yes. So what you get, you have a one letter code. I don't know all 20 of them by heart, but I'm sure your scientists to. Does anybody have a guess about what the number is? All following is the sequence you are. Yes. You will probably see this called the book is. So depending upon what you're looking at. It is just if you numbered your whole sequence from one to up whenever the sequence was. Earlier you saw it was from 1-768. These other just numbers along the way of what it is at. Okay. So what is meaningful? What is meaningful and labeling really depends upon what you're looking at. Clearly if you start calling down to something like show me every fifth label -- let's see. This one actually isn't automatically. Used it to see that you get more congestion. There is a certain point of what you try to see. Is there something meaningful but distanced him back most of the time I'm just teaching them how to do it mechanically. The choice of distance or what it's worth to them to show is really based upon what they're trying to to. I'm just teaching mechanics. And you also have details which just gives you some details about how it's rendered. These are all the faults that I don't normally change. But if somebody really wanted the tubes to be faster where they really wanted -- yes, that is pretty much about the thickness and width. So if that made sense for what they've heard doing, they could do that. Does it seem pretty straightforward? You have some favorites. If you said fuscia made me puke -- one of the primary things I've seen, we have a colleague who is red and green color blind. You saw that the maid was in red. I have some colleagues who will change that so that they don't have things in the colors that they can't distinguish. That way they can visualize better. That is actually the only time I've seen somebody change its for any other reason other than a scientific. These are the short cut. If you did not get into global and test wanted to change it, you could say, all right. I am a pollen stick person. I like to look at it and ball and sticky. I'd like to look at it in space. That is the one you're used to seeing from this to class. Let me return this. But look what happened. This accidentally put -- somebody tell me. How do I get rid of these? Does anybody remember? Edit. Remove stress and objects. What else. Helix objects. You guys now know how to get rid of it. Excellent. Those were all of the rendering shortcuts. We also have coloring shortcuts which is but a bus just talking about. Depending upon what you're working on you can do all kinds of things. If there is something about temperature -- sometimes of the site understructure is active it's going to have a higher temperature than it might have otherwise. These are kind of more scientific questions, but if it's helpful to view them in that capacity you have the option. These all the fault. The chemistry community has defaults for how-to show temperature. How to show [ indiscernible ]. So anybody who is looking at that, it will be the same. Okay. Does everyone see what's going on? Which in is the hot end? Probably the top. You also have a section. But at this free-floating Ind. That is also very warm in in an area where everything else is of little more cool. You can see things that might seem interesting. Okay. I almost hesitate to say, are there any questions because there probably could be lots. You aren't expected see no answers. Hopefully this is a very good. It's the same kind of situation. Most people don't actually take the time to read help. Seriously, are there any questions? No? Okay. The other of things, we have avoided this so far. I think what happens most of the time is that everybody gets all excited about the structure part and can't afford its the fact that the sequence is down here. So we have already shown how highlighting those and held the color changes down in here to correspond. You have a couple of different options. One of the things that you can do is it that you can actually -- of course I haven't imported anything. You can actually imports different data, a different sequence the debt into this and try to align it against this sequence data to see how it messes up and whether it would be interesting. So let's say that I had a sequence that I absolutely could not find a structure for, but I found some related sequences. I said let me Matchett up and then see how it could be looking. You could bring that in and do that, but before we commit to doing that's what I want to do is back up and talk for eight minutes about related sequences and structures, and how that works. Slights have a lot of good information and then go come back to this. Actually, we would use this as an example. There is another example in the slide or we have eight simple sequence. Co-head and close this. Everybody still with me? Okay. Don't worry. Definitely closed the windows of that you don't slow down -- unless all of you have a super high performing computers at your institution. Or computer lab does not tend to perform well under about Windows be open. Here is the conundrum we were talking about yesterday and India today. These are old numbers. These are of last year's numbers. You can see the ratio of posting record. It's really diverse. Some 18 million protein records and 50,000 structure records because it is a lot of experimental work and not a lot of places have the capacity to generate experimental structure of data. So the likelihood of matching and having a structure record is fairly slim. You see that we could actually figure out why of numbers ourselves, but I don't think that is as important as the sense of how big of a discrepancy it is in terms of a ratio. So if you don't find -- if going to structure and is doing you're searching deadly or killing and searching and then the bottling of the religious structure links, if either of those approaches yield anything that looks remotely like the item that you're a scientifically interested in, there are several avenues that you can do to try to get at what is going on. We have done already is what is he is calling here front door one which is good to the structured data base and that the search for words. Use the best creative words and tradition and synonyms. The second way is to go to protein and display be assisted to structure record. Let's actually go to this and try that same approach. Give yourself in there. Do a search. I am just going to do Alzheimer's. Okay. So when I do that I get 6,000 records. Did everyone gets there okay? ES. Okay. And so I have 6,000 records. What I'm going to do is go ahead and say, you know what, show me any of those records that have a structure link. That is one avenue. That is if I want to see the structures that available. Let me do that first. So out of 6,000 spirting records -- I can't believe that's true. Remember, this goes back to what we were saying. This is a teaching example that we had not gone through. It happened to you as welcome back the second semifitted it was fine. Okay. So we're having laid afternoon hiccup issues. Okay. Okay. That is comforting. I was wondering how it could be that all of those records did have sequences. That would be bad. So that is another avenue. We did the director of in a. This is the protein. And at each end, for my whole group of retrievals tell me anything that was attached to a of them. You also have the option of dealing with related structures and related sequences. So for any individual record you can go directly -- this one doesn't have a direct structure record. Do you see that? If it actually had its own experimental structure that lead with this particular sequence he would see a structure link here that was just labeled as structure. So you don't see structure, but what you do see is related structure which is going to -- it has a structure great. You'll have related structure and related sequences. So these two things are actually relate. Related. And they're going to try to -- I am going to explain this once, but it is also on the slides, and you should read it again. The relationship here is a bill leered. Remember I said most of the structure to a structure comparisons are done using that vector alignments search tool? That would require a structure that you were saying Shelby structures related to that structure. We have already just ascertained that this thing does that actually have a structure. Is everybody with a? It's does not have its own light structural 3D model. It does not have one. So if she said related structure works out and this doesn't have a 83D stricture help and I actually find a related structure? Does everybody see why it might be a the confusing? What religious structure does is say, take the related sequences and did show me the structures that relates to the related sequences. In every other kind of thing you're working at and these databases the related thing builds of of these. The related cigarettes is here. You're taking this sequence and doing it predicted last search and taking a certain cut of Lovell and saying, these are the ones that are related. Does that make sense to her body? In but the same way that these have related articles this is a pre computed search. It gives the words of discordances after a second cut off -- yes. So for this ID will be that the fault surge that is the cup point they will do. And some people are happy with that. Some people want to come up with their own list of related things. These are difficult and their pre computed. You don't have to blast and to your own search. You can just see what a typical blast search would generate. Yeah, right. I'll show you what happens if you go here. It is literally going to come back. I did not selected. It shouldn't do that. You should not have to select the this. I think it is having one of these kind of afternoon problems. You should get a whole bunch -- yes. Okay. At any rate. The related sequins -- Do you want to?

[Speaker Unclear]

Okay. So what would really be happening in times of the related structures is that it is really collapsing this into two steps. Instead of making you do related sequences and showing you the structures it is collapsing as two steps into one. So when you come here and say, show me the religious structure, what it is doing is, it's doing your related Siegel's work and then it's trying to show you the structures that are related to the related sickness. Like I said, read it again and the explanation because I'd think it is one of those things that is a little bit like, what am I retrieving again? Care of the law and it's differences, but you can see it has a 217 hits that have no new structures. And a lot of concern to the lanes across year. You see that because it was doing these -- you also acute did expect values for the database. Actually, I'm not clear. From looking at these I would be saying that these values much relates to the related see us work as well as too fast. I have to look at that. Yes, I don't know. It used to not do that for you. So I'll clarify that. But basically the values were the same way. Everything you were looking at in this sense of these are really really low. You wouldn't expect with these things -- I'm sorry. These -- if you mess over these you can see that they are actually the alignments to that particular piece of the sequence. The this is interesting because it says related structures and it is all about the structure, but it is all based about the sequence of similarity. He really still have to follow these things out. You see that you can view the three d structure. So for each of these alignments where it is saying, you know what caught my sequences were pretty similar. You can say, let me see the structure of that. And so we had started 17 heads. If you look at several of them and you had several with the sequence is the same and you go and visualize the structure and all of the structures also looked very similar, you might start to feel somewhat confident that the likelihood cut the sequence your looking at that is very similar to all of these sequences might also be structurally similar. If every one of these structures looks completely different that all the same kind of sequence but the structures look radically different the idea that you could use this sequence and structure, you might kill, there is no point in saying these might be related. At this point you haven't viewed them at all. He views several of the best matches to see what is quite wrong. This is what she called back door or front door north two. I can't ever remember. Okay. Hmm right. So this was the back door. This was, find a similar protein sequence and see if any had a structure. We just happens to do that in one step at. And then the number two, flowed, and not really doing well with sequences. Let me just focus on those concerned domains. Those things are very likely to carry through. You can at least look at the structure of those demands. You have a couple of police to look at that. Let me go back. I am trying to think which of these was actually the window that -- SO you see that you have the means for each of these. He could certainly go in and looked for the conservative main debt related to Alzheimer's and see which ones carried through. There are several that relate to Alzheimer's. Okay. So this is just an example of one of them. If you remember from what we've looked at just a minute ago with related structures some of that had a couple of these in them as well. So they should stop to look familiar. And so you have a lot of this year to work with. You can also search the database directly. If you were saying, I want to come in here and see what conserve domain is related to what's going on, this is a tool checks they let you say, here is my sequence. See if there is concern of the main in it. Whether you are providing the sequence are telling it to get a particular number. This is within that process. I don't think -- I don't think you can actually text word search in but the concern of the main database. He would be searching somewhere else and then saying, shall we that domain. That is what I'm saying. You can't. If you look at the box that tells you what kind of did it will take -- this is important. You've kind of hope that they tell you what did they take. If you are in you know what kind of did it takes. All of this takes texts. As he stuck to move out to this kind of database it should be telling you what kind of data you can put in it. And if it doesn't, we should tell them to put a the line under it that says what kind of did it can take. He sees. Is talking about format. We've been talking about that over and over. Hal is everybody holding up? We have about 10 more minutes. Is everybody okay kanaka okay. And of course they say that these simply said light on the possible structure of the things, the actual structure must be confirmed. Bless you. So once you get past the immediate thing of I suggest on what I was looking for and found the structure for a was searching on directly, all of these other things about it might be related and it might not be the hardest kind of insights into what you actually need to figure out experimentally. Okay. So let's actually go ahead and do this. So the example here is, I want to try to find a structure for the tumors suppressor proteins. I would like to see eight complex with DNA. Del ask me. To try to figure out how. Actually, do you know why? There you go. It makes sense. You'll see when we look at it, but basically it just means connected or interacting in space with in terms of this. Sorry, I should repeat the question. The complex with the in a mental what I just said. The old copy what it says on the screen exactly. If I am telling you to try to find it's complex with the DNA this is only giving you halfway there. Build upon that. Yes, you can use that. Like asset, we have been using operators, truncation. You can use nesting with parentheses. All of your typical searching arsenal is still available to you. Good question. Everyone feels like they have done this. A of. So most people, what did you type in? Some people insist did it P53N is DNA. Did anybody try anything else? Okay. He added the complex? Of all, you have complex before the word DNA. Okay. And you can certainly do things like this. Just keep adding words, but you then become more and more at the rim of whoever entered the data into the record, would have -- which can be variable across the depth at which they help you out. So it certainly narrows it down. I was just trade in an effort where it had to be capitalized. Is it still held by be a. You really don't even need it. I tried to use it when I teach any way to help people separate the different pieces of what I'm doing. Well, if you're going to have you might need it if you run into its strategy is a phrase like if you really don't want it to be a phrase you might pick it up intentionally. A lot of my students never use and. That's fine. So actually was talking about health the air and water and did not be a and the capitalization and whether it will treat it as a stop port. One of the things that will say is, we talked about this cutting across a shared search engine, the fact is the way things behave sometimes, for example like an opposition rally doesn't matter, even for the or not in the behavior of the of times, it will still matter in some of these others. So if you're not sure it's better to capitalize. The fact is, look at your details and see how it works or didn't work. Some of them it will still be gave it little strangely. But other question have we been talking about casting that we did up as your? How have we been focusing a lot of our researches? In some days -- plays we had been switching in particular fields. In terms of the scientific issue, what have we been asking of most of our research is? We did it lost this one in Fort sugar. What aspect of it really trying to get at? We will, yes. Some of we have been dealing with is the data, but we have also been dealing with the organisms. We have been doing human services and bus searches. We didn't really talk about this, but the king at what we have here just from of the display your seeing, is this all human data? You don't actually know whether some of this -- -- three is also in the mouse. It is also in the rat. In some cases it also has human into line, but rumor we talked about there are a lot of things that are also human. So if you're really only wanted human you could certainly at that and/or you could just deal with its in terms of a browsing level. Some of those two of them weren't human. In this particular case it might not have been a big deal. If you looked at your initial results and at half mast structures and have him and structures and delete what the human the same strategy of organism searching can be used. The bracket is what tells you that is the home field. If I just stepped that enveloped the brackets -- if you type in human without the field name what is going to happen is it will search human in all fields. So in this small set it is probably not, to make a big difference, but recent examples yesterday with the difference between doing human in all fields were caesuras until, you did some differences. In this case it does make a difference. One of them wasn't and did human organism. They have the word he meant in it. But doing it as a free floating one resulted different. Probably no big deal, but potentially a big deal and another database. Remember that this only has about 50,000 records. Where it makes a big difference is one of the searches I end up doing most often. Insulin. You can imagine that there is a pig insulin and all kinds of other insolent. You may definitely what the focus your organism searches in different ways. That is a good comment. Okay. This was a fairly easy for everybody to do. Did we actually look at -- did we actually look at any of them? As we have a question about what it meant to be complex. Which one to be what to look at? Somebody named one earlier. Let's look at number eight which is also the one that is referenced in the slide. If you want to stand up and see with a sledge says he'll be able to a great. You can probably see it from here without even opening it. Aarhus which part of this is the DNA and which part is the structure. What is DNA looks like it's? A double helix. Is it that top kick back part of this is visualizations. You can actually visualize. I'll open it a little bigger. Okay. And so you also can see here what's going on. This is a protein that has three chains. You actually have three. Each one is in [ indiscernible ]. These are little ones. You see all the different letters. But then you also can see this is a protein sequence. You can see that without doing any hedging or a thing. And if you were trying to figure out kind of where the connection could be you can kind of the -- you can see this end of it where it is. You can certainly go through Hansard seeing where of that nature of bonding and relationships are if what you're trying to figure out is how it complexes with it. And then what do we have free-floating? We will, they aren't free-floating. What are we visualizing? Everybody cares the periodic table, even if we don't remember the Middle passage Code list. A kit. Let me see. Which of the many -- okay. This is compiled at the plight X-ray crystallography which is like taking a massively massively high quality could photograph of its war and nuclear magnetic resonance where they put it in a bath of some kind of -- and take pictures of it in its solution. Yes. You can put it in, but you have it in solution. Right. Right. I guess I'm trying to think what kind of -- the whole microscope, and trying to remember those. There is some other name for it, like the actual -- yes. Yes. Okay. So extra a crystallography to repeat it is being said is that you were seeking x-ray at the Crystal and see how it's diffracts from that. That's what you have a picture of. That would make sense. If you're back in the sedge for the ones that or X-ray Crystal if the mouse over this is as quickly diffraction. That gives you a little bit of insight. Okay. If so I think the rest of these -- I don't necessarily think we have to do this again because it's basically -- these exercises are here and you can certainly do them again if you want them for reinforcement. We've done quite a few of protein searches. The idea of searching for this phrase and then choosing the structure of links, hopefully that's something you receive C -- feel pretty comfortable. Okay. And we looked at the structure record for this. Did we just jumped right into this connection right. This is the previous one. It is easy to get confused about how many windows you have open. Let me just bring this record back up for one moment. Okay. There are a couple things here that our eight little different. And you can see this is a much older structure that has been around for awhile. The protein chains were from one organism? Humans. But the DNA, what was the DNA from? It is synthetic construct according to this. A soldier going to run into this a lot. As you look at things as complex with. Don't get to completely cut up and that. Just be aware. I should have said that. We are here and taxonomy or is this, what organism you are sequenced or your structure is coming from. So they feel and what would have taken you to the human record. This takes you to the synthetic construct. So -- I actually Kali is it Mr. and not Doctor? That is with unsaying. A cape. All right. Any other -- all right. This is the other way of looking at it. Remember we had three different chains and colors. They are all part of this delay in a family. So even if we had not found this particular structure with these we could have said, let me go look into the domain and look for the P53 which would at least give me some information about this situation. In this case you see that you cannot really see this, but if you cross over it, it tells you what all of this is for. What will do is just pop over because that is the thing that we have not really addressed. Because this is kind of new -- it's not really new any longer, but they don't give it much attention. How many of you looked at this kid back a PER2 kit. About half. That's great. So I think a couple of these have just say about this. This is probably a particularly bad record to come in off of because of Limassol is probably not the most exciting compounds that you have in here. Let me go back and just orient you for one second. I did but he did. I skipped over many many slides. Just go forward a lot until you get to -- at she gives you lots the review of what we've gone over today. So basically we have three pieces. He'll hit this called the small model of the base. Basically this takes the other chemical databases -- all, they aren't ready NCBI databases. Most of this information resides in the specialized information services. The people elected the toxicology work. Chem ID. They are trying to bring these chemical and protein databases more together and under the same rubric. But these -- and said a plebiscite on small molecules there are other molecules that are over here. So there are three components you can see the source. We were just talking about some. You have the National Cancer Institute's various substances database. They're doing a lot of screaming of possible chemotherapeutic things. The National Institute for Standards. The did the book. All of those substances are in here. We talked about the place. The substances related to those are few. You get an idea of the scope of data in there. And then the compound database is basically taking a view of the substances that came from this. If there is one for each substance and been one for each unique compound. There is a lot of interrelation, and we'll come back to this when we look over that record. We will come back to that. Then there is the bioassay database. As they're taking all of these substances and try to figure out whether they have any potential interesting behavior these essays to detect whether these are by no active molecules or whether they are useful in the screening process -- and those of you who support, editorial Knesset's and people who are interested in drug databases entered design, this will probably become more and more interesting for them. These are the three parts of it, and really I would refer you. We talked about the module this morning. It has many slights about this. There is also a whole short course. So if you're interested probably the best thing to do is actually reviewed that's been done just going to open in another window real quick. I think this isn't being taught anymore either, but they have all of the lecture slides end handouts. This of you who support chemist's want to pop over to the link. We can't disclose this. I'm sorry, in that the actual there was about to have a hot lead to a. It will slide. You can't even see what numbers fly to Tehran. Yet. The fact is, if you get to the web site and just type and public can you will probably find it. It lives on their server. If you're over and the actual database -- you see there are a lot of linkages going on. Toxicology database links to projects structures with each one having a unique ID number. The cassette, zinc is a really bad example probably, but since it's with open, let's just run with it. You can use all of the subheadings you would normally use. And you're used to having subsets. This is the same thing. You can just focus in. You get 75 synonyms which probably is very important, but for other compounds that potentially have lots of other names of having does could be interesting. You have chemical properties but you get into the web book. And then you have links. Here we have biological properties. For this it probably isn't a big deal. For some other things rather is a unique resource, this becomes more hopeful, especially if you don't have the signal. So let's actually pick something that is able but more interesting. You see sheared you can choose between the three. Of course they did not structure it like any other database we have been searching. Everything is different and how it is laid out. But I think I was doing pepstat. This is normally what happens if you're searching it directly. You have the same luck. There are three essays. There are two proteins. You have related links. You can drop chemicals. You can use a drawing tools so that if you don't have a name but you want to draw a substance you can go ahead and use this. At any rate, you can get an analysis of non bioactivities from does. So I'll stop here because we're running over. At the Metro people want us out. Think about this in a little bit. If you have questions tomorrow morning we can entertain does for a minute or two before we move on. Is everybody okay? Questions? Concerns? Head explosions about to occur? Good. All right. Well, have a good night and we'll see you tomorrow morning. Tomorrow is the early day. If that means some of you are looking to change train schedules, I just want to remind you. I felt bad about yesterday. And Lawrence said to me, be slower and the rug. Today was the taking back well, it is box. Even for us. Even when I watch David to teach I've learned something every time. I didn't know that. Okay. It's a lot. Were you guys okay back here? I'm feeling of the love from you, Kevin. I'm not surprised. I thought absence would make the heart grow fonder. After this he'll probably never see me again. I don't think I get to come back to New York unless I'm coming back on vacation. And I assure you if I come back on vacation I will not be --

[Event concluded.]