配列の類似性を検索するソフトウェア
プログラム単体で実行することができるが、ほとんどの人はwwwブラウザ越しに行う。

サイト

http://www.ebi.ac.uk/Tools/sss/

Wikipedia preview

出典(authority):フリー百科事典『ウィキペディア（Wikipedia）』「2016/05/05 23:59:58」(JST)

wiki ja

[Wiki ja表示]

FASTA は、DNA の塩基配列とタンパク質のアミノ酸配列のシーケンスアラインメントを行うための、バイオインフォマティクスのソフトウェアパッケージである。

FASTA と同様にシーケンスアライメントを行うためのソフトウェアとして、BLAST なども知られる。

最初のバージョンは FASTP という名前であり、デヴィッド・J・リップマンとウィリアム・R・ピアスンが、1985年に開発して論文を発表した^[1]。

当初はタンパク質のアミノ酸配列のシーケンスデータベースに対して、アミノ酸配列の類似性 (similarity) の検索を行うように設計された。FASTA の1988年のバージョンでは、DNAの塩基配列の類似性を検索する機能が加えられた^[2]。FASTA は FASTP よりも精巧なアルゴリズムで処理を行い、統計上の有意性を評価する。FASTA ソフトウェアパッケージには、タンパク質のアミノ酸配列やDNAの塩基配列のアライメントを行うための、いくつかのプログラムが含まれている。

FASTA は、"FAST-Aye"（ファストエー）と発音する。FASTA は、"FAST-P"（Protein; タンパク質）アライメントと "FAST-N"（Nucleotide; ヌクレオチド）アライメントの総称である、"FAST-All" を意味している。

FASTA ソフトウェアパッケージの現在のバージョンでは、次のようなことができる。なお、シーケンスデータベースに与える検索のシーケンスをクエリーという。

塩基配列クエリーで塩基配列データベースを検索
塩基配列クエリーをアミノ酸配列に翻訳してアミノ酸配列データベースを検索
アミノ酸配列クエリーでアミノ酸配列データベースを検索
アミノ酸配列クエリーで塩基配列データベース（アミノ酸配列に翻訳）を検索
複数のペプチド（短いペプチド鎖）をクエリーとしてアミノ酸配列データベースを検索

フレームシフト突然変異を考慮した検索も可能である。Smith-Watermanアルゴリズムを実装した SSEARCH でのシーケンスデータベースの検索・比較をすることもできる（処理速度は遅くなる）。

FASTA ソフトウェアパッケージの主な用途は、類似性の精密な統計値を計算することである。類似性の統計値を計算することにより、生物学者は、どのアライメントが妥当性が高いかを判断することや、相同性 (homology) を推測することができる。

FASTA ソフトウェアパッケージは、ヴァージニア大学のFTPサーバから提供されている。

FASTAフォーマット

FASTA では、シーケンスデータの記述形式として FASTAフォーマットという形式を使う。FASTAフォーマットはプレーンテキストである。1つのシーケンスのデータは、">" で始まる1行のヘッダ行と、2行目以降の実際のシーケンス文字列で構成される。ヘッダ行では、">" の次にシーケンスデータを識別するための文字列を記述し、続けてそのシーケンスデータを説明する文字列を記述する（両方とも省略してよい）。ヘッダ行の ">" と識別文字列の間にスペースを入れてはいけない。FASTAフォーマットの全ての行は、80文字未満とすることが推奨される。">" で始まる別の行が出現すると、そこでシーケンスデータが区切られ、別のシーケンスデータが始まる。

FASTA ファイルフォーマットの例を示す。

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

FASTAフォーマットでは、IUB/IUPAC で規定されているアミノ酸コードもしくは核酸コードで、シーケンス文字列を記述する。ただし、小文字で記述した場合は FASTA内部で自動的に大文字に変換される。また、"-"（ハイフン）でギャップを、"U" でセレノシステインを、"*" で翻訳終止を記述する。FASTAでは、クエリーのシーケンスに数字が含まれていると正しく処理をすることができない。FASTAで処理を行う前に、数字は、除去しておくか、適切な文字列（"N" は不明な核酸塩基、"X" は不明なアミノ酸を意味する）に置き換えておく必要がある。

FASTA で使える核酸のコード
核酸のコード	意味
A	Adenosine （アデニン）
C	Cytidine （シトシン）
G	Guanine （グアニン）
T	Thymidine （チミン）
U	Uracil （ウラシル）
R	G A （puRine, プリン）
Y	T C （pYrimidine, ピリミジン）
K	G T （Ketone, ケトン）
M	A C （aMino group, アミノ基）
S	G C （Strong interaction, 強い結合）
W	A T （Weak interaction, 弱い結合）
B	G T C (not A) （B, A の次の文字）
D	G A T (not C) （D, C の次の文字）
H	A C T (not G) （H, G の次の文字）
V	G C A (not T, not U) （V, U の次の文字）
N	A G C T （aNy, 不明）
-	ギャップ

FASTA で使えるアミノ酸コード
アミノ酸コード	意味
A	アラニン
B	アスパラギン酸もしくはアスパラギン
C	システイン
D	アスパラギン酸
E	グルタミン酸
F	フェニルアラニン
G	グリシン
H	ヒスチジン
I	イソロイシン
K	リシン
L	ロイシン
M	メチオニン
N	アスパラギン
P	プロリン
Q	グルタミン
R	アルギニン
S	セリン
T	スレオニン
U	セレノシステイン
V	バリン
W	トリプトファン
Y	チロシン
Z	グルタミン酸もしくはグルタミン
X	不明 (any)
*	翻訳終止
-	ギャップ

参考文献

^ Lipman, D. J.; Pearson, W.R. (1985). "Rapid and sensitive protein similarity searches." Science 227 (4693): 1435–1441. PMID 2983426.
^ Pearson, W.R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85 (8): 2444–2448. PMID 3162770

外部リンク

FASTAフォーマットの説明（英語）
ヴァージニア大学のFASTAサーバ - FASTAソフトウェアパッケージを配布している
GenBank to Fasta conventer

wiki en

[Wiki en表示]

FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985.^[1] Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

History

The original FASTP program was designed for protein sequence similarity searching. FASTA added the ability to do DNA:DNA searches, translated protein:DNA searches, and also provided a more sophisticated shuffling program for evaluating statistical significance.^[2] There are several programs in this package that allow the alignment of protein sequences and DNA sequences..

Uses

FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.

The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle frameshift errors (which six-frame-translated searches do not handle very well) when comparing nucleotide to protein sequence data.

In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal Smith-Waterman algorithm.

A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology. The FASTA package is available from fasta.bioch.virginia.edu.

The web-interface to submit sequences for running a search of the European Bioinformatics Institute (EBI)'s online databases is also available using the FASTA programs.

The FASTA file format used as input for this software is now largely used by other sequence database search tools (such as BLAST) and sequence alignment programs (Clustal, T-Coffee, etc.).

Search method

FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences.

The FASTA program follows a largely heuristic method which contributes to the high speed of its execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches before performing a more time-consuming optimized search using a Smith-Waterman type of algorithm.

The size taken for a word, given by the parameter ktup, controls the sensitivity and speed of the program. Increasing the ktup value decreases number of background hits that are found. From the word hits that are returned the program looks for segments that contain a cluster of nearby hits. It then investigates these segments for a possible match.

There are some differences between fastn and fastp relating to the type of sequences used but both use four steps and calculate three scores to describe and format the sequence similarity results. These are:

Identify regions of highest density in each sequence comparison. Taking a ktup to equal 1 or 2.

In this step all or a group of the identities between two sequences are found using a look up table. The ktup value determines how many consecutive identities are required for a match to be declared. Thus the lesser the ktup value: the more sensitive the search. ktup=2 is frequently taken by users for protein sequences and ktup=4 or 6 for nucleotide sequences. Short oligonucleotides are usually run with ktup = 1. The program then finds all similar local regions, represented as diagonals of a certain length in a dot plot, between the two sequences by counting ktup matches and penalizing for intervening mismatches. This way, local regions of highest density matches in a diagonal are isolated from background hits. For protein sequences BLOSUM50 values are used for scoring ktup matches. This ensures that groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. Nucleotide sequences use the identity matrix for the same purpose. The best 10 local regions selected from all the diagonals put together are then saved.

Rescan the regions taken using the scoring matrices. trimming the ends of the region to include only those contributing to the highest score.

Rescan the 10 regions taken. This time use the relevant scoring matrix while rescoring to allow runs of identities shorter than the ktup value. Also while rescoring conservative replacements that contribute to the similarity score are taken. Though protein sequences use the BLOSUM50 matrix, scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative measure of similarity such as PAM, can also be used with the program. For each of the diagonal regions rescanned this way, a subregion with the maximum score is identified. The initial scores found in step1 are used to rank the library sequences. The highest score is referred to as init1 score.

In an alignment if several initial regions with scores greater than a CUTOFF value are found, check whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined regions penalising for each gap 20 points. This initial similarity score (initn) is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (init1).

Here the program calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidly calculated using a dynamic programming algorithm. The resulting score initn is used to rank the library sequences.This joining process increases sensitivity but decreases selectivity. A carefully calculated cut-off value is thus used to control where this step is implemented, a value that is approximately one standard deviation above the average score expected from unrelated sequences in the library. A 200-residue query sequence with ktup2 uses a value 28.

Use a banded Smith-Waterman algorithm to calculate an optimal score for alignment.

This step uses a banded Smith-Waterman algorithm to create an optimised score (opt) for each alignment of query sequence to a database(library) sequence. It takes a band of 32 residues centered on the init1 region of step2 for calculating the optimal alignment. After all sequences are searched the program plots the initial scores of each database sequence in a histogram, and calculates the statistical significance of the "opt" score. For protein sequences, the final alignment is produced using a full Smith-Waterman alignment. For DNA sequences, a banded alignment is provided.

The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Protein

Protein–protein FASTA.
Protein–protein Smith–Waterman (ssearch).
Global protein–protein (Needleman–Wunsch) (ggsearch)
Global/local protein–protein (glsearch)
Protein–protein with unordered peptides (fasts)
Protein–protein with mixed peptide sequences (fastf)

Nucleotide

Nucleotide–nucleotide (DNA/RNA fasta)
Ordered nucleotides vs nucleotide (fastm)
Unordered nucleotides vs nucleotide (fasts)

Translated

Translated DNA (with frameshifts, e.g. ESTs) vs proteins (fastx/fasty)
Protein vs translated DNA (with frameshifts) (tfastx/tfasty)
Peptides vs translated DNA (tfasts)

Statistical significance

Protein vs protein shuffle (prss)
DNA vs DNA shuffle (prss)
Translated DNA vs protein shuffle (prfx)

Local duplications

Local protein alignments (lalign)
Plot protein alignment "dot-plot" (plalign)
Local DNA alignments (lalign)
Plot DNA alignment "dot-plot" (plalign)

References

^ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science 227 (4693): 1435–41. doi:10.1126/science.2983426. PMID 2983426.
^ Pearson, WR; Lipman, DJ (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America 85 (8): 2444–8. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.

External links

FASTA Website
EBI's FASTA page - EBI's page for accessing FASTA services.

English Journal

Suppression subtractive hybridisation and real-time PCR for strain-specific quantification of the probiotic Bifidobacterium animalis BAN in broiler feed.

Fibi S1, Klose V2, Mohnl M2, Weber B2, Haslberger AG3, Sattler VA3.
Journal of microbiological methods.J Microbiol Methods.2016 Apr;123:94-100. doi: 10.1016/j.mimet.2016.02.011. Epub 2016 Feb 13.
To ensure quality management during the production processes of probiotics and for efficacy testing in vivo, accurate tools are needed for the identification and quantification of probiotic strains. In this study, a strain-specific qPCR assay based on Suppression Subtractive Hybridisation (SSH) for
PMID 26883620

Finding Protein and Nucleotide Similarities with FASTA.

Pearson WR1.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.].Curr Protoc Bioinformatics.2016 Mar 24;53:3.9.1-3.9.25. doi: 10.1002/0471250953.bi0309s53.
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searchin
PMID 27010337

Digital data for Quick Response (QR) codes of thermophiles to identify and compare the bacterial species isolated from Unkeshwar hot springs (India).

Rekadwad BN1, Khobragade CN1.
Data in brief.Data Brief.2015 Nov 24;6:53-67. doi: 10.1016/j.dib.2015.11.035. eCollection 2016.
16S rRNA sequences of morphologically and biochemically identified 21 thermophilic bacteria isolated from Unkeshwar hot springs (19°85'N and 78°25'E), Dist. Nanded (India) has been deposited in NCBI repository. The 16S rRNA gene sequences were used to generate QR codes for sequences (FASTA format
PMID 26793757

Japanese Journal

Method for Predicting Homology Modeling Accuracy from Amino Acid Sequence Alignment: the Power Function

Chemical and Pharmaceutical Bulletin 58(1), 1-10, 2010
NAID 130000140755

ワカサギの体腔内より検出された幼条虫プレロセルコイドの同定と文献的考察

生活衛生 53(2), 110-116, 2009
NAID 130004809688

P-85 非病原性Mycobacterium属細菌から得られた新規環状C_<35>テルペンのZ型C_<35>ポリプレニル二リン酸の環化反応による生合成(ポスター発表の部)

天然有機化合物討論会講演要旨集 (50), 517-522, 2008-09-01
NAID 110007066716

Related Pictures

Fasta Sequence Formats Suchat Udomsopagit. - ppt download Fasta Pasta Cookbook from Camerons Products Fasta file extensions & meaning Bioinformatics Applications - ppt download 2015 Multi-pack only cars | Hot Wheels Newsletter Blast fasta 4 Fasta